Hi community has anyone used prefect to track data lineage W Prefect Community #ask-community

Hi community, has anyone used prefect to track da...

Florian Kühnlenz

08/18/2021, 6:24 PM

Hi community, has anyone used prefect to track data lineage? We have a project that would greatly benefit from knowing what data flows from where to where. Since prefect already builds a graph we have been thinking that we just need to put some effort in how we structure our flows so load and store happens in easily identified tasks, and then ask the right questions from the resulting DAG. If anyone had done something similar already it would be very interesting to hear how it went!

Kevin Kho

08/18/2021, 6:27 PM

I would like to hear this as well if anyone has done it. Are you doing it yourself? Someone else hear was suggesting Marquez but I haven’t actually used it. We build a graph of tasks, but that doesn’t know your data by default 😅. Where would you persist this metadata?

An Hoang

08/18/2021, 6:52 PM

data lineage is a thing that Kedro has done really well and constantly improving. See kedro-viz: https://github.com/quantumblacklabs/kedro-viz Their live demo: https://quantumblacklabs.github.io/kedro-viz/ Since there are many parallels between Kedro and Prefect, I think Prefect users can inspect the

target

argument of a task (one target per task for now) or using the direct data dependency from upstream tasks (I don't know how to do this) and do a similar visualization

davzucky

08/19/2021, 12:01 AM

This is something I was looking at add well. I would be nice to have support for openlineage I raised this feature to @Chris White before. I know that airflow can be extended with that capability. For me they are two point which are important 1. Lineage of the flow. Input, transformation output 2. Inside of a task been able to access a lineage object like we have with log to augment the information. Should I open an issue about it?

👍 1

davzucky

08/19/2021, 12:01 AM

https://github.com/OpenLineage/OpenLineage

Florian Kühnlenz

08/19/2021, 8:48 AM

The biggest problem I see is that every task can in theory load any kind of data from anywhere. So I believe in order to derive lineage there needs to be a convention to use a specific subclass (?) for each system.

davzucky

08/19/2021, 11:34 AM

Yes I though about it. Would need to have a kind of tagging on the task with nothing, source, transform, output type. I expect a source would need to get the input attribute. Do you think about any other information?

Florian Kühnlenz

08/19/2021, 4:13 PM

For our usecase I have been thinking about having a subclass for files and for queries that would take the unparameterized query/filename as instance parameter and then some kwargs in the run method with the parameters. So that these can be included in the lineage. I might be able to prototype something next week.

👍 2

Kevin Kho

08/19/2021, 9:05 PM

Yes @davzucky, this looks like a good conversation to log in Github as an issue

Brad

09/15/2021, 9:49 AM

Hi everyone - I just found this thread. I opened https://github.com/PrefectHQ/prefect/discussions/4935 to discuss the same thing (more recently than this discussion) but I’m very interesting in having a go at implementing something. Would anyone be interested in collaborating / providing feedback?

Brad

09/15/2021, 9:50 AM

@Florian Kühnlenz @An Hoang @davzucky just tagging you incase you miss this

👍 2

Brad

09/15/2021, 9:52 AM

The biggest problem I see is that every task can in theory load any kind of data from anywhere. So I believe in order to derive lineage there needs to be a convention to use a specific subclass (?) for each system.

I’ve been thinking about this too - I think from prefects perspective lineage can only start from when data first hits your prefect flow. It would be on the user to implement anything upstream from there (likely in a separate way)

👍 1

davzucky

09/16/2021, 12:33 AM

@Brad Thank you for pinging me. I commented on the discussion.

Brad

09/16/2021, 12:35 AM

Open lineage have an open ticket for prefect integration - https://github.com/OpenLineage/OpenLineage/issues/81

👍 2

Brad

09/16/2021, 1:08 AM

hey @Kevin Kho - is this something that is on the roadmap for prefect? I'm going to have a play around with some integration here (in a separate repo) but I don't want to step on anyones toes

Brad

09/17/2021, 6:26 AM

Just an update here for anyone who is following along - I've opened a PR on open lineage to add a prefect integration https://github.com/OpenLineage/OpenLineage/pull/293. Would be great to hear any thoughts/comments

👍 3

8 Views

Open in Slack

Previous Next