Hi community, has anyone used prefect to track da...
# ask-community
f
Hi community, has anyone used prefect to track data lineage? We have a project that would greatly benefit from knowing what data flows from where to where. Since prefect already builds a graph we have been thinking that we just need to put some effort in how we structure our flows so load and store happens in easily identified tasks, and then ask the right questions from the resulting DAG. If anyone had done something similar already it would be very interesting to hear how it went!
k
I would like to hear this as well if anyone has done it. Are you doing it yourself? Someone else hear was suggesting Marquez but I haven’t actually used it. We build a graph of tasks, but that doesn’t know your data by default 😅. Where would you persist this metadata?
a
data lineage is a thing that Kedro has done really well and constantly improving. See kedro-viz: https://github.com/quantumblacklabs/kedro-viz Their live demo: https://quantumblacklabs.github.io/kedro-viz/ Since there are many parallels between Kedro and Prefect, I think Prefect users can inspect the
target
argument of a task (one target per task for now) or using the direct data dependency from upstream tasks (I don't know how to do this) and do a similar visualization
d
This is something I was looking at add well. I would be nice to have support for openlineage I raised this feature to @Chris White before. I know that airflow can be extended with that capability. For me they are two point which are important 1. Lineage of the flow. Input, transformation output 2. Inside of a task been able to access a lineage object like we have with log to augment the information. Should I open an issue about it?
👍 1
f
The biggest problem I see is that every task can in theory load any kind of data from anywhere. So I believe in order to derive lineage there needs to be a convention to use a specific subclass (?) for each system.
d
Yes I though about it. Would need to have a kind of tagging on the task with nothing, source, transform, output type. I expect a source would need to get the input attribute. Do you think about any other information?
f
For our usecase I have been thinking about having a subclass for files and for queries that would take the unparameterized query/filename as instance parameter and then some kwargs in the run method with the parameters. So that these can be included in the lineage. I might be able to prototype something next week.
👍 2
k
Yes @davzucky, this looks like a good conversation to log in Github as an issue
b
Hi everyone - I just found this thread. I opened https://github.com/PrefectHQ/prefect/discussions/4935 to discuss the same thing (more recently than this discussion) but I’m very interesting in having a go at implementing something. Would anyone be interested in collaborating / providing feedback?
@Florian Kühnlenz @An Hoang @davzucky just tagging you incase you miss this
👍 2
The biggest problem I see is that every task can in theory load any kind of data from anywhere. So I believe in order to derive lineage there needs to be a convention to use a specific subclass (?) for each system.
I’ve been thinking about this too - I think from prefects perspective lineage can only start from when data first hits your prefect flow. It would be on the user to implement anything upstream from there (likely in a separate way)
👍 1
d
@Brad Thank you for pinging me. I commented on the discussion.
b
Open lineage have an open ticket for prefect integration - https://github.com/OpenLineage/OpenLineage/issues/81
👍 2
hey @Kevin Kho - is this something that is on the roadmap for prefect? I'm going to have a play around with some integration here (in a separate repo) but I don't want to step on anyones toes
Just an update here for anyone who is following along - I've opened a PR on open lineage to add a prefect integration https://github.com/OpenLineage/OpenLineage/pull/293. Would be great to hear any thoughts/comments
👍 3