Hey, another quick question: we would like to export data lineage metadata from Prefect to our Data Catalog - is there an API from which we can pull such metadata?
k
Kevin Kho
04/08/2021, 5:32 PM
Hey @Remi Paulin, do you mean that when a flow runs, you want to keep track of the metadata (what tasks happened?) and store it in your Data Catalog? What format would you need it in for the data catalog?
r
Remi Paulin
04/08/2021, 5:44 PM
yes! any format should be fine actually. We're thinking of using Data Galaxy as our Data Catalog and the metadata would eventually need to be parsed to be ingested into Data Galaxy anyway.
k
Kevin Kho
04/08/2021, 5:47 PM
Ah ok so we have a GraphQL API where you can query your flows and task runs. Assuming they are named descriptively, you can pull out the data, parse it, and feed it into the data catalog.
Kevin Kho
04/08/2021, 5:48 PM
I think what this would look like is you have a separate Python script to query your Flow data and parse them to upload to Data Galaxy (maybe even as another Prefect flow).
👍 1
Kevin Kho
04/08/2021, 5:49 PM
Out of curiosity, do you know how detailed you want your lineage to be? Does it need to specify that “these 3 data sources were joined and filtered to produce this dataset”? Or is it more of carrying over schema and description from original columns?
r
Remi Paulin
04/08/2021, 5:55 PM
Ok amazing I'll check this out. Ideally we'd like to get quite detailed lineage such as in the example you mentioned (joins & filters involved). But from what I understood since Prefect Cloud doesn't have much visibility into the actual logic maybe this wouldn't be easy to implement. Using Prefect with dbt for instance would maybe enable us to retrieve such detailed metadata (only for the flows governed by dbt of course).
k
Kevin Kho
04/08/2021, 6:03 PM
I think it can be implemented if your Tasks are well named. It would also help if there is a relatively clean “separation of concerns” with your data engineering. You are right though that we actually don’t see the data (with our Hybrid Model).
r
Remi Paulin
04/08/2021, 6:07 PM
ok got it - I definitely need to think about this more. Thanks again for your help!!