https://prefect.io logo
h

Henning Holgersen

04/10/2023, 2:48 PM
There is a chance I’m kicking in open doors and reinventing the wheel, but a few weeks ago I talked with @Will Raphaelson about blocks, observability, lineage and events etc, and it sparked a thought. So I have made a proof-of-concept Snowflake block with integrated data lineage. There is a lot of rambling in my readme, but the TL;DR of my hypothesis is: - Blocks are classes, and can contain a lot of logic. - Lineage in Prefect should be thought of as a series of events. - Marquez is able to receive “partial” lineage records, which corresponds well to events. So... a Marquez endpoint + a block with some extra logic = Lineage sent to marquez more or less automatically simply by using the block (the usual asterisk applies). please note that this is only a proof-of-concept, with many bugs, strange code, inelegant, underdocumented, an intended only as a way of thinking out loud through code. It might be a starting point though, and any thoughts, ideas etc are welcome. https://github.com/radbrt/eventbased-lineage
🙌 12
blob attention gif 2
🎉 8
🦜 2
w

Will Raphaelson

04/10/2023, 2:57 PM
Woah! This looks awesome. I’ll check this out later today.
b

Brad

04/11/2023, 3:07 AM
Adding a note that I'm quite interested in this space - and willing to contribute. @Henning Holgersen I looked at Marquez/open lineage and open metadata and was thinking that open metadata was a bit nicer. Did you look into alternatives and (if you or anyone else cares to comment) what appealed to you most about Marquez ?
h

Henning Holgersen

04/11/2023, 4:37 AM
There was no deep thought behind the choice initially, the open lineage spec is clearly documented and I just happened to start there in a different project. I hope we can make this multi-format.
👍 1
But that said, the ability to send events (or “partial” records) is kind of key, and might be a limitation in what we can use.
b

Brad

04/11/2023, 4:40 AM
Agreed ^
w

Will Raphaelson

04/11/2023, 5:48 PM
this is super cool, and sure its not finished but it solves a real problem and i enjoyed reading through it. I think the relationship between the orchetrators and observability tools (lineage, but also catalogues, APM, log aggregators) is one of the least solved or set relatinoships in the data space right now. We’ve some hypotheses here at prefect that the two are super complementary, and we’re building out our observability features with this hypothesis front of mind. you’ll see we actually record all block method calls now, so you can see them in your event feed. In the short term these features are comparatively broad - you’ll be able to see and analyze and make automations in response to events, but deducing a dependency tree across events (necessary for real lineage use cases) is somewhat far off. All that to say, love this, we’re thinking about some of the same problems, and lets continue to chat as we build out our respective projects 🙂
h

Henning Holgersen

04/11/2023, 6:28 PM
I knew extending observability was on your roadmap, and it is of course difficult to guess how this is going to fit with what everybody else is doing. And I see the block events pop up nicely in the events feed. Is the next step to connect it to flow runs? I will try to summarize a little more later, but a few thoughts for now: In Prefect, it would be great to have a more consistent context object, so that I could always find the flow name and flow id in the same place no matter if it is being called from a task or from a flow. This would be a great way to tie everything together. For all I know this might be an incredibly difficult thing, or it might be near trivial. I have no idea, so I’m just mentioning it. The bigger nut is to find a replacement for the
complete_run
method I call at the end to send a message that the flow is over. I mentioned something about the events API as a possible replacement, but I don’t think that would be a good solution. Ideally, maybe some kind of post-hook system? I have no idea. Perhaps making developers call that method isn’t too bad, and we can live with it. From my side, I really like the idea of a kind of lineage-block SDK. providing a pattern and common functions that can be adapted for specific 3rd part services. That is where my mind will be going forward. Lastly, I am also pondering data lineage vs just surfacing dependencies. This is a very vague thought, but came from my realization that a flow can reach out to databases/servers/APIs and have a real dependency without any real data being involved. Not in the “data lineage” sense anyways. We might want to flesh out such relationships as well, but I don’t think there is anything at all in that space. Marquez and friends is focused on datasets, with schemas and such, trying to ram non-dataset stuff into that probably won’t work well.
w

Will Raphaelson

04/12/2023, 3:55 PM
I knew extending observability was on your roadmap, and it is of course difficult to guess how this is going to fit with what everybody else is doing. And I see the block events pop up nicely in the events feed. Is the next step to connect it to flow runs?
Yeah you should see flows/tasks as related resources on block events at this point
In Prefect, it would be great to have a more consistent context object, so that I could always find the flow name and flow id in the same place no matter if it is being called from a task or from a flow. This would be a great way to tie everything together. For all I know this might be an incredibly difficult thing, or it might be near trivial. I have no idea, so I’m just mentioning it.
We intend to expand this context object pretty readily, any use cases or objects in particular you’d like added?
From my side, I really like the idea of a kind of lineage-block SDK. providing a pattern and common functions that can be adapted for specific 3rd part services. That is where my mind will be going forward.
Yeah, we’re focused on a more or less auto-auto-instrumented approach in the near future, both of blocks and potentially all executed code. @Chris White and I have noodled on an
@autoinstrumented
decorator you could put on a class or function that would ping out to prefect, but eventually other specific providers isn’t an unreasonable thought.
Lastly, I am also pondering data lineage vs just surfacing dependencies…
This is a super important thought that I/we share. A core hypothesis of the observability roadmap is that existing lineage solutions are necessary but not sufficient to really understand a data intensive stack. The vision for events with their primary and and related resources is that it can be the place where we map out this looser dependency tree thats not quite data, not quite process, its something squishier, and we think instrumenting blocks is a good first step to see what this graph might look like.
c

Chris White

04/12/2023, 3:58 PM
Minor side note:
prefect.runtime
is a candidate for the universal context interface you're looking for @Henning Holgersen - I don't think we currently have the two fields you explicitly mentioned (fields are easy to add), but for example
prefect.runtime.flow_run.id
can be accessed anywhere that there is an overriding known flow run ID (for example, you can reference this ID outside of either a flow or a task function so long as the script is being run via a deployment!)
👀 1
👍 1
h

Henning Holgersen

04/12/2023, 7:39 PM
Wow, the
prefect.runtime
object was exactly what I wanted. Thanks!
w

Will Raphaelson

04/12/2023, 7:40 PM
yeah sorry about that, thought this is what you were referring to by saying context object, and that you wanted it to have more. glad this looks like what you need!
h

Henning Holgersen

04/12/2023, 7:41 PM
It is the “what you see is all there is” phenomenon I guess. I knew about the context object, and it didn’t really occur to me to look elsewhere.
2 Views