https://prefect.io logo
Title
l

Laura Lorenz (she/her)

03/27/2020, 7:02 PM
Just a reminder I’ll be at the Cantina (aka on a Google Hangout) in an hour if you want to come chat about Core roadmap/PRs/issues/etc! The hangout link is: https://meet.google.com/dok-gvpz-wxn
m

manesioz

03/27/2020, 8:05 PM
I won't be able to make this one but I hope this will be a regular thing!
:upvote: 1
a

Alex Cano

03/27/2020, 8:11 PM
Same, unfortunately had a scheduling conflict, but would love to come to future ones!
a

Alexander Hirner

03/27/2020, 8:56 PM
Thanks for hosting that hangout. We at moonvision.io are in early testing. So far our hotspot for "prefectizing" (tm Laura) computer vision related tasks are: • async/video stream processing (might overlap with PIN-14 and @Alex Cano) • depth-first or sub-flow execution to avoid OOM from decoded images • Checkpointing/serializing training loops using pytorch It's still very early and all over the place for us, but I'm sure to chime in on specific topics!
❤️ 4
l

Laura Lorenz (she/her)

03/27/2020, 8:59 PM
Ya @Alex Cano we have another streaming flows fan ^
a

Alex Cano

03/27/2020, 9:02 PM
Wooo!!! Regardless of what direction the streaming thing goes in, I’d love to follow along on it or help with some of the code changes!
a

Alexander Hirner

03/27/2020, 9:08 PM
I'm on a prototype right now which is very specific, needs to cover very low latency scenarios (20fps and no serialization). Thus it might be only maybe upward compatible with the "web world" of things. What's interesting and could be a common ground is
streamz
, a light-weight pet project of Matthew Rocklin that got some support from Nvidia lately. Its control flow concepts are principled and there is a dask adapter. https://github.com/python-streamz/streamz
a

Alex Cano

03/27/2020, 9:23 PM
Just so I’m understanding what you’re aiming for here, it’d be a stream processing of a batch job, correct? Also not a stateful processor? It’d be interesting to see what kind of abstractions could be built on top of a system that has a notably batch input, but with a goal of streaming. I know the Apache Beam programming model abstracts out both streaming and batch processing, but when evaluating it for work it felt a bit clunky for both. Since data would be streaming through a pipeline, there’d probably need to be a notion of watermarking processed items, potentially having caches that need to be updatable (rocksdb, anyone?), and ways to deal with any kind of wonkiness that may come from the combination of at least once style processing and non-idempotent stream processors. Adding all of that together, from a user perspective, it’d be great to be able to specify a configuration on the flow runner to say whether to run in streaming or batch mode would lead to a very different API. It’s a little deviated from Prefect, but have you checked out Ray? https://ray.readthedocs.io/en/latest/
a

Alexander Hirner

03/27/2020, 9:34 PM
Yes, checked out Ray, new grounds as often from Berkeley but too much emphasis on distributed. We aim for streaming, time-window batches, "lean" actors (e.g. https://streamz.readthedocs.io/en/latest/api.html#streamz.accumulate ). To me it would naturally be a prefect Task, but yes, neatly defined in terms of a SubFlow with some higher-level breakpoints. This would somewhat bring the two modes together (re "specify a configuration on the flow runner to say whether to run in streaming or batch mode"). Conversely, your use-case might be covered by faust? It goes to rocksdb for aggregations afair. https://faust.readthedocs.io
a

Alex Cano

03/27/2020, 9:53 PM
It’s a little less of a “use case” on my end, and more generally thinking what it would take to enable a stream processing mindset into Prefect and its mental model. Streaming is all fun and games when everything is idempotent, but was just giving a bit of thought of what it would take if you have a non-idempotent prefect task that crashes for whatever reason, there probably needs to be some kind recovery or expected behavior. I think faust is super neat, and covers a lot of functionality that’d you need from a streaming system, but having Kafka as a dependency is a huge dependency. I love Kafka, but having to spin up a kafka cluster to execute a prefect flow might be a bit too much to ask 😅. Who knows though, I don’t know the resource usage of a single node zookeeper process and single node kafka process, so maybe it’s less intrusive than I think it might be?
a

Alexander Hirner

03/27/2020, 10:22 PM
Kafka is for sure too far off for us as well, there might be a core but didn't check. Thanks for giving so much food for thought. This topic is best explored from some angles.
a

Alex Cano

03/27/2020, 10:25 PM
Agreed, way too many different use cases and angles to have a one size fits all perfect solution. Would love to end up what you end up going with to solve your problem.