Antony Southworth
07/28/2020, 2:15 AMPythonOperator
), but generally you don't want them too small or else things get sluggish due to scheduler overhead.
2. How does Prefect handle data for mapped tasks? For example, if I have a flow t = lambda x: x +1
, r1 = t.map(range(10))
, r2 = t.map(r1)
, my understanding is that Prefect would distribute the computation of r1
across the workers, then collect the results into a list on the "central" machine (where flow.run()
was invoked), then serialise each element again and send the elements to the workers in order to distribute the computation of r2
. This seems a bit inefficient (we compute the results on the worker, serialise, send back to central scheduler, deserialise, then serialise, send back to worker, deserialise, and then do more work).
3. How do folks use Task.map
in practise? For example, would it be weird to run Task.map
for each row in a dataframe? I guess this is related to the first question; basically "how small is too small for tasks/`Task.map`?"
4. Is there any way to "re-run" old instances of tasks? E.g if I had a scheduled process running daily, and I need to re-run the one from Wednesday last week, is there a convenient way of doing that in the UI? I guess I'm basically asking "can you do airflow backfill
?".
5. How do people handle returning large results from Tasks? For example, if I have a task in my Flow that returns a 1GB dataframe, I don't really want that to be serialised and sent back to the "central" machine (where flow.run()
was invoked), cause that machine might have other work to do (and I don't want to waste time on ser/de, or loading the entire thing into memory just to serialise it and send it to another worker). For Airflow, I usually store things on S3 and use XCom to tell other tasks where the result was stored. Would the "Prefect Way" of doing this be for my task to upload its results to S3 and return the path as the return variable from the Task?
6. Is there any way to run Flows in a "streaming" or asynchronous fashion? For example, if my task could yield
rows from a dataframe, and the downstream tasks accepted an Iterator
. Again, just thinking in terms of memory-usage, it would be nice to not require the entire dataframe loaded in-memory.
Apologies if these are all covered in the docs; I thought I got through them pretty thoroughly but it's possible I may have missed some parts.Chris White
yield
would violate. If you search Slack / GitHub you’ll find many instances of people talking about how to incorporate streaming into Prefect - it’s a hot topic of discussion right nowAntony Southworth
07/28/2020, 3:04 AM