https://prefect.io logo
Title
p

Preston Marshall

02/06/2020, 10:58 PM
Also is there a way to pass streams across task boundaries? I'd like to stream data from sftp to GCS if possible, that way it doesn't need to be all downloaded to disk first
c

Chris White

02/06/2020, 11:21 PM
Hi Preston - this is not possible; as a workflow tool, Prefect has a strong concept of dependency between tasks. In this case, a Task must be completely finished before a downstream dependency can begin running. In your case, you might consider combining your two tasks into a single task to support the streaming behavior you’re after
p

Preston Marshall

02/06/2020, 11:29 PM
Gotcha, that's what I landed on. I'm trying to just pull the file down and send it up to GCS using the
GCSUpload
task, and it seems like it expects the whole file as a string? That seems like it would cause a lot of problems, serializing multi-gigabyte files and sending them over the wire. Am I missing something?
c

Chris White

02/06/2020, 11:30 PM
No, you’re correct; that task is probably better as a template than something that should be used for large datasets
p

Preston Marshall

02/06/2020, 11:35 PM
got it, thanks
as far as data locality, I can only depend on a task having access to the same filesystem, right? so outside of those boundaries all bets are off
c

Chris White

02/06/2020, 11:39 PM
Actually that largely depends on what environment / executor you execute with; for example, if you ran your Flow on dask cluster your tasks would all run on different machines; if you use the
LocalExecutor
and run your flows in a non-dockerized environment, then all tasks will run in the same process on the same machine so it’s relatively easy to reason about locality in that scenario