https://prefect.io logo
Title
d

Daniel Manson

03/20/2020, 7:37 PM
Hi all I came across Prefect for the first time yesterday and have read most of the documentation (I think?) and had a little bit of a play with it. We are looking at migrating a bunch of ETL stuff from ad-hoc bash scripts to a “proper” workflow tool like Prefect (it certainly seems nicer than airflow). Before I dive in and start doing everything wrong, are there any examples/talks I could refer to in terms of real-world usage? As an example of the sorts of things I plan on implementing, the first thing I want to try is downloading an 8GB file over SFTP, unzipping it and importing the resulting files into postgres (to be clear, this is one piece of a much larger hypothetical DAG). How should I architect this, e.g. should i try and do this whole thing in a single task, or split it up? If I split it up should I pass 8GB of data (+more once unzipped) through the return values? If I am to use Docker/k8s how best to draw lines around things? etc. thanks!
c

Chris White

03/21/2020, 7:54 PM
Hi @Daniel Manson, and welcome! Because data pipelines tend to be intimately tied with business processes, there are not many code examples of production usage of Prefect at this time (outside of high level commentary). That being said, your question is a big one - ultimately how you architect it is up to you and your needs. I usually recommend that folks think of Prefect tasks as visible units of business logic that might require independent retries, for which failure is an informative mode, and that have reasonable (and recoverable) inputs and outputs.
d

Daniel Manson

03/23/2020, 2:56 PM
thanks for coming back to me. I am reading the x-files tutorial (which i missed earlier) . It’s somewhat helpful, but I still wonder if there is maybe some documentaiton around how different parts of prefect are designed to scale. Maybe a bit like mongo’s “limits and thresholds” page. https://docs.mongodb.com/manual/reference/limits/
c

Chris White

03/23/2020, 2:59 PM
For sure! Interesting, that makes sense for a DB but as Prefect users can scale their compute environments in many different ways there aren’t as many restrictions on what sorts of work Prefect Flows can do. The only scaling rule of thumb we have right now is that mapped tasks should spawn < 10,000 tasks
d

Daniel Manson

03/23/2020, 3:02 PM
ok, that was one of the things i was thinking about. and what about the size of the values returned by a task - is it really just about how much ram you want to throw at the problem?
c

Chris White

03/23/2020, 3:03 PM
yea, essentially - this is true even when using Prefect Cloud because no data is sent back to our servers
d

Daniel Manson

03/23/2020, 3:10 PM
so if I’m dealing with GBs, then probably i don’t want to use prefect return values directly, instead I should just store a reference to a file/table. But does the cache system fire some sort of event when I need to delete this external resource?
it looks like i can’t use checkpointing either because that’s also based around the idea of having the entire result in memory?
c

Chris White

03/23/2020, 3:13 PM
Prefect never deletes things on your behalf; however we do have some work that should be released soon that will allow you to tie task run states to the existence of data that you might find useful: https://docs.prefect.io/core/PINs/PIN-16-Results-and-Targets.html
👀 1
d

Daniel Manson

03/23/2020, 3:33 PM
not sure i fully understand that (and linked to PIN-2, and 4), but I was hoping that there would be something like a “destructor” method on the result object (interface), so that you know precisely when to free up the external resources. Of course if shutdown isn’t graceful then you might miss a destructor call, in which case you might need some kind of persisted data on which destructors have been called. Not sure if you were alraedy thinking along these lines at all?
c

Chris White

03/23/2020, 3:40 PM
Hmmm interesting idea --> honestly no, this is the first time this idea has come up, and I’m not sure when users would expect such a destructor method to be called within Prefect’s pipeline. What would your expectation be here?
d

Daniel Manson

03/23/2020, 3:41 PM
when no downstream tasks or cache rules require it to still exist
c

Chris White

03/23/2020, 3:44 PM
I’m not sure how we would ever be able to detect such a scenario - downstream tasks expect upstream tasks to produce data, but whether that data was generated from a cache or not is irrelevant from the downstream task’s perspective
d

Daniel Manson

03/23/2020, 3:46 PM
suppose we have task A => task B once task B has completed succesfully, so long as task A hasn’t specified some kind of caching, then it’s ok to free A’s result, no? If A does specify caching, then you just need to know when the caching conditions are no longer met (plus the above condition realting to B)
of course I obviously don’t have a particularlly good grasp of how things work in prefect, so i could be way off here
i mean, presumably you already deal with this kind of thing, just in-memory. all i’m asking is for a hook to be able to destroy external resources at the point that you would have destroyed an in-memory resource
c

Chris White

03/23/2020, 5:24 PM
Hmm yea that’s interesting; I’m not sure there’s a way to implement that exactly as you describe because it implicitly introduces a dependency of Task A on Task B’s successful completion but we could definitely explore exposing a
delete
method on the result interface that users can manually call (or call within a state handler if they want)
🤔 1
d

Daniel Manson

03/23/2020, 7:20 PM
Well thanks for the help. I will be getting more stuck into it this week, I hope.
c

Chris White

03/23/2020, 7:55 PM
Anytime! Let us know when you get in the weeds if you have additional questions, we’re always happy to help