hi - this is quite a basic question i think, but I haven't found a clear answer so far... Is the re...

Daniel Manson

09/15/2021, 8:01 PM

hi - this is quite a basic question i think, but I haven't found a clear answer so far... Is the recommendation to run Prefect flows against a shared filesystem or is the only persistence between tasks supposed to be explicit use of S3 etc. ? Some hints that a shared filesystem is expected are the filesystem tasks and the multiple mentions of EFS in the (deprecated) Fargate Agent, plus I see that according to AWS themselves, ECS (which Prefect now uses instead of raw Fargate), is intended to have EFS plugged in. However there are a few mentions of files not being persisted between tasks, including this Dask page. In general it seems that a lot of tasks will be a bit painful/impossible to use if you don't have a shared filesystem. Unless you want to pass large files through Prefect Task return values. Thanks!

Kevin Kho

09/15/2021, 8:38 PM

Hey @Daniel Manson, I’m not 100% I follow but I think I know what you mean. 1. You’re very welcome to explicitly use S3 and pass the URL to downstream tasks to load the persisted data 2. Prefect has checkpointing, which stores the return as a

Result

. This is needed to restart a flow from the point of which is failed. You can turn checkpointing off if you don’t want it. 3. You can check the Results page, you can use the

Result

class explicit like this as an easier way to write data or the pipeline can persist it for you like this . 4. If combining with mapping, you can template result names. 5. If you are using ECS, you can add that EFS volume yourself to the container, but to be honest I see S3 used more. With regards to the Dask link you sent, of course writing to the local filesystem is not recommended in a distributed setting. I see GCS/S3 used more but you might be able to add a volume across the Dask cluster (not sure). Prefect ultimately adjusts do you and anything you can do in Python, you can do in Prefect. Some users just do everything on a local machine with local storage and that ultimate works for them. I suppose the question for a shared filesystem happens in Dask moreso, and I think the answer here is that you can probably get it working by mounting something to the Dask cluster

Kevin Kho

09/15/2021, 8:38 PM

Does that help?

Daniel Manson

09/16/2021, 12:02 PM

Thanks Kevin, In summary, that sounds like a pretty clear "no" to my question, which is not a surprise, but I do find it a bit confusing still... I understand that you can explicitly store/retrieve data from GCS/S3, but the advantage of a filesystem is there's no additional step needed each time - if a file was put there by one task another task can just access it again with no additional "fuss". In particular I can't quite see the value of the Filesystem tasks if there isn't a shared filesystem between tasks, because changes you make to that filesystem will not be visible across tasks. Also the S3Upload/Download tasks don't make a great deal of sense as single tasks if a single task actually needs to do all of {download, execute something, upload}. I understand that the

Results

concept can help with that, but it is a bit confusing to say the least. In general, it feels like maybe this is foremost a documentation issue, though perhaps there is also some aspect of the api itself that isn't quite matching what i'm looking for. Anyway, thanks again for the quick response.

Kevin Kho

09/16/2021, 2:17 PM

Wait…sorry I’m a bit confused haha. Maybe I’m not understanding you well, but there shouldn’t be additional fuss if you do

Copy code

@task(target="/path/to/a.txt")
def abc():
    return "a"

@task
def bcd(x):
    return x

with Flow(...) as flow:
     x = abc()
     bcd(x)

the output of the

abc

task gets saved in the file without any additional steps on the local machine. You can also do

Copy code

@task(LocalResult(location="/path/to/a.txt"))
def abc():
    return "a"

and “a” gets saved at that location. Similarly you can do

Copy code

@task(S3Result(bucket = "xxx", location="/path/to/a.txt"))
def abc():
    return "a"

and that will be saved to S3 without any additional steps. If you end up having to restart this Flow, Prefect will know where that Result is stored and can load it in without the additional fuss. Unclear what you mean that changes to the filesystem are not visible across tasks?

Daniel Manson

09/16/2021, 4:26 PM

yeah, i get the examples you've just given in so far as with the

S3Result

you can change where you persist the result to, but maybe it's best to explain with an example... imagine you want to (a) download a 1GB zip file from a 3rd party; then (b) unzip it; (c) find a specific file that's been extracted (perhaps using some pattern matching rather than an exact filename); (d) copy that file into a postgres table. From my understanding, you can't (easily?) make use of any of the out-of-the-box

Tasks

unless you are on a shared filesystem in which case you can do: (a) custom download task to local file system (b) use the Prefect

Unzip

task on the local file system (c) custom task to find file of interest (d) the existing postgres query task, with the query being a copy-from-local-file command (postgres itself supports this). Using the

S3Result

concept (assuming you want to keep roughly the above task granularity): (a) custom download task, that copies/streams into

S3Result

(b) custom unzip task that reads the

S3Result

from memory onto disk (or perhaps directly) unzips it, and sends another

S3Result

(not actually sure how you'd represent an arbitrary unzipped directory structure as an

S3Result

in fact) (c) to get the relevant file, you might be able to use the S3-list Task, followed by a custom filtering task (I haven't looked at the docs for this in detail) (d) a custom task to download the file of interest from S3 and run some custom postgres to import it into the db. Of course you could squish all of it into a single task, so that you have a shared filesystem at your disposal, but then you're not really using much of the Prefect machinery. I hope that makes a bit more sense?

Kevin Kho

09/16/2021, 4:37 PM

Ah ok everything is right in your understanding here. Yes you would need a custom task to find the file of interest. But in general, am I right in saying it seems like there is a way to achieve stuff? Maybe it’s just not as “off-the-shelf”?

Daniel Manson

09/16/2021, 4:46 PM

there's definitely a way to do it, yeah, but i was hoping/thought Prefect was guiding users towards a more standardized task library kind of concept..but maybe i was wrong there. Also it seems like for large files there's a lot of transfers going on which could be happening much more efficeintly if just done directly on a filesystem..though i guess ultimately there's not a lot you can do on that front. thanks for the help - my main aim here was to make sure i've not missunderstood the right way to be using Prefect. i did learn a few things, but it seems ultimately it doesn't quite work the way i was hoping.

Kevin Kho

09/16/2021, 4:48 PM

I would say in general the Task Library is more of a guide and a lot of users subclass those tasks or edit them to fit their use case. 90% of the task library is community contributed.

👍 1

2 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.