Daniel Manson
09/15/2021, 8:01 PMKevin Kho
Result
. This is needed to restart a flow from the point of which is failed. You can turn checkpointing off if you don’t want it.
3. You can check the Results page, you can use the Result
class explicit like this as an easier way to write data or the pipeline can persist it for you like this .
4. If combining with mapping, you can template result names.
5. If you are using ECS, you can add that EFS volume yourself to the container, but to be honest I see S3 used more.
With regards to the Dask link you sent, of course writing to the local filesystem is not recommended in a distributed setting. I see GCS/S3 used more but you might be able to add a volume across the Dask cluster (not sure).
Prefect ultimately adjusts do you and anything you can do in Python, you can do in Prefect. Some users just do everything on a local machine with local storage and that ultimate works for them.
I suppose the question for a shared filesystem happens in Dask moreso, and I think the answer here is that you can probably get it working by mounting something to the Dask clusterKevin Kho
Daniel Manson
09/16/2021, 12:02 PMResults
concept can help with that, but it is a bit confusing to say the least.
In general, it feels like maybe this is foremost a documentation issue, though perhaps there is also some aspect of the api itself that isn't quite matching what i'm looking for.
Anyway, thanks again for the quick response.Kevin Kho
@task(target="/path/to/a.txt")
def abc():
return "a"
@task
def bcd(x):
return x
with Flow(...) as flow:
x = abc()
bcd(x)
the output of the abc
task gets saved in the file without any additional steps on the local machine. You can also do
@task(LocalResult(location="/path/to/a.txt"))
def abc():
return "a"
and “a” gets saved at that location.
Similarly you can do
@task(S3Result(bucket = "xxx", location="/path/to/a.txt"))
def abc():
return "a"
and that will be saved to S3 without any additional steps.
If you end up having to restart this Flow, Prefect will know where that Result is stored and can load it in without the additional fuss.
Unclear what you mean that changes to the filesystem are not visible across tasks?Daniel Manson
09/16/2021, 4:26 PMS3Result
you can change where you persist the result to, but maybe it's best to explain with an example...
imagine you want to (a) download a 1GB zip file from a 3rd party; then (b) unzip it; (c) find a specific file that's been extracted (perhaps using some pattern matching rather than an exact filename); (d) copy that file into a postgres table.
From my understanding, you can't (easily?) make use of any of the out-of-the-box Tasks
unless you are on a shared filesystem in which case you can do:
(a) custom download task to local file system
(b) use the Prefect Unzip
task on the local file system
(c) custom task to find file of interest
(d) the existing postgres query task, with the query being a copy-from-local-file command (postgres itself supports this).
Using the S3Result
concept (assuming you want to keep roughly the above task granularity):
(a) custom download task, that copies/streams into S3Result
(b) custom unzip task that reads the S3Result
from memory onto disk (or perhaps directly) unzips it, and sends another S3Result
(not actually sure how you'd represent an arbitrary unzipped directory structure as an S3Result
in fact)
(c) to get the relevant file, you might be able to use the S3-list Task, followed by a custom filtering task (I haven't looked at the docs for this in detail)
(d) a custom task to download the file of interest from S3 and run some custom postgres to import it into the db.
Of course you could squish all of it into a single task, so that you have a shared filesystem at your disposal, but then you're not really using much of the Prefect machinery.
I hope that makes a bit more sense?Kevin Kho
Daniel Manson
09/16/2021, 4:46 PMKevin Kho