Hey folks. Is there a way to have Prefect treat a ...
# ask-community
j
Hey folks. Is there a way to have Prefect treat a file on disk that was created during a task’s execution as a Result, that can be persisted using S3Result (for example) without loading that file into memory / returning it from the task function? I am trying to see if my organization can use Prefect, but our core need is to run 3rd party programs that produce large output files that may not fit in memory. Thanks and I apologize if this is already in the docs.
n
hey @Jacob Warwick! As you alluded to, you should be able to use something like the S3Result within tasks to write new results or read previous results from a bucket. Instead of writing to disk and/or passing task results between tasks, you could just pass the S3 object name between tasks
upvote 1
j
Hi @Nate, thanks for the reply. My read of the S3Results.write method is that it expects to be given in-memory data, not a file reference, correct?
It seems like the solution here would be for me to write the output files to s3 manually, using boto3, and then return a Result with a pointer to that location, does that sound right?
k
Hey @Jacob Warwick, what Nate is saying is to use the result interface explicitly. Combined with what you are saying, it will look like this:
Copy code
@task()
def task_1():
    my_result = S3Result(bucket="omlds-prefect", location="forecast.csv", serializer=PandasSerializer())
    res = my_result.write(df)
    return res.location

@task()
def task_2(location):
    my_result = S3Result(bucket="omlds-prefect", location=location, serializer=PandasSerializer())
    df = my_result.read(df)
    ...
    return modified_df
there is an example of this in the docs here
j
Hi @Kevin Kho, I am sorry if I'm missing the point here, but my question is about a situation where task_1's "df" is too large to fit in memory. It would be a file in the local filesystem. I'd like to provide a path - or even better, a list of paths or a whole directory. Similarly, I'd like S3Result to download to disk instead of read into memory. Seems like I'd have to write a custom Result myself.
k
Ah ok yeah I think you would need a custom result for the download to disk
For the first scenario, I think you should bypass the result interface for more control
j
ok, thanks, I'm sorry to hear that - I really like the direction Prefect is going, but it probably isn't the right tool for us.
a
@Jacob Warwick if you load a file which is too large to fit into memory, you have a couple of options to solve that: 1. Read this data in chunks so that you don’t process all the data at once 2. Prefect supports and works closely with Dask - Dask is able to parallelize processing of big data, and with Prefect you have features such as mapping which provide a user-friendly interface to tackle such use cases. I think the entire discussion about “Results” wasn’t perhaps entirely productive with respect to the problem you are facing? I would definitely encourage you to explore those two options above. In general, Prefect doesn’t limit you in any way in how you want to process your data - if you want to use Spark instead of Dask, you can use Prefect to trigger processing on a Spark cluster, and there is even a Resource manager that simplifies work with such remote clusters.
119 Views