Hello! I wanted to try and see if anyone has sugge...
# ask-community
d
Hello! I wanted to try and see if anyone has suggestions/experience with this problem. Let's say I have workflows that are called through a deployment, and the output of these workflows is a bunch of files all collected in a folder. What is the best way to pass on these files as the persisted result of the run to successive flows that are not run on the same machine? We specifically use S3. Initially, the dimplest solution was to just put these files in a bucket and pass the URI to the folder, but this is not fully reliable, as the folder might be deleted while Prefect still sees the cached result for the URI of the deleted folder, thus not repeating the operations and effectively bugging the workflow. Are there better strategies to achieve this? Thank you! :)
b
Hey Davide! I have a few general q's about the pattern you'd like to implement here (sorry it's not an immediate answer lol). What sort of files are these? Parquet? CSV? When a new file lands in the bucket, do you want the successive flow to pick up the file and process it immediately? Would you like each downstream flow to process one file? Or do you want all of the new files to be processed collectively by one flow? Last but not least, are you using Prefect Cloud or OSS?
d
We have multiple images files in GEOTIFF format, and they are to be processed by one downstream flow altogether and we're using Cloud!
gratitude thank you 1
(Adding @Nicholas Pini to the conversation)
b
Ahh cool, thanks for the context! Originally I was thinking that you may benefit from the s3 datalake pattern that our team put together a demo for. In essence, what it does is: Flow A gets raw data from API ➡️ Raw data stored in S3 ➡️ AWS EventBridge Rule is triggered ➡️ CloudEvent sent through Prefect Webhook ➡️ Event arrives in Prefect Cloud ➡️ Automation kicks off Flow B to process raw data file ➡️ Processed data stored in bucket.
Flow A and Flow B are running in completely separate ECS tasks, so the idea of having the flows running in separate machines (while keeping track of the files requiring processing) is applicable with this pattern.
LMK what you think. I'll give it a bit more thought since what you're looking for is having one downstream process gather up all of the files that need processing as a group.