https://prefect.io logo
Title
w

Wolfgang Kerzendorf

03/05/2020, 4:03 PM
okay - I want to give this a shot. I'm downloading data from and s3 bucket. Then this is extracted into ~ 1e6 folders then each of these folders is processed which results in one file per folder. I would like to have a workflow where I can see what went wrong with each of these 1e6 tasks (for those that fail). So do I start with the first task being a glob? and then string tasks to this? Sorry for the stupid questions
n

nicholas

03/05/2020, 4:18 PM
Hi @Wolfgang Kerzendorf, no questions are stupid! To start, I don't think S3 allows
glob
-like server side filtering. However boto3 allows you to pass the
Prefix
argument when inspecting a bucket, which, depending on the structure of your bucket and the nature of the files you're looking for, may serve your use case. Depending on the number of files you're transforming, you may want to do some batch processing here to avoid bottlenecks and to allow your download/uploads to share a client. You can map these batches as necessary and use the Prefect Logger to raise any issues that come up when doing the processing. The Prefect Docs have a bare-bones example of an ETL flow here: https://docs.prefect.io/core/examples/etl.html For an example of using the boto3 client to upload files, boto3 has an example here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html I also found a python library that might help with generating prefixes to pass to the boto3 client, if standard strings won't work for your use case: https://github.com/asciimoo/exrex Last, if you want to read more about Prefect loggers, you can do so here: https://docs.prefect.io/core/concepts/logging.html#logging-configuration
w

Wolfgang Kerzendorf

03/05/2020, 4:19 PM
sorry - i will download the s3 externally before starting - I should have mentioned that
n

nicholas

03/05/2020, 4:22 PM
Ah ok, in which case using the
glob
module is fine!
w

Wolfgang Kerzendorf

03/05/2020, 4:23 PM
do you think it's bad to have 1e6 transforms (that is essentially mostly the backlog) then a few hundred every week
n

nicholas

03/05/2020, 4:28 PM
No I don't see any issues, assuming you've got the resources to handle the load. I think batching is still really valuable here, since it'll help you reduce bottlenecks. We have some users using mapping over a few hundred thousand tasks right now.
w

Wolfgang Kerzendorf

03/05/2020, 4:30 PM
okay - looking at the example this does not split up the tasks -
extract
gives a list and then
transform
works on this list - rather than having 3 "transformers"
n

nicholas

03/05/2020, 4:36 PM
You're right, you could of course apply multiple transforms as needed, it's entirely up to you. An example of a flow where a list is generated and then multiple transforms are applied to each item can be found here: https://docs.prefect.io/core/examples/map_reduce.html
w

Wolfgang Kerzendorf

03/05/2020, 4:52 PM
yep - I think that is what I want - map/reduce. I'm an astrophysicist for most of my time and only occassionally do etl - so I sometimes have language problems 🙂
thanks!
n

nicholas

03/05/2020, 4:53 PM
No problem at all, glad we could help!
j

Jeremiah

03/05/2020, 6:06 PM
@Wolfgang Kerzendorf I fully echo everything @nicholas said and you can map as much as you need - I would just caution that we have generally found that once you get over 10,000 mapped tasks, managing and keeping track starts to get more difficult, and you’ll need to be more careful with your infrastructure. This isn’t a technical limitation, it’s purely a matter of resource contention and complexity - Prefect should run them all, but you’ll want to be more attuned to (for example) out of memory errors, depending on your executor. Adding some batching into your tasks (perhaps processing 10-100 items at a time) could help alleviate that.
:upvote: 1
w

Wolfgang Kerzendorf

03/05/2020, 9:30 PM
okay - so I'm playing around with your mapping task and it's quite nice. So for testing things out in my map stuff - can I somehow tell it to just run one tree of the maps
and how do I hook the webserver stuff up
sorry the dashboard
j

Jeremiah

03/05/2020, 10:01 PM
@Wolfgang Kerzendorf there isn’t a “built in” way to only run one branch of the tree, it’ll always run all of them. For testing, my suggestion would be to modify the task (or insert a new task) to filter the list of mapped inputs. For example if your mapped inputs is a list of 100 items, add a task that filters it to the one you want to test and return a list of just that item.
w

Wolfgang Kerzendorf

03/05/2020, 10:02 PM
So I'm extracting a bunch of tar files into individual files and then going further.
j

Jeremiah

03/05/2020, 10:02 PM
To spin up the dashboard, you can sign up for a free account here, and then follow the first two pages of the tutorial here. The first page will help you authenticate your local environment with Prefect Cloud, the second one will show you how to register and run your flow.
w

Wolfgang Kerzendorf

03/05/2020, 10:02 PM
in the end I only want to run the extraction if the tar file has changed. I'm writing the md5 to a pandas df
yep - I have an account - will look into this
so to stop individual branches of a tree - do I just fail?
j

Jeremiah

03/05/2020, 10:04 PM
Ah ok, to do this entirely within Prefect you may want to look at the
ifelse
conditional (for a more formal version) or, more simply, bake the conditional directly into a mapped task which checks if that branch should proceed and raises a
SKIP
signal otherwise (you can learn more about signaling here).
You can fail, or use a
SKIP
to differentiate intentional skips from true failures
w

Wolfgang Kerzendorf

03/05/2020, 10:05 PM
I see - that will then show up on the dashboard in a nicer way - right?
j

Jeremiah

03/05/2020, 10:05 PM
Exactly
Prefect also has some caching mechanisms you could use to avoid recomputing tasks - but I’m not 100% sure if they’ll map exactly to your use case and don’t want to send you down a road that might not help
w

Wolfgang Kerzendorf

03/06/2020, 4:20 PM
okay - I'm trying to map a map again so the first map returns a list - then the second task makes a list of lists out of these tasks. I want the next task to directly work on the items of the list of lists. does that make sense?