okay I want to give this a shot I m downloading data from an Prefect Community #ask-community

okay - I want to give this a shot. I'm downloading...

Wolfgang Kerzendorf

03/05/2020, 4:03 PM

okay - I want to give this a shot. I'm downloading data from and s3 bucket. Then this is extracted into ~ 1e6 folders then each of these folders is processed which results in one file per folder. I would like to have a workflow where I can see what went wrong with each of these 1e6 tasks (for those that fail). So do I start with the first task being a glob? and then string tasks to this? Sorry for the stupid questions

nicholas

03/05/2020, 4:18 PM

Hi @Wolfgang Kerzendorf, no questions are stupid! To start, I don't think S3 allows

glob

-like server side filtering. However boto3 allows you to pass the

Prefix

argument when inspecting a bucket, which, depending on the structure of your bucket and the nature of the files you're looking for, may serve your use case. Depending on the number of files you're transforming, you may want to do some batch processing here to avoid bottlenecks and to allow your download/uploads to share a client. You can map these batches as necessary and use the Prefect Logger to raise any issues that come up when doing the processing. The Prefect Docs have a bare-bones example of an ETL flow here: https://docs.prefect.io/core/examples/etl.html For an example of using the boto3 client to upload files, boto3 has an example here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html I also found a python library that might help with generating prefixes to pass to the boto3 client, if standard strings won't work for your use case: https://github.com/asciimoo/exrex Last, if you want to read more about Prefect loggers, you can do so here: https://docs.prefect.io/core/concepts/logging.html#logging-configuration

Wolfgang Kerzendorf

03/05/2020, 4:19 PM

sorry - i will download the s3 externally before starting - I should have mentioned that

nicholas

03/05/2020, 4:22 PM

Ah ok, in which case using the

glob

module is fine!

Wolfgang Kerzendorf

03/05/2020, 4:23 PM

do you think it's bad to have 1e6 transforms (that is essentially mostly the backlog) then a few hundred every week

nicholas

03/05/2020, 4:28 PM

No I don't see any issues, assuming you've got the resources to handle the load. I think batching is still really valuable here, since it'll help you reduce bottlenecks. We have some users using mapping over a few hundred thousand tasks right now.

Wolfgang Kerzendorf

03/05/2020, 4:30 PM

okay - looking at the example this does not split up the tasks -

extract

gives a list and then

transform

works on this list - rather than having 3 "transformers"

nicholas

03/05/2020, 4:36 PM

You're right, you could of course apply multiple transforms as needed, it's entirely up to you. An example of a flow where a list is generated and then multiple transforms are applied to each item can be found here: https://docs.prefect.io/core/examples/map_reduce.html

Wolfgang Kerzendorf

03/05/2020, 4:52 PM

yep - I think that is what I want - map/reduce. I'm an astrophysicist for most of my time and only occassionally do etl - so I sometimes have language problems 🙂

Wolfgang Kerzendorf

03/05/2020, 4:52 PM

thanks!

nicholas

03/05/2020, 4:53 PM

No problem at all, glad we could help!

Jeremiah

03/05/2020, 6:06 PM

@Wolfgang Kerzendorf I fully echo everything @nicholas said and you can map as much as you need - I would just caution that we have generally found that once you get over 10,000 mapped tasks, managing and keeping track starts to get more difficult, and you’ll need to be more careful with your infrastructure. This isn’t a technical limitation, it’s purely a matter of resource contention and complexity - Prefect should run them all, but you’ll want to be more attuned to (for example) out of memory errors, depending on your executor. Adding some batching into your tasks (perhaps processing 10-100 items at a time) could help alleviate that.

upvote 1

Wolfgang Kerzendorf

03/05/2020, 9:30 PM

okay - so I'm playing around with your mapping task and it's quite nice. So for testing things out in my map stuff - can I somehow tell it to just run one tree of the maps

Wolfgang Kerzendorf

03/05/2020, 9:30 PM

and how do I hook the webserver stuff up

Wolfgang Kerzendorf

03/05/2020, 9:32 PM

sorry the dashboard

Jeremiah

03/05/2020, 10:01 PM

@Wolfgang Kerzendorf there isn’t a “built in” way to only run one branch of the tree, it’ll always run all of them. For testing, my suggestion would be to modify the task (or insert a new task) to filter the list of mapped inputs. For example if your mapped inputs is a list of 100 items, add a task that filters it to the one you want to test and return a list of just that item.

Wolfgang Kerzendorf

03/05/2020, 10:02 PM

So I'm extracting a bunch of tar files into individual files and then going further.

Jeremiah

03/05/2020, 10:02 PM

To spin up the dashboard, you can sign up for a free account here, and then follow the first two pages of the tutorial here. The first page will help you authenticate your local environment with Prefect Cloud, the second one will show you how to register and run your flow.

Wolfgang Kerzendorf

03/05/2020, 10:02 PM

in the end I only want to run the extraction if the tar file has changed. I'm writing the md5 to a pandas df

Wolfgang Kerzendorf

03/05/2020, 10:03 PM

yep - I have an account - will look into this

Wolfgang Kerzendorf

03/05/2020, 10:03 PM

so to stop individual branches of a tree - do I just fail?

Jeremiah

03/05/2020, 10:04 PM

Ah ok, to do this entirely within Prefect you may want to look at the

ifelse

conditional (for a more formal version) or, more simply, bake the conditional directly into a mapped task which checks if that branch should proceed and raises a

SKIP

signal otherwise (you can learn more about signaling here).

Jeremiah

03/05/2020, 10:05 PM

You can fail, or use a

SKIP

to differentiate intentional skips from true failures

Wolfgang Kerzendorf

03/05/2020, 10:05 PM

I see - that will then show up on the dashboard in a nicer way - right?

Jeremiah

03/05/2020, 10:05 PM

Exactly

Jeremiah

03/05/2020, 10:07 PM

Prefect also has some caching mechanisms you could use to avoid recomputing tasks - but I’m not 100% sure if they’ll map exactly to your use case and don’t want to send you down a road that might not help

Wolfgang Kerzendorf

03/06/2020, 4:20 PM

okay - I'm trying to map a map again so the first map returns a list - then the second task makes a list of lists out of these tasks. I want the next task to directly work on the items of the list of lists. does that make sense?

Open in Slack

Previous Next