05/10/2020, 11:28 PM
Hi team - I have a question about nested mapping that maybe someone has thought about / come across. I have a list of zipped files I’d like to map over, however each file inside either contains raw data or one of a couple of formats of nesting (weekly or monthly/weekly) depending on how old the file is. What I’d really like to do is write a simple parser for each type, and just
out the data files for further mapping. Obviously I can’t do this in prefect, but I also don’t want to do an enormous
because the amount of data is too large. What I currently have is just a simple map over the files, but I’m really not getting the parallelism or granularity I’d like. I’m using dask so I thought about just grabbing a worker client and doing a
submit tasks from tasks
, but then I lose the benefits of having prefect tasks - is there anything anyone can suggest?
:upvote: 1


05/11/2020, 8:32 AM
Is identifying the file type a costly operation (involves unzipping), or can it be determined from the zip creation date? I want to clarify since you mentioned formats depend on how old the file is. By granularity, do you mean each different format parser should be its own task, and map over only its matching formats? If that is the case, you could use multiple
s to get multiple lists made of single formats. Then each one can be its own DAG branch, being mapped over its matching parser. You could also write a custom multi-way filter in order to partition all files into their own format lists in a single pass.


05/11/2020, 1:50 PM
Hi @Brad - keep an eye on this issue, we’re looking at implementing a flat_map operator for exactly this type of use case. I don’t have an exact date for you, but the preliminary work (refactoring how mapping works to minimize the need for reduce steps) is starting this week.