https://prefect.io logo
Title
c

Chris Hart

07/30/2019, 10:16 PM
a followup question on mapping tasks for pagination: I’m using the technique of having an upstream task that builds a giant dictionary of all possible pages, and then maps that into the task that then ingests (“hydrates”) the data from an API.. it is working yay! my question now is: any suggestions for how I can chunk the pages list or yield/stream it to the next task? reason being is that I want to a) limit the memory footprint of the downstream task and b) limit the size of failures that would then need to be retried, preventing duplicate fetching at scale
it seems that the generation of subtasks for the mapped items is working and I wonder if just switching it to the Dask Executor will allow for the parallelization of the downstream one to proceed, since right now the one that hydrates all records is going sequentially & locally
(and thus keeping all data from all pages in memory before proceeding to the last task map)
c

Chris White

07/31/2019, 4:42 AM
Great questions; a few things to call out: - at this exact moment, “chunking” or batching of mapped tasks is not a first-class feature, but it’s something we recently began considering implementing. Consequently, to do this today you need to manually create the batches yourself (i.e., map over a list-of-lists). Feel free to open an issue formally requesting this! - using the
DaskExecutor
will certainly gain you parallelism of the mapped tasks at each level of mapping; I’m not 100% sure if this is all you’re asking about, so let me know if there was something more subtle here that I’m missing - one caveat with mapping is that the results (outputs) of all the mapped tasks will be “reduced”, i.e., gathered into memory, so just be mindful of this when considering your resource requirements
c

Chris Hart

07/31/2019, 4:07 PM
ah ok cool thanks! will start by adding a dedicated batching/chunking task for that
💯 1