Hi everyone. I am trying to get a better understanding of mapped tasks and memory implications/best practices when using a large number of them.
I have a pipeline that
1) queries a DB for a list of app IDs. It usually gets about 25k
2) Calls an API for each app N times where N is currently 14 different countries. The json response is not too big for each app. A single dictionary with several columns.
3) combines and stores the output of all tasks in s3
This is currently implemented in Airflow with one branch per country. Each branch queries the API repeatedly for each app and stores the results in a single file for each country. If the task fails, all apps need to be reprocessed.
What I am wondering is:
* If I create a mapped task that gets the list of apps + countries, does it create all the 350k (14 * 25k) child tasks in memory at once and they are put in some sort of queue or are they lazily created?
* I suppose that if I did nothing special regarding caching the results to an external system like s3, it would hold all the data in memory until it gets to the reducer task that dumps the output to a file. This may require a lot of memory because the reducer won't start until all children finish. Correct?
* Would this be alleviated if I use caching to s3? Would the memory be released once each task results are persisted to s3?
* Each child task output would be pretty small and it seems that having that may s3 files with a little data in each is not great. Would you recommend that instead of having the child task process a single app, it processes a small batch of apps say 50?
* I suppose there is no garbage collection on persisted results. Is the recommendation to use s3 life cycle rules to clear old task outputs?
05/30/2020, 6:15 PM
Hi @Pedro Machado! I think longer discussions like this are better suited for GitHub so let’s transition there. I’ll open an issue for you
@Marvin open “Best practices for managing memory with mapped pipelines?”