Gary
07/16/2020, 3:24 AM@task
def get_stock_id_list():
# query stock id list from database
stock_id_list = query_stock_id_list_from_db()
return stock_id_list
@task
def crawl_stock_data_from_external_website(stock_id):
# Crawler related work
return crawl_stock_data(stock_id)
@task
def perform_calculation(crawled_data):
# Some calculation with Pandas
perform_some_calculation(crawled_data)
with Flow('Example flow') as flow:
# The number of stock id in list is about ten thousand.
stock_id_list = get_stock_id_list()
crawled_data_list = crawl_stock_data_from_external_website.map(stock_id_list)
perform_calculation.map(crawled_data_list)
Is it okay for Prefect to generate about 10,000~50,000 mapped tasks within the above flow without any problem?
Another scenario is to generate about 1 million tasks within a flow. In this scenario, we query user id list from our database, and perform some calculation about user behavior analysis, (One user is mapped to one mapped task.)
Is it okay? Or is there a better way to do this?Chris White
Jackson Maxfield Brown
07/16/2020, 3:47 AMGary
07/16/2020, 4:12 AMwe usually recommend people limit the number of mapped tasks at any given level to ~10,000, both for performance reasons and also at that scale it’s not clear that prefect is providing much visibility."At any given level to ~10,000" Say, if I have a flow as follows:
with Flow("another flow") as flow:
ids = get_stock_id_list()
job_1_result = job_1.map(ids)
job_2_result = job_2.map(ids)
.
.
.
job_10_result = job_10.map(ids)
Did you mean the sum of all mapped tasks within this flow is 10k, 1k for each job function? Or 10k for each job function (total 100k mapped task for this flow?
the open source prefect server would fall over at this scale, so if you want the UI I would highly recommend using prefect cloud for this scaleYes, we want the UI. Could you provide more info about why open source prefect server would fall over at this scale?
nicholas
Jackson Maxfield Brown
07/16/2020, 4:32 AMJackson Maxfield Brown
07/16/2020, 4:34 AMDaskExecutor
are done one by one now instead of simply passing all inputs to client.map
under the hood which is what causes a lot of problemsGary
07/16/2020, 4:43 AMCould you elaborate on why you want this many mapped tasks? While prefect can support incredibly granular tasks, it’s usually worthwhile to take a step back and ask whether this level of granularity is actually providing you value, vs. batching users into fewer groups (on the order of thousands)For the first scenario, currently we want to process U.S. stock data with Prefect. The number of U.S. stock we have to deal with is fewer than 8,000. So the recommended mapped task number is okay for us. However, in the future we will not only focus on US stock market. We will eventually process stocks in different country. ex: Europe, China, Hong Kong, ..., etc. Moreover, tasks for a stock could unexpected fail sometimes due to many kinds of issues (e.g., crawler fails by target webpage throttled, by stability issue with proxy service, ...). Therefore, if Prefect can work under many flows, for each flow with a few tens of thousand mapped tasks, will be really helpful for us. For the latter scenario, batching process user data is okay for us. Just ask for curiosity. 🙂
Gary
07/16/2020, 5:33 AMChris White
.map
, so your overall flow may have many more tasks than that. Also, we recently refactored our mapping pipeline and so it’s likely much more efficient now. Lastly
if Prefect can work under many flows, for each flow with a few tens of thousand mapped tasksPrefect Cloud can absolutely handle this load, so no worries there!
Gary
07/19/2020, 2:46 AM