https://prefect.io logo
Title
g

Gary

07/16/2020, 3:24 AM
Hi folks, we are evaluating Prefect to handle our data flow. By far all things are great for our needs. The only concern is we are not sure is it okay if we generate massive number of mapped tasks? One of our scenario is to use Prefect to perform crawl, calculate and store financial data (e.g., financial statement and daily trading data of US stocks). For the purpose, here is the simplified example code:
@task
def get_stock_id_list():
  # query stock id list from database
  stock_id_list = query_stock_id_list_from_db()
  return stock_id_list

@task
def crawl_stock_data_from_external_website(stock_id):
  # Crawler related work
  return crawl_stock_data(stock_id)

@task
def perform_calculation(crawled_data):
  # Some calculation with Pandas
  perform_some_calculation(crawled_data)


with Flow('Example flow') as flow:
  # The number of stock id in list is about ten thousand.
  stock_id_list = get_stock_id_list()

  crawled_data_list = crawl_stock_data_from_external_website.map(stock_id_list)

  perform_calculation.map(crawled_data_list)
Is it okay for Prefect to generate about 10,000~50,000 mapped tasks within the above flow without any problem? Another scenario is to generate about 1 million tasks within a flow. In this scenario, we query user id list from our database, and perform some calculation about user behavior analysis, (One user is mapped to one mapped task.) Is it okay? Or is there a better way to do this?
c

Chris White

07/16/2020, 3:33 AM
Hey Gary and welcome! In general we are always interested in hearing about user’s experiences who push the limits of the system. That being said, here are a few notes: • we usually recommend people limit the number of mapped tasks at any given level to ~10,000, both for performance reasons and also at that scale it’s not clear that prefect is providing much visibility. I’ve personally run mapped pipelines with 40,000 tasks and it worked fairly well, but I was also running on a large dask cluster • For large pipelines like this we recommend running with a dask executor for the best performance • the open source prefect server would fall over at this scale, so if you want the UI I would highly recommend using prefect cloud for this scale • Could you elaborate on why you want this many mapped tasks? While prefect can support incredibly granular tasks, it’s usually worthwhile to take a step back and ask whether this level of granularity is actually providing you value, vs. batching users into fewer groups (on the order of thousands) In short, I wouldn’t recommend it but it could be an interesting case study
:upvote: 2
👍 1
j

Jackson Maxfield Brown

07/16/2020, 3:47 AM
This is interesting to learn. We were planning on doing similar where basically we want to process something like 200,000 single cell images. We can break them out into sub categories and process about 20,000 at a time but the "dataset" as a whole is about 200,000
g

Gary

07/16/2020, 4:12 AM
@Chris White Thanks for your reply.
we usually recommend people limit the number of mapped tasks at any given level to ~10,000, both for performance reasons and also at that scale it’s not clear that prefect is providing much visibility.
"At any given level to ~10,000" Say, if I have a flow as follows:
with Flow("another flow") as flow:
  ids = get_stock_id_list()
  job_1_result = job_1.map(ids)
  job_2_result = job_2.map(ids)
  .
  .
  .
  job_10_result = job_10.map(ids)
Did you mean the sum of all mapped tasks within this flow is 10k, 1k for each job function? Or 10k for each job function (total 100k mapped task for this flow?
the open source prefect server would fall over at this scale, so if you want the UI I would highly recommend using prefect cloud for this scale
Yes, we want the UI. Could you provide more info about why open source prefect server would fall over at this scale?
n

nicholas

07/16/2020, 4:27 AM
@Gary - hopping in for Chris, that's a soft per-task limit and is usually a matter of best-practice. To the Prefect Server vs Cloud, it's more of a larger services/infrastructure problem; the various services and the database need to handle a LOT of state setting and log writing when mapping over tasks that large. Prefect Cloud has a myriad of elegant caching and volume layers that allow it to handle state-setting at scale, but these aren't available in Prefect Server. @Jackson Maxfield Brown - something I've found to be useful in jobs that large is batching wherever possible. It sounds like you're already prepared to split tasks at a sub-category level but batching within those at whatever granularity is possible will be a boon to you as you maintain your pipeline.
👍 1
j

Jackson Maxfield Brown

07/16/2020, 4:32 AM
@nicholas totally agree -- https://github.com/PrefectHQ/prefect/issues/2459 Originally I thought this was a good idea. (I still think it would be useful in some situations) but by and large I think sub-setting the data to large batches myself is better now
:upvote: 1
I believe part of the problem was also resolved by 0.12.0 or 0.12.1 I forget which PR it was but it was something about how in one of those releases, tasks submitted to the
DaskExecutor
are done one by one now instead of simply passing all inputs to
client.map
under the hood which is what causes a lot of problems
🚀 2
g

Gary

07/16/2020, 4:43 AM
Could you elaborate on why you want this many mapped tasks? While prefect can support incredibly granular tasks, it’s usually worthwhile to take a step back and ask whether this level of granularity is actually providing you value, vs. batching users into fewer groups (on the order of thousands)
For the first scenario, currently we want to process U.S. stock data with Prefect. The number of U.S. stock we have to deal with is fewer than 8,000. So the recommended mapped task number is okay for us. However, in the future we will not only focus on US stock market. We will eventually process stocks in different country. ex: Europe, China, Hong Kong, ..., etc. Moreover, tasks for a stock could unexpected fail sometimes due to many kinds of issues (e.g., crawler fails by target webpage throttled, by stability issue with proxy service, ...). Therefore, if Prefect can work under many flows, for each flow with a few tens of thousand mapped tasks, will be really helpful for us. For the latter scenario, batching process user data is okay for us. Just ask for curiosity. 🙂
@nicholas Provide this information in docs of Prefect official website may help developers to evaluate if they need Prefect Cloud or not. Thanks for your detailed explanation. 😃
c

Chris White

07/16/2020, 4:06 PM
Hey @Gary - sorry for the delayed response; +1 to everything nicholas said. In addition, the 10,000 task rule of thumb I mentioned is for each time you call
.map
, so your overall flow may have many more tasks than that. Also, we recently refactored our mapping pipeline and so it’s likely much more efficient now. Lastly
if Prefect can work under many flows, for each flow with a few tens of thousand mapped tasks
Prefect Cloud can absolutely handle this load, so no worries there!
g

Gary

07/19/2020, 2:46 AM
Sounds great! I am building some prototypes to verify such scenario. Thank you Chris and the Prefect team. 😀
🚀 1