<@ULVA73B9P> are there any known issues in prefect...
# ask-marvin
m
@Marvin are there any known issues in prefect 2.x where a lot of flows being submitted at once don't actually get submitted proper with the api and they do not return any error. No future object is created? Sometimes we submit 10,000 flows on gpus similar to a batch process and the number of flows submitted does not match the total number of jobs that there should actually be
m
It seems there are no direct GitHub issues specifically mentioning that Prefect 2.x flows are submitted but not executed, without returning any error or creating a future object, when dealing with large batches (like 10,000 flows). However, there are some related issues that might be indirectly impacting or similar to your concern: 1. Dask Task Scheduling Hangs: There was an issue with tasks hanging and eventually failing, particularly with the DaskTaskRunner in large task scheduling scenarios. This was about tasks not being scheduled after a certain point, and the flow eventually crashing with a
PoolTimeout
. (Issue Link) 2. Concurrency and Parallel Flow Runs: Prefect does not fully support running multiple concurrent flow runs of the same flow in certain environments, and this might cause unexpected behavior when dealing with multi-threading or multi-processing. (Issue Link) 3. Flow Deployment Issues: There were issues with flows stuck in a pending state when deploying them as part of certain modules, which might hint at problems with the deployment and flow execution process. (Issue Link) 4. Subflow Submission: There were proposals to add ways for subflows to operate more efficiently, including running in the background. Some difficulties in this area might influence how large batches are handled. (Issue Link) To better investigate and potentially resolve your issue, you might want to follow the discussion and resolution updates on these issues, or consider reaching out on the Prefect community forums or opening a new issue if your problem persists. You can find those issues on GitHub using the provided links.
n
do you have concurrency limits that might keep flow runs in pending unexpectedly?
m
yes, there is a concurrency limit of 250, but nonetheless the number of total sublows created via run_deployment does not match.
n
how are you checking that they're created?
m
At a very high level, we query a database, create a template for that job (pydantic objects) and save it to s3 (json), then pass the pydantic object as a dictionary.
n
sorry, i mean in terms of prefect like are you looking for flow run records? or when you say there's a discrepancy between what you submit and end up seeing im wondering how you're getting to what you "end up seeing" if that makes sense
m
The objects in s3 don't match the number of flows. If we run again from the parent, it is not deterministic which jobs are created or not. At this scale it's something like 10 jobs don't get submitted
n
The objects in s3 don't match the number of flows
so your flows have a side affect that writes to s3? and you're using that fact as a proxy for whether or not the flow "happened"?
m
I'm going to debug later this week, via checking that the future objects match the number of jobs and that they all get states created to see if it has to do with the api
👍 1
The parent essentially just writes parameters that of the subflows that should be ran, so that we have a quick check to see which have corresponding input-output pairs
thanks
n
gotcha, yeah my naive suspicion would still be totally occupied concurrency slots of some kind, but yeah feel free to pop back here later!
👍 1