https://prefect.io logo
Title
r

Robin Weiß

10/21/2022, 9:33 AM
Hey there! I have a really weird issue with Kubernetes Jobs that I can’t wrap my head around: I try to start around 150 flows from on orchestrator flow. I do this because I need to parallelize across K8s pods instead of threads. This works, but only around 50 of my Flow-Runs actually transition into
Running
state and start a K8s job & pod, the other 100 Flow Runs just idle around in “Pending”. I have searched everywhere to find hints about why they won’t start. There is definitely enough K8s computing resources for them to be scheduled, there is no concurrency limits set via tags or on the work queue directly. The K8s work queue just shows them as “Pending” in the UI. Does anyone have any idea where else to look? Cheers 🙂
1
r

redsquare

10/21/2022, 9:36 AM
does kubectl describe pods *${POD_NAME}* reveal anything?
r

Robin Weiß

10/21/2022, 9:37 AM
There is no pods (or jobs) on K8s yet, that’s the problem. Prefect doesn’t even try to start the jobs it seems
r

redsquare

10/21/2022, 9:37 AM
Ah sorry I thought you meant pending pod
This works, but only around 50 of my pods actually get into a running state, the other 100 just idle around in “Pending”
r

Robin Weiß

10/21/2022, 9:40 AM
You are completely right, I described that very badly. Fixed it now. Thanks a lot 🙂
r

redsquare

10/21/2022, 9:40 AM
🙂
Are you using a single prefect agent in k8's
r

Robin Weiß

10/21/2022, 9:43 AM
Yes I do
I was thinking that there might be some implicit limit for jobs spawned, but the weird thing is that it spawns exactly 52 jobs and not 50
r

redsquare

10/21/2022, 9:46 AM
Hmmm odd number - I have not tried to spin up as many flows at once as that. Just wondering if you need to shard it across agents, does the agent log report anything
r

Robin Weiß

10/21/2022, 9:47 AM
unfortunately it logs everything that every single pod logs so it’s pretty hard to get any useful information from it 😕
r

redsquare

10/21/2022, 10:03 AM
I am not sure if two agents to one queue distribute the work semi evenly or does first one to poll grab 'all' the work - you may need to set the concurrency on the work-queue to stop this from happening and split the load across n agents
hmm but not sure if the concurrency is per agent or overall
r

Robin Weiß

10/21/2022, 10:07 AM
I have just a single agent instance running so this shouldn’t be a problem I think 🤔 I also tried to fix the work queue concurrency limit to 200 but it didn’t change anything. I have this fear that it’s somehow related to rate limiting on the free Prefect Cloud Tier but I have no idea how to verify this
r

redsquare

10/21/2022, 10:20 AM
Looking at the agent code it gets just 10 flows each time it polls so you could run another agent and see if you get over 52 concurrent - I have not seen any word about limits 🤷‍♂️
r

Robin Weiß

10/21/2022, 10:52 AM
The problem is that then I’ll also need to change the DB because it’s currently using the SQLite one 😕
r

redsquare

10/21/2022, 11:02 AM
the agents are independent of the orion db backed by sqllite though?
r

Robin Weiß

10/21/2022, 12:37 PM
Are you sure about that? 🤔 I just saw the following line in the agent manifest on discourse:
replicas: 1 _# We're using SQLite, so we should only run 1 pod_
c

Christopher Boyd

10/21/2022, 1:10 PM
how many max connections do you have on the database
SQLite only allows one mutation per connection, and it locks ; when you are running 50 tasks in a flow concurrently, SQLITE is only performing a singular write transaction at a time, for each state, each task, each transition
it’s not really designed or meant to scale, that’s where postgres should come in
r

Robin Weiß

10/24/2022, 7:37 AM
@Christopher Boyd Thanks, that’s very helpful! Is there any way to validate that it’s definitely the SQLite limiting concurrency and no other factor? I only want to avoid adding new infrastructure to our stack only to end up with the same problem again :D
@Christopher Boyd I just realized one more thing: Is it even possible to configure Orion to use Postgres as a DB if I use the Prefect Cloud? Does the agent even use it’s own DB if it’s just running the
["prefect", "agent", "start", "-q", "kubernetes"]
command? Am I mixing stuff up here? Cheers!
c

Christopher Boyd

10/25/2022, 12:34 PM
sorry, I didn’t realize you were using prefect cloud, they are separate
I don’t rightly know here, I imagine you’re hitting a limit , but I need to get some more details and research
Do you have any tags registered or used at all?
r

Robin Weiß

10/25/2022, 1:38 PM
it’s very interesting, the log show that the Agent keeps trying to schedule the same flow run again and again:
13:37:34.302 | DEBUG   | prefect.agent - Checking for flow runs...
13:37:34.524 | INFO    | prefect.agent - Submitting flow run '<SAME_FLOW_RUN_ID>'
This keeps re-appearing 100 times
Hm there is some tags on the deployments but i don’t really use them other than for organizing the UI
c

Christopher Boyd

10/25/2022, 2:12 PM
They are set on tags, so it’s possible even if by accident
r

Robin Weiß

10/25/2022, 2:42 PM
I do but they are all way lower (5, 10) so it couldn’t be them I think, since I see ~40-50 pods actually spawn up
c

Christopher Boyd

10/25/2022, 4:24 PM
at risk of the obvious, what happens if you set the limit to something like 100
r

Robin Weiß

10/26/2022, 1:03 PM
Nothing unfortunately ;(