Hey there! I have a really weird issue with Kubern...
# prefect-community
r
Hey there! I have a really weird issue with Kubernetes Jobs that I can’t wrap my head around: I try to start around 150 flows from on orchestrator flow. I do this because I need to parallelize across K8s pods instead of threads. This works, but only around 50 of my Flow-Runs actually transition into
Running
state and start a K8s job & pod, the other 100 Flow Runs just idle around in “Pending”. I have searched everywhere to find hints about why they won’t start. There is definitely enough K8s computing resources for them to be scheduled, there is no concurrency limits set via tags or on the work queue directly. The K8s work queue just shows them as “Pending” in the UI. Does anyone have any idea where else to look? Cheers 🙂
1
r
does kubectl describe pods *${POD_NAME}* reveal anything?
r
There is no pods (or jobs) on K8s yet, that’s the problem. Prefect doesn’t even try to start the jobs it seems
r
Ah sorry I thought you meant pending pod
This works, but only around 50 of my pods actually get into a running state, the other 100 just idle around in “Pending”
r
You are completely right, I described that very badly. Fixed it now. Thanks a lot 🙂
r
🙂
Are you using a single prefect agent in k8's
r
Yes I do
I was thinking that there might be some implicit limit for jobs spawned, but the weird thing is that it spawns exactly 52 jobs and not 50
r
Hmmm odd number - I have not tried to spin up as many flows at once as that. Just wondering if you need to shard it across agents, does the agent log report anything
r
unfortunately it logs everything that every single pod logs so it’s pretty hard to get any useful information from it 😕
r
I am not sure if two agents to one queue distribute the work semi evenly or does first one to poll grab 'all' the work - you may need to set the concurrency on the work-queue to stop this from happening and split the load across n agents
hmm but not sure if the concurrency is per agent or overall
r
I have just a single agent instance running so this shouldn’t be a problem I think 🤔 I also tried to fix the work queue concurrency limit to 200 but it didn’t change anything. I have this fear that it’s somehow related to rate limiting on the free Prefect Cloud Tier but I have no idea how to verify this
r
Looking at the agent code it gets just 10 flows each time it polls so you could run another agent and see if you get over 52 concurrent - I have not seen any word about limits 🤷‍♂️
r
The problem is that then I’ll also need to change the DB because it’s currently using the SQLite one 😕
r
the agents are independent of the orion db backed by sqllite though?
r
Are you sure about that? 🤔 I just saw the following line in the agent manifest on discourse:
replicas: 1 _# We're using SQLite, so we should only run 1 pod_
c
how many max connections do you have on the database
SQLite only allows one mutation per connection, and it locks ; when you are running 50 tasks in a flow concurrently, SQLITE is only performing a singular write transaction at a time, for each state, each task, each transition
it’s not really designed or meant to scale, that’s where postgres should come in
r
@Christopher Boyd Thanks, that’s very helpful! Is there any way to validate that it’s definitely the SQLite limiting concurrency and no other factor? I only want to avoid adding new infrastructure to our stack only to end up with the same problem again :D
@Christopher Boyd I just realized one more thing: Is it even possible to configure Orion to use Postgres as a DB if I use the Prefect Cloud? Does the agent even use it’s own DB if it’s just running the
["prefect", "agent", "start", "-q", "kubernetes"]
command? Am I mixing stuff up here? Cheers!
c
sorry, I didn’t realize you were using prefect cloud, they are separate
I don’t rightly know here, I imagine you’re hitting a limit , but I need to get some more details and research
Do you have any tags registered or used at all?
r
it’s very interesting, the log show that the Agent keeps trying to schedule the same flow run again and again:
Copy code
13:37:34.302 | DEBUG   | prefect.agent - Checking for flow runs...
13:37:34.524 | INFO    | prefect.agent - Submitting flow run '<SAME_FLOW_RUN_ID>'
This keeps re-appearing 100 times
Hm there is some tags on the deployments but i don’t really use them other than for organizing the UI
c
They are set on tags, so it’s possible even if by accident
r
I do but they are all way lower (5, 10) so it couldn’t be them I think, since I see ~40-50 pods actually spawn up
c
at risk of the obvious, what happens if you set the limit to something like 100
r
Nothing unfortunately ;(