Hi folks, our prefect kubernetes agent sometimes f...
# prefect-community
m
Hi folks, our prefect kubernetes agent sometimes fails to deploy flow runs due to a connection error - namely a ReadTimeoutError (more details in the thread) Note that the same agent is able to deploy tens to hundreds of flows every day but seems to hiccup with these errors, leading to flow runs never getting submitted on our kubernetes cluster.
1
We are running the following prefect agent image on our EKS cluster:
Copy code
prefecthq/prefect:0.14.22-python3.8
Here is a snippet of the logs
Copy code
[2022-05-20 08:00:00,000] INFO - prefect-agent-staging | Deploying flow run ccefa859-6380-48c7-9a58-c7f1030cb294 to execution environment...
INFO:prefect-agent-staging:Deploying flow run ccefa859-6380-48c7-9a58-c7f1030cb294 to execution environment...
[2022-05-20 08:00:01,276] INFO - prefect-agent-staging | Completed deployment of flow run ccefa859-6380-48c7-9a58-c7f1030cb294
INFO:prefect-agent-staging:Completed deployment of flow run ccefa859-6380-48c7-9a58-c7f1030cb294
INFO:prefect-agent-staging:Deploying flow run a8647852-1667-461f-a3ee-e749b580f2ac to execution environment...
[2022-05-20 08:00:37,759] INFO - prefect-agent-staging | Deploying flow run a8647852-1667-461f-a3ee-e749b580f2ac to execution environment...
WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)")': /
As you can see the first flow run (
ccefa859-6380-48c7-9a58-c7f1030cb294
) at 800000 UTC is deployed just fine The second flow run
a8647852-1667-461f-a3ee-e749b580f2ac
however fails deployment given the agent shows this warning that indicates an error happened:
Copy code
WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)")': /
Searching through slack - the closest thread I found that might be related to this is the following thread where unfortunately there is no clear resolution
We have had this issue of flow runs never getting started for some time now (a few months at least) but it got much worse last week and yesterday - it made us look into it further and we just narrowed it down to the agent failing to deploy the flow run
a
it looks like your agent failed to deploy the flow run and it got stuck in the Submitted state which indicates an issue on the execution layer - did you check if there are enough resources and everything is fine on your K8s cluster?
m
no that is not accurate, the kubernetes job never gets created
additionally the same exact flow run executes successfully every day
so to clarify the agent fails to create the job from what I can tell by looking at our cluster’s kubernetes events
a
exactly I don't deny it - it didn't get created due to issue on the execution layer - the agent is part of it
m
oh are you suggesting I increase the resources allocated to the agent ?
a
I don't know - it could be worth examining more closely because it seems that the agent did the right job in picking up the flow run but somehow couldn't deploy it (i.e. couldn't create a K8s job for the flow run)
so we would need to find out why to know how to solve it
m
my hunch is the agent failed connecting with the prefect cloud causing this failure - I will take a closer look at the agent code to see what happens after the agent logs
Deploying flow run
👍 1
here is the code snippet for 0.14.22
Copy code
<http://self.logger.info|self.logger.info>(
                f"Deploying flow run {flow_run.id} to execution environment..."
            )

            self._mark_flow_as_submitted(flow_run)

            # Call the main deployment hook
            deployment_info = self.deploy_flow(flow_run)
Before the agent starts creating the kubernetes job in
self.deploy_flow
it has to run
_mark_flow_as_submitted
where it makes a
_self_.client.set_flow_run_state
call and then a series of
_self_.client.set_task_run_state
given the connection error, I think one of these client calls is failing and is not being retried correctly
I will follow up with links to the github code soon, need to leave the keyboard for now
a
I see - in that case, the next good step would be figuring out why this call fails or why it's not retried? it can be that this is due to resource allocation
you could also try upgrading your image and your agent to a newer Prefect version, or even better - spin up a new agent process with a more recent prefect version and a different label and reregister this flow to have this new label
m
Ideally we would want a confirmation that relevant code has been improved in later prefect versions to properly resolve this issue. This is a hard to reproduce bug given it doesn’t happen too frequently so best to pinpoint why a prefect upgrade would resolve this
a
so I'm not suggesting upgrading directly, but more spinning up a new agent process as an extra and testing out if this fixes the issue
unfortunately, I can't give you a confirmation whether the upgrade will directly fix it, it depends on many variables
m
Sorry I was wrong about the error occuring in
_mark_flow_as_submitted
the error is clearly happening in
deploy_flow
here https://github.com/PrefectHQ/prefect/blob/0.14.22/src/prefect/agent/kubernetes/agent.py#L418
this is the function call that is failing
Copy code
self.batch_client.create_namespaced_job(
                    namespace=self.namespace, body=job_spec
                )
this issue was first reported as a bug here https://github.com/PrefectHQ/prefect/issues/3278 following which retry logic was added to resolve it, but it seems that in our case the retries are not always enough
Do you know if increasing the resources allocated to the agent would help in this regard ?
a
hard to say but it seems worth trying