Hi folks our prefect kubernetes agent sometimes fails to dep Prefect Community #ask-community

Hi folks, our prefect kubernetes agent sometimes f...

Marwan Sarieddine

05/20/2022, 11:57 AM

Hi folks, our prefect kubernetes agent sometimes fails to deploy flow runs due to a connection error - namely a ReadTimeoutError (more details in the thread) Note that the same agent is able to deploy tens to hundreds of flows every day but seems to hiccup with these errors, leading to flow runs never getting submitted on our kubernetes cluster.

✅ 1

Marwan Sarieddine

05/20/2022, 12:02 PM

We are running the following prefect agent image on our EKS cluster:

Copy code

prefecthq/prefect:0.14.22-python3.8

Here is a snippet of the logs

Copy code

[2022-05-20 08:00:00,000] INFO - prefect-agent-staging | Deploying flow run ccefa859-6380-48c7-9a58-c7f1030cb294 to execution environment...
INFO:prefect-agent-staging:Deploying flow run ccefa859-6380-48c7-9a58-c7f1030cb294 to execution environment...
[2022-05-20 08:00:01,276] INFO - prefect-agent-staging | Completed deployment of flow run ccefa859-6380-48c7-9a58-c7f1030cb294
INFO:prefect-agent-staging:Completed deployment of flow run ccefa859-6380-48c7-9a58-c7f1030cb294
INFO:prefect-agent-staging:Deploying flow run a8647852-1667-461f-a3ee-e749b580f2ac to execution environment...
[2022-05-20 08:00:37,759] INFO - prefect-agent-staging | Deploying flow run a8647852-1667-461f-a3ee-e749b580f2ac to execution environment...
WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)")': /

As you can see the first flow run (

ccefa859-6380-48c7-9a58-c7f1030cb294

) at 800000 UTC is deployed just fine The second flow run

a8647852-1667-461f-a3ee-e749b580f2ac

however fails deployment given the agent shows this warning that indicates an error happened:

Copy code

WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)")': /

Marwan Sarieddine

05/20/2022, 12:03 PM

Searching through slack - the closest thread I found that might be related to this is the following thread where unfortunately there is no clear resolution

Marwan Sarieddine

05/20/2022, 12:05 PM

We have had this issue of flow runs never getting started for some time now (a few months at least) but it got much worse last week and yesterday - it made us look into it further and we just narrowed it down to the agent failing to deploy the flow run

Anna Geller

05/20/2022, 12:08 PM

it looks like your agent failed to deploy the flow run and it got stuck in the Submitted state which indicates an issue on the execution layer - did you check if there are enough resources and everything is fine on your K8s cluster?

Marwan Sarieddine

05/20/2022, 12:09 PM

no that is not accurate, the kubernetes job never gets created

Marwan Sarieddine

05/20/2022, 12:10 PM

additionally the same exact flow run executes successfully every day

Marwan Sarieddine

05/20/2022, 12:11 PM

so to clarify the agent fails to create the job from what I can tell by looking at our cluster’s kubernetes events

Anna Geller

05/20/2022, 12:12 PM

exactly I don't deny it - it didn't get created due to issue on the execution layer - the agent is part of it

Marwan Sarieddine

05/20/2022, 12:12 PM

oh are you suggesting I increase the resources allocated to the agent ?

Anna Geller

05/20/2022, 12:14 PM

I don't know - it could be worth examining more closely because it seems that the agent did the right job in picking up the flow run but somehow couldn't deploy it (i.e. couldn't create a K8s job for the flow run)

Anna Geller

05/20/2022, 12:14 PM

so we would need to find out why to know how to solve it

Marwan Sarieddine

05/20/2022, 12:15 PM

my hunch is the agent failed connecting with the prefect cloud causing this failure - I will take a closer look at the agent code to see what happens after the agent logs

Deploying flow run

👍 1

Marwan Sarieddine

05/20/2022, 12:22 PM

here is the code snippet for 0.14.22

Copy code

<http://self.logger.info|self.logger.info>(
                f"Deploying flow run {flow_run.id} to execution environment..."
            )

            self._mark_flow_as_submitted(flow_run)

            # Call the main deployment hook
            deployment_info = self.deploy_flow(flow_run)

Before the agent starts creating the kubernetes job in

self.deploy_flow

it has to run

_mark_flow_as_submitted

where it makes a

_self_.client.set_flow_run_state

call and then a series of

_self_.client.set_task_run_state

given the connection error, I think one of these client calls is failing and is not being retried correctly

Marwan Sarieddine

05/20/2022, 12:22 PM

I will follow up with links to the github code soon, need to leave the keyboard for now

Anna Geller

05/20/2022, 12:57 PM

I see - in that case, the next good step would be figuring out why this call fails or why it's not retried? it can be that this is due to resource allocation

Anna Geller

05/20/2022, 12:58 PM

you could also try upgrading your image and your agent to a newer Prefect version, or even better - spin up a new agent process with a more recent prefect version and a different label and reregister this flow to have this new label

Marwan Sarieddine

05/20/2022, 1:52 PM

Ideally we would want a confirmation that relevant code has been improved in later prefect versions to properly resolve this issue. This is a hard to reproduce bug given it doesn’t happen too frequently so best to pinpoint why a prefect upgrade would resolve this

Anna Geller

05/20/2022, 3:21 PM

so I'm not suggesting upgrading directly, but more spinning up a new agent process as an extra and testing out if this fixes the issue

Anna Geller

05/20/2022, 3:22 PM

unfortunately, I can't give you a confirmation whether the upgrade will directly fix it, it depends on many variables

Marwan Sarieddine

05/20/2022, 4:54 PM

Sorry I was wrong about the error occuring in

_mark_flow_as_submitted

the error is clearly happening in

deploy_flow

here https://github.com/PrefectHQ/prefect/blob/0.14.22/src/prefect/agent/kubernetes/agent.py#L418

Marwan Sarieddine

05/20/2022, 4:55 PM

this is the function call that is failing

Copy code

self.batch_client.create_namespaced_job(
                    namespace=self.namespace, body=job_spec
                )

Marwan Sarieddine

05/20/2022, 4:59 PM

this issue was first reported as a bug here https://github.com/PrefectHQ/prefect/issues/3278 following which retry logic was added to resolve it, but it seems that in our case the retries are not always enough

Marwan Sarieddine

05/20/2022, 5:09 PM

Do you know if increasing the resources allocated to the agent would help in this regard ?

Anna Geller

05/20/2022, 5:17 PM

hard to say but it seems worth trying

3 Views

Open in Slack

Previous Next