Hi. I keeping getting flows failing with `HTTPSCon...
# prefect-community
t
Hi. I keeping getting flows failing with
HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)
. I'm running a k8 agent. What is weird is that it is that it is not happening every time but in the same frequency (pictures shows how the same flow only fails every second time.
I might be related to the job manager becasuse according to this it might be
self.batch_client.list_namespaced_job
causing the issue http://mail-archives.apache.org/mod_mbox/airflow-commits/201911.mbox/%3CJIRA.13270065.1574417390000.198297.1574432460155@Atlassian.JIRA%3E
j
Hi @Thomas Hoeck - Thanks for the question. I'll check with the wider time to see if they have any insight. Hopefully others can chime in here if they've see anything similar.
t
Thank you @Jenny 🙂
👍 1
j
Hi Thomas, the prefect agent doesn't start a watch, so that's unlikely to be it, but it may be related. It's curious that you're seeing failed flow runs from this, errors at the k8s level shouldn't (necessarily) bubble up to be flow run failures. You say you're using the k8s agent, what prefect
Environment
are you using? What version of prefect? Do you see things in the agent or flow run logs that might be useful in diagnosing?
t
@Jim Crist-Harif I'm using LocalEnvironment. Will quickly see what logs I can dig out. I'm running prefect 0.13.5.
This is from the agents logs:
j
Is the agent running prefect 0.13.5 as well?
Thanks, those logs are helpful
t
@Jim Crist-Harif Yes agent is running 0.13.5 as well.
@Jim Crist-Harif I can also add that it looks like this error either causes or is caused by the containers never starting i.e. the containers for the failed flows newer started.
j
Ok, so the error here appears to be a timeout on POST to the k8s api. We'll need to redo the error handling here to support retries in the agent. I see you've opened an issue for this, I'll comment there, and we can continue the work in the repo.
t
@Jim Crist-Harif Makes sense because if I restart the run it works as expected.