https://prefect.io logo
d

Deceivious

07/20/2023, 5:41 PM
HI, I am noticing a weird bug. I see many of my flow runs in PENDING state. I have kubernetes infra. While looking at the pod list it seems that the pods associated with the Flow run are in ERROR state. The logs show "Temporary failure in name resolution" error message. Why are the flow runs in PENDING state and not CRASHED ?
Code flow has not yet entered the flow method yet. The error is originating on the import block of the file containing the flow.
Copy code
import stuff #error here

@flow
def ......:
    ....
j

Jake Kaplan

07/20/2023, 6:33 PM
Are you using a Kubernetes infrastructure block or a Kubernetes Worker?
d

Deceivious

07/20/2023, 6:33 PM
infrastucture block
the kubernetes agent is deployed using helm chart provided by prefect
j

Jake Kaplan

07/20/2023, 6:34 PM
To get a crashed flow, the agent/worker has to be able to get a response back from the state of the infra. It's possible the
Temporary failure in name resolution
is stopping the status from coming back?
d

Deceivious

07/20/2023, 6:35 PM
And when does the first report of status occur?
The run has been submitted to run, but is waiting on necessary preconditions to be satisfied.
According to docs. not sure what the precoditions are
j

Jake Kaplan

07/20/2023, 6:39 PM
After the job is created, it's checked for completion via
watch.stream()
do you have agent logs? it might help
d

Deceivious

07/20/2023, 6:40 PM
The temp error in name resolution is not prefect issue.
basically my code has a global variable , value of which is fetched from Azure key vault. Thats where the error is occuring.
I would still except the status on prefect to be Failed OR Crashed.
its been an hour 😄
watch.stream()
is this an interval polling system or event based trigger system?
What's your container's exit code?
d

Deceivious

07/20/2023, 6:45 PM
let me try describe the pod
j

Jake Kaplan

07/20/2023, 6:50 PM
What I think is happening is: • your pod is crashing from the import-ish error • the agent can't get the error code back from the infra due to the k8s networking error • the agent will only crash your flow if it gets a non 0 status code back from k8s you may see a log message like
Copy code
f"An error occured while monitoring flow run '{flow_run.id}'. The flow run will not be marked as failed, but an issue may have occurred."
d

Deceivious

07/20/2023, 6:51 PM
Exit code is 1
Copy code
State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 20 Jul 2023 17:23:36 +0000
      Finished:     Thu, 20 Jul 2023 17:24:35 +0000
    Ready:          False
j

Jake Kaplan

07/20/2023, 6:52 PM
gotcha
if it was getting back to the agent you're correct it would report crashed
d

Deceivious

07/20/2023, 6:52 PM
Is this expected tho?
or even intended
j

Jake Kaplan

07/20/2023, 6:52 PM
I'd look for the log message I sent above to confirm
but it is expected that if there is an unexpected exception while monitoring the flow run would not crash
for example, if you fixed the import error
d

Deceivious

07/20/2023, 6:53 PM
Is that error mesage on the agent or the server?
j

Jake Kaplan

07/20/2023, 6:53 PM
the agent would not want to crash your flow, just cause it couldnt reach k8s
it would be in the agent logs
d

Deceivious

07/20/2023, 6:55 PM
Well thats not goign to be easy to find cuz I have 10 agents running. 😄
on different machines
j

Jake Kaplan

07/20/2023, 6:57 PM
😅
if it was a worker, you would be able to see it in the prefect logs for that flow run I believe
d

Deceivious

07/20/2023, 6:58 PM
Copy code
try:
	import stuff
except Exception as err:
	fail_import=True
else:
	fail_import=False

@flow
def a():
	if fail_import:
		raise RuntimeError("Import failed {err}")
👀
its an agent not a worker. we havnt movedto workers yet because there are some issues regarding permissions on user scope on helm deployments.
j

Jake Kaplan

07/20/2023, 7:01 PM
gotcha make sense
d

Deceivious

07/20/2023, 7:02 PM
But wouldnt itmake sense to flag a flow run as failed after X numer of attempts ?
I see around 5 pods were created so it did make 5 attempts and just gave up after that.
I manually cancelled the flow and started it. Now its complete.
This is painful as this blocks workqueues with limited concurrency.
technically stopping the pipeline
Or does Pending flows not take a slot on work queues?
j

Jake Kaplan

07/20/2023, 7:06 PM
If everything is working as intended (without the network issue i'm guessing) the status code would be reported as 1 and the flow run would crash
I believe
PENDING
,
RUNNING
and
CANCELLING
count as in progress towards a work queue
while monitoring the agent functions on the policy that if can't reach the container
it's not enough to crash the flow. if the import error didn't exist for example, the flow run would finish fine, even though the agent could not monitor it, if that makes sense
I don't believe you're using Cloud if I remember right, but that is certainly a case an automation could handle (in
PENDING
for X minutes)
d

Deceivious

07/20/2023, 7:09 PM
Id except any exception raised by the script I wrote should have been marked as failed.
j

Jake Kaplan

07/20/2023, 7:10 PM
do you have logs from the flow run container? depending on the error maybe it should be crashed? it depends on when/how it's happening
d

Deceivious

07/20/2023, 7:10 PM
The prefect server and the prefect agents are running in the same machines have connection error as well. Like I said its not termination of connection between the server and the agent but rather the connection to azure services.
Do u mean the logs from the pod?
j

Jake Kaplan

07/20/2023, 7:11 PM
yes
d

Deceivious

07/20/2023, 7:11 PM
yes i do
OOf it just got cleared
j

Jake Kaplan

07/20/2023, 7:16 PM
ahhh darn.
depending on what's happening on the container it could report your flow as failed/crashed directly, but it's hard to tell without logs
I guess I was assuming the container wasn't able to even begin the entrypoint
d

Deceivious

07/20/2023, 7:17 PM
Found 1
Copy code
kubectl logs  piquant-teal-jh567-dlpgh -n prefect2-worker-prod
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/app/my_library/__init__.py", line 5, in <module>
      ======CODE REDACTED=====
      but my_library/__init__.py import is callint azure core functions below




  File "/app/.venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/azure/keyvault/secrets/_client.py", line 72, in get_secret
    bundle = self._client.get_secret(
  File "/app/.venv/lib/python3.10/site-packages/azure/keyvault/secrets/_generated/_operations_mixin.py", line 1640, in get_secret
    return mixin_instance.get_secret(vault_base_url, secret_name, secret_version, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/azure/keyvault/secrets/_generated/v7_4/operations/_key_vault_client_operations.py", line 760, in get_secret
    pipeline_response: PipelineResponse = self._client._pipeline.run(  # pylint: disable=protected-access
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 202, in run
    return first_node.send(pipeline_request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  [Previous line repeated 2 more times]
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_redirect.py", line 156, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 470, in send
    raise err
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_retry.py", line 448, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/policies/_authentication.py", line 113, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
  [Previous line repeated 1 more time]
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/_base.py", line 101, in send
    self._sender.send(request.http_request, **request.context.options),
  File "/app/.venv/lib/python3.10/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 364, in send
    raise error
azure.core.exceptions.ServiceRequestError: <urllib3.connection.HTTPSConnection object at 0x7f56dc3fc670>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
j

Jake Kaplan

07/20/2023, 7:21 PM
what storage are you using? is the flow code just in the image or it's remote?
d

Deceivious

07/20/2023, 7:22 PM
its in an image
we basically build an image push it to ACR and create a prefect deployment pointing to thatimage
j

Jake Kaplan

07/20/2023, 7:41 PM
is there a traceback around that that is from prefect code? or any prefect logs? if not I don't think the python process is getting off the ground. So the prefect process on that container never starts and doesn't have an opportunity to crash the flow the agent would crash this if the error code made it's way fully back
how consistent are those DNS errors? i'd expect that to only happen some of the time? and other times it should crash ?
d

Deceivious

07/20/2023, 7:42 PM
Thats the entire log,
This is the first time in 6 months that ive seen the DNS error
j

Jake Kaplan

07/20/2023, 7:44 PM
if you're able to see this happen consistently with a reproducible example and record the agent logs it would be great if you could file an issue for it!
d

Deceivious

07/20/2023, 7:45 PM
I think that should be easy. Ill get to it when i can.
🙌 1
Thanks for the help.