https://prefect.io logo
n

Noam polak

02/22/2022, 7:01 AM
Hey dear community We recently have an issue when we were triggering 3 instances of the same parent flow that triggers a child flows (it's not the common use case but we wanted to test this case also) And we get errors from the prefect-server that goes like this.. - I will write in the thread What can possibly be the cause? Thanks
error message num 1: Task 'some_child_flow': Exception encountered during task execution! Traceback (most recent call last):   File "/usr/local/lib/python3.10/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state     value = prefect.utilities.executors.run_task_with_timeout(   File "/usr/local/lib/python3.10/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout     return task.run(*args, **kwargs)  # type: ignore   File "/usr/local/lib/python3.10/site-packages/prefect/tasks/prefect/flow_run.py", line 260, in wait_for_flow_run     for log in watch_flow_run(   File "/usr/local/lib/python3.10/site-packages/prefect/backend/flow_run.py", line 94, in watch_flow_run     flow_run = flow_run.get_latest()   File "/usr/local/lib/python3.10/site-packages/prefect/backend/flow_run.py", line 424, in get_latest     return self.from_flow_run_id(   File "/usr/local/lib/python3.10/site-packages/prefect/backend/flow_run.py", line 576, in from_flow_run_id     flow_run_data = cls._query_for_flow_run(where={"id": {"_eq": flow_run_id}})   File "/usr/local/lib/python3.10/site-packages/prefect/backend/flow_run.py", line 618, in _query_for_flow_run     result = client.graphql(flow_run_query)   File "/usr/local/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql     raise ClientError(result["errors"]) prefect.exceptions.ClientError: [{'message': 'request to http://prefect-hasura.prefect:3000/v1alpha1/graphql failed, reason: connect ECONNREFUSED 10.10.11.202:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['flow_run'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to http://prefect-hasura.prefect:3000/v1alpha1/graphql failed, reason: connect ECONNREFUSED 10.10.11.202:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}] chevron_rightTO TASK RUN [4:22 PM] error message num 2: Failed to set task state with error: ClientError([{'message': 'Unable to complete operation. An internal API error occurred.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Unable to complete operation. An internal API error occurred.'}}}]) Traceback (most recent call last):   File "/usr/local/lib/python3.10/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers     state = self.client.set_task_run_state(   File "/usr/local/lib/python3.10/site-packages/prefect/client/client.py", line 1922, in set_task_run_state     result = self.graphql(   File "/usr/local/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql     raise ClientError(result["errors"]) prefect.exceptions.ClientError: [{'message': 'Unable to complete operation. An internal API error occurred.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Unable to complete operation. An internal API error occurred.'}}}]
a

Anna Geller

02/22/2022, 10:14 AM
The internal server error is a generic message from an API call that failed and in this case it tells us that Prefect couldn't set task run states. There can be many reasons for something like this to occur, hard to say. How do you run your Server? Did this flow run ran on Kubernetes? Did you use Dask? It could be that a Kubernetes pod or a Dask worker/client somehow died and therefore Prefect Server couldn't infer the task run state to be set Is any of this some long running job? What is this flow doing? What happens if you start a new flow run - does it complete now (i.e. a transient issue)?
n

Noam polak

02/22/2022, 1:02 PM
Hey Anna We run prefect on Kubernetes. Each "parent-flow" creates 3-4 child flows and use their results`. Each parent flow can lust 1 - 4 hours It only happens when we're trying to run multiple "parent-flow" and it is resolved when I restart them
k

Khen Price

02/23/2022, 5:24 AM
@Anna Geller please notice there are 2 different errors @Noam polak described (we are on the same team) - one is related to what you mentioned, failure to set the task state, and another is regarding a refused connection
Copy code
{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.10.11.202:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
Does this give any clue on how to debug?
Also, would you suggest avoiding the approach of having flow of flows, and using a more simple flow with tasks instead, or should we expect flow of flows to be handled well by the system?
a

Anna Geller

02/23/2022, 9:31 AM
Flow of flows orchestration pattern is fully supported and works well, so nothing speaks against that. The connection refused maybe some networking issue on Kubernetes, hard to say. What Prefect version do you use on Server vs on your Kubernetes job template?
n

Noam polak

02/23/2022, 9:41 AM
Prefect Core Version: 0.15.0
And our job templates for the flows are: apiVersion: batch/v1
a

Anna Geller

02/23/2022, 9:47 AM
I was only asking about Prefect version you use on your: 1. Server 2. Agent 3. Flow e.g. set in
KubernetesRun
as image
n

Noam polak

02/23/2022, 10:05 AM
server - 2021.07.06 agent - 0.15.0 flow - 0.15.13
a

Anna Geller

02/23/2022, 10:33 AM
I was asking about the versions because it could be caused by some Prefect version mismatch which causes that agent uses some API routes that are different in Server due to version mismatch. The INTERNAL_SERVER_ERROR looks like a transient error to me - perhaps Server was overloaded with too many requests at once and couldn't set task run states? Regarding the ECONNREFUSED, you can follow up on this thread, it discussed the same error message, potentially a similar issue
8 Views