Robin
10/08/2020, 9:27 PMFailed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 3417baee-44a2-4b39-82f4-c6ac6d073d1e: provided a running state but associated flow run 51cc335d-f029-45c6-80b4-8c88a0173dbc is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 128, in call_runner_target_handlers
cache_for=self.task.cache_for,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1321, in set_task_run_state
version=version,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 294, in graphql
raise ClientError(result["errors"])
prefect.utilities.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 3417baee-44a2-4b39-82f4-c6ac6d073d1e: provided a running state but associated flow run 51cc335d-f029-45c6-80b4-8c88a0173dbc is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
Has anyone already experienced this?
How to debug it? 😕Chris White
Running
state, your task runs cannot enter a Running
state either. Maybe check your logs - something caused your Flow Run to “finish” while you still had outstanding task runs that hadn’t started yetRobin
10/08/2020, 9:34 PMsnowflake-sqlalchemy 1.2.3
which has disappeared when upgrading to 1.2.4. So I wanted to hint on the possibility that snowflake dependencies might cause problems.
Please find attached the logs of the last successful and first failing tasks.
How can I dig deeper?
last successful log: https://cloud.prefect.io/accure/flow-run/51cc335d-f029-45c6-80b4-8c88a0173dbc?logId=cd2f7b79-a3ca-4486-9f84-391f8b29d4d9
first failing log:
https://cloud.prefect.io/accure/flow-run/51cc335d-f029-45c6-80b4-8c88a0173dbc?logId=cd2f7b79-a3ca-4486-9f84-391f8b29d4d9Chris White
Robin
10/08/2020, 9:39 PMChris White
"2020-10-08T16:52:40.096577+00:00"
and your task attempted to set its final state one second later at "2020-10-08T16:52:41.709352+00:00"
which is really interestingRobin
10/08/2020, 9:43 PMflow.environment = DaskKubernetesEnvironment(
min_workers=1, max_workers=10, labels=["k8s"]
)
# register flow on AWS ECR
module_dir = path.dirname(path.dirname(path.abspath(__file__)))
flow.storage = Docker(
python_dependencies=[
"numpy",
"pandas",
"snowflake-connector-python[pandas]==2.3.2 ",
"snowflake-sqlalchemy>=1.2.4",
"tqdm",
],
registry_url="<http://asdasd.dkr.ecr.eu-central-1.amazonaws.com|asdasd.dkr.ecr.eu-central-1.amazonaws.com>",
image_name="fetch_flow",
image_tag="beta_"
+ datetime.now().strftime("%Y%m%d_%H%M%S"), # Unique tag avoids AWS caching
files={module_dir:"/modules/accure_analytics"},
extra_dockerfile_commands= ["RUN pip install -e /modules/accure_analytics"],
)
flow.register(project_name="eks_test_01")
# flow.visualize()
task concurrency = 10Chris White
Failed
state
- something happens in your dask cluster at shutdown, causing dask to rerun some of the task futures even though they’ve already finished, causing these noisy error logs but not actually doing any additional workRobin
10/08/2020, 9:49 PMChris White
Robin
10/08/2020, 9:53 PMChris White
Robin
10/08/2020, 9:56 PMDylan
Robin
10/08/2020, 10:03 PMprefect.config.logging.level = "DEBUG"
Does this work?DEBUG
for prefect cloud?Chris White
PREFECT__LOGGING__LEVEL="DEBUG"
Robin
10/08/2020, 10:34 PMChris White
Robin
10/08/2020, 10:43 PMChris White
Robin
10/08/2020, 10:48 PMflow.register()
we get the following warnings:
UserWarning: No result handler was specified on your Flow. Cloud features such as input caching and resuming
task runs from failure may not work properly.
no_url=no_url,
and
/opt/prefect/healthcheck.py:149: UserWarning: Task <Task: create_or_update_copy_progress_table> has retry settings but some upstream dependencies do not have result types. See <https://docs.prefect.io/core/concepts/results.html> for more details.
result_check(flows)
/opt/prefect/healthcheck.py:149: UserWarning: Task <Task: copy_batteryconfiguration> has retry settings but some upstream dependencies do not have result types. See <https://docs.prefect.io/core/concepts/results.html> for more details.
result_check(flows)
Could that cause the trouble?Raphaël Riel
10/09/2020, 12:27 PMRobin
10/09/2020, 1:16 PMRaphaël Riel
10/14/2020, 7:58 PMRobin
10/14/2020, 9:52 PMChris White
Robin
10/14/2020, 11:35 PMflow.run_config
, however the parallelization with flow.run_config
and flow.executor = DaskExecutor()
did not work for us. Therefore, we finally switched back to flow.environment = DaskKubernetesEnvironment(...)
and run the flow now in small batches to see whether this reduces the failing tasks and flows, however without debug level logging. This is also a somewhat desperate attempt to at least start running the tasks for a larger amount of systems successfully.failed
before running all tasks
3. tasks don't parallelize with flow.run_config
and flow.executor = DaskExecutor()
4. debug level logging not available for DaskKubernetesEnvironment
on AWS EKSChris White
Robin
10/14/2020, 11:52 PMChris White
Robin
10/15/2020, 10:02 AMFailed to set task state with error: ClientError
• https://cloud.prefect.io/accure/flow-run/81060a1b-8e15-4c8d-9672-4da86a64f489?logId=7216403e-8741-4c82-90b7-7103d4908d12
• https://cloud.prefect.io/accure/flow-run/9bc734ad-805f-4904-aa95-88b1d81426a4?logId=daa523a9-1e39-4f0f-b412-5d5a225a47aafailed
before running all task
• https://cloud.prefect.io/accure/flow-run/81060a1b-8e15-4c8d-9672-4da86a64f489?logId=7216403e-8741-4c82-90b7-7103d4908d12
• https://cloud.prefect.io/accure/flow-run/9bc734ad-805f-4904-aa95-88b1d81426a4?logId=daa523a9-1e39-4f0f-b412-5d5a225a47aa
• https://cloud.prefect.io/accure/flow-run/51cc335d-f029-45c6-80b4-8c88a0173dbc?logId=e013e144-045f-413f-aedb-64804a7e2338
Currently (in the last ~7 days) it is hard to identify those flows as schematic view does not work for most flows and UI is a little unresponsive in general.DaskKubernetesEnvironment
on AWS EKS
see: https://github.com/PrefectHQ/prefect/issues/3509flow.run_config
and flow.executor = DaskExecutor()
https://github.com/PrefectHQ/prefect/issues/3510Dylan
run_config = KubernetesRun(cpu_limit=2, cpu_request=2)
executor = LocalDaskExecutor(num_workers=4)
Robin
10/15/2020, 4:45 PMDylan
Chris White
Robin
10/15/2020, 5:50 PMFailed to set task state with error: ClientError
So it looks like this issue is the common issue of all those flows that abort prematurely, right?failed
before running all task:
So on the prefect side it boils down to the following question:
Why is the flow run state set to failed? Some feedback from the dask cluster has to trigger this event, right? 🤔
Does prefect provide a mean to analyze why prefect sets the state to failed? If not yet, it would be great if prefect forwards information about the related
events that trigger the flow state change.
Apart from this, I guess that the dask cluster logs will tell us more about what triggers prefect to set the flow state to failed?coiled
avoids or simplifies these issues. Do you already recommend ways of how to setup prefect agents with coiled
?Chris White
81060a1b-8e15-4c8d-9672-4da86a64f489
, there’s a clear KilledWorker
error)
- the dask cluster still (partially) crashes, but in a way that the scheduler simply releases all work. This won’t necessarily result in an error in the flow runner process would would still result in incomplete task runs.
Either way, this is very much a dask resource issue. You can use coiled directly without new agents or anything: the coiled
package has a Cluster object that you can feed directly into the DaskExecutor
when configuring your Flow. As long as your coiled credentials are present (possibly in the docker image you run your flows with) then it will “just work” and create a cluster in coiled land and farm your tasks out to itRobin
10/15/2020, 7:24 PMChris White
Robin
10/15/2020, 9:27 PMChris White