Jeremy Phelps
09/27/2021, 6:45 PMwith Flow('my-flow') as f:
things = get_things()
process_thing.map(things)
...where each process_thing
returns a value that we don't really need.
When Prefect runs this flow, the second task shows that all the individual elements of things
have been mapped, yet the flow run remains in the "Running" state. I'm using Prefect 0.14.22.Kevin Kho
Jeremy Phelps
09/27/2021, 6:49 PMJeremy Phelps
09/27/2021, 6:49 PMKevin Kho
Jeremy Phelps
09/27/2021, 6:57 PMJeremy Phelps
09/27/2021, 7:01 PMKevin Kho
gather
step that collects the results in memory so I think it’s this gather
step that might be hanging. Is the process_thing
returning a large thing?Jeremy Phelps
09/27/2021, 7:02 PMgather
works?Jeremy Phelps
09/27/2021, 7:03 PMJeremy Phelps
09/27/2021, 7:05 PMgather
must have succeeded if its job is to get the results so they can be passed downstream.Jeremy Phelps
09/27/2021, 7:05 PMKevin Kho
task
as a future
to Dask with client.submit()
. When you have a mapped task, it uses client.submit()
on each one, and then collects them back with client.gather()
to bring them back to the client. So the scheduler orchestrates this operation and could run out of memory if it has to collect a lot of big results.Kevin Kho
Jeremy Phelps
09/27/2021, 7:07 PMJeremy Phelps
09/27/2021, 7:07 PMSucceeded
.Kevin Kho
Kevin Kho
Jeremy Phelps
09/27/2021, 7:17 PMprefecthq/prefect:0.14.22-python3.7
Docker image.Jeremy Phelps
09/27/2021, 7:26 PMFlow run 90259fef-86ba-4ddd-80be-ee636da5ef16 not found
Traceback (most recent call last):
File "/usr/local/bin/prefect", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/prefect/cli/execute.py", line 54, in flow_run
raise ValueError("Flow run {} not found".format(flow_run_id))
ValueError: Flow run 90259fef-86ba-4ddd-80be-ee636da5ef16 not found
I've debugged this error before. The Prefect client assumes that any GraphQL error can only be caused by the flow run not existing. The real error that the GraphQL server sent is thrown away.Jeremy Phelps
09/27/2021, 7:28 PMKevin Kho
Jeremy Phelps
09/27/2021, 7:30 PMJeremy Phelps
09/27/2021, 7:30 PMSep 27 00:01:37 prefect-agent agent-script.sh[87937]: [2021-09-27 00:01:37,967] ERROR - agent | Error while managing existing k8s jobs
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: Traceback (most recent call last):
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: self.manage_jobs()
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/agent/kubernetes/agent.py", line 333, in manage_jobs
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: flow_run_id
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/client/client.py", line 1262, in get_flow_run_state
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: return prefect.engine.state.State.deserialize(flow_run.serialized_state)
Sep 27 00:01:37 prefect-agent agent-script.sh[87937]: AttributeError: 'NoneType' object has no attribute 'serialized_state'
Kevin Kho
Jeremy Phelps
09/27/2021, 7:33 PMJeremy Phelps
09/27/2021, 7:34 PMKevin Kho
Kevin Kho
Jeremy Phelps
09/27/2021, 7:38 PMJeremy Phelps
09/27/2021, 7:39 PMunbiased-flamingo demand-forecasting-delivery-scheduler Running 2 days ago 2021-09-27 05:00:13 a88c2062-6594-4976-8bbe-8112aac6610d
amber-tody demand-forecasting-delivery-scheduler Running 3 days ago 2021-09-27 05:00:14 f9ca5bdd-6e4e-4fa8-aa09-6b76f0a0d35d
unselfish-bug demand-forecasting-delivery-scheduler Running 3 days ago 2021-09-26 05:00:11 7ec91aeb-3f35-4c2f-8b99-feeed9ce7469
festive-mantis demand-forecasting-delivery-scheduler Running 3 days ago 2021-09-27 05:00:13 38b59b3f-6ee6-4160-a5cf-2ad69bf31042
impossible-giraffe demand-forecasting-delivery-scheduler Running 3 days ago 2021-09-26 05:00:11 49556f7c-5e84-4dfe-9d7b-9fdc823c9c9a
impressive-walrus demand-forecasting-delivery-scheduler Running 4 days ago 2021-09-25 05:00:10 4773b20c-a1c3-4f28-b94a-fbf2d32da504
sassy-cicada demand-forecasting-delivery-scheduler Running 4 days ago 2021-09-26 05:00:11 f798a5d0-64fc-422e-9ae8-0ea6978bbbea
curly-muskrat one_market_item_ingestion Running 5 days ago 2021-09-27 07:00:16 16ae6995-f029-4a32-994a-bedcb51406d6
Kevin Kho
Zach Angell
a88c2062-6594-4976-8bbe-8112aac6610d
, our system is showing several tasks still in a Pending state. For example, I see a run of a task named insert_django_object_groups
is still in a Pending state. Are you able to see any tasks still in a Pending state in the UI?Jeremy Phelps
09/28/2021, 5:44 PMJeremy Phelps
09/28/2021, 5:46 PM16ae6995-f029-4a32-994a-bedcb51406d6
is an example of a flow run where all the tasks show up as complete.Jeremy Phelps
09/28/2021, 5:47 PMJeremy Phelps
09/28/2021, 5:50 PMded26dfd-19cf-46df-a24c-f23f735f5805
has a failed task, yet is still running, even the flow is configured to fail if any task fails. It shouldn't waste resources attempting to run remaining tasks to completion.Jeremy Phelps
10/02/2021, 7:04 PMa0c8f568-2986-4ee3-8d31-4e1e7048e63f
is a new example of a flow run where all the tasks have completed, yet Prefect still marks the task as "Running". There is no task_run with a state other than Success
or Mapped
.Kevin Kho
Jeremy Phelps
10/04/2021, 4:35 PMKevin Kho
Kevin Kho
ded26dfd-19cf-46df-a24c-f23f735f5805
, a Flow with a failed task can continue to run if there are still tasks that don’t depend on the failed tasks. Are other tasks that depend on the failed task continuing to execute? For a0c8f568-2986-4ee3-8d31-4e1e7048e63f
, the best idea right now is that the FlowRunner might be dying during your flow run causing the Flow to hang. You mentioned that having a downstream task continues. Curious if having an empty “reduce” step makes this flow succeed? If it doesn’t, that would indicate the FlowRunner is dying.Jeremy Phelps
10/04/2021, 6:23 PMForSo I've noticed. But this behavior doesn't make any sense since we'll ultimately have to regard the entire flow as having failed. All that is accomplished by completing the remaining tasks is to waste resources., a Flow with a failed task can continue to run if there are still tasks that don’t depend on the failed tasks.ded26dfd-19cf-46df-a24c-f23f735f5805
ForWhat is a FlowRunner and how do I stop it from dying?, the best idea right now is that the FlowRunner might be dying during your flow run causing the Flow to hang.a0c8f568-2986-4ee3-8d31-4e1e7048e63f
Kevin Kho
FAILED
state. If a task can be independently executed from other tasks, it doesn’t make sense to just fail them when they are independent. Either way, you do have the tools to control the behavior by making the dependency explicit. If the flow is still showing a SUCCESS
state even if that a task fails, that means that task is not a reference task and you can set it with flow.reference_tasks = List[Tasks]
The FlowRunner is the Python process that submits the tasks for execution. It looks to be getting stuck somewhere, it can die as with any Python process (memory or cpu shortage).Jeremy Phelps
10/04/2021, 6:35 PMJeremy Phelps
10/04/2021, 6:36 PMKevin Kho
Jeremy Phelps
10/04/2021, 6:37 PMJeremy Phelps
10/04/2021, 6:37 PMFlow run 90259fef-86ba-4ddd-80be-ee636da5ef16 not found
Traceback (most recent call last):
File "/usr/local/bin/prefect", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/prefect/cli/execute.py", line 54, in flow_run
raise ValueError("Flow run {} not found".format(flow_run_id))
ValueError: Flow run 90259fef-86ba-4ddd-80be-ee636da5ef16 not found
Kevin Kho
Jeremy Phelps
10/04/2021, 6:41 PMJeremy Phelps
10/04/2021, 6:45 PMa0c8f568-2986-4ee3-8d31-4e1e7048e63f
ran, but this one does:
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: Traceback (most recent call last):
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: self.manage_jobs()
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/agent/kubernetes/agent.py", line 333, in manage_jobs
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: flow_run_id
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: File "/home/agent/.pyenv/versions/3.7.3/lib/python3.7/site-packages/prefect/client/client.py", line 1262, in get_flow_run_state
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: return prefect.engine.state.State.deserialize(flow_run.serialized_state)
Oct 1 05:21:11 prefect-agent agent-script.sh[87937]: AttributeError: 'NoneType' object has no attribute 'serialized_state'
Jeremy Phelps
10/04/2021, 6:46 PMJeremy Phelps
10/04/2021, 6:47 PMJeremy Phelps
10/04/2021, 6:48 PMKevin Kho
Jeremy Phelps
10/04/2021, 6:52 PMKevin Kho
Kevin Kho
Kevin Kho
serialized_state
is a Cloud bug and needs to be fixed on our end.
2. The flow stuck in a Running State is likely having problems collecting results after a mapping executes.
3. The flow run not found, we could not find the flow run id. This may be because it was past the retention period. Not sure what would trigger that. Is the flow being registered or deleted mid run?Jeremy Phelps
10/05/2021, 7:16 PM@execute.command(hidden=True)
def flow_run():
"""
Execute a flow run in the context of a backend API.
"""
flow_run_id = prefect.context.get("flow_run_id")
if not flow_run_id:
click.echo("Not currently executing a flow within a Cloud context.")
raise Exception("Not currently executing a flow within a Cloud context.")
query = {
"query": {
with_args("flow_run", {"where": {"id": {"_eq": flow_run_id}}}): {
"flow": {"name": True, "storage": True, "run_config": True},
"version": True,
}
}
}
client = Client()
result = client.graphql(query)
flow_run = result.data.flow_run
if not flow_run:
click.echo("Flow run {} not found".format(flow_run_id))
raise ValueError("Flow run {} not found".format(flow_run_id))
Notice how it ignores any GraphQL error and just reports "not found" no matter what really happened. "Flow run not found" errors are intermittent.Jeremy Phelps
10/05/2021, 7:18 PMKevin Kho
client.graphl
does raise errors if there are other errors. That error message looks right.
There is nothing to do on our end about the collecting of results after a mapping. This is just distributed compute. Your driver node needs to be able to have enough resource to pull off the collection of the mapped tasks in memory, otherwise the process can die.Jeremy Phelps
10/05/2021, 7:30 PMJeremy Phelps
10/05/2021, 7:30 PMJeremy Phelps
10/05/2021, 7:31 PMJeremy Phelps
10/05/2021, 7:33 PMJeremy Phelps
10/05/2021, 7:37 PMKevin Kho
Kevin Kho
Jeremy Phelps
10/05/2021, 7:39 PMKevin Kho
Jeremy Phelps
10/05/2021, 7:39 PMKevin Kho
Kevin Kho
Jeremy Phelps
10/05/2021, 7:44 PMKevin Kho
Jeremy Phelps
10/05/2021, 7:45 PMJeremy Phelps
10/05/2021, 7:45 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:37 PMMALLOC_TRIM_THRESHOLD_
defaults to 65536
, and if that value is passed directly to malloc_trim(3)
, then that's in bytes.Kevin Kho
Jeremy Phelps
10/05/2021, 8:45 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:46 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:47 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:48 PMJeremy Phelps
10/05/2021, 8:49 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:49 PMJeremy Phelps
10/05/2021, 8:50 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:51 PMKevin Kho
Kevin Kho
Jeremy Phelps
10/05/2021, 8:53 PMKevin Kho
Jeremy Phelps
10/05/2021, 8:53 PM