https://prefect.io logo
Title
c

Cameron Raynor

02/12/2023, 4:40 AM
Hi everyone, I'm getting an error in all of my flows when running from Prefect Cloud. After ~9:50 mins (almost exactly) the execution of whatever task within the currently running subflow is running gets terminated. I see an error that's something like this:
Crash detected! Execution was cancelled by the runtime environment.

ERROR
Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 1869, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/local/lib/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/usr/local/lib/python3.9/site-packages/prefect/engine.py", line 1613, in cancel_flow_run
    raise TerminationSignal(signal=signal.SIGTERM)
prefect.exceptions.TerminationSignal
The the subflow keeps running indefinitely despite having a failed task, and the main flow fails (its seems similar to https://github.com/PrefectHQ/prefect/issues/8481). This only seems to happen if the flow runs for 9 mins and 50 seconds, and it happens to whichever subflow/task is running at that time. Is there anything that might be causing this? Any help is greatly appreciated.
1
To clarify the point on the subflow running but the primary flow failing, I've attached what I see in Prefect Cloud. The primary flow (etl-flow) shows that is has failed but the subflow (cleanup-flow) is still running and has been running longer than the primary flow of which it is a subflow.
r

Ryan Peden

02/12/2023, 6:11 AM
Hi Cameron! It looks like this is happening because the default timeout for Cloud Run flows is 10 minutes: https://github.com/PrefectHQ/prefect-gcp/blob/44a280ea3bcfe47b9433cfacdfdba9cbff43e106/prefect_gcp/cloud_run.py#L285 Since Cloud Run lets jobs run for up to an hour, increasing the timeout value in your Cloud Run infrastructure block should help.
c

Cameron Raynor

02/12/2023, 4:57 PM
Thanks Ryan, that worked! For long running flows, would I be better off using a Vertex AI Custom Training job?
a

Aleksandr Liadov

02/14/2023, 4:32 PM
@Ryan Peden hello I have the same issue like @Cameron Raynor However it runs on k8s infrastructure and it doesnt depend on time... Traceback:
Crash details:
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 1637, in cancel_flow_run
    raise TerminationSignal(signal=signal.SIGTERM)
prefect.exceptions.TerminationSignal

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 1689, in report_task_run_crashes
    yield
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 1328, in begin_task_run
    state = await orchestrate_task_run(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 1418, in orchestrate_task_run
    resolved_parameters = await resolve_inputs(parameters)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 1758, in resolve_inputs
    return await run_sync_in_worker_thread(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
asyncio.exceptions.CancelledError

Or it could be:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/engine.py", line 650, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/opt/pysetup/.venv/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 165, in run_sync_in_interruptible_worker_thread
    assert result is not NotSet
AssertionError
Any ideas, or suggestions where I should see?
c

Cameron Raynor

02/16/2023, 2:18 AM
Which run architecture block are you using @Aleksandr Liadov ?
a

Aleksandr Liadov

02/16/2023, 8:11 AM
@Cameron Raynor k8s job
c

Cameron Raynor

02/16/2023, 7:50 PM
Unfortunately I'm not very familiar with k8s, but the error does look similar. In my case, it was caused by the job timing out due to the configuration of the cloud run job. My best guess is that something in your
kubectl config
file is causing the job to stop which is sending the termination signal to Prefect
👀 1