<@ULVA73B9P> I have a task with 1 retry specified....
# marvin-in-the-wild
m
@Marvin I have a task with 1 retry specified. When I cancel the parent or master flow, the flow goes to a cancelling status then the task restarts. How do I ensure that all the child flows actually cancel?
m
It appears that multiple factors can affect whether a child flow is properly cancelled when the parent flow is cancelled. According to the Prefect documentation, here are some key considerations: 1. Infrastructure Support: Ensure that the infrastructure block for the flow run supports cancellation. 2. Identifier Scope: The identifier scope must match when attempting to cancel a flow run. Otherwise, the worker will be unable to cancel the flow run. 3. Infrastructure Availability: If the associated infrastructure for the run cannot be found or has already been killed, the worker will mark the flow run as cancelled but may not enforce the cancellation. 4. Errors During Cancellation: If the worker encounters an unexpected error during cancellation, the flow run may or may not be cancelled. The worker will try again, and another worker may attempt cancellation. Enhanced Cancellation: For improved reliability, you can enable the experimental enhanced cancellation feature:
Copy code
bash
prefect config set PREFECT_EXPERIMENTAL_ENABLE_ENHANCED_CANCELLATION=True
Cancellation via CLI: You can also cancel a flow run from the command line using:
Copy code
bash
prefect flow-run cancel <FLOW_RUN_ID>
For more details, you can refer to the Prefect documentation here. If these steps do not resolve the issue, consider discussing it on Slack or filing a GitHub issue.
m
@Marvin what does PREFECT_EXPERIMENTAL_ENABLE_ENHANCED_CANCELLATION=True do?
m
Enabling
PREFECT_EXPERIMENTAL_ENABLE_ENHANCED_CANCELLATION=True
improves the handling of cases where flow or task cancellation can fail. Some highlights of this enhanced cancellation feature include: - Worker Retry Mechanisms: If a worker encounters an error during cancellation, it will attempt the cancellation again, and other workers may also try. - Improved Cancellation Tracking: The system better handles scenarios where infrastructure specifics (like
infrastructure_pid
) are missing, ensuring that cancellation is properly acknowledged. - Fault Tolerance: Enhanced tolerance to unexpected errors during cancellation processes, providing a more reliable cancellation experience. You can enable this feature by configuring Prefect with the following command:
Copy code
bash
prefect config set PREFECT_EXPERIMENTAL_ENABLE_ENHANCED_CANCELLATION=True
For more information, refer to the Prefect documentation on flow cancellation. If you encounter any issues with this feature, you can report them in the Prefect Slack community or by opening a GitHub issue.
n
that setting is enabled by default nowadays, you shouldnt have to toggle that unless you're on an older prefect
m
@Nate Related to this, we're having an issue where cancelling the currently running parent flows doesn't seem to cancel the running and pending children/subflows. Do you have any recommendations? We're on prefect 2.16.0, so maybe this was updated in later releases...
It seems like a running task receives a cancellation error, then retries
n
hmm so enhanced cancellation should be default on as of 2.14.5 can you say more about your infrastructure? are you using
run_deployment
for subflows?
m
we're using run_deployment to launch them from the parent, running the jobs in an eks cluster. Each subflow has n tasks with a retry of 1. The work queue has a flow run concurrency limit of 10, where more than 10 subflows are launched. We have 1 gpu node running or dead start and use an autoscaler to scale with the workload
@Nate Late jobs continue to start even after cancelling the jobs that are currently running
n
I cant remember right now if there's some weirdness with cancellation around
run_deployment
specifically here (i can look more into that) > we're using run_deployment to launch them from the parent, running the jobs in an eks cluster. a safeguard you could use regardless if you were inclined, is add an
on_cancellation
hook that uses
client.set_flow_run_state(..., force=True)
for each
client.read_flow_runs
where
FlowRunParentRunIdFilter(any_=[current_flow_run.id])
hmm im not sure i understand this statement > Late jobs continue to start even after cancelling the jobs that are currently running
🙏 1
m
Rephrased: Late, pending and running subflows continue to run once another flow finishes. e.g. I cancel one of the running flows, the late or pending flow in the queue begins to run next.