Oscar Björhn

05/05/2023, 12:33 PM
Anyone else having issues with prefect agents telling flows to delete themselves while they're running in 2.10.7? Maybe it's an issue specific to azure container instances.. Never seen it before and it keeps happening today.
👍 1
It looks a bit like this when it happens:
11:58:05.607 | INFO    | prefect.infrastructure.container-instance-job - AzureContainerInstanceJob 'landing-to-raw': Running command...
11:58:05.742 | INFO    | prefect.agent - Completed submission of flow run '559eb34d-9f65-4c45-a42e-fca2cae208b8'
11:59:08.374 | INFO    | prefect.infrastructure.container-instance-job - AzureContainerInstanceJob 'landing-to-raw': Deleting container...
11:59:14.153 | INFO    | prefect.infrastructure.container-instance-job - AzureContainerInstanceJob 'landing-to-raw': Container deleted.
11:59:14.154 | ERROR   | prefect.agent - An error occured while monitoring flow run '559eb34d-9f65-4c45-a42e-fca2cae208b8'. The flow run will not be marked as failed, but an issue may have occurred.
Starts the flow, lets it run for a certain amount of time, then starts deleting the container for some reason. It then tries to poll the state of the flow but throws an exception (probably because the container has been deleted?). The flow still looks like it's running in the web interface but there's nothing logged related to shutting down and the container is gone in Azure.
Azure container instances seems to have had issues all day, this is probably a weird symptom of that. Ignore for now, I'll make an issue if it persists!

Ryan Peden

05/05/2023, 11:20 PM
I think this could happen if the ACI infrastructure block receives an error from the Azure management SDK when it pings Azure to check the status of the task container. The block errs on the side of deleting the container group when something goes wrong to avoid accidentally filling up your resource group with broken container groups. I could see it being useful for the ACI block (and the new ACI worker) to perhaps retry the status check a few times to make it more tolerant when your flow container group is running well but the Azure API call to get the container group status is flaky.

Oscar Björhn

05/08/2023, 12:49 PM
Hey Ryan! I think you're right, this is what's happening. ACI has been really flaky for almost a week now, probably not much that can be done about it on the client side. Might be time for us to look at other alternatives, after using ACI for six months we're starting to realize it's not quite production ready..
I mean adding more retry checks in the client side would help alleviate some of the issues but not all of them. Might be worth giving that a shot first, I suppose.