Hello I am using Prefect OSS 3 1 14 which I have deployed on Prefect Community #ask-community

Hello, I am using Prefect OSS 3.1.14, which I hav...

Johan sh

01/31/2025, 3:07 PM

Hello, I am using Prefect OSS 3.1.14, which I have deployed on an Azure VM with a worker. A PostgreSQL database is connected when I start the Prefect Server. The work pool is configured as Container Instances, where I use a Prefect 3.1.14 image as well. Everything works fine, except that sometimes (about 1 in 10 times, I would say), some flows restart in the middle of execution, sometimes even multiple times. When I check the logs of the container instance during execution, I notice that the image is pulled multiple times (at the start of the flow, then 10 minutes later, then 20 minutes later, etc.). Does anyone else have the same issue?

Johan sh

01/31/2025, 3:08 PM

Example here : “Opening process” 2 times

Johan sh

01/31/2025, 3:11 PM

The container logs of a normal execution :

Jake Kaplan

01/31/2025, 3:54 PM

👋 I would check the logs on the Azure container specifically to see if it restarted and why. It doesn't look this is double submission by the worker. Do you have a container restart policy set? It's normally set to never by the worker but it is configurable depending on what you're doing.

Johan sh

01/31/2025, 4:02 PM

Hi Jake! Nope, I just checked the work pool settings: “restartPolicy”: “Never”

Johan sh

01/31/2025, 4:07 PM

I feel that the problem is from the worker but I don't understand why. I have the same work pool settings in Prefect Cloud (Push version Container Instances) and I never had this kind of error

Jake Kaplan

01/31/2025, 4:50 PM

It's always possible theres a bug! But I think is unlikely given the logs you've shared the worker double submitted a run. Are you able to get the container logs out of Azure? You have two python processes running here that basically each executed your flow run once. When the submit starts up the container instance this should obviously only happen once, but it can be duplicated via things I mentioned above where the container job can suddenly restart outside of the worker's purview

Jake Kaplan

01/31/2025, 4:51 PM

so narrowing down where/how those python processes are executing can point to double submission, container restart etc. Which is what I'm hoping the Azure logs will show.

Johan sh

01/31/2025, 5:50 PM

Thanks for your answer, Jake! I’ll check the Azure container logs and see if there’s anything unusual

Jake Kaplan

01/31/2025, 6:33 PM

sounds good, curious to see what you find!

Johan sh

02/02/2025, 2:35 PM

Hey @Jake Kaplan I’m not sure if I’m looking in the wrong place, but it seems like there isn’t much more information in my container logs other than

'Opening process...'

, which makes debugging quite challenging. However, I’m not sure if this is related, but I was also encountering another recurring error:

Copy code

"An error occurred while monitoring flow run 'xxxxxxxxx'. The flow run will not be marked as failed, but an issue may have occurred.
(...)
File "/home/azureuser/prefect-env/lib/python3.12/site-packages/prefect_azure/workers/container_instance.py", line 948, in _stream_output
    if line_time > last_written_time:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: can't compare offset-naive and offset-aware datetimes"

To fix this, I modified the

_stream_output

function by adding the following just before the

line_time > last_written_time

comparison:

Copy code

if line_time.tzinfo is None:
    line_time = line_time.replace(tzinfo=timezone.utc)
if last_written_time.tzinfo is None:
    last_written_time = last_written_time.replace(tzinfo=timezone.utc)

Since making this change, I haven’t encountered either of the two recurring errors. I’m not sure if the issues are related, but I’m keeping my fingers crossed! 🤷🏻‍♂️

Jake Kaplan

02/02/2025, 8:44 PM

Interesting 🤔. Not sure if you're able to reproduce this consistently or not but these are the things I would check for in the container logs (not the worker logs just to clear, the containers that the worker is spinning up): 1. Are these "Opening process..." messages appearing in logs from two separate container instances, or within the same container? 2. Could you check the Azure activity logs or container instance metrics around the time of these restarts to see if there are any automatic container restarts or other infrastructure events occurring? 3. In the Azure portal, under the Container Instance details, could you check the Events tab for any container lifecycle events (stops, starts, restarts) during these times? Separately for the

TypeError: can't compare offset-naive and offset-aware datetimes"

error, that looks like it occurred once the job was created and as part of the monitoring. It should just mean that if your flow run crashed, it wouldn't be able to report it back. My hunch is that it's a separate issue, but hard to say? It looks like theres an assumption that the logs it pulls are not timezone naive. I'm glad you were able to resolve it though! Would you consider contributing your fix to https://github.com/PrefectHQ/prefect/tree/main/src/integrations/prefect-azure?

Johan sh

02/05/2025, 9:51 AM

Hey @Jake Kaplan 1. "Opening process..." messages appeared within the same container instance (in the container instance logs) 2. See the screenshot below 3. The original post screnshot is the "event tab" of the container instance

Johan sh

02/05/2025, 9:51 AM

And sure I'll try to contribute my fix today!

Johan sh

02/05/2025, 9:53 AM

Capture d’écran 2025-02-05 à 10.52.24.png

Johan sh

02/05/2025, 9:57 AM

I know it doesn't make sense, but since this fix, I haven't had this error anymore, unlike before when it happened every 10 or 15 flows.

🙌 1

Jake Kaplan

02/05/2025, 2:33 PM

I'm glad it's working for you!! But that is really so odd to me 😅 It looks like a single create deployment here like you've shown. So that means the worker isn't double submitting. But inside of the container the process is re-occuring... Would you be able to share the container instance logs? Totally understand if that's not possible. Also you're under no obligation to continue answering my debugging questions, don't want to take up your time. Just trying to understand 😄

Johan sh

02/05/2025, 2:46 PM

it doesn't bother me at all. Thank you for trying to understand! 😊

Johan sh

02/05/2025, 2:46 PM

the container logs are actually the logs that appear in the Prefect UI, right?

Johan sh

02/05/2025, 2:51 PM

i just anonymized the logs

logs - logs.csv

Jake Kaplan

02/05/2025, 3:35 PM

ah sorry! I mean the logs in Azure itself. I was hoping there might be something hidden there that wasn't being reported back through the prefect logs

Jake Kaplan

02/05/2025, 3:36 PM

ironically it would be the logs that likely would have come through from

stream_output

had that been working before your fix...

Johan sh

02/05/2025, 3:56 PM

ah yes, I can't check these logs because the container instance is deleted 😕

Jake Kaplan

02/05/2025, 4:27 PM

Ah understood. Thanks anyways!

9 Views

Open in Slack

Previous Next