Hello, I am using Prefect OSS 3.1.14, which I hav...
# ask-community
j
Hello, I am using Prefect OSS 3.1.14, which I have deployed on an Azure VM with a worker. A PostgreSQL database is connected when I start the Prefect Server. The work pool is configured as Container Instances, where I use a Prefect 3.1.14 image as well. Everything works fine, except that sometimes (about 1 in 10 times, I would say), some flows restart in the middle of execution, sometimes even multiple times. When I check the logs of the container instance during execution, I notice that the image is pulled multiple times (at the start of the flow, then 10 minutes later, then 20 minutes later, etc.). Does anyone else have the same issue?
Example here : “Opening process” 2 times
The container logs of a normal execution :
j
👋 I would check the logs on the Azure container specifically to see if it restarted and why. It doesn't look this is double submission by the worker. Do you have a container restart policy set? It's normally set to never by the worker but it is configurable depending on what you're doing.
j
Hi Jake! Nope, I just checked the work pool settings: “restartPolicy”: “Never”
I feel that the problem is from the worker but I don't understand why. I have the same work pool settings in Prefect Cloud (Push version Container Instances) and I never had this kind of error
j
It's always possible theres a bug! But I think is unlikely given the logs you've shared the worker double submitted a run. Are you able to get the container logs out of Azure? You have two python processes running here that basically each executed your flow run once. When the submit starts up the container instance this should obviously only happen once, but it can be duplicated via things I mentioned above where the container job can suddenly restart outside of the worker's purview
so narrowing down where/how those python processes are executing can point to double submission, container restart etc. Which is what I'm hoping the Azure logs will show.
j
Thanks for your answer, Jake! I’ll check the Azure container logs and see if there’s anything unusual
j
sounds good, curious to see what you find!
j
Hey @Jake Kaplan I’m not sure if I’m looking in the wrong place, but it seems like there isn’t much more information in my container logs other than
'Opening process...'
, which makes debugging quite challenging. However, I’m not sure if this is related, but I was also encountering another recurring error:
Copy code
"An error occurred while monitoring flow run 'xxxxxxxxx'. The flow run will not be marked as failed, but an issue may have occurred.
(...)
File "/home/azureuser/prefect-env/lib/python3.12/site-packages/prefect_azure/workers/container_instance.py", line 948, in _stream_output
    if line_time > last_written_time:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: can't compare offset-naive and offset-aware datetimes"
To fix this, I modified the
_stream_output
function by adding the following just before the
line_time > last_written_time
comparison:
Copy code
if line_time.tzinfo is None:
    line_time = line_time.replace(tzinfo=timezone.utc)
if last_written_time.tzinfo is None:
    last_written_time = last_written_time.replace(tzinfo=timezone.utc)
Since making this change, I haven’t encountered either of the two recurring errors. I’m not sure if the issues are related, but I’m keeping my fingers crossed! 🤷🏻‍♂️
j
Interesting 🤔. Not sure if you're able to reproduce this consistently or not but these are the things I would check for in the container logs (not the worker logs just to clear, the containers that the worker is spinning up): 1. Are these "Opening process..." messages appearing in logs from two separate container instances, or within the same container? 2. Could you check the Azure activity logs or container instance metrics around the time of these restarts to see if there are any automatic container restarts or other infrastructure events occurring? 3. In the Azure portal, under the Container Instance details, could you check the Events tab for any container lifecycle events (stops, starts, restarts) during these times? Separately for the
TypeError: can't compare offset-naive and offset-aware datetimes"
error, that looks like it occurred once the job was created and as part of the monitoring. It should just mean that if your flow run crashed, it wouldn't be able to report it back. My hunch is that it's a separate issue, but hard to say? It looks like theres an assumption that the logs it pulls are not timezone naive. I'm glad you were able to resolve it though! Would you consider contributing your fix to https://github.com/PrefectHQ/prefect/tree/main/src/integrations/prefect-azure?
j
Hey @Jake Kaplan 1. "Opening process..." messages appeared within the same container instance (in the container instance logs) 2. See the screenshot below 3. The original post screnshot is the "event tab" of the container instance
And sure I'll try to contribute my fix today!
Capture d’écran 2025-02-05 à 10.52.24.png
I know it doesn't make sense, but since this fix, I haven't had this error anymore, unlike before when it happened every 10 or 15 flows.
🙌 1
j
I'm glad it's working for you!! But that is really so odd to me 😅 It looks like a single create deployment here like you've shown. So that means the worker isn't double submitting. But inside of the container the process is re-occuring... Would you be able to share the container instance logs? Totally understand if that's not possible. Also you're under no obligation to continue answering my debugging questions, don't want to take up your time. Just trying to understand 😄
j
it doesn't bother me at all. Thank you for trying to understand! 😊
the container logs are actually the logs that appear in the Prefect UI, right?
i just anonymized the logs
j
ah sorry! I mean the logs in Azure itself. I was hoping there might be something hidden there that wasn't being reported back through the prefect logs
ironically it would be the logs that likely would have come through from
stream_output
had that been working before your fix...
j
ah yes, I can't check these logs because the container instance is deleted 😕
j
Ah understood. Thanks anyways!