Hi everyone After upgraded Prefect to 3 2 7 I tried to retry Prefect Community #prefect-ui

Hi everyone! After upgraded Prefect to 3.2.7, I tr...

Giacomo Chiarella

02/26/2025, 1:41 PM

Hi everyone! After upgraded Prefect to 3.2.7, I tried to retry, by clicking on the retry button, a failed flow run. The flow run is now stuck in AwaitingRetry state. Any idea what could be? The deployment is a simple one, it executes the flow run locally on Prefect instance. If I try to run the deployment it runs fine

Jenny

02/26/2025, 5:22 PM

Hi @Giacomo Chiarella - can you give a bit more info about your set up? 1. Are you on Cloud or OSS? 2. What type of work pool are you using? 3. Does it happen with a new work pools? 4. Do you have any sort of deployment or work pool concurrency set? 5. How long has the run been stuck in awaitingRetry? And anything else you think might help!

Giacomo Chiarella

02/26/2025, 5:42 PM

Hi @Jenny, sure. I’ve just deployed Prefect by installing via

python -m pip install prefect==3.2.7

. Once this is done, I’ve deployed this flow

Copy code

from prefect import flow, task
@task(name="simple_task", retries=1, retry_delay_seconds=30, retry_jitter_factor=1)
def simple_task():
    raise Exception("Test")


@flow(name=DAG_NAME, retries=1, retry_delay_seconds=30, description="Test")
def flow_entrypoint():
    a = simple_task.submit()
    a.wait()
    if str.lower(a.state.name) != "completed":
        raise Exception("Flow error")

if I just run a brand new flow run everything works as expected and the flow run fails. If, once it failed, I use the UI button to retry the flow run gets stuck in AwaitingRetry. The flow run concurrency is set to 1 and there are no other flow runs at all in the whole Prefect instance. It is stuck indefinitely, I had to cancel it after 2 hours it was in AwaitingRetry. The work pool is of type Process. I think there is something off with the retry button submission because everything else works as expected

Jenny

02/26/2025, 6:12 PM

Thanks! And are you using Prefect Cloud or Prefect Server?

🙌 1

Giacomo Chiarella

02/26/2025, 6:13 PM

I’ve installed in an EC2 instance in aws cloud

Jenny

02/26/2025, 6:27 PM

Hmmm... if I try to reproduce my runs are picked up pretty snappily. If I set a concurrency limit I do see it all working but the second runs has to wait a while. Can you try increasing your concurrency limit and see if your runs get picked up? I suspect something is blocking it. Sorry I don't have an easy fix for you!

Giacomo Chiarella

02/27/2025, 8:03 AM

@Jenny no problem! It looks like something related to only a manual retry, I’m not so concerned although would be good to know what is the problem. I will try to set concurrency to 2 and see what happens. Meanwhile, I have another question. In Prefect 3.2.7 I’m randomly having issue calling .wait() on a task future. For example, in the code I showed you above, at the end of the flow I do

Copy code

a.wait()
logger.info(a.task_run_id)
task_run_name = asyncio.run(get_task_run_name_by_id(prefect_future.task_run_id))

where the function get_task_run_name_by_id is simply this

Copy code

async def get_task_run_name_by_id(task_run_id: UUID):
    async with get_client() as client:
        task_run = await client.read_task_run(task_run_id)
        return task_run.name

the reason I’m doing that is because I check the state

a.state.name

in order to make the flow run fails if the task is not in completed state. The task run name is needed to log each task run with its state. Sometimes, randomly, I get

Copy code

Encountered exception during execution: ObjectNotFound(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/client/orchestration/__init__.py", line 843, in read_task_run
    response = await self._client.get(f"/task_runs/{task_run_id}")
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1768, in get
    return await self.request(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1540, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 354, in send
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 162, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '404 Not Found' for url '<http://prefect_orion:4200/api/task_runs/8317922a-ebd4-4eaa-85ed-7c374c07e69c>'

Response: {'detail': 'Task not found'}

when I call

task_run = await client.read_task_run(task_run_id)

and the strange thing is that I get this error at the very beginning of the flow, when the task is not started. Basically, the wait method is not waiting. Is it possible it is called too early. Why this happens? Is there a better way to wait for a task to finish? Can I mitigate this behaviour with something better than a time.sleep() before the wait? The logger after the wait (which does not wait) prints the task run id. Looks like the task run is not created yet in the database and the wait does not hold for the execution to finish

Giacomo Chiarella

02/27/2025, 3:43 PM

@Jenny regarding the retry issue I’ve noticed that is happening only with the flow sample I showed you at the moment. Maybe something off with the deployment, I will try to redeploy it. Meanwhile, would be really appreciated if there are some ideas regarding the other issue. At the moment I mitigated by using a time.sleep before any future.wait() but I wonder if there is a Prefect way to do that

Jenny

02/27/2025, 8:04 PM

Thanks for following up! I don't know the answer to your question off the top off my heads I'm afraid - perhaps best asked as a new question in #CL09KU1K7

Giacomo Chiarella

02/28/2025, 7:28 AM

Thank you anyway!

👍 1

21 Views

Open in Slack

Previous Next