Hello everyone! :wave: Has someone experienced an...
# prefect-cloud
s
Hello everyone! 👋 Has someone experienced an endless retries and rerunning of tasks? I have a flow with 10 tasks, which run simultaneously with
.submit()
, and final task with
wait_for
for all previous tasks. I also have decorator
@task
with setting
retries=0
. In addition, I explicitly switched global Prefect settings like this:
Copy code
prefect config set PREFECT_FLOW_DEFAULT_RETRIES=0
prefect config set PREFECT_TASK_DEFAULT_RETRIES=0
Nevertheless, I still can see
retrying
and
rerunning
states of the tasks. It's happening without any messages in logs. Some tasks cannot actually finish because of "repeating". But I want tasks to be executed only once. Did I do something wrong? Some pictures will be below ⤵️
s
Hi @Sergey Zakharchenko, is it possible to check your Event Feed page in the Prefect UI like in the attached screenshot and then click into “Flow run retrying” to determine how the flow run is moving to a
Retrying
AwaitingRetry
state?
👀 1
s
@Serina hello Serina and thank you for your attention! I cannot see any event
prefect.flow-run.Retrying
in my Events list, look at the first picture. The interesting fact: I have flow
looky_elt_prod
(
whispering-carp
) with task
looky-prod-tap-appmetrica-target-bigquery--flex
restarted at 061316 PM, the second picture. And I can see only
prefect.flow-run.Pending
and
prefect.flow-run.Running
events, the third picture. And later there are many
prefect.flow-run.Running
again and again, 4th, 5th, 6th pictures.
s
I cannot see any event
prefect.flow-run.Retrying
in my Events list, look at the first picture.
My bad,
AwaitingRetry
is what I meant facepalm
🙂 1
If you click into the actual event for one of those and share the output, I think that would be helpful
s
@Serina the only one I have is
prefect.flow-run.Running
🤔 There is an interesting point, we have
RUNNING
->
RUNNING
state here:
Copy code
{
  "id": "0f558e6f-bb20-4731-934c-d7d28c9a5e2c",
  "account": "e3cc5c8e-a339-46ff-9382-26e31cbaf166",
  "event": "prefect.flow-run.Running",
  "occurred": "2023-09-27T07:05:01.039Z",
  "payload": {
    "intended": {
      "to": "RUNNING",
      "from": "RUNNING"
    },
    "initial_state": {
      "name": "Running",
      "type": "RUNNING"
    },
    "validated_state": {
      "name": "Running",
      "type": "RUNNING"
    }
  },
  "received": "2023-09-27T07:05:01.281Z",
  "related": [
    {
      "prefect.resource.id": "prefect.flow.86900e6f-84df-4f91-a5b7-b6b7d1bf19de",
      "prefect.resource.name": "looky_elt_prod",
      "prefect.resource.role": "flow"
    },
    {
      "prefect.resource.id": "prefect.deployment.3a772541-50a0-4e9d-a4aa-da914568639d",
      "prefect.resource.name": "looky_elt_prod",
      "prefect.resource.role": "deployment"
    },
    {
      "prefect.resource.id": "prefect.work-queue.ab858e30-62fa-49e9-8950-3203876661bb",
      "prefect.resource.name": "looky-prod",
      "prefect.resource.role": "work-queue"
    },
    {
      "prefect.resource.id": "prefect.work-pool.ee2d8837-d027-4f2f-a7ca-8b3c8ca28a05",
      "prefect.resource.name": "looky-prod",
      "prefect.resource.role": "work-pool"
    },
    {
      "prefect.resource.id": "prefect.tag.0.1.42",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.tag.auto-scheduled",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.tag.looky-prod",
      "prefect.resource.role": "tag"
    },
    {
      "prefect.resource.id": "prefect.deployment.3a772541-50a0-4e9d-a4aa-da914568639d",
      "prefect.resource.name": "looky_elt_prod",
      "prefect.resource.role": "creator"
    }
  ],
  "resource": {
    "prefect.state-name": "Running",
    "prefect.state-type": "RUNNING",
    "prefect.resource.id": "prefect.flow-run.03c28448-6763-487c-b9c5-4880c7733873",
    "prefect.resource.name": "nonchalant-hog",
    "prefect.state-message": "",
    "prefect.state-timestamp": "2023-09-27T07:05:01.039595+00:00"
  },
  "workspace": "21b03e33-1cd9-4b21-9305-7e4dbb7339d9"
}
We also discovered insufficient resources problem at K8s level resulted in jobs killed or even pods lost. I suppose it might be a root of the problem and we're currently investigating K8s. But what was initial - simultaneously task rerunning or lack of resources? We are testing it right now. 🧐
s
Ah wait, you originally said you could only see retrying states on the tasks not flows, so do you have events for retrying tasks?
👀 1
s
> Ah wait, you originally said you could only see retrying states on the tasks not flows Yep, I did. > so do you have events for retrying tasks? No. Absolutely no. I can see only "running" events. 🤷‍♀️ Another flow was afflicted by the same issue, look at the picture. Here the task was working and then just started again without any reasons. And I'm able to understand it only because of the logs. Having gone to the Event Feed page I can see only
prefect.flow-run.Running
for this flow and
prefect.task-run.Running
for couple of tasks but for all! And there are no "retrying" events at all! 😖