Hi everyone I am experiencing something inusual in prefect c Prefect Community #prefect-server

Hi everyone! I am experiencing something inusual i...

jcozar

02/20/2022, 1:49 PM

Hi everyone! I am experiencing something inusual in prefect cloud. I have a flow scheduled everyday at 5:00 am (UTC). I also have an automation (SLA) which any flow run that does not start in 1h is cancelled (and notify to slack). Since 3 days ago I am receiving a notification from that automation, but flow runs were correctly executed at 5:00 am (UTC) every day. For example, last notification said that flow run

51ec0178-bd12-43c4-8b00-fa4ae61300ef

was cancelled because of that automation. But in the dashboard that flow run does not exists, and a flow run was correctly executed at expected time. Do you know what might be happening? Thank you in advance!

Anna Geller

02/20/2022, 1:54 PM

1. Can you share how did you define the automation? 2. I saw similar issues when users had both automations and cloud hooks configured simultaneously. Can you check if you happen to have both configured for the flow?

jcozar

02/20/2022, 8:42 PM

I have three automations: • cancel after 1h if flow run didn’t start • cancel flow run if it takes more than 24h • send slack notification if any flow run finished with state failed or cancelled I don’t have a cloud hook configured. The first time it happened was three days ago. I didn’t change anything, but I made a deployment that day (code issues, not related to prefect flow nor configuration)

Anna Geller

02/20/2022, 10:02 PM

What agent do you use and what labels did you assign to it? Can you share the flow definition, especially your run config? When I look at the flow run logs for this flow run, it says that the flow run was Scheduled but it was indeed sort of stuck in a Scheduled state for one hour until it was cancelled by the automation. Usually this happens when there is some label mismatch between the flow and your agent. This thread provides more info.

jcozar

02/21/2022, 7:32 AM

Yes of course, I specify it at the end of the message. The issue is that the flow (called

main

) has been correctly executed, so it is kind of scheduled twice. I put the example with today’s flow executions. The flow run

2eacfb3e-0300-4988-89e5-7eec59fe4013

failed (scheduled for more than one hour), but flow run

312d9de6-d23f-4a30-82fb-958f4c00c66f

executed correctly (flow main sheduled at 5:00 am). Also I don’t see any cancelled flow run in the dashboard for the flow

main

. The run configuration for this flow is:

Copy code

{
  "cpu": "1024",
  "env": null,
  "type": "ECSRun",
  "image": null,
  "labels": [
    "iebs-analytics-prefect-prod"
  ],
  "memory": "2048",
  "__version__": "0.14.20",
  "task_role_arn": "arn:aws:iam::XXXX",
  "run_task_kwargs": null,
  "task_definition": null,
  "execution_role_arn": "arn:aws:iam::XXXX",
  "task_definition_arn": null,
  "task_definition_path": null
}

Anna Geller

02/21/2022, 9:41 AM

Why do you have this automation in the first place? Are your flow runs getting stuck in a Scheduled state often?

jcozar

02/21/2022, 9:49 AM

This flow export some data from SQL tables to AWS S3, then do some transformations and finally send emails, update dashboards, etc. The flow runs for 2.5 to 3h, and the emails are sent in the morning. If the flow run does not start in 1 hour I suppose that something went wrong so I cancel the flow run. The other automation that cancel flow runs that are running for more than 24 hours are cancelled because the next flow run (for the next day) takes precedence (and also something went wrong with the flow run)

Anna Geller

02/21/2022, 10:05 AM

Can you share your flow definition and how you trigger this flow? You say that the flow is scheduled, but the

auto_scheduled

flag is set to False in the logs, so something looks suspicious here.

Anna Geller

02/21/2022, 10:11 AM

Another suspicious thing is that each of the failed flow runs you sent me were scheduled and then cancelled 10 days later rather than after 1 or 24 hours. I would be curious to see how did you start your ECS agent? Do you run it as a service? Automations don't seem to be a problem here, there is some deeper issue in your ECS execution layer or flow configuration/scheduling.

Anna Geller

02/21/2022, 10:13 AM

Also, the flow run that succeeded has version 4 and has proper schedule attached, while those suspicious flow runs that were cancelled after 10 days of being in Scheduled state had flow version 2. Could it be that your new flow version now works as expected and it was some issue in the older flow versions?

jcozar

02/21/2022, 10:47 AM

Oh, I’m gonna check my ECS task. Indeed yes, it is a service in AWS ECS. I noticed some time ago that a new agent was in Prefect Cloud dashboard (it seems that the task died, and the service run a new instance). It could be related to that. Let me check it

jcozar

02/21/2022, 10:51 AM

I see that there is only one running task, and all seems ok

jcozar

02/21/2022, 10:52 AM

I share the flow configuration:

jcozar

02/21/2022, 10:54 AM

Copy code

run_config = ECSRun(task_role_arn=TASK_ROLE_ARN, execution_role_arn=EXECUTION_ROLE_ARN, cpu=PREFECT_AGENT_CPU, memory=PREFECT_AGENT_MEMORY)
storage = docker.Docker(registry_url=REGISTRY_BASE_URL, image_name=REGISTRY_IMAGE_NAME, image_tag=FLOW_NAME, base_image="prefecthq/prefect:0.14.20-python3.8, ...)
result=S3Result(bucket=BUCKET_PREFECT_RESULTS)
executor=LocalDaskExecutor(scheduler="processes", num_workers=8)
schedule = Schedule(clocks=[CronClock("0 5 * * *", start_date=pendulum.datetime(2021, 1, 1, tz="Europe/Madrid"))])

with Flow(FLOW_NAME, run_config=run_config, storage=storage, result=result, executor=executor, schedule=schedule) as flow:
   ...

jcozar

02/21/2022, 10:58 AM

Oh and respect how I run the flow, it is always scheduled. Just sometimes when something goes wrong (a bug in my code) I need to run it manually, but it rarely happens.

Anna Geller

02/21/2022, 11:12 AM

Your flow config looks good. Also great that you run your agent as ECS service. If the new version is working as expected now, I would monitor it for a few days and check - so far it looks like your new flow version starting from v4 fixed the issue. Do you run it with Fargate or EC2 nodes? If EC2 perhaps the flow was stuck in Scheduled because it didn't have enough capacity on the cluster?

jcozar

02/21/2022, 11:18 AM

It is Fargate

jcozar

02/21/2022, 11:21 AM

And why the cancelled flows don’t appear in my Prefect Cloud platform? I tried in a first place to debug it from there but I cannot find it there (I didn’t tried to use the GraphQL API)

jcozar

02/21/2022, 11:22 AM

It might help to debug what is producing the error

Anna Geller

02/21/2022, 11:33 AM

You can reregister the flow again and see if this fixes your issue. You can also try upgrading your flow's base image to a higher Prefect version. But from Automations perspective, it seems to work as expected and the issue was in the execution layer. You could also set the log level to debug on your agent to see if this way you get more information.

Copy code

ECSRun(env={"PREFECT__LOGGING__LEVEL": "DEBUG"})

jcozar

02/21/2022, 11:37 AM

Thank you very much @Anna Geller , I will try debug log level in a first place!

👍 1

jcozar

02/24/2022, 7:52 AM

Hi @Anna Geller, I made a new flow deployment to include the logging level DEBUG, and the deployment itself solved the problem. I tried to get information about the flow runs cancelled using the Interactive API (GraphQL) and flow runs didn’t exist for me. I think that it might be something wrong in the last (previous) deployment that i did, and for some reason the scheduler “tried to run” a flow in a past version and with wrong labels? (because my agent never noticed about that flow runs!). However it is solved now. Thanks for your help!

💯 1

Anna Geller

02/24/2022, 10:46 AM

Nice work! Great to hear it’s solved now.

5 Views

Open in Slack

Previous Next