Hi everyone! I am experiencing something inusual i...
# prefect-server
j
Hi everyone! I am experiencing something inusual in prefect cloud. I have a flow scheduled everyday at 5:00 am (UTC). I also have an automation (SLA) which any flow run that does not start in 1h is cancelled (and notify to slack). Since 3 days ago I am receiving a notification from that automation, but flow runs were correctly executed at 5:00 am (UTC) every day. For example, last notification said that flow run
51ec0178-bd12-43c4-8b00-fa4ae61300ef
was cancelled because of that automation. But in the dashboard that flow run does not exists, and a flow run was correctly executed at expected time. Do you know what might be happening? Thank you in advance!
a
1. Can you share how did you define the automation? 2. I saw similar issues when users had both automations and cloud hooks configured simultaneously. Can you check if you happen to have both configured for the flow?
j
I have three automations: • cancel after 1h if flow run didn’t start • cancel flow run if it takes more than 24h • send slack notification if any flow run finished with state failed or cancelled I don’t have a cloud hook configured. The first time it happened was three days ago. I didn’t change anything, but I made a deployment that day (code issues, not related to prefect flow nor configuration)
a
What agent do you use and what labels did you assign to it? Can you share the flow definition, especially your run config? When I look at the flow run logs for this flow run, it says that the flow run was Scheduled but it was indeed sort of stuck in a Scheduled state for one hour until it was cancelled by the automation. Usually this happens when there is some label mismatch between the flow and your agent. This thread provides more info.
j
Yes of course, I specify it at the end of the message. The issue is that the flow (called
main
) has been correctly executed, so it is kind of scheduled twice. I put the example with today’s flow executions. The flow run
2eacfb3e-0300-4988-89e5-7eec59fe4013
failed (scheduled for more than one hour), but flow run
312d9de6-d23f-4a30-82fb-958f4c00c66f
executed correctly (flow main sheduled at 5:00 am). Also I don’t see any cancelled flow run in the dashboard for the flow
main
. The run configuration for this flow is:
Copy code
{
  "cpu": "1024",
  "env": null,
  "type": "ECSRun",
  "image": null,
  "labels": [
    "iebs-analytics-prefect-prod"
  ],
  "memory": "2048",
  "__version__": "0.14.20",
  "task_role_arn": "arn:aws:iam::XXXX",
  "run_task_kwargs": null,
  "task_definition": null,
  "execution_role_arn": "arn:aws:iam::XXXX",
  "task_definition_arn": null,
  "task_definition_path": null
}
a
Why do you have this automation in the first place? Are your flow runs getting stuck in a Scheduled state often?
j
This flow export some data from SQL tables to AWS S3, then do some transformations and finally send emails, update dashboards, etc. The flow runs for 2.5 to 3h, and the emails are sent in the morning. If the flow run does not start in 1 hour I suppose that something went wrong so I cancel the flow run. The other automation that cancel flow runs that are running for more than 24 hours are cancelled because the next flow run (for the next day) takes precedence (and also something went wrong with the flow run)
a
Can you share your flow definition and how you trigger this flow? You say that the flow is scheduled, but the
auto_scheduled
flag is set to False in the logs, so something looks suspicious here.
Another suspicious thing is that each of the failed flow runs you sent me were scheduled and then cancelled 10 days later rather than after 1 or 24 hours. I would be curious to see how did you start your ECS agent? Do you run it as a service? Automations don't seem to be a problem here, there is some deeper issue in your ECS execution layer or flow configuration/scheduling.
Also, the flow run that succeeded has version 4 and has proper schedule attached, while those suspicious flow runs that were cancelled after 10 days of being in Scheduled state had flow version 2. Could it be that your new flow version now works as expected and it was some issue in the older flow versions?
j
Oh, I’m gonna check my ECS task. Indeed yes, it is a service in AWS ECS. I noticed some time ago that a new agent was in Prefect Cloud dashboard (it seems that the task died, and the service run a new instance). It could be related to that. Let me check it
I see that there is only one running task, and all seems ok
I share the flow configuration:
Copy code
run_config = ECSRun(task_role_arn=TASK_ROLE_ARN, execution_role_arn=EXECUTION_ROLE_ARN, cpu=PREFECT_AGENT_CPU, memory=PREFECT_AGENT_MEMORY)
storage = docker.Docker(registry_url=REGISTRY_BASE_URL, image_name=REGISTRY_IMAGE_NAME, image_tag=FLOW_NAME, base_image="prefecthq/prefect:0.14.20-python3.8, ...)
result=S3Result(bucket=BUCKET_PREFECT_RESULTS)
executor=LocalDaskExecutor(scheduler="processes", num_workers=8)
schedule = Schedule(clocks=[CronClock("0 5 * * *", start_date=pendulum.datetime(2021, 1, 1, tz="Europe/Madrid"))])

with Flow(FLOW_NAME, run_config=run_config, storage=storage, result=result, executor=executor, schedule=schedule) as flow:
   ...
Oh and respect how I run the flow, it is always scheduled. Just sometimes when something goes wrong (a bug in my code) I need to run it manually, but it rarely happens.
a
Your flow config looks good. Also great that you run your agent as ECS service. If the new version is working as expected now, I would monitor it for a few days and check - so far it looks like your new flow version starting from v4 fixed the issue. Do you run it with Fargate or EC2 nodes? If EC2 perhaps the flow was stuck in Scheduled because it didn't have enough capacity on the cluster?
j
It is Fargate
And why the cancelled flows don’t appear in my Prefect Cloud platform? I tried in a first place to debug it from there but I cannot find it there (I didn’t tried to use the GraphQL API)
It might help to debug what is producing the error
a
You can reregister the flow again and see if this fixes your issue. You can also try upgrading your flow's base image to a higher Prefect version. But from Automations perspective, it seems to work as expected and the issue was in the execution layer. You could also set the log level to debug on your agent to see if this way you get more information.
Copy code
ECSRun(env={"PREFECT__LOGGING__LEVEL": "DEBUG"})
j
Thank you very much @Anna Geller , I will try debug log level in a first place!
👍 1
Hi @Anna Geller, I made a new flow deployment to include the logging level DEBUG, and the deployment itself solved the problem. I tried to get information about the flow runs cancelled using the Interactive API (GraphQL) and flow runs didn’t exist for me. I think that it might be something wrong in the last (previous) deployment that i did, and for some reason the scheduler “tried to run” a flow in a past version and with wrong labels? (because my agent never noticed about that flow runs!). However it is solved now. Thanks for your help!
💯 1
a
Nice work! Great to hear it’s solved now.