Has anyone had the problem when a scheduled flow, ...
# ask-community
d
Has anyone had the problem when a scheduled flow, just stop scheduling?
Copy code
weekly = CronSchedule("15 1 * * *", start_date=DEFAULT_START_DATE)
with Flow("Daily Extract", schedule=weekly) as flow:
   ...
This flow was happily running at the scheduled time at 1:15 a.m. every day, except now it just stopped getting scheduled. Curiously today is the first day of the next month....
There should be upcoming runs. And under Activity it is missing todays run, last one is yesterday:
The documentation says
The scheduler periodically queries for flows with active schedules and creates flow runs corresponding to the next 10 scheduled start times of the flow.
Any ideas what could cause it to stop doing that?
n
Hi @Daniel Caldeweyher - thanks for the report! Could you provide your flow or flow group ID for that flow? You can find these in the details of the flow page (top left tile, details tab).
d
flow id 8c23a270-9546-4b71-add8-a686fb89334b but I doubt this is going to help you as we are using prefect community/server
n
Ah ok you're correct I won't be able to look up that flow id - in which case you may need to do some digging, check that your containers have the CPU/memory they need, and I'd also inspect the flow and flow group objects to see that the schedules are active (you can do this through the interactive API in the UI)
d
what i just came across (just merely looked into it before) but.... would the re-scheduling be done by Lazarus?
n
Re-scheduling would happen from lazarus if the flow run heartbeat was lost for some reason, the initial scheduling would happen from the scheduler service
d
what about:
Copy code
{
  "severity": "ERROR",
  "name": "prefect-server.Scheduler",
  "message": "Unexpected error: ConnectError(gaierror(-2, 'Name or service not known'))"
}
this is from my towel service logs
n
yeah that could be the issue - I'm not sure how you've deployed your server but could there be some networking issues?
d
deployed on ECS via docker-compose
based on
Copy code
<https://github.com/PrefectHQ/prefect/blob/master/src/prefect/cli/docker-compose.yml>
i can debug it if i know what is trying to connect to where
n
Got it - unfortunately it's really difficult for me to give much guidance with any sort of custom docker compose file; it looks to me like there's an issue with your containers not being able to communicate, most likely it's the scheduler service unable to find the graphql/apollo containers
d
docker-compose.yml
... the hasura client defaults to config.hasura.graphql_url
which default to
graphql_url = "http://${server.hasura.host}:${server.hasura.port}/v1alpha1/graphql"
according to https://github.com/PrefectHQ/prefect/blob/master/src/prefect/config.toml
my gut feeling tells me that my docker-compose
towel
service needs:
Copy code
links:
      - hasura
the above change seems to have fixed it (unless simply having restarted as part of the redeployment did the trick). ...but another issue i was having also seems to be resolved now. Regardless, i am not sure that the provided docker-compose in the prefect repo is correct:
n
Hm glad you got it fixed but since all Prefect Server deployments run off that docker compose file I'd be surprised if it were incorrect - feel free to open an issue though if you're able to reproduce with a vanilla deployment (using
prefect server start
)
d
yes, i know, docker should find the service by name without using the link. i will try undo and see if it breaks again. it might also only be an issue when deployed on ECS
n
That's entirely possible - it's also odd that it was working previously but stopped; that lends to your theory that restarting is what really fixed it
d
still... the official docker-compose links to
graphql
where as the apollo service links to both which is also reflected in the env variables
well.... it schedules the first 10 by default
i won't really know until my flows hit the 11th scheduled run
thanks for your help @nicholas
n
Sure thing - let me know if you see this again and we can dig further; a note on those
depends_on
lines - they dictate when a container can start, rather than actual links between the containers. In those cases the various container definitions depend on another container being started and healthy. So for apollo, the graphql container needs to be started in order to correctly build the apollo service.
But those health checks have also raised issues in the past (including some open ones) so I wouldn't rule them out as potential sources