https://prefect.io logo
Title
a

Amanda Wee

05/13/2021, 1:14 AM
My team is running Prefect Server on ECS (apollo, hasura, graphql, and towel as separate containers in a single ECS task; postgres as an aurora cluster; Prefect UI as another ECS task), with a single local Prefect agent also on ECS. From time to time we encounter HTTP 502 errors and/or read timeout errors, and also a flow where we had a large task for which we would get heartbeat timeouts, but after breaking it up, Prefect UI would then report that at one of the broken up tasks, the flow runs "forever". The errors seem pretty inconsistent though, but started to occur a bit more frequency recently. Prefect Server itself has remained up, so I originally thought that there was something finicky about the internal load balancer, yet it does work. Would anyone have any ideas what might be going wrong?
It might be easier if I presented the significant errors for a particular case of a failed flow run in a "table":
12:28:38: Task 'get_vend_outlet[0]': Starting task run...
12:28:46: Error getting flow run info (...) requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: <http://internal-load-balancer.amazonaws.com:80/graphql>
12:28:59: Task 'get_vend_outlet[0]': Finished task run for task with final state: 'Success'
(... some successful task runs ...)
12:29:58: Task 'setup_retailer_outlet_register_dataframe[0]': Starting task run...
12:37:39: No heartbeat detected from the remote task; marking the run as failed.
12:40:38: Rescheduled by a Lazarus process. This is attempt 1.
12:40:43: Flow run is no longer in a running state; the current state is: <Scheduled: "Rescheduled by a Lazarus process.">
(... flow run restarts ...)
(... some successful task runs ...)
12:40:50: Task 'setup_retailer_outlet_register_dataframe[0]': Starting task run...
12:40:50: Task 'setup_retailer_outlet_register_dataframe[0]': Finished task run for task with final state: 'Failed'
(.. some successful task runs and also trigger failed task runs ...)
12:41:26: Flow run FAILED: some reference tasks failed.
12:41:47: Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">
12:48:43: Task 'get_vend_register_open_sequence[0]': Starting task run...
12:48:43: prefect.utilities.exceptions.ClientError: [{'message': 'State update failed for task run ID 168abb5c-a7af-4ec7-b221-81637152cf24: provided a running state but associated flow run 9d271ae3-58bf-4501-8a84-036c8e24fa6d is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 168abb5c-a7af-4ec7-b221-81637152cf24: provided a running state but associated flow run 9d271ae3-58bf-4501-8a84-036c8e24fa6d is not in a running state.'}}}]
12:48:43: Task 'get_vend_register_open_sequence[0]': Finished task run for task with final state: 'ClientFailed'
(... same error for next task ...)
(... some task runs with Finished task run for task with final state: 'Pending' ...)
12:48:45: Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">
12:48:45: Flow run RUNNING: terminal tasks are incomplete.
12:48:45: Heartbeat process died with exit code 1
(end)
Weirdly, if I look at
setup_retailer_outlet_register_dataframe[0]
, it says it succeeded (and so does the parent task before the mapping). So I'm confused.
k

Kevin Kho

05/13/2021, 2:48 PM
Will bring this up to the team @Amanda Wee.
m

merlin

05/13/2021, 4:17 PM
If I was in this situation, I would try restarting the server instance. I'd like to know if there are reasons not to -- as I understand it the postgres db maintains the state and history and the rest of the architecture is replaceable. As long as the postgres db is storing backup snapshots, the system should be resilient to crashes, is that right?
a

Amanda Wee

05/20/2021, 2:20 AM
Yeah, I might try that next, @merlin. My team has not needed to run that flow for now, but we will need to do so closer towards the end of the month. @Kevin Kho any ideas from the team? Thanks!
k

Kevin Kho

05/20/2021, 2:24 AM
Yeah I did bring this up. It really just sounds resource constrained and if you can provide the reproducible example with a simple setup, we can look more into it. Otherwise though, this is beyond the maturity of architecture we can support.
a

Amanda Wee

05/20/2021, 3:57 AM
Ah, tragic but to be expected. I should have another chance at looking into this again next week, so hopefully I'll have something more reproducible to share... or maybe come up with a fix myself. About the resource constrained bit: do you think this is on the side of the local agent/executor, or on my instance of prefect server? Thanks again!
k

Kevin Kho

05/20/2021, 12:33 PM
This seems to be on the execution side. Resource-constrained executors lead to a bunch of Dask workers hanging