Martin Durkac
03/14/2022, 8:38 AMFailed to retrieve task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='apollo', port=4200): Read timed out. (read timeout=15)"))
Sometimes the flow can continue without any problem, because we have infinite flow set to continue even when flow failed:
def never_ending_state_handler(obj, old_state, new_state):
if (old_state.is_running() and new_state.is_failed()):
send_message_to_email(obj)
if (old_state.is_running() and new_state.is_successful()) or (old_state.is_running() and new_state.is_failed()):
time.sleep(5)
create_flow_run.run(flow_name="our_flow", project_name = "our_project", run_name = str(uuid.uuid4()))
return new_state
But when we receive error: A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
we are not able to continue... The flow is failed but state handler is not working anymore to reschedule failed flow.
Does anybody know what may cause problem with Read timed out or Lazarus?Anna Geller
03/14/2022, 10:36 AMEvery container has a local agentThis is quite unusual setup - can you explain more about why you did it that way? Normally you would just spin up a docker agent on your VM:
prefect agent docker start --label yourlabel
and then Prefect agent will take care of spinning up containers for the flow run. This could be part of a problem especially because your Server is running within a container itself and the networking between containers becomes quite complicated here.
With such long-running jobs you may try setting this env variable:
from prefect.run_configs import UniversalRun
flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
Also, as a workaround, you could disable Lazarus for such flow as described here.
But if you want to find the root cause of the issue:
do you happen to have some unclosed DB connections or other resources like HTTP clients that you use in your flow? I saw a similar issue occurring due to resources failing to close/shut down.Martin Durkac
03/14/2022, 12:15 PMAnna Geller
03/14/2022, 12:36 PMDockerRun
run configuration:
flow.run_config = DockerRun(image="prefect-oracle:latest")
And as long as you use the same image, your Docker client should be able to reuse it without having to pull the image every time at runtime - I didn't benchmark this though.