m

    Martin Durkac

    6 months ago
    Hi all, We have a running Prefect server (version 0.15.4) on EC2 (8 CPU, 64 GB of RAM) with approx. 15 docker container with specified flows. Each one docker container contains at least 1 running flow (max. 3). Every container has a local agent which is connected via --api parameter to prefect server. We have 6 flows which runs as infinite flow (when flow is finished state handlers starts same flow with create_flow_run() function). The rest of flows are scheduler to only run once an hour. The problem is that once a week or two most of our infinity flows fails on error:
    Failed to retrieve task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='apollo', port=4200): Read timed out. (read timeout=15)"))
    Sometimes the flow can continue without any problem, because we have infinite flow set to continue even when flow failed:
    def never_ending_state_handler(obj, old_state, new_state):
    if (old_state.is_running() and new_state.is_failed()):
    send_message_to_email(obj)
    if (old_state.is_running() and new_state.is_successful()) or (old_state.is_running() and new_state.is_failed()):
    time.sleep(5)
    create_flow_run.run(flow_name="our_flow", project_name = "our_project", run_name = str(uuid.uuid4()))
    return new_state
    But when we receive error:
    A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
    we are not able to continue... The flow is failed but state handler is not working anymore to reschedule failed flow. Does anybody know what may cause problem with Read timed out or Lazarus?
    Anna Geller

    Anna Geller

    6 months ago
    Every container has a local agent
    This is quite unusual setup - can you explain more about why you did it that way? Normally you would just spin up a docker agent on your VM:
    prefect agent docker start --label yourlabel
    and then Prefect agent will take care of spinning up containers for the flow run. This could be part of a problem especially because your Server is running within a container itself and the networking between containers becomes quite complicated here. With such long-running jobs you may try setting this env variable:
    from prefect.run_configs import UniversalRun
    flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
    Also, as a workaround, you could disable Lazarus for such flow as described here. But if you want to find the root cause of the issue: do you happen to have some unclosed DB connections or other resources like HTTP clients that you use in your flow? I saw a similar issue occurring due to resources failing to close/shut down.
    m

    Martin Durkac

    6 months ago
    First of all, thank your for your response. Our setup is too complicated, because we want to separate flows by requirements. Some flows need access to MongoDB and others need access to OracleDB. Because Oracle library for Python only works when Oracle instant client is installed, we decided to separate our flows to container to prevent messy environment in EC2. Those containers have infinite flows. We were not able to schedule them to run every 1 minute, because sometimes flow lasts only 10 seconds and sometimes when more data is processed takes up to 2-10 minutes. You suggested to use docker agent which spins up container to run flow. Does this solution decrease overall execution time? Execution takes 20 sec and what about booting up container and destroying?
    Anna Geller

    Anna Geller

    6 months ago
    I don't have benchmarks about it but it seems that using Docker agent could make your setup less "complicated" 😄 You could e.g. build one docker image with MongoDB dependencies and another one with Oracle dependencies and provide a specific image to your
    DockerRun
    run configuration:
    flow.run_config = DockerRun(image="prefect-oracle:latest")
    And as long as you use the same image, your Docker client should be able to reuse it without having to pull the image every time at runtime - I didn't benchmark this though.