Thread
#prefect-community
    d

    Dilip Thiagarajan

    1 year ago
    hi all, is there a reason that
    CloudFlowRunner
    and
    CloudTaskRunner
    are used when running a flow using a
    LocalAgent
    ? I was expecting
    FlowRunner
    and
    TaskRunner
    to be used (the backend is also “server”)
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    The
    Cloud*
    prefix is a bit of a misnomer, those classes work with Server as well and are also used during Server orchestrated flow runs.
    d

    Dilip Thiagarajan

    1 year ago
    Thanks! I’m trying to add a task dynamically to a flow upstream of all other tasks, and it seems to run fine locally, but when running on the server I’m encountering issues. Specifically, I find that the original root tasks, which now have this setup task upstream, aren’t able to find the result of the upstream task, and accordingly throw an error. Do you know why that might happen?
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Hmmm, that's odd. No, not without further information, sorry. Can you post the output of
    flow.diagnostics()
    below?
    d

    Dilip Thiagarajan

    1 year ago
    Sure:
    {
      "config_overrides": {},
      "env_vars": [
        "PREFECT__FLOWS__CHECKPOINTING",
        "PREFECT__CONTEXT__SECRETS__AWS_CREDENTIALS",
        "PREFECT__SERVER__HOST",
        "PREFECT__BACKEND"
      ],
      "flow_information": {
        "environment": false,
        "result": {
          "type": "LocalResult"
        },
        "run_config": {
          "labels": true,
          "type": "UniversalRun"
        },
        "schedule": false,
        "storage": {
          "_flows": {
            "Mock Train with Persistence": true
          },
          "_labels": false,
          "add_default_labels": true,
          "directory": true,
          "flows": {
            "Mock Train with Persistence": true
          },
          "path": false,
          "result": true,
          "secrets": false,
          "stored_as_script": false,
          "type": "Local"
        },
        "task_count": 6
      },
      "system_information": {
        "platform": "Linux-5.4.0-58-generic-x86_64-with-debian-bullseye-sid",
        "prefect_backend": "server",
        "prefect_version": "0.14.0",
        "python_version": "3.6.9"
      }
    }
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Ok, nothing odd jumping out there. Can you post the flow run logs you're seeing (and exceptions) you're seeing?
    d

    Dilip Thiagarajan

    1 year ago
    [2020-12-29 22:25:20,687] INFO - agent | Found 1 flow run(s) to submit for execution.
    [2020-12-29 22:25:20,824] INFO - agent | Deploying flow run fa3ff6e6-5d91-460b-bd9f-cda45869e98b
    [2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
    [2020-12-29 17:25:22-0500] INFO - prefect | Launching data loading for task "Setup ai-core" in the background...
    [2020-12-29 17:25:22-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Starting task run...
    [2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
    [2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Beginning dependency setup: "Setup ai-core"...
    [2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
    [2020-12-29 17:26:07-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Finished task run for task with final state: 'Success'
    [2020-12-29 22:40:07,269] INFO - agent | Found 1 flow run(s) to submit for execution.
    [2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Starting task run...
    [2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Finished task run for task with final state: 'Failed'
    the trace is:
    <Failed: "Failed to retrieve task results: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'">
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Can you provide the context around the stack trace (logs before/after)? It'd be useful to see where in the execution it occurred.
    d

    Dilip Thiagarajan

    1 year ago
    Sure thing - here’s what I have logged:
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/cloud/task_runner.py", line 292, in load_results
      File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/state.py", line 125, in load_result
      File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/results/local_result.py", line 84, in read
    FileNotFoundError: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Thanks. Have you configured any
    Result
    objects for your tasks explicitly (it looks like you have) - if so, can you provide the configuration for those?
    I'm taking off for the evening, but can continue looking into this tomorrow.
    d

    Dilip Thiagarajan

    1 year ago
    Thanks! For the new root task, I actually haven’t configured any explicit result objects, but I have for the downstream task.
    result=S3Result(
            'paige-ai-flow-persistence-s3-dev1-use1',
            location="mock-train/fetch_labels.prefect",
            boto3_kwargs=boto3_kwargs
        )
    but I figured this shouldn’t affect anything, given that the trace shows LocalResult
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Hmmm, that's interesting. You wouldn't happen to be using a
    DaskExecutor
    backed by a distributed cluster?
    If you're not using a
    DaskExecutor
    , then I'm afraid I'm out of ideas - I'd need a reproducible example to continue debugging further.
    d

    Dilip Thiagarajan

    1 year ago
    thanks Jim - I was actually able to resolve by refactoring to have the setup task be handled in a state handler for the flow. One thing I’m still curious about though - I’m finding that there’s a large delay between the end of a task and the deployment of a flow to complete the downstream task when running on the server:
    [2020-12-30 11:00:25-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
    # LARGE DELAY
    [2020-12-30 16:20:11,594] INFO - agent | Found 1 flow run(s) to submit for execution.
    [2020-12-30 16:20:11,757] INFO - agent | Deploying flow run 1db98dc5-9188-4c5f-a90f-3519532f5513
    [2020-12-30 11:20:14-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
    [2020-12-30 11:20:14-0500] INFO - prefect | Beginning dependency setup.
    [2020-12-30 11:20:14-0500] INFO - prefect | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
    [2020-12-30 11:20:26-0500] INFO - prefect | Done setting up dependencies.
    [2020-12-30 11:21:05-0500] INFO - prefect.CloudFlowRunner | Flow run SUCCESS: all reference tasks succeeded
    is there a good way of debugging something like this? or is this expected behavior?
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    What executor are you using?
    d

    Dilip Thiagarajan

    1 year ago
    I’m using a local executor
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    The log before the long delay indicates that the flow finished without all tasks completing. This usually happens if a task has retries enabled and there's a long retry delay between retries. Rather than having the flow sit idle while it waits to retry, prefect will stop the flow run, then resubmit it after the retry delay. Do you happen to have retries enabled for any tasks in your flow, with a
    retry_delay
    set?
    If you don't, then we'd need to see a reproducible example.
    d

    Dilip Thiagarajan

    1 year ago
    I see - I don’t have any retries or retry_delay set, but I did actually notice this in the logs in the UI:
    Rescheduled by a Lazarus process. This is attempt 1.
    And this seems to happen between each level of the DAG
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    That indicates that your flow run was restarted due to it dieing partway through. This is usually due to some infrastructure/resource issues (say your flow exceeds a memory limit and is killed by k8s). See https://docs.prefect.io/orchestration/concepts/services.html#lazarus.
    Usually rescheduling is pretty quick though (10 min) not 5 hours. Did your prefect server instance go down at any point?
    Another possibility is network connectivity issues between where your flows are running (it sounds like you're using the local agent?) and prefect server. Small blips in connectivity are fine, but a larger connectivity issue may lead to similar behavior.
    d

    Dilip Thiagarajan

    1 year ago
    thanks for the link. I think the 5 hour difference above is just different time zones being used for the agent and the server - the wait was about 20 min, I think. I don’t think the server instance went down at any point, but I think connectivity is probably good to inspect, so I will look into that next and follow up as needed. Thanks so much for your help!
    Jim Crist-Harif

    Jim Crist-Harif

    1 year ago
    Ah, I missed the timezone indication above. Good catch. Feel free to reach out if you continue to have issues.