hi all is there a reason that `CloudFlowRunner` and `CloudTa Prefect Community #ask-community

hi all, is there a reason that `CloudFlowRunner` a...

Dilip Thiagarajan

12/29/2020, 10:02 PM

hi all, is there a reason that

CloudFlowRunner

and

CloudTaskRunner

are used when running a flow using a

LocalAgent

? I was expecting

FlowRunner

and

TaskRunner

to be used (the backend is also “server”)

Jim Crist-Harif

12/29/2020, 10:04 PM

The

Cloud*

prefix is a bit of a misnomer, those classes work with Server as well and are also used during Server orchestrated flow runs.

👍 1

Dilip Thiagarajan

12/29/2020, 10:13 PM

Thanks! I’m trying to add a task dynamically to a flow upstream of all other tasks, and it seems to run fine locally, but when running on the server I’m encountering issues. Specifically, I find that the original root tasks, which now have this setup task upstream, aren’t able to find the result of the upstream task, and accordingly throw an error. Do you know why that might happen?

Jim Crist-Harif

12/29/2020, 10:16 PM

Hmmm, that's odd. No, not without further information, sorry. Can you post the output of

flow.diagnostics()

below?

Dilip Thiagarajan

12/29/2020, 10:18 PM

Sure:

Copy code

{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__FLOWS__CHECKPOINTING",
    "PREFECT__CONTEXT__SECRETS__AWS_CREDENTIALS",
    "PREFECT__SERVER__HOST",
    "PREFECT__BACKEND"
  ],
  "flow_information": {
    "environment": false,
    "result": {
      "type": "LocalResult"
    },
    "run_config": {
      "labels": true,
      "type": "UniversalRun"
    },
    "schedule": false,
    "storage": {
      "_flows": {
        "Mock Train with Persistence": true
      },
      "_labels": false,
      "add_default_labels": true,
      "directory": true,
      "flows": {
        "Mock Train with Persistence": true
      },
      "path": false,
      "result": true,
      "secrets": false,
      "stored_as_script": false,
      "type": "Local"
    },
    "task_count": 6
  },
  "system_information": {
    "platform": "Linux-5.4.0-58-generic-x86_64-with-debian-bullseye-sid",
    "prefect_backend": "server",
    "prefect_version": "0.14.0",
    "python_version": "3.6.9"
  }
}

Jim Crist-Harif

12/29/2020, 10:23 PM

Ok, nothing odd jumping out there. Can you post the flow run logs you're seeing (and exceptions) you're seeing?

Dilip Thiagarajan

12/29/2020, 10:42 PM

Copy code

[2020-12-29 22:25:20,687] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-29 22:25:20,824] INFO - agent | Deploying flow run fa3ff6e6-5d91-460b-bd9f-cda45869e98b
[2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
[2020-12-29 17:25:22-0500] INFO - prefect | Launching data loading for task "Setup ai-core" in the background...
[2020-12-29 17:25:22-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Starting task run...
[2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
[2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Beginning dependency setup: "Setup ai-core"...
[2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
[2020-12-29 17:26:07-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Finished task run for task with final state: 'Success'
[2020-12-29 22:40:07,269] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Starting task run...
[2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Finished task run for task with final state: 'Failed'

the trace is:

Copy code

<Failed: "Failed to retrieve task results: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'">

Jim Crist-Harif

12/29/2020, 10:53 PM

Can you provide the context around the stack trace (logs before/after)? It'd be useful to see where in the execution it occurred.

Dilip Thiagarajan

12/29/2020, 10:56 PM

Sure thing - here’s what I have logged:

Copy code

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/cloud/task_runner.py", line 292, in load_results
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/state.py", line 125, in load_result
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/results/local_result.py", line 84, in read
FileNotFoundError: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'

Jim Crist-Harif

12/29/2020, 11:02 PM

Thanks. Have you configured any

Result

objects for your tasks explicitly (it looks like you have) - if so, can you provide the configuration for those?

Jim Crist-Harif

12/29/2020, 11:02 PM

I'm taking off for the evening, but can continue looking into this tomorrow.

👍 1

Dilip Thiagarajan

12/29/2020, 11:13 PM

Thanks! For the new root task, I actually haven’t configured any explicit result objects, but I have for the downstream task.

Copy code

result=S3Result(
        'paige-ai-flow-persistence-s3-dev1-use1',
        location="mock-train/fetch_labels.prefect",
        boto3_kwargs=boto3_kwargs
    )

but I figured this shouldn’t affect anything, given that the trace shows LocalResult

Jim Crist-Harif

12/30/2020, 4:27 PM

Hmmm, that's interesting. You wouldn't happen to be using a

DaskExecutor

backed by a distributed cluster?

Jim Crist-Harif

12/30/2020, 4:29 PM

If you're not using a

DaskExecutor

, then I'm afraid I'm out of ideas - I'd need a reproducible example to continue debugging further.

Dilip Thiagarajan

12/30/2020, 4:31 PM

thanks Jim - I was actually able to resolve by refactoring to have the setup task be handled in a state handler for the flow. One thing I’m still curious about though - I’m finding that there’s a large delay between the end of a task and the deployment of a flow to complete the downstream task when running on the server:

Copy code

[2020-12-30 11:00:25-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
# LARGE DELAY
[2020-12-30 16:20:11,594] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-30 16:20:11,757] INFO - agent | Deploying flow run 1db98dc5-9188-4c5f-a90f-3519532f5513
[2020-12-30 11:20:14-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
[2020-12-30 11:20:14-0500] INFO - prefect | Beginning dependency setup.
[2020-12-30 11:20:14-0500] INFO - prefect | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
[2020-12-30 11:20:26-0500] INFO - prefect | Done setting up dependencies.
[2020-12-30 11:21:05-0500] INFO - prefect.CloudFlowRunner | Flow run SUCCESS: all reference tasks succeeded

is there a good way of debugging something like this? or is this expected behavior?

Jim Crist-Harif

12/30/2020, 4:33 PM

What executor are you using?

Dilip Thiagarajan

12/30/2020, 4:35 PM

I’m using a local executor

Jim Crist-Harif

12/30/2020, 4:54 PM

The log before the long delay indicates that the flow finished without all tasks completing. This usually happens if a task has retries enabled and there's a long retry delay between retries. Rather than having the flow sit idle while it waits to retry, prefect will stop the flow run, then resubmit it after the retry delay. Do you happen to have retries enabled for any tasks in your flow, with a

retry_delay

set?

Jim Crist-Harif

12/30/2020, 4:54 PM

If you don't, then we'd need to see a reproducible example.

Dilip Thiagarajan

12/30/2020, 4:58 PM

I see - I don’t have any retries or retry_delay set, but I did actually notice this in the logs in the UI:

Copy code

Rescheduled by a Lazarus process. This is attempt 1.

And this seems to happen between each level of the DAG

Jim Crist-Harif

12/30/2020, 5:01 PM

That indicates that your flow run was restarted due to it dieing partway through. This is usually due to some infrastructure/resource issues (say your flow exceeds a memory limit and is killed by k8s). See https://docs.prefect.io/orchestration/concepts/services.html#lazarus.

Jim Crist-Harif

12/30/2020, 5:02 PM

Usually rescheduling is pretty quick though (10 min) not 5 hours. Did your prefect server instance go down at any point?

Jim Crist-Harif

12/30/2020, 5:04 PM

Another possibility is network connectivity issues between where your flows are running (it sounds like you're using the local agent?) and prefect server. Small blips in connectivity are fine, but a larger connectivity issue may lead to similar behavior.

Dilip Thiagarajan

12/30/2020, 5:06 PM

thanks for the link. I think the 5 hour difference above is just different time zones being used for the agent and the server - the wait was about 20 min, I think. I don’t think the server instance went down at any point, but I think connectivity is probably good to inspect, so I will look into that next and follow up as needed. Thanks so much for your help!

👍 1

Jim Crist-Harif

12/30/2020, 5:07 PM

Ah, I missed the timezone indication above. Good catch. Feel free to reach out if you continue to have issues.

👍 1

Open in Slack

Previous Next