hi all, is there a reason that `CloudFlowRunner` a...
# ask-community
d
hi all, is there a reason that
CloudFlowRunner
and
CloudTaskRunner
are used when running a flow using a
LocalAgent
? I was expecting
FlowRunner
and
TaskRunner
to be used (the backend is also “server”)
j
The
Cloud*
prefix is a bit of a misnomer, those classes work with Server as well and are also used during Server orchestrated flow runs.
👍 1
d
Thanks! I’m trying to add a task dynamically to a flow upstream of all other tasks, and it seems to run fine locally, but when running on the server I’m encountering issues. Specifically, I find that the original root tasks, which now have this setup task upstream, aren’t able to find the result of the upstream task, and accordingly throw an error. Do you know why that might happen?
j
Hmmm, that's odd. No, not without further information, sorry. Can you post the output of
flow.diagnostics()
below?
d
Sure:
Copy code
{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__FLOWS__CHECKPOINTING",
    "PREFECT__CONTEXT__SECRETS__AWS_CREDENTIALS",
    "PREFECT__SERVER__HOST",
    "PREFECT__BACKEND"
  ],
  "flow_information": {
    "environment": false,
    "result": {
      "type": "LocalResult"
    },
    "run_config": {
      "labels": true,
      "type": "UniversalRun"
    },
    "schedule": false,
    "storage": {
      "_flows": {
        "Mock Train with Persistence": true
      },
      "_labels": false,
      "add_default_labels": true,
      "directory": true,
      "flows": {
        "Mock Train with Persistence": true
      },
      "path": false,
      "result": true,
      "secrets": false,
      "stored_as_script": false,
      "type": "Local"
    },
    "task_count": 6
  },
  "system_information": {
    "platform": "Linux-5.4.0-58-generic-x86_64-with-debian-bullseye-sid",
    "prefect_backend": "server",
    "prefect_version": "0.14.0",
    "python_version": "3.6.9"
  }
}
j
Ok, nothing odd jumping out there. Can you post the flow run logs you're seeing (and exceptions) you're seeing?
d
Copy code
[2020-12-29 22:25:20,687] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-29 22:25:20,824] INFO - agent | Deploying flow run fa3ff6e6-5d91-460b-bd9f-cda45869e98b
[2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
[2020-12-29 17:25:22-0500] INFO - prefect | Launching data loading for task "Setup ai-core" in the background...
[2020-12-29 17:25:22-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Starting task run...
[2020-12-29 17:25:22-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
[2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Beginning dependency setup: "Setup ai-core"...
[2020-12-29 17:25:23-0500] INFO - prefect.Setup ai-core | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
[2020-12-29 17:26:07-0500] INFO - prefect.CloudTaskRunner | Task 'Setup ai-core': Finished task run for task with final state: 'Success'
[2020-12-29 22:40:07,269] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Starting task run...
[2020-12-29 11:20:27-0500] INFO - prefect.CloudTaskRunner | Task 'Fetch Slides': Finished task run for task with final state: 'Failed'
the trace is:
Copy code
<Failed: "Failed to retrieve task results: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'">
j
Can you provide the context around the stack trace (logs before/after)? It'd be useful to see where in the execution it occurred.
d
Sure thing - here’s what I have logged:
Copy code
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/cloud/task_runner.py", line 292, in load_results
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/state.py", line 125, in load_result
  File "/home/aicompute/.local/lib/python3.6/site-packages/prefect/engine/results/local_result.py", line 84, in read
FileNotFoundError: [Errno 2] No such file or directory: '/home/dilip.thiagarajan/.prefect/results/prefect-result-2020-12-29t16-05-36-905299-00-00'
j
Thanks. Have you configured any
Result
objects for your tasks explicitly (it looks like you have) - if so, can you provide the configuration for those?
I'm taking off for the evening, but can continue looking into this tomorrow.
👍 1
d
Thanks! For the new root task, I actually haven’t configured any explicit result objects, but I have for the downstream task.
Copy code
result=S3Result(
        'paige-ai-flow-persistence-s3-dev1-use1',
        location="mock-train/fetch_labels.prefect",
        boto3_kwargs=boto3_kwargs
    )
but I figured this shouldn’t affect anything, given that the trace shows LocalResult
j
Hmmm, that's interesting. You wouldn't happen to be using a
DaskExecutor
backed by a distributed cluster?
If you're not using a
DaskExecutor
, then I'm afraid I'm out of ideas - I'd need a reproducible example to continue debugging further.
d
thanks Jim - I was actually able to resolve by refactoring to have the setup task be handled in a state handler for the flow. One thing I’m still curious about though - I’m finding that there’s a large delay between the end of a task and the deployment of a flow to complete the downstream task when running on the server:
Copy code
[2020-12-30 11:00:25-0500] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
# LARGE DELAY
[2020-12-30 16:20:11,594] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-12-30 16:20:11,757] INFO - agent | Deploying flow run 1db98dc5-9188-4c5f-a90f-3519532f5513
[2020-12-30 11:20:14-0500] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Mock Train with Persistence'
[2020-12-30 11:20:14-0500] INFO - prefect | Beginning dependency setup.
[2020-12-30 11:20:14-0500] INFO - prefect | Commit hash for ai-core setup: f7b8552705e9eba33c62b4e11d42e7806631771d
[2020-12-30 11:20:26-0500] INFO - prefect | Done setting up dependencies.
[2020-12-30 11:21:05-0500] INFO - prefect.CloudFlowRunner | Flow run SUCCESS: all reference tasks succeeded
is there a good way of debugging something like this? or is this expected behavior?
j
What executor are you using?
d
I’m using a local executor
j
The log before the long delay indicates that the flow finished without all tasks completing. This usually happens if a task has retries enabled and there's a long retry delay between retries. Rather than having the flow sit idle while it waits to retry, prefect will stop the flow run, then resubmit it after the retry delay. Do you happen to have retries enabled for any tasks in your flow, with a
retry_delay
set?
If you don't, then we'd need to see a reproducible example.
d
I see - I don’t have any retries or retry_delay set, but I did actually notice this in the logs in the UI:
Copy code
Rescheduled by a Lazarus process. This is attempt 1.
And this seems to happen between each level of the DAG
j
That indicates that your flow run was restarted due to it dieing partway through. This is usually due to some infrastructure/resource issues (say your flow exceeds a memory limit and is killed by k8s). See https://docs.prefect.io/orchestration/concepts/services.html#lazarus.
Usually rescheduling is pretty quick though (10 min) not 5 hours. Did your prefect server instance go down at any point?
Another possibility is network connectivity issues between where your flows are running (it sounds like you're using the local agent?) and prefect server. Small blips in connectivity are fine, but a larger connectivity issue may lead to similar behavior.
d
thanks for the link. I think the 5 hour difference above is just different time zones being used for the agent and the server - the wait was about 20 min, I think. I don’t think the server instance went down at any point, but I think connectivity is probably good to inspect, so I will look into that next and follow up as needed. Thanks so much for your help!
👍 1
j
Ah, I missed the timezone indication above. Good catch. Feel free to reach out if you continue to have issues.
👍 1