Hello We had four deployments fail last night with the same Prefect Community #ask-community

Hello! We had four deployments fail last night wit...

Ilya Galperin

01/06/2023, 6:53 PM

Hello! We had four deployments fail last night with the same error a few hours into their runs (see below). All four deployments were long-running (2+ hours) and had a large number of tasks (3.5k, 3.4k, 1.1k, and 675 respectively). These deployments were part of two different flows with two very different sets of code. One of these flows is running Prefect 2.7.5, the other is running 2.6.7. Our agent is running 2.7.5 and we are using Prefect Cloud. Any insights as to what may have happened and how to fix this issue?

prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

(full traceback in thread)

Ilya Galperin

01/06/2023, 6:54 PM

Copy code

Encountered exception during execution:
Traceback (most recent call last):
 File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
  result = await run_sync(flow_call)
 File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
  return await anyio.to_thread.run_sync(
 File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
  return await get_asynclib().run_sync_in_worker_thread(
 File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
  return await future
 File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
  result = context.run(func, *args)
 File "/opt/prefect/flow.py", line 160, in entrypoint
  results.append(future.result())
 File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 226, in result
  return sync(
 File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 267, in sync
  return run_async_from_worker_thread(__async_fn, *args, **kwargs)
 File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
  return anyio.from_thread.run(call)
 File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
  return asynclib.run_async_from_thread(func, *args)
 File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
  return f.result()
 File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
  return self.__get_result()
 File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
  raise self._exception
 File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
  return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
 File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 100, in _get_state_result
  raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Kalise Richmond

01/06/2023, 7:33 PM

Hi @Ilya Galperin, There was a bug in 2.7.5 that causes this error. I recommend updating immediately to 2.7.6.

Ilya Galperin

01/06/2023, 7:45 PM

Hi @Kalise Richmond - this is only happening intermittently on flows and we are not using Prefect Orion using SQLite - we are on Prefect Cloud (adding clarification to my original message). It doesn’t seem like the bug linked to 2.7.5 is related?

🤔 1

Zanie

01/06/2023, 7:55 PM

This error occurs after the flow run finishes?

Zanie

01/06/2023, 7:56 PM

These deployments usually run without problems?

Ilya Galperin

01/06/2023, 8:02 PM

It looks like they were surfaced at the end of the flow runs. One of flows (flow a) (a ~2.5 hour one) usually runs without problems, and the other flow (flow b) with 3 deployments is expected to run >4 hours per deployment. Flow b runs fine on other deployments (using the same code) when it has shorter runtimes — we’re re-trying these long-running deployments now.

Ilya Galperin

01/06/2023, 8:03 PM

For “flow a” though, it is using dask and the flow will fail in the middle of the flow before it reaches its final set of tasks. After all the dask tasks finish during which the error happened, the flow re-surfaces the error and doesn’t trigger any remaining downstream tasks.

Zanie

01/06/2023, 8:04 PM

Can you share timestamps for the failures? Are you using result persistence?

Ilya Galperin

01/06/2023, 8:06 PM

For flow b: 110600 PM 110545 PM 110650 PM Flow a: 105127

Ilya Galperin

01/06/2023, 8:06 PM

I don’t think we’re persisting any results and we do not have retries enabled.

Ilya Galperin

01/06/2023, 8:07 PM

Unless result persistence is enabled somewhere by default that I’m not sure of?

Zanie

01/06/2023, 8:07 PM

It’s not often automatically enabled at the flow level

Zanie

01/06/2023, 8:08 PM

What timezone are those?

Ilya Galperin

01/06/2023, 8:08 PM

Sorry these are PST from last night

Ilya Galperin

01/06/2023, 8:08 PM

Jan 5th

Zanie

01/06/2023, 8:08 PM

🙂 thanks

Ilya Galperin

01/06/2023, 8:08 PM

Happy to give you run IDs if that helps at all as well

Zanie

01/06/2023, 8:09 PM

Yeah that’d be helpful for identifiying if it’s a server-sie issue

Zanie

01/06/2023, 8:09 PM

I suspect this is a client-side bug though, and it’s going to be hard to find without a minimal example that reproduces the problem.

Ilya Galperin

01/06/2023, 8:09 PM

Copy code

bad5b6dd-8f0f-428c-a045-e95cf72da4cd
82b0ca57-1501-47a2-91c8-ed621f184e82
646d0c66-320a-4435-9fb2-e0f4d03bb547
012ae57c-9296-46d0-af83-16eafc1218e0

Zanie

01/06/2023, 8:11 PM

Thanks! I’ll forward this to a Cloud engineer and we’ll rule out server errors.

Zanie

01/06/2023, 8:12 PM

Can you explain more about what’s going on here?

Copy code

File "/opt/prefect/flow.py", line 160, in entrypoint
  results.append(future.result())

Ilya Galperin

01/06/2023, 8:14 PM

Sure, this is doing something like…

Copy code

futures = []
    for e, x in enumerate(mylist):
        futures.append(
            sometask.with_options(name=task_name).submit(
                x=x,
                y=y,
                z=z,
            )
        )
    results = []
    for future in futures:
        results.append(future.result())

Ilya Galperin

01/06/2023, 8:15 PM

So it is submitting a number of tasks to dask, then waiting for them to complete

Ilya Galperin

01/06/2023, 8:15 PM

Before moving to the next step

Ilya Galperin

01/06/2023, 8:15 PM

This is only happening in flow B, flow A is not using Dask (the traceback might actually be slightly different let me check…)

Ilya Galperin

01/06/2023, 8:16 PM

Here is the traceback for flow A — different line in flow.py

Copy code

Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/opt/prefect/flow.py", line 228, in entrypoint
    temp_table_created = create_temp_table(
  File "/usr/local/lib/python3.10/site-packages/prefect/tasks.py", line 436, in __call__
    return enter_task_run_engine(
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 927, in enter_task_run_engine
    return run_async_from_worker_thread(begin_run)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
    return await future._result()
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 100, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Ilya Galperin

01/06/2023, 8:18 PM

And another run in Flow A - also a different line in flow.py

Copy code

Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/opt/prefect/flow.py", line 279, in entrypoint
    cols, _ = get_table_schema(
  File "/usr/local/lib/python3.10/site-packages/prefect/tasks.py", line 436, in __call__
    return enter_task_run_engine(
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 927, in enter_task_run_engine
    return run_async_from_worker_thread(begin_run)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
    return await future._result()
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 100, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Zanie

01/06/2023, 8:19 PM

Do you have in-memory storage of task results disabled?

Ilya Galperin

01/06/2023, 8:20 PM

Hmm I am not sure - how do I check that?

Zanie

01/06/2023, 8:20 PM

It’d be

Copy code

cache_result_in_memory=False

Ilya Galperin

01/06/2023, 8:20 PM

I wouldn’t think so since we’re definitely storing some results in memory between tasks

Zanie

01/06/2023, 8:20 PM

Are you using cache keys?

Ilya Galperin

01/06/2023, 8:20 PM

We are not

Zanie

01/06/2023, 8:20 PM

😄

Zanie

01/06/2023, 8:21 PM

So I suspect the problem is that the results are being dropped from memory for some reason. Perhaps on the way back from a Dask worker.

Ilya Galperin

01/06/2023, 8:22 PM

Maybe - only one of the flows is using Dask though, 3 of the failures are just running the SequentialTaskRunner

Ilya Galperin

01/06/2023, 8:23 PM

Not sure if it’s relevant but the Dask one has never failed (with this error) before last night when we had all of these deployments running simultaneously - could it maybe have something to do with high load against the agent or API?

Ilya Galperin

01/06/2023, 8:24 PM

All of these deployments are running on different k8s pods as well (so 4 pods total between these 4 runs)

Zanie

01/06/2023, 8:25 PM

Is that latest traceback from the sequential task runner?

Ilya Galperin

01/06/2023, 8:25 PM

Ya, the first log was from the flow running DaskTaskRunner

Ilya Galperin

01/06/2023, 8:25 PM

The later 2 are from the default task runner (not defined in the flow) which I believe is just sequential

Zanie

01/06/2023, 8:25 PM

If you’re using

.submit

the default is concurrent.

Zanie

01/06/2023, 8:26 PM

Or are you just calling the tasks?

Ilya Galperin

01/06/2023, 8:26 PM

The first one we have explicitly called dasktaskrunner

Ilya Galperin

01/06/2023, 8:26 PM

the other flow doens’t use .submit

Ilya Galperin

01/06/2023, 8:26 PM

it just calls them directly

Zanie

01/06/2023, 8:26 PM

🤯 okay that’s really weird

Zanie

01/06/2023, 8:26 PM

Are your dependencies pinned or could a version of a package changed?

Zanie

01/06/2023, 8:27 PM

i.e. could pydantic have changed versions or something?

Ilya Galperin

01/06/2023, 8:30 PM

They are all pinned

Zanie

01/06/2023, 8:33 PM

I’m out of questions, I think. This is very peculiar.

Zanie

01/06/2023, 8:34 PM

If you can create a minimal example that reproduces it, we can debug further but I’m not sure what would cause this

Ilya Galperin

01/06/2023, 8:35 PM

Hmm, I can try but it might be tough. This behavior only seems to be happening when we’re running thousands of tasks simultaneously — possibly across flows in a long-running manner.

Ilya Galperin

01/06/2023, 8:35 PM

Does it sound like there were no issues with the API/cloud services around this time last night?

Zanie

01/06/2023, 8:36 PM

We don’t see anything suspicious

Zanie

01/06/2023, 8:42 PM

Can you take a look at these three task runs?

Zanie

01/06/2023, 8:42 PM

Copy code

9bf44e45-bcc6-413b-a67e-0ee8a1b06c77 at 02:06:30.633 EST
0424b8f8-06ff-4f78-808d-13543167632f at 02:06:00.680 EST
3ae5277e-61f1-4a10-9ff3-cf89af740aa7 at 02:05:45.138 EST

Zanie

01/06/2023, 8:43 PM

• Are they in the affected flow runs? • What state transitions did they go through? • Are these the task runs your sequential flows crashed on?

Ilya Galperin

01/06/2023, 8:43 PM

Let me check

Ilya Galperin

01/06/2023, 8:46 PM

Yep, all of them are in the affected flow runs

Ilya Galperin

01/06/2023, 8:46 PM

These are all the sequential flows and all the tasks are still in pending

Ilya Galperin

01/06/2023, 8:46 PM

The actual error isn’t surfaced on the task level though, it’s attached to the flow run

Ilya Galperin

01/06/2023, 8:46 PM

But it seems like around the time the flow crashed yes

Ilya Galperin

01/06/2023, 8:47 PM

Let me confirm but I think the tracebacks I posted above those were the tasks called in flow.py at the time..

Ilya Galperin

01/06/2023, 8:47 PM

Yep, confirming that

Zanie

01/06/2023, 8:48 PM

Suspicious 🙂

Zanie

01/06/2023, 8:48 PM

Well, if these are all in PENDING states that explains the error! A pending state cannot have a result attached to it.

Zanie

01/06/2023, 8:48 PM

Are you passing data from upstream tasks into these?

Ilya Galperin

01/06/2023, 8:50 PM

Let me see.. all 3 of these are different tasks in the same flow 😉

Ilya Galperin

01/06/2023, 8:53 PM

It looks like 2/3 are. The one with id

3ae5277e-61f1-4a10-9ff3-cf89af740aa7

only uses values that are passed in as arguments to the flow function.

Zanie

01/06/2023, 8:54 PM

Hm okay. Are there logs for these tasks runs? Presuming there aren’t DEBUG level logs 😢

Ilya Galperin

01/06/2023, 8:54 PM

The other 2 are using values that are captured from a previous task then stored in a var in flow.py

Ilya Galperin

01/06/2023, 8:55 PM

Here’s all the logs I see..

Copy code

3ae5277e-61f1-4a10-9ff3-cf89af740aa7
Task run '3ae5277e-61f1-4a10-9ff3-cf89af740aa7' already finished.

9bf44e45-bcc6-413b-a67e-0ee8a1b06c77
Task run '9bf44e45-bcc6-413b-a67e-0ee8a1b06c77' already finished.

9bf44e45-bcc6-413b-a67e-0ee8a1b06c77
Task run '9bf44e45-bcc6-413b-a67e-0ee8a1b06c77' already finished.

Ilya Galperin

01/06/2023, 8:55 PM

All INFO level

Zanie

01/06/2023, 8:56 PM

https://github.com/PrefectHQ/prefect/blob/main/src/prefect/engine.py#L1319-L1323

Ilya Galperin

01/06/2023, 8:56 PM

Not sure if it matters but the Dask flow does not have any tasks still remaining in pending state (or failed or crashed for that matter)

Zanie

01/06/2023, 8:57 PM

It looks like the server sent back an ABORT instruction here

Ilya Galperin

01/06/2023, 8:57 PM

Ah interesting..

Zanie

01/06/2023, 8:59 PM

That generally means the server thought the task was already being executed elsewhere

Zanie

01/06/2023, 9:01 PM

Unfortunately I’m not sure we can recover the “reason” attached to the abort response.

Ilya Galperin

01/06/2023, 9:05 PM

Hmm

Zanie

01/06/2023, 9:06 PM

I can add logs so we get more information in the future

🙏 1

Ilya Galperin

01/06/2023, 9:06 PM

How do you recommend we proceed for now?

Zanie

01/06/2023, 9:07 PM

There’s not much to do in the meantime. You can use

return_state=True

on the task call and just like.. call it again if you happen to get a pending state back, but that’s pretty hacky.

Ilya Galperin

01/06/2023, 9:08 PM

Hmm ya, I don’t know if that would solve the issue with the Dask flow though? That one is already returning just futures

Ilya Galperin

01/06/2023, 9:08 PM

And there is no task stuck in pending there 🤷

Zanie

01/06/2023, 9:10 PM

Yeah I’m not sure what’s going on there yet either. I’ll add more logs in the next release and then I’d recommend running this with DEBUG logs.

Ilya Galperin

01/06/2023, 9:10 PM

Will do, thanks MIchael.

Zanie

01/06/2023, 9:10 PM

I think once we get a better picture of what’s going on it’ll be an easy enough fix

Ilya Galperin

01/06/2023, 9:10 PM

Do we need to do anything to enable another level of debug logs? I thought cloud captures debug level logs too

Ilya Galperin

01/06/2023, 9:11 PM

Or do we need to set some env var on the flow container?

Zanie

01/06/2023, 9:11 PM

PREFECT_LOGGING_LEVEL=DEBUG

— we only capture what you set the level to

Ilya Galperin

01/06/2023, 9:11 PM

Our Prefect logger debug level logs are captured

Zanie

01/06/2023, 9:11 PM

Oh okay yeah there just aren’t debug level logs in this spot yet.

Ilya Galperin

01/06/2023, 9:11 PM

Gotcha

Ilya Galperin

01/06/2023, 9:12 PM

Oh duh I set that env var in the docker image for these flows 😛

Ilya Galperin

01/06/2023, 9:12 PM

Will keep up with the next release, thanks again

Ilya Galperin

01/06/2023, 9:12 PM

I’ll let you know if we’re able to reproduce

Ilya Galperin

01/06/2023, 9:14 PM

Thanks again for all the help troubleshooting this

🙏 1

Zanie

01/06/2023, 9:14 PM

Thank you too!

🙏 1

Zanie

01/06/2023, 9:15 PM

I’m sure you’ve caught a currently obscure bug that will become more common as people scale up, I’m eager to see what’s going on here

✅ 1

Ilya Galperin

01/06/2023, 9:15 PM

Me too 🙂

Zanie

01/06/2023, 9:16 PM

btw the next release will be next Thurdsay, I’m sure I’ll have a PR up before then if you want to test beforehand

Zanie

01/06/2023, 9:16 PM

You may want to look into getting up to the latest release in the meantime

Ilya Galperin

01/06/2023, 9:17 PM

Noted, thanks! We’ll upgrade ASAP.

Ilya Galperin

01/06/2023, 11:11 PM

A quick follow-up @Michael - we just experienced the same error/failure with the same 3 deployments from the Sequential task runner flow that failed last night, so it does look like the issue is reproducible. We’re going to try staggering these runs and see if the problem persists.

Zanie

01/06/2023, 11:22 PM

Here’s a patch with better logs https://github.com/PrefectHQ/prefect/pull/8097

👍 1

Ilya Galperin

01/06/2023, 11:34 PM

Thank you! We’ll try to install this PR on the container running the flow and see what we catch.

Zanie

01/06/2023, 11:41 PM

Should be doable with the

EXTRA_PIP_PACKAGES

variable if you’re using our image entrypoint e.g.

Copy code

docker run --env EXTRA_PIP_PACKAGES="git+<https://github.com/PrefectHQ/prefect@task-run-abort-log>" prefecthq/prefect:2-latest prefect version

🙏 1

Ilya Galperin

01/10/2023, 10:34 PM

Hi @Michael - an update from our end. We’ve upgraded to 2.7.7 and have done some more testing. 1. The number of flows we have running in parallel at the time doesn’t seem to matter, we get the same crashing behavior when running just one instance of this flow. 2. Flows are virtually always crashing at almost exactly the 4 hour mark. 3. We installed the logging PR you created and here is the full output for the pending task:

Copy code

Task run '8775d5ff-1614-4df3-83bd-dc3b3fc4bb0f' received abort during orchestration: The enclosing flow must be running to begin task execution.. Task run is in PENDING state.

Zanie

01/10/2023, 10:40 PM

Ah interesting. This task run failure might be misleading here. If the flow run is marked as failed, we won’t let task runs move into a running state.

Zanie

01/10/2023, 10:40 PM

(as that abort reason suggests)

Zanie

01/10/2023, 10:41 PM

So right now we have a missing result error which is caused by a pending task run which is caused by an aborted transition which is caused by a flow run that is not in a running state anymore.

Zanie

01/10/2023, 10:42 PM

Does the flow run itself have any relevant logs?

Alix Cook

01/10/2023, 11:18 PM

we're also seeing similar weirdness in some of our flows, though we're seeing it in flows that only run for a couple seconds. all the tasks in our flow complete successfully, then we see the same errorr:

Copy code

refect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

we're on prefect 2.7.1. we use prefect cloud we don't have any retries setup, and we use whatever the default persistence is (the docs say that it should be enabled). The first time we saw this error was 2022-12-15 163948 UTC

Alix Cook

01/10/2023, 11:19 PM

we've only noticed in the last couple days because we have this set of jobs that run monthly that we have failure notifications set up for

Alix Cook

01/10/2023, 11:22 PM

its not impacting our data flows but our on call has to look at every failure to ensure that the job actually did what it was supposed to do, which is a bit cumbersome when you're running thousands of these flows a day

Andrew Huang

01/10/2023, 11:24 PM

I just encountered this; here’s a minimal example with prefect-dask

Copy code

import dask.dataframe
import dask.distributed
from prefect import flow, task
from prefect_dask import DaskTaskRunner, get_dask_client

client = dask.distributed.Client()

@task
def read_data(start: str, end: str) -> dask.dataframe.DataFrame:
    df = dask.datasets.timeseries(start, end, partition_freq="4w")
    return df

@task
def process_data(df) -> dask.dataframe.DataFrame:
    df_yearly_avg = df.groupby(df.index.year).mean()
    return df_yearly_avg.compute()

@flow(task_runner=DaskTaskRunner(address=client.scheduler.address))
def dask_flow():
    df = read_data.submit("1988", "2022")
    df_yearly_average = process_data.submit(df)
    return df_yearly_average

dask_flow()

in this case, it’s because the memory maxed out on a single worker wrapping with get_dask_client in my case allows it to run successfully

Copy code

@task
def process_data(df) -> dask.dataframe.DataFrame:
    with get_dask_client():
        df_yearly_avg = df.groupby(df.index.year).mean()
        return df_yearly_avg.compute()

Zanie

01/10/2023, 11:27 PM

@Alix Cook are you using Dask as well? @Andrew Huang since we have reproductions without Dask, I’m hesitant to go down the Dask rabbit hole. Especially OOM on a worker is always going to cause some sort of ungraceful shutdown.

👍 1

Alix Cook

01/10/2023, 11:27 PM

no, our flows are running as local processes

Ilya Galperin

01/10/2023, 11:28 PM

Hi Michael - no additional logging on the flow level unfortunately. Id o notice the State Message attached to the flow is :

Flow run infrastructure exited with non-zero status code -1.

Since this is almost always crashing at the 4 hour mark, it makes me suspicious that there is some timeout we’re missing. However, our other very long-running flows are not timing out at any point. I should also note the other flow where we saw this intermittent behavior on crashed at an arbitrary time after 2 hours, so maybe not.

Andrew Huang

01/10/2023, 11:28 PM

oh I missed the reproduction

Zanie

01/10/2023, 11:31 PM

@Ilya Galperin that sounds like the failure is being reported by the agent then the flow is crashing since the agent has marked it as failed.

Zanie

01/10/2023, 11:31 PM

You’re using Kubernetes jobs?

Ilya Galperin

01/10/2023, 11:31 PM

Yes

Zanie

01/10/2023, 11:33 PM

The -1 status code is reported by Prefect in a few cases. https://github.com/PrefectHQ/prefect/blob/2fe69ace54de28f9a47f2360bca43c9aeffbc6ef/src/prefect/infrastructure/kubernetes.py#L606 seems most notable.

🤔 1

Zanie

01/10/2023, 11:34 PM

@Alix Cook you’re using process infrastructure or just running flows without an agent? These may be different issues.

Alix Cook

01/10/2023, 11:40 PM

we're running flows with an agent, the agent is just running them in a local process.

Zanie

01/10/2023, 11:52 PM

Great thanks! Your issue actually sounds distinct from Ilya’s. Can you open an issue with details like DEBUG level agent and flow run logs?

Ilya Galperin

01/11/2023, 12:34 AM

I’m going to try the crash again and see if I can’t find anything in the container’s logs that might not be being shipped back.

Alix Cook

01/11/2023, 4:24 PM

sure

Ilya Galperin

01/11/2023, 6:30 PM

Hi @Zanie - unfortunately no additional logs on the container or otherwise. Not sure if it’s relevant but the pod itself is not crashing, the container spins down and the pod stays alive with a succeeded state. Any other avenues you think might be worth digging into?

Zanie

01/11/2023, 6:35 PM

I’m very confident this is a bug in the Kubernetes library that we will need to add a workaround for (https://github.com/PrefectHQ/prefect/issues/8050#issuecomment-1379289312)

✅ 1

Ilya Galperin

01/11/2023, 7:28 PM

Yes this looks really really similar - I think you are probably right. Thanks for flagging this Michael, we will keep track of this issue!

7 Views

Open in Slack

Previous Next