We are seeing errors when trying to register artifacts in ou Prefect Community #prefect-ui

We are seeing errors when trying to register artif...

Steven Fong

05/14/2022, 8:36 AM

We are seeing errors when trying to register artifacts in our flows using prefect cloud, causing all flows with artifacts to enter a failed state.

Copy code

prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4ffb8484-51c9-427e-8604-168dfb707993 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]

Anna Geller

05/14/2022, 12:37 PM

Could you share more information? INTERNAL_SERVER_ERROR is a generic error. Can you share your flow code that caused this issue and your

prefect diagnostics

Anna Geller

05/14/2022, 12:37 PM

you can also share the flow run ID so that we can check the logs

Florian Kühnlenz

05/14/2022, 1:53 PM

Hi I think we are seeing the exact same issue since today.

Anna Geller

05/14/2022, 3:35 PM

@Florian Kühnlenz could you elaborate a bit more on that - is any flow run that uses Artifacts failing all of a sudden? Could you share an example flow run ID and perhaps a code snippet I could use to reproduce the issue? I can open a GitHub issue but would be great to collect more information first.

Steven Fong

05/14/2022, 4:38 PM

Flow ID of failed run.

296f24ef-ec7f-4fcc-9ada-b626ca92baed

Steven Fong

05/14/2022, 4:40 PM

Prefect Diagnostics

Copy code

{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__CLOUD__API",
    "PREFECT__CLOUD__AGENT__AUTH_TOKEN",
    "PREFECT__CLOUD__AGENT__LABELS",
    "PREFECT__CLOUD__API_KEY",
    "PREFECT__CLOUD__AGENT__AGENT_ADDRESS",
    "PREFECT__CLOUD__TENANT_ID",
    "PREFECT__BACKEND"
  ],
  "system_information": {
    "platform": "Linux-4.14.138-rancher-x86_64-with-glibc2.2.5",
    "prefect_backend": "cloud",
    "prefect_version": "0.15.5",
    "python_version": "3.8.12"
  }
}

Steven Fong

05/14/2022, 4:51 PM

Copy code

INFO:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Starting task run...

Calculating Metrics:   0%|          | 0/24 [00:00<?, ?it/s]
Calculating Metrics:   8%|▊         | 2/24 [00:00<00:07,  3.09it/s]
Calculating Metrics:  17%|█▋        | 4/24 [00:00<00:03,  5.45it/s]
Calculating Metrics:  79%|███████▉  | 19/24 [00:01<00:00, 14.81it/s]
Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 17.81it/s]
Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 14.02it/s]
[2022-05-14 08:00:55+0000] ERROR - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "<string>", line 106, in run_ge_validation
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
    return _create_task_run_artifact("markdown", {"markdown": markdown})
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
ERROR:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "<string>", line 106, in run_ge_validation
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
    return _create_task_run_artifact("markdown", {"markdown": markdown})
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
[2022-05-14 08:00:56+0000] INFO - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Finished task run for task with final state: 'Failed'

Steven Fong

05/14/2022, 4:52 PM

this is the log from the k8s job container

Florian Kühnlenz

05/14/2022, 5:41 PM

@Anna Geller Our flow only contains a StartFlowRun task which seems to be the one that fails.

Copy code

[2022-05-14 14:04:22+0200] ERROR - prefect.CloudTaskRunner | Task 'StartFlowRun[0]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 479, in run_task_with_timeout
    return run_with_thread_timeout(
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 254, in run_with_thread_timeout
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/tasks.py", line 456, in method
    return run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/tasks/prefect/flow_run.py", line 466, in run
    create_link_artifact(urlparse(run_link).path)
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
    return _create_task_run_artifact("link", {"link": link})
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
ERROR:prefect.CloudTaskRunner:Task 'StartFlowRun[0]': Exception encountered during task execution!

Flow ID is ccc38835-4618-4852-8a0e-b233a29df1c5

Steven Fong

05/14/2022, 6:03 PM

we had to hotfix our pipelines and commented out the artifact storage due to our SLA, and things returned to normal. so its definitely seems that the artifact service is the root cause

Florian Kühnlenz

05/14/2022, 6:04 PM

It seems we are seeing more and more flows failing with similar problems.

Anna Geller

05/14/2022, 6:20 PM

sorry to hear, but the flow ID won't help us, can you both share a flow run ID of the failing flow runs?

Florian Kühnlenz

05/14/2022, 6:26 PM

Where can we find it 😅

Anna Geller

05/14/2022, 6:27 PM

it's the URL of the flow run page

Steven Fong

05/14/2022, 6:27 PM

here is a list of some of our failed runs

Copy code

468ac0af-83de-4c36-8d7f-7a88dc6f2ad2
8b403142-e675-4b5d-9c5c-9f5d07cc732c
e3bd219b-9e63-4f8a-9a32-d968caeb5244
97db0cfd-a7be-493e-8f5f-043990778c64

👍 1

Anna Geller

05/14/2022, 6:27 PM

part of it strictly speaking

Florian Kühnlenz

05/14/2022, 6:31 PM

I was searching it everywhere else in the UI 😅

Copy code

f8426950-0d53-4cea-975d-6d47f05d5e12

This one shows the log from above. I did not yet check the other in detail but I assume similar problems since they did run previously.

Copy code

265077fa-e8d3-46b1-96d7-88971d5d03bd
5533eda5-138d-4d40-a26b-f13b783ba426

Anna Geller

05/14/2022, 6:31 PM

Steven I don't see any failed task runs in any of the flow runs you sent - weird

Steven Fong

05/14/2022, 6:34 PM

this is what i mean, none of the failures bubble up to the UI

Steven Fong

05/14/2022, 6:34 PM

the only logs are from the k8s job container

👍 1

Anna Geller

05/14/2022, 6:35 PM

I couldn't find anything suspicious in the logs and I reported the issue - will report back tomorrow if I get some response over the weekend, I'll find out more by the latest on Monday keep us posted if you find any more information that can help identify the root cause here

Steven Fong

05/14/2022, 6:35 PM

even the agent doesn't detect the failure in the logs

Anna Geller

05/14/2022, 6:35 PM

this is what i mean, none of the failures bubble up to the UI

I see that explains a lot, thanks!

Steven Fong

05/14/2022, 6:36 PM

the specific logs are all the same as i posted above in this thread. commenting out the artifact submission did fix the issue, obviously we would like to store our artifacts for auditing so we will monitor this thread.

Florian Kühnlenz

05/14/2022, 6:38 PM

If this is indeed an issue with the API shouldn't this be covered by the SLA?

☝️ 1

Steven Fong

05/14/2022, 6:43 PM

the prefect cloud status page could use some work and should report information on all dependent services rather than just if the cloud service is responding to pings/online, as seeing everything still marked as green is probably dishonest in this scenario

Anna Geller

05/14/2022, 7:20 PM

Here is the response I got from the infrastructure team: there is no indication of a service outage on our side. It's possible to hit a race condition if you very quickly create an artifact after a task run, but based on the provided task run ids that doesn't appear to be happening.

👍 1

Anna Geller

05/14/2022, 7:21 PM

We will investigate it in more detail on Monday and get back to you then

Steven Fong

05/14/2022, 7:24 PM

Thanks, this issue started sometime last night (from our logs) and is fairly consistent, out of 11 attempted runs 9 failed to upload artifacts all with the same error.

Steven Fong

05/14/2022, 7:24 PM

i doubt this is a race condition as this flow worked flawlessly for months previous to last night

Florian Kühnlenz

05/14/2022, 7:25 PM

We are not even actively uploading artifacts we are just using the integrated StartFlowRun task. Since the same flow worked previously and we get a response from the API this clearly seems like a problem on prefects side to me.

Steven Fong

05/14/2022, 7:27 PM

@Florentino Bexiga your task must be triggering the artifacts based on the error you posted

Copy code

create_link_artifact(urlparse(run_link).path)
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
    return _create_task_run_artifact("link", {"link": link})
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]

its trying to insert the run link into your artifacts.

Florian Kühnlenz

05/14/2022, 7:35 PM

I am pretty certain that this all happens inside the supplied startFlowRun task.

Florian Kühnlenz

05/15/2022, 7:05 AM

Is there anything else we can do to move this issue foreward? We don't even have a workaround yet.

Florian Kühnlenz

05/15/2022, 10:12 AM

Looking at the code at https://github.com/PrefectHQ/prefect/blob/8d45a8a49a4123efa1b2b9c62895bc3fd9829d1e/src/prefect/tasks/prefect/flow_run.py#L475. I just realized that the sub flows probably are getting triggered and only saving the url as artifact fails. I am not sure it is a good design where an unnecessary(?) side effect like this breaks the whole task. Also there seems to be no way to bypass this step. Nor are there proper logs in prefect itself about what is happening.

Anna Geller

05/15/2022, 12:30 PM

I agree that artifacts are not necessary to trigger a child flow run from parent - @Florian Kühnlenz could you switch to use the create_flow_run task instead of StartFlowRun? This "workaround" should solve the issue

Anna Geller

05/15/2022, 3:01 PM

@Florian Kühnlenz and @Steven Fong we've investigated the issue more closely and your issue should be resolved now - could you confirm? Florian, even with StartFlowRun task and artifact creation it should work now

Steven Fong

05/15/2022, 3:02 PM

it will likely be tomorrow during business hours before we are able to release a new build to confirm

👍 1

Florian Kühnlenz

05/15/2022, 7:57 PM

I think we see a 50% failure rate in

c3154113-9700-431a-a9dd-10174ee6ce24

Anna Geller

05/15/2022, 8:35 PM

thanks for reporting back @Florian Kühnlenz - in that case, I'd recommend indeed switching to the

create_flow_run

task to avoid the artifact creation that happens in the StartFlowRun task since it doesn't seem to be an issue in the Cloud API backend anymore - the existing issue got resolved at 1:45 PM UTC and the error in this flow run log was at 6 PM UTC. Could you try switching to this

create_flow_run

task and report back if this helped? (it should)

Florian Kühnlenz

05/16/2022, 5:44 AM

We will look into implementing the workaround today. However it sounds a bit strange to me that at first there was no issue. Now there was a fix. But the remaining problems should have nothing to do with the issue or the fix? I will update with some logs soon.

Florian Kühnlenz

05/16/2022, 6:35 AM

It seems the flows running in the morning are doing fine so far. Above mentioned

c3154113-9700-431a-a9dd-10174ee6ce24

is part of our ETL process and should not be run during the day. So we will see how it behaves in the evening. However in the UI it still shows the same message:

Copy code

Error during execution of task: ClientError([{'path': ['create_task_run_artifact'], 'message': 'Task run 8d67c55d-dc9e-41e3-9993-4d673d0131d2 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])

I can not retrieve the logs however. I also had a look a

create_flow_run

and to me it seams this is not a good replacement for

StartFlowRun

since it does not wait for the flow run to finish. So we would need to combine it with

wait_for_flow_run

. This however is a bit difficult since in many cases we map over StartFlowRun. Any suggestions appreciated.

Anna Geller

05/16/2022, 11:39 AM

I created an example for you here

Anna Geller

05/16/2022, 11:40 AM

This example shows how you can combine those two tasks using mapping to create child flow runs and wait for their completion in a downstream task

👍 1

Anna Geller

05/16/2022, 3:01 PM

btw @Florian Kühnlenz I added this PR https://github.com/PrefectHQ/prefect/pull/5795

Florian Kühnlenz

05/17/2022, 6:46 PM

@Anna Geller the problem is still(?) happening see flow run

d7024b10-e2da-4d5a-bf28-9b6c9ca6cd22

. We did not manage to rewrite the flow yet using the workaround you provided.

Anna Geller

05/17/2022, 6:54 PM

I have no other recommendation as of now other than not using StartFlowRun - the PR I submitted got merged so you may be able to soon change that with a single flag once we release that, but this would require an upgrade - not sure if that's easier than rewriting to

create_flow_run

, especially given that this task is much nicer than StartFlowRun

Florian Kühnlenz

05/17/2022, 6:58 PM

Alright. But as mentioned we sometimes need to wait for the flow to finish, which made StarFlowRun easier for us to use. Is there an ongoing effort to investigate the problem with the API?

Anna Geller

05/17/2022, 7:03 PM

Sorry, I should have mentioned that: we investigated the issue and have an open ticket about that internally - it seems when you try to start that many task runs of StartFlowRun that spin up child flow runs and create artifacts that there comes to a race condition when your flow tries to create an artifact for a task run which doesn't exist yet we are aware of the issue on the Cloud side and recommend a workaround as discussed sorry, I should have mentioned the open issue right away

Florian Kühnlenz

05/17/2022, 7:04 PM

Okay, thanks. That's much more encouraging 😃.

👍 1

14 Views

Open in Slack

Previous Next