We are seeing errors when trying to register artif...
# prefect-ui
s
We are seeing errors when trying to register artifacts in our flows using prefect cloud, causing all flows with artifacts to enter a failed state.
Copy code
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4ffb8484-51c9-427e-8604-168dfb707993 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
a
Could you share more information? INTERNAL_SERVER_ERROR is a generic error. Can you share your flow code that caused this issue and your
prefect diagnostics
?
you can also share the flow run ID so that we can check the logs
f
Hi I think we are seeing the exact same issue since today.
a
@Florian Kühnlenz could you elaborate a bit more on that - is any flow run that uses Artifacts failing all of a sudden? Could you share an example flow run ID and perhaps a code snippet I could use to reproduce the issue? I can open a GitHub issue but would be great to collect more information first.
s
Flow ID of failed run.
296f24ef-ec7f-4fcc-9ada-b626ca92baed
Prefect Diagnostics
Copy code
{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__CLOUD__API",
    "PREFECT__CLOUD__AGENT__AUTH_TOKEN",
    "PREFECT__CLOUD__AGENT__LABELS",
    "PREFECT__CLOUD__API_KEY",
    "PREFECT__CLOUD__AGENT__AGENT_ADDRESS",
    "PREFECT__CLOUD__TENANT_ID",
    "PREFECT__BACKEND"
  ],
  "system_information": {
    "platform": "Linux-4.14.138-rancher-x86_64-with-glibc2.2.5",
    "prefect_backend": "cloud",
    "prefect_version": "0.15.5",
    "python_version": "3.8.12"
  }
}
Copy code
INFO:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Starting task run...

Calculating Metrics:   0%|          | 0/24 [00:00<?, ?it/s]
Calculating Metrics:   8%|▊         | 2/24 [00:00<00:07,  3.09it/s]
Calculating Metrics:  17%|█▋        | 4/24 [00:00<00:03,  5.45it/s]
Calculating Metrics:  79%|███████▉  | 19/24 [00:01<00:00, 14.81it/s]
Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 17.81it/s]
Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 14.02it/s]
[2022-05-14 08:00:55+0000] ERROR - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "<string>", line 106, in run_ge_validation
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
    return _create_task_run_artifact("markdown", {"markdown": markdown})
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
ERROR:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "<string>", line 106, in run_ge_validation
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
    return _create_task_run_artifact("markdown", {"markdown": markdown})
  File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
[2022-05-14 08:00:56+0000] INFO - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Finished task run for task with final state: 'Failed'
this is the log from the k8s job container
f
@Anna Geller Our flow only contains a StartFlowRun task which seems to be the one that fails.
Copy code
[2022-05-14 14:04:22+0200] ERROR - prefect.CloudTaskRunner | Task 'StartFlowRun[0]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 479, in run_task_with_timeout
    return run_with_thread_timeout(
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 254, in run_with_thread_timeout
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/tasks.py", line 456, in method
    return run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/tasks/prefect/flow_run.py", line 466, in run
    create_link_artifact(urlparse(run_link).path)
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
    return _create_task_run_artifact("link", {"link": link})
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
ERROR:prefect.CloudTaskRunner:Task 'StartFlowRun[0]': Exception encountered during task execution!
Flow ID is ccc38835-4618-4852-8a0e-b233a29df1c5
s
we had to hotfix our pipelines and commented out the artifact storage due to our SLA, and things returned to normal. so its definitely seems that the artifact service is the root cause
f
It seems we are seeing more and more flows failing with similar problems.
a
sorry to hear, but the flow ID won't help us, can you both share a flow run ID of the failing flow runs?
f
Where can we find it 😅
a
it's the URL of the flow run page
s
here is a list of some of our failed runs
Copy code
468ac0af-83de-4c36-8d7f-7a88dc6f2ad2
8b403142-e675-4b5d-9c5c-9f5d07cc732c
e3bd219b-9e63-4f8a-9a32-d968caeb5244
97db0cfd-a7be-493e-8f5f-043990778c64
👍 1
a
part of it strictly speaking
f
I was searching it everywhere else in the UI 😅
Copy code
f8426950-0d53-4cea-975d-6d47f05d5e12
This one shows the log from above. I did not yet check the other in detail but I assume similar problems since they did run previously.
Copy code
265077fa-e8d3-46b1-96d7-88971d5d03bd
5533eda5-138d-4d40-a26b-f13b783ba426
a
Steven I don't see any failed task runs in any of the flow runs you sent - weird
s
this is what i mean, none of the failures bubble up to the UI
the only logs are from the k8s job container
👍 1
a
I couldn't find anything suspicious in the logs and I reported the issue - will report back tomorrow if I get some response over the weekend, I'll find out more by the latest on Monday keep us posted if you find any more information that can help identify the root cause here
s
even the agent doesn't detect the failure in the logs
a
this is what i mean, none of the failures bubble up to the UI
I see that explains a lot, thanks!
s
the specific logs are all the same as i posted above in this thread. commenting out the artifact submission did fix the issue, obviously we would like to store our artifacts for auditing so we will monitor this thread.
f
If this is indeed an issue with the API shouldn't this be covered by the SLA?
☝️ 1
s
the prefect cloud status page could use some work and should report information on all dependent services rather than just if the cloud service is responding to pings/online, as seeing everything still marked as green is probably dishonest in this scenario
a
Here is the response I got from the infrastructure team: there is no indication of a service outage on our side. It's possible to hit a race condition if you very quickly create an artifact after a task run, but based on the provided task run ids that doesn't appear to be happening.
👍 1
We will investigate it in more detail on Monday and get back to you then
s
Thanks, this issue started sometime last night (from our logs) and is fairly consistent, out of 11 attempted runs 9 failed to upload artifacts all with the same error.
i doubt this is a race condition as this flow worked flawlessly for months previous to last night
f
We are not even actively uploading artifacts we are just using the integrated StartFlowRun task. Since the same flow worked previously and we get a response from the API this clearly seems like a problem on prefects side to me.
s
@Florentino Bexiga your task must be triggering the artifacts based on the error you posted
Copy code
create_link_artifact(urlparse(run_link).path)
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
    return _create_task_run_artifact("link", {"link": link})
  File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
    return client.create_task_run_artifact(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
    result = self.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
its trying to insert the run link into your artifacts.
f
I am pretty certain that this all happens inside the supplied startFlowRun task.
Is there anything else we can do to move this issue foreward? We don't even have a workaround yet.
Looking at the code at https://github.com/PrefectHQ/prefect/blob/8d45a8a49a4123efa1b2b9c62895bc3fd9829d1e/src/prefect/tasks/prefect/flow_run.py#L475. I just realized that the sub flows probably are getting triggered and only saving the url as artifact fails. I am not sure it is a good design where an unnecessary(?) side effect like this breaks the whole task. Also there seems to be no way to bypass this step. Nor are there proper logs in prefect itself about what is happening.
a
I agree that artifacts are not necessary to trigger a child flow run from parent - @Florian Kühnlenz could you switch to use the create_flow_run task instead of StartFlowRun? This "workaround" should solve the issue
@Florian Kühnlenz and @Steven Fong we've investigated the issue more closely and your issue should be resolved now - could you confirm? Florian, even with StartFlowRun task and artifact creation it should work now
s
it will likely be tomorrow during business hours before we are able to release a new build to confirm
👍 1
f
I think we see a 50% failure rate in
c3154113-9700-431a-a9dd-10174ee6ce24
a
thanks for reporting back @Florian Kühnlenz - in that case, I'd recommend indeed switching to the
create_flow_run
task to avoid the artifact creation that happens in the StartFlowRun task since it doesn't seem to be an issue in the Cloud API backend anymore - the existing issue got resolved at 1:45 PM UTC and the error in this flow run log was at 6 PM UTC. Could you try switching to this
create_flow_run
task and report back if this helped? (it should)
f
We will look into implementing the workaround today. However it sounds a bit strange to me that at first there was no issue. Now there was a fix. But the remaining problems should have nothing to do with the issue or the fix? I will update with some logs soon.
It seems the flows running in the morning are doing fine so far. Above mentioned
c3154113-9700-431a-a9dd-10174ee6ce24
is part of our ETL process and should not be run during the day. So we will see how it behaves in the evening. However in the UI it still shows the same message:
Copy code
Error during execution of task: ClientError([{'path': ['create_task_run_artifact'], 'message': 'Task run 8d67c55d-dc9e-41e3-9993-4d673d0131d2 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])
I can not retrieve the logs however. I also had a look a
create_flow_run
and to me it seams this is not a good replacement for
StartFlowRun
since it does not wait for the flow run to finish. So we would need to combine it with
wait_for_flow_run
. This however is a bit difficult since in many cases we map over StartFlowRun. Any suggestions appreciated.
a
I created an example for you here
This example shows how you can combine those two tasks using mapping to create child flow runs and wait for their completion in a downstream task
👍 1
btw @Florian Kühnlenz I added this PR https://github.com/PrefectHQ/prefect/pull/5795
f
@Anna Geller the problem is still(?) happening see flow run
d7024b10-e2da-4d5a-bf28-9b6c9ca6cd22
. We did not manage to rewrite the flow yet using the workaround you provided.
a
I have no other recommendation as of now other than not using StartFlowRun - the PR I submitted got merged so you may be able to soon change that with a single flag once we release that, but this would require an upgrade - not sure if that's easier than rewriting to
create_flow_run
, especially given that this task is much nicer than StartFlowRun
f
Alright. But as mentioned we sometimes need to wait for the flow to finish, which made StarFlowRun easier for us to use. Is there an ongoing effort to investigate the problem with the API?
a
Sorry, I should have mentioned that: we investigated the issue and have an open ticket about that internally - it seems when you try to start that many task runs of StartFlowRun that spin up child flow runs and create artifacts that there comes to a race condition when your flow tries to create an artifact for a task run which doesn't exist yet we are aware of the issue on the Cloud side and recommend a workaround as discussed sorry, I should have mentioned the open issue right away
f
Okay, thanks. That's much more encouraging 😃.
👍 1