s

    Steven Fong

    4 months ago
    We are seeing errors when trying to register artifacts in our flows using prefect cloud, causing all flows with artifacts to enter a failed state.
    prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4ffb8484-51c9-427e-8604-168dfb707993 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
    Anna Geller

    Anna Geller

    4 months ago
    Could you share more information? INTERNAL_SERVER_ERROR is a generic error. Can you share your flow code that caused this issue and your
    prefect diagnostics
    ?
    you can also share the flow run ID so that we can check the logs
    f

    Florian Kühnlenz

    4 months ago
    Hi I think we are seeing the exact same issue since today.
    Anna Geller

    Anna Geller

    4 months ago
    @Florian Kühnlenz could you elaborate a bit more on that - is any flow run that uses Artifacts failing all of a sudden? Could you share an example flow run ID and perhaps a code snippet I could use to reproduce the issue? I can open a GitHub issue but would be great to collect more information first.
    s

    Steven Fong

    4 months ago
    Flow ID of failed run.
    296f24ef-ec7f-4fcc-9ada-b626ca92baed
    Prefect Diagnostics
    {
      "config_overrides": {},
      "env_vars": [
        "PREFECT__CLOUD__API",
        "PREFECT__CLOUD__AGENT__AUTH_TOKEN",
        "PREFECT__CLOUD__AGENT__LABELS",
        "PREFECT__CLOUD__API_KEY",
        "PREFECT__CLOUD__AGENT__AGENT_ADDRESS",
        "PREFECT__CLOUD__TENANT_ID",
        "PREFECT__BACKEND"
      ],
      "system_information": {
        "platform": "Linux-4.14.138-rancher-x86_64-with-glibc2.2.5",
        "prefect_backend": "cloud",
        "prefect_version": "0.15.5",
        "python_version": "3.8.12"
      }
    }
    INFO:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Starting task run...
    
    Calculating Metrics:   0%|          | 0/24 [00:00<?, ?it/s]
    Calculating Metrics:   8%|▊         | 2/24 [00:00<00:07,  3.09it/s]
    Calculating Metrics:  17%|█▋        | 4/24 [00:00<00:03,  5.45it/s]
    Calculating Metrics:  79%|███████▉  | 19/24 [00:01<00:00, 14.81it/s]
    Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 17.81it/s]
    Calculating Metrics: 100%|██████████| 24/24 [00:01<00:00, 14.02it/s]
    [2022-05-14 08:00:55+0000] ERROR - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Exception encountered during task execution!
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
        value = prefect.utilities.executors.run_task_with_timeout(
      File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
        return task.run(*args, **kwargs)  # type: ignore
      File "<string>", line 106, in run_ge_validation
      File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
        return _create_task_run_artifact("markdown", {"markdown": markdown})
      File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
        return client.create_task_run_artifact(
      File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
        result = self.graphql(
      File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
        raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
    ERROR:prefect.CloudTaskRunner:Task 'run_ge_validation[5]': Exception encountered during task execution!
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
        value = prefect.utilities.executors.run_task_with_timeout(
      File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
        return task.run(*args, **kwargs)  # type: ignore
      File "<string>", line 106, in run_ge_validation
      File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 85, in create_markdown_artifact
        return _create_task_run_artifact("markdown", {"markdown": markdown})
      File "/usr/local/lib/python3.9/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
        return client.create_task_run_artifact(
      File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 1836, in create_task_run_artifact
        result = self.graphql(
      File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 473, in graphql
        raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 4d86a853-5ac4-4778-9285-ee5287100c97 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
    [2022-05-14 08:00:56+0000] INFO - prefect.CloudTaskRunner | Task 'run_ge_validation[5]': Finished task run for task with final state: 'Failed'
    this is the log from the k8s job container
    f

    Florian Kühnlenz

    4 months ago
    @Anna Geller Our flow only contains a StartFlowRun task which seems to be the one that fails.
    [2022-05-14 14:04:22+0200] ERROR - prefect.CloudTaskRunner | Task 'StartFlowRun[0]': Exception encountered during task execution!
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 876, in get_task_run_state
        value = prefect.utilities.executors.run_task_with_timeout(
      File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 479, in run_task_with_timeout
        return run_with_thread_timeout(
      File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 254, in run_with_thread_timeout
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/prefect/utilities/tasks.py", line 456, in method
        return run_method(self, *args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/prefect/tasks/prefect/flow_run.py", line 466, in run
        create_link_artifact(urlparse(run_link).path)
      File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
        return _create_task_run_artifact("link", {"link": link})
      File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
        return client.create_task_run_artifact(
      File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
        result = self.graphql(
      File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
        raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
    ERROR:prefect.CloudTaskRunner:Task 'StartFlowRun[0]': Exception encountered during task execution!
    Flow ID is ccc38835-4618-4852-8a0e-b233a29df1c5
    s

    Steven Fong

    4 months ago
    we had to hotfix our pipelines and commented out the artifact storage due to our SLA, and things returned to normal. so its definitely seems that the artifact service is the root cause
    f

    Florian Kühnlenz

    4 months ago
    It seems we are seeing more and more flows failing with similar problems.
    Anna Geller

    Anna Geller

    4 months ago
    sorry to hear, but the flow ID won't help us, can you both share a flow run ID of the failing flow runs?
    f

    Florian Kühnlenz

    4 months ago
    Where can we find it 😅
    Anna Geller

    Anna Geller

    4 months ago
    it's the URL of the flow run page
    s

    Steven Fong

    4 months ago
    here is a list of some of our failed runs
    468ac0af-83de-4c36-8d7f-7a88dc6f2ad2
    8b403142-e675-4b5d-9c5c-9f5d07cc732c
    e3bd219b-9e63-4f8a-9a32-d968caeb5244
    97db0cfd-a7be-493e-8f5f-043990778c64
    Anna Geller

    Anna Geller

    4 months ago
    part of it strictly speaking
    f

    Florian Kühnlenz

    4 months ago
    I was searching it everywhere else in the UI 😅
    f8426950-0d53-4cea-975d-6d47f05d5e12
    This one shows the log from above. I did not yet check the other in detail but I assume similar problems since they did run previously.
    265077fa-e8d3-46b1-96d7-88971d5d03bd
    5533eda5-138d-4d40-a26b-f13b783ba426
    Anna Geller

    Anna Geller

    4 months ago
    Steven I don't see any failed task runs in any of the flow runs you sent - weird
    s

    Steven Fong

    4 months ago
    this is what i mean, none of the failures bubble up to the UI
    the only logs are from the k8s job container
    Anna Geller

    Anna Geller

    4 months ago
    I couldn't find anything suspicious in the logs and I reported the issue - will report back tomorrow if I get some response over the weekend, I'll find out more by the latest on Monday keep us posted if you find any more information that can help identify the root cause here
    s

    Steven Fong

    4 months ago
    even the agent doesn't detect the failure in the logs
    Anna Geller

    Anna Geller

    4 months ago
    this is what i mean, none of the failures bubble up to the UI
    I see that explains a lot, thanks!
    s

    Steven Fong

    4 months ago
    the specific logs are all the same as i posted above in this thread. commenting out the artifact submission did fix the issue, obviously we would like to store our artifacts for auditing so we will monitor this thread.
    f

    Florian Kühnlenz

    4 months ago
    If this is indeed an issue with the API shouldn't this be covered by the SLA?
    s

    Steven Fong

    4 months ago
    the prefect cloud status page could use some work and should report information on all dependent services rather than just if the cloud service is responding to pings/online, as seeing everything still marked as green is probably dishonest in this scenario
    Anna Geller

    Anna Geller

    4 months ago
    Here is the response I got from the infrastructure team: there is no indication of a service outage on our side. It's possible to hit a race condition if you very quickly create an artifact after a task run, but based on the provided task run ids that doesn't appear to be happening.
    We will investigate it in more detail on Monday and get back to you then
    s

    Steven Fong

    4 months ago
    Thanks, this issue started sometime last night (from our logs) and is fairly consistent, out of 11 attempted runs 9 failed to upload artifacts all with the same error.
    i doubt this is a race condition as this flow worked flawlessly for months previous to last night
    f

    Florian Kühnlenz

    4 months ago
    We are not even actively uploading artifacts we are just using the integrated StartFlowRun task. Since the same flow worked previously and we get a response from the API this clearly seems like a problem on prefects side to me.
    s

    Steven Fong

    4 months ago
    @Florentino Bexiga your task must be triggering the artifacts based on the error you posted
    create_link_artifact(urlparse(run_link).path)
      File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 52, in create_link_artifact
        return _create_task_run_artifact("link", {"link": link})
      File "/usr/local/lib/python3.8/site-packages/prefect/backend/artifacts.py", line 28, in _create_task_run_artifact
        return client.create_task_run_artifact(
      File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 2160, in create_task_run_artifact
        result = self.graphql(
      File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 570, in graphql
        raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_task_run_artifact'], 'message': 'Task run 0ca85d4b-dbef-4090-a601-30451c5a9fc8 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
    its trying to insert the run link into your artifacts.
    f

    Florian Kühnlenz

    4 months ago
    I am pretty certain that this all happens inside the supplied startFlowRun task.
    Is there anything else we can do to move this issue foreward? We don't even have a workaround yet.
    Looking at the code at https://github.com/PrefectHQ/prefect/blob/8d45a8a49a4123efa1b2b9c62895bc3fd9829d1e/src/prefect/tasks/prefect/flow_run.py#L475. I just realized that the sub flows probably are getting triggered and only saving the url as artifact fails. I am not sure it is a good design where an unnecessary(?) side effect like this breaks the whole task. Also there seems to be no way to bypass this step. Nor are there proper logs in prefect itself about what is happening.
    Anna Geller

    Anna Geller

    4 months ago
    I agree that artifacts are not necessary to trigger a child flow run from parent - @Florian Kühnlenz could you switch to use the create_flow_run task instead of StartFlowRun? This "workaround" should solve the issue
    @Florian Kühnlenz and @Steven Fong we've investigated the issue more closely and your issue should be resolved now - could you confirm? Florian, even with StartFlowRun task and artifact creation it should work now
    s

    Steven Fong

    4 months ago
    it will likely be tomorrow during business hours before we are able to release a new build to confirm
    f

    Florian Kühnlenz

    4 months ago
    I think we see a 50% failure rate in
    c3154113-9700-431a-a9dd-10174ee6ce24
    Anna Geller

    Anna Geller

    4 months ago
    thanks for reporting back @Florian Kühnlenz - in that case, I'd recommend indeed switching to the
    create_flow_run
    task to avoid the artifact creation that happens in the StartFlowRun task since it doesn't seem to be an issue in the Cloud API backend anymore - the existing issue got resolved at 1:45 PM UTC and the error in this flow run log was at 6 PM UTC. Could you try switching to this
    create_flow_run
    task and report back if this helped? (it should)
    f

    Florian Kühnlenz

    4 months ago
    We will look into implementing the workaround today. However it sounds a bit strange to me that at first there was no issue. Now there was a fix. But the remaining problems should have nothing to do with the issue or the fix? I will update with some logs soon.
    It seems the flows running in the morning are doing fine so far. Above mentioned
    c3154113-9700-431a-a9dd-10174ee6ce24
    is part of our ETL process and should not be run during the day. So we will see how it behaves in the evening. However in the UI it still shows the same message:
    Error during execution of task: ClientError([{'path': ['create_task_run_artifact'], 'message': 'Task run 8d67c55d-dc9e-41e3-9993-4d673d0131d2 not found', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])
    I can not retrieve the logs however. I also had a look a
    create_flow_run
    and to me it seams this is not a good replacement for
    StartFlowRun
    since it does not wait for the flow run to finish. So we would need to combine it with
    wait_for_flow_run
    . This however is a bit difficult since in many cases we map over StartFlowRun. Any suggestions appreciated.
    Anna Geller

    Anna Geller

    4 months ago
    I created an example for you here
    This example shows how you can combine those two tasks using mapping to create child flow runs and wait for their completion in a downstream task
    btw @Florian Kühnlenz I added this PR https://github.com/PrefectHQ/prefect/pull/5795
    f

    Florian Kühnlenz

    4 months ago
    @Anna Geller the problem is still(?) happening see flow run
    d7024b10-e2da-4d5a-bf28-9b6c9ca6cd22
    . We did not manage to rewrite the flow yet using the workaround you provided.
    Anna Geller

    Anna Geller

    4 months ago
    I have no other recommendation as of now other than not using StartFlowRun - the PR I submitted got merged so you may be able to soon change that with a single flag once we release that, but this would require an upgrade - not sure if that's easier than rewriting to
    create_flow_run
    , especially given that this task is much nicer than StartFlowRun
    f

    Florian Kühnlenz

    4 months ago
    Alright. But as mentioned we sometimes need to wait for the flow to finish, which made StarFlowRun easier for us to use. Is there an ongoing effort to investigate the problem with the API?
    Anna Geller

    Anna Geller

    4 months ago
    Sorry, I should have mentioned that: we investigated the issue and have an open ticket about that internally - it seems when you try to start that many task runs of StartFlowRun that spin up child flow runs and create artifacts that there comes to a race condition when your flow tries to create an artifact for a task run which doesn't exist yet we are aware of the issue on the Cloud side and recommend a workaround as discussed sorry, I should have mentioned the open issue right away
    f

    Florian Kühnlenz

    4 months ago
    Okay, thanks. That's much more encouraging 😃.