How can I handle internal error in Prefect cloud ```Failed t Prefect Community #ask-community

How can I handle "internal error" in Prefect cloud...

haf

10/21/2021, 9:58 PM

How can I handle "internal error" in Prefect cloud?

Copy code

Failed to retrieve task state with error: ClientError([{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
    task_run_info = self.client.get_task_run_info(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1798, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 569, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}]

Zanie

10/21/2021, 10:02 PM

Hey @haf -- what version are you running?

Zanie

10/21/2021, 10:03 PM

We've seen this happening for 0.15.5+ since there was a change in task/flow registration. There have only been a couple reports and I am eager to find what's going on. cc @Joël Luijmes @Zach Angell

haf

10/21/2021, 10:04 PM

I just upgraded prefect yes

haf

10/21/2021, 10:04 PM

0.15.6

haf

10/21/2021, 10:04 PM

FROM prefecthq/prefect:0.15.6-python3.8

Zanie

10/21/2021, 10:06 PM

Have you reregistered any of your flows since upgrading, or are you running flows registered with an older version?

haf

10/21/2021, 10:06 PM

I have reregistered

haf

10/21/2021, 10:06 PM

Previously was 0.15.4

Zanie

10/21/2021, 10:07 PM

And you're using Cloud?

haf

10/21/2021, 10:07 PM

Yes

Zanie

10/21/2021, 10:12 PM

Did you just upgrade this one failing flow or did you upgrade the rest as well and they're fine?

haf

10/21/2021, 10:12 PM

Just this

haf

10/21/2021, 10:13 PM

It's failing on something in between configuration and the mapping of these Parameters and the initial tasks

Zanie

10/21/2021, 10:14 PM

Can you create a MRE? Perhaps by creating a no-op bit that maps a parameter and some other values as your real flow does in this first step?

Zanie

10/21/2021, 10:26 PM

Can you also confirm this occurs on 0.15.5 as well?

Zanie

10/21/2021, 10:28 PM

Which executor are you using?

Zanie

10/21/2021, 10:30 PM

(We've opened a new tracking issue for this at https://github.com/PrefectHQ/prefect/issues/5075)

haf

10/21/2021, 10:43 PM

executor = unspecified

haf

10/21/2021, 10:43 PM

but it's on kubernetes with the cloud agent

haf

10/21/2021, 10:44 PM

I haven't tested 0.15.5 and I need some sleep right now

Zanie

10/21/2021, 10:45 PM

Thanks! Sounds good. Feel free to report back in that tracking issue, I've also opened a branch to explore debugging solutions.

haf

10/21/2021, 10:46 PM

Copy code

pg_db = Parameter("LOGARY_PG_DB", default=getenv("LOGARY_PG_DB", default="analytics"))
@task(nout=2)
def fetch_model(dsn_params: DSNParams) -> Tuple[str, ModelDTO]:
    m = _fetch_model(dsn_params)
    return m.id, m

with Flow(
    "run_mmm",
    state_handlers=[print_state_callback],
) as flow:
    # <https://deepnote.com/project/Media-Mix-Model-5xns-00xTG6nRlUK1f9DfA/%2Ftest_preprocessing.ipynb>

    #
    ########### CONFIG #######

    dsn_params = lambda name: build_dsn_params(
        user=pg_user,
        password=pg_password,
        host=pg_host,
        port=pg_port,
        dbname=pg_db,
        sslmode=pg_sslmode,
        application_name=f"mmm/{name}",
    )
    model_id, model = fetch_model(dsn_params("fetch_model"))

haf

10/21/2021, 10:47 PM

According to the graph this is a minimal repro

haf

10/21/2021, 10:48 PM

Also this works when running it locally

haf

10/21/2021, 10:49 PM

It only fails in the cloud k8s runner

Zanie

10/21/2021, 10:49 PM

Locally meaning

prefect run -p

flow.run

? Does it work with the cloud local agent?

haf

10/21/2021, 10:49 PM

flow.run

Zanie

10/21/2021, 10:49 PM

👍

haf

10/21/2021, 10:50 PM

"cloud local agent" sounds like a paradox

Zanie

10/21/2021, 10:50 PM

You could give it a quick go with

prefect run --name "run_mmm" --execute

to do an agentless run (that still interacts with cloud)

Zach Angell

10/22/2021, 2:18 AM

Hey Haf - could you double check the Prefect version on your Kubernetes agent? From what I can see, the state set on that flow run was using Prefect version 0.14.22.

haf

10/22/2021, 9:52 AM

Yes it’s an old version. Let me upgrade it

haf

10/22/2021, 10:15 AM

Upgrading to 0.15.7 that you just released, all across Dockerfile, poetry and the agent

haf

10/22/2021, 10:40 AM

Still a problem!

haf

10/22/2021, 10:40 AM

[2021-10-22 104006+0000] ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: ClientError([{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}])

haf

10/22/2021, 10:40 AM

Upgraded everything

haf

10/22/2021, 10:41 AM

0.15.7

haf

10/22/2021, 10:41 AM

Same problem when hard-deleting the flow and starting over

haf

10/22/2021, 10:41 AM

Downgrading.

haf

10/22/2021, 10:51 AM

Would be great to hear about solutions to this when you have them.

haf

10/22/2021, 11:04 AM

I realise now that due to how Prefect's k8s agent is configured — by not having it properly in source control = not being able to override the job template, I had to change how the

agent.py

file looks on disk: you'll find all of that discussion further up.

haf

10/22/2021, 11:04 AM

Copy code

volumeMounts:
        - mountPath: /opt/job_template.yaml
          name: prefect-agent-conf
          subPath: job_template.yaml 
        - mountPath: /usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py
          name: prefect-agent-conf
          subPath: kubernetes_agent.py

haf

10/22/2021, 11:04 AM

If I could have proper job template support with on-disk k8s yamls, I wouldn't have to do this

haf

10/22/2021, 11:09 AM

This didn't help though, now I'm getting this same error on 0.15.4

Anna Geller

10/22/2021, 11:16 AM

@haf if you reregister the flow after upgrading, do you still get the same error? I remember the UUID error could be fixed with reregistering. If multiple flows are affected, you could register them in bulk with

Copy code

prefect register --project project_name -p /path/to/flows/that/need/registration

haf

10/22/2021, 11:17 AM

Yes I reregistered it all

Anna Geller

10/22/2021, 11:19 AM

ok, thanks for that. We will work on a fix then.

🙌 1

Anna Geller

10/22/2021, 11:23 AM

@haf just to confirm: did you register the affected flows in bulk? could you try to register a single one and see if the error occurs again? This would confirm that it’s related to batched flow registration

haf

10/22/2021, 11:27 AM

No I did a flow.register()

haf

10/22/2021, 11:27 AM

Just this one

Joël Luijmes

10/22/2021, 11:27 AM

@Anna Geller I was/am also running into this. I’ve registered the flows with a manual script by calling flow.register(). Just as haf just mentioned https://prefect-community.slack.com/archives/CL09KU1K7/p1632735840323400

😔 1

Joël Luijmes

10/22/2021, 11:28 AM

I have some time in couple hours to further debug this issue. If there are suggestions to try, let me know.

upvote 1

Anna Geller

10/22/2021, 11:30 AM

thanks to you both. If you could provide the following info, this would be helpful: @haf for you it happened in both 0.15.7 and 0.15.4 with Kubernetes agent and Local executor, correct? @Joël Luijmes which Prefect version, agent and executor were you using when this error occurred?

haf

10/22/2021, 11:31 AM

Not quite correct because it initially only happened after upgrading the flow to 0.15.6

Joël Luijmes

10/22/2021, 11:31 AM

0.15.6 I tried (upgraded from .2 directly), running on Kubernetes (GCP), using the LocalDaskExecutor

👍 1

haf

10/22/2021, 11:32 AM

But I have another flow that is running just fine

👍 1

haf

10/22/2021, 11:32 AM

As we speak

Anna Geller

10/22/2021, 11:33 AM

thanks, that’s valuable to know that it only happens sporadically

haf

10/22/2021, 11:33 AM

It always happens for one of these flows

Joël Luijmes

10/22/2021, 11:34 AM

Hm that is interesting. In my case all of the 40ish flow failed to start

haf

10/22/2021, 11:35 AM

But feel free to access my account and have a look

haf

10/22/2021, 12:00 PM

This is my deployment

Copy code

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prefect-agent

spec:
  replicas: 1

  template:
    spec:
      serviceAccountName: prefect-agent

      containers:
      - name: agent
        args:
        - prefect agent kubernetes start --job-template /opt/job_template.yaml
        command: [ "/bin/bash", "-c" ]

        env:
        - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
          valueFrom:
            secretKeyRef:
              name: prefect-agent
              key: prefect-cloud-token
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: flows
        - name: IMAGE_PULL_SECRETS
          value: ''
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '["dev"]'
        - name: JOB_MEM_REQUEST
          value: ''
        - name: JOB_MEM_LIMIT
          value: ''
        - name: JOB_CPU_REQUEST
          value: ''
        - name: JOB_CPU_LIMIT
          value: ''
        - name: IMAGE_PULL_POLICY
          value: IfNotPresent
        - name: SERVICE_ACCOUNT_NAME
          value: prefect-agent
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        # - name: PREFECT__LOGGING__LEVEL
        #   value: DEBUG
        - name: PREFECT__CLOUD__AGENT__LEVEL
          value: DEBUG

        image: prefecthq/prefect:0.15.4-python3.8

        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40

        resources:
          requests:
            cpu: 200m
            memory: 40Mi
          limits:
            cpu: 1000m
            memory: 1024Mi

        volumeMounts:
        - mountPath: /opt/job_template.yaml
          name: prefect-agent-conf
          subPath: job_template.yaml 
        - mountPath: /usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py
          name: prefect-agent-conf
          subPath: kubernetes_agent.py

      volumes:
      - name: prefect-agent-conf
        configMap:
          name: prefect-agent

👍 1

Zanie

10/22/2021, 3:25 PM

Can either of you confirm if this issue occurs on 0.15.5 as well?

Zanie

10/22/2021, 3:25 PM

@haf you mentioned you saw this error on 0.15.4?

Zanie

10/22/2021, 3:26 PM

Identifying the first version where this was introduced will be the first step to determining the cause.

haf

10/22/2021, 3:32 PM

I did not initially see this error on 0.15.4

haf

10/22/2021, 3:32 PM

I have not tested 0.15.5

haf

10/22/2021, 3:32 PM

Upgrading the flow to 0.15.6 and registering it without upgrading the k8s agent, causes this error for that flow

haf

10/22/2021, 3:32 PM

Downgrading that flow to 0.15.4 again did not help, the error remained

haf

10/22/2021, 3:32 PM

@Zanie

haf

10/22/2021, 3:32 PM

Other, untouched flows do not experience this error.

Zanie

10/22/2021, 3:33 PM

Downgrading the flow to 0.15.4 being: registering it on 0.15.4 / running it on a 0.15.4 container / using a 0.15.4 agent?

haf

10/22/2021, 3:33 PM

yes

haf

10/22/2021, 3:33 PM

all of the above

Zanie

10/22/2021, 5:29 PM

Can you reproduce this with an agentless run as I mentioned before?

Zanie

10/22/2021, 5:29 PM

If so, can you reproduce this while using the branch at https://github.com/PrefectHQ/prefect/pull/5076 so we can get some additional logs?

Joël Luijmes

10/22/2021, 7:18 PM

I tried replicating it earlier today by deploying a fresh “staging” install, so that I’m not affecting the production prefect flows. Unfortunately, I was unable to replicate it that way.. What I did: 1. Deploy prefect 0.15.2 (as our production was on that version) 2. Upgraded the deployment to .6 3. Creata a dummy flow, and registered it from my local machine 4. Ran a flow => no issues The differences between the deployments are a) dummy flow which just prints hello world, b) fresh instance / database c) different version of helm chart (although i doubt it’ll matter much). At this time I currently can’t risk upgrading the production version due running flows. I’ll try the update again tomorrow, and see if it’s still broken. If so, I can try to run from the branch. Should I install that version in flow, or in the backend services (if so, how?)

haf

10/24/2021, 7:30 AM

Hi guys, any updates on this?

haf

10/24/2021, 7:31 AM

@Zanie How do I reproduce it? I use poetry — how can I configure it to use your branch inside my docker container and where do I take the agent token from from inside the docker container? Do I use the same service account? What happens when you have these logs?

Anna Geller

10/24/2021, 10:44 AM

@haf I believe you can do:

Copy code

git clone <https://github.com/PrefectHQ/prefect/tree/task-run-info-missing-id>
pip install ./prefect

not 100% sure, but in poetry this could be:

Copy code

prefect = { git = "<https://github.com/PrefectHQ/prefect.git>", branch = "task-run-info-missing-id" }

haf

10/24/2021, 10:50 AM

Thanks, I'll try that then. Is this a priority for your team to fix?

Anna Geller

10/24/2021, 11:13 AM

@haf I believe so. It’s hard to determine the root cause because it occurs only for a small fraction of flows. Feel free to chime into the open issue if you know more about the use cases in which it happens (flow running on storage A with run configuration B, how it was registered, agent X, prefect version Y, executor Z). I tried to reproduce the issue in several ways and the error did not occur in any of flows. Btw the PR shows how to install it:

Copy code

pip install -U git+<https://github.com/PrefectHQ/prefect@task-run-info-missing-id#egg=prefect>

haf

10/24/2021, 11:14 AM

Why don't you just instrument your code properly?

haf

10/24/2021, 11:14 AM

Add opentelemetry everywhere, structured logging. If you have a bug you don't immediately can find from reading logs, you're not logging enough.

haf

10/24/2021, 11:14 AM

It feels so inadequate just to add two new error statements like this PR is doing.

haf

10/24/2021, 11:15 AM

Log all contexts, log all control flow, make it opt-in when debugging, via an env var.

haf

10/24/2021, 11:19 AM

https://twitter.com/el_bhs/status/1381707753165484039

haf

10/24/2021, 11:20 AM

https://twitter.com/el_bhs/status/1364282351837089799

haf

10/24/2021, 11:20 AM

The latest one is best at explaining how to log context

haf

10/24/2021, 11:21 AM

This is what you should do to really root out bugs like this. At some scale even rare events will consume disproportionate amounts of time (like this bug is doing right now)

haf

10/24/2021, 12:52 PM

Trying the GH branch now

haf

10/24/2021, 1:07 PM

Running from the GH branch works without a crash!

🎉 2

🙌 1

👍 1

haf

10/25/2021, 8:51 PM

@Zanie Now it started failing again

Copy code

[2021-10-25 20:50:18+0000] ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: ValueError('`task_id` missing from task run context')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 158, in initialize_run
    raise ValueError("`task_id` missing from task run context")
ValueError: `task_id` missing from task run context

haf

10/25/2021, 8:52 PM

And

Task slug(s) missing from the current flow missing from the flow stored in the Prefect backend: {'LOGARY_PG_SSLMODE', 'LOGARY_PG_PASSWORD-1', 'LOGARY_PG_HOST', 'LOGARY_PG_USER-1'}

Zanie

10/25/2021, 8:52 PM

Thanks for getting us that debug information!

Zanie

10/25/2021, 8:53 PM

Are these all parameters?

haf

10/25/2021, 8:54 PM

What do you mean?

haf

10/25/2021, 8:55 PM

It's all the logs say, but they aren't all the parameters that the flow takes

Zanie

10/25/2021, 8:55 PM

I'm wondering if those slugs all belong to "Parameter" tasks

haf

10/25/2021, 8:56 PM

They are all LogarySecrets:

Copy code

from typing import Optional

from prefect.client.secrets import Secret as _Secret
from prefect.core.task import Task
from prefect.engine.results.secret_result import SecretResult


class LogarySecret(Task):
    """
    Prefect Secrets Task.  This task retrieves the underlying secret through
    the Prefect Secrets API (which has the ability to toggle between local vs. Cloud secrets).

    Args:
        - name (str, optional): The name of the underlying secret
        - **kwargs (Any, optional): additional keyword arguments to pass to the Task constructor

    Raises:
        - ValueError: if a `result` keyword is passed
    """

    secret_name: Optional[str]
    default: Optional[str]

    def __init__(self, name=None, default=None, **kwargs):
        if kwargs.get("result"):
            raise ValueError("Result types for Secrets are not configurable.")
        kwargs["checkpoint"] = False
        self.secret_name = name
        self.default = default
        super().__init__(name=name, **kwargs)
        self.result = SecretResult(secret_task=self)

    def run(self, name: str = None):
        """
        The run method for Secret Tasks.  This method actually retrieves and returns the
        underlying secret value using the `Secret.get()` method.  Note that this method first
        checks context for the secret value, and if not found either raises an error or queries
        Prefect Cloud, depending on whether `config.cloud.use_local_secrets` is `True` or
        `False`.

        Args:
            - name (str, optional): the name of the underlying Secret to retrieve. Defaults
                to the name provided at initialization.

        Returns:
            - Any: the underlying value of the Prefect Secret
        """
        if name is None:
            name = self.secret_name

        if name is None:
            raise ValueError("A secret name must be provided.")

        _s = _Secret(name)
        if not _s.exists() and self.default is not None:
            return self.default

        return _s.get()

Zanie

10/25/2021, 9:09 PM

I'm curious how these secrets are being passed into tasks. Are they a part of

build_dsn_params

haf

10/26/2021, 8:56 AM

@Zanie I hope you're able to make heads or tails of this with the flow I sent you in DM:s

haf

10/26/2021, 10:59 AM

Any update to this?

Zanie

10/26/2021, 3:03 PM

Hey haf, I'm balancing this with a lot of other work. I'll report back when I have something.

😔 1

haf

10/26/2021, 5:11 PM

ok, thank you

Zanie

10/26/2021, 5:21 PM

From looking at your flow, these missing slugs are a combination of parameters and secrets. The two with

-1

appended are secrets, the others are parameters.

Zanie

10/26/2021, 5:22 PM

If you want things to go faster, the main work item here is to create a minimal reproducible example. We have not been able to reproduce this which makes it very hard for us to debug for you

haf

10/26/2021, 5:41 PM

Wouldn't it be more constructive to add tracing to the code-base?

Zanie

10/26/2021, 5:44 PM

We're not going to be able to find a bug in traces from a large flow like that

Zanie

10/26/2021, 5:50 PM

Are you registering your flows via the CLI or flow.register()?

haf

10/26/2021, 5:51 PM

flow.register

haf

10/26/2021, 5:52 PM

Ok, I think I can find the bug in logs, I'm very used to it. So I rephrase the question: are you interested in adding tracing ability to your non-orion codebase?

Zanie

10/26/2021, 5:55 PM

I'm not really sure what you're asking for, but I've pushed some additional logs to the registration code on that branch if you want to give that a go we can see if those slugs are missing.

haf

10/27/2021, 7:38 PM

Hi any update?

Zanie

10/27/2021, 7:40 PM

Hey, did you run the latest tracing to see if the slugs are missing at registration time?

haf

10/27/2021, 7:44 PM

No I’ll run it before I go to bed today! Thank you for enabling me to debug!

Joël Luijmes

10/28/2021, 9:16 AM

Bit later then expected, but just reupgraded my server to 0.15.6, and rebuild and reregistered with 0.15.16, but right now the flows are running just fine. 🤯

haf

10/28/2021, 1:13 PM

@Zanie I ran with the new code but it's still crashing; I sent you the client trace in DM

haf

10/28/2021, 1:14 PM

Looking forward to hearing something; at least a workaround – I've tried deleting the job, reregistering and so on

haf

10/28/2021, 3:02 PM

Hey; so @Zanie found the problem and the solution — I had environment variables that made the usage of parameters conditional on their presence. This explains why I sometimes got the flows running and sometimes not, depending on what env vars the shell I registered the flow with, had. Big thank you to Michael!

marvin 4

🙏 2

Zanie

10/28/2021, 3:03 PM

Glad we sorted it out! @Joël Luijmes here's your chance to confess if you're up to anything weird like that! 😛

😀 1

Kevin Kho

10/28/2021, 3:04 PM

Big thank you to @Zanie!

Joël Luijmes

10/28/2021, 5:00 PM

Nope, don’t really work with parameters (only 1 of thee many flows uses them). Must have been something bogus 🤷‍♂️ At least glad I can run latest versions again 🙂

Zanie

10/28/2021, 5:01 PM

Let us know if you see it again -- that branch will help narrow down the misbehaving slugs. We'll probably roll some of those logs into an actual release as well.

👍 1

13 Views

Open in Slack

Previous Next