How can I handle "internal error" in Prefect cloud...
# ask-community
h
How can I handle "internal error" in Prefect cloud?
Copy code
Failed to retrieve task state with error: ClientError([{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
    task_run_info = self.client.get_task_run_info(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 1798, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 569, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}]
z
Hey @haf -- what version are you running?
We've seen this happening for 0.15.5+ since there was a change in task/flow registration. There have only been a couple reports and I am eager to find what's going on. cc @Joël Luijmes @Zach Angell
h
I just upgraded prefect yes
0.15.6
FROM prefecthq/prefect:0.15.6-python3.8
z
Have you reregistered any of your flows since upgrading, or are you running flows registered with an older version?
h
I have reregistered
Previously was 0.15.4
z
And you're using Cloud?
h
Yes
z
Did you just upgrade this one failing flow or did you upgrade the rest as well and they're fine?
h
Just this
It's failing on something in between configuration and the mapping of these Parameters and the initial tasks
z
Can you create a MRE? Perhaps by creating a no-op bit that maps a parameter and some other values as your real flow does in this first step?
Can you also confirm this occurs on 0.15.5 as well?
Which executor are you using?
(We've opened a new tracking issue for this at https://github.com/PrefectHQ/prefect/issues/5075)
h
executor = unspecified
but it's on kubernetes with the cloud agent
I haven't tested 0.15.5 and I need some sleep right now
z
Thanks! Sounds good. Feel free to report back in that tracking issue, I've also opened a branch to explore debugging solutions.
h
Copy code
pg_db = Parameter("LOGARY_PG_DB", default=getenv("LOGARY_PG_DB", default="analytics"))
@task(nout=2)
def fetch_model(dsn_params: DSNParams) -> Tuple[str, ModelDTO]:
    m = _fetch_model(dsn_params)
    return m.id, m

with Flow(
    "run_mmm",
    state_handlers=[print_state_callback],
) as flow:
    # <https://deepnote.com/project/Media-Mix-Model-5xns-00xTG6nRlUK1f9DfA/%2Ftest_preprocessing.ipynb>

    #
    ########### CONFIG #######

    dsn_params = lambda name: build_dsn_params(
        user=pg_user,
        password=pg_password,
        host=pg_host,
        port=pg_port,
        dbname=pg_db,
        sslmode=pg_sslmode,
        application_name=f"mmm/{name}",
    )
    model_id, model = fetch_model(dsn_params("fetch_model"))
According to the graph this is a minimal repro
Also this works when running it locally
It only fails in the cloud k8s runner
z
Locally meaning
prefect run -p
or
flow.run
? Does it work with the cloud local agent?
h
flow.run
z
👍
h
"cloud local agent" sounds like a paradox
z
You could give it a quick go with
prefect run --name "run_mmm" --execute
to do an agentless run (that still interacts with cloud)
z
Hey Haf - could you double check the Prefect version on your Kubernetes agent? From what I can see, the state set on that flow run was using Prefect version 0.14.22.
h
Yes it’s an old version. Let me upgrade it
Upgrading to 0.15.7 that you just released, all across Dockerfile, poetry and the agent
Still a problem!
[2021-10-22 104006+0000] ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: ClientError([{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}])
Upgraded everything
0.15.7
Same problem when hard-deleting the flow and starting over
Downgrading.
Would be great to hear about solutions to this when you have them.
I realise now that due to how Prefect's k8s agent is configured — by not having it properly in source control = not being able to override the job template, I had to change how the
agent.py
file looks on disk: you'll find all of that discussion further up.
Copy code
volumeMounts:
        - mountPath: /opt/job_template.yaml
          name: prefect-agent-conf
          subPath: job_template.yaml 
        - mountPath: /usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py
          name: prefect-agent-conf
          subPath: kubernetes_agent.py
If I could have proper job template support with on-disk k8s yamls, I wouldn't have to do this
This didn't help though, now I'm getting this same error on 0.15.4
a
@haf if you reregister the flow after upgrading, do you still get the same error? I remember the UUID error could be fixed with reregistering. If multiple flows are affected, you could register them in bulk with
Copy code
prefect register --project project_name -p /path/to/flows/that/need/registration
h
Yes I reregistered it all
a
ok, thanks for that. We will work on a fix then.
🙌 1
@haf just to confirm: did you register the affected flows in bulk? could you try to register a single one and see if the error occurs again? This would confirm that it’s related to batched flow registration
h
No I did a flow.register()
Just this one
j
@Anna Geller I was/am also running into this. I’ve registered the flows with a manual script by calling flow.register(). Just as haf just mentioned https://prefect-community.slack.com/archives/CL09KU1K7/p1632735840323400
😔 1
I have some time in couple hours to further debug this issue. If there are suggestions to try, let me know.
upvote 1
a
thanks to you both. If you could provide the following info, this would be helpful: @haf for you it happened in both 0.15.7 and 0.15.4 with Kubernetes agent and Local executor, correct? @Joël Luijmes which Prefect version, agent and executor were you using when this error occurred?
h
Not quite correct because it initially only happened after upgrading the flow to 0.15.6
j
0.15.6 I tried (upgraded from .2 directly), running on Kubernetes (GCP), using the LocalDaskExecutor
👍 1
h
But I have another flow that is running just fine
👍 1
As we speak
a
thanks, that’s valuable to know that it only happens sporadically
h
It always happens for one of these flows
j
Hm that is interesting. In my case all of the 40ish flow failed to start
h
But feel free to access my account and have a look
This is my deployment
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prefect-agent

spec:
  replicas: 1

  template:
    spec:
      serviceAccountName: prefect-agent

      containers:
      - name: agent
        args:
        - prefect agent kubernetes start --job-template /opt/job_template.yaml
        command: [ "/bin/bash", "-c" ]

        env:
        - name: PREFECT__CLOUD__AGENT__AUTH_TOKEN
          valueFrom:
            secretKeyRef:
              name: prefect-agent
              key: prefect-cloud-token
        - name: PREFECT__CLOUD__API
          value: <https://api.prefect.io>
        - name: NAMESPACE
          value: flows
        - name: IMAGE_PULL_SECRETS
          value: ''
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '["dev"]'
        - name: JOB_MEM_REQUEST
          value: ''
        - name: JOB_MEM_LIMIT
          value: ''
        - name: JOB_CPU_REQUEST
          value: ''
        - name: JOB_CPU_LIMIT
          value: ''
        - name: IMAGE_PULL_POLICY
          value: IfNotPresent
        - name: SERVICE_ACCOUNT_NAME
          value: prefect-agent
        - name: PREFECT__BACKEND
          value: cloud
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: http://:8080
        # - name: PREFECT__LOGGING__LEVEL
        #   value: DEBUG
        - name: PREFECT__CLOUD__AGENT__LEVEL
          value: DEBUG

        image: prefecthq/prefect:0.15.4-python3.8

        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /api/health
            port: 8080
          initialDelaySeconds: 40
          periodSeconds: 40

        resources:
          requests:
            cpu: 200m
            memory: 40Mi
          limits:
            cpu: 1000m
            memory: 1024Mi

        volumeMounts:
        - mountPath: /opt/job_template.yaml
          name: prefect-agent-conf
          subPath: job_template.yaml 
        - mountPath: /usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py
          name: prefect-agent-conf
          subPath: kubernetes_agent.py

      volumes:
      - name: prefect-agent-conf
        configMap:
          name: prefect-agent
👍 1
z
Can either of you confirm if this issue occurs on 0.15.5 as well?
@haf you mentioned you saw this error on 0.15.4?
Identifying the first version where this was introduced will be the first step to determining the cause.
h
I did not initially see this error on 0.15.4
I have not tested 0.15.5
Upgrading the flow to 0.15.6 and registering it without upgrading the k8s agent, causes this error for that flow
Downgrading that flow to 0.15.4 again did not help, the error remained
@Zanie
Other, untouched flows do not experience this error.
z
Downgrading the flow to 0.15.4 being: registering it on 0.15.4 / running it on a 0.15.4 container / using a 0.15.4 agent?
h
yes
all of the above
z
Can you reproduce this with an agentless run as I mentioned before?
If so, can you reproduce this while using the branch at https://github.com/PrefectHQ/prefect/pull/5076 so we can get some additional logs?
j
I tried replicating it earlier today by deploying a fresh “staging” install, so that I’m not affecting the production prefect flows. Unfortunately, I was unable to replicate it that way.. What I did: 1. Deploy prefect 0.15.2 (as our production was on that version) 2. Upgraded the deployment to .6 3. Creata a dummy flow, and registered it from my local machine 4. Ran a flow => no issues The differences between the deployments are a) dummy flow which just prints hello world, b) fresh instance / database c) different version of helm chart (although i doubt it’ll matter much). At this time I currently can’t risk upgrading the production version due running flows. I’ll try the update again tomorrow, and see if it’s still broken. If so, I can try to run from the branch. Should I install that version in flow, or in the backend services (if so, how?)
h
Hi guys, any updates on this?
@Zanie How do I reproduce it? I use poetry — how can I configure it to use your branch inside my docker container and where do I take the agent token from from inside the docker container? Do I use the same service account? What happens when you have these logs?
a
@haf I believe you can do:
Copy code
git clone <https://github.com/PrefectHQ/prefect/tree/task-run-info-missing-id>
pip install ./prefect
not 100% sure, but in poetry this could be:
Copy code
prefect = { git = "<https://github.com/PrefectHQ/prefect.git>", branch = "task-run-info-missing-id" }
h
Thanks, I'll try that then. Is this a priority for your team to fix?
a
@haf I believe so. It’s hard to determine the root cause because it occurs only for a small fraction of flows. Feel free to chime into the open issue if you know more about the use cases in which it happens (flow running on storage A with run configuration B, how it was registered, agent X, prefect version Y, executor Z). I tried to reproduce the issue in several ways and the error did not occur in any of flows. Btw the PR shows how to install it:
Copy code
pip install -U git+<https://github.com/PrefectHQ/prefect@task-run-info-missing-id#egg=prefect>
h
Why don't you just instrument your code properly?
Add opentelemetry everywhere, structured logging. If you have a bug you don't immediately can find from reading logs, you're not logging enough.
It feels so inadequate just to add two new error statements like this PR is doing.
Log all contexts, log all control flow, make it opt-in when debugging, via an env var.
The latest one is best at explaining how to log context
This is what you should do to really root out bugs like this. At some scale even rare events will consume disproportionate amounts of time (like this bug is doing right now)
Trying the GH branch now
Running from the GH branch works without a crash!
🎉 2
🙌 1
👍 1
@Zanie Now it started failing again
Copy code
[2021-10-25 20:50:18+0000] ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: ValueError('`task_id` missing from task run context')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 158, in initialize_run
    raise ValueError("`task_id` missing from task run context")
ValueError: `task_id` missing from task run context
And
Task slug(s) missing from the current flow missing from the flow stored in the Prefect backend: {'LOGARY_PG_SSLMODE', 'LOGARY_PG_PASSWORD-1', 'LOGARY_PG_HOST', 'LOGARY_PG_USER-1'}
z
Thanks for getting us that debug information!
Are these all parameters?
h
What do you mean?
It's all the logs say, but they aren't all the parameters that the flow takes
z
I'm wondering if those slugs all belong to "Parameter" tasks
h
They are all LogarySecrets:
Copy code
from typing import Optional

from prefect.client.secrets import Secret as _Secret
from prefect.core.task import Task
from prefect.engine.results.secret_result import SecretResult


class LogarySecret(Task):
    """
    Prefect Secrets Task.  This task retrieves the underlying secret through
    the Prefect Secrets API (which has the ability to toggle between local vs. Cloud secrets).

    Args:
        - name (str, optional): The name of the underlying secret
        - **kwargs (Any, optional): additional keyword arguments to pass to the Task constructor

    Raises:
        - ValueError: if a `result` keyword is passed
    """

    secret_name: Optional[str]
    default: Optional[str]

    def __init__(self, name=None, default=None, **kwargs):
        if kwargs.get("result"):
            raise ValueError("Result types for Secrets are not configurable.")
        kwargs["checkpoint"] = False
        self.secret_name = name
        self.default = default
        super().__init__(name=name, **kwargs)
        self.result = SecretResult(secret_task=self)

    def run(self, name: str = None):
        """
        The run method for Secret Tasks.  This method actually retrieves and returns the
        underlying secret value using the `Secret.get()` method.  Note that this method first
        checks context for the secret value, and if not found either raises an error or queries
        Prefect Cloud, depending on whether `config.cloud.use_local_secrets` is `True` or
        `False`.

        Args:
            - name (str, optional): the name of the underlying Secret to retrieve. Defaults
                to the name provided at initialization.

        Returns:
            - Any: the underlying value of the Prefect Secret
        """
        if name is None:
            name = self.secret_name

        if name is None:
            raise ValueError("A secret name must be provided.")

        _s = _Secret(name)
        if not _s.exists() and self.default is not None:
            return self.default

        return _s.get()
z
I'm curious how these secrets are being passed into tasks. Are they a part of
build_dsn_params
?
h
@Zanie I hope you're able to make heads or tails of this with the flow I sent you in DM:s
Any update to this?
z
Hey haf, I'm balancing this with a lot of other work. I'll report back when I have something.
😔 1
h
ok, thank you
z
From looking at your flow, these missing slugs are a combination of parameters and secrets. The two with
-1
appended are secrets, the others are parameters.
If you want things to go faster, the main work item here is to create a minimal reproducible example. We have not been able to reproduce this which makes it very hard for us to debug for you
h
Wouldn't it be more constructive to add tracing to the code-base?
z
We're not going to be able to find a bug in traces from a large flow like that
Are you registering your flows via the CLI or flow.register()?
h
flow.register
Ok, I think I can find the bug in logs, I'm very used to it. So I rephrase the question: are you interested in adding tracing ability to your non-orion codebase?
z
I'm not really sure what you're asking for, but I've pushed some additional logs to the registration code on that branch if you want to give that a go we can see if those slugs are missing.
h
Hi any update?
z
Hey, did you run the latest tracing to see if the slugs are missing at registration time?
h
No I’ll run it before I go to bed today! Thank you for enabling me to debug!
j
Bit later then expected, but just reupgraded my server to 0.15.6, and rebuild and reregistered with 0.15.16, but right now the flows are running just fine. 🤯
h
@Zanie I ran with the new code but it's still crashing; I sent you the client trace in DM
Looking forward to hearing something; at least a workaround – I've tried deleting the job, reregistering and so on
Hey; so @Zanie found the problem and the solution — I had environment variables that made the usage of parameters conditional on their presence. This explains why I sometimes got the flows running and sometimes not, depending on what env vars the shell I registered the flow with, had. Big thank you to Michael!
marvin 4
🙏 2
z
Glad we sorted it out! @Joël Luijmes here's your chance to confess if you're up to anything weird like that! 😛
😀 1
k
Big thank you to @Zanie!
j
Nope, don’t really work with parameters (only 1 of thee many flows uses them). Must have been something bogus 🤷‍♂️ At least glad I can run latest versions again 🙂
z
Let us know if you see it again -- that branch will help narrow down the misbehaving slugs. We'll probably roll some of those logs into an actual release as well.
👍 1