Andrew Lawlor

    Andrew Lawlor

    5 months ago
    I am running pipelines in GKE using Prefect Cloud, and I intermittently see the following error, which seems to come up more often when I am running lots of flows at the same time. Can anyone help?
    Failed to load and execute flow run: RefreshError(TransportError("Failed to retrieve <http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ACCOUNT/token?scopes=SCOPES> from the Google Compute Engine metadata service. Status: 500 Response:\nb'Unable to generate access token; IAM returned \\n'", <google.auth.transport.requests._Response object at 0x7f5b071fd670>))
    Anna Geller

    Anna Geller

    5 months ago
    Googling around a bit, it looks like a service account issue. You could try to create a new service account and change the service account in the pods. From here: "It turned out that Google Service Account (attached to the pod) lost its IAM binding for
    roles/iam.workloadIdentityUser
    . Re-issuing
    gcloud iam service-account add-iam-policy-binding
    command fixed the issue."
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    i am able to connect most of the time tho, even from pods. i just ran 79 flows, and 78 of them worked, while 1 failed. there wasnt any difference in the configuration between them
    Anna Geller

    Anna Geller

    5 months ago
    do you have a GCP support subscription? GCP would know more about it. I would guess this is a transient issue related to some rate limiting on the GCP side - perhaps there are some API call limits to the metadata service to retrieve the IAM token, and you are crossing their limits? Can you share how do you define your flow, especially the run config? perhaps there is a way to retrieve that token fewer times
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    run_config = KubernetesRun(
    image=docker_image,
        labels=[project_id],
        cpu_request='200m',
        memory_request='128Mi',
        env={ENV_VARS}
      )
    i also use a LocalDaskExecutor and GCS storage
    Anna Geller

    Anna Geller

    5 months ago
    I see, it looks like this error message comes from the IAM service granting permissions to GCS - would you mind trying to switch to something like GitHub storage to check if this fixes the issue? worth checking to confirm that GCS is the issue here
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    id have to talk to my team about that. we dont have any github repositories setup rn. and either way, we would need to talk to gcp to talk to gcs/bigquery/cloud sql in our pipelines
    i did ask gcp tho
    Anna Geller

    Anna Geller

    5 months ago
    I didn't mean to change all your flows immediately, but rather to just try it out with 1-3 flows that were previously failing only to confirm that GCS is the culprit here
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    dont i need to set up a github tho?
    Anna Geller

    Anna Geller

    5 months ago
    yes for sure, but it's free to use and you can totally even create a private repo and store the access token in the Prefect Cloud Secrets backend. So no need to rewrite your flows or anything. Or what's your worry about GitHub? haven't you used GitHub yet?
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    my worry is that i dont want our stuff to be public and private github repos arent free i think i misunderstood the github pricing. i can create a private repo
    similarly, i also see this sometimes
    google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see <https://cloud.google.com/docs/authentication/getting-started>
    again, it doesnt happen everytime, only if lots of flows are running at once. my credenttials are set on pods, so i think its also related to a rate limit. this time it happens on connecting to bigquery
    Anna Geller

    Anna Geller

    5 months ago
    I see - interesting! no idea how GCP handles that under the hood. I know on AWS there are many "soft" limits so if you write to them, you can increase the default limits to some extent. Perhaps it's worth sending a message to GCP and asking? it's the first time I see this type of error reported by the community. What's the number of flow runs and task runs you are running concurrently that cause those errors to be triggered?
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    i dont know the exact numbers. i saw it that time with 160 concurrent flow runs, all of which were hitting bigquery
    Anna Geller

    Anna Geller

    5 months ago
    thanks a lot, noted that 📝 LMK if there is any way I can help with - so far I shared pretty much all I know about the issue. As a next step, I would write to GCP and try switching storage to GitHub to be sure that the issue is due to GCS As a workaround, you could also use concurrency limits to ensure that no more than say 10 runs with a specific label/tag are running at the same time, which may help mitigate the problem
    Andrew Lawlor

    Andrew Lawlor

    5 months ago
    thank you. im trying github storage now. i will also try concurrency limits