I am running pipelines in GKE using Prefect Cloud and I inte Prefect Community #ask-community

I am running pipelines in GKE using Prefect Cloud,...

Andrew Lawlor

04/21/2022, 6:17 PM

I am running pipelines in GKE using Prefect Cloud, and I intermittently see the following error, which seems to come up more often when I am running lots of flows at the same time. Can anyone help?

Copy code

Failed to load and execute flow run: RefreshError(TransportError("Failed to retrieve <http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ACCOUNT/token?scopes=SCOPES> from the Google Compute Engine metadata service. Status: 500 Response:\nb'Unable to generate access token; IAM returned \\n'", <google.auth.transport.requests._Response object at 0x7f5b071fd670>))

Anna Geller

04/21/2022, 6:20 PM

Googling around a bit, it looks like a service account issue. You could try to create a new service account and change the service account in the pods. From here: "It turned out that Google Service Account (attached to the pod) lost its IAM binding for

roles/iam.workloadIdentityUser

. Re-issuing

gcloud iam service-account add-iam-policy-binding

command fixed the issue."

Andrew Lawlor

04/21/2022, 6:31 PM

i am able to connect most of the time tho, even from pods. i just ran 79 flows, and 78 of them worked, while 1 failed. there wasnt any difference in the configuration between them

Anna Geller

04/21/2022, 6:36 PM

do you have a GCP support subscription? GCP would know more about it. I would guess this is a transient issue related to some rate limiting on the GCP side - perhaps there are some API call limits to the metadata service to retrieve the IAM token, and you are crossing their limits? Can you share how do you define your flow, especially the run config? perhaps there is a way to retrieve that token fewer times

Andrew Lawlor

04/21/2022, 6:39 PM

Copy code

run_config = KubernetesRun(
image=docker_image,
    labels=[project_id],
    cpu_request='200m',
    memory_request='128Mi',
    env={ENV_VARS}
  )

Andrew Lawlor

04/21/2022, 6:39 PM

i also use a LocalDaskExecutor and GCS storage

Anna Geller

04/21/2022, 6:41 PM

I see, it looks like this error message comes from the IAM service granting permissions to GCS - would you mind trying to switch to something like GitHub storage to check if this fixes the issue? worth checking to confirm that GCS is the issue here

Anna Geller

04/21/2022, 6:42 PM

if you need an example https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows/github_kubernetes_run_custom_gcr_image.py

Andrew Lawlor

04/21/2022, 6:56 PM

id have to talk to my team about that. we dont have any github repositories setup rn. and either way, we would need to talk to gcp to talk to gcs/bigquery/cloud sql in our pipelines

Andrew Lawlor

04/21/2022, 6:56 PM

i did ask gcp tho

Anna Geller

04/21/2022, 7:18 PM

I didn't mean to change all your flows immediately, but rather to just try it out with 1-3 flows that were previously failing only to confirm that GCS is the culprit here

Andrew Lawlor

04/21/2022, 7:20 PM

dont i need to set up a github tho?

Anna Geller

04/21/2022, 7:22 PM

yes for sure, but it's free to use and you can totally even create a private repo and store the access token in the Prefect Cloud Secrets backend. So no need to rewrite your flows or anything. Or what's your worry about GitHub? haven't you used GitHub yet?

Andrew Lawlor

04/21/2022, 7:23 PM

~~my worry is that i dont want our stuff to be public and private github repos arent free~~ i think i misunderstood the github pricing. i can create a private repo

Andrew Lawlor

04/21/2022, 8:00 PM

similarly, i also see this sometimes

Copy code

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see <https://cloud.google.com/docs/authentication/getting-started>

again, it doesnt happen everytime, only if lots of flows are running at once. my credenttials are set on pods, so i think its also related to a rate limit. this time it happens on connecting to bigquery

Anna Geller

04/21/2022, 8:09 PM

I see - interesting! no idea how GCP handles that under the hood. I know on AWS there are many "soft" limits so if you write to them, you can increase the default limits to some extent. Perhaps it's worth sending a message to GCP and asking? it's the first time I see this type of error reported by the community. What's the number of flow runs and task runs you are running concurrently that cause those errors to be triggered?

Andrew Lawlor

04/21/2022, 8:12 PM

i dont know the exact numbers. i saw it that time with 160 concurrent flow runs, all of which were hitting bigquery

Anna Geller

04/21/2022, 8:16 PM

thanks a lot, noted that 📝 LMK if there is any way I can help with - so far I shared pretty much all I know about the issue. As a next step, I would write to GCP and try switching storage to GitHub to be sure that the issue is due to GCS As a workaround, you could also use concurrency limits to ensure that no more than say 10 runs with a specific label/tag are running at the same time, which may help mitigate the problem

Andrew Lawlor

04/21/2022, 8:18 PM

thank you. im trying github storage now. i will also try concurrency limits

👍 1

10 Views

Open in Slack

Previous Next