I am running pipelines in GKE using Prefect Cloud,...
# prefect-community
a
I am running pipelines in GKE using Prefect Cloud, and I intermittently see the following error, which seems to come up more often when I am running lots of flows at the same time. Can anyone help?
Copy code
Failed to load and execute flow run: RefreshError(TransportError("Failed to retrieve <http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ACCOUNT/token?scopes=SCOPES> from the Google Compute Engine metadata service. Status: 500 Response:\nb'Unable to generate access token; IAM returned \\n'", <google.auth.transport.requests._Response object at 0x7f5b071fd670>))
a
Googling around a bit, it looks like a service account issue. You could try to create a new service account and change the service account in the pods. From here: "It turned out that Google Service Account (attached to the pod) lost its IAM binding for
roles/iam.workloadIdentityUser
. Re-issuing
gcloud iam service-account add-iam-policy-binding
command fixed the issue."
a
i am able to connect most of the time tho, even from pods. i just ran 79 flows, and 78 of them worked, while 1 failed. there wasnt any difference in the configuration between them
a
do you have a GCP support subscription? GCP would know more about it. I would guess this is a transient issue related to some rate limiting on the GCP side - perhaps there are some API call limits to the metadata service to retrieve the IAM token, and you are crossing their limits? Can you share how do you define your flow, especially the run config? perhaps there is a way to retrieve that token fewer times
a
Copy code
run_config = KubernetesRun(
image=docker_image,
    labels=[project_id],
    cpu_request='200m',
    memory_request='128Mi',
    env={ENV_VARS}
  )
i also use a LocalDaskExecutor and GCS storage
a
I see, it looks like this error message comes from the IAM service granting permissions to GCS - would you mind trying to switch to something like GitHub storage to check if this fixes the issue? worth checking to confirm that GCS is the issue here
a
id have to talk to my team about that. we dont have any github repositories setup rn. and either way, we would need to talk to gcp to talk to gcs/bigquery/cloud sql in our pipelines
i did ask gcp tho
a
I didn't mean to change all your flows immediately, but rather to just try it out with 1-3 flows that were previously failing only to confirm that GCS is the culprit here
a
dont i need to set up a github tho?
a
yes for sure, but it's free to use and you can totally even create a private repo and store the access token in the Prefect Cloud Secrets backend. So no need to rewrite your flows or anything. Or what's your worry about GitHub? haven't you used GitHub yet?
a
my worry is that i dont want our stuff to be public and private github repos arent free i think i misunderstood the github pricing. i can create a private repo
similarly, i also see this sometimes
Copy code
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see <https://cloud.google.com/docs/authentication/getting-started>
again, it doesnt happen everytime, only if lots of flows are running at once. my credenttials are set on pods, so i think its also related to a rate limit. this time it happens on connecting to bigquery
a
I see - interesting! no idea how GCP handles that under the hood. I know on AWS there are many "soft" limits so if you write to them, you can increase the default limits to some extent. Perhaps it's worth sending a message to GCP and asking? it's the first time I see this type of error reported by the community. What's the number of flow runs and task runs you are running concurrently that cause those errors to be triggered?
a
i dont know the exact numbers. i saw it that time with 160 concurrent flow runs, all of which were hitting bigquery
a
thanks a lot, noted that 📝 LMK if there is any way I can help with - so far I shared pretty much all I know about the issue. As a next step, I would write to GCP and try switching storage to GitHub to be sure that the issue is due to GCS As a workaround, you could also use concurrency limits to ensure that no more than say 10 runs with a specific label/tag are running at the same time, which may help mitigate the problem
a
thank you. im trying github storage now. i will also try concurrency limits
👍 1