Hey all trying to run a local instance of a prefect server v Prefect Community #ask-community

Hey all, trying to run a local instance of a prefe...

Matthew Scanlon

05/21/2024, 3:11 PM

Hey all, trying to run a local instance of a prefect server via kube pods and an RDS Database and i am running into the following issue: When calling run_deployment from inside of one flow to kick off a subflow, this sometimes results in one subflow being created, but other times 2,3,4 subflows are all created (at different times and run in different orders). Has anyone else experienced this before? Any thoughts on what could be causing this? I set up a MWE where i literally just have two flows, A and B, and flow A calls a task 'run_flow_b' which just calls run deployment on flow b and this still occurs. There is no looping or fancy logic, just a task calling run_deployment.

Chris White

05/21/2024, 4:26 PM

I've seen this happen before whenever your API returns a bad status code, and the client retries; you can avoid this by providing an idempotency_key
to the

run_deployment

call. I recommend using the parent flow run ID as the idempotency key as this will ensure only one subflow run is created per parent flow run, but allows new run creation across new parent runs

Matthew Scanlon

05/21/2024, 4:47 PM

Hey Chris. thanks for that. Would we see a bad status code show up in the logs? The weird thing is that the subflows all pass. It is not like they crash or fail with this bad status code. They are all valid, running subflows.

Matthew Scanlon

05/21/2024, 4:53 PM

Further more this is not impossible, but does make it quite challenging to know if, in an orchestration flow with thousands of child flows, all the intended subflows are being kicked off.

Chris White

05/21/2024, 5:35 PM

I don't think a retried bad status code gets logged tbh

Matthew Scanlon

05/21/2024, 6:16 PM

How is this handled by prefect cloud if you know? This is not an issuse on our prefect cloud. And we use the same infrastructure there

Chris White

05/21/2024, 6:50 PM

sorry I don't think I understand your question - what do you mean? The problem here is that your API is flaky (potentially a scale issue or networking issue of some kind). You don't see the issue in Prefect Cloud because the API is more stable than self-hosted APIs in these situations

Matthew Scanlon

05/21/2024, 7:29 PM

I see. So the real issue here is the connection stability of the API

Matthew Scanlon

05/21/2024, 7:29 PM

Thanks. This is a great help

Chris White

05/21/2024, 7:29 PM

yea exactly

Chris White

05/21/2024, 7:29 PM

anytime!

Matthew Scanlon

05/22/2024, 2:17 PM

Are there any suggestions you would have on this front? It seems like replicas, more resources, etc all exacerbate the issue instead of resolving it. We aren't noticing extreme loads when this occurs. Or anything out of the ordinary. I will say just as an additional piece of information we have seen a few errors like this as well

Copy code

ConnectionRefusedError: [Errno 111] Connect call failed ('XX.YYY.ZZZ.ABC', 4200)

But i am still working through trying to figure out if the two are related. As this is much less frequent now that we are not reaching compute thresholds

Chris White

05/22/2024, 5:16 PM

Hm that looks like the API is saturated for some reason or it's possible that it's a DNS/networking issue (especially if you're saying replicas exacerbate the issue); we have a helm chart that has sane defaults for services that could be a good place to compare against. Are you using Postgres as your database?

Matthew Scanlon

05/22/2024, 5:25 PM

Yes we use Postgres. The weird thing to me with it being a DNS/Networking issue is that it does not always happen, and even the connection refused error.... sometimes we see a worker expereiences this on start up but after a few retries connects, and other times the worker will be started with no issues at all

Matthew Scanlon

05/22/2024, 8:07 PM

and for reference we use your helm chart, we just use terraform to "fill-it-in"

50 Views

Open in Slack

Previous Next