Hey all, trying to run a local instance of a prefe...
# ask-community
m
Hey all, trying to run a local instance of a prefect server via kube pods and an RDS Database and i am running into the following issue: When calling run_deployment from inside of one flow to kick off a subflow, this sometimes results in one subflow being created, but other times 2,3,4 subflows are all created (at different times and run in different orders). Has anyone else experienced this before? Any thoughts on what could be causing this? I set up a MWE where i literally just have two flows, A and B, and flow A calls a task 'run_flow_b' which just calls run deployment on flow b and this still occurs. There is no looping or fancy logic, just a task calling run_deployment.
c
I've seen this happen before whenever your API returns a bad status code, and the client retries; you can avoid this by providing an
idempotency_key
to the
run_deployment
call. I recommend using the parent flow run ID as the idempotency key as this will ensure only one subflow run is created per parent flow run, but allows new run creation across new parent runs
m
Hey Chris. thanks for that. Would we see a bad status code show up in the logs? The weird thing is that the subflows all pass. It is not like they crash or fail with this bad status code. They are all valid, running subflows.
Further more this is not impossible, but does make it quite challenging to know if, in an orchestration flow with thousands of child flows, all the intended subflows are being kicked off.
c
I don't think a retried bad status code gets logged tbh
m
How is this handled by prefect cloud if you know? This is not an issuse on our prefect cloud. And we use the same infrastructure there
c
sorry I don't think I understand your question - what do you mean? The problem here is that your API is flaky (potentially a scale issue or networking issue of some kind). You don't see the issue in Prefect Cloud because the API is more stable than self-hosted APIs in these situations
m
I see. So the real issue here is the connection stability of the API
Thanks. This is a great help
c
yea exactly
anytime!
m
Are there any suggestions you would have on this front? It seems like replicas, more resources, etc all exacerbate the issue instead of resolving it. We aren't noticing extreme loads when this occurs. Or anything out of the ordinary. I will say just as an additional piece of information we have seen a few errors like this as well
Copy code
ConnectionRefusedError: [Errno 111] Connect call failed ('XX.YYY.ZZZ.ABC', 4200)
But i am still working through trying to figure out if the two are related. As this is much less frequent now that we are not reaching compute thresholds
c
Hm that looks like the API is saturated for some reason or it's possible that it's a DNS/networking issue (especially if you're saying replicas exacerbate the issue); we have a helm chart that has sane defaults for services that could be a good place to compare against. Are you using Postgres as your database?
m
Yes we use Postgres. The weird thing to me with it being a DNS/Networking issue is that it does not always happen, and even the connection refused error.... sometimes we see a worker expereiences this on start up but after a few retries connects, and other times the worker will be started with no issues at all
and for reference we use your helm chart, we just use terraform to "fill-it-in"