Samuel Hinton
08/28/2025, 11:35 PMrun_deployment
which I worry might be a bug and not user error. The summary is that I'm scheduling flow runs on NERSC via slurm, which means the flow is often in pending for many minutes before slurm allocates resources. The timeout
parameter is kept to None, but it seems that if the poll then it errors. Ie in my parent flow logs I can see:
PrefectHTTPStatusError("Server error '500 Internal Server Error' for url '<http://prefect.prefect-pipelines.production.svc.spin.nersc.org/api/flow_runs/9418ee41-b55c-40ee-a7cb-879df77a766a>'\nResponse: {'exception_message': 'Internal Server Error'}\nFor more information check: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500>")
Which comes from File "/usr/local/lib/python3.13/site-packages/prefect/deployments/flow_runs.py", line 203, in run_deployment
and flow_run = await client.read_flow_run(flow_run_id)
When I wait a minute for the job to be scheduled, I can hit this endpoint perfectly:
{
"id": "9418ee41-b55c-40ee-a7cb-879df77a766a",
"created": "2025-08-28T23:20:39.035659Z",
"updated": "2025-08-28T23:20:43.784032Z",
"name": "preprocess_/data/level=raw/runs/run_id=25_056_084/science_red.fits",
"flow_id": "344eea82-0256-46b2-bdc2-615a21be48e6",
"state_id": "0198f2fb-c55b-7ab9-aab6-971eb870d3ff",
"deployment_id": "c29d351e-ee20-4267-93ee-d157b91f1d6b",
"deployment_version": "104f2d07",
"work_queue_id": "b0ac5ee7-ce2f-47ed-8f6d-7ddce9c0a1b0",
"work_queue_name": "default",
...
}
So I feel like run_deployment
shouldn't be raising an error in this instance. Any devs around to share their thoughts?
Specifically, the code in `run_deployment`:
with anyio.move_on_after(timeout):
while True:
flow_run = await client.read_flow_run(flow_run_id)
flow_state = flow_run.state
if flow_state and flow_state.is_final():
return flow_run
await anyio.sleep(poll_interval)
Uses timeout to wait for flow completion, but it does seem to assume relatively immediate flow registration