Paweł Biernat
06/12/2024, 9:25 AMfrom prefect import flow, task
from prefect.concurrency.sync import concurrency
from prefect.deployments import run_deployment
from prefect.states import raise_state_exception
@flow(log_prints=True)
def worker(a: int):
print(f"Run a worker with a={a}")
# retries for handling infrastructure issues
@task(retries=10)
def run_deployment_task(deployment_name, parameters):
with concurrency("aci-max-instances"):
flow_run = run_deployment(name=deployment_name, parameters=parameters)
raise_state_exception(flow_run.state)
# A flow that submits multiple tasks
@flow
def submit_tasks(n_tasks: int):
# actual parameters go here
parameters = [{"a": i} for i in range(n_tasks)]
for task_parameters in parameters:
run_deployment_task.submit(
deployment_name="worker/test-deployment",
parameters=task_parameters,
)
if __name__ == "__main__":
worker.deploy(
name="test-deployment",
work_pool_name="aci-pool",
image="worker:test",
)
submit_tasks.deploy(
name="test-submit-deployment",
work_pool_name="aci-pool",
image="submit-tasks:test",
)
run_deployment(
"submit-tasks/test-submit-deployment",
parameters={"n_tasks": 10_000},
)
Paweł Biernat
06/12/2024, 9:27 AMPaweł Biernat
06/12/2024, 9:34 AMPaweł Biernat
06/12/2024, 7:26 PMaz container list
.
I looked into the main container running submit-tasks
flow and there I'm seeing the too many open files error followed by 19:20:44.546 | ERROR | Task run 'run_deployment_task-272' - Crash detected! Execution was cancelled by the runtime environment.
19:20:41.902 | ERROR | Task run 'run_deployment_task-366' - Encountered exception during execution:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/prefect/engine.py", line 2103, in orchestrate_task_run
File "/usr/local/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 327, in aresult
File "/usr/local/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 352, in _run_sync
File "/opt/prefect/protpardelle/prefect_pipeline/parallel-test.py", line 15, in run_deployment_task
File "/usr/local/lib/python3.12/contextlib.py", line 137, in __enter__
File "/usr/local/lib/python3.12/site-packages/prefect/concurrency/sync.py", line 61, in concurrency
File "/usr/local/lib/python3.12/site-packages/prefect/concurrency/sync.py", line 103, in _call_async_function_from_sync
File "/usr/local/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 421, in __call__
File "/usr/local/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 308, in run
File "/usr/local/lib/python3.12/asyncio/runners.py", line 193, in run
File "/usr/local/lib/python3.12/asyncio/runners.py", line 58, in __enter__
File "/usr/local/lib/python3.12/asyncio/runners.py", line 137, in _lazy_init
File "/usr/local/lib/python3.12/asyncio/events.py", line 823, in new_event_loop
File "/usr/local/lib/python3.12/asyncio/events.py", line 720, in new_event_loop
File "/usr/local/lib/python3.12/asyncio/unix_events.py", line 64, in __init__
File "/usr/local/lib/python3.12/asyncio/selector_events.py", line 63, in __init__
File "/usr/local/lib/python3.12/selectors.py", line 349, in __init__
OSError: [Errno 24] Too many open files
Paweł Biernat
06/12/2024, 7:27 PMPaweł Biernat
06/12/2024, 7:30 PMPaweł Biernat
06/12/2024, 7:40 PMPaweł Biernat
06/13/2024, 8:05 AM# retries for handling infrastructure issues
@task(retries=10, tags=["aci-max-instances"])
def run_deployment_task(deployment_name, parameters):
flow_run = run_deployment(name=deployment_name, parameters=parameters)
raise_state_exception(flow_run.state)
and it still fails at 10k tasks but differently. This time I get to almost 10k tasks submitted (in the pending status) but the entire flow still stops working with a crash state, after executing ~50 tasks. I inspected all logs and couldn't find what goes wrong this time.
The same flow with 100 tasks total executes just fine (second image).Paweł Biernat
06/13/2024, 5:21 PMCrash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '<https://10.0.1.4:4200/api/task_runs/>'
For more information check: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500>
06:01:37 PM
prefect.flow_runs
Server logs only show some warnings of this kind:
ESC[36mprefect-server_1 |ESC[0m 16:00:25.718 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 5.473042 seconds to run, which is longer than its loop interval of 4 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:34.742 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 5.147579 seconds to run, which is longer than its loop interval of 5.0 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:39.770 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 5.938708 seconds to run, which is longer than its loop interval of 4 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:44.711 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 5.452976 seconds to run, which is longer than its loop interval of 5.0 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:55.901 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 6.098662 seconds to run, which is longer than its loop interval of 5.0 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:55.980 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 4.162087 seconds to run, which is longer than its loop interval of 4 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:57.025 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 7.29833 seconds to run, which is longer than its loop interval of 5.0 seconds.
ESC[36mprefect-server_1 |ESC[0m 16:00:57.213 | WARNING | prefect.server.services.recentdeploymentsscheduler - RecentDeploymentsScheduler took 6.463927 seconds to run, which is longer than its loop interval of 5 seconds.
I switched to hosting workers on a VM because ACI instances were running into hard open file limits. My main flow has
1,844 Task runs (164 pending, 825 running, 27 failed, 1371 crashed, 7 scheduled)
and
550 Flow runs
The flow runs come from run_deployment
that starts ACI instance with actual work to do.
I'm running out of ideas, any help would be appreciated.Paweł Biernat
06/13/2024, 5:23 PM