I've configured a worker service on an ECS Cluster...
# prefect-aws
j
I've configured a worker service on an ECS Cluster non-push work pool. The ECS Cluster uses an EC2 auto-scaling group as the capacity provider. The capacity group has minimum of 0, desired 0, and maximum of 40. It seems the worker is unable to spin up instances on the cluster when I submit deployments to it (see thread for stack trace). When I set the desired as 1, the deployment runs successfully. Is this the same story as the Push work pool error mentioned before by Luis, that capacityProvider support doesn't exist for either type of pool until prefect-aws#312 is merged and a new prefect-aws release includes it?
Logs of failed deployment:
Copy code
Worker 'ECSWorker 81a53a34-8a8c-4206-ac5c-cf2950c720c3' submitting flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54'
10:40:07 AM
prefect.flow_runs.worker
Registering ECS task definition...
10:40:10 AM
prefect.flow_runs.worker
Using ECS task definition 'arn:aws:ecs:us-east-1:840419303237:task-definition/prefect:1'...
10:40:11 AM
prefect.flow_runs.worker
Creating ECS task run...
10:40:11 AM
prefect.flow_runs.worker
Failed to submit flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 1524, in _create_task_run
    return ecs_client.run_task(**task_run_request)["tasks"][0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: No Container Instances were found in your cluster.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 843, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 567, in run
    ) = await run_sync_in_worker_thread(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 728, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 724, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7f0a4032c3d0 state=finished raised InvalidParameterException>]
10:40:15 AM
prefect.flow_runs.worker
Completed submission of flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54'
10:40:15 AM
prefect.flow_runs.worker
Reported flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54' as crashed: Flow run could not be submitted to infrastructure
10:40:15 AM
prefect.flow_runs.worker
I simply want my deployments to spin up on EC2 instances and spin down after completion. If there's another way than capacityProvider+ASG to do this please let me know!
hi @Jake Kaplan could you confirm?
j
hey! by default the ECSWorker is configured to use FARGATE. Until that PR is merged the ECSWorker won't be able to support passing the
capacityProvider
params correctly. If you're interested in taking it up from the original contributor it's waiting on tests to get fixed. Otherwise will try to get to it when possible
j
i tried a few nights ago actually, for some reason my setup broke though and even on the main branch all the ecs worker tests started generating failures (500 codes I think).
I might try again this weekend with a fresh environment
thanks for letting me know Jake!
if I were to contribute, would it makes sense for me to fork Luis' fork and open a PR to merge into the branch that he is trying to merge into #312? Never have had to contribute to someone else's fork before.
j
I don't know if you'd be able to fork his fork necessarily. If it were me I would probably start my own fork and copy the changes if i'm being honest 😅
a
@Jake Kaplan I’m testing the branch live and see that if I add extra parameter under keys
capacityProviderStrategy
or
capacity_provider_strategy
under
work_pool.job_variables
it’s not parsed to
configuration
of
ECSWorker
run method in
prefect-aws
. Even though, it’s showing up in UI under
Infra Overrides
Can you shed some light what might be the issue on this?
j
hey, I think you're looking to add the variable under the advanced tab on the default job template on the work pool
once that's there any matching infra overrides will be used to fill that out placeholder
a
Ah, I see your point. Let me try it out and see if it’s working as expected
👍 1
Yes. It’s working as expected. We also need to add
capacity_provider_strategy
to
variables
section in Advanced tab as well. Thank you, @Jake Kaplan Can you point me to a place where I could open a PR to update the default base job template to include this option in?
I’ll see if I could fix the test case that is failing in the opened PR as well.
Ping @Jake Kaplan @James Gatter 😄 Is there something I could help?
j
I didn't end up having time to probe the test and I have to focus on my work so I've had to abandon using Prefect. I think the only thing really is figuring out that test failure. I don't think it should be hard if you use a debugger. I just couldn't get to that point because something weird started happening with my environment (maybe the debugger?) that caused all tests to start failing.