I ve configured a worker service on an ECS Cluster non push Prefect Community #prefect-aws

I've configured a worker service on an ECS Cluster...

James Gatter

10/03/2023, 4:31 PM

I've configured a worker service on an ECS Cluster non-push work pool. The ECS Cluster uses an EC2 auto-scaling group as the capacity provider. The capacity group has minimum of 0, desired 0, and maximum of 40. It seems the worker is unable to spin up instances on the cluster when I submit deployments to it (see thread for stack trace). When I set the desired as 1, the deployment runs successfully. Is this the same story as the Push work pool error mentioned before by Luis, that capacityProvider support doesn't exist for either type of pool until prefect-aws#312 is merged and a new prefect-aws release includes it?

James Gatter

10/03/2023, 4:32 PM

Logs of failed deployment:

Copy code

Worker 'ECSWorker 81a53a34-8a8c-4206-ac5c-cf2950c720c3' submitting flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54'
10:40:07 AM
prefect.flow_runs.worker
Registering ECS task definition...
10:40:10 AM
prefect.flow_runs.worker
Using ECS task definition 'arn:aws:ecs:us-east-1:840419303237:task-definition/prefect:1'...
10:40:11 AM
prefect.flow_runs.worker
Creating ECS task run...
10:40:11 AM
prefect.flow_runs.worker
Failed to submit flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 1524, in _create_task_run
    return ecs_client.run_task(**task_run_request)["tasks"][0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: No Container Instances were found in your cluster.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 843, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 567, in run
    ) = await run_sync_in_worker_thread(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 728, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 724, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7f0a4032c3d0 state=finished raised InvalidParameterException>]
10:40:15 AM
prefect.flow_runs.worker
Completed submission of flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54'
10:40:15 AM
prefect.flow_runs.worker
Reported flow run '685458d3-e5f8-4d11-a8aa-3f3b6fa50e54' as crashed: Flow run could not be submitted to infrastructure
10:40:15 AM
prefect.flow_runs.worker

James Gatter

10/03/2023, 4:36 PM

I simply want my deployments to spin up on EC2 instances and spin down after completion. If there's another way than capacityProvider+ASG to do this please let me know!

James Gatter

10/05/2023, 7:49 PM

hi @Jake Kaplan could you confirm?

Jake Kaplan

10/06/2023, 1:16 PM

hey! by default the ECSWorker is configured to use FARGATE. Until that PR is merged the ECSWorker won't be able to support passing the

capacityProvider

params correctly. If you're interested in taking it up from the original contributor it's waiting on tests to get fixed. Otherwise will try to get to it when possible

James Gatter

10/06/2023, 2:40 PM

i tried a few nights ago actually, for some reason my setup broke though and even on the main branch all the ecs worker tests started generating failures (500 codes I think).

James Gatter

10/06/2023, 2:40 PM

I might try again this weekend with a fresh environment

James Gatter

10/06/2023, 2:40 PM

thanks for letting me know Jake!

James Gatter

10/06/2023, 2:40 PM

if I were to contribute, would it makes sense for me to fork Luis' fork and open a PR to merge into the branch that he is trying to merge into #312? Never have had to contribute to someone else's fork before.

Jake Kaplan

10/06/2023, 8:24 PM

I don't know if you'd be able to fork his fork necessarily. If it were me I would probably start my own fork and copy the changes if i'm being honest 😅

Anh Pham

10/11/2023, 2:37 PM

@Jake Kaplan I’m testing the branch live and see that if I add extra parameter under keys

capacityProviderStrategy

capacity_provider_strategy

under

work_pool.job_variables

it’s not parsed to configuration
of

ECSWorker

run method in

prefect-aws

. Even though, it’s showing up in UI under

Infra Overrides

Can you shed some light what might be the issue on this?

Jake Kaplan

10/11/2023, 6:09 PM

hey, I think you're looking to add the variable under the advanced tab on the default job template on the work pool

Jake Kaplan

10/11/2023, 6:10 PM

once that's there any matching infra overrides will be used to fill that out placeholder

Anh Pham

10/11/2023, 7:13 PM

Ah, I see your point. Let me try it out and see if it’s working as expected

👍 1

Anh Pham

10/11/2023, 10:35 PM

Yes. It’s working as expected. We also need to add

capacity_provider_strategy

variables

section in Advanced tab as well. Thank you, @Jake Kaplan Can you point me to a place where I could open a PR to update the default base job template to include this option in?

Anh Pham

10/11/2023, 10:36 PM

I’ll see if I could fix the test case that is failing in the opened PR as well.

Anh Pham

10/18/2023, 2:46 PM

Ping @Jake Kaplan @James Gatter 😄 Is there something I could help?

James Gatter

10/18/2023, 2:51 PM

I didn't end up having time to probe the test and I have to focus on my work so I've had to abandon using Prefect. I think the only thing really is figuring out that test failure. I don't think it should be hard if you use a debugger. I just couldn't get to that point because something weird started happening with my environment (maybe the debugger?) that caused all tests to start failing.

9 Views

Open in Slack

Previous Next