Hello here! I'm launching flows on ECS tasks runni...
# prefect-aws
r
Hello here! I'm launching flows on ECS tasks running on EC2 SPOT instances with some success. It works fine up to 8 cores machines, but once I try 16 cores, I get an error saying that I can't override cpu reservation above 10240 units (i.e. 10 cores). Is it a hard limit set by AWS or is it possible to configure a work pool in such a way that 16 cores tasks are possible? Wondering if anybody had this problem before.
I tried to override cpu and memory values this way :
Copy code
"task_run_request": {
      "cluster": "prefect",
      "overrides": {
        "cpu": 0,
        "memory": 0,
        "taskRoleArn": "xyz",
        "containerOverrides": [
          {
            "cpu": 0,
            "memory": 0
          }
        ]
      },
But I get this error :
Invalid type for parameter overrides.cpu, value: 0, type: , valid types:
n
hi @Romain Vincent - do you have the whole error log from this?
r
Hi Nate, here is a full error :
Copy code
Failed to submit flow run 'b372179f-6e67-4942-8630-303ca4d20e21' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 640, in run
    ) = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 752, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 748, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 330, in wrapped_f
    return self(f, *args, **kw)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 467, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 368, in iter
    result = action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 410, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 183, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 470, in __call__
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 1633, in _create_task_run
    task = ecs_client.run_task(**task_run_request)
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 974, in _make_api_call
    request_dict = self._convert_to_request_dict(
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1048, in _convert_to_request_dict
    request_dict = self._serializer.serialize_to_request(
  File "/usr/local/lib/python3.10/site-packages/botocore/validate.py", line 381, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid type for parameter overrides.cpu, value: 0, type: , valid types:
Setting cpu to something above 10240 returns :
Copy code
Failed to submit flow run '0832e9e2-876d-4916-a678-cb494d91088c' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 640, in run
    ) = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 752, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 748, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 330, in wrapped_f
    return self(f, *args, **kw)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 467, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 368, in iter
    result = action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 410, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 183, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 470, in __call__
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 1633, in _create_task_run
    task = ecs_client.run_task(**task_run_request)
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: Task Overrides 'cpu' setting must be at most 10240
n
thank you!
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: Task Overrides 'cpu' setting must be at most 10240
this appears to be coming straight from boto which version of prefect-aws are you using?
r
prefect is
2.16.5
and prefect-aws is
0.4.9
n
thanks! ill take a look at this - but i would think the best move (if anything) is to add validation so that we fail eariler? do you have any thoughts?
r
I think validation is a good idea if we know exactly what's going on. So far, I could not pinpoint if it's the override that fails or just trying to launch a container requiring more than 10 cpu on a larger ECS task.
I'm going to investigate if this is related to service quota in any ways. I'll come back to you if it isn't!
n
👍
r
Following up on this one : • It's not a quota issue (no quota entry for cpu per task). • From this doc, it seems that cpu cannot be higher than 10240 for EC2 launch type (not Fargate), so it seems that I reached a hard limit. • From this doc, it seems that cpu (task override) must be a string and cpu (container override) must be an integer. So this is the part that could be taken into account in the validation process, but it's quite niche. Maybe, a good solution would be to print the expected type in the stack trace (for now, it seems to be empty).
Thanks Nate for looking that up!