Kathryn Klarich
07/15/2021, 3:59 PMSubmitted for execution
and eventually after three retries it marks the flow as failed A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
, on agent side, the only log i see is Deploying flow run
. It appears that the agent is creating the task definition, running the task and then immediately de-registering it (as seen here), which could be the problem as I don't see how the task can run if the definition is immediately de-registered before the task run is complete. I have looked through this issue, but I don't think this is the same problem because I am using prefect cloud. Any help is much appreciated.nicholas
Kathryn Klarich
07/15/2021, 4:36 PMKathryn Klarich
07/15/2021, 4:38 PMKathryn Klarich
07/15/2021, 4:39 PM2021-07-15T11:27:13.912-04:00 [2021-07-15 15:27:13,912] INFO - agent | Found 1 flow run(s) to submit for execution.
2021-07-15T11:27:14.140-04:00 [2021-07-15 15:27:14,139] INFO - agent | Deploying flow run '0baca4bd-ff7d-408e-9952-4f12044a83f9'
Zanie
resp = self.ecs_client.run_task(taskDefinition=taskdef_arn, **kwargs)
# Always deregister the task definition if a new one was registered
if new_taskdef_arn:
self.logger.debug("Deregistering task definition %s", taskdef_arn)
self.ecs_client.deregister_task_definition(taskDefinition=taskdef_arn)
if resp.get("tasks"):
task_arn = resp["tasks"][0]["taskArn"]
self.logger.debug("Started task %r for flow run %r", task_arn, flow_run.id)
return f"Task {task_arn}"
raise ValueError(
"Failed to start task for flow run {0}. Failures: {1}".format(
flow_run.id, resp.get("failures")
)
)
Kathryn Klarich
07/15/2021, 5:03 PMKevin Kho
prefect agent ecs start
and then you add the log level there with --log-level=DEBUG
and maybe we can debug from this agent?Kevin Kho
Kathryn Klarich
07/15/2021, 5:09 PMKathryn Klarich
07/15/2021, 5:10 PMKevin Kho
Kevin Kho
Kathryn Klarich
07/15/2021, 5:44 PMKathryn Klarich
07/15/2021, 5:45 PMFile "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6268f5de10>, 'Connection to <http://api.prefect.io|api.prefect.io> timed out. (connect timeout=15)'))
Zanie
Kathryn Klarich
07/15/2021, 6:43 PMAn error occurred (InvalidParameterException) when calling the RunTask operation: Network Configuration is not valid for the given networkMode of this task definition.
It is strange that this is the error that we are getting because we shouldn't have to specify a networkConfiguration for networkMode = bridge.
task_definition = yaml.safe_load(
"""
networkMode: awsvpc
cpu: 1024
memory: 1900
containerDefinitions:
- name: flow
requiresCompatibilities:
- EC2
"""
)
Kathryn Klarich
07/15/2021, 6:44 PMA networkConfiguration is now only specified for network mode awsvpc.
Kevin Kho
Kevin Kho
Kathryn Klarich
07/15/2021, 6:50 PMKathryn Klarich
07/15/2021, 6:50 PMKathryn Klarich
07/15/2021, 6:51 PMNetwork Configuration is not valid for the given networkMode of this task definition.
Kathryn Klarich
07/15/2021, 6:52 PMKevin Kho
Kevin Kho
Kathryn Klarich
07/15/2021, 7:03 PMKathryn Klarich
07/15/2021, 7:03 PMKathryn Klarich
07/15/2021, 7:04 PMKevin Kho
json
and restart?Kathryn Klarich
07/15/2021, 7:12 PMKevin Kho
Kevin Kho
launchType
is taken from the agent and defaults to Fargate. So if you specify it on the RunConfig, it won’t override. So yes I think they need to match.Kevin Kho
launchType
follows the agent, but then requiresCompability
is overridden by the RunConfig, otherwise it takes the launch type. I guess I don’t have a clear answer to whether they need to match. I’ll look at bit more.Kathryn Klarich
07/15/2021, 8:07 PMKevin Kho
Kathryn Klarich
07/15/2021, 8:44 PMtask_definition = yaml.safe_load(
"""
networkMode: awsvpc
cpu: 1024
memory: 2048
containerDefinitions:
- name: flow
requiresCompatibilities:
- EC2
"""
)
RUN_CONFIG = ECSRun(run_task_kwargs = {'launchType': 'EC2', 'cluster': 'jupiter-eos-swift-prefect-dev', "networkConfiguration": {
"awsvpcConfiguration": {
"assignPublicIp": "DISABLED",
"subnets": ["subnet-1", "subnet-2"],
"securityGroups": ["sg-1"],
}
}},
task_role_arn='arn:aws:iam::my-task-role',
execution_role_arn='arn:aws:iam::my-task-execution_role',
task_definition=task_definition)
Kathryn Klarich
07/15/2021, 8:44 PMKathryn Klarich
07/15/2021, 9:08 PMKathryn Klarich
07/15/2021, 9:10 PMKevin Kho
Zanie
Kathryn Klarich
07/16/2021, 1:38 PMCannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve ref
even though the flow and container haven't changed at all. We did add one permission to the task role, but didn't remove anything so right now I have no idea why it worked before and now it's not.Kathryn Klarich
07/16/2021, 1:39 PMKevin Kho
Kathryn Klarich
07/16/2021, 1:47 PMKevin Kho
Kathryn Klarich
07/16/2021, 2:01 PMKevin Kho
aws ecr get-login-password _--_region us-east-2 | docker login _--_username AWS _--_password-stdin <http://XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com|XXXXXXXX.dkr.ecr.us-east-2.amazonaws.com>
Kathryn Klarich
07/16/2021, 2:15 PMKathryn Klarich
07/16/2021, 2:26 PMKathryn Klarich
07/16/2021, 9:15 PMECSAgent
as a Fargate
task on an ECS cluster. We use the agent to submit flows as ECS tasks (so pretty much a standard production setup) - I started by using default params for the ECSRun
config, but got an error saying that I couldn't create Fargate
tasks, thus I changed the launchType to EC2
and then ran into this issue. However, when trying to use the fix that Kevin suggested, (i.e. setting networkMode
to bridge
), I would get the error: Network Configuration is not valid for the given networkMode of this task definition.
However this error doesn't make sense as boto will only allow you to pass a networkConfiguration
to awsvpc
network mode, so we had to switch to awsvpc
. However, then we were unable to connect to the prefect api, as the container being run inside of the task did not have any networking ability, thus the requests timed out. Since I knew the prefect agent was able to connect to the prefect api, I decided to copy all of the agent configs to the task definition, and magically it worked. Which leaves us with a couple of questions - why when using the networkMode bridge
do we get an error about network configuration? Do the configurations and launchType need to be identical between the agent and the task it's trying to launch?Kathryn Klarich
07/27/2021, 6:02 PMKevin Kho
Kathryn Klarich
07/27/2021, 8:01 PMBen Muller
07/27/2021, 8:15 PMKathryn Klarich
08/04/2021, 5:18 PM