I feel like am so close to getting my ECS Task to ...
# ask-community
s
I feel like am so close to getting my ECS Task to run w/ the ECS Agent, but I'm having an odd issue where my flow that's calling this task is just hanging forever in a scheduled/pending state – I can see in the AWS console that the ECS task is being run, and prefect is indeed passing a lot of ENV variables to the container that's getting spun up, but nonetheless my flow is always stuck in a "Submitted" state and I'm not seeing any logs in the Cloud UI
in the AWS console:
all I ever see in Prefect UI:
z
Hey @Sean Talia, can you start your agent with the setting
PREFECT__CLOUD__AGENT__LEVEL=DEBUG
and provide the logs of it deploying the flow?
s
certainly, one second
Copy code
[2021-03-03 21:51:20,615] DEBUG - Sean ECS Agent | Found flow runs ['220201ce-ad50-4a59-b4ec-d08e2f7c639e']
[2021-03-03 21:51:20,615] DEBUG - Sean ECS Agent | Querying flow run metadata
[2021-03-03 21:51:20,798] INFO - Sean ECS Agent | Found 1 flow run(s) to submit for execution.
[2021-03-03 21:51:20,798] DEBUG - Sean ECS Agent | Updating states for flow run 220201ce-ad50-4a59-b4ec-d08e2f7c639e
[2021-03-03 21:51:20,803] DEBUG - Sean ECS Agent | Next query for flow runs in 0.25 seconds
[2021-03-03 21:51:20,804] DEBUG - Sean ECS Agent | Flow run 220201ce-ad50-4a59-b4ec-d08e2f7c639e is in a Scheduled state, updating to Submitted
[2021-03-03 21:51:21,033] INFO - Sean ECS Agent | Deploying flow run '220201ce-ad50-4a59-b4ec-d08e2f7c639e'
[2021-03-03 21:51:21,033] DEBUG - Sean ECS Agent | Using task definition prefect-test-task-stage:5 for flow 447bf955-c4b3-464b-a4b6-b9e15c3497a5
[2021-03-03 21:51:21,059] DEBUG - Sean ECS Agent | Querying for flow runs
[2021-03-03 21:51:21,244] DEBUG - Sean ECS Agent | No flow runs found
[2021-03-03 21:51:21,245] DEBUG - Sean ECS Agent | Next query for flow runs in 0.5 seconds
[2021-03-03 21:51:21,748] DEBUG - Sean ECS Agent | Querying for flow runs
[2021-03-03 21:51:22,084] DEBUG - Sean ECS Agent | No flow runs found
[2021-03-03 21:51:22,084] DEBUG - Sean ECS Agent | Next query for flow runs in 1.0 seconds
[2021-03-03 21:51:22,298] DEBUG - Sean ECS Agent | Started task 'arn:aws:ecs:us-east-2:<ACCOUNT-ID>:task/<CLUSTER-NAME>/91f00f616a094bef9985dc10adfe5d49' for flow run '220201ce-ad50-4a59-b4ec-d08e2f7c639e'
[2021-03-03 21:51:22,465] DEBUG - Sean ECS Agent | Completed flow run submission (id: 220201ce-ad50-4a59-b4ec-d08e2f7c639e)
z
Hmm okay so everything looks good on the agent's end. Perhaps your flow task is unable to communicate with the Cloud API?
I'm no ECS expert -- can you pull logs from the container?
s
yeah that was also my suspicion, I was hoping someone might have been like "oh yeah i've had this happen easy fix"
i also am no ECS expert, this is my first time working with it so sadly my debugging skills here are quite lacking
z
Haha understandable. I've pinged someone on our devops team.
Are you using a custom task definition or the default?
s
i'm using a custom task that I registered to ECS through terraform; my org has a framework for quickly spinning up tasks that have all kinds of bells and whistles attached to them that we don't want to have to manually configure
z
Are you running ECS on Fargate?
s
which i'm sure will make it more difficult for me provide insight into how the task has been configured 😇
yep
I think the best next step is to get ahold of the container logs
💯 1
Since you're already using a custom task it should be pretty straightforward?
s
yeah i think that's my only hope
ha that's what I would have thought, but I don't see my logs showing up in cloudwatch either
(classic)
I'll figure out what's going on and report back when I solve this
okay, we're getting somewhere
it's odd, the image that i'm using for the flow is a custom one that uses
prefecthq/prefect:0.14.5-python3.8
as its base
it seems like the command on the container isn't getting set or overridden or something
z
Huh that's weird. Can you inspect the
command
on the actual task definition?
s
i actually didn't specify one on the task definition itself because i assumed that the
ECSRun
config was going to override it
but what's interesting is that i just changed the
image
in my ECS task definition itself to be something different from what my flow requires, and i'm seeing that the image is actually not being overridden either
but actually yes i see that thing i was just referencing
is happening in the
register_task_definition
function and not in the
deploy_flow
function
that's the problem then i think
I think that
container["command"] = ["/bin/sh", "-c", get_flow_run_command(flow_run)]
needs to get passed to
containerOverrides
z
I've pinged our run config expert 🙂 we'll look into this
It looks like right now you'd have to include the flow run command in your custom task definition
Jim said he'd look at fixing this tomorrow, basically setting a default command if you haven't. I think it's written as is so you can define more complex commands if you want.
d
@Marvin open “ECSRun with custom task definition does not set default container options”
s
awesome, thanks for opening this up! just to follow up on this, i've made some adjustments to my flow and got it to successfully run! it's a fairly trivial example, but it is executing the flow body and writing the logs, the only issue is that it's never getting out of the
Submitted
state, despite the logs showing that the flow actually finished
it's a little odd, I guess the flow states are never getting communicated back to the agent; I also see that the task results aren't getting published to S3 as I'd expect
would any of the prefect ECS experts have an idea of what might be happening here? I spent most of the morning on this and am pretty stumped...it's weird to me that the cloud instance is having the logs communicated back to it, logs which show the tasks starting and succeeding, and yet my Flow as a whole is never moving past the "Submitted" state and none of the task results are being written to S3 (as they are just fine when I use the DockerAgent / DockerRun pair)
z
What command are you using the run the flow in your task definition?
s
I'm passing these in via my flow's run config:
Copy code
"containerOverrides": [
                    {
                        "name": "flow",
                        "command": ["/bin/sh", "-c", "prefect execute flow-run"],
                    }
                ]
which I think is just the command that would be getting run if prefect had been responsible for creating/registering the ECS task from scratch, right?
z
Yeah.. hmm.
It's missing some environment variables that are also static in the original definition. This will also be addressed in that issue -- the custom task definition was a user-contributed feature and it doesn't do any setup for you as is.
👍 1
s
oh that's interesting – can you tell me which ENV variables those would be?
I'm basically just doing a POC right now so if I can manually cobble some stuff together to get it going I'd be ecstatic
z
Jim says:
Copy code
env = {
            "PREFECT__CLOUD__USE_LOCAL_SECRETS": "false",
            "PREFECT__ENGINE__FLOW_RUNNER__DEFAULT_CLASS": "prefect.engine.cloud.CloudFlowRunner",
            "PREFECT__ENGINE__TASK_RUNNER__DEFAULT_CLASS": "prefect.engine.cloud.CloudTaskRunner",
        }
It's being executed using the
FlowRunner
rather than the API connected one, I presume
s
ohhhh now that's very interesting
for what it's worth i do have
Copy code
[cloud]
use_local_secrets = false
in the image's
~/.prefect/config.toml
but obviously overriding that w/ the ENV var is better
i'm going to add these and see what happens 😄
wow
@Jim Crist-Harif can I buy you a coffee
that did it
also i now see those were things being set as part of the agent's
generate_task_definition()
... sigh
okay, thanks for your help everyone, this is awesome
🙌 1
is there anything I can help with in terms of compiling all of these issues in one place and maybe making a feature suggestion/request about it?
I feel like my use case isn't so crazy that it wouldn't be helpful to address some of these difficulties was having in a future release
z
If you'd like to open an issue that explains how you setup the logging/inspecting the container logs that may be nice. Otherwise I think Jim plans to resolve all of the
generate_task_definition()
issues in a single PR.
j