I’ve got an issue with prefect v1 and the ECS agen...
# prefect-community
m
I’ve got an issue with prefect v1 and the ECS agent: I’ve got a flow that’s spinning on
Submitted for execution: Task arn:<task-arn>
with no further logs generated. In cloudwatch, it only shows:
Copy code
2022-05-17T16:18:52.575-04:00	[2022-05-17 20:18:52,575] INFO - agent | Deploying flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989 to execution environment...
	2022-05-17T16:18:53.805-04:00	[2022-05-17 20:18:53,805] INFO - agent | Completed deployment of flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989
k
This cloudwatch logs are from the specific ECS Task? Does it say anything if you go to the task page?
m
they are from the agent log group, and reference that flow run ID correctly
if I click through the logs link in the task page, it shows me the info logs
Copy code
2022-05-17T16:18:52.575-04:00	[2022-05-17 20:18:52,575] INFO - agent | Deploying flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989 to execution environment...
	2022-05-17T16:18:53.805-04:00	[2022-05-17 20:18:53,805] INFO - agent | Completed deployment of flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989
	2022-05-17T16:38:47.650-04:00	[2022-05-17 20:38:47,650] INFO - agent | Deploying flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989 to execution environment...
	2022-05-17T16:38:48.748-04:00	[2022-05-17 20:38:48,748] INFO - agent | Completed deployment of flow run f2f45aac-ccfb-4bb9-88db-fb1d00426989
k
Ohh are you using a LocalDaskExecutor?
m
data engineer is submitting a flow run to the ECS Agent via prefect?
it shows up in the Prefect Cloud UI, tied to the ECS agent
k
I know but what is the executor on the Flow?
m
where can I see that information?
k
Ah if you didn’t specify any, it should just be the default
LocalExecutor
. I was asking because the
LocalDaskExecutor
with processes was not sending logs recently. If the Flow is still stuck in Submitted on the Prefect UI, I think we have a different issue though. Looking at this again, it feels like the container may not have started right. This is the log associated with the agent. Do you have logs associated with the Flow?
m
I never get a log entry beyond “submitted”
k
I get that for the Prefect side, it likely means there was an error even before the Flow started running. Is the Flow configured to log to CloudWatch too?
m
how do you configure that?
I basically set up the TF module Prefect provided, and AFAIK the Prefect Cloud config is fairly standard
k
This is a good example.
So this will add logging to the Flow so you can get visibility into errors that happen in between Flow spin up and execution (the logger is not spun up yet which is why we dont get visibility)
m
ah
and that should work out of the box, since we already configured cloudwatch permissions via the provided TF module, correct?
k
Yes as long as the cloudwatch log group already exists
m
ok, I’ll give that a try and report back, one moment
just to clear up…. that’s a config for a Flow run right, not an agent?
k
Yes that is attached to the Flow so you need to re-register
m
lol now I don’t even get the “Submitted” log
just spins on “scheduled”
k
Uhh that shouldn’t be the case. You just added the logging section right to the Flow right? Is the agent still on to pick it up?
m
agent is live
I just copy/pasted the basic tutorial flow to eliminate other variables
Copy code
@task
def hello_task():
    logger = prefect.context.get("logger")
    <http://logger.info|logger.info>("Hi from Prefect %s from flow %s", prefect.__version__, FLOW_NAME)
    return


with Flow(FLOW_NAME, run_config=RUN_CONFIG) as flow:
    hello_task()

flow.register(project_name="mike-test")
k
Ah this looks like it used the default local storage so it will add a local host label by default. So now you have a label mismatch between Flow and agent so the agent won’t pick it up.
m
ah, so labels aren’t partial
interestingly enough my data engineer’s flow had the same problem but he got “submitted”
k
I think they must have used another storage without the default labels. Only the default local storage has the default labels. Local storage will also not work on ECS because the Flow is in your local machine and the container won’t have access to it. Agent labels must be a superset of Flow labels to be able to pick it up.
m
I suspect he had other config abnormalities tho
k
Typically with ECS, you use S3 storage or Docker storage hosted in ECR. What I really suspect happened to your initial error though was that the agent could not pull the image (appropriate IAM roles) or the image just can’t run on ECS (architecture issue like built on an M1 MAC)
m
hmm
is all of this in the docs for running on ECS?
if so, totally missed it
also, is this the error that indicates local/cloud mismatch?
Copy code
An error occurred (ServerException) when calling the RunTask operation (reached max retries: 2): Service Unavailable. Please try again later.
k
Not exactly? I think the docstring here has a bunch but not specifically this
I believe that is an ECS issue specifically, not a Prefect log
m
FWIW I do not see anything re: the contents of this discussion on: https://docs.prefect.io/orchestration/agents/ecs.html#ecs-agent
in fact in a couple of places it seems to imply local storage is ok:
https://docs.prefect.io/orchestration/agents/ecs.html#ecs-agent
To provide your own task definition template, you can use the
--task-definition
flag. This takes a path to a job template YAML file. The path can be local to the agent, or stored in cloud storage on S3.
k
Ah well, more like, you can use anything but local (CodeCommit, Bitbucket, Github, etc.).
That is not the Flow storage. That is for the task definition for the ECS task
m
k
so pretty clearly no reference to this requirement on that page then
k
If you do use local Flow storage, you get an error that’s described in the FAQ. You can actually get it if you do:
Copy code
with Flow(...) as flow:
    ...

flow.storage.add_default_labels = False
m
ok, so you actually need a remote… storage mechanism somewhere files are hosted?
k
Any besides local. Well you can use local actually if the Flow file already lives inside the container that the ECS task is using. It will be looked for relative to the container file paths
This is an example where you can use local storage with Kubernetes and the flow file will live inside the image specified
m
ok
do I need to upload the flow file to the bucket first, or does that occur as part of registration?
k
It occurs as part of registration for you
m
kk, still getting that “Service Unavailable”, which in a random SO thread seems to indicate a config value is null somewhere?
k
I saw that. Hard to tell what causes that. If you just want to test a working setup, you could use the Prefect image. What kind of things are you setting in ECSRun or do you have your own task definition?
m
literally just copied that github file link
including using the Prefect image
only changes are a couple of hard coded values
k
Yeah that seems like it should work. Have not encountered that specific error message before myself.
m
where’s the best place to post issues for official response?
k
from Prefect? or from AWS? That specific is hard to help without a reproducible example
m
Prefect
if this keeps happening without some kind of status change from AWS re: service availability I think we can reasonably eliminate that as the actual cause
k
That’s not entirely true though. We do see some ECS containers fail to spin up because they fail to get resources intermittently (dunno if the AWS service is really down here). The StackOverflow post you mentioned suggests something wrong with the Task Definition, which we just pass through. I don’t know what more official response you can get. You can post a Github issue but we really need a reproducible example for us to debug it. You could also reach out to Professional Services and they will help debug.
m
ok, no problem, thank you again, I’ll take this back to our group, I know we’re in evaluation stage with some dataflow products so this is valuable information
👍 1
k
Of course! If you find a path to reproduce that, you could open an issue on the repo too
m
well it still happens on that flow….
I dunno if that qualifies as reproducible
k
If you DM me the code I can take a look
m
sent