https://prefect.io logo
a

Alex Welch

02/16/2021, 2:17 PM
Hi all, I am trying to run a Prefect agent on ECS and have been following this tutorial. I used the below configurations (along with exporting the runner token, aws creds, etc to environment variables). However I am continuously receiving a
ValueError: Failed to infer default networkConfiguration, please explicitly configure using --run-task-kwargs
error. I have tried all teh different combinations I can think of and been through the docs (both prefect and ECS) but cant find a solution that works. Has anyone else been able to solve this?
Copy code
export networkConfiguration="{'awsvpcConfiguration': {'assignPublicIp': 'ENABLED', 'subnets': ['<EC2 instance subnet>'], 'securityGroups': ['<EC2 instance security group>']}}"
prefect agent ecs start --token $RUNNER_TOKEN_FARGATE_AGENT \
 --task-role-arn=arn:aws:iam::<AWS ACCOUNT ID>:role/ECSTaskS3Role \
 --log-level INFO --label prod --label s3-flow-storage \
 --name prefect-prod
when i run
--log-level DEBUG
it looks like nothing is being passed to the environment variables. I’m thinking this is the problem?
but when i add in the
networkConfiguration
as an
--env
I get
Error: Got unexpected extra arguments
a

ale

02/16/2021, 2:30 PM
Hey @Alex Welch From the docs it seems that if you want to provide
networkConfiguration
you have to provide it in a YAML file. Something like this
Copy code
prefect agent ecs start --run-task-kwargs /path/to/options.yaml
Copy code
prefect agent ecs start --run-task-kwargs <s3://bucket/path/to/options.yaml>
If you want to provide the YAML file from S3
a

Alex Welch

02/16/2021, 3:18 PM
i noticed that, but i didnt see an example of what the yaml needed to look like
i found some docs on the available environment variables in prefect and netwrokConfiguration wasn’t one of them
a

ale

02/16/2021, 3:25 PM
This might helpm you https://github.com/PrefectHQ/prefect/blob/18fcb5dfce7ba3ecf9db3c6a6f0efa21d9403d88/tests/agent/test_ecs_agent.py#L244 Maybe someone from Prefect Team can point you to the
kwargs.yaml
, I’m not able to find it 😅
a

Alex Welch

02/16/2021, 3:38 PM
thanks, ill give it a try
👍 1
looks like that solved the networking part… i think? I now get
Failed to load and execute Flow's environment: UnpicklingError("invalid load key, '{'.")
when i run my flow
a

ale

02/16/2021, 3:45 PM
Would you mind sharing the YAML content? I’m not sure this is related to newtorkConfiguration, though
a

Alex Welch

02/16/2021, 3:48 PM
based on some earlier threads, I did validate that both my flow and the ecs prefect versiosn were using 0.14.8
and that cloudpickle is running at 1.6.0
a

ale

02/16/2021, 3:58 PM
Can you share your logs? Maybe setting logs to debug (if not already on debug level) would help in troubleshooting
a

Alex Welch

02/16/2021, 3:59 PM
that is what I have
in the actual agent it looks liek
a

ale

02/16/2021, 4:02 PM
I suggest you restart the agent with the following options:
--show-flow-logs --verbose
This way you can see the flow run logs in the agent console
a

Alex Welch

02/16/2021, 4:04 PM
Error: no such option: --show-flow-logs
this is what is available
a

ale

02/16/2021, 4:06 PM
Try setting log level to debug
I also suggest you try to call this function
prefect.utilities.debug.is_serializable
on your flow https://docs.prefect.io/api/latest/utilities/debug.html#functions
a

Alex Welch

02/16/2021, 4:09 PM
so i have log level set to debug\
give me one second to dig into that doc
here is my flow
oh! it worked
but…. why lol
a

ale

02/16/2021, 4:13 PM
😅
a

Alex Welch

02/16/2021, 4:13 PM
well thats amazing
a

ale

02/16/2021, 4:14 PM
Well, we don’t know how we did it….but we did it!
j

Jim Crist-Harif

02/16/2021, 5:00 PM
Failed to load and execute Flow's environment: UnpicklingError("invalid load key, '{'.")
This error indicates that the image used to run the flow was older than prefect 0.14.3, while the version of prefect used to register the flow was >= 0.14.3. In 0.14.3 we changed the serialization format to include some metadata like version numbers, so in the future if this happens you'll get a nice error message. We've never made guarantees that running a flow using a version of prefect older than the version the flow was registered with would work, so this wasn't really a breaking change, but the error that happens if you've accidentally been doing this is a bit cryptic.
Regarding the
networkConfiguration
thing, specifying that via a yaml file provided to
--run-task-kwargs
is the correct way to fix it. Is there something we could do to make this clearer? The error message you got specifically calls out using
--run-task-kwargs
to fix it, and the behavior of this flag is documented in both the CLI help and the docs: https://docs.prefect.io/orchestration/agents/ecs.html#custom-runtime-options. Happy to make any docs updates needed to help future users who may run into this issue.
a

ale

02/17/2021, 8:46 AM
Hey @Jim Crist-Harif 🙂 Thanks a lot for the explanation of the error, very useful!
a

Alex Welch

02/18/2021, 4:27 PM
@Jim Crist-Harif thank you for the explanation. I might recommend either adding some verbiage to the pickling error message that recommends checking your prefect versions or putting something in a trouble shooting page in the documents
i was following a tutorial you guys had put out but (either i missed it or it wasn’t in the article ) i didnt see anything about the version of prefect used
regarding the
networkConfiguration
I would recommend some documents around how to best use and structure
--run-task-kwargs
when i was doing my research into how to use them I got hung up on what arguments prefect could take and ended up here. It was only after @ale shared the code from teh text_ecs_agent that I understood what and how to pass things. So maybe and example of what that
options.yml
could or should look like? and again - I’ve seen the
networkConfiguration
issue mentioned in other threads here. Maybe it makes sense to dump it in a Troubleshooting section of the documentation?
here is the test_ecs_agent code
s

Sean Talia

02/25/2021, 7:07 PM
i'm also just starting to play around with the ECS agent and am running into some of the same difficulties 😎
why is part of the setup for the agent to try to fetch the default VPC and then set
assignPublicIp= 'ENABLED'
on the vpc configuration?
do you NEED to have the
assignPublicIp
key set to ENABLED?
@Jim Crist-Harif i do think that it would be helpful for there to be some mention in the ECS Agent documentation that you need to either have a default VPC set or you must feed some required set of specifics about your VPC + network configuration (e.g. subnets, security groups, etc.)
a

Alex Welch

02/25/2021, 10:02 PM
I think that is a great point
having something that explains the reasoning and/or alternatives
j

Jim Crist-Harif

02/25/2021, 11:18 PM
I confess that I'm not the best at AWS stuff, so there might be a better way. We tried to make this as simple as possible - with zero configuration specified, the ECS agent attempts to infer the proper VPC configuration for you. For most deployments you'll need
assignPublicIp
configured, since otherwise you won't have public internet access (so pulling an image from dockerhub wouldn't work). Users that would want to override this probably also know what to do since this requires extra AWS effort on their end.
👍 1
I agree this could be better documented though.
a

Alex Welch

02/26/2021, 4:32 AM
Copy code
Users that would want to override this probably also know what to do since this requires extra AWS effort on their end
In my case, all the networking had been handled by another team and I didn’t have any control over it. To make things more complicated was that I did not have access to the
default vpc
and so I had to go the route i ended up at. I think this may be true for many analytics-engineer type roles. The infra/network teams would handle all of that stuff to ensure security.
f

felice

02/27/2021, 2:17 AM
hi all, this was very helpful - thank you! i had the same problem where i encountered this trying to run on a cluster in my company's aws account (though no problems in my own account), so likely it was specific to the networking setup by our infra/devops team, where similar to @Alex Welch i have no control over.