Hey All, I'm running an EKS cluster where I host t...
# prefect-community
m
Hey All, I'm running an EKS cluster where I host the Prefect Agent. However, I'm not able to get flows working. I am using an S3 storage block without the credentials set, so it means that the EKS cluster needs to have permission to pull the code from S3. When I trigger a flow execution I get this in the agent logs:
Copy code
| _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s): infra-dev-plexflow-2...
23:25:42.455 | INFO    | prefect.agent - Submitting flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:25:44.631 | INFO    | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod has status 'Pending'.
23:25:44.705 | INFO    | prefect.agent - Completed submission of flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:26:44.628 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod never started.
23:26:44.792 | INFO    | prefect.agent - Reported flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a' as crashed: Flow run infrastructure exited with non-zero status code -1.
.. and these are the logs from the job pod..
Copy code
kubectl logs flow-infra-data-engineering-infra-dev-jhqhm-74k5l --follow
/usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
23:31:09.701 | INFO    | Flow run 'economic-jerboa' - Downloading flow code from storage at ''
23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
Traceback (most recent call last):
...
..
..
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
.. it gets stuck trying to download the code. The role I have assigned to the cluster has S3 full access. My S3 Block has this configuration:
Copy code
Bucket Path: <bucket-name>
AWS Access Key Id: None
AWS Secret Access Key: None
There seem to be multiple problems but.. 1. Why does the flow fail right after it starts with
Pod never started.
? 2. And why it cannot pull the code from S3 and the log shows
Downloading flow code from storage at ''
? which is empty
n
hi @Maikel Penz - hmm these lines look the most suspect to me
Copy code
23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
...
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
this error definitely seems permissions-related somehow, a couple ideas to check: 1. can your worker nodes actually assume your full s3 access role? docs 2. could this be a networking issue? the timeout to STS is kind of odd to me if you're just getting started, I'd recommend checking prefect projects / workers / work pools managing your deployment and its execution environment, since that is our recommendation going forward happy to continue debugging with you if you're still blocked
m
thanks for the input Nate. Yeah it turns out there's an issue with a new custom VPC we're creating. It seems to be unrelated to Prefect. 👍
n
👍