Maikel Penz
04/13/2023, 11:35 PM| _ \ _ \ __| __| __/ __|_ _| /_\ / __| __| \| |_ _|
| _/ / _|| _|| _| (__ | | / _ \ (_ | _|| .` | | |
|_| |_|_\___|_| |___\___| |_| /_/ \_\___|___|_|\_| |_|
Agent started! Looking for work from queue(s): infra-dev-plexflow-2...
23:25:42.455 | INFO | prefect.agent - Submitting flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:25:44.631 | INFO | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod has status 'Pending'.
23:25:44.705 | INFO | prefect.agent - Completed submission of flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:26:44.628 | ERROR | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod never started.
23:26:44.792 | INFO | prefect.agent - Reported flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a' as crashed: Flow run infrastructure exited with non-zero status code -1.
.. and these are the logs from the job pod..
kubectl logs flow-infra-data-engineering-infra-dev-jhqhm-74k5l --follow
/usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
23:31:09.701 | INFO | Flow run 'economic-jerboa' - Downloading flow code from storage at ''
23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
Traceback (most recent call last):
...
..
..
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
.. it gets stuck trying to download the code. The role I have assigned to the cluster has S3 full access.
My S3 Block has this configuration:
Bucket Path: <bucket-name>
AWS Access Key Id: None
AWS Secret Access Key: None
There seem to be multiple problems but..
1. Why does the flow fail right after it starts with Pod never started.?
2. And why it cannot pull the code from S3 and the log shows Downloading flow code from storage at '' ? which is emptyNate
04/14/2023, 3:08 PM23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
...
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
this error definitely seems permissions-related somehow, a couple ideas to check:
1. can your worker nodes actually assume your full s3 access role? docs
2. could this be a networking issue? the timeout to STS is kind of odd to me
if you're just getting started, I'd recommend checking prefect projects / workers / work pools managing your deployment and its execution environment, since that is our recommendation going forward
happy to continue debugging with you if you're still blockedMaikel Penz
04/16/2023, 11:50 PMNate
04/16/2023, 11:51 PM