Maikel Penz
04/13/2023, 11:35 PM| _ \ _ \ __| __| __/ __|_ _| /_\ / __| __| \| |_ _|
| _/ / _|| _|| _| (__ | | / _ \ (_ | _|| .` | | |
|_| |_|_\___|_| |___\___| |_| /_/ \_\___|___|_|\_| |_|
Agent started! Looking for work from queue(s): infra-dev-plexflow-2...
23:25:42.455 | INFO | prefect.agent - Submitting flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:25:44.631 | INFO | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod has status 'Pending'.
23:25:44.705 | INFO | prefect.agent - Completed submission of flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a'
23:26:44.628 | ERROR | prefect.infrastructure.kubernetes-job - Job 'flow-infra-data-engineering-infra-dev-96hss': Pod never started.
23:26:44.792 | INFO | prefect.agent - Reported flow run '69af3c9b-f22b-42db-a0a5-21595ec5408a' as crashed: Flow run infrastructure exited with non-zero status code -1.
.. and these are the logs from the job pod..
kubectl logs flow-infra-data-engineering-infra-dev-jhqhm-74k5l --follow
/usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
23:31:09.701 | INFO | Flow run 'economic-jerboa' - Downloading flow code from storage at ''
23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
Traceback (most recent call last):
...
..
..
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
.. it gets stuck trying to download the code. The role I have assigned to the cluster has S3 full access.
My S3 Block has this configuration:
Bucket Path: <bucket-name>
AWS Access Key Id: None
AWS Secret Access Key: None
There seem to be multiple problems but..
1. Why does the flow fail right after it starts with Pod never started.
?
2. And why it cannot pull the code from S3 and the log shows Downloading flow code from storage at ''
? which is emptyNate
04/14/2023, 3:08 PM23:36:17.233 | WARNING | aiobotocore.credentials - Refreshing temporary credentials failed during mandatory refresh period.
...
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "<https://sts.eu-west-1.amazonaws.com/>"
this error definitely seems permissions-related somehow, a couple ideas to check:
1. can your worker nodes actually assume your full s3 access role? docs
2. could this be a networking issue? the timeout to STS is kind of odd to me
if you're just getting started, I'd recommend checking prefect projects / workers / work pools managing your deployment and its execution environment, since that is our recommendation going forward
happy to continue debugging with you if you're still blockedMaikel Penz
04/16/2023, 11:50 PMNate
04/16/2023, 11:51 PM