Adam
07/01/2025, 8:41 PMMarvin
07/01/2025, 8:41 PMMarvin
07/01/2025, 8:42 PMtimeout_seconds
parameter
- Set up monitoring to detect and handle stuck runs automatically
To prevent this issue, ensure that worker changes are done with proper shutdown procedures, and consider implementing infrastructure monitoring to detect when workers become unavailable.
If you're frequently changing worker names, you might want to consider using more stable identifiers in your worker configuration and letting the name be purely descriptive while maintaining consistent underlying infrastructure identification.
For more details, you can refer to the Prefect workers documentation and the work pools API reference.Adam
07/01/2025, 8:46 PMMarvin
07/01/2025, 8:46 PMMarvin
07/01/2025, 8:47 PMAdam
07/01/2025, 8:49 PMMarvin
07/01/2025, 8:50 PMMarvin
07/01/2025, 8:50 PMyaml
# In your Kubernetes deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: prefect-worker
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "prefect worker stop --grace-period 60"]
2. Use a StatefulSet Instead of a Deployment:
- StatefulSets maintain a sticky identity for pods
- This helps maintain consistent naming even after pod replacement
- Example:
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prefect-worker
spec:
serviceName: prefect-worker
podManagementPolicy: Parallel
# ... rest of your StatefulSet config
3. Configure Flow Run Timeouts:
python
@flow(timeout_seconds=3600) # 1 hour timeout
def your_flow():
# your flow code
4. Implement Health Checks:
yaml
spec:
containers:
- name: prefect-worker
livenessProbe:
httpGet:
path: /health
port: 4200
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 4200
initialDelaySeconds: 30
periodSeconds: 30
5. Set Up Work Pool Health Monitoring:
- Configure alerts for worker disconnections
- Monitor work pool queue depths
- Set up automatic cleanup of stuck runs using the Prefect API
6. Use Pod Disruption Budgets:
yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prefect-worker-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: prefect-worker
7. Implement Automatic Flow Run Cleanup:
python
from prefect.client import get_client
from datetime import timedelta
async def cleanup_stuck_runs():
async with get_client() as client:
# Find runs stuck in Running state for more than 1 hour
stuck_runs = await client.read_flow_runs(
flow_run_filter=FlowRunFilter(
state={"type": "RUNNING"},
start_time={"before_": datetime.utcnow() - timedelta(hours=1)}
)
)
for run in stuck_runs:
await client.set_flow_run_state(
flow_run_id=run.id,
state=Failed(message="Terminated due to stuck state")
)
These measures will help make your Prefect deployment more resilient to pod replacements. The combination of StatefulSets for stable naming, proper shutdown handling, and automated cleanup will minimize disruptions when pods are replaced.
If I may add a personal observation, it's rather like trying to maintain order in a universe that's fundamentally chaotic. At least your pods have a better chance of a dignified death than the bowl of petunias that suddenly materialized and fell from space...Adam
07/02/2025, 9:59 AMMarvin
07/02/2025, 9:59 AMMarvin
07/02/2025, 9:59 AMAdam
07/02/2025, 10:01 AMMarvin
07/02/2025, 10:01 AMMarvin
07/02/2025, 10:02 AM