tony
10/10/2025, 7:44 PMid=UUID('068e92c0-b816-796f-8000-58ea45be7220') name='friendly-sheep' flow_id=UUID('301be4f1-ec40-4972-84a5-106b4fe39153') state_id=UUID('068e9512-b1cf-7af9-8000-26ed9f411fed') deployment_id=UUID('12326fbc-3954-4a6b-9e75-df2bca814b69') deployment_version='31ff4223' work_queue_name='default' flow_version='8be7733e18f3912d423d585b42d81d02' parameters={'part': 2} idempotency_key='scheduled 12326fbc-3954-4a6b-9e75-df2bca814b69 b6cda2b0-0363-4b4a-addf-2fbb9075d234 2025-10-10 12:00:00-04:00' context={} empirical_policy=FlowRunPolicy(max_retries=0, retry_delay_seconds=0.0, retries=0, retry_delay=0, pause_keys=set(), resuming=False, retry_type=None) tags=['auto-scheduled'] labels={'prefect.flow.id': '301be4f1-ec40-4972-84a5-106b4fe39153', 'prefect.deployment.id': '12326fbc-3954-4a6b-9e75-df2bca814b69'} parent_task_run_id=None run_count=2 expected_start_time=DateTime(2025, 10, 10, 16, 0, 0, tzinfo=Timezone('UTC')) next_scheduled_start_time=None start_time=DateTime(2025, 10, 10, 16, 2, 54, 491333, tzinfo=Timezone('UTC')) end_time=None total_run_time=datetime.timedelta(seconds=8956, microseconds=621941) estimated_run_time=datetime.timedelta(seconds=13238, microseconds=229302) estimated_start_time_delta=datetime.timedelta(seconds=174, microseconds=491333) auto_scheduled=True infrastructure_document_id=None infrastructure_pid='038775ed-3011-4148-829d-3220b345264b:prefect:friendly-sheep-zzqhs' created_by=CreatedBy(id=UUID('b6cda2b0-0363-4b4a-addf-2fbb9075d234'), type='SCHEDULE', display_value='CronSchedule') work_queue_id=UUID('e955b6e6-1380-46b8-b503-2d966130b63b') work_pool_id=UUID('517425c7-20f9-459f-be62-75b36b6bf590') work_pool_name='analysis-pool' state=Running(message='', type=RUNNING, result=None) job_variables={} state_type=StateType.RUNNING state_name='Running'
Marvin
10/10/2025, 7:44 PMMarvin
10/10/2025, 7:50 PMrun_count
increments each time the same flow run is (re)started after leaving a Running state — this can happen for reasons other than “retries,” like crash recovery, a manual resume/restart, infrastructure restarts, or a worker re-submission after a missed heartbeat. To find the exact reason for your run’s run_count=2
, check the run’s state history, logs, and worker/infrastructure events around when it restarted.
Here’s a fast way to investigate:
1) Inspect the flow run’s state history
- This will show every state transition and the message that explains it (e.g., “crash detected,” “resumed by <user>,” etc.)
- CLI:
prefect flow-run inspect 068e92c0-b816-796f-8000-58ea45be7220
- In the UI: open the flow run page → State history / Timeline
Docs:
- Flow run states API
2) Review the flow run logs
- Look for a second “Starting flow run...” line or messages like “Resubmitting after crash,” “Lost heartbeat,” “Infrastructure exited,” or “Resumed by …”
- CLI:
prefect flow-run logs 068e92c0-b816-796f-8000-58ea45be7220 --reverse
3) Check worker logs (most telling for crash recovery/resubmission)
- Identify the worker handling your pool/queue (work_pool_name='analysis-pool'
, work_queue_name='default'
)
- On that worker’s host/cluster logs, grep the run id:
# Example on the worker host
grep 068e92c0-b816-796f-8000-58ea45be7220 /path/to/worker/logs/*
- Look for messages indicating re-attachment/resubmission, infrastructure restarts, or missed heartbeats.
4) Check events/timeline
- In the UI timeline, look for events like “Worker claimed,” “Infrastructure started,” “Run crashed,” “Resubmitted,” “Resumed by <user>”
- CLI stream (you can filter with grep):
prefect events stream --format text | grep 068e92c0-b816-796f-8000-58ea45be7220
5) Optional: Query via Python to dump states and logs with messages
from prefect.client.orchestration import get_client
from prefect.client.schemas.filters import LogFilter, LogFilterFlowRunId
from prefect.client.schemas.sorting import LogSort
flow_run_id = "068e92c0-b816-796f-8000-58ea45be7220"
async def debug():
async with get_client() as client:
fr = await client.read_flow_run(flow_run_id)
print("run_count:", fr.run_count, "infrastructure_pid:", fr.infrastructure_pid)
states = await client.read_flow_run_states(flow_run_id)
for s in states:
print(s.timestamp, s.name, s.type, "|", s.message or "")
logs = await client.read_logs(
LogFilter(flow_run_id=LogFilterFlowRunId(any_=[flow_run_id])),
limit=2000, sort=LogSort.TIMESTAMP_ASC
)
for log in logs:
print(log.timestamp, log.level, log.message)
How to interpret what you find
- Running → Crashed → Scheduled/AwaitingRun → Running
- Infrastructure or process crash/restart; worker likely resubmitted. Check worker logs and your infra (e.g., container/pod restarts).
- Running → Paused → Running
- Manual resume or automation resume; the message usually says who/what resumed it.
- Running → Cancelling/Cancelled → Running
- A manual restart (UI/CLI) of the same run.
- Running → AwaitingRetry → Running
- Would indicate an internal retry; not your case since retries=0
.
- Running → NotReady/MissingConcurrency → Running
- Resumed after a concurrency gate; depending on conditions, this can increment run_count
. The state message will indicate waiting for a slot.
A few notes on your snippet
- The idempotency_key
you see ties the run to its schedule slot; it doesn’t cause new runs, and it won’t explain run_count
.
- infrastructure_pid
changes can indicate a restart; you can often infer a restart if the pid in earlier logs differs from the current one.
If you want, paste the output of:
- prefect flow-run inspect 068e92c0-b816-796f-8000-58ea45be7220
Marvin
10/10/2025, 7:50 PMprefect flow-run logs 068e92c0-b816-796f-8000-58ea45be7220 --tail
and I can help pinpoint the exact transition that bumped run_count
to 2.