m

    matta

    1 year ago
    Heya! So, I'm pulling from a database that goes down for ~75 minutes at random times. I set my tasks to have
    @task(max_retries=3, retry_delay=timedelta(minutes=30))
    but apparently Zombie Killer doesn't like that? Looking through the logs, I see
    No heartbeat detected from the remote task; marking the run as failed.
    , then `Flow run is no longer in a running state; the current state is: <Failed: "Some reference tasks failed.">`` then
    Heartbeat process died with exit code -9
    then
    Failed to set task state with error: ClientError([{'message': 'State update failed for task run ID 43f52f19-fffb-4d16-8223-da4ffc5668b2: provided a running state but associated flow run 8c8fc810-eb3d-447c-ab70-76dd1dc2acaa is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 43f52f19-fffb-4d16-8223-da4ffc5668b2: provided a running state but associated flow run 8c8fc810-eb3d-447c-ab70-76dd1dc2acaa is not in a running state.'}}}],)
    Michael Adkins

    Michael Adkins

    1 year ago
    Hi @matta ! I think this kind of looks like a bug, we'll have to take a look at the desired behavior for long retries like this. Can you check on your database as the first task of your flow then, if it is not ready, use the
    StartFlowRun
    task with a
    scheduled_start_time
    to kick off a new run in the future?
    m

    matta

    1 year ago
    Sometimes it goes down in the middle of the pull (the whole thing takes about 3 hours, I'm replicating a whole db). Buuut I guess I could do that within a fow maybe? Make a trigger like "more than 10% failed" downstream from that step, and then have it do
    StartFlowRun
    ?
    Okay, this is coming together in my head. Thanks!
    ale

    ale

    1 year ago
    I guess there's an option to disable heartbeat check. Not the best option, but maybe something worth considering in this case?
    Michael Adkins

    Michael Adkins

    1 year ago
    We do think this is a possible bug on our end though, I'll post back in this thread if we identify an issue.
    m

    matta

    1 year ago
    We're still using 0.13.19 btw
    Not sure if there might have been a fix since then.