Hi there I think this is a simple question I have some flows Prefect Community #ask-community

Hi there. I think this is a simple question! I hav...

Theo Platt

12/06/2021, 7:25 PM

Hi there. I think this is a simple question! I have some flows (generally many hours long) where the last task fails but the flow continues to run. The last task happens to be an AWSClientWait on a Batch job but I don't know if this is the reason or not. Or alternatively is there a way to fail a flow if any of the tasks fail? Thanks as always!

Kevin Kho

12/06/2021, 7:28 PM

Hey @Theo Platt a Flow should fail automatically if a terminal task fails. You can set this when you construct your flow with

flow.terminal_tasks

Copy code

with Flow(...) as flow:
   a = task_one()
   b = task_two()

flow.terminal_tasks = [a,b]

With AWSClientWait, I’ve seen issues with timeouts. There might be a default timeout there causing failures on the boto3 side (I’ve seen 12 hours). I have also need issues when mapping a lot of these calls (over 100). Are you doing that?

Kevin Kho

12/06/2021, 7:29 PM

Oh that previous issue was also you lol

Kevin Kho

12/06/2021, 7:33 PM

I was wondering why I see so many AWS Batch questions 😆

Theo Platt

12/06/2021, 7:36 PM

Yes - lot of those are me! 😉 Trying to pay it back with the solutions I find.

Theo Platt

12/06/2021, 7:39 PM

In this case it was the very last, terminal task, but I still had to manually cancel the run in the prefect cloud as it was still 'blue'

Theo Platt

12/06/2021, 7:40 PM

Theo Platt

12/06/2021, 7:43 PM

and here's the code for that final task. In this case it did actually time out after delay*max_attempts had passed. So the task failed but the flow thought all was good still.

Copy code

@task
def wait_batch(job_id, delay, max_attempts):

    logger = prefect.context.get('logger')

    <http://logger.info|logger.info>(f"Waiting for job to complete: {job_id}")

    waiter = AWSClientWait(
        client='batch',
        waiter_name='JobComplete',
    )
    waiter.run(
        waiter_kwargs={
            'jobs': [job_id],
            'WaiterConfig': {
                'Delay': delay,
                'MaxAttempts': max_attempts
            }
        },   
    )

    <http://logger.info|logger.info>(f"Job complete: {job_id}")

    return job_id

Kevin Kho

12/06/2021, 7:53 PM

Are you on ECS by chance?

Theo Platt

12/06/2021, 7:54 PM

yes

Theo Platt

12/06/2021, 7:54 PM

both the agent and the flow execution

Kevin Kho

12/06/2021, 7:55 PM

I think this is a similar issue. We opened an internal ticket for it. Do you get a heartbeat failure in the logs? Does your ECS container show it exited?

Theo Platt

12/06/2021, 8:14 PM

The flow logs look like this so the flow is still running -

Kevin Kho

12/06/2021, 8:16 PM

Ok thanks. Will add this to our internal issue. Do you get any message like this though on the ECS side?

Theo Platt

12/06/2021, 8:17 PM

Nope - the actual Batch job kept on running and eventually finished successfully about an hour after the timeout we'd imposed on the AWSClientWait

Kevin Kho

12/06/2021, 8:18 PM

But the batch is different compute than the ECS container right? So ECS shows everything went as normal?

Theo Platt

12/06/2021, 8:18 PM

Yes - that's right

Theo Platt

12/06/2021, 8:19 PM

Everything seems to be working great except for the flow not failing when that last task fails

Kevin Kho

12/06/2021, 8:23 PM

Ok it’s honestly hard for me to tell if this is expected or not. A failed task does not necessarily update the state of the Flow, so what it looks like is that the ECS compute is still going even if the task is marked as failed and the FlowRunner has not exited yet. Part of this probably has to do with the fact that timeouts are a best effort thing and not guaranteed. It’s very hard to terminate an ongoing Python program, and it gets all the more harder when it’s happening on different machines (not the case here). It could be that the timeout here is just failing to terminate the task and the Flow just keeps going on.

Kevin Kho

12/06/2021, 8:24 PM

Either way, I think the thing for me to do here is add your issue to our internal issue for the previously mentioned thread. It may be related.

Theo Platt

12/06/2021, 8:32 PM

Thanks as always @Kevin Kho! I'll keep looking my side too. It does look like the AWSClientWait fails ok and raises a FAIL. https://github.com/PrefectHQ/prefect/blob/d44b72a950ebda9f7bc6a9712fc71e2e9c680d25/src/prefect/tasks/aws/client_waiter.py#L118-L121

Theo Platt

12/06/2021, 8:34 PM

Maybe as I have this task wrapped in my own task should I be capturing the exception?

Copy code

@task
def wait_batch(job_id, delay, max_attempts):

    logger = prefect.context.get('logger')

    <http://logger.info|logger.info>(f"Waiting for job to complete: {job_id}")

    waiter = AWSClientWait(
        client='batch',
        waiter_name='JobComplete',
    )
    waiter.run(
        waiter_kwargs={
            'jobs': [job_id],
            'WaiterConfig': {
                'Delay': delay,
                'MaxAttempts': max_attempts
            }
        },   
    )

    <http://logger.info|logger.info>(f"Job complete: {job_id}")

    return job_id

Kevin Kho

12/06/2021, 8:39 PM

Oh I see. Yes I suppose capturing that may give you more inofrmative logs.

5 Views

Open in Slack

Previous Next