Job late for long time what can be the cause of ```No heartb Prefect Community #ask-community

Job "late" for long time what can be the cause of...

08/15/2021, 2:15 AM

Job "late" for long time what can be the cause of:

Copy code

No heartbeat detected from the remote task; retrying the run.This will be retry 1 of 2

I also have a flow that I started manually, but it does not start, even though I have no other flows running. in general, if I have a flow that I need to ensure is running at an exact time (no more than few seconds off) , is it good to use Prefect for this, or is it better to use cron job ?

08/15/2021, 2:30 AM

it is now behind 7 minutes I tried to delete an re register the flow, but this did not help. https://prefect.status.io/ does not show an issues

08/15/2021, 5:17 PM

jobs still waiting and waiting ...

08/15/2021, 7:17 PM

I deleted the flow and registered it again. I tried to use "quick run" and to adjust the schedule so it will run by itself. job just waits ...

08/16/2021, 3:38 AM

I cancelled the job after 6 hours of waiting...

Kevin Kho

08/16/2021, 6:39 AM

Hey @YD, will answer this tomorrow

08/16/2021, 6:49 AM

thanks

Kevin Kho

08/16/2021, 2:56 PM

Hey, on the second issue, whenever a flow is scheduled and not running, it is because of labels 99% of the time. Check these docs . So flows can only be picked up with Agent with the same labels. The Agent abels need to be a superset of the Flow labels. What were the labels for your Flow and Agent when you tried this? Yes schedules will fire on time if there is an agent to pick up the Flow.

Dustin Ngo

08/16/2021, 6:03 PM

Hi @YD, regarding your question about heartbeats above, Prefect will start a heartbeat process along with a flow to tell the server that a flow is still running. If this process stops sending heartbeats, the server will assume that something has gone wrong (like an infrastructure failure) and will retry the run--this is intentional to prevent the flow from appearing as "running" in the UI forever. Sometimes though, the heartbeat process can be prematurely terminated by the system (for example, in the event of a memory issue), and if this happens consistently, please give us more information and we can see if something more is going on and try other solutions, such as disabling heartbeats completely.

08/16/2021, 6:23 PM

the systems were AWS c6g.2xlarge (run some ML code) AWS small postgres I did noticed that the postgres was very high on resources usage. I installed postgres on the EC2, an it resources where not very highly utilized I will check the labels tonight The agent on the EC2 was communicating with cloud prefect

08/16/2021, 7:17 PM

How can I check if it was a label issue ? are there any logs that I can look at?

Kevin Kho

08/16/2021, 7:23 PM

There will be no logs because nothing started. Check the flow labels here in the UI, and then check the labels of your agents and see if there is an agent capable of picking up that flow

08/16/2021, 7:34 PM

those flows run before, without any labels

Kevin Kho

08/16/2021, 7:37 PM

Prefect adds default labels for local storage and local agent. What storage and agent are you using?

08/16/2021, 8:02 PM

I ran the agent without giving it a specific label, so it shows the IP address of my AWS machine

08/16/2021, 8:02 PM

I am running a local agent on the AWS

08/16/2021, 8:03 PM

but I can try and add lables

Kevin Kho

08/16/2021, 8:06 PM

You can turn it off by doing

prefect agent local start --no-hostname-label

08/16/2021, 8:09 PM

what I did was

prefect agent local start

Kevin Kho

08/16/2021, 8:10 PM

Yep the local agent has a default label. Where are you storing the flow?

08/16/2021, 8:12 PM

flows are AWS EC2 machine (this is where the code and where I run the

flow.register

from) on that machine I added the cloud Key to the .prefect/config.toml file is this what you are asking?

Kevin Kho

08/16/2021, 8:13 PM

Ah flow I’m asking about the Storage class like this . The default if you don’t specify is to store it on the Local machine.

08/16/2021, 8:20 PM

I just used the defaults, I did not specify and storage location

Copy code

from prefect import task, Flow
from prefect.executors import LocalDaskExecutor
from prefect.schedules import Schedule
from prefect.schedules.clocks import CronClock
import pendulum

@task(...)
def my_task()

def main():
    start_date = pendulum.datetime(2019, 1, 1, tz="America/New_York")
    schedule = Schedule(clocks=[CronClock('2 9 * * mon-fri', start_date=start_date)])

    with Flow("my flow name", schedule=schedule, executor=LocalDaskExecutor()) as flow:
        my_task()

    flow.register(project_name='my project name')

if __name__ == "__main__":
    main()

Kevin Kho

08/16/2021, 8:21 PM

Did you register this on the same VM did the agent?

08/16/2021, 8:22 PM

yes.. I use a single EC2 for this process

Kevin Kho

08/16/2021, 8:23 PM

The labels should have been the same by default. Do you see you agent in Prefect Cloud and do both the flow and agent have the same label?

08/16/2021, 8:29 PM

both have the same label but I do not see in the agent, when I click on "more" any acknowledgement that this flow was submitted even though I do see other flows, that were submitted from the same machine in the same way

Kevin Kho

08/16/2021, 8:35 PM

A bit confused. You don’t see your agent in the Agents screen, but other flows have been running successfully?

08/16/2021, 9:14 PM

I do see the agent in the agent screen and both the agent and the flow have the same label some of the flows run without an issue while some just hangs, and do not run at all

Kevin Kho

08/16/2021, 9:36 PM

And what are the labels of the ones that don’t hang? does the agent show logs that it picked it i[?

08/16/2021, 11:38 PM

All have the same labels

Kevin Kho

08/16/2021, 11:46 PM

So confused because we actually have people complaining that too many flows run on the agent causing it to crash sometimes so as long as the agent can pick up the flow it will without worrying about memory/cpu available

08/16/2021, 11:55 PM

I have 4 flows

08/16/2021, 11:56 PM

is the "memory/cpu" on the prefect cloud side or the agent side ?

Kevin Kho

08/16/2021, 11:57 PM

The LocalAgent is the one responsible for executing the Flow. It executes the Flow as a local process. What happens when you click Quick Run? Do you just see a new yellow bar?

08/17/2021, 12:10 AM

yes... just the the yellow

Kevin Kho

08/17/2021, 12:10 AM

can you click into the yellow and take a screen shot of the dashboard?

08/17/2021, 12:11 AM

I am running on the EC2 the command

nohup prefect agent local start --no-hostname-label > ~/tmp/prefect_agent.log &

then deleting all flow and re-registering them

Kevin Kho

08/17/2021, 12:12 AM

wait if you register them, they will get the default label (looking at your code above). so with the agent, i think you don’t want to do

--no-hostname-label

. Either way just note the labels for the flows

08/17/2021, 12:17 AM

Ok I'll do

Copy code

nohup prefect agent local start -l aws > ~/tmp/prefect_agent.log &

and in the code use

flow.register(project_name='my flow name', labels=['aws'])

Kevin Kho

08/17/2021, 12:26 AM

yeah that looks good. just make sure also the labels show up the same in the UI

08/30/2021, 2:54 PM

I have the same issue again with a production pipeline

08/30/2021, 2:55 PM

how can I find the root cause? and how can I prevent this from happening ?

Kevin Kho

08/30/2021, 2:56 PM

This seems to show that the flows are not even being picked up by an agent. I think it would be a label issue where no agent is capable to picking up these flows?

08/30/2021, 2:57 PM

will try to run the agent…

08/30/2021, 3:01 PM

yes, the agent was not running… how can this happen ? I run it using

nohup prefect agent local start -l aws --agent-config-id <instance-id> > ~/tmp/prefect_agent.log &

08/30/2021, 3:01 PM

and have in the crontab

@reboot nohup prefect agent local start -l aws --agent-config-id <instance-id> > ~/tmp/prefect_agent.log &

08/30/2021, 3:02 PM

what is the best way to ensure that the agent alway run ?

08/30/2021, 3:04 PM

I also have an automation set up for an alert but did not get one

Kevin Kho

08/30/2021, 3:10 PM

Isn’t crontab

@reboot

to run something when the machine reboots? I guess this won’t restart the process if it just dies but the machine doesn’t start? We have this section in our docs about using

supervisor

here to always run the agent. Did that agent config work previously for you?

08/30/2021, 3:25 PM

the machine was not rebooted I am not sure why it died, and also why I did not get an alert email

Kevin Kho

08/30/2021, 3:26 PM

I wouldn’t know, but maybe you can try spinning up something with the config and then turning it down and seeing if you get an email?

08/30/2021, 3:29 PM

I’ll try today to generate an alert by killing the agent… to make sure that my set up is ok

4 Views

Open in Slack

Previous Next