Job "late" for long time what can be the cause of...
# ask-community
y
Job "late" for long time what can be the cause of:
Copy code
No heartbeat detected from the remote task; retrying the run.This will be retry 1 of 2
I also have a flow that I started manually, but it does not start, even though I have no other flows running. in general, if I have a flow that I need to ensure is running at an exact time (no more than few seconds off) , is it good to use Prefect for this, or is it better to use cron job ?
it is now behind 7 minutes I tried to delete an re register the flow, but this did not help. https://prefect.status.io/ does not show an issues
jobs still waiting and waiting ...
I deleted the flow and registered it again. I tried to use "quick run" and to adjust the schedule so it will run by itself. job just waits ...
I cancelled the job after 6 hours of waiting...
k
Hey @YD, will answer this tomorrow
y
thanks
k
Hey, on the second issue, whenever a flow is scheduled and not running, it is because of labels 99% of the time. Check these docs . So flows can only be picked up with Agent with the same labels. The Agent abels need to be a superset of the Flow labels. What were the labels for your Flow and Agent when you tried this? Yes schedules will fire on time if there is an agent to pick up the Flow.
d
Hi @YD, regarding your question about heartbeats above, Prefect will start a heartbeat process along with a flow to tell the server that a flow is still running. If this process stops sending heartbeats, the server will assume that something has gone wrong (like an infrastructure failure) and will retry the run--this is intentional to prevent the flow from appearing as "running" in the UI forever. Sometimes though, the heartbeat process can be prematurely terminated by the system (for example, in the event of a memory issue), and if this happens consistently, please give us more information and we can see if something more is going on and try other solutions, such as disabling heartbeats completely.
y
the systems were AWS c6g.2xlarge (run some ML code) AWS small postgres I did noticed that the postgres was very high on resources usage. I installed postgres on the EC2, an it resources where not very highly utilized I will check the labels tonight The agent on the EC2 was communicating with cloud prefect
How can I check if it was a label issue ? are there any logs that I can look at?
k
There will be no logs because nothing started. Check the flow labels here in the UI, and then check the labels of your agents and see if there is an agent capable of picking up that flow
y
those flows run before, without any labels
k
Prefect adds default labels for local storage and local agent. What storage and agent are you using?
y
I ran the agent without giving it a specific label, so it shows the IP address of my AWS machine
I am running a local agent on the AWS
but I can try and add lables
k
You can turn it off by doing
prefect agent local start --no-hostname-label
y
what I did was
prefect agent local start
k
Yep the local agent has a default label. Where are you storing the flow?
y
flows are AWS EC2 machine (this is where the code and where I run the
flow.register
from) on that machine I added the cloud Key to the .prefect/config.toml file is this what you are asking?
k
Ah flow I’m asking about the Storage class like this . The default if you don’t specify is to store it on the Local machine.
y
I just used the defaults, I did not specify and storage location
Copy code
from prefect import task, Flow
from prefect.executors import LocalDaskExecutor
from prefect.schedules import Schedule
from prefect.schedules.clocks import CronClock
import pendulum

@task(...)
def my_task()

def main():
    start_date = pendulum.datetime(2019, 1, 1, tz="America/New_York")
    schedule = Schedule(clocks=[CronClock('2 9 * * mon-fri', start_date=start_date)])

    with Flow("my flow name", schedule=schedule, executor=LocalDaskExecutor()) as flow:
        my_task()

    flow.register(project_name='my project name')

if __name__ == "__main__":
    main()
k
Did you register this on the same VM did the agent?
y
yes.. I use a single EC2 for this process
k
The labels should have been the same by default. Do you see you agent in Prefect Cloud and do both the flow and agent have the same label?
y
both have the same label but I do not see in the agent, when I click on "more" any acknowledgement that this flow was submitted even though I do see other flows, that were submitted from the same machine in the same way
k
A bit confused. You don’t see your agent in the Agents screen, but other flows have been running successfully?
y
I do see the agent in the agent screen and both the agent and the flow have the same label some of the flows run without an issue while some just hangs, and do not run at all
k
And what are the labels of the ones that don’t hang? does the agent show logs that it picked it i[?
y
All have the same labels
k
So confused because we actually have people complaining that too many flows run on the agent causing it to crash sometimes so as long as the agent can pick up the flow it will without worrying about memory/cpu available
y
I have 4 flows
is the "memory/cpu" on the prefect cloud side or the agent side ?
k
The LocalAgent is the one responsible for executing the Flow. It executes the Flow as a local process. What happens when you click Quick Run? Do you just see a new yellow bar?
y
yes... just the the yellow
k
can you click into the yellow and take a screen shot of the dashboard?
y
I am running on the EC2 the command
nohup prefect agent local start --no-hostname-label > ~/tmp/prefect_agent.log &
then deleting all flow and re-registering them
k
wait if you register them, they will get the default label (looking at your code above). so with the agent, i think you don’t want to do
--no-hostname-label
. Either way just note the labels for the flows
y
Ok I'll do
Copy code
nohup prefect agent local start -l aws > ~/tmp/prefect_agent.log &
and in the code use
flow.register(project_name='my flow name', labels=['aws'])
k
yeah that looks good. just make sure also the labels show up the same in the UI
y
I have the same issue again with a production pipeline
how can I find the root cause? and how can I prevent this from happening ?
k
This seems to show that the flows are not even being picked up by an agent. I think it would be a label issue where no agent is capable to picking up these flows?
y
will try to run the agent…
yes, the agent was not running… how can this happen ? I run it using
nohup prefect agent local start -l aws --agent-config-id <instance-id> > ~/tmp/prefect_agent.log &
and have in the crontab
@reboot nohup prefect agent local start -l aws --agent-config-id <instance-id> > ~/tmp/prefect_agent.log &
what is the best way to ensure that the agent alway run ?
I also have an automation set up for an alert but did not get one
k
Isn’t crontab
@reboot
to run something when the machine reboots? I guess this won’t restart the process if it just dies but the machine doesn’t start? We have this section in our docs about using
supervisor
here to always run the agent. Did that agent config work previously for you?
y
the machine was not rebooted I am not sure why it died, and also why I did not get an alert email
k
I wouldn’t know, but maybe you can try spinning up something with the config and then turning it down and seeing if you get an email?
y
I’ll try today to generate an alert by killing the agent… to make sure that my set up is ok