hi, is there a quick way to check the health and s...
# ask-community
v
hi, is there a quick way to check the health and status of the lazarus process/service? perhaps a REST endpoint we could query or a command line utility we can invoke?
j
I assume you're asking about a health check for that service when running your own version of Prefect Server (not one for pinging cloud)?
v
yes @Jim Crist-Harif that is correct!
j
Sure, I'll open an issue
v
excellent thanks 👍
j
@Marvin open "Ensure health check available for all services (where it makes sense)" in server
v
@Jim Crist-Harif in the meantime is there anything we can do to sort of "manually" figure out if everything is running fine?
as of right now some of our flows have having tasks stuck in the pending state forever
j
I don't believe any check is currently done for pending tasks (pending only means the task id is created, not that it ever started), but zombie running tasks should be reaped periodically in a healthy system.
v
although the flow still shows as running on the perfect server UI, and the agent at least is able to talk to graphql fine. i'm trying to work with our ops team to figure out if anything is off with our lazarus process so it would be helpful to know if there's any specific steps for them to take to figure this out. thank you!
j
cc @Zanie, our resident server expert ^^
v
i see. i thought lazarus would check that "hey this flow shows as running but its tasks have been stuck as pending for a while, let me do something"... but i guess that's not what it does
j
No, as long as the main flow run process is still healthy the flow is marked as active. What executor are you using?
We're working on some features that would let you detect the above though, not released yet.
z
(I do not believe those features will be available in Server though)
👍 1
You could probably write a loop service that checks for the case you’re interested in if this is a significant issue for you.
v
@Jim Crist-Harif we actually have our own executor we've implemented to run tasks on our HPC
@Zanie i think this might be an issue for us, we could write the GQL query to detect the issue but how would we get prefect to handle it?
z
You’d have to write and run a service like Lazurus
although it seems likely this is a bug in your executor?
v
it could be a bug, i just don't know where to look though. because when we directly run on our HPC using
flow.run()
everything runs fine
as i mentioned in a thread yesterday, in this particular case, what's happening is that: 1. the flow is registered to the server 2. an agent is brought up on our HPC 3. the flow is triggered to run from the prefect server UI 4. the "top level" tasks (parameters in the case of this flow) execute and return quickly, but then 5. the "next" level of tasks in the dag remain stuck in the pending state the difference between
flow.run
vs the agent picking up a flow run is that
flow.run
runs a loop over the tasks of the flow, until
flow.is_finished
whereas when the agent deploys a flow, it uses the flow runner's
run()
which only submits the flow's tasks a single time
so something somewhere i think is off, i could also be totally misunderstanding 😛
j
If the flow is still reporting as active, this sounds like an issue with your custom executor (which isn't a supported feature of prefect). If you can reproduce the issue with a supported builtin executor we could investigate further.
v
i see, thanks. any pointers about where we should look in our custom exec implementation?
j
Nope. I'd check if you have any tasks pending or locking internally going on. Figuring out what the flow runner is waiting on would be useful here.
v
so, curiously the flow runner does run through all the tasks of the flow, but it's only doing this once as far as i can tell...