Hey there, I have a problem with prefect server. M...
# prefect-community
a
Hey there, I have a problem with prefect server. My setup used to work but now there might be something wrong with the version or something. I’m getting this error in the UI when running a flow:
Copy code
[3:18pm]: Exception raised while calling state handlers: HTTPError('400 Client Error: Bad Request for url: <http://localhost:4200/graphql/alpha>')
Any idea on how to start debugging?
n
Hi @Avi A - sounds like your agent might not be able to communicate with the server; where do you have Prefect Server deployed?
a
I have it deployed on one server, and the agent is on the other. On the agent, I have set up an SSH tunnel between them and
prefect agent start
works and communicates with the server. It then receives the job but ends up like this. On the agent itself it says nothing
n
When you say the agent itself says nothing, you mean you're not getting any output from the agent when you start it?
a
it says this:
Copy code
[2020-06-03 12:15:36,746] INFO - agent | Waiting for flow runs...
[2020-06-03 12:18:33,032] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-06-03 12:18:33,066] INFO - agent | Deploying flow run 679fe6ae-b245-47a0-97c7-a413468096ef
but nothing further
it actually does so for every flow… is there a way to see logs on the agent end?
n
@Avi A you can start your agent with a
--verbose
flag to get some more output, this is an interesting problem I haven't seen before
j
@Avi A is that original message at [3:18] something you’re seeing in the logs in the UI?
Just trying to suss out where the communication break is happening - if so, sounds like your agent can pull work and runs can communicate logs back to the server but your state handler is having an issue
a
yes, here’s an example screenshot
j
And are you using any custom state handlers or just the automatic ones for setting states?
a
at the moment I can’t restore the issue b/c the communication is off entirely (task stuck on `Submitted for execution`… will update
I’m not customizing the state handlers
AFAIK
j
Ok
a
ohhh wait. I have set the Slack state handler. It used to work fine, but maybe that’s causing the problem? Where should the WEBHOOK secret be defined? On the agent or on the server?
j
I think it’ll need to be on the agent, since the flows it launches will inherit its config
a
any idea on how to get the webhook URL secret for the app after installing? or do I have to reinstall and get a new URL?
I’m talking about the slack app. I somehow lost the URL
j
😬 I’m really sorry I dont know the answer to that, maybe someone else will
a
nm, I’ll just reinstall the app
btw the URL can’t be restored. It says so upon installation 🙂
Copy code
Please store it securely. If you lose your webhook URL, you will need to uninstall and reinstall the Prefect slack integration!
@Jeremiah now I’m using
--verbose
and getting this on the agent when submitting the job:
Copy code
[2020-06-03 13:55:02,998] DEBUG - agent | Querying for flow runs
[2020-06-03 13:55:03,064] DEBUG - agent | Found flow runs ['14e59135-b8fb-4d35-9140-d5eda3a709aa']
[2020-06-03 13:55:03,065] DEBUG - agent | Querying flow run metadata
[2020-06-03 13:55:03,098] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-06-03 13:55:03,099] DEBUG - agent | Updating states for flow run 14e59135-b8fb-4d35-9140-d5eda3a709aa
[2020-06-03 13:55:03,101] DEBUG - agent | Flow run 14e59135-b8fb-4d35-9140-d5eda3a709aa is in a Scheduled state, updating to Submitted
[2020-06-03 13:55:03,103] DEBUG - agent | Next query for flow runs in 0.25 seconds
[2020-06-03 13:55:03,137] INFO - agent | Deploying flow run 14e59135-b8fb-4d35-9140-d5eda3a709aa
[2020-06-03 13:55:03,141] DEBUG - agent | Submitted flow run 14e59135-b8fb-4d35-9140-d5eda3a709aa to process PID 26768
[2020-06-03 13:55:03,160] DEBUG - agent | Completed flow run submission (id: 14e59135-b8fb-4d35-9140-d5eda3a709aa)
On the UI (different server) it still says
Submitted for execution
j
Ok, so far that’s all expected. My guess is that when the flow spins up and attempts to enter a
Running
state, the state handler fails due to the slack config and prevents progress. This is an issue we’ve seen in the past - since the state handler is responsible for communicate state updates back to the server, including errors, errors in the state handler itself get tricky to handle
a
I’ve set the secret now, shouldn’t it work?
I’ll disable it for now, see what happens
I’ve disabled slack and getting the same behavior
I’m using
LocalDaskExecutor
as a remote environment, perhaps that’s related. But it worked fine before. I mean, it failed before, but it reported the logs ok to the UI
n
@Avi A this is looking like it's probably something unique to your setup with the SSH tunnel, would you mind putting together a minimum code repro and opening a GitHub issue?
a
I don’t think it’s related to the tunnel. I’ve set it up before and it worked perfectly (I’m setting a tunnel on ports 8080 and 4200). In any case, I’ve changed the config and using it without the ssh tunnel now. In
~/.prefect/config.toml
I put
Copy code
[server]
host = "<http://prefect>"
Since the agent connects well to the server and receives the job submission, I figured this was a good setup. Is there anything else to configure for this to work?
j
I think you are encountering a different issue now. Originally, your flow was able to communicate with the server, at least enough to set a failed state when the slack handler failed. Now it sounds like it can’t set states at all, which means either the flow isn’t starting, or once the flow gets to the dask cluster it’s unable to communicate with the server. In both cases the most likely culprit is an inability to reach the API. It’s going to be very hard for us to diagnose a communication issue from here, I suggest opening a GitHub issue with a description of the environment for visibility and testing.
a
I hear you, but I still think that opening an issue would be a waste of time for you guys. How could you reproduce the issue with this amount of info? btw I also changed the environment back to the default so it’s probably not an issue with Dask
n
@Avi A on Github you’ll have access to a broader support base for the issue; this is equally as difficult to diagnose here. It seems that your configuration is set correctly given that your setup was working correctly previously.
a
ok I’ve now run the agent with the flag
-f / --show-flow-logs
to output the logs from the run itself, and indeed it reports problems with the state:
Copy code
[2020-06-03 15:06:48] INFO - prefect.CloudFlowRunner | Beginning Flow run for 'Smart Groups'
[2020-06-03 15:06:49] DEBUG - prefect.CloudFlowRunner | Failed to retrieve flow state with error: RegistryError("Multiple classes with name 'StateSchema' were found. Please use the full, module-qualified path.")
@nicholas! I know why I was getting these errors! Our internal piece of code was also registering an object named
State
under the name
StateSchema
, and it caused a collision with Prefect’s
StateSchema
. Question is… what can I do with it?
😄 1
j
@Avi A thanks so much for following up on this - I’ve got a growing list of reasons that Prefect serialization is getting “too complex” and I’ll add this to the list. I just opened a PR that might solve your issue, would you mind giving it a look? https://github.com/PrefectHQ/prefect/pull/2738 These are the only two places that the StateSchema appears to be referenced as a class-name reference.
upvote 1
a
Thanks a lot, @Jeremiah. I installed from that branch and it works for my case!
😄 1
j
Glad to hear it!
👏 1