Hi. I'd really appreciate some help on connecting ...
# prefect-community
n
Hi. I'd really appreciate some help on connecting to a remote k8s cluster running in azure. I can execute flows in the cluster when using
flow.run(...)
but when I
flow.register(...
) and then run it from the UI, I get errors in the dask workers in k8s saying that cannot connect to
localhost:4200
. which makes sense since that localhost:4200 has to be replaced by the IP where the prefect server is running. how to do that? (https://github.com/PrefectHQ/prefect/issues/3185)
j
Hi @Nuno Silva you can directly set an API endpoint for your flow runs that the agent deploys with the
--api
flag
n
Hi @josh. Sounds good, can you elaborate a bit more, where and how can I set that flag?
prefect agent start --api <server_ip>
?
j
Exactly! When you call either
prefect agent start
or
prefect agent install
you can say
--api <server_ip>
n
ok, still having some issues: prefect agent start --api http://server_ip:4200/graphql I get the logs from the agent: [2020-08-19 150850,895] INFO - agent | Agent connecting to the Prefect API at http://vm-python-server1.westeurope.cloudapp.azure.com:4200/graphql [2020-08-19 150850,902] INFO - agent | Waiting for flow runs... [2020-08-19 150858,496] INFO - agent | Found 1 flow run(s) to submit for execution. [2020-08-19 150858,529] INFO - agent | Deploying flow run 3b06a5ef-5970-4fa0-8744-313cc66cdf93 [2020-08-19 150934,368] INFO - agent | Process PID 22615 returned non-zero exit code
so the flow in the UI is found but nothing runs in the dask cluster, at some point the agent gives this error and in the UI still nothing showing as being failed
j
Can you run the agent with the
--show-flow-logs
flag to see what the logs from the process are showing?
n
ok found it: HTTPConnectionPool(host=server_ip, port=4200): Max retries exceeded with url: /graphql/graphql
I'm running the server in a VM in azure
the agent is also running in the VM in azure
it looks like the agent cannot resolve the ip, which is it's own
j
Not sure if this is it but are you appending
/graphql
to the
--api
endpoint you are passing in? It looks like it’s attempting to talk to
/graphql/graphql
(the client automatically adds it)
n
shall I try 1)
--api= server_ip
or 2)
--api= server_ip:4200
?
j
Do it with the
4200
port
n
same error, timeout cannot connect
j
Do you know if your dask cluster is permissioned to access that endpoint?
To verify, you could try reregistering your flow without the executor in the
LocalEnvironment
and see what happens
n
explicitly it's not, they're both in azure in the same resource-group. when I run from that server/endpoint the same flow with
flow.run
it works
that server communicates with the k8s cluster and back, no probs
but with the UI, the explicit calls to localhost:4200 are failing
j
flow.run
doesn’t make any calls to the server so I would expect that to succeed. Could you try the reregister w/o the executor?
n
yes
register without the executor should work but the agent still gives the same error when I set
--api
j
Oh interesting. On the same instance where you are starting the agent could you test this snippet:
Copy code
from prefect import Client

# add the /graphql this time
c = Client(api_server=<endpoint>/graphql)

c.graphql("{hello}")
and see if it raises an error
Actually that wouldn’t matter because you are able to start the agent somehow 🤔
n
ok, so if I run that from inside the server, it fails, from outside the server it works
j
By inside the server do you mean in a pod/container/etc. where the prefect server is running? Because agents should be started externally to the prefect server
n
I have a vm in azure (ubuntu server), I installed conda and then prefect. I
prefect start server
and then
prefect start agent
. the dask cluster (executor) is a dask-kubernetes cluster that I deployed myself in azure aks
when I'm just using the
localenvironment
without an executor, having the prefect server and agent inside the same machine works fine, the agent finds the flows and runs them
j
When getting a success using the LocalEnvironment without an executor are you still setting the
--api
or no?
n
if i set it, it fails, if i dont it works
I can have the prefect server in a VM in azure and then from my pc run the agent pointing to the api in that VM running the server?
so your example
c = Client(api_server=<endpoint>/graphql)
works from my local machine. when I run that from a python terminal in the VM it doesn't work
j
Okay that makes sense. So what appears to be happening is that your agent wants to communicate with the prefect server on localhost:4200 since it’s on the same machine but dask would need to communicate over that external endpoint. What I would recommend is trying one of these options: • Deploy your agent somewhere where it can talk to that external endpoint via the
--api
flag and then it will be propagated down to the workers • Keep your agent on that instance without setting
--api
and instead on your dask cluster somehow set the env var
PREFECT___CLOUD___API=<endpoint>
n
cool, let me work on that. Thanks a lot @josh
j
Yeah you could try running the agent on your individual machine with the api set and see if that results in a success
n
which ports does the VM need to have open? right now I have open 8080 and 4200
j
Yep those are the only ones the deployment exposes 👍
n
quick update: if on my local machine I have no prefect server running and I just do
prefect agent start --api server_ip:4200
it fails immediately with
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=4200)
so it looks like the agent by default always goes first to localhost. I can change the host in config.toml
I'll try to dask env variable
j
Oh that’s weird I think that may be a bug
n
that would also explain why when running the agent in the VM where the server is running, even with the
--api
flag it only fails when the flow is scheduled, since only then it tries to use it
j
Yeah it would!
Looking into it
About to submit a PR but if you want you could try one thing: set the
export
PREFECT___CLOUD___API=<endpoint>
in the same process on your VM before you start the agent and see if it resolves anything
PR here: https://github.com/PrefectHQ/prefect/pull/3186 feel free to also test off of this branch or master once merged 🙂
n
it solves the issue of the agent not starting, now it starts if I set the env var before
still it doesn't schedule the flow: meaning, the agent in my local pc doesn't run the flow in the server in the VM in azure
and the agent in the local machine does actually connect to it because if I change the port from 4200 to 4199 it doesn't start
j
The agent might not be running the flow due to the labels on the flow’s environment and on the agent. You should be able to see the labels on your flow in the UI and the agent’s labels at start.
Flows with the default local storage won’t be able to be picked up by agents running on other machines because they are stored in the registration machine’s filesystem
n
the labels are indeed different
one is my local pc name, the other is the vm name
I can switch to docker.storage, thanks for the pointers. I'll keep digging
j
Anytime! There’s also other cloud-based storage options as well 🙂 https://docs.prefect.io/orchestration/execution/storage_options.html
n
I was trying the azure storage and good news it finds the flow immediately, but I get errors in the agent :
AttributeError: 'NoneType' object has no attribute 'rstrip'
in
python3.8/site-packages/azure/storage/blob/_shared/base_client.py", line 349, in parse_connection_str
which azure.storage.blob version does prefect expect?
btw, setting
PREFECT___CLOUD___API
in dask workers in kubernetes doesn't work
j
setup.py has the requirement as
"azure-storage-blob >= 12.1.0, < 13.0"
, I don’t know too much about this storage integration so if there are some updates it needs due to azure API changes we welcome the contribution haha
n
dask:adlfs also struggles a lot with the breaking changes in azure.storage, so that is normal at the moment I guess. I'll contribute when I fix it for sure
by the looks of the error it looks like the agent in my local machine doesn't get the connection string that was set in the flow running in the server
d
Were you able to pinpoint where you got the error from?
AttributeError: 'NoneType' object has no attribute 'rstrip'