Hi all, we're testing out running prefect server w...
# prefect-server
r
Hi all, we're testing out running prefect server with an ECS Agent and Docker Containers for flow storage. Everything is working great so long as we use the LocalExecutor or LocalDaskExecutor for our flows. However, when we try to use the DaskExecutor to launch a temporary Fargate dask cluster, the worker nodes on that temporary dask cluster start complaining about not being authenticated with the cloud service. Any ideas about what is going wrong here? Has anyone else attempted this kind of setup before?
Full error message located in the dask worker task cloudwatch log is: ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: AuthorizationError('Malformed response received from Cloud - please ensure that you are authenticated. See
prefect auth login --help
.')ERROR - prefect.CloudTaskRunner | Failed to retrieve task state with error: AuthorizationError('Malformed response received from Cloud - please ensure that you are authenticated. See
prefect auth login --help
.') prefect.exceptions.AuthorizationError: Malformed response received from Cloud - please ensure that you are authenticated. See
prefect auth login --help
.
k
Hi @Ryan Smith, are you using Prefect Server or Cloud, and what version of Prefect are you on?
r
Hi @Kevin Kho. Using Prefect Server, with version 0.15.0 of the docker containers. Rest of the parts are working fine, this seems like the error I hit earlier where I forgot to run
prefect backend server
, but I'm not really sure where I should be looking for that since DaskExecutor takes care of spinning itself up for the most part. Thinking maybe I need to use a custom Docker image as the base image when I'm building the flow storage (currently just referencing
prefecthq/prefect:0.15.0
), and then I could "bake" in a call to
prefect backend server
in there? Let me know if this feels like the right track or if there might be an easier way.
k
Yes you are on the right track that the workers are being automatically configured to point to Cloud and you need it to point to server. I think you have to do
prefect backend server
and then I think you just need the
config.toml
in that image that points to the right API.
Copy code
[server]
endpoint = "http://<YOUR_VM_IP>:4200/graphql/"
or maybe you can set the environment variable
Copy code
PREFECT__SERVER__ENDPOINT
r
@Kevin Kho, Okay, going to give that a try. Does it make sense that I didn't need to do this in order to get things working when I'm not using the DaskExecutor? Because everything is working fine when I just rely on either LocalExecutor
k
Yes because LocalDaskExecutor and LocalExecutor use the configuration of your local machine. DaskExecutor doesn’t so it will need to be configured. But what I am not sure about is if it’s expected for the agent configuration to propagate to the Dask workers. If you use
LocalRun
, I think env variables are carried over but it if not for the other RunConfigs. Yes this is a common issue though where even if you’re able to send work to the Dask cluster somehow, they won;’t be able to update the state of the tasks if they don’t point to the server correctly.
r
Also, probably not related, but we're currently stuck on version 0.15.0 because when I try to build a docker image on anything newer than that, I get the following error when installing
dask-cloudprovider[aws]
via pip:
Copy code
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

boto3 1.18.23 requires botocore<1.22.0,>=1.21.23, but you'll have botocore 1.20.106 which is incompatible.
I tried running the image regardless, but it fails to create the DaskCluster complaining something about aiobotocore being used improperly, so clearly there is a real incompatibility somewhere.
k
If you are doing this since last Thursday,
aiobotocore
has a release to 1.4 that broke multiple Dask related things and you would need to go down to 1.3.3 for that.
r
@Kevin Kho okay great, building our flows from a custom base image that calls
prefect backend server
and sets the correct ENV var for our server did the trick!
👍 1
Also, looks like if I explicitly install
aiobotocore==1.3.3
and
boto3==1.17.106
, then I can get pip install to work against the prebuilt
0.15.4
prefect image. My only concern is that the base 0.15.4 prefect image came preinstalled with 1.18.XX of
boto3
, do you think I'm going to hit any weird issues by downgrading boto3 as well?
k
I don’t expect you to run into problems and pip itself is not a dependency version resolver so you will likely need to manage those dependencies yourself. For more complex dependency resolution, you’ll need something like conda.
👍 1
v
I have been encountering a similar issue such where a version bump from 0.14.22 to 0.15.0 within the docker image alters the behavior of the worker authorization. In 0.14.22, the
PREFECT__BACKEND
and
PREFECT__SERVER__HOST
configurations from the agent are passed successfully to the dask-scheduler and dask-workers but in 0.15.0+, the workers no longer inherit this specification from the agent. I have been able to find a work around by using said environment variables in the docker image, but preferably I would like for the agent to pass this info down to the workers. Any suggestions?
k
Hey @Vincent, what type of agent are you using? I think only Local agent passes env vars. Will bring this up to the team though
v
I am using a kubernetes agent
k
Did you use that before 0.15.0 and the env vars passed from the agent?
v
I don't see environment variables set on the docker pod, but these pods mysteriously have some knowledge of which server to ping home to, and this changed in between 0.14.22 and 0.15.0
👍 1
z
Hey Vincent, I'm not sure what caused this change in behavior but I'd love to restore passing these settings to the workers 🙂 if you help track it down we can get a fix in quickly
v
Yes - I am still investigating these root causes but the 15.0 bump introduced many changes. How were credentials passed to the workers prior to 15.0? (what files should I focus on) Are these passed via the task, or are they present before any work gets started.
z
I think we pass the
context
(which contains the populated
settings
object) to the
TaskRunner.run
method which is submitted to the dask workers. This means that settings should be loaded from the
context.settings
instead of
prefect.settings
where they need to be respected on workers. I suspect that this is a result of my changes in the
Client
as I have no knowledge of settings environment variables on dask workers (although it could certainly be happening somewhere).
v
Okay. I found the issue! It turns out that there is a missing
self
at line 156 for
api_server
https://github.com/PrefectHQ/prefect/blob/master/src/prefect/client/client.py#L156 Small bug, but it works now!
z
Wonderful! Want to PR?
v