https://prefect.io logo
#prefect-community
Title
# prefect-community
e

Evan Curtin

05/06/2022, 6:12 PM
Hey all, has anybody successfully started a dask cluster + prefect on top of an existing spark cluster (databricks)? I have a databricks cluster but want to use distributed prefect to schedule
a

Anna Geller

05/06/2022, 6:23 PM
to work with Databricks, the easiest is to use https://docs.prefect.io/api/latest/tasks/databricks.html Running Dask on Databricks is 💣 😄 even if it works, it would be potentially dangerous. Probably easier to stick with Spark if you rely on Databricks I saw a blog post on Medium showing how someone could in theory run Dask on Databricks but not sure if this is worth your time Can you perhaps explain your use case more? what type of work do you try to parallelize?
e

Evan Curtin

05/06/2022, 6:24 PM
It’s large scale ML experiments I want to run. 1 simulation per worker. Databricks is just the easiest way for me to get a cluster. I’m currently using spark to do the orchestration but it’s really not meant for what I’m doing
👍 1
k

Kevin Kho

05/06/2022, 6:32 PM
Yes dude! I was trying this out with my collaborator for my other open source project. This is the init script:
Copy code
#!/bin/bash
echo $DB_IS_DRIVER
/databricks/python/bin/pip install prefect dask distributed fugue[sql,duckdb] tune fs-s3fs optuna
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  dask-scheduler &>/dev/null &
else
  dask-worker tcp://$DB_DRIVER_IP:8786 --nworkers 4 --nthreads 1 &>/dev/null &
fi
but really, don’t do this lol. All packages need to be installed before dask too. I think order matters. But with Databricks connect you can just do:
Copy code
@task
def spark_thing():
    spark = SparkSession.builder.getOrCreate()
    spark.createDataFrame(...)
👍 1
e

Evan Curtin

05/06/2022, 8:09 PM
we’re using the docker runtime so installing dependencies is trivial. So if we just need a way to start the workers and scheduler and setup communication it feels doable
it doesn’t need to happen at cluster init either
Copy code
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  dask-scheduler &>/dev/null &
else
  dask-worker tcp://$DB_DRIVER_IP:8786 --nworkers 4 --nthreads 1 &>/dev/null &
fi
so I could potentially run this right?
k

Kevin Kho

05/06/2022, 8:59 PM
I guess you could try yeah
167 Views