Hi i have flow on github and prefect agent running...
# ask-community
a
Hi i have flow on github and prefect agent running on gke and have docker file for storing all the custom modules which eventually goes to gcr. Things were working fine but now need to install pyspark in docker file. i have included it the same way we were doing it for other docker files (we have already pyspark included in other docker file for the project and that works) but now when i try to include it in current docker file and builds it using cloudbuild the build fails saying
Unable to locate package openjdk-8-jdk
. Is the issue is because of base image, for other docker files where spark run we have ubuntu 20.04 as base image but for prefect we have prefect as base image. Below is the docker file
FROM prefecthq/prefect:0.15.6-python3.8
# for spark
ENV JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
ENV SPARK_HOME="/spark/spark-3.1.2-bin-hadoop3.2/"
ENV PYTHONPATH="/spark/spark-3.1.2-bin-hadoop3.2/python:$PYTHONPATH"
ENV PYSPARK_PYTHON="python3"
ENV PATH="$PATH:/spark/spark-3.1.2-bin-hadoop3.2/bin"
ENV PATH="$PATH:$JAVA_HOME"
ENV PATH="$PATH:$JAVA_HOME/bin"
ENV PATH="$PATH:$JAVA_HOME/jre/bin"
ENV SPARK_LOCAL_IP="127.0.0.1"
WORKDIR /
COPY . /
RUN apt-get update && \
apt-get install -y  \
openjdk-8-jdk  \
python3-pip
ADD <https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz> spark.tgz
RUN mkdir -p spark && \
tar -zxvf spark.tgz -C spark/ && \
rm spark.tgz
# for prefect
RUN pip install feast feast-postgres sqlalchemy google-auth scikit-learn
RUN pip install feast[gcp]
RUN pip install --upgrade google-cloud
RUN pip install --upgrade google-cloud-bigquery
RUN pip install --upgrade google-cloud-storage
WORKDIR /opt/prefect
COPY flow_utilities/ /opt/prefect/flow_utilities/
COPY flow_utilities_bigQ_Datastore/ /opt/prefect/flow_utilities_bigQ_Datastore/
COPY setup.py /opt/prefect/setup.py
COPY .feastignore /opt/prefect/.feastignore
RUN pip install .
a
Correct, in this case it may be easier to use Ubuntu base image if you really need open jdk there, here is a SO issue with some pointers: https://stackoverflow.com/questions/32942023/ubuntu-openjdk-8-unable-to-locate-package But I'm not sure whether you need it - where is your Spark cluster running? Perhaps you can start a Pyspark job via API (e.g. Databricks cluster) or via ShellTask submitting job to the cluster?
a
ok so i am now sure it is because of using prefect base image because now i changed the base image from prefect to ubuntu i worked, but as far as i remember i cannot use ubuntu as base image becuase in past in DM i was facing issues, the prefect could not get the custom modules and than you told me i have to use prefect as base image and than it worked.
@Open AIMP
@Anna Geller Inside our container where other custom modules are
a
Well it's not that you can't use other base image than PrefectHQ but we recommend those since they are configured to have everything you need.
o
@Anna Geller can we use everything in this image to create prefect-image based upon Ubuntu as base image? https://github.com/PrefectHQ/prefect/blob/master/Dockerfile
a
sure, you definitely can do that
a
ok when i will use ubuntu as base image now what would be the work directoy ( the same opt/prefect) for storing custom modules, because as far as i remember i had to use opt/prefect otherwise prefect was not able to find the custom modules
a
Since you are installing your custom modules as a package, I think the working directory doesn’t matter that much, but if you want, you can set it using the WORKDIR command in a Dockerfile. Prefect will be able to find your modules because of this:
Copy code
COPY setup.py /opt/prefect/setup.py
RUN pip install .