Skip to content

Wrong pyspark version during notebook runtime #187

@moschnetzsch

Description

@moschnetzsch

Hi,

i created a custom docker image using this docker file.

FROM databricksruntime/minimal:14.3-LTS

ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"

# Installs python 3.9 and virtualenv for Spark and Notebooks
RUN apt update && apt upgrade -y
RUN apt install curl software-properties-common -y apt-utils
RUN add-apt-repository ppa:deadsnakes/ppa -y
RUN apt update
RUN ln -snf /usr/share/zoneinfo/$CONTAINER_TIMEZONE /etc/localtime && echo $CONTAINER_TIMEZONE > /etc/timezone
RUN apt install -y python${python_version} python${python_version}-dev python${python_version}-distutils
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version}
RUN rm get-pip.py

RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
  && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
  && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
  /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download  --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/13.3.html#system-environment

COPY requirements.txt /databricks/.
COPY databricks_requirements.txt /databricks/.

# strip pywin32 as it is not needed on linux
RUN sed -i '/pywin32/d' /databricks/requirements.txt

RUN /databricks/python3/bin/pip install -r /databricks/requirements.txt
RUN /databricks/python3/bin/pip install -r /databricks/databricks_requirements.txt

# Install Databricks CLI
RUN apt install unzip -y
RUN curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3

As in the example, the PYSPARK_PYTHON variable is pointing to my custom python 3.9. environment. When I am checking the imported pyspark version, it is a different compared to the one that is installed as a sub dependency in my python environment as seen in the image.

image

This leads to a lot of complications when e.g. using the dataengineering client with pyspark. How can i make sure the import pyspark version is the one installed in my python environment?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions