Wrong pyspark version during notebook runtime

Hi,

i created a custom docker image using this docker file. 

```
FROM databricksruntime/minimal:14.3-LTS

ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"

# Installs python 3.9 and virtualenv for Spark and Notebooks
RUN apt update && apt upgrade -y
RUN apt install curl software-properties-common -y apt-utils
RUN add-apt-repository ppa:deadsnakes/ppa -y
RUN apt update
RUN ln -snf /usr/share/zoneinfo/$CONTAINER_TIMEZONE /etc/localtime && echo $CONTAINER_TIMEZONE > /etc/timezone
RUN apt install -y python${python_version} python${python_version}-dev python${python_version}-distutils
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version}
RUN rm get-pip.py

RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
  && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
  && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
  /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download  --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/13.3.html#system-environment

COPY requirements.txt /databricks/.
COPY databricks_requirements.txt /databricks/.

# strip pywin32 as it is not needed on linux
RUN sed -i '/pywin32/d' /databricks/requirements.txt

RUN /databricks/python3/bin/pip install -r /databricks/requirements.txt
RUN /databricks/python3/bin/pip install -r /databricks/databricks_requirements.txt

# Install Databricks CLI
RUN apt install unzip -y
RUN curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
```

As in the example, the PYSPARK_PYTHON variable is pointing to my custom python 3.9. environment. When I am checking the imported pyspark version, it is a different compared to the one that is installed as a sub dependency in my python environment as seen in the image.

![image](https://github.com/databricks/containers/assets/161849762/0ff70097-77de-4b88-a63a-6b5b712d4568)

This leads to a lot of complications when e.g. using the dataengineering client with pyspark. How can i make sure the import pyspark version is the one installed in my python environment?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong pyspark version during notebook runtime #187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong pyspark version during notebook runtime #187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions