Apache Spark learning log [#1] - Installing PySpark and enabling IPython in PyCharm

Hi, I'm starting a new series of short posts that will describe the process of me learning Apache Spark.

Choice of the programming language

Apache Spark offers high-level APIs for Java, Scala, Python, R. Python seems the best choice for me at the moment because I use it both at job and for my private projects. I was wondering about Scala as well but I only had a short episode with this language in one of my previous jobs and I haven't used it since then. Sticking with Python seems most reasonable.

Installation process

Python's binding for Apache Spark is called PySpark and is available at PyPi. Therefore, installing it with pip should be pretty straightforward. There is one dependency though that you'll need to take care of to make it work.

JAVA

Spark requires Java. I already installed Open JDK 11 before. Once you have Java installed on your machine, you need to set an environment variable JAVA_HOME. I'm working on Ubuntu 20.04 system and zsh shell so I set this variable in ~/.zshrc file as following:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

PySpark

I use PyCharm and, for the purpose of learning Spark, I created a new project with fresh virtual environment named pyspark-learning.

Installing Spark is as easy as typing in the terminal:

pip install pyspark

This installs latest version of package (which was 3.1.1 in my case). If you have a poor/unstable internet connection, then please keep in mind that pyspark package weighs 235 MB so it might a little while to download.

Switching from default Python shell to IPython

To make sure that installation succeeded, I ran below command in Pycharm's terminal:

pyspark

This has started a PySpark shell inside Pycharm's terminal and everything was working properly. However, PySpark was using Python's default shell which doesn't enable coloring and autocompletion so I tried to switch to Ipython. First, I had to install it:

pip install ipython

When running pyspark command, PySpark would again start inside default's Python shell. PySpark's documentation (available here for 3.1.1 version) advises to set PYSPARK_DRIVER_PYTHON variable to ipython.

When you start your virtualenv on UNIX based systems, you'd usually do that by running source /path/to/venv/bin/activate.

Any variables that you define inside activate file will be set and available for you after your environment is started. I added PYSPARK_DRIVER_PYTHON=ipython there but still, that did not work as expected. PySpark would start with Python's default shell again.

After some trials & errors I came to conclusion that PySpark would load with IPython shell if I executed pyspark file directly from installed package location (and not from path/to/venv/bin/pyspark) while setting PYSPARK_DRIVER_PYTHON at the same time. To make it more clear:

  • /home/kuba/.virtualenvs/pyspark-learning is my virtualenv main directory. In /home/kuba/.virtualenvs/pyspark-learning/bin directory, we have activate file which starts whole virtual environment as well as pyspark file that starts PySpark's Python shell. Whenever you're typing pyspark inside Pycharm's terminal, it will execute /home/kuba/.virtualenvs/pyspark-learning/bin/pyspark file.
  • /home/kuba/.virtualenvs/pyspark-learning/lib/python3.8/site-packages/pyspark is where PySpark package was installed with pip. It also contains a /home/kuba/.virtualenvs/pyspark-learning/lib/python3.8/site-packages/pyspark/bin directory where pyspark file lives.

It turned out that it worked whenever I tried to set PYSPARK_DRIVER_PYTHON=ipython variable and ran the pyspark from within package installation path (and not venv bin folder), PySpark's shell was loading with IPython! Command for that would look like:

PYSPARK_DRIVER_PYTHON=ipython ~/.virtualenvs/pyspark-learning/lib/python3.8/site-packages/pyspark/bin/pyspark

To avoid typing this long path, I added an alias for pyspark inside /home/kuba/.virtualenvs/pyspark-learning/bin/activate file:

# file /home/kuba/.virtualenvs/pyspark-learning/bin/activate

alias pyspark="PYSPARK_DRIVER_PYTHON=ipython /home/kuba/.virtualenvs/pyspark-learning/lib/python3.8/site-packages/pyspark/bin/pyspark"

And now, whenever I type pyspark inside Pycharm's terminal with virtualenv activated, it starts PySpark's shell using IPython :-).

Ensuring Spark works inside Python application

Last thing I wanted to check was if PySpark was properly initialized from within Python files. To test this, I created a little program that just creates a Spark session.

# file test_spark.py

from pyspark.sql import SparkSession


def main():
    spark = SparkSession.builder.getOrCreate()
    assert spark


if __name__ == '__main__':
    main()

It ran successfully and I could see Spark initialization output printed to the console. The only thing that caught my eye was WARNING: An illegal reflective access operation has occurred warning. After quick googling, it looks like it was caused by the fact that I'm using JAVA 11 instead of JAVA 8. For learning purposes, not the production code, I'll probably leave it as it is for now and won't downgrade JAVA installation.

And that it for this article!
Best Regards,
Kuba