Hi, I'm starting a new series of short posts that will describe the process of me learning Apache Spark.
Choice of the programming language
Apache Spark offers high-level APIs for Java, Scala, Python, R. Python seems the best choice for me at the moment because I use it both at job and for my private projects. I was wondering about Scala as well but I only had a short episode with this language in one of my previous jobs and I haven't used it since then. Sticking with Python seems most reasonable.
Python's binding for Apache Spark is called
PySpark and is available at PyPi. Therefore, installing it with
pip should be pretty straightforward. There is one dependency though that you'll need to take care of to make it work.
Spark requires Java. I already installed
Open JDK 11 before. Once you have
Java installed on your machine, you need to set an environment variable
JAVA_HOME. I'm working on
Ubuntu 20.04 system and
zsh shell so I set this variable in
~/.zshrc file as following:
I use PyCharm and, for the purpose of learning Spark, I created a new project with fresh virtual environment named
Installing Spark is as easy as typing in the terminal:
pip install pyspark
This installs latest version of package (which was
3.1.1 in my case). If you have a poor/unstable internet connection, then please keep in mind that
pyspark package weighs 235 MB so it might a little while to download.
Switching from default Python shell to IPython
To make sure that installation succeeded, I ran below command in Pycharm's terminal:
This has started a PySpark shell inside Pycharm's terminal and everything was working properly. However, PySpark was using Python's default shell which doesn't enable coloring and autocompletion so I tried to switch to Ipython. First, I had to install it:
pip install ipython
pyspark command, PySpark would again start inside default's Python shell. PySpark's documentation (available here for 3.1.1 version) advises to set
PYSPARK_DRIVER_PYTHON variable to
When you start your virtualenv on UNIX based systems, you'd usually do that by running
Any variables that you define inside
activate file will be set and available for you after your environment is started. I added
PYSPARK_DRIVER_PYTHON=ipython there but still, that did not work as expected. PySpark would start with Python's default shell again.
After some trials & errors I came to conclusion that PySpark would load with IPython shell if I executed
pyspark file directly from installed package location (and not from
path/to/venv/bin/pyspark) while setting
PYSPARK_DRIVER_PYTHON at the same time. To make it more clear:
/home/kuba/.virtualenvs/pyspark-learningis my virtualenv main directory. In
/home/kuba/.virtualenvs/pyspark-learning/bindirectory, we have
activatefile which starts whole virtual environment as well as
pysparkfile that starts PySpark's Python shell. Whenever you're typing
pysparkinside Pycharm's terminal, it will execute
PySparkpackage was installed with
pip. It also contains a
It turned out that it worked whenever I tried to set
PYSPARK_DRIVER_PYTHON=ipython variable and ran the pyspark from within package installation path (and not venv bin folder), PySpark's shell was loading with IPython! Command for that would look like:
To avoid typing this long path, I added an alias for
# file /home/kuba/.virtualenvs/pyspark-learning/bin/activate alias pyspark="PYSPARK_DRIVER_PYTHON=ipython /home/kuba/.virtualenvs/pyspark-learning/lib/python3.8/site-packages/pyspark/bin/pyspark"
And now, whenever I type
pyspark inside Pycharm's terminal with virtualenv activated, it starts PySpark's shell using
Ensuring Spark works inside Python application
Last thing I wanted to check was if PySpark was properly initialized from within Python files. To test this, I created a little program that just creates a Spark session.
# file test_spark.py from pyspark.sql import SparkSession def main(): spark = SparkSession.builder.getOrCreate() assert spark if __name__ == '__main__': main()
It ran successfully and I could see Spark initialization output printed to the console. The only thing that caught my eye was
WARNING: An illegal reflective access operation has occurred warning. After quick googling, it looks like it was caused by the fact that I'm using JAVA 11 instead of JAVA 8. For learning purposes, not the production code, I'll probably leave it as it is for now and won't downgrade JAVA installation.
And that it for this article!