配置 Spark 以使用 Jupyter Notebook 和 Anaconda
问题描述
我花了几天时间尝试让 Spark 与我的 Jupyter Notebook 和 Anaconda 一起工作.这是我的 .bash_profile 的样子:
I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
当我输入 /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell
时,我可以在命令行 shell 中正常启动 Spark.并且输出 sc
不为空.它似乎工作正常.
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell
, I can launch Spark just fine in my command line shell. And the output sc
is not empty. It seems to work fine.
当我输入 pyspark
时,它会启动我的 Jupyter Notebook.当我创建一个新的 Python3 笔记本时,会出现这个错误:
When I type pyspark
, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
而我的 Jupyter Notebook 中的 sc
是空的.
And sc
in my Jupyter Notebook is empty.
谁能帮忙解决这个问题?
Can anyone help solve this situation?
只是想澄清一下:错误末尾的冒号后面没有任何内容.我还尝试使用此 post 创建我自己的启动文件,我在这里引用,所以你不必去看那里:
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:
我创建了一个简短的初始化脚本 init_spark.py,如下所示:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
并将其放在 ~/.ipython/profile_default/startup/目录中
and placed it in the ~/.ipython/profile_default/startup/ directory
当我这样做时,错误就变成了:
When I did this, the error then became:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:
解决方案
Conda 可以帮助正确管理很多依赖...
Conda can help correctly manage a lot of dependencies...
安装火花.假设 spark 安装在/opt/spark 中,请将其包含在您的 ~/.bashrc 中:
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
创建一个 conda 环境,其中包含除 spark 之外的所有所需依赖项:
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
激活环境
$ source activate findspark-jupyter-openjdk8-py3
启动 Jupyter Notebook 服务器:
Launch a Jupyter Notebook server:
$ jupyter notebook
在您的浏览器中,创建一个新的 Python3 笔记本
In your browser, create a new Python3 notebook
尝试使用以下脚本计算 PI(借自 this)
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
相关文章