目标
本地环境(Mac OS X)
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
export SPARK_HOME="$HOME/opt/spark-2.3.2-bin-hadoop2.6"
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$PATH
python环境
virtualenv -p python3 spark-python3
source spark-python3/bin/activate
pip install pyspark
deactivate(仅从当前venv环境脱离时执行)
helloworld
./bin/spark-submit examples/src/main/python/wordcount.py README.md
spark-shell
IDE(PyCharm)
- 首选项-Project Interpreter
- helloworld解读
RDD

- partition
- 转换
- 行动
- lazy evaluation
- shuffle
并行化
- partition 与并发
- worker,executor,task,job

Spark Streaming, DStream
socket
./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999
- web ui, http://localhost/4040
kafka
./bin/spark-submit --jars jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar examples/src/main/python/streaming/direct_kafka_wordcount.py 192.168.16.22:9092 jiedian_adapter_heartbeat
DataFrame, SparkSQL, Structured Streaming
- spark.sql.shuffle.partitions (default 200, 在单节点测试时,会造成极大的延迟).
./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
- udf
- 代码在哪里执行?

理论
spark
pyspark