Below you will find pages that utilize the taxonomy term “spark”
Posts
Spark 学习笔记
目标 概念 本地开发 本地环境(Mac OS X) 下载 export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) export SPARK_HOME="$HOME/opt/spark-2.3.2-bin-hadoop2.6" export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH export PATH=$SPARK_HOME/bin:$PATH python环境 virtualenv -p python3 spark-python3 source spark-python3/bin/activate pip install pyspark deactivate(仅从当前venv环境脱离时执行) helloworld ./bin/spark-submit examples/src/main/python/wordcount.py README.md spark-shell IDE(PyCharm) 首选项-Project Interpreter helloworld解读 RDD partition 转换 行动 lazy evaluation shuffle 并行化 partition 与并发 worker,executor,task,job Spark Streaming, DStream socket ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999 web ui, http://localhost/4040 kafka https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.3.2 , 放置于jars目录,或者在命令行指定 spark.io.compression.codec snappy (conf/spark-defaults.conf lz4依赖冲突,2.2.2没有此问题) 两种模式: receiver, direct .
Posts
Spark调研笔记
理论 streaming 101 streaming 102 spark 子雨大数据之Spark入门教程(Python版) RDD、DataFrame和DataSet的区别 Spark Streaming 不同Batch任务可以并行计算么? Spark Streaming 管理 Kafka Offsets 的方式探讨 Spark Streaming容错性和零数据丢失 Structured Streaming Programming Guide结构化流编程指南 是时候放弃 Spark Streaming, 转向 Structured Streaming 了 选项 spark.io.compression.codec snappy (lz4依赖冲突) spark.streaming.concurrentjobs (spark streaming实时大数据分析4.4.4) spark.streaming.receiver.writeaheadlog.enable (spark streaming实时大数据分析5.6节) spark.sql.shuffle.partitions (default 200, 在单节点测试时,会造成极大的延迟). pyspark Improving PySpark performance: Spark Performance Beyond the JVM Python最佳实践指南 本地环境搭建 export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) export SPARK_HOME="$HOME/opt/spark-2.3.2-bin-hadoop2.6" export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH export PATH=$SPARK_HOME/bin:$PATH python环境 virtualenv -p python3 spark-python3 source spark-python3/bin/activate pip install pyspark deactivate(仅从当前venv环境脱离时执行)