Spark集群安装

2022-05-31 00:00:00 集群 代码 节点 安装 复制

实验名称:spark集群安装详细步骤

安装Spark集群任务介绍

下面将要在三台linux虚拟机上搭建spark集群并启动服务。

用到的知识点

  • linux基本命令
  • spark节点master和worker

集群安装

完成实验需要以下相关知识

  1. 解压命令

tar -zxvf XX.tar.gz -C dist

  1. vi编辑器的使用

vi + file 打开一个文件,要想了解更多请了解vi编辑器的使用

  1. 远程拷贝

scp -r srcfile user@hostName:distpath

  1. 关闭防火墙命令

    service iptables stop

  2. linux下安装jdk

  3. linux下免密码登录

  4. spark集群基本常识

安装前准备

  1. 准备三台linux虚拟机
  2. 配置ip和host 下面表格是本次实验的配置情况
iphost软件名
192.168.1.111linux1spark-master
192.168.1.112linux2spark-worker
192.168.1.113linux3spark-worker
  1. 配置免密登录,免密登录方案 linux1免密登录linux2和linux3
  2. 安装jdk8
  3. 准备spark-2.2.3-bin-hadoop2.7.tgz版本的安装包

下面开始进行安装。

spark集群安装实验

  1. 上传spark-2.2.3-bin-hadoop2.7.tgz安装文件到 /root/apps/srcclauster
  2. 进入主节点创建一个目录apps就作为安装目录
[root@linux1 ~]# mkdir  /root/apps
复制代码
  1. 解压spark
[root@linux1 ~]#tar –zxvf  /root/srcclauster/spark-2.2.3-bin-hadoop2.7.tgz     -C   /root/apps
复制代码
  1. 配置spark

进入sparkconf目录

[root@linux1 ~]#
[root@linux01 ~]# cd apps/spark-2.2.3-bin-hadoop2.7/conf/
[root@linux01 conf]# 
[root@linux01 conf]# ll
总用量 44
-rw-r--r--. 1  501 games  996 1月   8 2019 docker.properties.template
-rw-r--r--. 1  501 games 1105 1月   8 2019 fairscheduler.xml.template
-rw-r--r--. 1  501 games 2025 1月   8 2019 log4j.properties.template
-rw-r--r--. 1  501 games 7313 1月   8 2019 metrics.properties.template
-rw-r--r--. 1 root root    17 5月  21 19:48 slaves
-rw-r--r--. 1  501 games  865 1月   8 2019 slaves.template
-rw-r--r--. 1  501 games 1292 1月   8 2019 spark-defaults.conf.template
-rwxr-xr-x. 1  501 games 3764 1月   8 2019 spark-env.sh.template
[root@linux01 conf]# 

复制代码

修改spark-env.sh.template文件名为spark-env.sh

[root@linux01 conf]# mv  spark-env.sh.template  spark-env.sh
[root@linux01 conf]# ll
总用量 44
-rw-r--r--. 1  501 games  996 1月   8 2019 docker.properties.template
-rw-r--r--. 1  501 games 1105 1月   8 2019 fairscheduler.xml.template
-rw-r--r--. 1  501 games 2025 1月   8 2019 log4j.properties.template
-rw-r--r--. 1  501 games 7313 1月   8 2019 metrics.properties.template
-rw-r--r--. 1 root root    17 5月  21 19:48 slaves
-rw-r--r--. 1  501 games  865 1月   8 2019 slaves.template
-rw-r--r--. 1  501 games 1292 1月   8 2019 spark-defaults.conf.template
-rwxr-xr-x. 1  501 games 3764 1月   8 2019 spark-env.sh
[root@linux01 conf]# 
复制代码

编辑spark-env.sh

[root@linux01 conf]# vi  spark-env.sh
复制代码
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
export JAVA_HOME=/root/apps/jdk1.8.0_101
export HADOOP_HOME=/root/apps/hadoop-2.7.7
export HADOOP_CONF_DIR=/root/apps/hadoop-2.7.7/etc/hadoop
export SPARK_MASTER_IP=linux01
export SPARK_MASTER_PORT=7077
export SPARK_EXECUTOR_MEMORY=512m
复制代码
  1. 配置spark的环境变量
[root@linux1 ~]# vi /etc/profile
复制代码
export SPARK_HOME=/root/apps/spark-2.2.3-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$PATH:$SPARK_HOME/sbin
复制代码
  1. scp到其他节点
[root@linux1 ~]# scp -r /root/apps/  root@linux2:/root
[root@linux1 ~]# scp -r /root/apps/  root@linux3:/root
复制代码
  1. 启动spark集群
[root@linux01 sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /root/apps/spark-2.2.3-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-linux01.out
linux02: starting org.apache.spark.deploy.worker.Worker, logging to /root/apps/spark-2.2.3-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-linux02.out
linux03: starting org.apache.spark.deploy.worker.Worker, logging to /root/apps/spark-2.2.3-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-linux03.out
linux03: failed to launch: nice -n 0 /root/apps/spark-2.2.3-bin-hadoop2.7/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://linux01:7077
linux03: full log in /root/apps/spark-2.2.3-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-linux03.out
复制代码
  1. 查看是否启动成功

linux1节点执行jps命令

[root@linux01 sbin]# jps
1649 Jps
1493 Master
[root@linux01 sbin]# 
复制代码

linux2节点执行jps命令

[root@linux02 sbin]# jps
1649 Jps
1493 Master
[root@linux01 sbin]# 
复制代码

linux3节点执行jps命令

[root@linux03 sbin]# jps
1649 Jps
1493 Master
[root@linux01 sbin]# 
复制代码

实验总结

下载并上传spark集群的安装文件,解压到指定目录,修改spark-env.sh-文件然后scp到其他节点即可。

启动集群是sbin目录下的start-all.sh这个命令。

相关文章