.Spark-Bench - Benchmarking Apache Spark v1.0


Overview

These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.


Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.

It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.

The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:

  • traditional benchmarking of algorithm implementations
  • stress-testing clusters
  • simulating multiple notebook users on one cluster
  • comparing multiple versions of Spark on multiple clusters


Highlights

  • Data Generation.
    A data generator automatically generates input data sets with various sizes. Spark-Bench has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc.
  • Workloads
    The atomic unit of organization in Spark-Bench is the workload. Workloads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk.. Spark-Bench provides diverse and representative workloads ( extensible to new workloads )
    • Machine learning: Logistic regression, support vector machine, matrix factorization
    • Graph processing: pagerank, svdplusplus, triangle count
    • Streaming: twitter, pageview
    • SQL query applications: hive,RDDRelation
  • Configurations
    Spark-Bench allows you to launch multiple spark-submit commands by creating and launching multiple spark-submit scripts. This flexibility allows to:
    • Comparing benchmark times of the same workloads with different Spark settings
    • Simulating multiple batch applications hitting the same cluster at once.
    • Comparing benchmark times against two different Spark clusters!
  • Metrics:
    • supported: job execution time, input data size, data process rate
    • under development: shuffle data, RDD size, resource consumption, integration with monitoring tool

Workload characterization and study of parameter impacts

  • Diverse and representative date sets: Wikipedia, Google web graph, Amazon movie review
  • Charactering workloads in terms of resource consumption, data access patterns and time information, job execution time, shuffle data
  • Studying the impact of Spark configuration parameters

Pre-Requisities

  • OpenJDK8  installed

    $ java -version
  • Docker installed

    $ docker version
  • Apache Hadoop and Apache Spark should be installed from Apache Bigtop packages.
    • Follow the instructions here to install the Bigtop Hadoop and Spark components

BigTop Setup

Follow the instructions in here

Create docker Containers


1. Create a cluster of Bigtop docker containers

$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 3

2. Login into each container.

$ docker container exec -it <container_name> bash

3. Verify Hadoop is installed in containers.

$ hadoop

Configure Docker Containers

1. Follow these steps inside each container.

  • Create hadoop user

    We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop user group: 

     

    $ sudo adduser hduser -G hadoop


    give a password for hduser

    $ sudo passwd hduser


    Add hduser to sudoers list:

    On Debian:

    $ sudo adduser hduser sudo


    On CentOS:

    $ sudo usermod -G wheel hduser


    Switch to hduser

    $ su - hduser


    Generate ssh key for hduser

    $ ssh-keygen -t rsa -P ""


    Press <enter> to leave to default file name.

    enable ssh access to local machine

    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    $ chmod 600 $HOME/.ssh/authorized_keys
    $ chmod 700 $HOME/.ssh
  • Login as hduser

    $ su - hduser
  • Set Environment variables

    Make sure the environment variables are set for Hadoop. Add them to your bash profile.
    $ vi ~/.bashrc
    export HADOOP_HOME=/usr/lib/hadoop
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
    export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
    export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
    export YARN_HOME=/usr/lib/hadoop-yarn
    export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/
    export HADOOP_USER_NAME=hdfs
    export CLASSPATH=$CLASSPATH:.
    export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.7.2.jar:$HADOOP_HOME/client/hadoop-hdfs-2.7.2.jar:$HADOOP_HOME/hadoop-auth-2.7.2.jar:/usr/lib/hadoop-mapreduce/*:/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*:
    export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
    export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH
    export SPARK_HOME=/usr/lib/spark
    export PATH=$HADOOP_HOME\bin:$PATH
    export SPARK_DIST_CLASSPATH=$HADOOP_HOME\bin\hadoop:$CLASSPATH:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-mapreduce/*:.
    export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/:. 
    export SPARK_MASTER_HOST=local[*]
    $ source ~/.bashrc
  • Hosts file 

    Make sure hosts file is setup correctly. The hosts file should like below:  

    $ sudo vi /etc/hosts
    172.17.0.4 spark-master <containerID.apache.bigtop.org> <containerID>
    172.17.0.3 spark-slave01 <containerID.apache.bigtop.org> <containerID>
    172.17.0.2 spark-slave02 <containerID.apache.bigtop.org> <containerID>
    127.0.0.1 localhost localhost.domain
    
    ::1 localhost
  • Spark Configurations


    Make sure Spark is configured properly
    • /usr/lib/spark/conf/spark-env.sh
      This file should have STANDALONE_SPARK_MASTER_HOST pointing to spark master ip address
      Also SPARK_MASTER_IP should be set
    • cp /usr/lib/spark/conf/slaves.template /usr/lib/spark/conf/slaves. Add the slaves ip address to the file instead of 'localhost'
    • Make sure spark-defaults.conf has the master set correctly


  • Verify Spark

    • ping to make sure you can ping all nodes

      $ ping spark-master
      $ ping spark-slave01
      $ ping spark-slave02
    • Make sure below command shows spark-master ip with port as 'Established' and other spark ports as listening

      $ netstat -n -a
  • Stop and start Spark

    $ /usr/lib/spark/sbin/stop-all.sh
    
    
    $ /usr/lib/spark/sbin/start-all.sh

2. Install Spark-Bench

  • Install dependencies

    Login to each container and install gnupg package

    $ apt-get install gnupg

    7. Install sbt package inside all containers.

    $ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
    $ sudo apt-get update
    $ sudo apt-get install sbt

    8. Set SBT_OPTS. Change your SBT heap space. Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. Add the below to your bash profile

    $ export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M"
  • Grab the .tgz source code from here, inside each container
$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
  • Unpack the tar file and cd into the newly created folder, inside each container
$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
$ cd spark-bench_2.3.0_0.4.0-RELEASE_99/
  • Run sbt compile
$ sbt compile


3. Run Spark-Bench

$ ./bin/spark-bench.sh examples/minimal-example.conf

4. Verification

  • The output should be like below:
One run of SparkPi and that's it!                                               
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|   pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|       spark.app.id|         description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328|   1032871662|3.141851141851142|     |                        |    10|  0|     10.200.22.54|            61657|                 :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...|           driver|                 client|    local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+


  • You could also verify by opening up another terminal instance, while the job is running with the below command
$ jps -lm
  • The output should be like:
11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}}
12045 sun.tools.jps.Jps -lm
11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf

5. SQL benchmark

  • Run the gen-data.sh script
$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh


Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.

  • Run the run.sh script
    For SQL applications, by default it runs the RDDRelation workload. 
$ <SPARK_BENCH_HOME>/SQL/bin/run.sh


6. Hive Workload

To run Hive workload, execute:


$ hive;

7. Streaming Applications

For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute: 


$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal


$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminal
In order run a particular streaming app (default: PageViewStream):
     You need to pass a subApp parameter to the gen_data.sh or run.sh like this:
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags
     *Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.

8. Other Workloads

https://hub.docker.com/r/alvarobrandon/spark-bench/


Reference

https://codait.github.io/spark-bench/compilation/

https://github.com/codait/spark-bench/tree/legacy


Errors and Resolutions