Spark

One component of the hr32 environment is an Apache Spark distribution. This can be used by packages such as hail and GATK to perform distributed data analysis.

Loading Spark

Spark version 3.1.2 is installed. Once the hr32 modules have been added to the module search path, spark can be added to your execution environment using:

module load spark/3.1.2

Loading the spark module sets the environment variable SPARK_HOME that is referenced below.

Using Spark on gadi

While a spark cluster can run within a batch queue job that is smaller than a single node, the focus of this documentation is on larger, multi-node analysis tasks.

The principal use case that we have envisioned is creating a single-user, standalone cluster that lasts for the duration of a batch queue job (see Spark Standalone Mode in the Spark documentation).
PBS jobs should request jobfs to be used for short term storage. This is important as Spark can have IO patterns that create high filesystem load if global storage is used for certain files.

Creating a cluster in standalone mode is a two step process.

First, a master process needs to be started on one of the nodes in the job allocation.
Second, on each of the nodes within the job, a worker process needs to be started that connects to the master process.

Scripts to carry out of each of these tasks are in the directory ${SPARK_HOME}/sbin.

Two scripts are provided to assist with the process of managing a cluster on gadi:

${SPARK_HOME}/bin/nci-start-cluster.sh: This creates a master process and workers on each node.
${SPARK_HOME}/bin/nci-stop-cluster.sh: Stops all of the workers and the master when the cluster is no longer needed.

The nci-start-cluster.sh script uses the following values when initialising the cluster:

Variable	Value
`SPARK_LOG_DIR`	`${PBS_O_WORKDIR}/spark-log`
`SPARK_LOCAL_DIRS`	`${PBS_JOBFS}/local`
`SPARK_WORKER_DIR`	`${PBS_JOBFS}/work`

These place the log files from the Spark process within the working directory of the PBS job and ensures that short-term files created by the Spark cluster are kept in fast SSD storage.

Submitting applications to a Spark cluster

The normal Spark terminology is that applications are submitted to a running Spark cluster. This is done using the command spark-submit which takes as an argument the hostname and port number of the master process.

The following script demonstrates how to submit a task and how to use the nci-start-cluster.sh and nci-stop-cluster.sh scripts. It uses an application, run-example, and task, SparkPi, that are included within the Spark distribution.

#!/bin/bash
#PBS -lwalltime=1:00:00,ncpus=192,mem=760G,jobfs=1600G,wd
#PBS -lstorage=gdata/hr32+scratch/<abc>
#PBS -q normal
#PBS -P <abc>

module use /g/data/hr32/apps/Modules/modulefiles
module load spark/3.1.2

nci-start-cluster.sh

spark-submit run-example --master spark://$(hostname):7077 SparkPi 10000 &> calcpi.log

nci-stop-cluster.sh

Accessing the web user interfaces

Several user interfaces are served by Spark that can be used to monitor the cluster and the progress of Spark applications. These can be accessed from your local computer using ssh and port forwarding.

Master web UI. This gives details of the size of the cluster and any previous or currently running applications. This is usually on port 8080.
Application web UI. Once an application is running on the cluster, a web interface is created that allows the progress of the application to be monitored. This is usually on port 4040.

Page tree

Spark

Loading Spark

Using Spark on gadi

Submitting applications to a Spark cluster