View Source

Hail is a python library that can be used for the distributed analysis of genomic data using Spark. More details are available in the online documentation available at https://hail.is.

Using hail

Hail is available as part of the hr32 software project within the NCI-bio-python module.

On gadi

On gadi there are two options for running hail:

Run on a single node using a local spark cluster that is started automatically when the hail module is initialised within your python script.
Submit a script for execution on a previously provisioned Spark cluster.

For the second option, the following shell script demonstrates how to submit a hail calculation to a spark cluster.

#!/bin/bash
#PBS -lwalltime=1:00:00,ncpus=140,mem=640G,jobfs=2000G,wd,storage=gdata/hr32+scratch/<abc>
#PBS -P <abc>
#PBS -q normalbw
module use /g/data/hr32/apps/Modules/modulefiles
module load NCI-bio-python/2021.07
module load spark/3.1.2
module load hadoop/3.2.2

# Start a Spark cluster on each node in the job
nci-start-cluster.sh

# Start HDFS to create a common workspace using the /jobfs directories on each node
# This is optional, but might be useful if a large amount of temporary data will be generated
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0
export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf
export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log
nci-start-hdfs.sh

export HAIL_HOME=/g/data/hr32/apps/NCI-bio-python/envs/2021.07/lib/python3.7/site-packages/hail/
spark-submit \
    --jars $HAIL_HOME/backend/hail-all-spark.jar \
    --conf spark.driver.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \
    --conf spark.executor.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.kryo.registrator=is.hail.kryo.Hail.KryoRegistrator \
    --master spark://$(hostname):7077 \
    ./hail-script.py

# Stop HDFS
nci-stop-hdfs.sh

# Stop the Spark cluster
nci-stop-cluster.sh

In the example, 5 Broadwell nodes are used to create a small Spark cluster. Note that for workers within the cluster, the working directory may not be the same as the node from which the spark job was submitted. For this reason, it's usually best to give the absolute path to input and output files.