Hail is a python library that can be used for the distributed analysis of genomic data using Spark. More details are available in the online documentation available at https://hail.is.
Using hail
Hail is available as part of the hr32 software project within the NCI-bio-python module.
On gadi
On gadi there are two options for running hail:
- Run on a single node using a local spark cluster that is started automatically when the hail module is initialised within your python script.
- Submit a script for execution on a previously provisioned Spark cluster.
For the second option, the following shell script demonstrates how to submit a hail calculation to a spark cluster.
#!/bin/bash #PBS -lwalltime=1:00:00,ncpus=140,mem=640G,jobfs=2000G,wd,storage=gdata/hr32+scratch/<abc> #PBS -P <abc> #PBS -q normalbw module use /g/data/hr32/apps/Modules/modulefiles module load NCI-bio-python/2021.07 module load spark/3.1.2 module load hadoop/3.2.2 # Start a Spark cluster on each node in the job nci-start-cluster.sh # Start HDFS to create a common workspace using the /jobfs directories on each node # This is optional, but might be useful if a large amount of temporary data will be generated export JAVA_HOME=/usr/lib/jvm/jre-1.8.0 export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log nci-start-hdfs.sh export HAIL_HOME=/g/data/hr32/apps/NCI-bio-python/envs/2021.07/lib/python3.7/site-packages/hail/ spark-submit \ --jars $HAIL_HOME/backend/hail-all-spark.jar \ --conf spark.driver.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \ --conf spark.executor.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=is.hail.kryo.Hail.KryoRegistrator \ --master spark://$(hostname):7077 \ ./hail-script.py # Stop HDFS nci-stop-hdfs.sh # Stop the Spark cluster nci-stop-cluster.sh
In the example, 5 Broadwell nodes are used to create a small Spark cluster. Note that for workers within the cluster, the working directory may not be the same as the node from which the spark job was submitted. For this reason, it's usually best to give the absolute path to input and output files.