Hail is a python library that can be used for the distributed analysis of genomic data using Spark. More details are available in the online documentation available at https://hail.is.
Hail is available as part of the hr32 software project within the NCI-bio-python module.
On gadi there are two options for running hail:
For the second option, the following shell script demonstrates how to submit a hail calculation to a spark cluster.
#!/bin/bash #PBS -lwalltime=1:00:00,ncpus=140,mem=640G,jobfs=2000G,wd,storage=gdata/hr32+scratch/<abc> #PBS -P <abc> #PBS -q normalbw module use /g/data/hr32/apps/Modules/modulefiles module load NCI-bio-python/2021.07 module load spark/3.1.2 module load hadoop/3.2.2 # Start a Spark cluster on each node in the job nci-start-cluster.sh # Start HDFS to create a common workspace using the /jobfs directories on each node # This is optional, but might be useful if a large amount of temporary data will be generated export JAVA_HOME=/usr/lib/jvm/jre-1.8.0 export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log nci-start-hdfs.sh export HAIL_HOME=/g/data/hr32/apps/NCI-bio-python/envs/2021.07/lib/python3.7/site-packages/hail/ spark-submit \ --jars $HAIL_HOME/backend/hail-all-spark.jar \ --conf spark.driver.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \ --conf spark.executor.extraClassPath=${HAIL_HOME}/backend/hail-all-spark.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=is.hail.kryo.Hail.KryoRegistrator \ --master spark://$(hostname):7077 \ ./hail-script.py # Stop HDFS nci-stop-hdfs.sh # Stop the Spark cluster nci-stop-cluster.sh |
In the example, 5 Broadwell nodes are used to create a small Spark cluster. Note that for workers within the cluster, the working directory may not be the same as the node from which the spark job was submitted. For this reason, it's usually best to give the absolute path to input and output files.