...
HDFS is a distributed filesystem that is used by data analysis frameworks such as Spark and Hadoop.
Loading HDFS
HDFS is installed as part of a hadoop distribution. Once the hr32 modules have been added to the module search path, the hadoop module can be added to the environment using
...
Loading the hadoop module sets the HADOOP_HOME
environment variable that is referenced below.
Using HDFS on gadi
The use case that we have envisioned is for a short term storage that lasts for the duration of a batch queue job and uses the on-node SSDs available on each node. This can be used as an alternative to the lustre filesystems by applications such as hail that need multiple terabytes of scratch space.
...
${HADOOP_HOME}/bin/nci-start-hdfs.sh
: This creates the configuration files needed to start the cluster, starts DataNodes on each node in the PBS job and starts a NameNode.- The script initialises the filesystem and creates a home directory
/user/<username>
. - On the node where the NameNode runs, a directory
${PBS_JOBFS}/Name
is created. - On the data nodes, a directory
${PBS_JOBFS}/Data
is created.
- The script initialises the filesystem and creates a home directory
${HADOOP_HOME}/bin/nci-stop-hdfs.sh
: Once the cluster is no longer needed, this can be used to stop the cluster. The script also cleans up the${PBS_JOBFS}/{Name,Data}
directories on each node.
The dfs
command
One method of interacting with the hdfs cluster is through the dfs
command. This gives command line access to a number of file operations that include copying files and directories to and from the cluster. In the following, some basic commands are outlined. For detailed information see the official File System Shell Guide which includes the full list of commands.
...
When calling ${HADOOP_HOME}/bin/hdfs
, the three environment variables HADOOP_CONF_DIR
, HADOOP_LOG_DIR
and JAVA_HOME
need to be defined.
Accessing HDFS within an application
For applications such as Spark that are able to interact with HDFS, files and directories are referenced using a URL that begins hdfs://<NameNode>:9000/
and is then followed by the filepath.
...
hdfs://gadi-cpu-clx-0491:9000/user/<username>/data.1
An example job script
The following script demonstrates how to use the helper scripts to start and stop a cluster.
Code Block | ||
---|---|---|
| ||
#!/bin/bash #PBS -lwalltime=1:00:00,ncpus=192,mem=760G,jobfs=1600G,wd #PBS -lstorage=gdata/hr32+scratch/<abc> #PBS -q normal #PBS -P <abc> module use /g/data/hr32/apps/Modules/modulefiles module load hadoop/3.2.2 export JAVA_HOME=/usr/lib/jvm/jre-1.8.0 export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log nci-start-hdfs.sh # Copy a file to the cluster. # This will be placed in the directory /user/<username> hdfs dfs -put filetotransfer # List the contents of the filesystem. # This defaults to the /user/<username> directory. hdfs dfs -ls nci-stop-hdfs.sh |
Accessing the web user interface
HDFS serves a web user interface that can be used to monitor that status of the cluster. This can be accessed from your local computer using ssh and port forwarding.
...