HDFS is a distributed filesystem that is used by data analysis frameworks such as Spark and Hadoop.
Loading HDFS
HDFS is installed as part of a hadoop distribution. Once the hr32 modules have been added to the module search path, the hadoop module can be added to the environment using
module load hadoop/3.2.2
Loading the hadoop module sets the HADOOP_HOME
environment variable that is referenced below.
Using HDFS on gadi
The use case that we have envisioned is for a short term storage that lasts for the duration of a batch queue job and uses the on-node SSDs available on each node. This can be used as an alternative to the lustre filesystems by applications such as hail that need multiple terabytes of scratch space.
Starting a HDFS cluster involves two steps:
- First, a NameNode process needs to be started on one node in the cluster. This acts as a metadata server for the cluster, telling applications what data is available, where data can be found and where to store new data.
- Second, DataNode processes need to be started on each of the nodes that will store data. These store data in blocks and transfer data to requesting applications.
Before initialisation, HDFS needs several environment variables and files to be configured. The helper scripts that are described below generate the files automatically. However, the following variables need to be defined in your environment before starting a cluster:
JAVA_HOME
This should point to the location of the java runtime environment that is being used.- To use the system installed java on gadi use:
JAVA_HOME=/usr/lib/jvm/jre-1.8.0
.
- To use the system installed java on gadi use:
HADOOP_CONF_DIR
This will be used several one-off configuration files that store information about the cluster such as the node where the NameNode service is running.- A suggestion is
HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf
.
- A suggestion is
HADOOP_LOG_DIR
Log files written while the cluster will be placed in this directory.- A suggestion is
HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log
.
- A suggestion is
Helper scripts have been created to assist with the process of managing a cluster on gadi:
${HADOOP_HOME}/bin/nci-start-hdfs.sh
: This creates the configuration files needed to start the cluster, starts DataNodes on each node in the PBS job and starts a NameNode.- The script initialises the filesystem and creates a home directory
/user/<username>
. - On the node where the NameNode runs, a directory
${PBS_JOBFS}/Name
is created. - On the data nodes, a directory
${PBS_JOBFS}/Data
is created.
- The script initialises the filesystem and creates a home directory
${HADOOP_HOME}/bin/nci-stop-hdfs.sh
: Once the cluster is no longer needed, this can be used to stop the cluster. The script also cleans up the${PBS_JOBFS}/{Name,Data}
directories on each node.
The dfs
command
One method of interacting with the hdfs cluster is through the dfs
command. This gives command line access to a number of file operations that include copying files and directories to and from the cluster. In the following, some basic commands are outlined. For detailed information see the official File System Shell Guide which includes the full list of commands.
The HDFS file system has a hierarchical organisation of files and directories that will be familiar from interacting with other computer systems. Many of the dfs
commands will also be familiar from other settings such as ftp or the unix command line.
The basic format of commands when using HDFS is
${HADOOP_HOME}/bin/hdfs dfs <args>
Note that this is equivalent to the hadoop fs <args>
syntax used in the documentation referenced above.
- To copy files to the cluster use the
put
command${HADOOP_HOME}/bin/hdfs dfs -put localfile /user/abc123/localfilecopy
- To copy files from the cluster use the
get
command${HADOOP_HOME}/bin/hdfs dfs -get /user/abc123/hdfsfile hdfsfilecopy
- The
ls
command can be used to list the contents of directories${HADOOP_HOME}/bin/hdfs dfs -ls /user/abc123/
will list the contents of the directory/user/abc123
${HADOOP_HOME}/bin/hdfs dfs -ls /user/abc123/hdfsfile
will check if the file/user/hadoop/hdfsfile
exists
When calling ${HADOOP_HOME}/bin/hdfs
, the three environment variables HADOOP_CONF_DIR
, HADOOP_LOG_DIR
and JAVA_HOME
need to be defined.
Accessing HDFS within an application
For applications such as Spark that are able to interact with HDFS, files and directories are referenced using a URL that begins hdfs://<NameNode>:9000/
and is then followed by the filepath.
As an example, if the NameNode is running on gadi-cpu-clx-0491 and you wish to access a file named data.1
in your home directory, the URL would be:
hdfs://gadi-cpu-clx-0491:9000/user/<username>/data.1
An example job script
The following script demonstrates how to use the helper scripts to start and stop a cluster.
#!/bin/bash #PBS -lwalltime=1:00:00,ncpus=192,mem=760G,jobfs=1600G,wd #PBS -lstorage=gdata/hr32+scratch/<abc> #PBS -q normal #PBS -P <abc> module use /g/data/hr32/apps/Modules/modulefiles module load hadoop/3.2.2 export JAVA_HOME=/usr/lib/jvm/jre-1.8.0 export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log nci-start-hdfs.sh # Copy a file to the cluster. # This will be placed in the directory /user/<username> hdfs dfs -put filetotransfer # List the contents of the filesystem. # This defaults to the /user/<username> directory. hdfs dfs -ls nci-stop-hdfs.sh
Accessing the web user interface
HDFS serves a web user interface that can be used to monitor that status of the cluster. This can be accessed from your local computer using ssh and port forwarding.
- The interface usually runs on port 9870.