Hadoop HDFS

HDFS is a distributed filesystem that is used by data analysis frameworks such as Spark and Hadoop.

Loading HDFS

HDFS is installed as part of a hadoop distribution. Once the hr32 modules have been added to the module search path, the hadoop module can be added to the environment using

module load hadoop/3.2.2

Loading the hadoop module sets the HADOOP_HOME environment variable that is referenced below.

Using HDFS on gadi

The use case that we have envisioned is for a short term storage that lasts for the duration of a batch queue job and uses the on-node SSDs available on each node. This can be used as an alternative to the lustre filesystems by applications such as hail that need multiple terabytes of scratch space.

Starting a HDFS cluster involves two steps:

First, a NameNode process needs to be started on one node in the cluster. This acts as a metadata server for the cluster, telling applications what data is available, where data can be found and where to store new data.
Second, DataNode processes need to be started on each of the nodes that will store data. These store data in blocks and transfer data to requesting applications.

Before initialisation, HDFS needs several environment variables and files to be configured. The helper scripts that are described below generate the files automatically. However, the following variables need to be defined in your environment before starting a cluster:

JAVA_HOME This should point to the location of the java runtime environment that is being used.
- To use the system installed java on gadi use: JAVA_HOME=/usr/lib/jvm/jre-1.8.0.
HADOOP_CONF_DIR This will be used several one-off configuration files that store information about the cluster such as the node where the NameNode service is running.
- A suggestion is HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf.
HADOOP_LOG_DIR Log files written while the cluster will be placed in this directory.
- A suggestion is HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log.

Helper scripts have been created to assist with the process of managing a cluster on gadi:

${HADOOP_HOME}/bin/nci-start-hdfs.sh: This creates the configuration files needed to start the cluster, starts DataNodes on each node in the PBS job and starts a NameNode.
- The script initialises the filesystem and creates a home directory /user/<username>.
- On the node where the NameNode runs, a directory ${PBS_JOBFS}/Name is created.
- On the data nodes, a directory ${PBS_JOBFS}/Data is created.
${HADOOP_HOME}/bin/nci-stop-hdfs.sh: Once the cluster is no longer needed, this can be used to stop the cluster. The script also cleans up the ${PBS_JOBFS}/{Name,Data} directories on each node.

The `dfs` command

One method of interacting with the hdfs cluster is through the dfs command. This gives command line access to a number of file operations that include copying files and directories to and from the cluster. In the following, some basic commands are outlined. For detailed information see the official File System Shell Guide which includes the full list of commands.

The HDFS file system has a hierarchical organisation of files and directories that will be familiar from interacting with other computer systems. Many of the dfs commands will also be familiar from other settings such as ftp or the unix command line.

The basic format of commands when using HDFS is

${HADOOP_HOME}/bin/hdfs dfs <args>

Note that this is equivalent to the hadoop fs <args> syntax used in the documentation referenced above.

To copy files to the cluster use the put command
- ${HADOOP_HOME}/bin/hdfs dfs -put localfile /user/abc123/localfilecopy
To copy files from the cluster use the get command
- ${HADOOP_HOME}/bin/hdfs dfs -get /user/abc123/hdfsfile hdfsfilecopy
The ls command can be used to list the contents of directories
- ${HADOOP_HOME}/bin/hdfs dfs -ls /user/abc123/ will list the contents of the directory /user/abc123
- ${HADOOP_HOME}/bin/hdfs dfs -ls /user/abc123/hdfsfile will check if the file /user/hadoop/hdfsfile exists

When calling ${HADOOP_HOME}/bin/hdfs, the three environment variables HADOOP_CONF_DIR, HADOOP_LOG_DIR and JAVA_HOME need to be defined.

Accessing HDFS within an application

For applications such as Spark that are able to interact with HDFS, files and directories are referenced using a URL that begins hdfs://<NameNode>:9000/ and is then followed by the filepath.

As an example, if the NameNode is running on gadi-cpu-clx-0491 and you wish to access a file named data.1 in your home directory, the URL would be:

hdfs://gadi-cpu-clx-0491:9000/user/<username>/data.1

An example job script

The following script demonstrates how to use the helper scripts to start and stop a cluster.

#!/bin/bash
#PBS -lwalltime=1:00:00,ncpus=192,mem=760G,jobfs=1600G,wd
#PBS -lstorage=gdata/hr32+scratch/<abc>
#PBS -q normal
#PBS -P <abc>

module use /g/data/hr32/apps/Modules/modulefiles
module load hadoop/3.2.2

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0
export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf
export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log

nci-start-hdfs.sh

# Copy a file to the cluster.
# This will be placed in the directory /user/<username>
hdfs dfs -put filetotransfer

# List the contents of the filesystem.
# This defaults to the /user/<username> directory.
hdfs dfs -ls

nci-stop-hdfs.sh

Accessing the web user interface

HDFS serves a web user interface that can be used to monitor that status of the cluster. This can be accessed from your local computer using ssh and port forwarding.

The interface usually runs on port 9870.

Page tree