Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

HDFS is a distributed filesystem that is used by data analysis frameworks such as Spark and Hadoop.

Loading HDFS

HDFS is installed as part of a hadoop distribution. Once the hr32 modules have been added to the module search path, the hadoop module can be added to the environment using

...

Loading the hadoop module sets the HADOOP_HOME environment variable that is referenced below.

Using HDFS on gadi

The use case that we have envisioned is for a short term storage that lasts for the duration of a batch queue job and uses the on-node SSDs available on each node. This can be used as an alternative to the lustre filesystems by applications such as hail that need multiple terabytes of scratch space.

...

  • ${HADOOP_HOME}/bin/nci-start-hdfs.sh: This creates the configuration files needed to start the cluster, starts DataNodes on each node in the PBS job and starts a NameNode.
    • The script initialises the filesystem and creates a home directory /user/<username>.
    • On the node where the NameNode runs, a directory ${PBS_JOBFS}/Name is created.
    • On the data nodes, a directory ${PBS_JOBFS}/Data is created.
  • ${HADOOP_HOME}/bin/nci-stop-hdfs.sh: Once the cluster is no longer needed, this can be used to stop the cluster. The script also cleans up the ${PBS_JOBFS}/{Name,Data} directories on each node.

The dfs command

One method of interacting with the hdfs cluster is through the dfs command. This gives command line access to a number of file operations that include copying files and directories to and from the cluster. In the following, some basic commands are outlined. For detailed information see the official File System Shell Guide which includes the full list of commands.

...

When calling ${HADOOP_HOME}/bin/hdfs, the three environment variables HADOOP_CONF_DIR, HADOOP_LOG_DIR and JAVA_HOME need to be defined.

Accessing HDFS within an application

For applications such as Spark that are able to interact with HDFS, files and directories are referenced using a URL that begins hdfs://<NameNode>:9000/ and is then followed by the filepath.

...

  • hdfs://gadi-cpu-clx-0491:9000/user/<username>/data.1

An example job script

The following script demonstrates how to use the helper scripts to start and stop a cluster.

Code Block
languagebash
#!/bin/bash
#PBS -lwalltime=1:00:00,ncpus=192,mem=760G,jobfs=1600G,wd
#PBS -lstorage=gdata/hr32+scratch/<abc>
#PBS -q normal
#PBS -P <abc>

module use /g/data/hr32/apps/Modules/modulefiles
module load hadoop/3.2.2

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0
export HADOOP_CONF_DIR=${PBS_O_WORKDIR}/hadoop.conf
export HADOOP_LOG_DIR=${PBS_O_WORKDIR}/hadoop.log

nci-start-hdfs.sh

# Copy a file to the cluster.
# This will be placed in the directory /user/<username>
hdfs dfs -put filetotransfer

# List the contents of the filesystem.
# This defaults to the /user/<username> directory.
hdfs dfs -ls

nci-stop-hdfs.sh

Accessing the web user interface

HDFS serves a web user interface that can be used to monitor that status of the cluster. This can be accessed from your local computer using ssh and port forwarding.

...