Page tree
Skip to end of metadata
Go to start of metadata

Introduction

Lustre filesystem is a high-performance, shared filesystem for Linux clusters, managed with the Lustre software. It is highly scalable and can support many thousands of client nodes, petabytes of storage and hundreds of gigabytes per second of I/O throughput. On Raijin, the Lustre filesystems are /home, /short and /g/data[1-3].

Each Lustre filesystem is actually a set of many small filesystems, which are referred to as object storage targets (OSTs). The Lustre software presents the OSTs as a single unified filesystem.

For more information about OSTs and other Lustre filesystem components, see Main Lustre Components.

File Striping

The default policy on all Raijin Lustre filesystems (/home, /short, /g/dataN) is:

  • to create a new file on 1 OST, where the OST is selected in round-robin fashion AND
  • to read/write file data in units of 1MB block also referred to as transfer size

 

Files on the Lustre filesystems can be striped, which means they can be transparently divided into chunks that are written or read simultaneously across a set of OSTs within the filesystem. The chunks are distributed among the OSTs using a method that ensures load balancing.

Striping allows:

  • One or more clients to read/write different parts of the same file at the same time, providing higher I/O bandwidth to the file because the bandwidth is aggregated over the multiple OSTs.
  • File sizes larger than the size of a single OST.
  • Maintenance of all OSTs evenly balanced, that is crucial for optimal performance from Lustre. A large non-striped file can fill up an OST faster and practically make it unavailable for creation of new files. This reduces the sum total of filesystem performance (capacity is not affected).
 

When to Stripe a File

The following examples describe circumstances where it is beneficial to stripe a file:

  • Your program reads a single large input file, where many nodes read or write different parts of the file at the same time. You should stripe this file to prevent all the nodes from reading from the same OST at the same time. This will avoid creating a bottleneck in which your processes try to read from a single set of disks.
  • Your program waits while a large output file is written. You should stripe the large file so that it will write faster, in order to reduce the amount of time the processors are idle.
  • Files larger than 100 GB must be striped in order to avoid taking up too much space on any single OST, which might adversely affect the filesystem

It is not always necessary to stripe files. For example, if your program periodically writes several small files from each processor, you don’t need to stripe the files because they will be randomly distributed across the OSTs.

Note: Periodic scans are performed on the filesystems. If large files occupying more than 100GB space are found, NCI storage administrators may contact and discuss with file owners different strategies like striping, copying to another OST, etc. in efforts to keep the filesystems balanced across their OSTs.

 

Setting Stripe Parameters

There are default stripe configurations for each Lustre filesystem. To check the existing stripe settings on your directory, use the lfs getstripe command as follows:

lfs getstripe -d /short/project/username

 

For optimal I/O performance, you can change the stripe parameters for your own directories and files. To change stripe parameters, use the lfs setstripe command as follows:

lfs setstripe -s stripe_size -c stripe_count -o stripe_offset dir|filename
 

Parameters

You can use the following stripe parameters:

Stripe size:    -s

Sets the size of the chunks in bytes. Use with k, m, or g to specify units of KB, MB, or GB, respectively (for example, -s 2m). The specified size must be a multiple of 65,536 bytes (64 KB). The default size is 1 MB for all Raijin Lustre filesystems; specify 0 to use the default.

Stripe count:  -c

Sets the number of OSTs to stripe across. The default is 1 for most Raijin Lustre filesystems; specify 0 to use the default. Specify -1 to use all OSTs in the filesystem (for example, -c -1).

Stripe offset:  -o

Sets the index of the OST where the first stripe is to be placed. The default is -1, which results in random selection. Using a non-default value is not recommended.


Note: The stripe settings of an existing file cannot be changed. If you want to change the settings of a file, create a new file with the desired settings and copy the existing file to the newly created file.

See the lfs man page for more options and information.

Examples of Striping

Newly created files and directories inherit the stripe settings of their parent directories. You can take advantage of this feature by organizing your large and small files into separate directories, then setting a stripe count on the large-file directory so that all new files created in the directory will be automatically striped. For example, to create a directory called “dir1” with a stripe size of 1 MB and a stripe count of 8, run:

mkdir dir1
lfs setstripe -s 1m -c 8 dir1

 

You can “pre-create” a file as a zero-length striped file by running lfs setstripe as part of your job script or as part of the I/O routine in your program. You can then write to that file later. For example, to pre-create the file “bigdir.tar” with a stripe count of 20, and then add data from the large directory “bigdir,” run:

lfs setstripe -c 20 bigdir.tar
tar cf bigdir.tar bigdir 
 

Useful Lustre Commands

The examples in this section use /short as an example of a specific Lustre filesystem.

Listing Striping Information

To list the striping information for a specific file or directory:

lfs getstripe filename 
lfs getstripe -d directory_name
 

Note: Omitting the -d flag will display striping for all files in the directory.

Listing Disk Usage and Quotas

NCI supplies lquota command for users to check their quota usage and limits of all their projects.

lquota

To display disk usage and limits on your /short project directory: 

lfs quota -h -g project_id /short

To display usage on each OST, add the -v option:

 lfs quota -h -v -u username /short/project/username
 

Note: Viewing usage on each OST can help you manage your quota. When you write to your /short directory, a chunk of your quota (~100 MB) is allocated to each OST that is being written to. Therefore, it is possible to exceed your soft quota limit before your actual disk usage reaches the limit. See the Lustre Best Practice Stay at Least 10% Below Soft Quota Limit for more information.

Listing Space Usage

To list space usage per OST and MDT, in human-readable format, for all Lustre filesystems or for a specific one:

lfs df -h
lfs df -h /short

Listing Inode Usage

To list inode usage for all filesystems or for a specific one:

lfs df -i
lfs df -i /short

Listing OSTs

To list all the OSTs for the filesystem:

 lfs osts /short


Main Lustre Components

Lustre filesystem components include:

Metadata Server (MDS)

Service nodes that manage all metadata operations, such as assigning and tracking the names and storage locations of directories and files on the OSTs. There is 1 active MDS per filesystem.

Metadata Target (MDT)

A storage device where the metadata (name, ownership, permissions and file type) are stored. There is 1 MDT per filesystem.

Object Storage Server (OSS)

Service nodes that run the Lustre software stack, provide the actual I/O service and network request handling for the OSTs, and coordinate file locking with the MDS. Each OSS can serve multiple OSTs. The aggregate bandwidth of a Lustre filesystem can approach the sum of bandwidths provided by the OSSes. There can be 1 or multiple OSSes per filesystem.

Object Storage Target (OST)

Storage devices where users’ file data is stored. The size of each OST varies from approximately 7 TB to approximately 22 TB, depending on the Lustre filesystem. The capacity of a Lustre filesystem is the sum of the sizes of all OSTs. There are multiple OSTs per filesystem.

Lustre Clients

Compute nodes that mount the Lustre filesystem, and access/use data in the filesystem. There are commonly thousands of Lustre clients per filesystem.

Acknowledgement

We acknowledge the original Lustre documentation is taken from NASA and modified based on NCI filesystems.

  • No labels