A Lustre filesystem is a high-performance, shared filesystem for Linux clusters, managed with the Lustre software. It is highly scalable and can support many thousands of client nodes, petabytes of storage and hundreds of gigabytes per second of I/O throughput. On Raijin, the Lustre filesystems are /home, /short and /g/data[1-3].
Each Lustre filesystem is actually a set of many small filesystems, which are referred to as object storage targets (OSTs). The Lustre software presents the OSTs as a single unified filesystem.
For more information about OSTs and other Lustre filesystem components, see Main Lustre Components.
The default policy on all Raijin Lustre filesystems (/home, /short, /g/dataN) is:
- to create a new file on 1 OST, where the OST is selected in round-robin fashion AND
- to read/write file data in units of 1MB block also referred to as transfer size
Files on the Lustre filesystems can be striped, which means they can be transparently divided into chunks that are written or read simultaneously across a set of OSTs within the filesystem. The chunks are distributed among the OSTs using a method that ensures load balancing.
- One or more clients to read/write different parts of the same file at the same time, providing higher I/O bandwidth to the file because the bandwidth is aggregated over the multiple OSTs.
- File sizes larger than the size of a single OST.
- Maintenance of all OSTs evenly balanced, that is crucial for optimal performance from Lustre. A large non-striped file can fill up an OST faster and practically make it unavailable for creation of new files. This reduces the sum total of filesystem performance (capacity is not affected).
When to Stripe a File
The following examples describe circumstances where it is beneficial to stripe a file:
- Your program reads a single large input file, where many nodes read or write different parts of the file at the same time. You should stripe this file to prevent all the nodes from reading from the same OST at the same time. This will avoid creating a bottleneck in which your processes try to read from a single set of disks.
- Your program waits while a large output file is written. You should stripe the large file so that it will write faster, in order to reduce the amount of time the processors are idle.
- Files larger than 100 GB must be striped in order to avoid taking up too much space on any single OST, which might adversely affect the filesystem
It is not always necessary to stripe files. For example, if your program periodically writes several small files from each processor, you don’t need to stripe the files because they will be randomly distributed across the OSTs.
Note: Periodic scans are performed on the filesystems. If large files occupying more than 100GB space are found, NCI storage administrators may contact and discuss with file owners different strategies like striping, copying to another OST, etc. in efforts to keep the filesystems balanced across their OSTs.
Setting Stripe Parameters
There are default stripe configurations for each Lustre filesystem. To check the existing stripe settings on your directory, use the lfs getstripe command as follows:
For optimal I/O performance, you can change the stripe parameters for your own directories and files. To change stripe parameters, use the lfs setstripe command as follows:
You can use the following stripe parameters:
Stripe size: -s
Sets the size of the chunks in bytes. Use with k, m, or g to specify units of KB, MB, or GB, respectively (for example, -s 2m). The specified size must be a multiple of 65,536 bytes (64 KB). The default size is 1 MB for all Raijin Lustre filesystems; specify 0 to use the default.
Stripe count: -c
Sets the number of OSTs to stripe across. The default is 1 for most Raijin Lustre filesystems; specify 0 to use the default. Specify -1 to use all OSTs in the filesystem (for example, -c -1).
Stripe offset: -o
Sets the index of the OST where the first stripe is to be placed. The default is -1, which results in random selection. Using a non-default value is not recommended.
Note: The stripe settings of an existing file cannot be changed. If you want to change the settings of a file, create a new file with the desired settings and copy the existing file to the newly created file.
See the lfs man page for more options and information.
Examples of Striping
Newly created files and directories inherit the stripe settings of their parent directories. You can take advantage of this feature by organizing your large and small files into separate directories, then setting a stripe count on the large-file directory so that all new files created in the directory will be automatically striped. For example, to create a directory called “dir1” with a stripe size of 1 MB and a stripe count of 8, run:
You can “pre-create” a file as a zero-length striped file by running lfs setstripe as part of your job script or as part of the I/O routine in your program. You can then write to that file later. For example, to pre-create the file “bigdir.tar” with a stripe count of 20, and then add data from the large directory “bigdir,” run:
Useful Lustre Commands
The examples in this section use /short as an example of a specific Lustre filesystem.
Listing Striping Information
To list the striping information for a specific file or directory:
Note: Omitting the -d flag will display striping for all files in the directory.
Listing Disk Usage and Quotas
NCI supplies lquota command for users to check their quota usage and limits of all their projects.
To display disk usage and limits on your /short project directory:
To display usage on each OST, add the
Note: Viewing usage on each OST can help you manage your quota. When you write to your /short directory, a chunk of your quota (~100 MB) is allocated to each OST that is being written to. Therefore, it is possible to exceed your soft quota limit before your actual disk usage reaches the limit. See the Lustre Best Practice Stay at Least 10% Below Soft Quota Limit for more information.
Listing Space Usage
To list space usage per OST and MDT, in human-readable format, for all Lustre filesystems or for a specific one:
Listing Inode Usage
To list inode usage for all filesystems or for a specific one:
To list all the OSTs for the filesystem:
Main Lustre Components
Lustre filesystem components include:
Metadata Server (MDS)
Service nodes that manage all metadata operations, such as assigning and tracking the names and storage locations of directories and files on the OSTs. There is 1 active MDS per filesystem.
Metadata Target (MDT)
A storage device where the metadata (name, ownership, permissions and file type) are stored. There is 1 MDT per filesystem.
Object Storage Server (OSS)
Service nodes that run the Lustre software stack, provide the actual I/O service and network request handling for the OSTs, and coordinate file locking with the MDS. Each OSS can serve multiple OSTs. The aggregate bandwidth of a Lustre filesystem can approach the sum of bandwidths provided by the OSSes. There can be 1 or multiple OSSes per filesystem.
Object Storage Target (OST)
Storage devices where users’ file data is stored. The size of each OST varies from approximately 7 TB to approximately 22 TB, depending on the Lustre filesystem. The capacity of a Lustre filesystem is the sum of the sizes of all OSTs. There are multiple OSTs per filesystem.
Compute nodes that mount the Lustre filesystem, and access/use data in the filesystem. There are commonly thousands of Lustre clients per filesystem.
We acknowledge the original Lustre documentation is taken from NASA and modified based on NCI filesystems.