This article explains how Lustre I/O works, and provides best practices for improving application performance.
How Does Lustre I/O Work?
At NCI, Lustre filesystems (/home, /short and /gdata) are shared among many users and many application processes, which causes contention for various Lustre resources.
When a client (a compute node from your job) needs to create or access a file, the client node queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the file's stripes. Once the file is opened and the client node obtains the striping information, the MDS is no longer involved in the file I/O process; the client node interacts directly with the object storage servers (OSSes) and object storage targets (OSTs) to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple client nodes try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency so that all clients see consistent results.
Jobs running on Raijin contend for shared resources in NCI's Lustre filesystem. The Lustre server can only handle limited number of remote procedure calls (RPCs, inter-process communications that allow the client to cause a procedure to be executed on the server) per second.Contention slows the performance of your applications and weakens the overall health of the Lustre filesystem. To reduce contention and improve performance, please apply the examples below to your compute jobs.
1) Avoid Using ls -l
The ls -l command displays information such as ownership, permission, and size of all files and directories. The ownership and permission metadata is stored on the MDTs, but the file size metadata is only available from the OSTs. Thus, the ls -l command issues RPCs to the MDS/MDT and OSSes/OSTs for every file/directory to be listed. RPC requests to the OSSes/OSTs are very costly and can take a long time to complete if there are many files and directories.
- Use ls by itself if you just want to see if a file exists
- Use ls -l filename if you want to see all the information about a specific file
2) Avoid Having a Large Number of Files in a Single Directory
Opening a file keeps a lock on the parent directory, such that no other client nodes can access the directory. When many files in the same directory are to be opened, it creates contention.
- Split a large number of files (in the thousands or more) into multiple subdirectories to minimise contention.
3) Avoid Accessing Small Files
Accessing small files on the Lustre filesystem is not efficient. When possible, keep them on an EXT4-mounted local filesystem (only accessible as $PBS_JOBFS ) on each node at the beginning of the job, and then copy back to Lustre at the end of the job.
4) Use a Stripe Count of 1 for Directories with Many Small Files
If you must keep small files on Lustre, be aware that stat operations are more efficient if each small file resides in one OST. Create a directory to keep small files in and set the stripe count to 1 so that only one OST will be needed for each file. This is useful when you extract source and header files (which are usually very small files) from a tarfile. Use the Lustre utility lfs to create a specific striping pattern, or find the striping pattern of existing files:
If there are large files in the same directory tree, it may be better to allow them to stripe across more than one OST. You can create a new directory with a larger stripe count and copy the larger files to that directory. Note that moving files into that directory with the mv command will not change the strip count of the files. Files must be created in or copied to a directory to inherit the stripe count properties of a directory.
If you have a directory with many small files (less than 100 MB) and a few very large files (greater than 1 GB), then it may be better to create a new subdirectory with a larger stripe count. Store just the large files and create symbolic links to the large files using the symlink command ln:
5) Keep Copies of Your Source Code on /home and/or MDSS
Be aware that files under /short are not backed up. Make sure that you have copies of your source codes, makefiles, and any other important files saved on your Raijin home filesystem or on MDSS.
6) Increase the stripe_count for Parallel Writes to the Same File
When multiple processes are writing blocks of data to the same file in parallel, the I/O performance for large files will improve when the stripe_count is set to a larger value. The stripe count sets the number of OSTs the file will be written to. By default, the stripe count is set to 1. While this default setting provides for efficient access of metadata--for example to support the ls -l command--large files should use stripe counts of greater than 1. This will increase the aggregate I/O bandwidth by using multiple OSTs in parallel instead of just one. A rule of thumb is to use a stripe count approximately equal to the number of gigabytes in the file.
Another good practice is to make the stripe count an integral factor of the number of processes performing the write in parallel, so that you achieve load balance among the OSTs. For example, set the stripe count to 16 instead of 15 when you have 64 processes performing the writes.
7) Limit the Number of Processes Performing Parallel I/O
Given that there are 40 OSSes and 360 OSTs on Raijin for /short, there will be contention if a large number of processes of an application are involved in parallel I/O. Instead of allowing all processes to do the I/O, choose just a few processes to do the work. For writes, these few processes should collect the data from other processes before the writes. For reads, these few processes should read the data and then broadcast the data to others.
8) Stripe Align I/O Requests to Minimise Contention
Stripe aligning means that the processes access files at offsets that correspond to stripe boundaries. This helps to minimize the number of OSTs a process must communicate for each I/O request. It also helps to decrease the probability that multiple processes accessing the same file communicate with the same OST at the same time.
One way to stripe-align a file is to make the stripe size the same as the amount of data in the write operations of the program.
9) Avoid Repetitive "stat" Operations
Some users have implemented logic in their scripts to test for the existence of certain files. Such tests generate "stat" requests to the Lustre server. When the testing becomes excessive, it creates a significant load on the filesystem. A workaround is to slow down the testing process by adding sleep in the logic. For example, the following user script tests the existence of the files WAIT and STOP to decide what to do next.
When neither the WAIT nor STOP file exists, the loop ends up testing for their existence as quickly as possible (on the order of 5,000 times per second). Adding sleep inside the loop slows down the testing.
10) Avoid Having Multiple Processes Open the Same File(s) at the Same Time
On Lustre filesystems, if multiple processes try to open the same file(s), some processes will not able to find the file(s) and your job will fail.
The source code can be modified to call the sleep function between I/O operations. This will reduce the occurrence of multiple, simultaneous access attempts to the same file from different processes.
When opening a read-only file in Fortran, use ACTION='read' instead of the default ACTION='readwrite'. The former will reduce contention by not locking the file.
11) Avoid Repetitive Open/Close Operations
Opening files and closing files incurs overhead and repetitive open/close should be avoided.
If you intend to open the files for read only, make sure to use ACTION='READ' in the open statement. If possible, read the files once each and save the results, instead of reading the files repeatedly.
If you intend to write to a file many times during a run, open the file once at the beginning of the run. When all writes are done, close the file at the end of the run.
See Lustre Basics for more information.
12) Use the Soft Link to Refer to Your Lustre Directory
Your /g/data directory is created on a specific Lustre filesystem, such as /g/data1 or /g/data2, but you can use a soft link to refer to the directory no matter which filesystem it is on:
By using the soft link, you can easily access your directory without needing to know the name of the underlying filesystem. Also, you will not need to change your scripts or re-create any symbolic links if a system administrator needs to migrate your data from one Lustre filesystem to another.
13) Stay at Least 10% Below Soft Quota Limit
It is possible to reach the soft quota on your /short filesystem even when your actual disk usage is lower than the quota limit. Therefore, to ensure that you don't go over the quota, keep your Lustre disk usage at least 10% below the soft quota limit.
A Lustre filesystem is a collection of smaller filesystems, known as object storage targets (OSTs), that appear as a single filesystem. Each OST has its own quota, and each time you write to your /short directory, a chunk of your overall quota (about 100 MB) is allocated to an OST to keep you from exceeding the OST's quota. It is this allocation—which is larger than the disk space you are actually using—that is counted against your overall quota.
You can find out what is allocated on the OSTs by running the lfs command as follows:
Note: The resulting output might be very long. Scroll to the end to see the allocated blocks.
14) Preserve Corrupted Files for Investigation
When you notice a corrupted file in your /nobackup directory, it is important to preserve the file to allow NCI staff to investigate the cause of corruption. To prevent the file from being accidentally overwritten or deleted by your scripts, we recommend that you rename the corrupted file using:
Note: Do not use cp to create a new copy of the corrupted file.
Report the problem to NCI staff. Include:
– your name
– contact number
– full pathname of lost file
– date file was removed
– best guess as to when file was created and/or last modified.
If you report performance problems with a Lustre filesystem, please include:
– the time
– PBS job number
– name of the filesystem
– the path of the directory or file that you are trying to access.
Your report will help us correlate issues with recorded performance data to determine the cause of efficiency problems.
We acknowledge the original Lustre documentation is taken from NASA and modified based on NCI filesystems.