Storage and Archive Manager File System (DMF) Product Overview
The Mass Data Store was migrated to a new SGI Hierarchical Storage Management System in January 2012. As a result, SAM-QFS commands are now obsolete.
The new SGI based MDSS file system is significantly more robust, with two large tape silos in separate machine rooms in two separate buildings enabling automatic offsite copies for all data. It also has disk cache size at around 1 PetaBytes. Watch Tape Robot at work
DMF software provides services which automatically manage the migration and retrieval of files between multiple levels of a storage hierarchy: from on-line disk cache to offline tape archival. Archive files or complete file systems can be moved or duplicated on multiple DMF servers, or off-site servers, for additional data protection.
On raijin, the existing mdss commands have been retained to allow continuity. However, not all the old functionality has been restored (man mdss for details of the latest options). The new massdata service provides a further set of useful DMF commands (included as arguments to mdss on raijin). The status of your files can be shown using the “dmls -l” command as an alternative to the normal Linux “ls” command. The status includes offline (OFL), regular (REG) and dual state (DUL).
offline (OFL): the file is on tape.
regular (REG): the file is online (on disk).
dual state (DUL): the file is both on tape and disk.
Projects dealing with massive amounts of data need to carefully consider all aspects of data acquisition, storage, retrieval, navigation, and interpretation.
Projects are advised to carefully consider the data flow requirements of their data retrieval and navigation since this facet of MDSS will be the most utilized in the long term.
Following are some issues to be considered prior to writing the first byte of data to DMF:
- data organisation;
- directory structure for efficient recall;
- file layout for efficient recall;
- data navigation (aka metadata) database for archived directories and files;
- projected access/retrieval modes;
- user and application interface to data retrieval process.
- Intended to be used for archiving large data files particularly those created or used by batch jobs. (It is a misuse of the system to try to store large numbers of small files – please do NOT do this. See the netcp -t command option below.)
- Each project has a directory on the Mass Data Storage System (MDSS) with pathname/massdata/projectid on that system. This path CANNOT be directly accessed from raijin login.
- Remote access to your massdata directory is by the mdss utility or the netcp and netmv commands (see man mdss/netcp/netmv for full details.) The mdss commands operate on files in that remote directory.
- Users connected to the project have rwx permissions in that directory and so may create their own files in those areas.
- NOT to be used as an extension of home directories (files changed/removed on the massdata area are not in general recoverable, as there are no back-ups of previous revisions.)
Currently batch jobs (other than copyq jobs) cannot use the mdss utilities.
Note: always use -l other=mdss when using mdss commands in copyq. this is so that jobs only run when the the mdss system is available.
- Quotas apply – use nci_account on the compute machines to see your MDSS quota and usage. See the Disk Quota Policy document for details of the ramifications of exceeding the quotas.
- The mdss access is intended for relatively modest mass data storage needs. Users with larger capacity storage or more sophisticated access needs should contact us to get an account on the data cluster.
# lists all files and directories in the user’s mass storage
# creates a directory ‘foo’ in the user’s mass storage
# submits a copyq batch job that creates a zipped tarbar named ‘mytarball.tar.gz’ from directory ‘mydir/‘ and copies it into the user’s mass storage subdirectory ‘foo/’
Bundling Small Files
Archival of small files ( < 20Mb) places a processing burden on DMF and hence reduces overall throughput to all users. Users with many small files are requested to bundle them into larger files. Common tools used to bundle files are tar(1) and cpio(1). Although it’s preferable to bundle files on your local machine, data intensive projects will also have a short term area on host raijin that can be used to bundle files before being saved in the DMF filesystem.
# transfer files to mdss from raijin (perhaps as part of a job script)
Numerous Small Files
Maintaining user directories containing many small files places a great burden on DMF. The inherent overhead of archival processing for each file is large.
To quantify ‘many’ and ‘small’: an account, having more than 30% of its files smaller than 10 MB, reaches our cautionary limit. When this threshold is reached, the user is required to commence bundling the small files into larger container files (e.g. using ‘tar’ or ‘cpio’).
Backups of Quickly Changing Source Code
Archive copies of source trees are a common use of archival systems. The preferred backup model is to manage the source under a configuration management tool and then backup only the major software milestones. The configuration manager will manage the day-to-day changes on the user’s local disk.
The NCI National Facility consultants will consult on the selection, integration and use of common public domain configuration management tools (e.g. CVS, SCCS, RCS). Providing methods of tracking daily changes (thus enabling easy backout of injudicious modifications) and major branches in code development.
Locking Files onto Disk Cache
Locking the initial kilobytes of a file onto disk cache will speed the initial flow of data to the requester. However, the disk cache could easily become full with locked down data if users routinely set this attribute. A small number of such locked files is permissible. NCI National Facility staff will closely monitor the use of this attribute for abuse. If this attribute is required by your project, you need to write to us .
To prevent user login delay, the standard startup files will be automatically, in their entirety, locked onto the disk when they are found during a weekly automated search. The default locked files are: .cshrc, .profile, .login, .rhosts, .logout, .history, and .sshrc/*.
Recovery of Lost Files
Recovery of files inadvertently removed or mangled is possible within the period of a week. The short timeframe is due to the possibility that the tape holding the original file might be returned to the tape pool.
The recovery process is complex and not automatic. DMF does not version pre-existing files of the same name–so the recovery involves reloading an old snapshot of the DMF database and then retrieving the desired entry. Users are requested to submit a file recovery request only for critical, irreplaceable data.
In the event of a personal disaster, send email , including the following information:
– user name
– contact number
– full pathname of lost file
– date file was removed
– best guess as to when file was created or last modified.
Projects dealing with massive amounts of data need to carefully consider all aspects of data acquisition, storage, retrieval, navigation, and interpretation. NCI National Facility consultants are available to assist projects in all phases of this cradle-to-grave process. Send email to make an appointment for an exploratory conversation. Some topics on which projects might need assistance include:
- project data flow;
- determining if a formal database manager is recommended;
- database organisation;
- efficient access of project data;
- data interpretation tools;
- source configuration management.
If your project changes direction or data patterns, let NCI National Facility staff know so we can adjust your storage environment appropriately.