National Reference Datasets
The NCI National Research Data Collection is Australia’s largest collection of research climate, weather, Earth systems, environmental, satellite, and geophysics research datasets. NCI also has many other specialised domain datasets, such as optical astronomy and genomic data. This data is a mix of nationally generated datasets as well as replicated international datasets that need to be hosted at NCI.
There are currently more than 13 PB of nationally and internationally significant datasets that are managed at NCI, and with ongoing growth in many of these collections. As well as data being available more generally, one of the important aspects about this data is that it is organised next to high performance computing and data analysis systems.
Most of the datasets are have been prioritised through the NCI collaboration, particularly in the priority science domains and with NCI partners (ANU, Bureau of Meteorology, CSIRO, GeoScience Australia) and ARC Centres of Excellence.
NCI considers the following as key criteria for support as a national collection of datasets at NCI:
- Use to multiple Australian research organisations
- Takes advantage of NCI capabilities and colocation
- Demonstrated interest and ongoing high usage
- Satisfies the FAIR and other suitable data principles
There are additional data collections and datasets under our management, which is negotiated with the relevant organisation.
NCI supports a number of key internationally recognised data principles:
- FAIR (Findable, Accessible, Interoperable, Reusable) data principles for its major datasets
- Programmable and high performance access
- Open as possible, Closed as necessary
- Use modern Data Standards where-ever possible
- Transdisciplinary access
Finding and Accessing the datasets published by NCI
You can discover the datasets published and available at NCI using our NCI Data Catalogue catalogue, using ISO19115 compliant data records. Each collection and constituent dataset has information available as catalogue records in through the NCI Data Catalogue. The data can be accessed through:
- NCI Lustre filesystems /g/data[1a,1b,2,..]/<NCI code>, which are available on NCI's Gadi, both directly and via the ARE. This data is best analysed using our Specialised Environments.
- NCI THREDDS data service (https://dapds00.nci.org.au), primarily using Open Geospatial Data Services (OGC) and DAP protocols (e.g., subsetting and aggregation)
While all the datasets are available via direct login to NCI, some selected services cater for specific communities needs such international federations or desktop tools. These include:
- Earth Systems Grid Federation (https://esgf.nci.org.au), primarily for the international climate community undertaking the Coupled Model Intercomparison Project (CMIP) activity
- GSKY data service (https://gsky.nci.org.au) using OGC data protocols (WMS, WCS and WPS) for very large datasets (e.g., Satellite imagery)
- Sentinel Data service (https://copernicus.nci.org.au/sara.client/#/home)
- Optical Astronomy Data Repository service using IVOA data standards.
The NCI Data Catalogue provides our requirements for F in FAIR - a discoverability and search portal for these datasets. Please see our Data Catalogue User Guide for more information on how to use our catalogue to discover and access data at NCI.
NCI uses internationally recognised Digital Object Identifiers (DOI) on datasets, which can be used to reference these datasets in journal publications or for sharing the location of the dataset landing page. Our goal is to ensure that each dataset lists includes a reference to its license to give confidence around the use of the dataset.
NCI tracks usage statistics around all accesses on datasets - via the open data services and the different protocols of access and usage, as well as in-situ access within the NCI computing systems. This provides information for planning and measuring demand for existing datasets, as well as impacts for upgrades and decommissioning of datasets.
Data Hierarchy Definitions
There are several definitions that are fundamental to how the data catalogue and data directories at NCI are organised: Dataset, Data Subcollection, Data Collection, Data Category and Dataset Granules. While there are a variety of definitions for the terms used that available from other sources, we use those listed below primarily because NCI’s focus is on programmatic access to data.
Dataset | A Dataset is a compilation of data that constitutes a programmable data unit that has been collected and organised using the one process. For this purpose it must have a named Data Owner, a single license, one set of semantics, ontologies, vocabularies, and has a single data format and internal data convention. A Dataset must include its version. |
Data Subcollection | A Data Subcollection is an exclusive grouping of Datasets (i.e., belonging to only one Subcollection) where the constituent Datasets are tightly managed. It must have responsibilities within one organisation with responsibility for the underlying management of its constituent datasets. A Data Subcollection constitutes a strong connection between the component Datasets, and is organised coherently around a single scientific element (e.g., model, instrument). A Subcollection must have compatible licenses such that constituent Datasets do not need different access arrangements. |
Data Collection | A Data Collection is the highest in the hierarchy of data groupings at NCI. It is comprised of either an exclusive grouping of Data Subcollections; or, it is a tiered structure with an exclusive grouping of lower tiered Data Collections, where the lowest tier Data Collection will only contain Data Subcollections. |