Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

National Reference Datasets

The NCI National Research Data Collection is Australia’s largest collection of research climate, weather, Earth systems, environmental, satellite, and geophysics research datasets.  NCI also has many other specialised domain datasets, such as optical astronomy and genomic data. This data is a mix of nationally generated datasets as well as replicated international datasets that need to be hosted at NCI.

There are currently more than 13 PB of nationally and internationally significant datasets that are managed at NCI, and with ongoing growth in many of these collections.  As well as data being available more generally, one of the important aspects about this data is that it is organised next to high performance computing and data analysis systems.

Most of the datasets are have been prioritised through the NCI collaboration, particularly in the priority science domains and with NCI partners (ANU, Bureau of Meteorology, CSIRO, GeoScience Australia) and ARC Centres of Excellence.

NCI considers the following as key criteria for support as a national collection of datasets at NCI:

  • Use to multiple Australian research organisations
  • Takes advantage of NCI capabilities and colocation
  • Demonstrated interest and ongoing high usage
  • Satisfies the FAIR and other suitable data principles

There are additional data collections and datasets under our management, which is negotiated with the relevant organisation.

NCI supports a number of key internationally recognised data principles:

  • FAIR (Findable, Accessible, Interoperable, Reusable) data principles for its major datasets
  • Programmable and high performance access
  • Open as possible, Closed as necessary
  • Use modern Data Standards where-ever possible
  • Transdisciplinary access

Finding and Accessing the datasets published by NCI

You can discover the datasets published and available at NCI using our NCI Data Catalogue catalogue, using ISO19115 compliant data records.  Each collection and constituent dataset has information available as catalogue records in through the NCI Data Catalogue. The data can be accessed through:

  • NCI Lustre filesystems /g/data[1a,1b,2,..]/<NCI code>, which are available on NCI's Gadi, both directly and via the ARE.  This data is best analysed using our Specialised Environments.
  • NCI THREDDS data service (https://dapds00.nci.org.au), primarily using Open Geospatial Data Services (OGC) and DAP protocols (e.g., subsetting and aggregation)

While all the datasets are available via direct login to NCI, some selected services cater for specific communities needs such international federations or desktop tools. These include:

The NCI Data Catalogue provides our requirements for F in FAIR - a discoverability and search portal for these datasets. Please see our Data Catalogue User Guide for more information on how to use our catalogue to discover and access data at NCI.

NCI uses internationally recognised Digital Object Identifiers (DOI) on datasets, which can be used to reference these datasets in journal publications or for sharing the location of the dataset landing page.  Our goal is to ensure that each dataset lists includes a reference to its license to give confidence around the use of the dataset. 

NCI tracks usage statistics around all accesses on datasets - via the open data services and the different protocols of access and usage, as well as in-situ access within the NCI computing systems. This provides information for planning and measuring demand for existing datasets, as well as impacts for upgrades and decommissioning of datasets

Table of Contents

Introduction

NCI hosts and organises over 40 large data collections that cover a wide range of disciplines. Being co-located with both HPC and cloud facilities, the data collections need to be organised in a systematic way to enable fast programmatic access for in situ analysis across multiple domains, as well as made accessible via data services.

The NCI Data Collections Catalogue manages the details of datasets through a uniform application of ISO19115:2003 - an international schema used for describing geographic information and services. Datasets stored at NCI have to be organised within the NCI catalogue, filesystems and data services in harmonised ways in order to make data accessible using a high degree of specificity and in formats suitable for programmatic (automated) access methods. Hence the organisation and information in the catalogue must be complete and synchronised with the filesystem and data services. Such programmatic access is required by:

  • NCI core services such as the NCI supercomputer and NCI cloud-based capabilities;
  • Co-located community Virtual Laboratories;
  • Remotely, through established standards-based protocols that use the NCI Data Services; and
  • Increasingly, through international federations.

This requires data to be well-organised and meet uniform professional standards that makes it usable by programs, developers of programs, and end-users alike. Data also needs to be organised in a way so as to harmonise data operations at NCI, which must publish data simultaneously across several different data servers and services, as well as addressing other data repository management processes and requirements.

Datasets at NCI are primarily ingested and subsequently updated by programmatic means. This can be either through network-enabled replication of datasets organised at other data repositories, or through data generation and/or processing at NCI. Therefore, the organisation of the datasets on the filesystems also needs to be predictable for the publishing aspects of the service, as well as being suitable for automated updating without requiring human intervention.

If you are interesting in publishing your data at NCI or have an inquiry, you can email our NCI Help Desk (help@nci.org.au) or Data Repository team (kelsey.druken@anu.edu.au).

Roles and Responsibilities

NCI is responsible for the quality of the data repository and all its functions and internal consistency of all the information. The following Roles and Responsibilities have been established:

  • Data Collections are managed by NCI to agreed community and international standards that strongly relate the data to both transdisciplinary use as well as domain specific needs. NCI leads the process of broader consultation through community management as resolved through the NCI Allocation Committee, its Scientific Assessment Panel and Technical Advisory Group;

  • NCI is responsible for the organisation and coordinated activities of data within the collections, in concert with Dataset Managers and Organisational staff such as Data Stewards. This includes development of Data Management Plans (DMPs) and ensuring datasets comply with NCI’s Data Quality Strategy; and

  • To ensure uniformity in the stakeholder communication and management of the service, NCI is responsible for communications about changes to data areas. The content of advice will be developed in consultation with data providers. It is therefore important that any updates within data areas is managed under controlled procedures.

The value of any data at NCI is considered at the Data Collection and SubCollection level, including funding arrangements for the storage allocations for each of the underlying data Subcollections.

Data Hierarchy Definitions

There are several definitions that are fundamental to how the data catalogue and data directories at NCI are organised: Dataset, Data Subcollection, Data Collection, Data Category and Dataset Granules. While there are a variety of definitions for the terms used that available from other sources, we use those listed below primarily because NCI’s focus is on programmatic access to data.

DatasetA Dataset is a compilation of data that constitutes a programmable data unit that has been collected and organised using the one process. For this purpose it must have a named Data Owner, a single license, one set of semantics, ontologies, vocabularies, and has a single data format and internal data convention. A Dataset must include its version.
Data SubcollectionA Data Subcollection is an exclusive grouping of Datasets (i.e., belonging to only one Subcollection) where the constituent Datasets are tightly managed. It must have responsibilities within one organisation with responsibility for the underlying management of its constituent datasets. A Data Subcollection constitutes a strong connection between the component Datasets, and is organised coherently around a single scientific element (e.g., model, instrument). A Subcollection must have compatible licenses such that constituent Datasets do not need different access arrangements.
Data Collection

A Data Collection is the highest in the hierarchy of data groupings at NCI. It is comprised of either an exclusive grouping of Data Subcollections; or, it is a tiered structure with an exclusive grouping of lower tiered Data Collections, where the lowest tier Data Collection will only contain Data Subcollections

.Dataset GranuleA Dataset Granule is sometimes used for some scientific domains – particularly in Satellite Earth Observation. In this case it refers to the smallest aggregation of data that can be independently described, inventoried, and retrieved (https://earthdata.nasa.gov/user-resources/glossary#ed-glossary-g ). Dataset granules have their own metadata and support values associated with the additional attributes defined by parent Datasets.

...

.