The NCI Data Collections Catalogue manages the details of datasets through a uniform application of ISO19115:2003 - an international schema used for describing geographic information and services. Datasets stored at NCI have to be organised within the NCI catalogue, filesystems and data services in systematic harmonised ways in order to make data accessible using a high degree of specificity and in formats suitable for programmatic (automated) access methods. Hence the organisation and information in the catalogue must be complete and synchronised with the filesystem and data services. Such programmatic access is required by:
The value of any data at NCI is considered at the Data Collection and SubCollection level, including funding arrangements for the storage allocations for each of the underlying data Subcollections.
Data Quality Strategy
To ensure that data is managed for all these different uses, NCI Data Quality Strategy to Enable FAIR, Programmatic Access across Large, Diverse Data Collections for High Performance Data Analysis (link to Informatics - Open Access Journal) and how we are Implementing a Data Quality Strategy to Simplify Access to Data (link to AGU 2016 abstract) and AGU-DQS-2016-Druken_v2-1.pdf (PDF).
This process is to ensure that datasets complies with known standards, can be delivered using data services and their specialised capabilities, are represented correctly through identified dataa portals of need to the community, and can be used programmatically for high performance simulation and data analysis.
Data Hierarchy Definitions
There are several definitions that are fundamental to how the data catalogue and data directories at NCI are organised: Dataset, Data Subcollection, Data Collection, Data Category and Dataset Granules. While there are a variety of definitions for the terms used that available from other sources, we use those listed below primarily because NCI’s focus is on programmatic access to data.
|Dataset||A Dataset is a compilation of data that constitutes a programmable data unit that has been collected and organised using the one process. For this purpose it must have a named Data Owner, a single license, one set of semantics, ontologies, vocabularies, and has a single data format and internal data convention. A Dataset must include its version.|
|Data Subcollection||A Data Subcollection is an exclusive grouping of Datasets (i.e., belonging to only one Subcollection) where the constituent Datasets are tightly managed. It must have responsibilities within one organisation with responsibility for the underlying management of its constituent datasets. A Data Subcollection constitutes a strong connection between the component Datasets, and is organised coherently around a single scientific element (e.g., model, instrument). A Subcollection must have compatible licenses such that constituent Datasets do not need different access arrangements.|
A Data Collection is the highest in the hierarchy of data groupings at NCI. It is comprised of either an exclusive grouping of Data Subcollections; or, it is a tiered structure with an exclusive grouping of lower tiered Data Collections, where the lowest tier Data Collection will only contain Data Subcollections.
|Dataset Granule||A Dataset Granule is sometimes used for some scientific domains – particularly in Satellite Earth Observation. In this case it refers to the smallest aggregation of data that can be independently described, inventoried, and retrieved (https://earthdata.nasa.gov/user-resources/glossary#ed-glossary-g ). Dataset granules have their own metadata and support values associated with the additional attributes defined by parent Datasets.|