Data indexing is the process of organizing and structuring metadata to facilitate fast and efficient search and retrieval. It involves creating an index, which is a data structure that stores a pointer to the location of the data, allowing for quick access to relevant information. Through our intake index of data, users can perform searches and get results in a fraction of the time it would take to search through the entire database.
Intake is a Python package that is used for data access and management. It provides a simple interface for loading and exploring data from different data sources in both local and remote storage systems. Intake allows users to define and share data catalogs, which are collections of metadata about the data sources. Intake also provides a mechanism for lazy loading of data, allowing users to work with extremely large datasets (> PB), with index beyond capability of available memory.
NCI provides two dataset indexing schemes based on different Intake techniques:
- Intake-spark Scheme ( for expert users): For every data collection hosted by NCI, we generate intake data source files in parquet format, encompassing all file attributes as metadata. These files can be manipulated using the intake-spark package.
- Intake-ESM Scheme ( for climate users): Additionally, for certain data collections, we create lightweight data source files in CSV format, containing selected metadata. These files can be handled seamlessly with our NCI intake-esm indexes, and handled with associated intake software.