A data catalog is a centralized repository designed to help users manage and organize data across different sources, formats, and storage systems. It acts as a metadata management system that stores detailed information about datasets, such as their descriptions, locations, structure, schema, data types, and access permissions. The primary goal of a data catalog is to provide an organized and easily accessible overview of the data available.

While the data catalog helps in organizing, managing, and providing access to datasets, indexing enhances the speed and efficiency of searching and retrieving specific data. Indexing involves creating a data structure (index) that stores pointers to the location of the data, allowing for rapid access to relevant information. By using indexing, users can search and access datasets much more quickly than scanning through the entire data collection. In Intake, this indexing functionality helps users search data efficiently and retrieve results in a fraction of the time compared to a full scan.

Intake is a Python package for data access and management. It offers an intuitive interface for loading and exploring data from various sources, whether local or remote. With Intake, users can define and share data catalogs, which contain metadata about data sources, and leverage its lazy loading feature to work with large datasets (even those exceeding petabytes). The indexing capabilities in Intake ensure that data can be accessed beyond the available memory, improving performance for large-scale operations.

At NCI, we use two dataset indexing schemes based on different Intake techniques:

  1. Intake-Spark Scheme: For each data collection hosted by NCI, we generate Intake data source files in Parquet format, which include all relevant file attributes as metadata. These files can be processed using the intake-spark package.
  2. Intake-ESM Scheme: For specific data collections, we create lightweight Intake data source files in CSV format containing essential metadata. These files can be easily managed using the intake-esm indexes and associated software for climate data analysis.

 

  • No labels