Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A data catalog is a centralized repository that allows designed to help users to manage and organize data across various different sources, formats, and storage systems. It provides acts as a metadata management system for storing that stores detailed information about datasets, including such as their descriptions, locations, structure, schema, data types, and access permissions.  

...

The primary goal of a data catalog is to provide an organized and easily accessible overview of the data available.

While the data catalog helps in organizing, managing, and providing access to datasets, indexing enhances the speed and efficiency of searching and retrieving specific data. Indexing involves creating a data structure (index) that stores pointers to the location of the data, allowing for

...

rapid access to relevant information.

...

By using indexing, users can search and access datasets much more quickly than scanning through the entire data collection. In Intake, this indexing functionality helps users search data efficiently and retrieve results in a fraction of the time

...

compared to a full scan.

Intake is a Python package

...

for data access and management. It

...

offers an intuitive interface for loading and exploring data from

...

various sources, whether local or remote. With Intake, users can define and share data catalogs, which

...

contain metadata about

...

data sources

...

, and leverage its lazy loading feature to work with

...

large datasets (

...

even those exceeding petabytes). The indexing capabilities in Intake ensure that data can be accessed beyond the available memory, improving performance for large-scale operations.

...

At NCI, we use two 

dataset indexing schemes based on different Intake techniques:

  1. Intake-

...

  1. Spark Scheme

...

  1. : For

...

  1. each data collection hosted by NCI, we generate

...

  1. Intake data source files in

...

  1. Parquet format,

...

  1. which include all relevant file attributes as metadata. These files can be

...

  1. processed using the intake-spark package.
  2. Intake-ESM Scheme

...

  1. : For specific data collections, we create lightweight Intake data source files in CSV format

...

  1. containing

...

  1. essential metadata. These files can be

...

  1. easily managed using the intake-esm indexes

...

  1. and

...

  1. associated

...

  1. software for climate data analysis.

Children Display
depth2
styleh5