Table of Contents
maxLevel 4

File Format

Catalog File

An intake catalog file is a YAML or JSON file that contains a description of a collection of data sources, such as CSV, HDF5, NetCDF, and other formats. They can include information about the location, format, and any relevant metadata associated with the data sources. Each dataset in the NCI dataset indexing scheme has its own catalog file in YAML or Json formats.

the catalog file file of In NCI intake-spark scheme , the catalog file is in YAML format.

Data Source File

In NCI Intake-spark scheme, the data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics.

...

Page tree

Versions Compared

Old Version 6

New Version 7

Key

Table of Contents
maxLevel 4

File Format

Catalog File

Data Source File

Page tree

Page History

Versions Compared

Old Version 6

New Version 7

Key

Table of ContentsmaxLevel4

File Format

Catalog File

Data Source File

Table of Contents
maxLevel 4