Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel4

File Format

Catalog File

In NCI intake-spark scheme, the catalog file 

Data Source File

In NCI Intake-spark scheme, the catalog data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we  use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics. 

...