Table of Contents
maxLevel 4

File Format

Catalog File

In NCI intake-spark scheme, the catalog file

Data Source File

In NCI Intake-spark scheme, the catalog data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics.

...

Page tree

Versions Compared

Old Version 5

New Version 6

Key

Table of Contents
maxLevel 4

File Format

Page tree

Page History

Versions Compared

Old Version 5

New Version 6

Key

Table of ContentsmaxLevel4

File Format

Table of Contents
maxLevel 4