Table of ContentsmaxLevel 4
maxLevel | 4 |
---|
File Format
Catalog File
In NCI intake-spark scheme, the catalog file
Data Source File
In NCI Intake-spark scheme, the catalog data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics.
...