Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Catalog File

An intake catalog file is a YAML or JSON file that contains a description of a collection of data sources, such as CSV, HDF5, NetCDF, and other formats.  They can include information about the location, format, and any relevant metadata associated with the data sources. Each dataset in the NCI dataset indexing scheme has its own catalog file in YAML or Json formats.

The catalog file of NCI intake-spark scheme is in YAML format.

Data Source File

In NCI Intake-spark scheme, the data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we  use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics. 

Data layout

The Apache Spark DataFrame is used to organise the NCI indexing data layout. A Spark DataFrame is a distributed collection of (meta)data organized into named columns. It is built on top of the Apache Spark core and provides a flexible programming interface similar to SQL or data frames in R or Python, enables both high performance access and multi-node scalability to exploit both in-memory and disk-based data processing.

...