File Format

Catalog File

An intake catalog file is a YAML or JSON file that contains a description of a collection of data sources, such as CSV, HDF5, NetCDF, and other formats. They can include information about the location, format, and any relevant metadata associated with the data sources. Each dataset in the NCI dataset indexing scheme has its own catalog file in YAML or Json formats.

the catalog file file of NCI intake-spark scheme is in YAML format.

Data Source File

In NCI Intake-spark scheme, the data source file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset indexes require supporting the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics.

Data layout

The Apache Spark DataFrame is used to organise the NCI indexing data layout. A Spark DataFrame is a distributed collection of (meta)data organized into named columns. It is built on top of the Apache Spark core and provides a flexible programming interface similar to SQL or data frames in R or Python, enables both high performance access and multi-node scalability to exploit both in-memory and disk-based data processing.

Spark DataFrame can be used smoothly with the intake framework. Intake makes it easy to create data catalogs from different data sources. Once the data source is defined in the catalog, it can be loaded into a Spark DataFrame using the intake API. The Spark DataFrame can then be used for data analysis and manipulation with the powerful Spark cluster.

Tools

Two packages are widely used in the Intake-spark scheme, i.e. intake-spark and Spark SQL.

Intake-spark

Intake-spark is an Intake plugin that provides a unified interface for loading and accessing data in Apache Spark using the Intake data catalog system. Spark is a powerful distributed computing framework for processing large-scale data, but working with it can be challenging because it requires specific knowledge of Spark's API and data sources. Intake-spark simplifies this process by providing a consistent and intuitive interface for loading data into Spark DataFrame. Intake-spark supports several file formats, including Apache Parquet, Avro, CSV, and JSON, and can read data from various storage systems such as HDFS, Amazon S3, and Azure Blob Storage. Intake-spark also allows users to configure advanced settings such as partitioning and caching for improved performance.

Spark SQL

Using SQL in accessing Apache Spark DataFrames provides a convenient and familiar way for users to query and manipulate data, particularly for those who are already familiar with SQL. The Spark SQL module provides an interface for executing SQL queries against Spark DataFrames, making it easier for users to work with structured data and perform complex data analysis.

To use SQL in accessing Spark DataFrames, you first need to create a SparkSession object, which is the entry point to using Spark functionality. Then, you can load your data into a DataFrame, register the DataFrame as a temporary table, and execute SQL queries on the DataFrame using the Spark SQL module.

Using SQL in accessing Spark DataFrames can provide several benefits, including the ability to leverage the powerful query optimization and distributed processing capabilities of Spark. By executing SQL queries on Spark DataFrames, users can take advantage of Spark's multi-node, distributed processing capabilities to quickly process large volumes of data and gain valuable insights from it.

Workflow

Here we introduce the workflow for using intake-spark and spark SQL. User can adopt other working tools and methods to access the NCI data indexes.

Page tree

Intake-spark Scheme