Table of Contents maxLevel 4
maxLevel | 4 |
---|
File Format
In NCI Intake-spark scheme, the catalog file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset indexes require supporting
Catalog data source
NCI regularly crawls its data collections in multiple domains (climate and weather, earth observation and environmental science, and geophysics) and produces catalog data source files under project dk92 in /g/data/dk92/catalog/v2/). These catalogues can then work with our data analysis software environments and packages.
File Format
One of important requirements of our dataset indexes is to support the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics.
Data layout
The Apache Spark DataFrame is used to organise the NCI indexing data layout. A Spark DataFrame is a distributed collection of (meta)data organized into named columns. It is built on top of the Apache Spark core and provides a flexible programming interface similar to SQL or data frames in R or Python, enables both high performance access and multi-node scalability to exploit both in-memory and disk-based data processing.
Spark DataFrame can be used smoothly with the intake framework. Intake makes it easy to create data catalogs from different data sources. Once the data source is defined in the catalog, it can be loaded into a Spark DataFrame using the intake API. The Spark DataFrame can then be used for data analysis and manipulation with the powerful Spark cluster.
Tools
There are many tools can be used to access the NCI intake data catalog files and data source files. Here we introduce two of them: Two packages are widely used in the Intake-spark scheme, i.e. intake-spark and Spark SQL.
...
Using SQL in accessing Spark DataFrames can provide several benefits, including the ability to leverage the powerful query optimization and distributed processing capabilities of Spark. By executing SQL queries on Spark DataFrames, users can take advantage of Spark's multi-node, distributed processing capabilities to quickly process large volumes of data and gain valuable insights from it.
NCI platform and software environments
NCI provides multiple platforms for users to use our data indexes, such as the interactive ARE JupyterLab session, or the Gadi PBS jobs.
NCI also provides software environments that each support the NCI data indexes:
...
Workflow
Here we introduce the workflow for using intake-spark and spark SQL. User can adopt other working tools and methods to access the NCI data indexes.
...
You can use one of the two options to load catalog data into Spark DataFrame.
Option 1: Loading catalog file. | Option 2: Loading data source files directly. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
The catalog file for each dataset contains information about a set of data sources, such as their location, format, and metadata. You can use | In this option you can use | ||||||||||
Example code: Loading cmip6-oi10 catalog data via catalog file
| Example code: Loading cmip6-oi10 catalog data from data source file
|
Processing Catalog Data
Get columns of a Spark DataFrame
...
Get unique values of columns
Show unique values of a single column of "attributes.experiment_id" and "attributes.institution_id" | Show unique values of combined columns of "attributes.experiment_id" and "attributes.institution_id" | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
|
You can specify a larger number in show() function and set "truncate=False" to display more rows in full lengths.
...