Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel4

File Format

In NCI Intake-spark scheme, the catalog file collects all attributes from a single variable, dimension and configurations of all dataset files. Thus the dataset indexes require supporting

Catalog data source

NCI regularly crawls its data collections in multiple domains (climate and weather, earth observation and environmental science, and geophysics) and produces catalog data source files under project dk92 in /g/data/dk92/catalog/v2/).  These catalogues can then work with our data analysis software environments and packages

File Format

One of important requirements of our dataset indexes is to support the heterogeneous schema among datasets, such as: variable number of columns; tree-structured metadata; and so on. To address this, we  use the Apache Parquet file format for the data source indexes. As a columnar data storage format, parquet provides many benefits including: improved storage efficiency; increased query performance; and reduced data loading times. Additionally, parquet files are often used in conjunction with analytical frameworks such as Apache Spark, making it easier to perform powerful "big data" analytics. 

Data layout

The Apache Spark DataFrame is used to organise the NCI indexing data layout. A Spark DataFrame is a distributed collection of (meta)data organized into named columns. It is built on top of the Apache Spark core and provides a flexible programming interface similar to SQL or data frames in R or Python, enables both high performance access and multi-node scalability to exploit both in-memory and disk-based data processing.

Spark DataFrame can be used smoothly with the intake framework. Intake makes it easy to create data catalogs from different data sources. Once the data source is defined in the catalog, it can be loaded into a Spark DataFrame using the intake API. The Spark DataFrame can then be used for data analysis and manipulation with the powerful Spark cluster. 

Tools

There are many tools can be used to access the NCI intake data catalog files and data source files. Here we introduce two of them: Two packages are widely used in the Intake-spark scheme, i.e. intake-spark and Spark SQL.

...

Using SQL in accessing Spark DataFrames can provide several benefits, including the ability to leverage the powerful query optimization and distributed processing capabilities of Spark. By executing SQL queries on Spark DataFrames, users can take advantage of Spark's multi-node, distributed processing capabilities to quickly process large volumes of data and gain valuable insights from it.

NCI platform and software environments

NCI provides multiple platforms for users to use our data indexes, such as the interactive ARE JupyterLab session, or the Gadi PBS jobs.

NCI also provides software environments that each support the NCI data indexes:

...

Workflow

Here we introduce the workflow for using intake-spark and spark SQL. User can adopt other working tools and methods to access the NCI data indexes.

...

You can use one of the two options to load catalog data into Spark DataFrame. 

Option 1: Loading catalog file.

Option 2: Loading data source files directly.

The catalog file for each dataset contains information about a set of data sources, such as their location, format, and metadata. You can use intake.open_catalog() method to open the catalog file and load the data source specified within the catalog file. It returns a Catalog object, which allows users to access and load the data sources described in the catalog. After that, you can convert the data source into Spark DataFrame via its to_spark() method. 

In this option you can use open_spark_dataframe() method to load a Spark DataFrame from data source files ( in parquet format ) directly. This method takes in a set of parameters that specify the data source and any necessary options, and returns a Spark DataFrame that can be used for data analysis and processing.

Example code: Loading cmip6-oi10 catalog data via catalog file

Code Block
languagepy
import intake
catalog = intake.open_catalog('/g/data/dk92/catalog/yml/cmip6-oi10.yml')
df = catalog.mydata.to_spark()


Example code: Loading cmip6-oi10 catalog data from data source file

Code Block
languagepy
import intake
cat_path="/g/data/dk92/catalog/v2/data/cmip6-oi10"
source = intake.open_spark_dataframe([
    ['read', ],
    ['option', ["mergeSchema", "true"]],
    ['parquet', [cat_path,]]
    ])
df = source.to_spark()


Processing Catalog Data

Get columns of a Spark DataFrame

...

Get unique values of columns

Show unique values of a single column of "attributes.experiment_id" and "attributes.institution_id"

Show unique values of combined columns of "attributes.experiment_id" and "attributes.institution_id"


Code Block
languagepy
In[3]: df.select("attributes.experiment_id")\
.distinct().show() 
+--------------+
| experiment_id|
+--------------+
|   ssp534-over|
|piClim-histall|
| esm-piControl|
|        ssp585|
|     piControl|
|        ssp460|
|piClim-control|
|        ssp370|
|        ssp126|
|        ssp119|
|        ssp245|
|       1pctCO2|
|  abrupt-4xCO2|
|          amip|
|    historical|
|piClim-histghg|
|        ssp434|
|      esm-hist|
|piClim-histaer|
|      hist-aer|
+--------------+
only showing top 20 rows 
 
In[3]: df.select("attributes.institution_id")\
.distinct().show() 
+-----------------+
|institution_id   |
+-----------------+
|UA               |
|IPSL             |
|E3SM-Project     |
|CCCma            |
|CAS              |
|MIROC            |
|NASA-GISS        |
|MRI              |
|NCC              |
|HAMMOZ-Consortium|
|NCAR             |
|NUIST            |
|NOAA-GFDL        |
|NIMS-KMA         |
|BCC              |
|CNRM-CERFACS     |
|AWI              |
|MPI-M            |
|KIOST            |
|MOHC             |
+-----------------+
only showing top 20 rows 



Code Block
languagepy
In[3]: df.select("attributes.institution_id",\
          "attributes.experiment_id")\ .distinct().show(40,truncate=False)

+--------------+--------------+
|institution_id|experiment_id |
+--------------+--------------+
|NCC           |ssp585        |
|NCAR          |esm-piControl |
|MRI           |esm-hist      |
|NOAA-GFDL     |ssp370        |
|E3SM-Project  |ssp585        |
|UA            |ssp245        |
|NUIST         |ssp245        |
|UA            |ssp370        |
|CNRM-CERFACS  |ssp245        |
|NOAA-GFDL     |ssp585        |
|CMCC          |abrupt-4xCO2  |
|MIROC         |ssp534-over   |
|CAMS          |1pctCO2       |
|CNRM-CERFACS  |ssp126        |
|NASA-GISS     |piClim-histall|
|CMCC          |piControl     |
|MRI           |historical    |
|BCC           |piControl     |
|MPI-M         |esm-hist      |
|NCAR          |historical    |
|MIROC         |ssp126        |
|MOHC          |piClim-histall|
|CAMS          |piControl     |
|MIROC         |ssp370        |
|MOHC          |ssp534-over   |
|MPI-M         |amip          |
|NOAA-GFDL     |ssp126        |
|CCCma         |piClim-histall|
|CAMS          |abrupt-4xCO2  |
|CAMS          |amip          |
|CNRM-CERFACS  |ssp370        |
|KIOST         |ssp126        |
|CNRM-CERFACS  |ssp585        |
|NCAR          |piClim-control|
|UA            |ssp126        |
|MOHC          |ssp119        |
|CMCC          |amip          |
|MRI           |abrupt-4xCO2  |
|NUIST         |ssp126        |
|NASA-GISS     |piClim-control|
+--------------+--------------+
only showing top 40 rows


You can specify a larger number in show() function and set "truncate=False" to display more rows in full lengths.

...