The details of our intake-esm software is here. Once the intake-esm is loaded in your environment you can start to use our intake-esm data catalogs. NCI current provides intake-esm data catalogs for the following CMIP5/CMIP6 data collections on NCI:
- Earth System Grid Federation (ESGF) Australian CMIP6-era Datasets
- Earth System Grid Federation (ESGF) Replicated CMIP6-era Datasets
- Earth System Grid Federation (ESGF) Replicated CMIP5-era Datasets
- CSIRO-BOM ACCESS1-3 model output prepared for CMIP5
Our intake-esm catalogue files are all located on the filesystem under /g/data/dk92/catalog/v2/esm. Note that you must have connected to project dk92 to access these.
Operations
First of all, you need to open a catalog file via the intake open_esm_datastore method.
import intake cmip6 = intake.open_esm_datastore("/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json")
Calling the loaded esm_datastore, gives an overview over its content.
cmip6
The datastore contains a df class in the type pf pandas DataFrame.
cmip6.df.head()
Using `cmip6.df.columns` lists all the columns/keys that can be used to search the data.
cmip6.df.columns
The method unique() lists all the unique values for each column as a dictionary. You can search any values for each column.
values_dict = cmip6.unique() print(values_dict)
Let's select a subset by passing the search() method with a combination of columns. The returned results shows that the subset contains 18 files crossing multiple columns.
subset = cmip6.search(source_id=['MPI-ESM-1-2-HAM', 'NorESM2-LM'], experiment_id=['ssp370-lowNTCF'], variable_id="tas", table_id="Amon", grid_label="gn") subset
The search results are often split into multiple keys based on metadata columns, and each key represents a unique combination of metadata attributes. This allows you to distinguish between different datasets that match your query.
You can list all keys as below
subset.keys()
It contains the following keys.
['f.AerChemMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.ssp370-lowNTCF.r1i1p1f1.mon.atmos.Amon.tas.gn.v20190627', 'f.AerChemMIP.NCC.NorESM2-LM.ssp370-lowNTCF.r1i1p1f1.mon.atmos.Amon.tas.gn.v20200206', 'f.AerChemMIP.NCC.NorESM2-LM.ssp370-lowNTCF.r2i1p1f1.mon.atmos.Amon.tas.gn.v20200206', 'f.AerChemMIP.NCC.NorESM2-LM.ssp370-lowNTCF.r3i1p1f1.mon.atmos.Amon.tas.gn.v20200206']
Now you can open the dataset directly via the to_dataset_dict() API. It is recommended to start a Dask cluster to accelerate it.
For example, you can quickly set up a local Dask cluster with a single node resources as below.
from dask.distributed import Client, LocalCluster cluster = LocalCluster() client = Client(cluster)
Now you can invoke the to_dataset_dict() API and it returns a dictionary listing all the datasets in the dataset, leading by each key
dset_dict = subset.to_dataset_dict() print(dset_dict)
Finally you can simply load a dataset using its key
ds = dset_dict['f.AerChemMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.ssp370-lowNTCF.r1i1p1f1.mon.atmos.Amon.tas.gn.v20190627'] ds