Specialised Environments

Page tree

For this example, we will use Tiledb.CF-Py library to convert an example NetCDF file to TileDB arrays in a Python Jupyter Notebook.

The following libraries will need to be imported for this example:

import tiledb
import tiledb.cf
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import os
import shutil

The nci_ipynb package could move the working directory to same location with the notebook. This is particularly useful when working at ARE JupyterLab session, in which the default working directory always starts from user's home directory.

import nci_ipynb
os.chdir(nci_ipynb.dir())

Let's look at an example NetCDF file from the ERA5 monthly averaged data on pressure levels collection and print its meta data.

netcdf_file = '/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/w_era5_moda_pl_20200101-20200131.nc'
ds = xr.open_dataset(netcdf_file)
print(ds)

This NetCDF file contains a variable called "w" with 4 coordinates, i.e. "longitude","latitude", "level" and "time".


<xarray.Dataset> Size: 154MB
Dimensions:    (longitude: 1440, latitude: 721, level: 37, time: 1)
Coordinates:
  * longitude  (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8
  * latitude   (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * level      (level) int32 148B 1 2 3 5 7 10 20 ... 875 900 925 950 975 1000
  * time       (time) datetime64[ns] 8B 2020-01-01
Data variables:
    w          (time, level, latitude, longitude) float32 154MB ...
Attributes:
    Conventions:  CF-1.6
    license:      Licence to use Copernicus Products: https://apps.ecmwf.int/...
    summary:      ERA5 is the fifth generation ECMWF atmospheric reanalysis o...
    title:        ERA5 pressure-levels monthly-averaged vertical_velocity 202...
    history:      2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5....

 

Next we can define our output directory and the folder where the TileDB arrays will be written to:

output_dir = './dataset/tiledb'

### Folder where TileDB arrays will be written to
uri1 = f"{output_dir}/from_netcdf"

shutil.rmtree(output_dir)
os.makedirs(output_dir)

Now we can convert the above netCDF file via a method called 'from_netcdf' provided by Tiledb.CF-Py library

tiledb.cf.from_netcdf(netcdf_file,uri1)

 Let's check the file structure of the produced tileDB arrays as below:  

!tree $uri1

It is noticed that each coordinate and variable of the original NetCDF file is converted into a separated array. Specifically, the variable "w" is converted into the TileDB array named "array4".  

./dataset/tiledb/from_netcdf
├── __group
│   └── __1713868510463_1713868510463_bd0772ad9f2541b781d54ca17d7334be_2
├── __meta
│   └── __1713868510497_1713868510497_4b731228555e45c6bf7e89b5b7514905
├── __tiledb_group.tdb
├── array0
│   ├── __commits
│   │   └── __1713868510520_1713868510520_01ac756b068543e68d7790ac1bab5283_19.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __1713868510520_1713868510520_01ac756b068543e68d7790ac1bab5283_19
│   │       ├── __fragment_metadata.tdb
│   │       └── a0.tdb
│   ├── __labels
│   ├── __meta
│   │   └── __1713868510518_1713868510518_8da9574e9ee94e91be73ee244aaa61d6
│   └── __schema
│       └── __1713868510352_1713868510352_4b959e9d22794ef3a529b5f0b0975caa
├── array1
│   ├── __commits
│   │   └── __1713868510623_1713868510623_780b26db1aa4417f86a60e68a8cf70cb_19.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __1713868510623_1713868510623_780b26db1aa4417f86a60e68a8cf70cb_19
│   │       ├── __fragment_metadata.tdb
│   │       └── a0.tdb
│   ├── __labels
│   ├── __meta
│   │   └── __1713868510622_1713868510622_c9f5ea9235674c44ab93f89c560d3e1f
│   └── __schema
│       └── __1713868510411_1713868510411_0e59d6e79ed1438eb1b9d227821e2afb
├── array2
│   ├── __commits
│   │   └── __1713868510781_1713868510781_c823542d13f14a25adcc249c4b623341_19.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __1713868510781_1713868510781_c823542d13f14a25adcc249c4b623341_19
│   │       ├── __fragment_metadata.tdb
│   │       └── a0.tdb
│   ├── __labels
│   ├── __meta
│   │   └── __1713868510780_1713868510780_0f40c0ce66364be78606ec19a50fa029
│   └── __schema
│       └── __1713868510425_1713868510425_3dfa3d3d144545d2afcb6ca832fa0546
├── array3
│   ├── __commits
│   │   └── __1713868510900_1713868510900_a19e8fd0ab3d4cfd86ef3db371d7ec6d_19.wrt
│   ├── __fragment_meta
│   ├── __fragments
│   │   └── __1713868510900_1713868510900_a19e8fd0ab3d4cfd86ef3db371d7ec6d_19
│   │       ├── __fragment_metadata.tdb
│   │       └── a0.tdb
│   ├── __labels
│   ├── __meta
│   │   └── __1713868510899_1713868510899_f16404fbdb6349bda5bfab183f477759
│   └── __schema
│       └── __1713868510438_1713868510438_cb2c5687f74d4fbab05eb33a21644d0a
└── array4
    ├── __commits
    │   └── __1713868511500_1713868511500_7579133e0b204675a39e403d70dfad60_19.wrt
    ├── __fragment_meta
    ├── __fragments
    │   └── __1713868511500_1713868511500_7579133e0b204675a39e403d70dfad60_19
    │       ├── __fragment_metadata.tdb
    │       └── a0.tdb
    ├── __labels
    ├── __meta
    │   └── __1713868511065_1713868511065_905859f0d66c4d468c9b55f508642e2a
    └── __schema
        └── __1713868510451_1713868510451_f04050a87cb94785afef1c74ef04a241

42 directories, 28 files

Run a sanity check to ensure that we can open one of our new TileDB arrays.

w_path = f"{uri1}/array4"

### view some metadata 
with tiledb.Array(w_path) as A:
    print(A.meta.keys())
    print("")
    print(A.schema)
    print("")
    print(A.meta["__tiledb_attr.w.long_name"]


It shows that the TileDB attribute "w" ( which is equavalent to the variable in NetCDF) has 4 dimensions as below

['__tiledb_attr.w.add_offset', '__tiledb_attr.w.long_name', '__tiledb_attr.w.missing_value', '__tiledb_attr.w.scale_factor', '__tiledb_attr.w.standard_name', '__tiledb_attr.w.units']

ArraySchema(
  domain=Domain(*[
    Dim(name='time', domain=(0, 0), tile=1, dtype='uint64'),
    Dim(name='level', domain=(0, 36), tile=13, dtype='uint64'),
    Dim(name='latitude', domain=(0, 720), tile=241, dtype='uint64'),
    Dim(name='longitude', domain=(0, 1439), tile=480, dtype='uint64'),
  ]),
  attrs=[
    Attr(name='w', dtype='int16', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  sparse=False,
)


Vertical velocity 

We can also use Xarray to open the produced TileDB arrays and conduct further opearation, such as making a plot.

data = xr.open_dataset(w_path, engine="tiledb")
data


data.w[0,0,:,:].plot()

Further resources:

Website: https://tiledb.com/embedded

Source Code: https://github.com/TileDB-Inc/TileDB

Documentation: https://docs.tiledb.io/en/stable/


  • No labels