For this example, we will use Tiledb.CF-Py library to convert an example NetCDF file to TileDB arrays in a Python Jupyter Notebook.
The following libraries will need to be imported for this example:
import tiledb import tiledb.cf import xarray as xr import numpy as np import matplotlib.pyplot as plt import os import shutil
The nci_ipynb package could move the working directory to same location with the notebook. This is particularly useful when working at ARE JupyterLab session, in which the default working directory always starts from user's home directory.
import nci_ipynb os.chdir(nci_ipynb.dir())
Let's look at an example NetCDF file from the ERA5 monthly averaged data on pressure levels collection and print its meta data.
netcdf_file = '/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/w_era5_moda_pl_20200101-20200131.nc' ds = xr.open_dataset(netcdf_file) print(ds)
This NetCDF file contains a variable called "w" with 4 coordinates, i.e. "longitude","latitude", "level" and "time".
<xarray.Dataset> Size: 154MB Dimensions: (longitude: 1440, latitude: 721, level: 37, time: 1) Coordinates: * longitude (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8 * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0 * level (level) int32 148B 1 2 3 5 7 10 20 ... 875 900 925 950 975 1000 * time (time) datetime64[ns] 8B 2020-01-01 Data variables: w (time, level, latitude, longitude) float32 154MB ... Attributes: Conventions: CF-1.6 license: Licence to use Copernicus Products: https://apps.ecmwf.int/... summary: ERA5 is the fifth generation ECMWF atmospheric reanalysis o... title: ERA5 pressure-levels monthly-averaged vertical_velocity 202... history: 2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5....
|
Next we can define our output directory and the folder where the TileDB arrays will be written to:
output_dir = './dataset/tiledb' ### Folder where TileDB arrays will be written to uri1 = f"{output_dir}/from_netcdf" shutil.rmtree(output_dir) os.makedirs(output_dir)
Now we can convert the above netCDF file via a method called 'from_netcdf' provided by Tiledb.CF-Py library
tiledb.cf.from_netcdf(netcdf_file,uri1)
Let's check the file structure of the produced tileDB arrays as below:
!tree $uri1
It is noticed that each coordinate and variable of the original NetCDF file is converted into a separated array. Specifically, the variable "w" is converted into the TileDB array named "array4".
./dataset/tiledb/from_netcdf ├── __group │ └── __1713868510463_1713868510463_bd0772ad9f2541b781d54ca17d7334be_2 ├── __meta │ └── __1713868510497_1713868510497_4b731228555e45c6bf7e89b5b7514905 ├── __tiledb_group.tdb ├── array0 │ ├── __commits │ │ └── __1713868510520_1713868510520_01ac756b068543e68d7790ac1bab5283_19.wrt │ ├── __fragment_meta │ ├── __fragments │ │ └── __1713868510520_1713868510520_01ac756b068543e68d7790ac1bab5283_19 │ │ ├── __fragment_metadata.tdb │ │ └── a0.tdb │ ├── __labels │ ├── __meta │ │ └── __1713868510518_1713868510518_8da9574e9ee94e91be73ee244aaa61d6 │ └── __schema │ └── __1713868510352_1713868510352_4b959e9d22794ef3a529b5f0b0975caa ├── array1 │ ├── __commits │ │ └── __1713868510623_1713868510623_780b26db1aa4417f86a60e68a8cf70cb_19.wrt │ ├── __fragment_meta │ ├── __fragments │ │ └── __1713868510623_1713868510623_780b26db1aa4417f86a60e68a8cf70cb_19 │ │ ├── __fragment_metadata.tdb │ │ └── a0.tdb │ ├── __labels │ ├── __meta │ │ └── __1713868510622_1713868510622_c9f5ea9235674c44ab93f89c560d3e1f │ └── __schema │ └── __1713868510411_1713868510411_0e59d6e79ed1438eb1b9d227821e2afb ├── array2 │ ├── __commits │ │ └── __1713868510781_1713868510781_c823542d13f14a25adcc249c4b623341_19.wrt │ ├── __fragment_meta │ ├── __fragments │ │ └── __1713868510781_1713868510781_c823542d13f14a25adcc249c4b623341_19 │ │ ├── __fragment_metadata.tdb │ │ └── a0.tdb │ ├── __labels │ ├── __meta │ │ └── __1713868510780_1713868510780_0f40c0ce66364be78606ec19a50fa029 │ └── __schema │ └── __1713868510425_1713868510425_3dfa3d3d144545d2afcb6ca832fa0546 ├── array3 │ ├── __commits │ │ └── __1713868510900_1713868510900_a19e8fd0ab3d4cfd86ef3db371d7ec6d_19.wrt │ ├── __fragment_meta │ ├── __fragments │ │ └── __1713868510900_1713868510900_a19e8fd0ab3d4cfd86ef3db371d7ec6d_19 │ │ ├── __fragment_metadata.tdb │ │ └── a0.tdb │ ├── __labels │ ├── __meta │ │ └── __1713868510899_1713868510899_f16404fbdb6349bda5bfab183f477759 │ └── __schema │ └── __1713868510438_1713868510438_cb2c5687f74d4fbab05eb33a21644d0a └── array4 ├── __commits │ └── __1713868511500_1713868511500_7579133e0b204675a39e403d70dfad60_19.wrt ├── __fragment_meta ├── __fragments │ └── __1713868511500_1713868511500_7579133e0b204675a39e403d70dfad60_19 │ ├── __fragment_metadata.tdb │ └── a0.tdb ├── __labels ├── __meta │ └── __1713868511065_1713868511065_905859f0d66c4d468c9b55f508642e2a └── __schema └── __1713868510451_1713868510451_f04050a87cb94785afef1c74ef04a241 42 directories, 28 files |
Run a sanity check to ensure that we can open one of our new TileDB arrays.
w_path = f"{uri1}/array4" ### view some metadata with tiledb.Array(w_path) as A: print(A.meta.keys()) print("") print(A.schema) print("") print(A.meta["__tiledb_attr.w.long_name"]
It shows that the TileDB attribute "w" ( which is equavalent to the variable in NetCDF) has 4 dimensions as below
['__tiledb_attr.w.add_offset', '__tiledb_attr.w.long_name', '__tiledb_attr.w.missing_value', '__tiledb_attr.w.scale_factor', '__tiledb_attr.w.standard_name', '__tiledb_attr.w.units']
ArraySchema(
domain=Domain(*[
Dim(name='time', domain=(0, 0), tile=1, dtype='uint64'),
Dim(name='level', domain=(0, 36), tile=13, dtype='uint64'),
Dim(name='latitude', domain=(0, 720), tile=241, dtype='uint64'),
Dim(name='longitude', domain=(0, 1439), tile=480, dtype='uint64'),
]),
attrs=[
Attr(name='w', dtype='int16', var=False, nullable=False),
],
cell_order='row-major',
tile_order='row-major',
sparse=False,
)
Vertical velocity |
We can also use Xarray to open the produced TileDB arrays and conduct further opearation, such as making a plot.
data = xr.open_dataset(w_path, engine="tiledb") data
data.w[0,0,:,:].plot()
Further resources:
Website: https://tiledb.com/embedded
Source Code: https://github.com/TileDB-Inc/TileDB
Documentation: https://docs.tiledb.io/en/stable/