Store Xarray Datasets into TileDB Arrays

The TileDB-cf-py library could convert a NetCDF file into TileDB arrays directly as described here.

In the case of converting a datasets from multiple NetCDF files into TileDB arrays, TileDB-cf-py also provides an convenient solution.

Read Dataset from multiple NetCDF files

Before using Xarray to fetch dataset from NetCDF files, it is suggested to set up a Dask cluster to enable parallel processings.

You could easily set up a local Dask cluster as below

Set up a Dask cluster.

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)

Now we can use Xarray to fetch the dataset from multiple NetCDF files by using the local Dask cluster:

Read datasets

import xarray as xr
filenames = '/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/*.nc'
xrdata = xr.open_mfdataset(filenames, combine='by_coords',parallel=True)
start_year="2020-01-01"
end_year="2020-03-31" 
xrdata=xrdata.sel(time=slice(start_year,end_year))

The 'xrdata' in the above script is Xarray dataset containing the following metadata and attributes.

Read datasets

print(xrdata)

<xarray.Dataset> Size: 461MB
Dimensions:    (longitude: 1440, latitude: 721, level: 37, time: 3)
Coordinates:
  * longitude  (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8
  * latitude   (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * level      (level) int32 148B 1 2 3 5 7 10 20 ... 875 900 925 950 975 1000
  * time       (time) datetime64[ns] 24B 2020-01-01 2020-02-01 2020-03-01
Data variables:
    w          (time, level, latitude, longitude) float32 461MB dask.array<chunksize=(1, 13, 241, 480), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    license:      Licence to use Copernicus Products: https://apps.ecmwf.int/...
    summary:      ERA5 is the fifth generation ECMWF atmospheric reanalysis o...
    title:        ERA5 pressure-levels monthly-averaged vertical_velocity 202...
    history:      2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5....

The variable "w" has 4 dimensions and it contains 3 months data which is sliced from 1-year dataset loaded from the directory "/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/".

Store the dataset into TileDB arrays

We can simply utilise tiledb.cf.from_xarray() method to store a Xarray dataset into a TileDB object as below.

Read datasets

uri2 = f"{output_dir}/example2"
tiledb.cf.from_xarray(xrdata,uri2)

Now we can open the TileDB object via Xarray to check its metadata, which presents the same with the NetCDF file.

Read datasets

ds = xr.open_dataset(uri2, engine="tiledb")
ds

<xarray.Dataset> Size: 461MB
Dimensions:    (time: 3, level: 37, latitude: 721, longitude: 1440)
Coordinates:
  * time       (time) datetime64[ns] 24B 2020-01-01 2020-02-01 2020-03-01
  * level      (level) float64 296B 1.0 2.0 3.0 5.0 ... 925.0 950.0 975.0 1e+03
  * latitude   (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8
Data variables:
    w          (time, level, latitude, longitude) float32 461MB ...
Attributes:
    Conventions:  CF-1.6
    history:      2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5....
    license:      Licence to use Copernicus Products: https://apps.ecmwf.int/...
    summary:      ERA5 is the fifth generation ECMWF atmospheric reanalysis o...
    title:        ERA5 pressure-levels monthly-averaged vertical_velocity 202...

Read datasets

uri2 = "./dataset/tiledb/example2"
creator.create_group(uri2)

The directory structure of the TileDB group is shown as below. At this stage, no actually data is stored into the TileDB group yet.

Open TileDB group

You can validate the TileDB group by loading the variable "w" from the TileDB arrays

And you could visualise the data of each month in 'w' as below

Page tree

Store Xarray Datasets into TileDB Arrays

Read Dataset from multiple NetCDF files

Store the dataset into TileDB arrays

Open TileDB group