The TileDB-cf-py library could convert a NetCDF file into TileDB arrays directly as described here.
In the case of converting a datasets from multiple NetCDF files into TileDB arrays, TileDB-cf-py also provides an convenient solution.
Read Dataset from multiple NetCDF files
Before using Xarray to fetch dataset from NetCDF files, it is suggested to set up a Dask cluster to enable parallel processings.
You could easily set up a local Dask cluster as below
from dask.distributed import Client, LocalCluster cluster = LocalCluster() client = Client(cluster)
Now we can use Xarray to fetch the dataset from multiple NetCDF files by using the local Dask cluster:
import xarray as xr filenames = '/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/*.nc' xrdata = xr.open_mfdataset(filenames, combine='by_coords',parallel=True) start_year="2020-01-01" end_year="2020-03-31" xrdata=xrdata.sel(time=slice(start_year,end_year))
The 'xrdata' in the above script is Xarray dataset containing the following metadata and attributes.
print(xrdata)
<xarray.Dataset> Size: 461MB Dimensions: (longitude: 1440, latitude: 721, level: 37, time: 3) Coordinates: * longitude (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8 * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0 * level (level) int32 148B 1 2 3 5 7 10 20 ... 875 900 925 950 975 1000 * time (time) datetime64[ns] 24B 2020-01-01 2020-02-01 2020-03-01 Data variables: w (time, level, latitude, longitude) float32 461MB dask.array<chunksize=(1, 13, 241, 480), meta=np.ndarray> Attributes: Conventions: CF-1.6 license: Licence to use Copernicus Products: https://apps.ecmwf.int/... summary: ERA5 is the fifth generation ECMWF atmospheric reanalysis o... title: ERA5 pressure-levels monthly-averaged vertical_velocity 202... history: 2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5....
The variable "w" has 4 dimensions and it contains 3 months data which is sliced from 1-year dataset loaded from the directory "/g/data/rt52/era5/pressure-levels/monthly-averaged/w/2020/".
Store the dataset into TileDB arrays
We can simply utilise tiledb.cf.from_xarray() method to store a Xarray dataset into a TileDB object as below.
uri2 = f"{output_dir}/example2" tiledb.cf.from_xarray(xrdata,uri2)
Now we can open the TileDB object via Xarray to check its metadata, which presents the same with the NetCDF file.
ds = xr.open_dataset(uri2, engine="tiledb") ds
<xarray.Dataset> Size: 461MB Dimensions: (time: 3, level: 37, latitude: 721, longitude: 1440) Coordinates: * time (time) datetime64[ns] 24B 2020-01-01 2020-02-01 2020-03-01 * level (level) float64 296B 1.0 2.0 3.0 5.0 ... 925.0 950.0 975.0 1e+03 * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0 * longitude (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8 Data variables: w (time, level, latitude, longitude) float32 461MB ... Attributes: Conventions: CF-1.6 history: 2020-11-05 14:53:38 UTC+1100 by era5_replication_tools-1.5.... license: Licence to use Copernicus Products: https://apps.ecmwf.int/... summary: ERA5 is the fifth generation ECMWF atmospheric reanalysis o... title: ERA5 pressure-levels monthly-averaged vertical_velocity 202...
uri2 = "./dataset/tiledb/example2" creator.create_group(uri2)
The directory structure of the TileDB group is shown as below. At this stage, no actually data is stored into the TileDB group yet.
Open TileDB group
You can validate the TileDB group by loading the variable "w" from the TileDB arrays
And you could visualise the data of each month in 'w' as below