Blog

Blog from February, 2016

 

I have installed and played around with pyesgf http://esgf-pyclient.readthedocs.org/en/latest/index.html a python module that allows to interact with the ESGF API using python.

Basically you can build a search passing on constraints as you can do online and get back the search results.

The advantage of using this is that you can decide how you want these results to be retrieved, for example currently Lawson scripts and also my own python scripts where retrieving a wget file which then was parsed to extract the information on available files such as checksum, download url etc.

With pyesgf you have different options to retrieve the search results:

wget

opendap aggregation

a "list of datasets" object DatasetResult and each of these has a "list of files" object FileResult

Following are some of the properties that can be retrieve for each of these

DatasetResult:

methods: 'context', 'dataset_id', 'download_url', 'file_context', 'index_node', 'json', 'las_url', 'number_of_files', 'opendap_url', 'urls'

Where all the info is hidden is the "json" attribute (a python dictionary), that's the actually entire response from the server, so you get from it much more information including the version. The dict keys are:

[u'index_node', u'version', u'dataset_id_template_', u'cf_standard_name', u'number_of_aggregations', u'south_degrees', u'drs_id', u'cmor_table', u'replica', u'west_degrees', u'master_id', u'height_top', u'id', u'datetime_stop', u'height_bottom', u'access', u'realm', u'data_node', u'title', u'description', u'instance_id', u'height_units', u'variable_long_name', u'metadata_format', u'variable_units', u'score', u'datetime_start', u'number_of_files', u'_version_', u'size', u'type', u'ensemble', u'product', u'experiment', u'format', u'timestamp', u'time_frequency', u'_timestamp', u'variable', u'east_degrees', u'experiment_family', u'forcing', u'institute', u'north_degrees', u'project', u'url', u'model', u'latest']

FileResult:  

methods: 'checksum', 'checksum_type', 'context', 'download_url', 'file_id', 'filename', 'index_node', 'json', 'las_url', 'opendap_url', 'size', 'tracking_id', 'urls'

json dictionary keys:

[u'index_node', u'version', u'dataset_id_template_', u'cf_standard_name', u'cmor_table', u'replica', u'master_id', u'dataset_id', u'id', u'size', u'instance_id', u'realm', u'data_node', u'title', u'description', u'drs_id', u'variable_long_name', u'metadata_format', u'variable_units', u'score', u'_version_', u'forcing', u'type', u'ensemble', u'product', u'experiment', u'format', u'timestamp', u'time_frequency', u'variable', u'_timestamp', u'institute', u'checksum', u'experiment_family', u'project', u'url', u'tracking_id', u'checksum_type', u'model', u'latest']

It's also easy to add an extra attribute/property to the class so it'll retrieve it nicely, I've done that on my git clone for version for example.
The other interesting data is the dataset_id, straight on from his github repo:

ESGF search records contain various ids. These include:

  • id: internal identifier so that SOLr can keep track of each record. Currently the instance_id + index node
  • drs_id: DRS identifier without the version number. Effectively an identifier for each dataset (including all versions and replicas)
  • master_id: The same as drs_id. I think this was created because of confusion about the purpose of DRS and drs_id. E.g. it's not clear whether a DRS identifier includes a version
  • instance_id: .. I.e. the DRS including the version number

The dataset_id listed above is the first, but we could potentially retrieve any of them, Examples are

id - cmip5.output1.MIROC.MIROC5.historical.mon.atmos.Amon.r5i1p1.v20120710|aims3.llnl.gov    which is the dataset_id property

drs_id - cmip5.output1.MIROC.MIROC5.historical.mon.atmos.Amon.r5i1p1

master_id - cmip5.output1.MIROC.MIROC5.historical.mon.atmos.Amon.r5i1p1

instance_id - cmip5.output1.MIROC.MIROC5.historical.mon.atmos.Amon.r5i1p1.v20120710


At the moment the only issue seems to be a bug with a regex, basically if you do a search across nodes there's one node that doesn't fit into this regex and it's causing the program to crash, that can be easily fixed, though I'll probably ask him why there was no exception set for the regex and if it is intentional to exclude such node or not