Page tree

Introduction

The Imagenet dataset is widely used for testing the effectiveness of large-scale visual recognition algorithms. We have added the ImageNet datasets to our local collection of AI-DL datasets under project wb00. If you have an NCI account you can join this project.

We are going to use the ILSVRC 2012 dataset, which is one of the largest and most well-known datasets in the Imagenet collection and can be sourced from the following website: https://image-net.org/challenges/LSVRC/index.php.

NCI Filesystem location

We have downloaded and processed the ILSVRC 2012 dataset and made available on NCI under the project wb00 in the following directory: /g/data/wb00/ImageNet/. Please join project wb00 to access this dataset.

We have downloaded the following archived files (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar, ILSVRC2012_bbox_train_v2.tar.gz) from the official website, and then processed the dataset to make it compatible with both TensorFlow and PyTorch. NCI hosts this dataset in two formats: raw and TFrecords. Both are described below.

Raw Data Format

As mentioned above several archived files are downloaded directly from the ImageNet site. The first one contains training data (ILSVRC2012_img_train.tar), the second is for validation data (ILSVRC2012_img_val.tar), and the third archive (ILSVRC2012_bbox_train_v2.tar.gz) contains the bounding boxes for objects inside the images. That is the bounding boxes identify the objects that are used in the actual training.

Once you uncompress the first two archives you should see two folders containing raw images for training and validation. The unzipped folders should contain more than 1.2 million images. The third archive holds XML files that contain the coordinates of bounding boxes inside the respective images. For training, those images and data have to be grouped according to a predefined hierarchy.

The training data has 1000 different subcategories of images, those subcategories along with their classification hierarchy are shown here: https://opus.nci.org.au/x/R4DSCQ. To aid the training, ImageNet represents each subcategory with an alpha-numeric code, this mapping from subcategories to alpha-numeric code is listed here: https://image-net.org/challenges/LSVRC/2015/browse-synsets.php. All 1.2 million images have to move into the smaller folders according to the above classification, and it takes a considerable amount of time to accomplish this. 

We have already done this time-consuming step for the researchers and made the deep learning ready dataset available in the gdata location: /g/data/wb00/admin/staging/ImageNet/ILSVRC2012/raw-data/The following code block shows a portion of the image files inside the train folder.

Raw Imagenet data
train/n01440764/n01440764_10026.JPEG  train/n01440764/n01440764_12433.JPEG  train/n01440764/n01440764_172.JPEG    train/n01440764/n01440764_31283.JPEG  train/n01440764/n01440764_5776.JPEG  train/n01440764/n01440764_7946.JPEG
train/n01440764/n01440764_10027.JPEG  train/n01440764/n01440764_12435.JPEG  train/n01440764/n01440764_1735.JPEG   train/n01440764/n01440764_31293.JPEG  train/n01440764/n01440764_5781.JPEG  train/n01440764/n01440764_7950.JPEG
train/n01440764/n01440764_10029.JPEG  train/n01440764/n01440764_12446.JPEG  train/n01440764/n01440764_17454.JPEG  train/n01440764/n01440764_3129.JPEG   train/n01440764/n01440764_5785.JPEG  train/n01440764/n01440764_7963.JPEG
train/n01440764/n01440764_10040.JPEG  train/n01440764/n01440764_1244.JPEG   train/n01440764/n01440764_17501.JPEG  train/n01440764/n01440764_31406.JPEG  train/n01440764/n01440764_5802.JPEG  train/n01440764/n01440764_7967.JPEG
train/n01440764/n01440764_10042.JPEG  train/n01440764/n01440764_1245.JPEG   train/n01440764/n01440764_17514.JPEG  train/n01440764/n01440764_3146.JPEG   train/n01440764/n01440764_5848.JPEG  train/n01440764/n01440764_7982.JPEG
...


Tensorflow Record Format

Although those raw images are ready for deep learning with PyTorch, Tensorflow will require some additional steps. The raw images will make the training process slower with Tensorflow. To speed up training, we need to convert the raw images into Tensorflow Records (TFrecords) using the script and instructions given on the following page: https://github.com/tensorflow/models/blob/master/research/slim/datasets/download_and_convert_imagenet.sh

The download and conversion may take a considerable amount of time depending on the computational power of the machine. We have already downloaded and converted the entire dataset into TFrecords. Our converted training and validation TFrecord files can be found in the following folder: /g/data/wb00/ImageNet/ILSVRC2012

Particularly, the train and validation TFrecords are located in the directory: /g/data/wb00/admin/staging/ImageNet/ILSVRC2012/data_dir/. Some data loader libraries need an index file for each TFrecord file. We also have created index files for all TFrecords and they are in the folder: /g/data/wb00/admin/staging/ImageNet/ILSVRC2012/idx_dir/

The code block below shows the converted Tensorflow record files from the above folder. Here, the small individual images are converted into continuous TensorFlow records, which are easier to use with distributed deep learning. All the images are bundled together and stored in larger files to reduce the latency of the system. By joining the wb00 project, a user can access both the training and validation TFrecords and use the dataset to benchmark Tensorflow models.

Tensorflow Records of Imagenet Data
-rw-r--r-- 1 mah900 z00 132M Jun  2 20:38 data_dir/train-00000-of-01024
-rw-r--r-- 1 mah900 z00 141M Jun  2 20:38 data_dir/train-00001-of-01024
-rw-r--r-- 1 mah900 z00 140M Jun  2 20:38 data_dir/train-00002-of-01024
-rw-r--r-- 1 mah900 z00 135M Jun  2 20:38 data_dir/train-00003-of-01024
-rw-r--r-- 1 mah900 z00 132M Jun  2 20:38 data_dir/train-00004-of-01024
-rw-r--r-- 1 mah900 z00 139M Jun  2 20:38 data_dir/train-00005-of-01024
-rw-r--r-- 1 mah900 z00 143M Jun  2 20:38 data_dir/train-00006-of-01024
-rw-r--r-- 1 mah900 z00 140M Jun  2 20:38 data_dir/train-00007-of-01024
-rw-r--r-- 1 mah900 z00 133M Jun  2 20:38 data_dir/train-00008-of-01024
-rw-r--r-- 1 mah900 z00 147M Jun  2 20:38 data_dir/train-00009-of-01024
-rw-r--r-- 1 mah900 z00 139M Jun  2 20:38 data_dir/train-00010-of-01024
-rw-r--r-- 1 mah900 z00 141M Jun  2 20:38 data_dir/train-00011-of-01024
-rw-r--r-- 1 mah900 z00 135M Jun  2 20:38 data_dir/train-00012-of-01024
-rw-r--r-- 1 mah900 z00 139M Jun  2 20:38 data_dir/train-00013-of-01024
-rw-r--r-- 1 mah900 z00 133M Jun  2 20:38 data_dir/train-00014-of-01024
-rw-r--r-- 1 mah900 z00 133M Jun  2 20:38 data_dir/train-00015-of-01024
-rw-r--r-- 1 mah900 z00 137M Jun  2 20:38 data_dir/train-00016-of-01024
-rw-r--r-- 1 mah900 z00 132M Jun  2 20:38 data_dir/train-00017-of-01024
-rw-r--r-- 1 mah900 z00 143M Jun  2 20:38 data_dir/train-00018-of-01024


License information 

All the images in the dataset are collected from online and ImageNet does not hold the copyright for any of the images. The dataset is free to use for researchers and the website also gives details about how to cite the dataset in your work. The URL of the original images can be found on the ImageNet website and researchers can go to this URL and download the images themselves. However, the ImageNet website gives free access to the curated dataset for non-commercial and research purposes. Anyone can register to download the dataset for education purposes.

Citation

The Imagenet dataset is created and maintained by the Stanford vision lab (https://svl.stanford.edu/) in conjunction with Princeton University. If you want to use this dataset for your research work, then refer to the Imagenet website (https://image-net.org/challenges/LSVRC/index.php) for the latest citation information. 


  • No labels