Description:

The ImageNet is a large visual database containing millions of images designed for computer vision research. It is one of the first and most popular image databases for Deep learning. It is widely used for large-scale Deep Neural Networks, the huge collection of images makes it possible for the Neural networks to efficiently learn various image features. Deep Neural Networks today can have more than 100 layers and run on multiple GPUs, such a Network can easily process thousands of images per second, which is why a large image dataset is necessary. Once a model is trained it can be used for image classification and object detection.

Fig.1 shows some examples from the ImageNet dataset, it shows various categories and sub-categories and how they are linked together. The images are collected from the internet and painstakingly categorised by researchers at the Stanford vision lab. An image can fall under several different categories and subcategories. Fig. 1 show two such chains of categories.

On the top of Fig.1, one can see that the placental animals are a subcategory of mammals, in tern carnivorous animals are a subcategory of placental animals. All the canine animal images are grouped. In the next subcategory, dog images are singled out from the images of canine animals. One can easily understand how the animals of similar species are grouped in the same subcategories. The classification of images is not limited to animals. The categories chain at the bottom shows different subcategories of vehicles. The ImageNet has hundreds of such chains for different animals and objects.

ImageNet classification

Fig.1. ImageNet Dataset examples (source: https://devopedia.org/imagenet)

ImageNet Hierarchy

The entire ImageNet dataset contains 27 high-level categories and over 21k subcategories. It is not possible to show all the categories and subcategories on this page, a chart of about 1000 subcategories is shown in the following link: https://opus.nci.org.au/x/R4DSCQ

On average there are about 500 images in each subcategory. The "animal" category has the most number of subcategories and images, it is about 3.8k subcategories and 2.8 million images. On the other hand, the "appliance" category has one of the least numbers of subcategories and images.

All the subcategories are taken from another Princeton project called the WordNet (https://wordnet.princeton.edu/). It is a large lexical database of English words categorised and linked together in a semantic relationship. The words that are of similar meaning are grouped and called synsets. Nouns, verbs, adjectives and adverbs are grouped based on cognitive synonyms and put into the synsets. Each expresses a distinct concept and is interlinked together based on conceptual semantic and lexical relations. ImageNet uses only a subset of the WordNet dataset.

Use in computer vision

The hierarchical nature of this dataset makes the task of image classification and object detection much easier. Deep learning models are trained with this dataset giving the option to choose among several subcategories if the model is not sure about the subclass/subcategory. The model can be tuned to select a category with higher accuracy rather than a category with lower accuracy during the classification process.

The following three images show the usefulness of the ImageNet dataset. Fig.2 shows an image of a cat and how it is classified using the hierarchical architecture. It shows that an image can belong to more than one path in the hierarchy. Based on the prediction, it can be seen that there is a greater chance that it is the image of an Egyptian cat rather than a Tabby cat. On the other hand, it is less likely to be a Tiger cat. Next, Fig.3 shows the image of a dog and there is a 65% chance that it is a Clumber Spaniel. Now, compare this result to that of Fig.4. There is a 96% chance that it is an image of a Golden Retriever. The difference is that Fig.4 focuses more on the face. On the other hand, Fig.3 covers a wider angle that includes other pictorial information besides the face, which is why training yields less accuracy. It is clear that if the image is more focused on the important features, then the trained model has a higher prediction accuracy. This observation gives rise to the concept of bounding box, which is discussed next.

Fig.2. Classification of a cat image (source: https://observablehq.com/@mbostock/imagenet-hierarchy/2).

Fig.3. Classification of a dog image.

Fig.4. Classification of another dog image.

Bounding box

As discussed in the above section, the model training is more effective if the image focuses on the important features. However, not all images are not focused properly. That is why bounding boxes are introduced in computer vision. For example, when training with dog images, if a rectangular box is drawn around the actual dog's body the model can focus on the rectangular box and ignore the rest of the image parts. ImageNet publishes the bounding box coordinates for images in the form of pixel coordinates. Over a million images have been annotated with bounding boxes and they help to achieve higher training accuracies.

Research and Data quality

The ImageNet has played an important part in the development of modern computer vision algorithms. It has many different subcategories of images, all stored in a logical hierarchical structure. Before the ImageNet, datasets and Neural Networks were largely segregated. One dataset and algorithm were designed for dog images, while another algorithm was developed for vehicle image recognition. An algorithm would work only with a particular dataset and datasets were not interchangeable. ImageNet changed all that, and combined data from different synsets. That is why it is possible to use the same Neural network to detect different types of objects.

The ImageNet introduces diversity to the image collection. Each subcategory contains images of the same object or animal under different circumstances, including different camera angles, lighting conditions, and backgrounds. Models trained with such diverse datasets are more robust and because of this, the ImageNet is a great benchmark to measure the efficiency of Neural network models.

Usage on NCI

At NCI, ImageNet is used to train models on multiple GPUs and nodes. A tutorial about training Deep learning models using ImageNet and ResNet on Gadi can be found in the following link: https://opus.nci.org.au/x/UAC7CQ. The ImageNet dataset is also part of NCI's AI-DL data collection and is available on the gdata filesystem.

Use in research and ImageNet Challenge.

The ImageNet data labelling is done through crowdsourcing. Numerous volunteers have participated in classification of the images in various subcategories which are cross-checked by other volunteers. Hundreds of thousands of man-hours have been spent creating and validating this very large dataset. Thus, by using ImageNet a researcher can save many months worth of work that would have been otherwise spent preparing the dataset.

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition held between 2010 and 2017. Each year a new benchmark was prepared for the challenge that contained about 1000 synsets and over a million images. Those benchmarks are now a part of the larger ImageNet database, which can be freely downloaded and used for research projects.

Page tree

Introduction to ImageNet