|
| 1 | +# Data |
| 2 | + |
| 3 | +In this part of the readme we describe the datasets we used and how to create them. |
| 4 | + |
| 5 | +## ImageNet Data |
| 6 | + |
| 7 | +- when we say ImageNet data, we refer to the data that is used for the ImageNet Large Scale Visual Recognition |
| 8 | + Challenge (ILSVRC) |
| 9 | +- an overview of the Challenges can be found [here](http://image-net.org/challenges/LSVRC/) |
| 10 | +- [paper](https://arxiv.org/pdf/1409.0575.pdf) |
| 11 | +- **object categories** |
| 12 | + - total of 1000 synsets |
| 13 | + - synsets follow the WordNet hierarchy (2014) |
| 14 | + - since 2012 the used categories remained consistent |
| 15 | +- **data collection** |
| 16 | + - images are retrieved by querying multiple search engines |
| 17 | +- **image classification** |
| 18 | + - humans label the images (using Amazon Mechanical Turk) using Wikipedia definition |
| 19 | + - multiple users label each image (at least 10 per image until confidence threshold is passed) |
| 20 | +- **statistics** |
| 21 | + - 1000 object classes |
| 22 | + - ~ 1.2 Million training images |
| 23 | + - ~ 50 Thousand validation images |
| 24 | + - ~ 100 Thousand test images |
| 25 | + |
| 26 | +## COCO |
| 27 | + |
| 28 | +- [coco dataset](https://cocodataset.org/) |
| 29 | +- COCO - Common Objects in Context |
| 30 | +- large-scale object detection, segmentation, and captioning dataset |
| 31 | +- general info |
| 32 | + - total of 80 categories |
| 33 | + - each picture can have multiple categories |
| 34 | + - train, val, and test set: more than 200 000 pictures |
| 35 | +- definition of multiple challenges: |
| 36 | + - Object Detection |
| 37 | + - Keypoint Detection |
| 38 | + - Stuff Segmentation |
| 39 | + - Panoptic Segmentation |
| 40 | + - Image Captioning |
| 41 | + - DensePose |
| 42 | + |
| 43 | +### Focus |
| 44 | + |
| 45 | +- for us most interesting is the data provided for the **Object Detection** use case |
| 46 | +- the data ([can be downloaded here](https://cocodataset.org/#download)) is structured in two parts: |
| 47 | + - the images |
| 48 | + - the annotations |
| 49 | +- we need the *train_val annotations* that contain 6 files of 3 categories (train and val split) |
| 50 | + - person_keypoints |
| 51 | + - captions |
| 52 | + - instances |
| 53 | +- interesting for us are the *instances* annotations, the annotations data contains the following data |
| 54 | + - general info for file (JSON object) |
| 55 | + - list of licences |
| 56 | + - list of annotations |
| 57 | + - list of categories |
| 58 | +- of this data, the *annotations*, and the *categories* are of interest |
| 59 | +- they have the following format ([see here](https://cocodataset.org/#format-data)): |
| 60 | + |
| 61 | + ```json |
| 62 | +annotation{ |
| 63 | + "id": int, |
| 64 | + "image_id": int, |
| 65 | + "category_id": int, |
| 66 | + "segmentation": RLE |
| 67 | + or [ |
| 68 | + polygon |
| 69 | +], |
| 70 | +"area": float, |
| 71 | +"bbox": [x, y, width, height], |
| 72 | +"iscrowd": 0 or 1, |
| 73 | +} |
| 74 | + |
| 75 | +categories[ |
| 76 | +{ |
| 77 | +"id": int, |
| 78 | +"name": str, |
| 79 | +"supercategory" : str, |
| 80 | +} |
| 81 | +] |
| 82 | +``` |
| 83 | + |
| 84 | +## Customized COCO dataset |
| 85 | + |
| 86 | +- the goal of creating our customized COCO dataset is to create a dataset that is similar to/compatible with the |
| 87 | + ImageNet dataset but from a different distribution than the ImageNet data |
| 88 | +- also, we want to split the subset of the COCO data that we use into multiple classes |
| 89 | + |
| 90 | +### Creation |
| 91 | + |
| 92 | +- The images in the ImageNet data have only one defined category |
| 93 | +- Also in most images, you can see only the one object defining the category |
| 94 | + |
| 95 | +- To create a similar dataset, we filter the images: |
| 96 | + - first, we extract all images that have only one assigned category |
| 97 | + - out of these images, we then extract the images that have a category that is also part of the Imagenet dataset |
| 98 | +- finally, we take only the filtered images and store them only with the relevant data |
| 99 | + |
| 100 | +- the custom coco dataset is created by executing the [extract-custom-coco](custom/extract_custom_coco.py) script using |
| 101 | + the following arguments |
| 102 | + - `--coco-train-root-path <coco-root>/train2017` |
| 103 | + - `--coco-train-annotations <coco-root>/annotations/instances_train2017.json` |
| 104 | + - `--coco-val-root-path <coco-root>/val2017` |
| 105 | + - `--coco-val-annotations <coco-root>/annotations/instances_val2017.json` |
| 106 | + - `--imagenet-root <imagenet-root>` |
| 107 | + - `--target-root <target-dir>` |
| 108 | + |
| 109 | +### Download |
| 110 | + |
| 111 | +- **download initial data** |
| 112 | + - [coco-train2017](http://images.cocodataset.org/zips/train2017.zip) |
| 113 | + - [coco-val2017](http://images.cocodataset.org/zips/val2017.zip) |
| 114 | + - [coco-annotations2017](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) |
| 115 | + (contains `instances_train2017.json` and `instances_val2017.json`) |
| 116 | + - the `<imagenet-root>` needs to contain the following files |
| 117 | + - `ILSVRC2012_devkit_t12.tar.gz` |
| 118 | + - `ILSVRC2012_img_val.tar` |
| 119 | + - unfortunately, we can not include a download link here because a login is required |
| 120 | + - the root link is [here](http://www.image-net.org/challenges/LSVRC/2012/downloads) |
| 121 | + - **WARNING**: after login for download you get redirected to the data for 2010, change the link to 2012 |
| 122 | + (http://www.image-net.org/challenges/LSVRC/2012/downloads) before download! |
| 123 | +- **customized coco data** |
| 124 | + - [full-dataset](https://owncloud.hpi.de/s/TRCzfvxwyHCRIQr) |
| 125 | + |
| 126 | +## Custom Coco Datasets 512 |
| 127 | + |
| 128 | +- for our experiments we create 4 custom coco datasets |
| 129 | + - custom-coco-food-512,custom-coco-outdoor-512,custom-coco-indoor-512,custom-coco-zebra-512 |
| 130 | +- each dataset contains the first 512 items of its categories |
| 131 | +- we created them using the script `create_coco_n.py` |
| 132 | + |
| 133 | +### Download |
| 134 | + |
| 135 | +- [custom-coco-food-512](https://owncloud.hpi.de/s/Pp2f4hdKFvUrYMm) |
| 136 | +- [custom-coco-outdoor-512](https://owncloud.hpi.de/s/T083xp5fBt5S7OI) |
| 137 | +- [custom-coco-indoor-512](https://owncloud.hpi.de/s/m5XjelcaVm577i4) |
| 138 | +- [custom-coco-zebra-512](https://owncloud.hpi.de/s/MgfxezMgdWbvOYu) |
| 139 | + |
| 140 | +## Pretrained Models: Used Data |
| 141 | + |
| 142 | +We refer to the results and models listed in the |
| 143 | +[official PyTorch documentation](https://pytorch.org/docs/stable/torchvision/models.html) (last accessed, 01.12.2020) |
| 144 | + |
| 145 | +Essential questions to answer for pre-trained models (AlexNet, VGG-19, ResNet18, ResNet50, Resnet152): |
| 146 | + |
| 147 | +- What data was used to pre-train the models? (short answer: **data of ImageNet challenge 2012**) |
| 148 | +- Was the validation dataset used to train the models? (short answer: **NO** ) |
| 149 | + |
| 150 | +What data was exactly used to train the models can not be answered by the information given in the documentation. |
| 151 | +Nevertheless, we can make some qualified guesses. |
| 152 | + |
| 153 | +What data was used to train the model? |
| 154 | + |
| 155 | +- for all models, the documentation says pre-trained on ImageNet |
| 156 | +- furthermore, the provided |
| 157 | + [dataloader uses the ImageNet data from 2012](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/torchvision/datasets/imagenet.py#L11-L15) |
| 158 | + - thus we can be pretty sure that the data used to pre-train the models is the **data of the ImageNet challenge |
| 159 | + 2012**. The data can be downloaded [here](http://image-net.org/challenges/LSVRC/2012/downloads.php#images) |
| 160 | + |
| 161 | +Was the validation set used to train the models? |
| 162 | + |
| 163 | +- good overview of common definitions of train, test, and validation set can be |
| 164 | + found [here](https://machinelearningmastery.com/difference-test-validation-datasets/) |
| 165 | +- they give the following definitions: |
| 166 | + - Training Dataset: The sample of data used to fit the model. |
| 167 | + - Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training |
| 168 | + dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset |
| 169 | + is incorporated into the model configuration. |
| 170 | + - Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training |
| 171 | + dataset. |
| 172 | +- following these definitions, the validation dataset shouldn't be used to train the model, but it is also said: |
| 173 | + - *the final model could be fit on the aggregate of the training and validation datasets* |
| 174 | +- taking a look at the [GitHub issue](https://github.com/pytorch/vision/issues/2469) asking for the code that was used |
| 175 | + to generate the pre-trained models, we can find out that: |
| 176 | + - the validation split is used to generate the *dataset |
| 177 | + test* ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L110-L138)) |
| 178 | + - this data is **only** used to generate the *data_loader_test* |
| 179 | + ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L164)) |
| 180 | + - the *data_loader_test* is **only** used in the *evaluate* |
| 181 | + method ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L48-L71)) |
| 182 | + - the *evaluate* method calculates no gradients and also makes no use of the optimizer |
| 183 | + - thus **the validation set was not used to pretrain the models we use** |
0 commit comments