Skip to content

Commit fe1969b

Browse files
committed
added datasets
1 parent c3e57d4 commit fe1969b

9 files changed

+611
-0
lines changed

data/README.md

+183
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Data
2+
3+
In this part of the readme we describe the datasets we used and how to create them.
4+
5+
## ImageNet Data
6+
7+
- when we say ImageNet data, we refer to the data that is used for the ImageNet Large Scale Visual Recognition
8+
Challenge (ILSVRC)
9+
- an overview of the Challenges can be found [here](http://image-net.org/challenges/LSVRC/)
10+
- [paper](https://arxiv.org/pdf/1409.0575.pdf)
11+
- **object categories**
12+
- total of 1000 synsets
13+
- synsets follow the WordNet hierarchy (2014)
14+
- since 2012 the used categories remained consistent
15+
- **data collection**
16+
- images are retrieved by querying multiple search engines
17+
- **image classification**
18+
- humans label the images (using Amazon Mechanical Turk) using Wikipedia definition
19+
- multiple users label each image (at least 10 per image until confidence threshold is passed)
20+
- **statistics**
21+
- 1000 object classes
22+
- ~ 1.2 Million training images
23+
- ~ 50 Thousand validation images
24+
- ~ 100 Thousand test images
25+
26+
## COCO
27+
28+
- [coco dataset](https://cocodataset.org/)
29+
- COCO - Common Objects in Context
30+
- large-scale object detection, segmentation, and captioning dataset
31+
- general info
32+
- total of 80 categories
33+
- each picture can have multiple categories
34+
- train, val, and test set: more than 200 000 pictures
35+
- definition of multiple challenges:
36+
- Object Detection
37+
- Keypoint Detection
38+
- Stuff Segmentation
39+
- Panoptic Segmentation
40+
- Image Captioning
41+
- DensePose
42+
43+
### Focus
44+
45+
- for us most interesting is the data provided for the **Object Detection** use case
46+
- the data ([can be downloaded here](https://cocodataset.org/#download)) is structured in two parts:
47+
- the images
48+
- the annotations
49+
- we need the *train_val annotations* that contain 6 files of 3 categories (train and val split)
50+
- person_keypoints
51+
- captions
52+
- instances
53+
- interesting for us are the *instances* annotations, the annotations data contains the following data
54+
- general info for file (JSON object)
55+
- list of licences
56+
- list of annotations
57+
- list of categories
58+
- of this data, the *annotations*, and the *categories* are of interest
59+
- they have the following format ([see here](https://cocodataset.org/#format-data)):
60+
61+
```json
62+
annotation{
63+
"id": int,
64+
"image_id": int,
65+
"category_id": int,
66+
"segmentation": RLE
67+
or [
68+
polygon
69+
],
70+
"area": float,
71+
"bbox": [x, y, width, height],
72+
"iscrowd": 0 or 1,
73+
}
74+
75+
categories[
76+
{
77+
"id": int,
78+
"name": str,
79+
"supercategory" : str,
80+
}
81+
]
82+
```
83+
84+
## Customized COCO dataset
85+
86+
- the goal of creating our customized COCO dataset is to create a dataset that is similar to/compatible with the
87+
ImageNet dataset but from a different distribution than the ImageNet data
88+
- also, we want to split the subset of the COCO data that we use into multiple classes
89+
90+
### Creation
91+
92+
- The images in the ImageNet data have only one defined category
93+
- Also in most images, you can see only the one object defining the category
94+
95+
- To create a similar dataset, we filter the images:
96+
- first, we extract all images that have only one assigned category
97+
- out of these images, we then extract the images that have a category that is also part of the Imagenet dataset
98+
- finally, we take only the filtered images and store them only with the relevant data
99+
100+
- the custom coco dataset is created by executing the [extract-custom-coco](custom/extract_custom_coco.py) script using
101+
the following arguments
102+
- `--coco-train-root-path <coco-root>/train2017`
103+
- `--coco-train-annotations <coco-root>/annotations/instances_train2017.json`
104+
- `--coco-val-root-path <coco-root>/val2017`
105+
- `--coco-val-annotations <coco-root>/annotations/instances_val2017.json`
106+
- `--imagenet-root <imagenet-root>`
107+
- `--target-root <target-dir>`
108+
109+
### Download
110+
111+
- **download initial data**
112+
- [coco-train2017](http://images.cocodataset.org/zips/train2017.zip)
113+
- [coco-val2017](http://images.cocodataset.org/zips/val2017.zip)
114+
- [coco-annotations2017](http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
115+
(contains `instances_train2017.json` and `instances_val2017.json`)
116+
- the `<imagenet-root>` needs to contain the following files
117+
- `ILSVRC2012_devkit_t12.tar.gz`
118+
- `ILSVRC2012_img_val.tar`
119+
- unfortunately, we can not include a download link here because a login is required
120+
- the root link is [here](http://www.image-net.org/challenges/LSVRC/2012/downloads)
121+
- **WARNING**: after login for download you get redirected to the data for 2010, change the link to 2012
122+
(http://www.image-net.org/challenges/LSVRC/2012/downloads) before download!
123+
- **customized coco data**
124+
- [full-dataset](https://owncloud.hpi.de/s/TRCzfvxwyHCRIQr)
125+
126+
## Custom Coco Datasets 512
127+
128+
- for our experiments we create 4 custom coco datasets
129+
- custom-coco-food-512,custom-coco-outdoor-512,custom-coco-indoor-512,custom-coco-zebra-512
130+
- each dataset contains the first 512 items of its categories
131+
- we created them using the script `create_coco_n.py`
132+
133+
### Download
134+
135+
- [custom-coco-food-512](https://owncloud.hpi.de/s/Pp2f4hdKFvUrYMm)
136+
- [custom-coco-outdoor-512](https://owncloud.hpi.de/s/T083xp5fBt5S7OI)
137+
- [custom-coco-indoor-512](https://owncloud.hpi.de/s/m5XjelcaVm577i4)
138+
- [custom-coco-zebra-512](https://owncloud.hpi.de/s/MgfxezMgdWbvOYu)
139+
140+
## Pretrained Models: Used Data
141+
142+
We refer to the results and models listed in the
143+
[official PyTorch documentation](https://pytorch.org/docs/stable/torchvision/models.html) (last accessed, 01.12.2020)
144+
145+
Essential questions to answer for pre-trained models (AlexNet, VGG-19, ResNet18, ResNet50, Resnet152):
146+
147+
- What data was used to pre-train the models? (short answer: **data of ImageNet challenge 2012**)
148+
- Was the validation dataset used to train the models? (short answer: **NO** )
149+
150+
What data was exactly used to train the models can not be answered by the information given in the documentation.
151+
Nevertheless, we can make some qualified guesses.
152+
153+
What data was used to train the model?
154+
155+
- for all models, the documentation says pre-trained on ImageNet
156+
- furthermore, the provided
157+
[dataloader uses the ImageNet data from 2012](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/torchvision/datasets/imagenet.py#L11-L15)
158+
- thus we can be pretty sure that the data used to pre-train the models is the **data of the ImageNet challenge
159+
2012**. The data can be downloaded [here](http://image-net.org/challenges/LSVRC/2012/downloads.php#images)
160+
161+
Was the validation set used to train the models?
162+
163+
- good overview of common definitions of train, test, and validation set can be
164+
found [here](https://machinelearningmastery.com/difference-test-validation-datasets/)
165+
- they give the following definitions:
166+
- Training Dataset: The sample of data used to fit the model.
167+
- Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training
168+
dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset
169+
is incorporated into the model configuration.
170+
- Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training
171+
dataset.
172+
- following these definitions, the validation dataset shouldn't be used to train the model, but it is also said:
173+
- *the final model could be fit on the aggregate of the training and validation datasets*
174+
- taking a look at the [GitHub issue](https://github.com/pytorch/vision/issues/2469) asking for the code that was used
175+
to generate the pre-trained models, we can find out that:
176+
- the validation split is used to generate the *dataset
177+
test* ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L110-L138))
178+
- this data is **only** used to generate the *data_loader_test*
179+
([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L164))
180+
- the *data_loader_test* is **only** used in the *evaluate*
181+
method ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L48-L71))
182+
- the *evaluate* method calculates no gradients and also makes no use of the optimizer
183+
- thus **the validation set was not used to pretrain the models we use**

data/custom/create_coco_n.py

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import argparse
2+
import json
3+
import os
4+
import shutil
5+
6+
from experiments.data.custom.custom_coco import CustomCoco, FILE_NAME, COCO_META_JSON, COCO_META
7+
8+
IMAGES = 'images'
9+
10+
11+
def main(args):
12+
data = CustomCoco(root=args.data_root, id_subset_json=args.id_subset_json, num_samples=args.size)
13+
14+
root_path = os.path.abspath(args.dst_path)
15+
images_path = os.path.join(root_path, IMAGES)
16+
17+
os.mkdir(root_path)
18+
os.mkdir(images_path)
19+
20+
# copy all images
21+
included_file_names = []
22+
for sample in data._items:
23+
src_path = os.path.join(args.data_root, IMAGES, sample[FILE_NAME])
24+
dst_path = os.path.join(root_path, IMAGES, sample[FILE_NAME])
25+
shutil.copy(src_path, dst_path)
26+
included_file_names.append(sample[FILE_NAME])
27+
28+
# create reduced json
29+
current_json_path = os.path.join(args.data_root, COCO_META_JSON)
30+
new_coco_meta_list = []
31+
with open(current_json_path) as f:
32+
j_doc = json.load(f)
33+
coco_meta = j_doc[COCO_META]
34+
for e in coco_meta:
35+
file_name = e[FILE_NAME]
36+
if file_name in included_file_names:
37+
new_coco_meta_list.append(e)
38+
39+
j_doc[COCO_META] = new_coco_meta_list
40+
41+
with open(os.path.join(root_path, COCO_META_JSON), 'w') as f:
42+
json.dump(j_doc, f)
43+
44+
45+
def parse_args():
46+
parser = argparse.ArgumentParser()
47+
48+
parser.add_argument('--size', help='the size in number of samples of the dataset', type=int, required=True)
49+
parser.add_argument('--data-root', help='the root path to teh custom coco dataset', type=str, required=True)
50+
parser.add_argument('--id-subset-json', help='file to specify which ids are included', type=str, required=True)
51+
parser.add_argument('--dst-path', help='the path where the new dataset is created', type=str, required=True)
52+
53+
_args = parser.parse_args()
54+
55+
return _args
56+
57+
58+
if __name__ == '__main__':
59+
args = parse_args()
60+
main(args)

data/custom/custom_coco.py

+115
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
import json
2+
import os
3+
from typing import Optional, Callable, Any
4+
5+
from PIL import Image
6+
from torchvision import transforms
7+
from torchvision.datasets import VisionDataset
8+
9+
FILE_NAME = 'file_name'
10+
IMAGES = 'images'
11+
INCLUDED_COCO_IDS = 'included-coco-ids'
12+
COCO_META = 'coco_meta'
13+
COCO_META_JSON = 'coco_meta.json'
14+
COCO_IMAGE_TYPE = ".jpg"
15+
COCO_FILENAME_NUMBERS = 12
16+
COCO_CLASSES = 91
17+
COCO_SPLIT = 'coco_split'
18+
COCO_CATEGORY_ID = 'coco_category_id'
19+
COCO_IMAGE_ID = 'coco_image_id'
20+
IMAGENET_WNID = 'imagenet_wnid'
21+
IMAGENET_CLASS_ID = 'imagenet_class_id'
22+
23+
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
24+
25+
inference_transforms = transforms.Compose([
26+
transforms.Resize(256),
27+
transforms.CenterCrop(224),
28+
transforms.ToTensor(),
29+
normalize, ])
30+
31+
train_transforms = transforms.Compose([
32+
transforms.RandomResizedCrop(224),
33+
transforms.RandomHorizontalFlip(),
34+
transforms.ToTensor(),
35+
normalize,
36+
])
37+
38+
39+
def _included_ids(id_subset_json):
40+
with open(id_subset_json, 'r') as json_file:
41+
included_id_json = json.load(json_file)
42+
ids = included_id_json[INCLUDED_COCO_IDS]
43+
return ids
44+
45+
46+
class CustomCoco(VisionDataset):
47+
48+
def __init__(self,
49+
root: str,
50+
ann_file: str = COCO_META_JSON,
51+
id_subset_json: str = None,
52+
transform: Optional[Callable] = None,
53+
target_transform: Optional[Callable] = None,
54+
transforms: Optional[Callable] = None,
55+
num_samples: int = None
56+
) -> None:
57+
super(CustomCoco, self).__init__(root, transforms, transform, target_transform)
58+
self.images_path = os.path.join(self.root, IMAGES)
59+
self.ann_file = os.path.join(self.root, ann_file)
60+
61+
if id_subset_json:
62+
self.included_ids = _included_ids(id_subset_json)
63+
else:
64+
self.included_ids = list(range(COCO_CLASSES + 1))
65+
66+
self._items = []
67+
with open(self.ann_file) as f:
68+
ann_data = json.load(f)
69+
coco_meta = ann_data[COCO_META]
70+
for e in coco_meta:
71+
coco_cat = e[COCO_CATEGORY_ID]
72+
if coco_cat in self.included_ids:
73+
self._items.append(e)
74+
75+
if num_samples is not None:
76+
if len(self._items) >= num_samples:
77+
self._items.sort(key=lambda x: x[COCO_IMAGE_ID])
78+
self._items = self._items[:num_samples]
79+
assert len(self._items) == num_samples
80+
else:
81+
raise ValueError('The given num_samples is higher than the available number of samples')
82+
83+
def __getitem__(self, index: int) -> Any:
84+
item = self._items[index]
85+
image_path = os.path.join(self.images_path, item[FILE_NAME])
86+
img = Image.open(image_path).convert('RGB')
87+
label = item[IMAGENET_CLASS_ID]
88+
89+
if self.transform:
90+
img = self.transform(img)
91+
if self.target_transform:
92+
label = self.target_transform(label)
93+
94+
return img, label
95+
96+
def __len__(self) -> int:
97+
return len(self._items)
98+
99+
100+
class InferenceCustomCoco(CustomCoco):
101+
102+
def __init__(self, root: str, ann_file: str = COCO_META_JSON, id_subset_json: str = None,
103+
target_transform: Optional[Callable] = None, transforms: Optional[Callable] = None,
104+
num_samples: int = None):
105+
transform = inference_transforms
106+
super().__init__(root, ann_file, id_subset_json, transform, target_transform, transforms, num_samples)
107+
108+
109+
class TrainCustomCoco(CustomCoco):
110+
111+
def __init__(self, root: str, ann_file: str = COCO_META_JSON, id_subset_json: str = None,
112+
target_transform: Optional[Callable] = None, transforms: Optional[Callable] = None,
113+
num_samples: int = None):
114+
transform = train_transforms
115+
super().__init__(root, ann_file, id_subset_json, transform, target_transform, transforms, num_samples)

0 commit comments

Comments
 (0)