hpides
diff --git a/‎data/README.md
+183 b/‎data/README.md
+183
diff --git a/‎data/custom/create_coco_n.py
+60 b/‎data/custom/create_coco_n.py
+60
diff --git a/‎data/custom/custom_coco.py
+115 b/‎data/custom/custom_coco.py
+115
@@ -0,0 +1,183 @@
+# Data
+
+In this part of the readme we describe the datasets we used and how to create them.
+
+## ImageNet Data
+
+- when we say ImageNet data, we refer to the data that is used for the ImageNet Large Scale Visual Recognition
+  Challenge (ILSVRC)
+- an overview of the Challenges can be found [here](http://image-net.org/challenges/LSVRC/)
+- [paper](https://arxiv.org/pdf/1409.0575.pdf)
+- **object categories**
+    - total of 1000 synsets
+    - synsets follow the WordNet hierarchy (2014)
+    - since 2012 the used categories remained consistent
+- **data collection**
+    - images are retrieved by querying multiple search engines
+- **image classification**
+    - humans label the images (using Amazon Mechanical Turk) using Wikipedia definition
+    - multiple users label each image (at least 10 per image until confidence threshold is passed)
+- **statistics**
+    - 1000 object classes
+    - ~ 1.2 Million training images
+    - ~ 50 Thousand validation images
+    - ~ 100 Thousand test images
+
+## COCO
+
+- [coco dataset](https://cocodataset.org/)
+- COCO - Common Objects in Context
+- large-scale object detection, segmentation, and captioning dataset
+- general info
+    - total of 80 categories
+    - each picture can have multiple categories
+    - train, val, and test set: more than 200 000 pictures
+- definition of multiple challenges:
+    - Object Detection
+    - Keypoint Detection
+    - Stuff Segmentation
+    - Panoptic Segmentation
+    - Image Captioning
+    - DensePose
+
+### Focus
+
+- for us most interesting is the data provided for the **Object Detection** use case
+- the data ([can be downloaded here](https://cocodataset.org/#download)) is structured in two parts:
+    - the images
+    - the annotations
+- we need the *train_val annotations* that contain 6 files of 3 categories (train and val split)
+    - person_keypoints
+    - captions
+    - instances
+- interesting for us are the *instances* annotations, the annotations data contains the following data
+    - general info for file (JSON object)
+    - list of licences
+    - list of annotations
+    - list of categories
+- of this data, the *annotations*, and the *categories* are of interest
+- they have the following format ([see here](https://cocodataset.org/#format-data)):
+
+ ```json
+annotation{
+  "id": int,
+  "image_id": int,
+  "category_id": int,
+  "segmentation": RLE
+  or [
+  polygon
+],
+"area": float,
+"bbox": [x, y, width, height],
+"iscrowd": 0 or 1,
+}
+
+categories[
+{
+"id": int,
+"name": str,
+"supercategory" : str,
+}
+]
+```
+
+## Customized COCO dataset
+
+- the goal of creating our customized COCO dataset is to create a dataset that is similar to/compatible with the
+  ImageNet dataset but from a different distribution than the ImageNet data
+- also, we want to split the subset of the COCO data that we use into multiple classes
+
+### Creation
+
+- The images in the ImageNet data have only one defined category
+- Also in most images, you can see only the one object defining the category
+
+- To create a similar dataset, we filter the images:
+    - first, we extract all images that have only one assigned category
+    - out of these images, we then extract the images that have a category that is also part of the Imagenet dataset
+- finally, we take only the filtered images and store them only with the relevant data
+
+- the custom coco dataset is created by executing the [extract-custom-coco](custom/extract_custom_coco.py) script using
+  the following arguments
+    - `--coco-train-root-path <coco-root>/train2017`
+    - `--coco-train-annotations <coco-root>/annotations/instances_train2017.json`
+    - `--coco-val-root-path <coco-root>/val2017`
+    - `--coco-val-annotations <coco-root>/annotations/instances_val2017.json`
+    - `--imagenet-root <imagenet-root>`
+    - `--target-root <target-dir>`
+
+### Download
+
+- **download initial data**
+    - [coco-train2017](http://images.cocodataset.org/zips/train2017.zip)
+    - [coco-val2017](http://images.cocodataset.org/zips/val2017.zip)
+    - [coco-annotations2017](http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
+      (contains `instances_train2017.json` and `instances_val2017.json`)
+    - the `<imagenet-root>` needs to contain the following files
+        - `ILSVRC2012_devkit_t12.tar.gz`
+        - `ILSVRC2012_img_val.tar`
+        - unfortunately, we can not include a download link here because a login is required
+            - the root link is [here](http://www.image-net.org/challenges/LSVRC/2012/downloads)
+            - **WARNING**: after login for download you get redirected to the data for 2010, change the link to 2012
+              (http://www.image-net.org/challenges/LSVRC/2012/downloads) before download!
+- **customized coco data**
+    - [full-dataset](https://owncloud.hpi.de/s/TRCzfvxwyHCRIQr)
+
+## Custom Coco Datasets 512
+
+- for our experiments we create 4 custom coco datasets
+    - custom-coco-food-512,custom-coco-outdoor-512,custom-coco-indoor-512,custom-coco-zebra-512
+- each dataset contains the first 512 items of its categories
+- we created them using the script `create_coco_n.py`
+
+### Download
+
+- [custom-coco-food-512](https://owncloud.hpi.de/s/Pp2f4hdKFvUrYMm)
+- [custom-coco-outdoor-512](https://owncloud.hpi.de/s/T083xp5fBt5S7OI)
+- [custom-coco-indoor-512](https://owncloud.hpi.de/s/m5XjelcaVm577i4)
+- [custom-coco-zebra-512](https://owncloud.hpi.de/s/MgfxezMgdWbvOYu)
+
+## Pretrained Models: Used Data
+
+We refer to the results and models listed in the
+[official PyTorch documentation](https://pytorch.org/docs/stable/torchvision/models.html) (last accessed, 01.12.2020)
+
+Essential questions to answer for pre-trained models (AlexNet, VGG-19, ResNet18, ResNet50, Resnet152):
+
+- What data was used to pre-train the models? (short answer: **data of ImageNet challenge 2012**)
+- Was the validation dataset used to train the models? (short answer: **NO** )
+
+What data was exactly used to train the models can not be answered by the information given in the documentation.
+Nevertheless, we can make some qualified guesses.
+
+What data was used to train the model?
+
+- for all models, the documentation says pre-trained on ImageNet
+- furthermore, the provided
+  [dataloader uses the ImageNet data from 2012](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/torchvision/datasets/imagenet.py#L11-L15)
+    - thus we can be pretty sure that the data used to pre-train the models is the **data of the ImageNet challenge
+      2012**. The data can be downloaded [here](http://image-net.org/challenges/LSVRC/2012/downloads.php#images)
+
+Was the validation set used to train the models?
+
+- good overview of common definitions of train, test, and validation set can be
+  found [here](https://machinelearningmastery.com/difference-test-validation-datasets/)
+- they give the following definitions:
+    - Training Dataset: The sample of data used to fit the model.
+    - Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training
+      dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset
+      is incorporated into the model configuration.
+    - Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training
+      dataset.
+- following these definitions, the validation dataset shouldn't be used to train the model, but it is also said:
+    - *the final model could be fit on the aggregate of the training and validation datasets*
+- taking a look at the [GitHub issue](https://github.com/pytorch/vision/issues/2469) asking for the code that was used
+  to generate the pre-trained models, we can find out that:
+    - the validation split is used to generate the *dataset
+      test* ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L110-L138))
+    - this data is **only** used to generate the *data_loader_test*
+      ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L164))
+    - the *data_loader_test* is **only** used in the *evaluate*
+      method ([code](https://github.com/pytorch/vision/blob/6e7ed49a93a1b0d47cef7722ea2c2f525dcb8795/references/classification/train.py#L48-L71))
+    - the *evaluate* method calculates no gradients and also makes no use of the optimizer
+    - thus **the validation set was not used to pretrain the models we use**
@@ -0,0 +1,60 @@
+import argparse
+import json
+import os
+import shutil
+
+from experiments.data.custom.custom_coco import CustomCoco, FILE_NAME, COCO_META_JSON, COCO_META
+
+IMAGES = 'images'
+
+
+def main(args):
+    data = CustomCoco(root=args.data_root, id_subset_json=args.id_subset_json, num_samples=args.size)
+
+    root_path = os.path.abspath(args.dst_path)
+    images_path = os.path.join(root_path, IMAGES)
+
+    os.mkdir(root_path)
+    os.mkdir(images_path)
+
+    # copy all images
+    included_file_names = []
+    for sample in data._items:
+        src_path = os.path.join(args.data_root, IMAGES, sample[FILE_NAME])
+        dst_path = os.path.join(root_path, IMAGES, sample[FILE_NAME])
+        shutil.copy(src_path, dst_path)
+        included_file_names.append(sample[FILE_NAME])
+
+    # create reduced json
+    current_json_path = os.path.join(args.data_root, COCO_META_JSON)
+    new_coco_meta_list = []
+    with open(current_json_path) as f:
+        j_doc = json.load(f)
+        coco_meta = j_doc[COCO_META]
+        for e in coco_meta:
+            file_name = e[FILE_NAME]
+            if file_name in included_file_names:
+                new_coco_meta_list.append(e)
+
+    j_doc[COCO_META] = new_coco_meta_list
+
+    with open(os.path.join(root_path, COCO_META_JSON), 'w') as f:
+        json.dump(j_doc, f)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--size', help='the size in number of samples of the dataset', type=int, required=True)
+    parser.add_argument('--data-root', help='the root path to teh custom coco dataset', type=str, required=True)
+    parser.add_argument('--id-subset-json', help='file to specify which ids are included', type=str, required=True)
+    parser.add_argument('--dst-path', help='the path where the new dataset is created', type=str, required=True)
+
+    _args = parser.parse_args()
+
+    return _args
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    main(args)
@@ -0,0 +1,115 @@
+import json
+import os
+from typing import Optional, Callable, Any
+
+from PIL import Image
+from torchvision import transforms
+from torchvision.datasets import VisionDataset
+
+FILE_NAME = 'file_name'
+IMAGES = 'images'
+INCLUDED_COCO_IDS = 'included-coco-ids'
+COCO_META = 'coco_meta'
+COCO_META_JSON = 'coco_meta.json'
+COCO_IMAGE_TYPE = ".jpg"
+COCO_FILENAME_NUMBERS = 12
+COCO_CLASSES = 91
+COCO_SPLIT = 'coco_split'
+COCO_CATEGORY_ID = 'coco_category_id'
+COCO_IMAGE_ID = 'coco_image_id'
+IMAGENET_WNID = 'imagenet_wnid'
+IMAGENET_CLASS_ID = 'imagenet_class_id'
+
+normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+
+inference_transforms = transforms.Compose([
+    transforms.Resize(256),
+    transforms.CenterCrop(224),
+    transforms.ToTensor(),
+    normalize, ])
+
+train_transforms = transforms.Compose([
+    transforms.RandomResizedCrop(224),
+    transforms.RandomHorizontalFlip(),
+    transforms.ToTensor(),
+    normalize,
+])
+
+
+def _included_ids(id_subset_json):
+    with open(id_subset_json, 'r') as json_file:
+        included_id_json = json.load(json_file)
+        ids = included_id_json[INCLUDED_COCO_IDS]
+    return ids
+
+
+class CustomCoco(VisionDataset):
+
+    def __init__(self,
+                 root: str,
+                 ann_file: str = COCO_META_JSON,
+                 id_subset_json: str = None,
+                 transform: Optional[Callable] = None,
+                 target_transform: Optional[Callable] = None,
+                 transforms: Optional[Callable] = None,
+                 num_samples: int = None
+                 ) -> None:
+        super(CustomCoco, self).__init__(root, transforms, transform, target_transform)
+        self.images_path = os.path.join(self.root, IMAGES)
+        self.ann_file = os.path.join(self.root, ann_file)
+
+        if id_subset_json:
+            self.included_ids = _included_ids(id_subset_json)
+        else:
+            self.included_ids = list(range(COCO_CLASSES + 1))
+
+        self._items = []
+        with open(self.ann_file) as f:
+            ann_data = json.load(f)
+            coco_meta = ann_data[COCO_META]
+            for e in coco_meta:
+                coco_cat = e[COCO_CATEGORY_ID]
+                if coco_cat in self.included_ids:
+                    self._items.append(e)
+
+        if num_samples is not None:
+            if len(self._items) >= num_samples:
+                self._items.sort(key=lambda x: x[COCO_IMAGE_ID])
+                self._items = self._items[:num_samples]
+                assert len(self._items) == num_samples
+            else:
+                raise ValueError('The given num_samples is higher than the available number of samples')
+
+    def __getitem__(self, index: int) -> Any:
+        item = self._items[index]
+        image_path = os.path.join(self.images_path, item[FILE_NAME])
+        img = Image.open(image_path).convert('RGB')
+        label = item[IMAGENET_CLASS_ID]
+
+        if self.transform:
+            img = self.transform(img)
+        if self.target_transform:
+            label = self.target_transform(label)
+
+        return img, label
+
+    def __len__(self) -> int:
+        return len(self._items)
+
+
+class InferenceCustomCoco(CustomCoco):
+
+    def __init__(self, root: str, ann_file: str = COCO_META_JSON, id_subset_json: str = None,
+                 target_transform: Optional[Callable] = None, transforms: Optional[Callable] = None,
+                 num_samples: int = None):
+        transform = inference_transforms
+        super().__init__(root, ann_file, id_subset_json, transform, target_transform, transforms, num_samples)
+
+
+class TrainCustomCoco(CustomCoco):
+
+    def __init__(self, root: str, ann_file: str = COCO_META_JSON, id_subset_json: str = None,
+                 target_transform: Optional[Callable] = None, transforms: Optional[Callable] = None,
+                 num_samples: int = None):
+        transform = train_transforms
+        super().__init__(root, ann_file, id_subset_json, transform, target_transform, transforms, num_samples)