|
| 1 | +Creating Multi-Modal datasets using Lance |
| 2 | +----------------------------------------- |
| 3 | +Thanks to Lance file format's ability to store data of different modalities, one of the important use-cases that Lance shines in is storing Multi-modal datasets. |
| 4 | +In this brief example we will be going over how you can take a Multi-modal dataset and store it in Lance file format. |
| 5 | + |
| 6 | +The dataset of choice here is `Flickr8k dataset <https://github.com/goodwillyoga/Flickr8k_dataset>`_. Flickr8k is a benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. |
| 7 | +The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations. |
| 8 | + |
| 9 | +We will be creating an Image-caption pair dataset for Multi-modal model training by using the above mentioned Flickr8k dataset, saving it in form of a Lance dataset with image file names, all captions for every image (order preserved) and the image itself (in binary format). |
| 10 | + |
| 11 | +Imports and Setup |
| 12 | +~~~~~~~~~~~~~~~~~ |
| 13 | +We assume that you downloaded the dataset, more specifically the "Flickr8k.token.txt" file and the "Flicker8k_Dataset/" folder and both are present in the current directory. |
| 14 | +These can be downloaded from `here <https://github.com/goodwillyoga/Flickr8k_dataset?tab=readme-ov-file>`_ (download both the dataset and text zip files). |
| 15 | + |
| 16 | +We also assume you have pyarrow and pylance installed as well as opencv (for reading in images) and tqdm (for pretty progress bars). |
| 17 | + |
| 18 | +Now let's start with imports and defining the caption file and image dataset folder. |
| 19 | + |
| 20 | +.. code-block:: python |
| 21 | +
|
| 22 | + import os |
| 23 | + import cv2 |
| 24 | + import random |
| 25 | +
|
| 26 | + import lance |
| 27 | + import pyarrow as pa |
| 28 | +
|
| 29 | + import matplotlib.pyplot as plt |
| 30 | +
|
| 31 | + from tqdm.auto import tqdm |
| 32 | +
|
| 33 | + captions = "Flickr8k.token.txt" |
| 34 | + image_folder = "Flicker8k_Dataset/" |
| 35 | +
|
| 36 | +
|
| 37 | +Loading and Processing |
| 38 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 39 | + |
| 40 | +In flickr8k dataset, each image has multiple corresponding captions that are ordered. |
| 41 | +We are going to put all these captions in a list corresponding to each image with their position in the list representing the order in which they originally appear. |
| 42 | +Let's load the annotations (the image path and corresponding captions) in a list with each element of the list being a tuple consisting of image name, caption number and caption itself. |
| 43 | + |
| 44 | +.. code-block:: python |
| 45 | +
|
| 46 | + with open(captions, "r") as fl: |
| 47 | + annotations = fl.readlines() |
| 48 | +
|
| 49 | + # Converts the annotations where each element of this list is a tuple consisting of image file name, caption number and caption itself |
| 50 | + annotations = list(map(lambda x: tuple([*x.split('\t')[0].split('#'), x.split('\t')[1]]), annotations)) |
| 51 | +
|
| 52 | +Now, for all captions of the same image, we will put them in a list sorted by their ordering. |
| 53 | + |
| 54 | +.. code-block:: python |
| 55 | +
|
| 56 | + captions = [] |
| 57 | + image_ids = set(ann[0] for ann in annotations) |
| 58 | + for img_id in tqdm(image_ids): |
| 59 | + current_img_captions = [] |
| 60 | + for ann_img_id, num, caption in annotations: |
| 61 | + if img_id == ann_img_id: |
| 62 | + current_img_captions.append((num, caption)) |
| 63 | + |
| 64 | + # Sort by the annotation number |
| 65 | + current_img_captions.sort(key=lambda x: x[0]) |
| 66 | + captions.append((img_id, tuple([x[1] for x in current_img_captions]))) |
| 67 | +
|
| 68 | +Converting to a Lance Dataset |
| 69 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 70 | + |
| 71 | +Now that our captions list is in a proper format, we will write a :meth:`process()` function that will take the said captions as argument and yield a Pyarrow record batch consisting of the :meth:`image_id`, :meth:`image` and :meth:`captions`. |
| 72 | +The image in this record batch will be in binary format and all the captions for an image will be in a list with their ordering preserved. |
| 73 | + |
| 74 | +.. code-block:: python |
| 75 | +
|
| 76 | + def process(captions): |
| 77 | + for img_id, img_captions in tqdm(captions): |
| 78 | + try: |
| 79 | + with open(os.path.join(image_folder, img_id), 'rb') as im: |
| 80 | + binary_im = im.read() |
| 81 | + |
| 82 | + except FileNotFoundError: |
| 83 | + print(f"img_id '{img_id}' not found in the folder, skipping.") |
| 84 | + continue |
| 85 | + |
| 86 | + img_id = pa.array([img_id], type=pa.string()) |
| 87 | + img = pa.array([binary_im], type=pa.binary()) |
| 88 | + capt = pa.array([img_captions], pa.list_(pa.string(), -1)) |
| 89 | + |
| 90 | + yield pa.RecordBatch.from_arrays( |
| 91 | + [img_id, img, capt], |
| 92 | + ["image_id", "image", "captions"] |
| 93 | + ) |
| 94 | +
|
| 95 | +Let's also define the same schema to tell Pyarrow the type of data it should be expecting in the table. |
| 96 | + |
| 97 | +.. code-block:: python |
| 98 | +
|
| 99 | + schema = pa.schema([ |
| 100 | + pa.field("image_id", pa.string()), |
| 101 | + pa.field("image", pa.binary()), |
| 102 | + pa.field("captions", pa.list_(pa.string(), -1)), |
| 103 | + ]) |
| 104 | +
|
| 105 | +We are including the :meth:`image_id` (which is the original image name) so it can be easier to reference and debug in the future. |
| 106 | + |
| 107 | +Finally, we define a reader to iteratively read those record batches and then write them to a lance dataset on the disk. |
| 108 | + |
| 109 | +.. code-block:: python |
| 110 | + |
| 111 | + reader = pa.RecordBatchReader.from_batches(schema, process(captions)) |
| 112 | + lance.write_dataset(reader, "flickr8k.lance", schema) |
| 113 | +
|
| 114 | +And that's basically it! If you want to execute this in a notebook form, you can check out this example in our deeplearning-recipes repository `here <https://github.com/lancedb/lance-deeplearning-recipes/tree/main/examples/flickr8k-dataset>`_. |
| 115 | + |
| 116 | +For more Deep learning related examples using Lance dataset, be sure to check out the `lance-deeplearning-recipes <https://github.com/lancedb/lance-deeplearning-recipes>`_ repository! |
0 commit comments