Skip to content

Commit 0238ab2

Browse files
authored
docs: flickr8k_dataset_creation_example (lancedb#2489)
Adds the Flickr8k dataset creation example (multi-modal dataset creation and model training duo complete).
1 parent ec04990 commit 0238ab2

File tree

3 files changed

+118
-1
lines changed

3 files changed

+118
-1
lines changed

docs/examples/clip_training.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Training Multi-Modal models using a Lance dataset
22
-------------------------------------------------
33

44
In this example we will be training a CLIP model for natural image based search using a Lance image-text dataset.
5-
In particular, we will be using the `flickr_8k Lance dataset <https://www.kaggle.com/datasets/heyytanay/flickr-8k-lance>`_
5+
In particular, we will be using the `flickr_8k Lance dataset <https://www.kaggle.com/datasets/heyytanay/flickr-8k-lance>`_.
66

77
The model architecture and part of the training code is adapted from Manan Goel's `Implementing CLIP with PyTorch Lightning <https://wandb.ai/manan-goel/coco-clip/reports/Implementing-CLIP-With-PyTorch-Lightning--VmlldzoyMzg4Njk1>`_ with necessary changes to for a minimal, lance-compatible training example.
88

docs/examples/examples.rst

+1
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ Examples
77
Creating text dataset for LLM training using Lance <./llm_dataset_creation.rst>
88
Training LLMs using a Lance text dataset <./llm_training.rst>
99
Reading and writing a Lance dataset in Rust <./write_read_dataset.rst>
10+
Creating Multi-Modal datasets using Lance <./flickr8k_dataset_creation.rst>
1011
Training Multi-Modal models using a Lance dataset <./clip_training.rst>
+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
Creating Multi-Modal datasets using Lance
2+
-----------------------------------------
3+
Thanks to Lance file format's ability to store data of different modalities, one of the important use-cases that Lance shines in is storing Multi-modal datasets.
4+
In this brief example we will be going over how you can take a Multi-modal dataset and store it in Lance file format.
5+
6+
The dataset of choice here is `Flickr8k dataset <https://github.com/goodwillyoga/Flickr8k_dataset>`_. Flickr8k is a benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
7+
The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.
8+
9+
We will be creating an Image-caption pair dataset for Multi-modal model training by using the above mentioned Flickr8k dataset, saving it in form of a Lance dataset with image file names, all captions for every image (order preserved) and the image itself (in binary format).
10+
11+
Imports and Setup
12+
~~~~~~~~~~~~~~~~~
13+
We assume that you downloaded the dataset, more specifically the "Flickr8k.token.txt" file and the "Flicker8k_Dataset/" folder and both are present in the current directory.
14+
These can be downloaded from `here <https://github.com/goodwillyoga/Flickr8k_dataset?tab=readme-ov-file>`_ (download both the dataset and text zip files).
15+
16+
We also assume you have pyarrow and pylance installed as well as opencv (for reading in images) and tqdm (for pretty progress bars).
17+
18+
Now let's start with imports and defining the caption file and image dataset folder.
19+
20+
.. code-block:: python
21+
22+
import os
23+
import cv2
24+
import random
25+
26+
import lance
27+
import pyarrow as pa
28+
29+
import matplotlib.pyplot as plt
30+
31+
from tqdm.auto import tqdm
32+
33+
captions = "Flickr8k.token.txt"
34+
image_folder = "Flicker8k_Dataset/"
35+
36+
37+
Loading and Processing
38+
~~~~~~~~~~~~~~~~~~~~~~
39+
40+
In flickr8k dataset, each image has multiple corresponding captions that are ordered.
41+
We are going to put all these captions in a list corresponding to each image with their position in the list representing the order in which they originally appear.
42+
Let's load the annotations (the image path and corresponding captions) in a list with each element of the list being a tuple consisting of image name, caption number and caption itself.
43+
44+
.. code-block:: python
45+
46+
with open(captions, "r") as fl:
47+
annotations = fl.readlines()
48+
49+
# Converts the annotations where each element of this list is a tuple consisting of image file name, caption number and caption itself
50+
annotations = list(map(lambda x: tuple([*x.split('\t')[0].split('#'), x.split('\t')[1]]), annotations))
51+
52+
Now, for all captions of the same image, we will put them in a list sorted by their ordering.
53+
54+
.. code-block:: python
55+
56+
captions = []
57+
image_ids = set(ann[0] for ann in annotations)
58+
for img_id in tqdm(image_ids):
59+
current_img_captions = []
60+
for ann_img_id, num, caption in annotations:
61+
if img_id == ann_img_id:
62+
current_img_captions.append((num, caption))
63+
64+
# Sort by the annotation number
65+
current_img_captions.sort(key=lambda x: x[0])
66+
captions.append((img_id, tuple([x[1] for x in current_img_captions])))
67+
68+
Converting to a Lance Dataset
69+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70+
71+
Now that our captions list is in a proper format, we will write a :meth:`process()` function that will take the said captions as argument and yield a Pyarrow record batch consisting of the :meth:`image_id`, :meth:`image` and :meth:`captions`.
72+
The image in this record batch will be in binary format and all the captions for an image will be in a list with their ordering preserved.
73+
74+
.. code-block:: python
75+
76+
def process(captions):
77+
for img_id, img_captions in tqdm(captions):
78+
try:
79+
with open(os.path.join(image_folder, img_id), 'rb') as im:
80+
binary_im = im.read()
81+
82+
except FileNotFoundError:
83+
print(f"img_id '{img_id}' not found in the folder, skipping.")
84+
continue
85+
86+
img_id = pa.array([img_id], type=pa.string())
87+
img = pa.array([binary_im], type=pa.binary())
88+
capt = pa.array([img_captions], pa.list_(pa.string(), -1))
89+
90+
yield pa.RecordBatch.from_arrays(
91+
[img_id, img, capt],
92+
["image_id", "image", "captions"]
93+
)
94+
95+
Let's also define the same schema to tell Pyarrow the type of data it should be expecting in the table.
96+
97+
.. code-block:: python
98+
99+
schema = pa.schema([
100+
pa.field("image_id", pa.string()),
101+
pa.field("image", pa.binary()),
102+
pa.field("captions", pa.list_(pa.string(), -1)),
103+
])
104+
105+
We are including the :meth:`image_id` (which is the original image name) so it can be easier to reference and debug in the future.
106+
107+
Finally, we define a reader to iteratively read those record batches and then write them to a lance dataset on the disk.
108+
109+
.. code-block:: python
110+
111+
reader = pa.RecordBatchReader.from_batches(schema, process(captions))
112+
lance.write_dataset(reader, "flickr8k.lance", schema)
113+
114+
And that's basically it! If you want to execute this in a notebook form, you can check out this example in our deeplearning-recipes repository `here <https://github.com/lancedb/lance-deeplearning-recipes/tree/main/examples/flickr8k-dataset>`_.
115+
116+
For more Deep learning related examples using Lance dataset, be sure to check out the `lance-deeplearning-recipes <https://github.com/lancedb/lance-deeplearning-recipes>`_ repository!

0 commit comments

Comments
 (0)