Skip to content

Commit 024a268

Browse files
committed
chore: initial commit of benchmark documentation with images and gifs
#905
1 parent f784720 commit 024a268

14 files changed

+356
-0
lines changed

.cspell/custom_misc.txt

+1
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ typecheck
6565
venv
6666
WMMD
6767
wspace
68+
xlarge
6869
xticks
6970
yerr
7071
yscale

.gitattributes

+1
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
*.gif -text
44
*.png -text
55
*.svg -text
6+
examples/benchmarking_images/* filter=lfs diff=lfs merge=lfs -text

documentation/source/benchmark.rst

+325
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
Benchmarking Coreset Algorithms
2+
===============================
3+
4+
In this benchmark, we assess the performance of four different coreset algorithms:
5+
:class:`~coreax.solvers.KernelHerding`, :class:`~coreax.solvers.SteinThinning`,
6+
:class:`~coreax.solvers.RandomSample`, and :class:`~coreax.solvers.RPCholesky`.
7+
Each of these algorithms is evaluated across four different tests, providing a
8+
comparison of their performance and applicability to various datasets.
9+
10+
Test 1: Benchmarking Coreset Algorithms on the MNIST Dataset
11+
------------------------------------------------------------
12+
13+
The first test evaluates the performance of the coreset algorithms on the
14+
**MNIST dataset** using a simple neural network classifier. The process follows
15+
these steps:
16+
17+
1. **Dataset**: The MNIST dataset consists of 60,000 training images and 10,000
18+
test images. Each image is a 28x28 pixel grey-scale image of a handwritten digit.
19+
20+
2. **Model**: A Multi-Layer Perceptron (MLP) neural network is used for
21+
classification. The model consists of a single hidden layer with 64 nodes.
22+
Images are flattened into vectors for input.
23+
24+
3. **Dimensionality Reduction**: To speed up computation and reduce dimensionality, a
25+
density preserving :class:`~umap.umap_.UMAP` is applied to project the 28x28 images into 16 components
26+
before applying any coreset algorithm.
27+
28+
4. **Coreset Generation**: Coresets of various sizes are generated using the
29+
different coreset algorithms. For :class:`~coreax.solvers.KernelHerding` and
30+
:class:`~coreax.solvers.SteinThinning`, :class:`~coreax.solvers.MapReduce` is
31+
employed to handle large-scale data.
32+
33+
5. **Training**: The model is trained using the selected coresets, and accuracy is
34+
measured on the test set of 10,000 images.
35+
36+
6. **Evaluation**: Due to randomness in the coreset algorithms and training process,
37+
the experiment is repeated 5 times with different random seeds. The benchmark is run
38+
on an **Amazon g4dn.12xlarge instance** with 4 NVIDIA T4 Tensor Core GPUs, 48 vCPUs,
39+
and 192 GiB memory.
40+
41+
**Results**:
42+
The accuracy of the MLP classifier when trained using the full MNIST dataset
43+
(60,000 training images) was 97.31%, serving as a baseline for evaluating the performance
44+
of the coreset algorithms.
45+
46+
- Plots showing the accuracy (with error bars) of the MLP's predictions on the test set,
47+
along with the time taken generate coreset for each coreset size and algorithm.
48+
49+
.. image:: ../../examples/benchmarking_images/mnist_benchmark_accuracy.png
50+
:alt: Benchmark Results for MNIST Coreset Algorithms
51+
52+
**Figure 1**: Accuracy of coreset algorithms on the MNIST dataset. Bar heights
53+
represent the average accuracy. Error bars represent the min-max range for accuracy
54+
for each coreset size across 5 runs.
55+
56+
.. image:: ../../examples/benchmarking_images/mnist_benchmark_time_taken.png
57+
:alt: Time Taken Benchmark Results for MNIST Coreset Algorithms
58+
59+
**Figure 2**: Time taken to generate coreset for each coreset algorithm. Bar heights
60+
represent the average time taken. Error bars represent the min-max range for each
61+
coreset size across 5 runs.
62+
63+
Test 2: Benchmarking Coreset Algorithms on a Synthetic Dataset
64+
--------------------------------------------------------------
65+
66+
In this second test, we evaluate the performance of the coreset algorithms on a
67+
**synthetic dataset**. The dataset consists of 1,000 points in two-dimensional space,
68+
generated using :func:`sklearn.datasets.make_blobs`. The process follows these steps:
69+
70+
1. **Dataset**: A synthetic dataset of 1,000 points is generated to test the
71+
quality of coreset algorithms.
72+
73+
2. **Coreset Generation**: Coresets of different sizes (10, 50, 100, and 200 points)
74+
are generated using each coreset algorithm.
75+
76+
3. **Evaluation Metrics**: Two metrics evaluate the quality of the generated coresets:
77+
:class:`~coreax.metrics.MMD` and :class:`~coreax.metrics.KSD`.
78+
79+
4. **Optimisation**: We optimise the weights for coresets to minimise the MMD score
80+
and recompute both MMD and KSD metrics. These entire process is repeated 5 times with
81+
5 random seeds and the metrics are averaged.
82+
83+
**Results**:
84+
The tables below show the performance metrics (Unweighted MMD, Unweighted KSD,
85+
Weighted MMD, Weighted KSD, and Time) for each coreset algorithm and each coreset size.
86+
For each metric and coreset size, the best performance score is highlighted in bold.
87+
88+
.. list-table:: Coreset Size 10 (Original Sample Size 1,000)
89+
:header-rows: 1
90+
:widths: 20 15 15 15 15 15
91+
92+
* - Method
93+
- Unweighted_MMD
94+
- Unweighted_KSD
95+
- Weighted_MMD
96+
- Weighted_KSD
97+
- Time
98+
* - KernelHerding
99+
- **0.071504**
100+
- 0.087505
101+
- 0.037931
102+
- 0.082903
103+
- 5.884511
104+
* - RandomSample
105+
- 0.275138
106+
- 0.106468
107+
- 0.080327
108+
- **0.082597**
109+
- **2.705248**
110+
* - RPCholesky
111+
- 0.182342
112+
- 0.079254
113+
- **0.032423**
114+
- 0.085621
115+
- 3.177700
116+
* - SteinThinning
117+
- 0.186064
118+
- **0.078773**
119+
- 0.087347
120+
- 0.085194
121+
- 4.450125
122+
123+
.. list-table:: Coreset Size 50 (Original Sample Size 1,000)
124+
:header-rows: 1
125+
:widths: 20 15 15 15 15 15
126+
127+
* - Method
128+
- Unweighted_MMD
129+
- Unweighted_KSD
130+
- Weighted_MMD
131+
- Weighted_KSD
132+
- Time
133+
* - KernelHerding
134+
- **0.016602**
135+
- 0.080800
136+
- 0.003821
137+
- **0.079875**
138+
- 5.309067
139+
* - RandomSample
140+
- 0.083658
141+
- 0.084844
142+
- 0.005009
143+
- 0.079948
144+
- **2.636160**
145+
* - RPCholesky
146+
- 0.133182
147+
- **0.061976**
148+
- **0.001859**
149+
- 0.079935
150+
- 3.201798
151+
* - SteinThinning
152+
- 0.079028
153+
- 0.074763
154+
- 0.009652
155+
- 0.080119
156+
- 3.735810
157+
158+
.. list-table:: Coreset Size 100 (Original Sample Size 1,000)
159+
:header-rows: 1
160+
:widths: 20 15 15 15 15 15
161+
162+
* - Method
163+
- Unweighted_MMD
164+
- Unweighted_KSD
165+
- Weighted_MMD
166+
- Weighted_KSD
167+
- Time
168+
* - KernelHerding
169+
- **0.007747**
170+
- 0.080280
171+
- **0.001582**
172+
- 0.080024
173+
- 5.425807
174+
* - RandomSample
175+
- 0.032532
176+
- 0.077081
177+
- 0.001638
178+
- 0.080073
179+
- **3.009871**
180+
* - RPCholesky
181+
- 0.069909
182+
- **0.072023**
183+
- 0.000977
184+
- 0.079995
185+
- 3.497632
186+
* - SteinThinning
187+
- 0.118452
188+
- 0.081853
189+
- 0.002652
190+
- **0.079836**
191+
- 3.766622
192+
193+
.. list-table:: Coreset Size 200 (Original Sample Size 1,000)
194+
:header-rows: 1
195+
:widths: 20 15 15 15 15 15
196+
197+
* - Method
198+
- Unweighted_MMD
199+
- Unweighted_KSD
200+
- Weighted_MMD
201+
- Weighted_KSD
202+
- Time
203+
* - KernelHerding
204+
- **0.003937**
205+
- 0.079932
206+
- 0.001064
207+
- 0.080012
208+
- 5.786333
209+
* - RandomSample
210+
- 0.048701
211+
- 0.077522
212+
- 0.000913
213+
- 0.080059
214+
- **2.964436**
215+
* - RPCholesky
216+
- 0.052085
217+
- **0.075708**
218+
- **0.000772**
219+
- 0.080050
220+
- 3.722556
221+
* - SteinThinning
222+
- 0.129073
223+
- 0.084883
224+
- 0.002329
225+
- **0.079847**
226+
- 4.004353
227+
228+
229+
**Visualisation**: The results in this table can be visualised as follows:
230+
231+
.. image:: ../../examples/benchmarking_images/blobs_benchmark_results.png
232+
:alt: Benchmark Results for Synthetic Dataset
233+
234+
**Figure 3**: Line graphs depicting the average performance metrics across 5 runs of
235+
each coreset algorithm on a synthetic dataset.
236+
237+
Test 3: Benchmarking Coreset Algorithms on Pixel Data from an Image
238+
-------------------------------------------------------------------
239+
240+
This test evaluates the performance of coreset algorithms on pixel data extracted
241+
from an input image. The process follows these steps:
242+
243+
1. **Image Preprocessing**: An image is loaded and converted to grey-scale. Pixel
244+
locations and values are extracted for use in the coreset algorithms.
245+
246+
2. **Coreset Generation**: Coresets (of size 20% of the original image) are generated
247+
using each coreset algorithm.
248+
249+
3. **Visualisation**: The original image is plotted alongside coresets generated by
250+
each algorithm. This visual comparison helps assess how well each algorithm
251+
represents the image.
252+
253+
**Results**: The following plot visualises the pixels chosen by each coreset algorithm.
254+
255+
.. image:: ../../examples/benchmarking_images/david_benchmark_results.png
256+
:alt: Coreset Visualisation on Image
257+
258+
**Figure 4**: The original image and pixels selected by each coreset algorithm
259+
plotted side-by-side for visual comparison.
260+
261+
Test 4: Benchmarking Coreset Algorithms on Frame Data from a GIF
262+
----------------------------------------------------------------
263+
264+
The fourth and final test evaluates the performance of coreset algorithms on data
265+
extracted from an input **GIF**. This test involves the following steps:
266+
267+
1. **Input GIF**: A GIF is loaded, and its frames are preprocessed.
268+
269+
2. **Dimensionality Reduction**: On each frame data, a density preserving
270+
:class:`~umap.umap_.UMAP` is applied to reduce dimensionality of each frame to 25.
271+
272+
3. **Coreset Generation**: Coresets are generated using each coreset algorithm, and
273+
selected frames are saved as new GIFs.
274+
275+
276+
**Result**:
277+
- GIF files showing the selected frames for each coreset algorithm.
278+
279+
.. image:: ../../examples/pounce/pounce.gif
280+
:alt: Coreset Visualisation on GIF Frames
281+
282+
**Gif 1**: Original gif file.
283+
284+
.. image:: ../../examples/benchmarking_images/RandomSample_coreset.gif
285+
:alt: Coreset Visualisation on GIF Frames
286+
287+
**Gif 2**: Frames selected by Random Sample.
288+
289+
.. image:: ../../examples/benchmarking_images/SteinThinning_coreset.gif
290+
:alt: Coreset Visualisation on GIF Frames
291+
292+
**Gif 3**: Frames selected by Stein Thinning.
293+
294+
.. image:: ../../examples/benchmarking_images/RPCholesky_coreset.gif
295+
:alt: Coreset Visualisation on GIF Frames
296+
297+
**Gif 4**: Frames selected by RP Cholesky.
298+
299+
.. image:: ../../examples/benchmarking_images/KernelHerding_coreset.gif
300+
:alt: Coreset Visualisation on GIF Frames
301+
302+
**Gif 5**: Frames selected by Kernel Herding.
303+
304+
.. image:: ../../examples/benchmarking_images/pounce_frames.png
305+
:alt: Coreset Visualisation on GIF Frames
306+
307+
**Figure 5**:Frames chosen by each each coreset algorithm with action frames (the
308+
frames in which pounce action takes place) highlighted in red.
309+
310+
Conclusion
311+
----------
312+
313+
In this benchmark, we evaluated four coreset algorithms across various datasets and
314+
tasks, including image classification, synthetic datasets, and pixel/frame data
315+
processing. Based on the results, **Kernel Herding** emerges as the preferred choice
316+
for most tasks due to its consistent performance. For larger datasets,
317+
combining Kernel Herding with distributed frameworks like **Map Reduce** is
318+
recommended to ensure scalability and efficiency.
319+
320+
For specialised tasks, such as frame selection from GIFs (Test 4), **Stein Thinning**
321+
demonstrated superior performance and may be the optimal choice.
322+
323+
Ultimately, this conclusion reflects one interpretation of the results, and readers are
324+
encouraged to analyse the benchmarks and derive their own insights based on the specific
325+
requirements of their tasks.

documentation/source/conf.py

+1
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@
157157
"sklearn": ("https://scikit-learn.org/stable", None),
158158
"tqdm": ("https://tqdm.github.io/docs", str(TQDM_CUSTOM_PATH)),
159159
"equinox": ("https://docs.kidger.site/equinox", None),
160+
"umap": ("https://umap-learn.readthedocs.io/en/latest", None),
160161
}
161162

162163
nitpick_ignore = [

documentation/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ Contents
4444
coresets
4545
quickstart
4646
faq
47+
benchmark
4748

4849
.. toctree::
4950
:hidden:
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)