You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Throughout the previous couple days I have tried various instances on AWS, from C6, C7 series to R6, R7, All the instances that I have tried have more than 96 cores and 256GB RAM.
I found that the initialization of clustering constantly remains around 5~11 hours for the kmeans initialization of flop.
However, HPC6a.48Xlarge instances demonstrated a significant improvement in this process, with 12 minutes initialization!!!
The HPC instance that I used contains around 384GB of RAM(which was barely used), and 96cores, and most importantly, 512MB of L3 cache and 48GB of L2 cache. C6a.32xlarge has 128MB of L3 cache, that seems to be the key that caused this training time improvement.
Which, my conclusion is that, the clustering is handling significant amount of repeatitive memory fetching and writing, with which we could implement a loop tiling technique to significantly boost the clustering efficiency.
What do you guys think!
The text was updated successfully, but these errors were encountered:
this sounds quite promising! it is certainly true that we do a ton of repetitive memory access in the training steps, and i've not thought about how to optimize these operations. what do you imagine a loop tiling approach could look like here?
Throughout the previous couple days I have tried various instances on AWS, from C6, C7 series to R6, R7, All the instances that I have tried have more than 96 cores and 256GB RAM.
I found that the initialization of clustering constantly remains around 5~11 hours for the kmeans initialization of flop.
However, HPC6a.48Xlarge instances demonstrated a significant improvement in this process, with 12 minutes initialization!!!
The HPC instance that I used contains around 384GB of RAM(which was barely used), and 96cores, and most importantly, 512MB of L3 cache and 48GB of L2 cache. C6a.32xlarge has 128MB of L3 cache, that seems to be the key that caused this training time improvement.
Which, my conclusion is that, the clustering is handling significant amount of repeatitive memory fetching and writing, with which we could implement a loop tiling technique to significantly boost the clustering efficiency.
What do you guys think!
The text was updated successfully, but these errors were encountered: