Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Loop Tiling Could possibly make it faster for clustering #36

Open
andyafter opened this issue Mar 9, 2025 · 2 comments
Open

Comments

@andyafter
Copy link

Throughout the previous couple days I have tried various instances on AWS, from C6, C7 series to R6, R7, All the instances that I have tried have more than 96 cores and 256GB RAM.

I found that the initialization of clustering constantly remains around 5~11 hours for the kmeans initialization of flop.

However, HPC6a.48Xlarge instances demonstrated a significant improvement in this process, with 12 minutes initialization!!!

Image

The HPC instance that I used contains around 384GB of RAM(which was barely used), and 96cores, and most importantly, 512MB of L3 cache and 48GB of L2 cache. C6a.32xlarge has 128MB of L3 cache, that seems to be the key that caused this training time improvement.

Which, my conclusion is that, the clustering is handling significant amount of repeatitive memory fetching and writing, with which we could implement a loop tiling technique to significantly boost the clustering efficiency.

What do you guys think!

@krukah
Copy link
Owner

krukah commented Mar 10, 2025

this sounds quite promising! it is certainly true that we do a ton of repetitive memory access in the training steps, and i've not thought about how to optimize these operations. what do you imagine a loop tiling approach could look like here?

@andyafter
Copy link
Author

Still trying to get a good understand of the clustering code. Let me answer this a bit later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants