Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Is there any Benchmarks on the Resource consumption of Each Step? #35

Open
andyafter opened this issue Mar 7, 2025 · 5 comments

Comments

@andyafter
Copy link

Seems that when I am running this program, the clustering of the flop actually takes around 25 days on a C6a.32Xlarge instance, which is 128 virtual CPUs.

Is there like an overview of the entire training process? That would be very helpful.

@andyafter
Copy link
Author

To add up some information, I ran robopoker with cargo run, seems to be a full game heads up, with C6a.32Xlarge instance, which is 128 virtual CPUs and 256GB RAM.

It seems to have gotten stuck on "clustering kmeans, flop", which by my understanding shouldnt take so much time(estimated 900h+)

Image

Is there anything that I am doing wrong?

@krukah
Copy link
Owner

krukah commented Mar 10, 2025

thanks for opening up the issue! it doesn't look like you're doing anything wrong here, the calculation is very very CPU intensive after all. i'm quite interested in these two issues however, since they would be huge in getting the abstraction to run orders of magnitude faster.

from my experience running on a 176 core machine, the flop clustering does indeed take about 4 days to complete, so maybe i'll put it at 700 CPU-hours.

perhaps i may ask what parameters you are using in lib.rs?

@andyafter
Copy link
Author

I have not changed anything from the source code, so the configuration should be the same in the source code:

const KMEANS_FLOP_TRAINING_ITERATIONS: usize = 32; // eyeball test seems to converge around here for K = 128
const KMEANS_TURN_TRAINING_ITERATIONS: usize = 32; // eyeball test seems to converge around here for K = 144
const KMEANS_FLOP_CLUSTER_COUNT: usize = 128;
const KMEANS_TURN_CLUSTER_COUNT: usize = 144;
const KMEANS_EQTY_CLUSTER_COUNT: usize = 101;

@krukah
Copy link
Owner

krukah commented Mar 10, 2025

did this particular run eventually terminate? or did you find a different machine that just ran it much faster?

@andyafter
Copy link
Author

I just quit the program and ran it with a new HPC instance, and it is much faster for sure. I am waiting for the clustering of flop to end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants