bench mark scripts #2525

ved1beta · 2025-04-30T09:14:57Z

this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think

BenjaminBossan · 2025-04-30T15:05:51Z

Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results (benchmark_result.json)? We will run the experiments on our hardware to get reproducible results.

BenjaminBossan

Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.

Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info, train_info, and meta_info. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.

We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.

One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?

Also, please add a README.md and run make style before pushing your changes.

method_comparison/peft_bench/data.py

BenjaminBossan · 2025-05-02T12:37:48Z

method_comparison/peft_bench/data.py

+    ],
+    "long": [
+        """Analyze the evolution of parameter-efficient fine-tuning methods from 2020 to present. 
+        Include a detailed comparison of at least five different approaches, their theoretical foundations, 


Note that the leading spaces are part of the prompt, which is undesired. Either remove them here, which looks a bit ugly, or wrap the whole text with textwrap.dedent.

BenjaminBossan · 2025-05-02T12:40:49Z

method_comparison/peft_bench/data.py

+    # If a prompts file is specified, load it
+    if "prompts_file" in config:
+        file_path = config["prompts_file"]
+        if os.path.exists(file_path):
+            with open(file_path, "r") as f:
+                file_prompts = json.load(f)
+            # Update or add to default prompts
+            for category, prompt_list in file_prompts.items():
+                prompts[category] = prompt_list
+
+    # If custom prompts are specified directly in config
+    if "custom_prompts" in config:
+        for category, prompt_list in config["custom_prompts"].items():
+            prompts[category] = prompt_list


AFAICT, this is currently not being used and I'm not quite sure what the intent for this is. Could you please either explain this with an example, or just remove it for now?

BenjaminBossan · 2025-05-02T12:43:23Z

method_comparison/peft_bench/data.py

+            prompts[category] = prompt_list
+
+    # If specific categories are requested, filter to just those
+    if "prompt_categories" in config:


Hmm, so right now, for each experiment we can define what type of prompts we would like? Would this not mean we cannot fully compare the results between different experiments. I think what I would prefer is that we run the benchmark for each prompt category and track the metrics separately. Then we have for example inference speed for short, medium, long ... prompts. We might also need to adjust the max generated tokens accordingly.

BenjaminBossan · 2025-05-02T12:45:55Z

method_comparison/peft_bench/data.py

+    return prompts
+
+
+def get_prompts_by_length(prompts: Dict[str, List[str]], length: str = "all") -> List[str]:


If we make the change I proposed above of testing each prompt category, I think this function won't be necessary anymore.

BenjaminBossan · 2025-05-02T13:52:29Z

method_comparison/peft_bench/utils.py

+    memory_allocated_log: List[float] = field(default_factory=list)
+    memory_reserved_log: List[float] = field(default_factory=list)
+
+    # Performance metrics
+    inference_times: Dict[str, float] = field(default_factory=dict)
+    inference_overhead: Dict[str, float] = field(default_factory=dict)
+    training_throughput: float = 0.0  # tokens/second
+
+    # Additional metrics
+    metrics: List[Dict[str, Any]] = field(default_factory=list)


We can use list[str], dict[str, Any] etc. no need for List and Dict.

BenjaminBossan · 2025-05-02T13:54:56Z

method_comparison/peft_bench/utils.py

+    """Configuration for benchmarking PEFT methods."""
+    # Model configuration
+    model_id: str
+    peft_method: Literal["lora", "adalora", "bone", "ia3", "prompt_tuning", "prefix_tuning", "none"]


Let's just put str here, since this list is much bigger than this and will grow in the future. I'm not even sure if we need it at all, as we can use peft_config.peft_type

BenjaminBossan · 2025-05-02T13:59:23Z

method_comparison/peft_bench/utils.py

+    seed: int = 42
+    num_inference_runs: int = 5
+    max_new_tokens: int = 20
+    train_batch_size: int = 4
+    train_steps: int = 10
+
+    # Data settings
+    prompt_categories: List[str] = field(default_factory=lambda: ["short", "medium"])
+    num_prompt_samples: int = 2
+    reserve_output_tokens: int = 50
+
+    # Optional settings
+    use_4bit: bool = False
+    use_8bit: bool = False
+    compile_model: bool = False
+    merge_adapter: bool = False


Again, let's not set any defaults if they're defined by the config file. Also, compile_model and merge_adapter are not used, so let's remove them for now.

BenjaminBossan · 2025-05-02T14:03:03Z

method_comparison/peft_bench/utils.py

+    merge_adapter: bool = False
+
+    # Method-specific parameters (these would be overridden by the experiment config)
+    peft_params: Dict[str, Any] = field(default_factory=dict)


Not sure why we need the special handling of PEFT params, could you please explain?

BenjaminBossan · 2025-05-02T16:05:31Z

method_comparison/peft_bench/utils.py

+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+
+
+def time_function(fn: Callable, *args, **kwargs) -> Tuple[Any, float]:


It appears this function is not being used.

ved1beta · 2025-05-05T09:52:25Z

tried to cover all the changes please have a look : )

BenjaminBossan

Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.

method_comparison/peft_bench/run.py

method_comparison/peft_bench/utils.py

method_comparison/peft_bench/run.py

method_comparison/peft_bench/data.py

method_comparison/peft_bench/README.md

BenjaminBossan · 2025-05-05T14:56:32Z

method_comparison/peft_bench/experiments/lora/lora_r16/benchmark_params.json

@@ -0,0 +1,12 @@
+{


Let's create a default json instead of having a sample_config.json.

ved1beta · 2025-05-05T16:06:51Z

i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that

ved1beta · 2025-05-05T17:20:40Z

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose
need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

githubnemo · 2025-05-12T18:36:44Z

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

Can you be a bit more specific on what import errors you're experiencing?

ved1beta · 2025-05-13T03:09:40Z

@githubnemo
The import errors occur because the script uses relative imports (from data import ... and from utils import ...) but these modules are in the same directory. we can either:

Use absolute imports:

from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...

Or add the directory to the Python path in the script:

import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose

hey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : )

githubnemo

Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.

Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench folder - every other folder is unsupported, making the code a bit simpler, I think.

I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.

method_comparison/peft_bench/run.py

githubnemo · 2025-05-23T11:34:38Z

method_comparison/peft_bench/utils.py

+        # Default meta_info
+        self.meta_info = {
+            "model_id": self.model_id,
+            "peft_method": self.peft_method,
+            "parameters": {
+                "base_params": 0,
+                "trainable_params": 0,
+                "total_params": 0,
+                "param_ratio": 0.0,
+            },
+            "model_size": {
+                "base_model_size_mb": 0.0,
+                "adapter_size_mb": 0.0,
+            },
+        }
+
+        # Default train_info
+        self.train_info = {
+            "training_throughput": 0.0,  # tokens/second
+            "memory": {
+                "peak_gpu_memory_mb": 0.0,
+                "peak_ram_memory_mb": 0.0,
+                "memory_logs": [],
+            },
+            "inference": {
+                "times": {},
+                "overhead": {},
+            },
+        }
+
+        # Default metrics structure
+        self.metrics = {
+            "by_category": {},  # Will hold metrics for each prompt category
+            "overall": {},  # Overall metrics across all categories
+        }


Thank you for restructuring the benchmark results! I think we don't need to mimic MetaMathQA exactly but merely taking its structure as an inspiration. Since we're not doing training it would probably be best to not have a train_info section. We're measuring inference performance, so generation_info or something similar would be better suited, I think.

Maybe I'm misunderstanding something but isn't train_info.inference redundant with metrics.*.inference_time? Would it make sense to place the metrics under generation_info? E.g., generation_info.{by_category,overall}.[...]?

githubnemo · 2025-05-23T11:43:57Z

method_comparison/peft_bench/utils.py

+    # Use benchmark_params.json if exists, otherwise use default config
+    if os.path.exists(benchmark_params_path):
+        benchmark_config = BenchmarkConfig.from_json(benchmark_params_path)
+    elif os.path.exists(default_config_path):
+        print(f"No benchmark_params.json found in {path}, using default configuration")
+        benchmark_config = BenchmarkConfig.from_json(default_config_path)
+    else:
+        raise FileNotFoundError(f"Neither benchmark_params.json nor default_config.json found")


Let's always load the default config first, then load the specific benchmark config and merge the two so that the benchmark config only needs to specify the values that are diverging, keeping the experiment configs small and readable.

githubnemo · 2025-05-23T11:45:27Z

method_comparison/peft_bench/configs/sample_config.json

@@ -0,0 +1,52 @@
+{


I think this example is still worth keeping but can probably be thinned out quite a bit once default config + experiment config are merged upon load (see comment below).

githubnemo · 2025-05-23T11:48:37Z

method_comparison/peft_bench/experiments/lora/lora_r16/adapter_config.json

@@ -0,0 +1,17 @@
+{
+    "base_model_name_or_path": "facebook/opt-350m",


Additionally this seems to be a default config for an adapter, not an experiment. validate_experiment_path suggests that this should be a default config for an experiment (and I concur :))

It should also reside in configs/.

method_comparison/peft_bench/prompts.json

githubnemo · 2025-05-23T11:53:16Z

method_comparison/peft_bench/configs/sample_config.json

+  "peft_config_variants": [
+    {
+      "variant_name": "lora_r8",
+      "peft_method": "lora",
+      "r": 8,
+      "lora_alpha": 16,
+      "lora_dropout": 0.05,
+      "bias": "none",
+      "task_type": "CAUSAL_LM",
+      "target_modules": ["q_proj", "v_proj"]
+    },
+    {
+      "variant_name": "lora_r16",
+      "peft_method": "lora",
+      "r": 16,
+      "lora_alpha": 32,
+      "lora_dropout": 0.05,
+      "bias": "none",
+      "task_type": "CAUSAL_LM",
+      "target_modules": ["q_proj", "v_proj"]
+    },
+    {
+      "variant_name": "lora_r32",
+      "peft_method": "lora",
+      "r": 32,
+      "lora_alpha": 64,
+      "lora_dropout": 0.05,
+      "bias": "none",
+      "task_type": "CAUSAL_LM",
+      "target_modules": ["q_proj", "v_proj"]
+    }
+  ]


I like the idea of having config variants to keep the experiment config small. Would it make sense to have a peft_config_base key alongside peft_config_variants so that the variant only needs to update the keys that change? This would remove a lot of the repetition we see currently.

githubnemo · 2025-05-23T11:56:29Z

method_comparison/peft_bench/data.py

+    # Apply textwrap.dedent to remove leading spaces from multiline prompts
+    for category, prompt_list in prompts.items():
+        prompts[category] = [textwrap.dedent(prompt) for prompt in prompt_list]


I think this is not necessary anymore since we store the prompts in a JSON file now (which does not have multi-line strings).

githubnemo · 2025-05-23T12:03:43Z

method_comparison/peft_bench/run.py

+    # Handle relative paths - if the path doesn't exist as provided, try within experiments directory
+    experiment_path = args.experiment_path
+    if not os.path.exists(experiment_path):
+        script_dir = os.path.dirname(os.path.abspath(__file__))
+        alt_path = os.path.join(script_dir, experiment_path)
+        if os.path.exists(alt_path):
+            experiment_path = alt_path
+        else:
+            # Try one more time with experiments/ prefix
+            alt_path = os.path.join(script_dir, "experiments", 
+                                    os.path.basename(os.path.dirname(experiment_path)),
+                                    os.path.basename(experiment_path))
+            if os.path.exists(alt_path):
+                experiment_path = alt_path


I'm not sure I can see the benefit of doing this. Let's be strict here and not attempt to guess what the user meant. Either it is the correct experiment directory or it isn't.

ved1beta · 2025-05-24T04:27:15Z

did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know

bench mark scripts

4d03d65

json removed

18733f5

BenjaminBossan requested changes May 2, 2025

View reviewed changes

ved1beta added 2 commits May 5, 2025 15:14

required changes

b31236e

read me commad fix '

e65b2be

BenjaminBossan requested changes May 5, 2025

View reviewed changes

required changes

627a038

ved1beta force-pushed the benchmark2scripts branch from ea4bb46 to 627a038 Compare May 5, 2025 17:15

configs

2de1796

ved1beta added 2 commits May 5, 2025 22:53

readme

03dc404

import handler

ec00ec6

githubnemo requested changes May 23, 2025

View reviewed changes

required chnages

9b476f2

		return prompts


		def get_prompts_by_length(prompts: Dict[str, List[str]], length: str = "all") -> List[str]:

		return sum(p.numel() for p in model.parameters() if p.requires_grad)


		def time_function(fn: Callable, args, *kwargs) -> Tuple[Any, float]:

		@@ -0,0 +1,17 @@
		{
		"base_model_name_or_path": "facebook/opt-350m",

bench mark scripts #2525

Are you sure you want to change the base?

bench mark scripts #2525

Uh oh!

Conversation

ved1beta commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Apr 30, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

githubnemo commented May 12, 2025

Uh oh!

ved1beta commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ved1beta commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ved1beta commented Apr 30, 2025 •

edited

Loading

ved1beta commented May 13, 2025 •

edited

Loading

ved1beta commented May 24, 2025 •

edited

Loading