Skip to content

bench mark scripts #2525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

ved1beta
Copy link

@ved1beta ved1beta commented Apr 30, 2025

scriptys PR
docs PR

this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think

@BenjaminBossan
Copy link
Member

Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results (benchmark_result.json)? We will run the experiments on our hardware to get reproducible results.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.

Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info, train_info, and meta_info. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.

We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.

One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?

Also, please add a README.md and run make style before pushing your changes.

],
"long": [
"""Analyze the evolution of parameter-efficient fine-tuning methods from 2020 to present.
Include a detailed comparison of at least five different approaches, their theoretical foundations,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the leading spaces are part of the prompt, which is undesired. Either remove them here, which looks a bit ugly, or wrap the whole text with textwrap.dedent.

Comment on lines 97 to 110
# If a prompts file is specified, load it
if "prompts_file" in config:
file_path = config["prompts_file"]
if os.path.exists(file_path):
with open(file_path, "r") as f:
file_prompts = json.load(f)
# Update or add to default prompts
for category, prompt_list in file_prompts.items():
prompts[category] = prompt_list

# If custom prompts are specified directly in config
if "custom_prompts" in config:
for category, prompt_list in config["custom_prompts"].items():
prompts[category] = prompt_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this is currently not being used and I'm not quite sure what the intent for this is. Could you please either explain this with an example, or just remove it for now?

prompts[category] = prompt_list

# If specific categories are requested, filter to just those
if "prompt_categories" in config:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so right now, for each experiment we can define what type of prompts we would like? Would this not mean we cannot fully compare the results between different experiments. I think what I would prefer is that we run the benchmark for each prompt category and track the metrics separately. Then we have for example inference speed for short, medium, long ... prompts. We might also need to adjust the max generated tokens accordingly.

return prompts


def get_prompts_by_length(prompts: Dict[str, List[str]], length: str = "all") -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the change I proposed above of testing each prompt category, I think this function won't be necessary anymore.

Comment on lines 87 to 96
memory_allocated_log: List[float] = field(default_factory=list)
memory_reserved_log: List[float] = field(default_factory=list)

# Performance metrics
inference_times: Dict[str, float] = field(default_factory=dict)
inference_overhead: Dict[str, float] = field(default_factory=dict)
training_throughput: float = 0.0 # tokens/second

# Additional metrics
metrics: List[Dict[str, Any]] = field(default_factory=list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use list[str], dict[str, Any] etc. no need for List and Dict.

"""Configuration for benchmarking PEFT methods."""
# Model configuration
model_id: str
peft_method: Literal["lora", "adalora", "bone", "ia3", "prompt_tuning", "prefix_tuning", "none"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just put str here, since this list is much bigger than this and will grow in the future. I'm not even sure if we need it at all, as we can use peft_config.peft_type

Comment on lines 137 to 152
seed: int = 42
num_inference_runs: int = 5
max_new_tokens: int = 20
train_batch_size: int = 4
train_steps: int = 10

# Data settings
prompt_categories: List[str] = field(default_factory=lambda: ["short", "medium"])
num_prompt_samples: int = 2
reserve_output_tokens: int = 50

# Optional settings
use_4bit: bool = False
use_8bit: bool = False
compile_model: bool = False
merge_adapter: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, let's not set any defaults if they're defined by the config file. Also, compile_model and merge_adapter are not used, so let's remove them for now.

merge_adapter: bool = False

# Method-specific parameters (these would be overridden by the experiment config)
peft_params: Dict[str, Any] = field(default_factory=dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need the special handling of PEFT params, could you please explain?

return sum(p.numel() for p in model.parameters() if p.requires_grad)


def time_function(fn: Callable, *args, **kwargs) -> Tuple[Any, float]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this function is not being used.

@ved1beta
Copy link
Author

ved1beta commented May 5, 2025

tried to cover all the changes please have a look : )

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.

@@ -0,0 +1,12 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a default json instead of having a sample_config.json.

@ved1beta
Copy link
Author

ved1beta commented May 5, 2025

i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that

@ved1beta ved1beta force-pushed the benchmark2scripts branch from ea4bb46 to 627a038 Compare May 5, 2025 17:15
@ved1beta
Copy link
Author

ved1beta commented May 5, 2025

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose
need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

@githubnemo
Copy link
Collaborator

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

Can you be a bit more specific on what import errors you're experiencing?

@ved1beta
Copy link
Author

ved1beta commented May 13, 2025

@githubnemo
The import errors occur because the script uses relative imports (from data import ... and from utils import ...) but these modules are in the same directory. we can either:

  1. Use absolute imports:
from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...
  1. Or add the directory to the Python path in the script:
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose

hey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : )

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.

Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench folder - every other folder is unsupported, making the code a bit simpler, I think.

I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.

Comment on lines 87 to 121
# Default meta_info
self.meta_info = {
"model_id": self.model_id,
"peft_method": self.peft_method,
"parameters": {
"base_params": 0,
"trainable_params": 0,
"total_params": 0,
"param_ratio": 0.0,
},
"model_size": {
"base_model_size_mb": 0.0,
"adapter_size_mb": 0.0,
},
}

# Default train_info
self.train_info = {
"training_throughput": 0.0, # tokens/second
"memory": {
"peak_gpu_memory_mb": 0.0,
"peak_ram_memory_mb": 0.0,
"memory_logs": [],
},
"inference": {
"times": {},
"overhead": {},
},
}

# Default metrics structure
self.metrics = {
"by_category": {}, # Will hold metrics for each prompt category
"overall": {}, # Overall metrics across all categories
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for restructuring the benchmark results! I think we don't need to mimic MetaMathQA exactly but merely taking its structure as an inspiration. Since we're not doing training it would probably be best to not have a train_info section. We're measuring inference performance, so generation_info or something similar would be better suited, I think.

Maybe I'm misunderstanding something but isn't train_info.inference redundant with metrics.*.inference_time? Would it make sense to place the metrics under generation_info? E.g., generation_info.{by_category,overall}.[...]?

Comment on lines 365 to 372
# Use benchmark_params.json if exists, otherwise use default config
if os.path.exists(benchmark_params_path):
benchmark_config = BenchmarkConfig.from_json(benchmark_params_path)
elif os.path.exists(default_config_path):
print(f"No benchmark_params.json found in {path}, using default configuration")
benchmark_config = BenchmarkConfig.from_json(default_config_path)
else:
raise FileNotFoundError(f"Neither benchmark_params.json nor default_config.json found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's always load the default config first, then load the specific benchmark config and merge the two so that the benchmark config only needs to specify the values that are diverging, keeping the experiment configs small and readable.

@@ -0,0 +1,52 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is still worth keeping but can probably be thinned out quite a bit once default config + experiment config are merged upon load (see comment below).

@@ -0,0 +1,17 @@
{
"base_model_name_or_path": "facebook/opt-350m",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally this seems to be a default config for an adapter, not an experiment. validate_experiment_path suggests that this should be a default config for an experiment (and I concur :))

It should also reside in configs/.

Comment on lines +20 to +51
"peft_config_variants": [
{
"variant_name": "lora_r8",
"peft_method": "lora",
"r": 8,
"lora_alpha": 16,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": ["q_proj", "v_proj"]
},
{
"variant_name": "lora_r16",
"peft_method": "lora",
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": ["q_proj", "v_proj"]
},
{
"variant_name": "lora_r32",
"peft_method": "lora",
"r": 32,
"lora_alpha": 64,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": ["q_proj", "v_proj"]
}
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having config variants to keep the experiment config small. Would it make sense to have a peft_config_base key alongside peft_config_variants so that the variant only needs to update the keys that change? This would remove a lot of the repetition we see currently.

Comment on lines 48 to 50
# Apply textwrap.dedent to remove leading spaces from multiline prompts
for category, prompt_list in prompts.items():
prompts[category] = [textwrap.dedent(prompt) for prompt in prompt_list]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not necessary anymore since we store the prompts in a JSON file now (which does not have multi-line strings).

Comment on lines 320 to 333
# Handle relative paths - if the path doesn't exist as provided, try within experiments directory
experiment_path = args.experiment_path
if not os.path.exists(experiment_path):
script_dir = os.path.dirname(os.path.abspath(__file__))
alt_path = os.path.join(script_dir, experiment_path)
if os.path.exists(alt_path):
experiment_path = alt_path
else:
# Try one more time with experiments/ prefix
alt_path = os.path.join(script_dir, "experiments",
os.path.basename(os.path.dirname(experiment_path)),
os.path.basename(experiment_path))
if os.path.exists(alt_path):
experiment_path = alt_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can see the benefit of doing this. Let's be strict here and not attempt to guess what the user meant. Either it is the correct experiment directory or it isn't.

@ved1beta
Copy link
Author

ved1beta commented May 24, 2025

did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants