-
Notifications
You must be signed in to change notification settings - Fork 1.9k
bench mark scripts #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
bench mark scripts #2525
Conversation
Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.
Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info
, train_info
, and meta_info
. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.
We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.
One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?
Also, please add a README.md
and run make style
before pushing your changes.
method_comparison/peft_bench/data.py
Outdated
], | ||
"long": [ | ||
"""Analyze the evolution of parameter-efficient fine-tuning methods from 2020 to present. | ||
Include a detailed comparison of at least five different approaches, their theoretical foundations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the leading spaces are part of the prompt, which is undesired. Either remove them here, which looks a bit ugly, or wrap the whole text with textwrap.dedent
.
method_comparison/peft_bench/data.py
Outdated
# If a prompts file is specified, load it | ||
if "prompts_file" in config: | ||
file_path = config["prompts_file"] | ||
if os.path.exists(file_path): | ||
with open(file_path, "r") as f: | ||
file_prompts = json.load(f) | ||
# Update or add to default prompts | ||
for category, prompt_list in file_prompts.items(): | ||
prompts[category] = prompt_list | ||
|
||
# If custom prompts are specified directly in config | ||
if "custom_prompts" in config: | ||
for category, prompt_list in config["custom_prompts"].items(): | ||
prompts[category] = prompt_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this is currently not being used and I'm not quite sure what the intent for this is. Could you please either explain this with an example, or just remove it for now?
method_comparison/peft_bench/data.py
Outdated
prompts[category] = prompt_list | ||
|
||
# If specific categories are requested, filter to just those | ||
if "prompt_categories" in config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so right now, for each experiment we can define what type of prompts we would like? Would this not mean we cannot fully compare the results between different experiments. I think what I would prefer is that we run the benchmark for each prompt category and track the metrics separately. Then we have for example inference speed for short, medium, long ... prompts. We might also need to adjust the max generated tokens accordingly.
method_comparison/peft_bench/data.py
Outdated
return prompts | ||
|
||
|
||
def get_prompts_by_length(prompts: Dict[str, List[str]], length: str = "all") -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the change I proposed above of testing each prompt category, I think this function won't be necessary anymore.
memory_allocated_log: List[float] = field(default_factory=list) | ||
memory_reserved_log: List[float] = field(default_factory=list) | ||
|
||
# Performance metrics | ||
inference_times: Dict[str, float] = field(default_factory=dict) | ||
inference_overhead: Dict[str, float] = field(default_factory=dict) | ||
training_throughput: float = 0.0 # tokens/second | ||
|
||
# Additional metrics | ||
metrics: List[Dict[str, Any]] = field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use list[str]
, dict[str, Any]
etc. no need for List
and Dict
.
"""Configuration for benchmarking PEFT methods.""" | ||
# Model configuration | ||
model_id: str | ||
peft_method: Literal["lora", "adalora", "bone", "ia3", "prompt_tuning", "prefix_tuning", "none"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just put str
here, since this list is much bigger than this and will grow in the future. I'm not even sure if we need it at all, as we can use peft_config.peft_type
seed: int = 42 | ||
num_inference_runs: int = 5 | ||
max_new_tokens: int = 20 | ||
train_batch_size: int = 4 | ||
train_steps: int = 10 | ||
|
||
# Data settings | ||
prompt_categories: List[str] = field(default_factory=lambda: ["short", "medium"]) | ||
num_prompt_samples: int = 2 | ||
reserve_output_tokens: int = 50 | ||
|
||
# Optional settings | ||
use_4bit: bool = False | ||
use_8bit: bool = False | ||
compile_model: bool = False | ||
merge_adapter: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, let's not set any defaults if they're defined by the config file. Also, compile_model
and merge_adapter
are not used, so let's remove them for now.
merge_adapter: bool = False | ||
|
||
# Method-specific parameters (these would be overridden by the experiment config) | ||
peft_params: Dict[str, Any] = field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need the special handling of PEFT params, could you please explain?
return sum(p.numel() for p in model.parameters() if p.requires_grad) | ||
|
||
|
||
def time_function(fn: Callable, *args, **kwargs) -> Tuple[Any, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears this function is not being used.
tried to cover all the changes please have a look : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.
@@ -0,0 +1,12 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a default json instead of having a sample_config.json
.
i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that |
ea4bb46
to
627a038
Compare
there are some import issue so we cant run it directly |
Can you be a bit more specific on what import errors you're experiencing? |
@githubnemo
from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__))) cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose hey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.
Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench
folder - every other folder is unsupported, making the code a bit simpler, I think.
I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.
# Default meta_info | ||
self.meta_info = { | ||
"model_id": self.model_id, | ||
"peft_method": self.peft_method, | ||
"parameters": { | ||
"base_params": 0, | ||
"trainable_params": 0, | ||
"total_params": 0, | ||
"param_ratio": 0.0, | ||
}, | ||
"model_size": { | ||
"base_model_size_mb": 0.0, | ||
"adapter_size_mb": 0.0, | ||
}, | ||
} | ||
|
||
# Default train_info | ||
self.train_info = { | ||
"training_throughput": 0.0, # tokens/second | ||
"memory": { | ||
"peak_gpu_memory_mb": 0.0, | ||
"peak_ram_memory_mb": 0.0, | ||
"memory_logs": [], | ||
}, | ||
"inference": { | ||
"times": {}, | ||
"overhead": {}, | ||
}, | ||
} | ||
|
||
# Default metrics structure | ||
self.metrics = { | ||
"by_category": {}, # Will hold metrics for each prompt category | ||
"overall": {}, # Overall metrics across all categories | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for restructuring the benchmark results! I think we don't need to mimic MetaMathQA
exactly but merely taking its structure as an inspiration. Since we're not doing training it would probably be best to not have a train_info
section. We're measuring inference performance, so generation_info
or something similar would be better suited, I think.
Maybe I'm misunderstanding something but isn't train_info.inference
redundant with metrics.*.inference_time
? Would it make sense to place the metrics under generation_info
? E.g., generation_info.{by_category,overall}.[...]
?
# Use benchmark_params.json if exists, otherwise use default config | ||
if os.path.exists(benchmark_params_path): | ||
benchmark_config = BenchmarkConfig.from_json(benchmark_params_path) | ||
elif os.path.exists(default_config_path): | ||
print(f"No benchmark_params.json found in {path}, using default configuration") | ||
benchmark_config = BenchmarkConfig.from_json(default_config_path) | ||
else: | ||
raise FileNotFoundError(f"Neither benchmark_params.json nor default_config.json found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's always load the default config first, then load the specific benchmark config and merge the two so that the benchmark config only needs to specify the values that are diverging, keeping the experiment configs small and readable.
@@ -0,0 +1,52 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example is still worth keeping but can probably be thinned out quite a bit once default config + experiment config are merged upon load (see comment below).
@@ -0,0 +1,17 @@ | |||
{ | |||
"base_model_name_or_path": "facebook/opt-350m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally this seems to be a default config for an adapter, not an experiment. validate_experiment_path
suggests that this should be a default config for an experiment (and I concur :))
It should also reside in configs/
.
"peft_config_variants": [ | ||
{ | ||
"variant_name": "lora_r8", | ||
"peft_method": "lora", | ||
"r": 8, | ||
"lora_alpha": 16, | ||
"lora_dropout": 0.05, | ||
"bias": "none", | ||
"task_type": "CAUSAL_LM", | ||
"target_modules": ["q_proj", "v_proj"] | ||
}, | ||
{ | ||
"variant_name": "lora_r16", | ||
"peft_method": "lora", | ||
"r": 16, | ||
"lora_alpha": 32, | ||
"lora_dropout": 0.05, | ||
"bias": "none", | ||
"task_type": "CAUSAL_LM", | ||
"target_modules": ["q_proj", "v_proj"] | ||
}, | ||
{ | ||
"variant_name": "lora_r32", | ||
"peft_method": "lora", | ||
"r": 32, | ||
"lora_alpha": 64, | ||
"lora_dropout": 0.05, | ||
"bias": "none", | ||
"task_type": "CAUSAL_LM", | ||
"target_modules": ["q_proj", "v_proj"] | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having config variants to keep the experiment config small. Would it make sense to have a peft_config_base
key alongside peft_config_variants
so that the variant only needs to update the keys that change? This would remove a lot of the repetition we see currently.
method_comparison/peft_bench/data.py
Outdated
# Apply textwrap.dedent to remove leading spaces from multiline prompts | ||
for category, prompt_list in prompts.items(): | ||
prompts[category] = [textwrap.dedent(prompt) for prompt in prompt_list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not necessary anymore since we store the prompts in a JSON file now (which does not have multi-line strings).
method_comparison/peft_bench/run.py
Outdated
# Handle relative paths - if the path doesn't exist as provided, try within experiments directory | ||
experiment_path = args.experiment_path | ||
if not os.path.exists(experiment_path): | ||
script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
alt_path = os.path.join(script_dir, experiment_path) | ||
if os.path.exists(alt_path): | ||
experiment_path = alt_path | ||
else: | ||
# Try one more time with experiments/ prefix | ||
alt_path = os.path.join(script_dir, "experiments", | ||
os.path.basename(os.path.dirname(experiment_path)), | ||
os.path.basename(experiment_path)) | ||
if os.path.exists(alt_path): | ||
experiment_path = alt_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I can see the benefit of doing this. Let's be strict here and not attempt to guess what the user meant. Either it is the correct experiment directory or it isn't.
did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know |
scriptys PR
docs PR
this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think