This repository contains a benchmarking toolkit for evaluating Large Language Models (LLMs) on competitive programming tasks. The toolkit provides a standardized way to test your LLM's code generation capabilities across a diverse set of problems.
LiveCodeBench Pro evaluates LLMs on their ability to generate solutions for programming problems. The benchmark includes problems of varying difficulty levels from different competitive programming platforms.
- Python 3.12 or higher
- pip package manager
Install the required dependencies:
pip install -r requirements.txt
Create your own LLM class by extending the abstract LLMInterface
class in api_interface.py
. Your implementation needs to override the call_llm
method.
Example:
from api_interface import LLMInterface
class YourLLM(LLMInterface):
def __init__(self):
super().__init__()
# Initialize your LLM client or resources here
def call_llm(self, user_prompt: str):
# Implement your logic to call your LLM with user_prompt
# Return a tuple containing (response_text, metadata)
# Example:
response = your_llm_client.generate(user_prompt)
return response.text, response.metadata
You can use the ExampleLLM
class as a reference, which shows how to integrate with OpenAI's API.
Edit the benchmark.py
file to use your LLM implementation:
from your_module import YourLLM
# Replace this line:
llm_instance = YourLLM() # Update with your LLM class
Execute the benchmark script:
python benchmark.py
The script will:
- Load the LiveCodeBench-Pro dataset from Hugging Face
- Process each problem with your LLM
- Save the results to
benchmark_result.json
Send your benchmark_result.json
file to [email protected] for evaluation.
Please include the following information in your submission:
- LLM name and version
- Any specific details
- Contact information for results
This file defines the abstract interface for LLM integration:
LLMInterface
: Abstract base class with methods for LLM interactionExampleLLM
: Example implementation with OpenAI's GPT-4o
The main benchmarking script that:
- Loads the dataset
- Processes each problem through your LLM
- Collects and saves results
The benchmark uses the anonymous1926/anonymous_dataset dataset from Hugging Face, which contains competitive programming problems with varying difficulty levels.
For questions or support, please contact us at [email protected].