-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor BaseEvaluation with Minor Project Structure Change, Rate Limiting, and pyproject.toml Migration #4
base: dev
Are you sure you want to change the base?
Conversation
Oo, I also changed the structure of the results. It now is a dict {"scores": scores, "stats": stats} so you can access the stats via results.get('stats'). The README is updated but here is the new flow for evaluation: from chunking_evaluation import BaseChunker, GeneralEvaluation
from chromadb.utils import embedding_functions
# Define a custom chunking class
class CustomChunker(BaseChunker):
def split_text(self, text):
# Custom chunking logic
return [text[i:i+1200] for i in range(0, len(text), 1200)]
# Instantiate the custom chunker and evaluation
chunker = CustomChunker()
evaluation = GeneralEvaluation()
# Choose embedding function
default_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="OPENAI_API_KEY",
model_name="text-embedding-3-large"
)
# Create a RateLimiter instance
rate_limiter = RateLimiter(
# Set your rate limits as needed (this is OpenAI's tier 1 rate limit)
max_tokens_per_minute=1_000_000,
max_requests_per_minute=3_000,
)
# Evaluate the chunker
results = evaluation.run(chunker, default_ef, rate_limiter) # set use_tqdm=True to see progress bar
print(results.get('stats'))
# {'iou_mean': 0.17715979570301696, 'iou_std': 0.10619791407460026,
# 'recall_mean': 0.8091207841640163, 'recall_std': 0.3792297991952294} |
Also, there is a bit of a hack in the RateLimiter where I decrease the tokens per minute by 20% because I could not resolve an bug with OpenAI embedding endpoint. I outlined it here... https://community.openai.com/t/discrepancy-between-tiktoken-token-count-and-openai-embeddings-api-token-count-exceeding-tpm-limit-in-tier-2-account/959298. |
…gs with large lists to be embedded
Removed the hack in RateLimiter by implementing batching on top of the TPM and RPM rate limiting. |
Refactored
BaseEvaluation
class:evaluation_utils.py
.run
method by breaking down its logic into several internal helper functions.Implemented
RateLimiter
:RateLimiter
class insrc/chunking_evaluation/utils.py
to regulate the embedding request flow.BaseEvaluation
_add_documents_to_collection helper function.Introduced
tqdm
for Progress Tracking:tqdm
progress bar option to theBaseEvaluation
class for better visualization of long-running processes.use_tqdm=True
in therun
method:Migrated from
setup.py
topyproject.toml
:pyproject.toml
, following modern Python packaging standards.Google Colab Notebook Management:
Please test on your end and let me know if you have questions.