Fixed Issue: #1977 #2181

SSivakumar12 · 2024-10-14T17:54:42Z

First of all, I want to take the opportunity to thank the awesome work being done by Martin and all contributors both past and present. This is a really good tool that we have used at work and has provided us with a lot of meaningful insights so I figured that I could do my bit to contribute to a tool that has helped me previously :)

This is my first PR and I sincerely hope it is not my last especially within this repo since I really want to give back to this project.

What does this PR do?

Address the issue raised in issues #1977 which I believe has not been resolved yet.

Fixes # (issue)
I have added an edge case so that if a tokenizer isn't any of the expected patterns a ValueError is raised and produced a constructive error message to hopefully to prevent a repeat of the error. I did not update the documentation and add any tests. That being said, I am more than happy to add tests if deemed appropriate/valuable

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
- didn't update documentation but added a error message
Did you write any new necessary tests?
- no but happy to do so in a separate PR since I think it doesn't fall into the scope of this current PR?

MaartenGr · 2024-10-18T10:01:14Z

Thank you for the PR! Is this exception raised whenever the documents are actually truncated or when you instantiate the related LLM? If it is the former, you would get an error after the clustering has been done which seems like wasted resources for the user. Would it perhaps be an idea to raise this error earlier?

SSivakumar12 · 2024-10-19T22:08:26Z

Good point! In my case right now, the error would be raised when I instantiate the relevant LLM. I will look to try and find a suitable place to raise the earlier and amend accordingly.

SSivakumar12 · 2024-10-19T22:30:01Z

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

MaartenGr · 2024-10-22T04:23:28Z

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

That is correct.

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

I understand your reasoning. However, that would mean that this very basic issue only gets flagged after most of the computation (and potentially hours of training) has already been done. Moreover, this is something we could already know when you initialize a given LLM. So by moving this:

            raise ValueError(
                "Please select from one of the valid options for the `tokenizer` parameter: \n"
                "{'char', 'whitespace', 'vectorizer'} \n"
                "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
            )

To the __init__ of all LLM-based representations. We could do something like this:

if tokenizer is None and doc_length is not None:
  raise ValueError(
      "Please select from one of the valid options for the `tokenizer` parameter: \n"
      "{'char', 'whitespace', 'vectorizer'} \n"
      "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
  )

That will show the error the moment you actually create the LLM, so before any computation has been done and users can adjust accordingly.

MaartenGr · 2024-10-29T14:23:39Z

@SSivakumar12 Could you create the errors as a function that is imported from _utils.py? Now, we are duplicating the same code which makes it hard to manage the code.

MaartenGr · 2024-11-20T13:13:44Z

bertopic/representation/_cohere.py

-                "{'char', 'whitespace', 'vectorizer'} \n"
-                "If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
-            )
+        _ = validate_truncate_document_parameters(self.tokenizer, self.doc_length)


Are you returning the _ for a specific reason? Looking at the validate_truncate_document_parameters, there doesn't seem anything that is returned.

MaartenGr · 2024-11-20T13:14:50Z

bertopic/representation/_utils.py

+    if tokenizer is None and doc_length is not None:
+        raise ValueError(
+            "Please select from one of the valid options for the `tokenizer` parameter: \n"
+            "{'char', 'whitespace', 'vectorizer'} \n"
+            "If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
+        )


Should we also include a check for the opposite? That someone does use a tokenizer but not a doc_length?

MaartenGr · 2024-11-25T10:52:08Z

bertopic/representation/_utils.py

+    else:
+        pass


I believe this isn't necessary right?

No it is not necessary but I included it to be verbose in terms of the possible edgecase. That being said, happy to remove it.

Let's remove it and I'll make sure to merge it after that! Other than that, it looks great!

MaartenGr · 2024-12-03T08:04:48Z

@SSivakumar12 Could you check the linting? It is failing there.

SSivakumar12 · 2024-12-03T19:56:36Z

Apologies, for the oversight, have resolved the linting issues locally and the build should pass on linting now

MaartenGr · 2024-12-09T11:08:17Z

Awesome, everything is looking good. Thank you for your work on this!

SSivakumar12 mentioned this pull request Oct 14, 2024

Should raise an Exception when tokenizer is not defined #1977

Open

Fixed Issue: #1977

2601ac3

SSivakumar12 force-pushed the master branch from 3f715d3 to 2601ac3 Compare October 22, 2024 19:23

standardise errors into function for consistency

ede4d99

MaartenGr reviewed Nov 20, 2024

View reviewed changes

add additional edgecase and removal of underscore for consistency

7796fbc

MaartenGr reviewed Nov 25, 2024

View reviewed changes

remove redundant else clause

19bab3d

fixing linting errors

aa03a25

MaartenGr merged commit 50d9a49 into MaartenGr:master Dec 9, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed Issue: #1977 #2181

Fixed Issue: #1977 #2181

SSivakumar12 commented Oct 14, 2024 •

edited

Loading

MaartenGr commented Oct 18, 2024

SSivakumar12 commented Oct 19, 2024

SSivakumar12 commented Oct 19, 2024 •

edited

Loading

MaartenGr commented Oct 22, 2024

MaartenGr commented Oct 29, 2024

MaartenGr Nov 20, 2024

MaartenGr Nov 20, 2024

MaartenGr Nov 25, 2024

SSivakumar12 Nov 26, 2024

MaartenGr Nov 29, 2024

MaartenGr commented Dec 3, 2024

SSivakumar12 commented Dec 3, 2024

MaartenGr commented Dec 9, 2024

Fixed Issue: #1977 #2181

Fixed Issue: #1977 #2181

Conversation

SSivakumar12 commented Oct 14, 2024 • edited Loading

What does this PR do?

Before submitting

MaartenGr commented Oct 18, 2024

SSivakumar12 commented Oct 19, 2024

SSivakumar12 commented Oct 19, 2024 • edited Loading

MaartenGr commented Oct 22, 2024

MaartenGr commented Oct 29, 2024

MaartenGr Nov 20, 2024

Choose a reason for hiding this comment

MaartenGr Nov 20, 2024

Choose a reason for hiding this comment

MaartenGr Nov 25, 2024

Choose a reason for hiding this comment

SSivakumar12 Nov 26, 2024

Choose a reason for hiding this comment

MaartenGr Nov 29, 2024

Choose a reason for hiding this comment

MaartenGr commented Dec 3, 2024

SSivakumar12 commented Dec 3, 2024

MaartenGr commented Dec 9, 2024

SSivakumar12 commented Oct 14, 2024 •

edited

Loading

SSivakumar12 commented Oct 19, 2024 •

edited

Loading