-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed issue #2144 #2191
Fixed issue #2144 #2191
Conversation
No problem, if a simpler fix is possible I would definitely like to go for that.
I thought this wasn't possible since we need either the c-TF-IDF representations of the topics or the topic embeddings to reduce the topics. Neither the c-TF-IDF or the topic embeddings are outcomes of the clustering procedure, so how can then
I still have to wrap my head around this one. By adding this, aren't we essentially ignoring an issue here that is introduced by the PR? As you mentioned in the previous PR:
Although this is an edge case, I actually think this might happen more often than you think. Users do not know the number of clustered topics beforehand and might overshoot. Or am I missing something here? |
bertopic/_bertopic.py
Outdated
else: | ||
# Extract topics by calculating c-TF-IDF | ||
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) | ||
|
||
# Reduce topics | ||
else: | ||
# Reduce topics if needed, extract topics by calculating c-TF-IDF, and get representations. | ||
if self.nr_topics: | ||
documents = self._reduce_topics(documents) | ||
else: | ||
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing something basic here... How is it possible that we can run _reduce_topics
while no topic embeddings/ctfidf has been created already? Or has that been done somewhere and I'm completely glossing over it?
Long story short:
Long story long:
However, this was addressed here by adding Regarding the other point. This also touches on my recently opened issue regarding logging and, more broadly, conceptualization of the pipeline. In particular, our confusion here arises from the question of when topics are "extracted". The way I see it, the topics are formed, as a group of documents/embeddings, at the end of the clustering procedure. Here is where the column The It was unnecessary to rely on While _reduce_topics() could use the c_tf-idf, this is only the case when the argument |
That's exactly the thing. In order to use I forked your branch and ran the following which gave me an error referencing exactly this problem: from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
model = BERTopic(nr_topics=2, verbose=True).fit(docs) Which gave me the following error log: 2024-10-22 13:49:12,464 - BERTopic - Embedding - Completed ✓
2024-10-22 13:49:12,464 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-22 13:49:16,580 - BERTopic - Dimensionality - Completed ✓
2024-10-22 13:49:16,581 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-22 13:49:17,012 - BERTopic - Cluster - Completed ✓
2024-10-22 13:49:17,013 - BERTopic - Topic reduction - Reducing number of topics
2024-10-22 13:49:17,014 - BERTopic - WARNING: No topic embeddings were found despite they are supposed to be used (`use_ctfidf` is False). Defaulting to c-TF-IDF representation. After which I get the following error: TypeError Traceback (most recent call last)
Cell In[20], line 5
2 from sklearn.datasets import fetch_20newsgroups
4 docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
----> 5 model = BERTopic(nr_topics=2, verbose=True).fit(docs)
File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:364](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=363), in BERTopic.fit(self, documents, embeddings, images, y)
322 def fit(
323 self,
324 documents: List[str],
(...)
327 y: Union[List[int], np.ndarray] = None,
328 ):
329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
330
331 Arguments:
(...)
362 ```
363 """
--> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
365 return self
File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:494](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=493), in BERTopic.fit_transform(self, documents, embeddings, images, y)
491 else:
492 # Reduce topics if needed, extract topics by calculating c-TF-IDF, and get representations.
493 if self.nr_topics:
--> 494 documents = self._reduce_topics(documents)
495 else:
496 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:4351](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=4350), in BERTopic._reduce_topics(self, documents, use_ctfidf)
4349 if isinstance(self.nr_topics, int):
4350 if self.nr_topics < initial_nr_topics:
-> 4351 documents = self._reduce_to_n_topics(documents, use_ctfidf)
4352 else:
4353 logger.info(
4354 f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(documents['Topic'].unique())})."
4355 )
File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:4383](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=4382), in BERTopic._reduce_to_n_topics(self, documents, use_ctfidf)
4380 topics = documents.Topic.tolist().copy()
4382 # Create topic distance matrix
-> 4383 topic_embeddings = select_topic_representation(
4384 self.c_tf_idf_, self.topic_embeddings_, use_ctfidf, output_ndarray=True
4385 )[0][self._outliers :]
4386 distance_matrix = 1 - cosine_similarity(topic_embeddings)
4387 np.fill_diagonal(distance_matrix, 0)
TypeError: 'NoneType' object is not subscriptable As shown, because no topic embeddings or c-TF-IDF representations were generated, the But the strangest thing to me is that the tests seem to pass without any issues, even though they really shouldn't. |
True! I had my scripts checking only for the edge case and not for normal topic reduction. Is strange indeed that it doesn't show in the tests. I might check on that in the near-mid future. The solution then might revert to the previous version then (the one with the included I can amend the commit taking into account the comments of the previous PR to improve clarity |
Ok, I made a new commit. I didn't amend the previous commit to keep some reference (Note: i ended up forcepushing a logging detail, my bad), and I thought that maybe you can do that when merging (still getting my head around shared repos). If I am wrong, let me know and I can squash everything on one commit. This returns to previous solution, but now the Importantly, the edge case is still fully handled in the code and the test. When it comes to logging, a normal fit looks the same as before
But when an desired number of topics is inputted, the modified log will appear as
The 'for topic reduction' is added because this log only appears when |
Thanks for updating this and taking the time to merge these two commits. This logging is definitely much nicer now and should be more intuitive to users! I think this solution is arguably one of the cleanest here. One last question though, could you add a test that showcases that it now works? Also, I'm seeing the tests fail for both the linting as well as the code. |
What kind of test are you referring to? Indeed the tests are failing for linting, I forgot to run the When I get clarity on the kind of test you want I can see about it and force push everything into a single commit. |
The one that checks whether topic vectors/c-tf-idf are calculated so that we can actually reduce the number of topics correctly. It might be that the testing pipeline doesn't actually reduce the number of topics, so we might need to up the number of documents in order to create enough clusters to reduce. |
Sorry for long text, By reviewing the test scripts, I can't find where the tests are checking whether the model is correctly fitted or not. I will describe my understanding but I am not familiar with the pytest framework, so please point me out if I missed something. In the
And then these models are used to perform tests that are related to most of the model class methods. To give an example illustrative of the current issue, the following is what tests the
However, it seems that the This helps us understand what you noticed before:
As the test is configured, topic reduction is tested only when is called after the model is fitted (thus, there is an existing ctfidf), and the error that you points occurs only while fitting (with a defined number of topics, or auto reduce) I am sure that this particular error doesn't happen anymore as I've been testing mostly by fitting my local dataset considering the different cases and edge cases (also verified by my other issue to improve the logger). However, there are no scripted tests to sistematically verify that. A direct solution would involve in creating a new test for the I hope this was clear enough. |
Ah, this makes perfect sense. Thanks for taking the time to research this. It indeed seems I reduced the number of topics after fitting the model as a way to reduce wall time. A nasty side-effect is that
It definitely was, thanks! I agree, this is a bit out of scope for now and since you tested this thorougly I think this should be all for now. I will check it with my code, but if everything else passes, I think it's time to merge this. |
It works for me! Thank you for working on this, I'll go ahead and merge this 😄 |
What does this PR do?
I'm sorry for making a new PR, but hopefully this is much more clear/clean
Fixes #2144 (issue) by optimizing the topic extraction process when using fit_transform() with nr_topics="auto" or int for reducing topics. The main improvements are:
On my previous PR I added a
calculate representations
argument. This is not necessary here, as the initial nr of topics is now retrieved asinitial_nr_topics = len(documents["Topic"].unique())
, which is an outcome of the clustering procedure. So this doesn't require a previous call onextract_topics()
, in any form.Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. This was not accounted with previous tests (they raised a false negative), so test_representations.py was modified to account for this.
Before submitting
- Modified test_representations to account for the edge case mentioned above