-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed issue #2144 #2180
Fixed issue #2144 #2180
Conversation
Ok, so I solved the issue with the It just needed one line of code to I hope is enough and self-explanatory. I forced pushed into the same commit to keep git history clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.
One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;
else:
logger.info(
f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
)
documents = self._sort_mappings_by_frequency(documents)
self._extract_topics(documents, verbose=self.verbose)
return documents
action = "Representation" if calculate_representation else "Topics" | ||
logger.info( | ||
f"{action} - Extracting topics from clusters{' using representation models' if calculate_representation else ''}." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I understood this correctly, are the two options as follows:
"Representation - Extracting topics from clusters using representation models"
Does this refer to the case where all representations are calculated (and not just c-TF-IDF)?
"Topics - Extracting topics from clusters"
Does this then refer to the case where only c-TF-IDF is calculate?
Also, perhaps it would be nice to create a variable for the second if/else statement the same way you did for the first.
method = ' using representation models' if calculate_representation else ''
logger.info(f"{action} - Extracting topics from clusters{method}."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the "Topics - Extracting topics from clusters"
refers to the case where only the c-TF-IDF is calculated.
And I would suggest changing it to "Extracting base representation from bag of words"
I can make the change suggested for the if else statement, but please check my comment below.
@@ -3980,18 +3980,29 @@ def _extract_topics( | |||
embeddings: The document embeddings | |||
mappings: The mappings from topic to word | |||
verbose: Whether to log the process of extracting topics | |||
calculate_representation: Whether to extract the topic representations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be the "additional" topic representations instead since c-TF-IDF is already calculated? I'm struggling a bit with having calculated c-TF-IDF before or not. Now it seems like c-TF-IDF is not calculated at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be the "additional" representation
The c-TF-IDF is always calculated, since is not conditional to that boolean argument.
Is calculated in line 3995 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
and then used to calculate the default representation in def _extract_words_per_topic
if not self.representation_model or not calculate_representation: # Default representation: c_tf_idf + top_n_words topics = {label: values[: self.top_n_words] for label, values in topics.items()}
words, | ||
documents, | ||
calculate_representation=calculate_representation, | ||
calculate_aspects=calculate_representation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I understand this correctly. You added this here because when you only calculate the base representation (calculate_representation=False
) you do not want the additional representations (calculate_aspects=False
), right?
If so, we might need to think about the term calculate_representation
because it should refer to fine-tuning representations instead right? Or did you name it did way because technically, it might also be the default representation when you set it as "Main"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right.
Indeed is about fine tuning. But yea, the nomenclature here is confusing for me as well.
It could be changed to "fine_tune_representations"
I will gather my thoughts on one single comment bellow, since this touches on the other comments.
@@ -4468,7 +4488,7 @@ def _auto_reduce_topics(self, documents: pd.DataFrame, use_ctfidf: bool = False) | |||
# Update documents and topics | |||
self.topic_mapper_.add_mappings(mapped_topics, topic_model=self) | |||
documents = self._sort_mappings_by_frequency(documents) | |||
self._extract_topics(documents, mappings=mappings) | |||
self._extract_topics(documents, mappings=mappings, verbose=self.verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added verbosity here. I purposefully left this to False to prevent too much logging but was wondering why you think it was necessary to add it.
Yes, that solves the edge case. The issue I had before was that test_representation.py would raise a false negative error, but i got that fixed by editing the test. Regarding the other comments: I think there is a higher level confusion between the documented modularity and the current logging verbosity. In the code, and logger calls, the pipeline is defined as: However, in the documentation and paper, the pipeline is more detailed for the later steps as : I could suggest a modification for the logging, if you would allow it, and put it here or in other PR. However, this requires some conceptual clarifications:
Also the word "extract" is a bit ambiguous for me.
I would propose aligning the logging to the documentation pipeline. As a rough example of a fit_transform with verbosity (ignoring the Completed ✓ for brevity): As a non-CS researcher, I find it very helpful to understand and adapt the different steps of the implementation through the logging, that is also why I added the extra verbosity in Long story short, If what I reflected above make sense I can present some proposal on how to align the logger to the documented pipeline. This will probably help solve the other comments you made. Otherwise, I can make small modifications for the specific comments you made, and come back with my logging suggestions in a new issue or PR |
Ok I closed this PR because i found a much cleaner solution. I will make a new one, you will notice it has a minimal change that might not even require major review. I will open the logging discussion as a new issue, to distinguish the two issues and make my point better. |
What does this PR do?
Fixes #2144 (issue) by optimizing the topic extraction process when using
fit_transform()
withnr_topics="auto"
orint
for reducing topics. The main improvements are:calculate_representation
parameter in the_extract_topics
method to control when representations are calculated.fit_transform
method to calculate representations only after topic reduction (if needed).These changes significantly improve performance, particularly when using computationally expensive representation models like LLMs.
Additionally, this PR addresses an edge case where
self.nr_topics >= initial_nr_topics
. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.This is easily solvable by adding
self._sort_mappings_by_frequency(documents)
indef _reduce_topics()
in line 4369. But runs into problems whenpytest tests\test_representation\test_representations.py
tests for the edge caseself.nr_topics >= initial_nr_topics
Thus, some fix in how this edge cases are tested is required.
Before submitting