Recommendation systems have become a powerful tool for guiding users through the extensive world of digital content. In this research, we will focus on a content-based approach, where the system finds the attributes related to the product that the user liked and uses this information to recommend other products with similar features. We will focus on book plot summaries, using NLP methods to extract meaningful features. The primary objective of this research is to develop three book recommendation algorithms and address their strengths and limitations.
We used the CMU Book Summary Dataset from Kaggle (https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset). It contains plot summaries for 16 559 books extracted from Wikipedia and their metadata.
- Cosine Similarity of Plots: This approach represents each book's plot numerically using TF-IDF and calculates cosine similarity between them to recommend books with similar themes or plots.
- Named Entities in Plots: This approach is focused on extracting named entities (e.g., characters, locations, events) mentioned in the plots and using them to establish similarities between books.
- Topic Modeling using Latent Dirichlet Allocation (LDA): This approach applies LDA to identify underlying topics in book plots, representing each book's plot as a distribution over these topics.
While RS are relatively successful at suggesting content, their performance suffers when little information about user’s preferences is given. These situations occur when the user is new to the system, or the system is new and doesn’t have any users yet, or when the item is new and has no history of preferences. Additionally, considerations regarding user privacy may stop the system from recording user preferences. Using content-based methods, based on NLP, we can address the "cold start" problem to some extent and enhance user experience in these challenging scenarios. In the results section we will evaluate the recommendations generated by each approach and address their respective strengths and limitations. We expect some methods to be better in certain domains, e.g. NER in biographies and sequels.
To create a simple book RS that uses TF-IDF and cosine similarity: We tokenized the text, lemmatized each token and removed non-alphabetic tokens. We used the TfidfVectorizer from scikit-learn to transform the cleaned plot summaries into TF-IDF feature vectors. The TfidfVectorizer parameters are set to remove English stop words, and create unigrams, bigrams, and trigrams. We calculated the cosine similarity distance between each pair of TF-IDF vectors representing the plot summaries. To create a book RS based on named entities: We preprocess the text data using spaCy, performing tokenization. We do not lowercase, or lemmatize the text nor remove the stopwards, since NER relies on capitalization patterns to identify proper nouns accurately. The code defines a function extract_named_entities(text) to extract named entities from text using SpaCy's NER. Named entities are filtered based on specific labels such as PERSON, LOC, ORG, etc. We want to exclude entities like time, ordinal and cardinal numbers, etc, because they provide less value. We compute entity similarity between books and come up with recommendations. To create a book RS based on LDA and cosine similarity: The preprocessing steps included text cleaning (removing special characters, digits and punctuation), filtering out stop words, short words (length <= 4), non-alphabetic words, and lemmatization. Bigram and trigram models are built using Gensim's Phrases module to identify common phrases consisting of two or three words. These models help capture meaningful phrases like "Animal Farm". A bag of words representation is created for each document in the corpus using the doc2bow method, where each word is represented by its ID and its frequency in the document.
The results generated by the three automatic book recommendation systems revealed significant variations. These variations underscored the distinct strengths and limitations inherented in each approach.
- Demonstrated ability to identify thematic similarities. E.g. The recommendation for "War and Peace" by Tolstoy included novels with similar themes of multi-generational sagas and exploration of society.
- Considered broader themes and genres, offering recommendations beyond direct sequels. (For Dune it suggested Star Wars and books of the science fiction and fantasy genres).
- Inconsistently recognized books within the same series, occasionally missing opportunities to recommend sequels. (To Dune it didn’t suggest other books from the Dune series but to Harry Potter, it suggested sequel books).
- Some recommendations seemed unrelated, such as suggesting a book about Robin Hood when a book about James Bond was provided.
- Performed well on sequels (for Dune it suggested all parts of the Dune series) and parts of the same universe (James Bond book - The Moneypenny Diaries).
- Seems like it accurately identifies thematic elements related to adventure, but in fact it recognizes the locations, which makes it relevant for travel-related literature.
- May misclassify characters or fail to accurately identify them: suggested books about characters named Harry but unrelated to Harry Potter.
- Inconsistencies in performance across different domains: For War and Peace it suggested an unrelated science fiction book about Russian genetics experiments and a French comic book that mentions Russian satellites. These two books had shorter plots, which is also a factor in worse predictions.
- Effectively captured the topic: e.g. for a book about James Bond it suggested books about an intelligence agent and a private investigator. For Into the Wild - books that explore themes of societal norms, individuality, and rebellion.
- Demonstrated proficiency in recognizing sequels: e.g. to Harry Potter recommends multiple Harry Potter books.
- Provided diverse recommendations beyond genre constraints. This leads to recommendations that may not be immediately obvious but still resonate with readers. For Dune suggested both a book from the same series by Frank Herbert and science fiction novels about futuristic empires.
- Struggled with specific domains, such as biographies, and books with complex narratives. The bigger evaluation of the RS outputs can be performed by human evaluators. They could detect errors or inconsistencies in the recommendations, such as misclassifications or irrelevant suggestions. In general, the results of our research demonstrated the effectiveness of the developed book RS. By combining these approaches, and giving different weights, we could build a versatile book recommendation system that accounts for various aspects of book content and structure.
- NumPy (np)
- Pandas (pd)
- SpaCy
- Regular Expressions (re)
- CSV (csv)
- TQDM (tqdm)
- Matplotlib (matplotlib.pyplot)
- Sklearn
- NumPy (np)
- Pandas (pd)
- SpaCy
- Regular Expressions (re)
- CSV (csv)
- TQDM (tqdm)
- Sklearn
- pyLDAvis
- Gensim
- Matplotlib