180-text_mining.Rmd

# Text Mining

Pretty much everything in this chapter comes from the "Text Mining in R" book

Book: https://www.tidytextmining.com/

Stemming:

- https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html

Datacamp:

- https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words
- https://www.datacamp.com/courses/sentiment-analysis-in-r-the-tidy-way
- https://www.datacamp.com/courses/sentiment-analysis-in-r

Podcast:

- https://www.datacamp.com/community/podcast/text-mining-nlproc

Misc:

- https://www.datacamp.com/community/blog/text-mining-in-r-and-python-tips
- Drew Conway's "Beter Wordcloud": http://drewconway.com/zia/2013/3/26/building-a-better-word-cloud

(Time: 1 hour)

## One-token-per-row (unnest_tokens)

- String: Text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
- Corpus: These types of objects typically contain raw strings annotated with additional metadata and details.
- Document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf (see Chapter 3).

```{r}
library(tibble)
```


```{r}
text <- c("I'm your basic average girl",
          "And I'm here to save the world",
          "You can't stop me",
          "Cause I'm Kim Pos-si-ble")
```

```{r}
text
```

```{r}
text_df <- data_frame(line = 1:4, text = text)
text_df
```

```{r}
library(tidytext)
library(magrittr)

text_df %>%
  unnest_tokens(word, text)
```

## Jane Austen

```{r}
library(janeaustenr)
library(dplyr)
library(stringr)
```

```{r}
austen_books()
```


```{r}
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

original_books
```

```{r}
library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books

```


## Stop words and Frequencies

```{r}
tidy_books %>%
  count(word, sort = TRUE) 
```


```{r}
data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)

```

```{r}
tidy_books %>%
  count(word, sort = TRUE) 

```

```{r}
library(ggplot2)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

```

## gutenbergr package

gutenbergr package provides access to the public domain works from the Project Gutenberg collection

## Sentiment Analytis

```{r}
library(tidytext)

sentiments

```

3 main sentiment dictionaries (based on single words)

- AFINN from Finn Årup Nielsen
    - assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment
- bing from Bing Liu and collaborators
    - binary fashion into positive and negative categories.
- nrc from Saif Mohammad and Peter Turney.
    - binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness,surprise, and trust.
    - 

```{r}
get_sentiments("afinn")
```

```{r}
get_sentiments("bing")
```

```{r}
get_sentiments("nrc")
```

```{r}
library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

```

```{r}
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

```

## tf-idf: Term frequency, inverse document frequency

what a document is about by looking at the words.

tf-idf: measure how important a word is to a document in a collection (or corpus) of documents

```{r}
library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words

```

Term frequency (standarized)

```{r}
library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

```

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.


find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents

## bind_tf_idf

https://en.wikipedia.org/wiki/Tf%E2%80%93idf:

- tf: The weight of a term that occurs in a document is simply proportional to the term frequency
- idf: The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs

```{r}
book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

```

Get high tf-idf scores

```{r}
book_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))
```

```{r}
book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(book) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

```


## Relationships (n-grams)

```{r}
library(dplyr)
library(tidytext)
library(janeaustenr)

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

austen_bigrams

```

```{r}
austen_bigrams %>%
  count(bigram, sort = TRUE)
```

Remove stop words and get new counts

```{r}
library(tidyr)

bigrams_separated <- austen_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

```

go back to bi-grams

```{r}
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united

```

Analyzing bi-grams

```{r}
bigrams_filtered %>%
  filter(word2 == "street") %>%
  count(book, word1, sort = TRUE)
```

```{r}
bigram_tf_idf <- bigrams_united %>%
  count(book, bigram) %>%
  bind_tf_idf(bigram, book, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

```

```{r}
library(ggplot2)

bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  group_by(book) %>%
  top_n(12, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(bigram, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = "free") +
  coord_flip() +
  labs(y = "tf-idf of bigram to novel",
       x = "")

```

## Stemming

https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html

```{r}
library(corpus)
```

```{r}
text <- "love loving lovingly loved lover lovely love"
text_tokens(text, stemmer = "en") # english stemmer
```

Using `tm`

```{r}
library(tm)
```

```{r}
complicate <- c("complicated", "complication", "complicatedly")
```

```{r}
stemDocument(complicate)
```