0.1 Introduction
1 Notes on Ch 1: The Tidy Text Format
2 Notes on Ch 2: Sentiment Analysis with Tidy Data
- 2.1 Sentiment Analysis with Inner Join
- 2.2 Looking at Units Beyond Just Words
  - 2.2.1 semi_join()
  - 2.2.2 left_join()
3 Notes on Ch 3: Analyzing Word and Document Frequency: tf-idf

0.1 Introduction

Text Mining with R by Julia Silge and David Robinson provides a wonderful overview of ways to explore and analyse texts using their tidytext package. It allows students to immediately engage with historical and literary texts with the approaches of the tidyverse. However, the book understandably assumes that students have a strong foundation in the R programming language in order to fully understand what is happening in the code used in the book. Although students in my own module MO5161 should be making progress on their introductory R and tidyverse training through their DataCamp assignments, the below is designed to help explain some of the book’s material.

In addition to the book itself, other important sources include the package vignette (vignette("tidytext")), the CRAN entry for tidtytext, where you can download the reference manual and R for Data Science by Hadley Wickham and Garrett Grolemund.

1 Notes on Ch 1: The Tidy Text Format

1.1 The unnest_tokens Function

The unnest_tokens() is a useful command that is part of the pre-processing workflow. The first example takes a few lines of poetry from Emily Dickinson, puts it into a tibble, a special kind of data frame class often used in tidy tools, and then is tokenized.

1.1.1 c()

This command combines a collection of items into a vector.

e.g. my_fruit_basket <- c("apple","orange","pear")

1.1.2 library(dplyr)

This package is a colection of tools or rather “a grammar for data manipulation.” Read more with vignette("dplyr") or at its homepage. All of its functions assume that you will offer it a tibble as its first argument, but you can also pass data to it with a “pipe” in the form of %>%. Here is an example. If we start with this:

library(dplyr)

my_fruit_basket <- c("apple","orange","pear")
my_fruit_sizes <- c(3,4,6)

Here we have two vectors, one holding the names of the fruit, the other some sizes of the fruit. Here is where the pipe comes in handy. The following two pieces of code do the same thing, the first without the pipe:

basket_tible <- tibble(fruit=my_fruit_basket,size=my_fruit_sizes)   
big_fruit <- filter(basket_tible, size>3)
big_fruit

## # A tibble: 2 x 2
##   fruit   size
##   <chr>  <dbl>
## 1 orange     4
## 2 pear       6

And the second with:

big_fruit <- tibble(fruit=my_fruit_basket,size=my_fruit_sizes) %>%
  filter(size>3)
big_fruit

## # A tibble: 2 x 2
##   fruit   size
##   <chr>  <dbl>
## 1 orange     4
## 2 pear       6

The dplyr packet command filter() can be used to filter data by certain statements. Here, if we had done filter(fruit=="apple") it would have returned only the apple and its size (notice that it uses == instead of = since the latter is used to assign variables, not check for equivalency). You may also filter by things which don’t match something by using != instead of == (all fruits except apple: filter(fruit!="apple")). The other major commands used by dplyr are select() (to select only certain columns), mutate() (add new columns/variables), arrange() and summarise().

1.1.3 tibble()

Older editions of the printed book use the now deprecated data_frame() command. The tibble() command builds a data frame from a collection of input variables that also has the special tibble class tbl_df. It offers some advantages over a plain data frame and is the preferred replacement for it in the tidyverse. Above, we fed the tibble() command two input vectors, the character vector my_fruit_basket, and the numeric vector my_fruit_sizes to tibble().

1.1.4 unnest_tokens()

In the Silge and Robinson book, the first use of unnest_tokens() transforms the text_df tibble, which contains two columns (“line” and “text”) and carries out tokenization on the variable. It runs the command unnest_tokens(word, text) which produces a new tibble with two columns (“line” and “word”). The basic structure of this command, assuming you are sending it a tibble via pipe:

unnest_tokens(name_of_output_column_for_tokenized_words, name_of_column_in_input_tibble_to_tokenize)

The outputed table in the book contains also the “line” column as well, and shows what line each word came from. The tokenizer merely copied over the “line” column from the first tibble, and left its name and the contents for the “line” column for all words that come from that line. It may be a bit clearer if we use some nonsense data instead with completely different column names:

some_numbers <- c("one","two","three","four")
some_sentences <- c("Here is the first sentence","Here is the second","And a third sentence","Finally, a fourth sentence.")
all_together <- tibble(number_list=some_numbers,sentence_list=some_sentences)
all_together

## # A tibble: 4 x 2
##   number_list sentence_list              
##   <chr>       <chr>                      
## 1 one         Here is the first sentence 
## 2 two         Here is the second         
## 3 three       And a third sentence       
## 4 four        Finally, a fourth sentence.

Here you can see that I have just created a tibble similar in structure to the Dickenson poems, but instead of “line” and “text” we have used different column titles.

library(tidytext)

all_together %>%
  unnest_tokens(just_the_words,sentence_list)

## # A tibble: 17 x 2
##    number_list just_the_words
##    <chr>       <chr>         
##  1 one         here          
##  2 one         is            
##  3 one         the           
##  4 one         first         
##  5 one         sentence      
##  6 two         here          
##  7 two         is            
##  8 two         the           
##  9 two         second        
## 10 three       and           
## 11 three       a             
## 12 three       third         
## 13 three       sentence      
## 14 four        finally       
## 15 four        a             
## 16 four        fourth        
## 17 four        sentence

From this we can see that the number_list column was just copied over and preserved for each row containing a word from a given matching line. Obviously, numeric line numbers are more useful.

1.2 Getting Your Text Into R

The subsequent sections of the chapter work with texts that are added to variables thanks to the janeaustenr and gutenbergr packages. In most cases, however, you will want to bring your own historical sources that you have obtained from historical databases and other web resources, OCRed and cleaned before hand, or which has been transcribed. If you have PDFs that are not image scans, but contain text, you can use command line utilities such as the pdftotext utility that comes with xpdf (OSX users can install xpdf, which will come with pdftotext with homebrew) to extract it.

Once you have a text you wish to work with in R, you have an abundance of ways to import it. Notice that all of the following can import a file directly from your hard drive (remember to set your working directory from the Session menu or with setwd())

I suggest you use the readr package command read_lines() as your main way to get files into R. Compare the following that you may see used in other people’s code:

1.2.1 scan()

In the base collection of R utilities, produces a character vector, with each line in an element. If you are importing text, assign “character” to the “what” parameter and the regular expression metacharacter for the newline character used by your text (such as “”). The below examples work with a sample text file lear.txt.

lear <- scan("lear.txt",what="character",sep="\n") Also works directly on texts imported by a web URL: household_management <- scan("http://www.gutenberg.org/cache/epub/10136/pg10136.txt",what="character",sep="\n")

1.2.2 read.table()

This command, which is in the basic uitilities in R, is better used for reading in a regular table of information that is delimited by some character such as a comma (csv) or a tab (tsv), etc. You can, however, set the delimiter to a newline character and read in a file. The output will be a data frame, rather than a character vector.

lear <- read.table("lear.txt",sep="\n") household_management <- read.table("http://www.gutenberg.org/cache/epub/10136/pg10136.txt",sep="\n")

1.2.3 readLines()

In the base collection of R utilities. However read_lines() may be slower than either the readr command read_lines() or the brio read_lines()

lear <- readLines("lear.txt") hm <- readLines("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")

1.2.4 read_file()

This is in the tidyverse’s readr package. It will read an entire file into a single vector object of length one. It does not divide up the lines.

lear <- read_file("lear.txt") household_management <- read_file("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")

1.2.5 read_lines()

This command is in the tidyverse’s readr package. It is a great default command to use for importing text into a character vector. It appears to work faster than the readLines() command.

lear <- read_lines("lear.txt") household_management <- read_lines("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")

1.2.6 brio: read_lines()

The brio package is an input/output package that always reads and writes UTF-8 Unicode files. It has its own read_lines() command which appears to run faster than the readr version. This might be useful if you are reading lots of very large files. However, note, that it will not download files remotely through a URL.

lear <- read_lines("lear.txt")

1.3 Tidying the Words of Jane Austen

This section makes use of the package janeaustenr. The austen_books() command returns a tibble with two columns text and books. The first contains each line from each of the six completed published novels of Jane Austen, and the second column book gives the title of the book. The goal of the first commands are to add a column of line numbers.

1.3.1 group_by()

To do this it uses the group_by() command to “group by” book and repeat the addition of line numbers for each book, starting each time from one. It also uses a regular expression, with str_detect() and cumsum() to add chapter numbers.

Let us explain what is happening step by step. First, what does the group_by() do here? Examine the example code below using data from the gapminder package:

library(gapminder)
gapminder

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

gapminder_withmedian <- gapminder %>% 
  group_by(year) %>% 
  mutate(medianlife=median(lifeExp), 
         over_under=(lifeExp-median(lifeExp))) %>% 
  ungroup()

This package has a lifeExp column. When we group_by(year) we are combining all the different observations of various countries into groups, one for all the observations from 1952, one for all the observations from 1957, and so on. Then, when we pass this on to mutate(), which adds two new columns, the median(lifeExp) grabs the median of all life expectancies calculated for each grouped year, and for over_under_median subtracts the median across all observations for the year from the life expectancy of each row in turn, telling us how much the life expectancy is over or under the global median for that year. When we are done performing actions on a particular grouped set of data, we ungroup(). We see the resulting data for Cuba below:

gapminder_withmedian %>%
  filter(country=="Cuba")

## # A tibble: 12 x 8
##    country continent  year lifeExp      pop gdpPercap medianlife over_under
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>      <dbl>      <dbl>
##  1 Cuba    Americas   1952    59.4  6007797     5587.       45.1      14.3 
##  2 Cuba    Americas   1957    62.3  6640752     6092.       48.4      14.0 
##  3 Cuba    Americas   1962    65.2  7254373     5181.       50.9      14.4 
##  4 Cuba    Americas   1967    68.3  8139332     5690.       53.8      14.5 
##  5 Cuba    Americas   1972    70.7  8831348     5305.       56.5      14.2 
##  6 Cuba    Americas   1977    72.6  9537988     6380.       59.7      13.0 
##  7 Cuba    Americas   1982    73.7  9789224     7317.       62.4      11.3 
##  8 Cuba    Americas   1987    74.2 10239839     7533.       65.8       8.34
##  9 Cuba    Americas   1992    74.4 10723260     5593.       67.7       6.71
## 10 Cuba    Americas   1997    76.2 10983007     5432.       69.4       6.76
## 11 Cuba    Americas   2002    77.2 11226999     6341.       70.8       6.33
## 12 Cuba    Americas   2007    78.3 11416987     8948.       71.9       6.34

Read more in section 5.1.3 of of R for Data Science here.

1.3.2 mutate()

As we saw above the mutate() function or “verb” can add new variables or new columns to a tibble, usually by carrying out some transformation on existing data. In the example above, we added two columns, one with the median life expectancy for all countries in a given year, and one which shows the gap between each country in that year and the median. Read more in section 5.5 of of R for Data Science here.

1.3.3 str_detect() and cumsum()

In the Jane Austen example, two new columns are added. One of them uses row_number() to give an incrementing line number to each line, in each group containing all the lines of a book. The second is more complicated:

chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE))))

Let us examine this line step by step from the inside out. We want to find the location of the beginning of each chapter. This will be a line which is called something like “Chapter 14” or “Chapter IV”. So, a regular expression which, in a case insensitive way, looks for a line beginning with the word “chapter” (as a regular epxression ^chapter) followed by a space and any of the following: any digit, or the letters i, v, x, l, or c ([\\divxlc]) will find the lines we are looking for. We don’t have to capture anything else: as long as we have found this, we have seen enough to conclude that this is the beginning of a chapter. This explains the regex("^chapter [\\divxlc]", ignore_case = TRUE).

Now for the str_detect(). This stringr function will search a character vector for a given phrase (in this case, a regular expression, rather than a simple string), and when it finds it, will return TRUE, otherwise it will return FALSE. For example, let us imagine I have a fruit basket with an apple, a pear, and a banana. If I want to search the basket to see if I have a banana, here is what that would look like with a search using str_detect():

library(stringr)
fruit_basket <- c("apple", "pear", "banana")
find_banana <- fruit_basket %>% str_detect("banana")
find_banana

## [1] FALSE FALSE  TRUE

As you can see, it returned FALSE twice, because neither the apple nor the pear are bananas. Then, finding the banana, it returned TRUE. This collection of vectors can now be used to “select” just a few words from a list. In this case, we could make a new basket with just the banana we found:

new_basket <- fruit_basket[find_banana]
new_basket

## [1] "banana"

But what about the case of our book? It has returned a TRUE for every line in every Jane Austen book which is the beginning of a chapter. For example, the tenth line of original_books is “CHAPTER 1” so the tenth object in the vector returned by str_detect will be TRUE. Here is where cumsum() comes in. If I have a vector x, which contains some numbers, cumsum() will take the number (Any TRUE gets interpreted or “coerced” to a 1, and FALSE gets coerced to a 0) found in each vector and add to it cumulative sum of the items that come before it. So if: x <- c(1,2,3) Then cumsum(x) will result in 1 3 and 6. The sum of 1 is just 1. The sum of 1+2 = 3, and the sum of 1+2+3 is 6. With logical vectors containing FALSE and TRUE here is an example:

logical_example <- c(TRUE,FALSE,TRUE,FALSE,TRUE)
cumsum(logical_example)

## [1] 1 1 2 2 3

The first TRUE is coerced to 1, then 1+0=1, then 1+0+1=2, then 1+0+1+0=2, then 1+0+1+0+1=3. This is why we end up with 1 1 2 2 3. And this is how the chapter numbers were obtained, by incrementing the chapter each time it finds (TRUE) another chapter head.

After using unnest_tokens() to break the text down into words, the next piece of code loads a dataset called stop_words from the tidytext package. This is simple dataset with two columns, a “word” for a list of English language words, and second a “lexicon” to indicate the source of the list of stop words (onix, SMART, and snowball).

1.3.4 anti_join()

The anti_join() command is like a cookie cutter: we “cut out” the words in the stop words list from the tidy_books variable. Also see more discussion of some of the other join commands further below. Let us show another example:

seven_sins <- tibble(characteristics=c("gluttony","lust","greed","sloth","wrath","envy","pride"))
my_personality <- tibble(characteristics=c("gluttony","modesty","lust","restraint","greed","industriousness","sloth","tolerance","wrath","generosity","envy","strenght","pride"))
my_personality

## # A tibble: 13 x 1
##    characteristics
##    <chr>          
##  1 gluttony       
##  2 modesty        
##  3 lust           
##  4 restraint      
##  5 greed          
##  6 industriousness
##  7 sloth          
##  8 tolerance      
##  9 wrath          
## 10 generosity     
## 11 envy           
## 12 strenght       
## 13 pride

Notice that my_personality had a column characteristics which was the same as a column characteristics in the seven_sins tibble. We now have a list of sins, and collection of personality traits with some sins among them. Let us now remove all the sins from our personality and examine the result:

new_me <- my_personality %>%
  anti_join(seven_sins)
new_me

## # A tibble: 6 x 1
##   characteristics
##   <chr>          
## 1 modesty        
## 2 restraint      
## 3 industriousness
## 4 tolerance      
## 5 generosity     
## 6 strenght

Since the two tibbles had a column with the same name, we didn’t have to tell anti_join() which columns matched eachother. If our sins had been listed in column named name instead, we would get an error when running anti_join. To fix that, we can pass the information of what column mantches which column to anti_join() with: anti_join(seven_sins, by = c("characteristics" = "name"))

1.3.5 count()

Now that the stop words were removed, the remaining words were counted and sorted with the dplyr verb count(). This counts things in an inputted tibble and produces a list, optionally sorted, with the frequency in an “n” column. Here is another very simple example:

drawer_contents <- tibble(objects=c("pen","pen","pen","pen","pen","pen","pencil","pencil","eraser","highlighter","highlighter","highlighter","usb stick","usb stick"))
drawer_contents %>%
  count(objects, sort = TRUE)

## # A tibble: 5 x 2
##   objects         n
##   <chr>       <int>
## 1 pen             6
## 2 highlighter     3
## 3 pencil          2
## 4 usb stick       2
## 5 eraser          1

Finally ggplot2 is used to plot the resulting frequencies. Ggplot 2 is not something easily explained, and I would suggest you read the (online) chapter 3 of R for Data Science for more details.

1.4 Word Frequencies

Much of this section builds on techniques from the earlier sections, but using the gutenbergr package to download and work with multiple books at once. There is one somewhat complicated section which uses bind_rows() to merge together several tibbles, uses select() with a negative to exclude only one column, then uses spread() and gather() from the tidyr package to reformulate the table.

1.4.1 bind_rows()

This function from dplyr takes two tibbles or data frames and merges them together by stacking them on top of each other. Here is a simple example:

some_countries <- tibble(name=c("Hungary","Lebanon","Panama"),language=c("Hungarian","Arabic","Spanish"))
more_countries <- tibble(name=c("Cambodia","Morroco","Argentina"),language=c("Khmer","Arabic","Spanish"))
all_countries <- bind_rows(some_countries,more_countries)
all_countries %>% arrange(language)

## # A tibble: 6 x 2
##   name      language 
##   <chr>     <chr>    
## 1 Lebanon   Arabic   
## 2 Morroco   Arabic   
## 3 Hungary   Hungarian
## 4 Cambodia  Khmer    
## 5 Panama    Spanish  
## 6 Argentina Spanish

Here I used the dplyr verb arrange() to sort the final results by language. In the text, you will notice that instead of just passing it the tidy_bronte and tidy_hgwells etc. tibbles, it also used mutate() to first add a column with author information.

Another column that is added is the percentage of words a given frequency is with: mutate(proportion = n / sum(n)). If the most frequent word is found 1,200 times in a text, but there are 1,200,000, then the proportional frequency of that word is 0.1% (1,200/1,200,000). The sum(n) derives the total number of words in the text by adding all the frequencies together.

The select(-n) uses the dplyr verb select() which grabs data from only certain columns. However, if you put a “-” before the name of a column, you are saying, “Give me all the columns except this one.” For example, using the all_countries tibble above I can grab a list of just the names of the countries with either of the following:

all_countries %>% select(name)

## # A tibble: 6 x 1
##   name     
##   <chr>    
## 1 Hungary  
## 2 Lebanon  
## 3 Panama   
## 4 Cambodia 
## 5 Morroco  
## 6 Argentina

all_countries %>% select(-language)

## # A tibble: 6 x 1
##   name     
##   <chr>    
## 1 Hungary  
## 2 Lebanon  
## 3 Panama   
## 4 Cambodia 
## 5 Morroco  
## 6 Argentina

The spread() and gather() have, in recent versions of the tidyr package been renamed pivot_wider and pivot_longer. The best way to understand what is happening here is to take a closer look at the tidyr vignette on pivot.

1.4.2 pivot_wider() and pivot_longer()

One technique in making data more “tidy” (each column a variable, each row an observation), is to “spread” or pivot_wider() a dataset. Imagine a case where we had some data on things you collected on a walk in the forest. The following data is not tidy because we have the thing collected in one field and the amount of them collected in another:

foraging <- tibble(day=c("Monday Trip","Monday Trip","Tuesday Trip","Wednesday Trip","Wednesday Trip","Friday Trip"),item=c("raspberry","mushroom","raspberry","blackberry","mushroom","strawberry"),amount=c(14,3,27,35,6,18))
foraging

## # A tibble: 6 x 3
##   day            item       amount
##   <chr>          <chr>       <dbl>
## 1 Monday Trip    raspberry      14
## 2 Monday Trip    mushroom        3
## 3 Tuesday Trip   raspberry      27
## 4 Wednesday Trip blackberry     35
## 5 Wednesday Trip mushroom        6
## 6 Friday Trip    strawberry     18

If we think of our observations as as composed of the combined harvest of single trips to the forest, then the number of mushrooms collected are one variable and the strawberries collected as another, etc., and so these should be columns, not values in separate rows, with the number collected in a separate column. We can resolve this problem with pivot_wider() (the old spread):

library(tidyr)
tidy_foraging <- foraging %>% pivot_wider(names_from=item,values_from=amount,values_fill=0)

The names_from determines where the names of the distinct columns will be pulled. Here the item column will be replaced by four columns corresponding to the four different kinds of things we are collecting. The values_from indicates where the values to be associated with those columns are to be found. Finally, the values_fill will handle cases where there are no available values. For example, when there is no data for one of the things potentially collected on a given trip. The resulting data will look like this:

tidy_foraging

## # A tibble: 4 x 5
##   day            raspberry mushroom blackberry strawberry
##   <chr>              <dbl>    <dbl>      <dbl>      <dbl>
## 1 Monday Trip           14        3          0          0
## 2 Tuesday Trip          27        0          0          0
## 3 Wednesday Trip         0        6         35          0
## 4 Friday Trip            0        0          0         18

What about pivot_longer()? Imagine an untidy dataset which has a list of candidates and the votes they won in various years (This is very similar to the example used here).

votes <- tibble(name=c("Mickey","Goofy","Minnie","Donald"),"2004"=c(1200,1600,1400,1700),"2008"=c(1400,900,1600,2000),"2011"=c(1000,1000,2500,900))
votes

## # A tibble: 4 x 4
##   name   `2004` `2008` `2011`
##   <chr>   <dbl>  <dbl>  <dbl>
## 1 Mickey   1200   1400   1000
## 2 Goofy    1600    900   1000
## 3 Minnie   1400   1600   2500
## 4 Donald   1700   2000    900

This is not a tidy dataset because our rows are not individual observations, as the columns contain multiple years and the individual cells are the observations. To convert this into a tidy format in which each row constitutes a single observation we can use pivot_longer():

tidy_votes <- votes %>% pivot_longer(c("2004","2008","2011"),names_to="years", values_to="votes")
tidy_votes

## # A tibble: 12 x 3
##    name   years votes
##    <chr>  <chr> <dbl>
##  1 Mickey 2004   1200
##  2 Mickey 2008   1400
##  3 Mickey 2011   1000
##  4 Goofy  2004   1600
##  5 Goofy  2008    900
##  6 Goofy  2011   1000
##  7 Minnie 2004   1400
##  8 Minnie 2008   1600
##  9 Minnie 2011   2500
## 10 Donald 2004   1700
## 11 Donald 2008   2000
## 12 Donald 2011    900

Now each line of the dataset is an observation, how many votes someone got in a particular year.

The plot that follows uses ggplot with the scales package, both of which cannot be covered in detail here. The final cor.test() is a simple correlation test between paired samples using Pearson’s product-moment correlation. You can read more in the documentation for this command.

2 Notes on Ch 2: Sentiment Analysis with Tidy Data

The introductory section discusses the nature and origins of the lexicons of words which lie at the heart of the exercises in the chapter. In addition to the tidytext package, you may need to install the textdata package (install.packages("textdata")). Running the get_sentiments() command will sometimes prompt you to accept a license for use of the lexicon, or commit to citing the data before downloading it. It is good to note that there are many more lexicons than those discussed in the book, and as the authors point out, it is important to learn more about their composition, limitations, and target sources.

2.1 Sentiment Analysis with Inner Join

2.1.1 get_sentiments()

This merely loads a sentiment lexicon into a tibble. It is best to examine the tibble as the column names vary. For example, the “bing” dataset uses “sentiment” as a column name (with negative and positive as its values), while “afinn” uses “value” as a column name with negative and positive integer values.

2.1.2 inner_join()

Inner join is one of a series of ways that two datasets can be combined to produce a variety of results. Some of these are “mutating joins” which combine information together. Others are “filtering joins”. Above we encountered the filtering join anti_join() which acts like a cookie cutter: cutting out elements that two tibbles have in common. We used the example of a list of the seven sins being used to remove these from a list of personality characteristics.

An inner join combines two datasets, but retains only the information in rows that are found in both of them. Let us imagine I have a list

2.1.3 get_sentiments()

2.1.4 summarise()

2.1.5 top_n()

2.1.6 Wordclouds

2.1.7 with()

2.1.8 wordcloud()

2.1.9 acast() - reshape2

2.1.10 comparison.cloud()

2.2 Looking at Units Beyond Just Words

2.2.1 semi_join()

2.2.2 left_join()

3 Notes on Ch 3: Analyzing Word and Document Frequency: tf-idf

“rule of thumb or heuristic quantity” “its theoretical foundations are considered less than firm by information theory experts.”

3.1 Term Frequency in Jane Austen’s Novels

3.2 Zipf’s Law

3.3 The bind_tf_idf Function

3.4 A Corpus of Physics Texts

DH Tutorials Home
The GitHub Repository for this handout and its files.
Konrad M. Lawson. @kmlawson
Creative Commons - Attribution CC BY, 2020.

Notes on Text Mining with R - MO5161 Skills in Transnational History

Konrad M. Lawson