Text Mining with R by Julia Silge and David Robinson provides a wonderful overview of ways to explore and analyse texts using their tidytext
package. It allows students to immediately engage with historical and literary texts with the approaches of the tidyverse. However, the book understandably assumes that students have a strong foundation in the R programming language in order to fully understand what is happening in the code used in the book. Although students in my own module MO5161 should be making progress on their introductory R and tidyverse training through their DataCamp assignments, the below is designed to help explain some of the book’s material.
In addition to the book itself, other important sources include the package vignette (vignette("tidytext")
), the CRAN entry for tidtytext, where you can download the reference manual and R for Data Science by Hadley Wickham and Garrett Grolemund.
The unnest_tokens()
is a useful command that is part of the pre-processing workflow. The first example takes a few lines of poetry from Emily Dickinson, puts it into a tibble, a special kind of data frame class often used in tidy tools, and then is tokenized.
This command combines a collection of items into a vector.
my_fruit_basket <- c("apple","orange","pear")
This package is a colection of tools or rather “a grammar for data manipulation.” Read more with vignette("dplyr")
or at its homepage. All of its functions assume that you will offer it a tibble as its first argument, but you can also pass data to it with a “pipe” in the form of %>%
. Here is an example. If we start with this:
Here we have two vectors, one holding the names of the fruit, the other some sizes of the fruit. Here is where the pipe comes in handy. The following two pieces of code do the same thing, the first without the pipe:
basket_tible <- tibble(fruit=my_fruit_basket,size=my_fruit_sizes)
big_fruit <- filter(basket_tible, size>3)
big_fruit
## # A tibble: 2 x 2
## fruit size
## <chr> <dbl>
## 1 orange 4
## 2 pear 6
And the second with:
## # A tibble: 2 x 2
## fruit size
## <chr> <dbl>
## 1 orange 4
## 2 pear 6
The dplyr
packet command filter()
can be used to filter data by certain statements. Here, if we had done filter(fruit=="apple")
it would have returned only the apple and its size (notice that it uses ==
instead of =
since the latter is used to assign variables, not check for equivalency). You may also filter by things which don’t match something by using !=
instead of ==
(all fruits except apple: filter(fruit!="apple")
). The other major commands used by dplyr
are select()
(to select only certain columns), mutate()
(add new columns/variables), arrange()
and summarise()
.
Older editions of the printed book use the now deprecated data_frame()
command. The tibble()
command builds a data frame from a collection of input variables that also has the special tibble class tbl_df. It offers some advantages over a plain data frame and is the preferred replacement for it in the tidyverse. Above, we fed the tibble()
command two input vectors, the character vector my_fruit_basket
, and the numeric vector my_fruit_sizes
to tibble()
.
In the Silge and Robinson book, the first use of unnest_tokens()
transforms the text_df
tibble, which contains two columns (“line” and “text”) and carries out tokenization on the variable. It runs the command unnest_tokens(word, text)
which produces a new tibble with two columns (“line” and “word”). The basic structure of this command, assuming you are sending it a tibble via pipe:
unnest_tokens(name_of_output_column_for_tokenized_words, name_of_column_in_input_tibble_to_tokenize)
The outputed table in the book contains also the “line” column as well, and shows what line each word came from. The tokenizer merely copied over the “line” column from the first tibble, and left its name and the contents for the “line” column for all words that come from that line. It may be a bit clearer if we use some nonsense data instead with completely different column names:
some_numbers <- c("one","two","three","four")
some_sentences <- c("Here is the first sentence","Here is the second","And a third sentence","Finally, a fourth sentence.")
all_together <- tibble(number_list=some_numbers,sentence_list=some_sentences)
all_together
## # A tibble: 4 x 2
## number_list sentence_list
## <chr> <chr>
## 1 one Here is the first sentence
## 2 two Here is the second
## 3 three And a third sentence
## 4 four Finally, a fourth sentence.
Here you can see that I have just created a tibble similar in structure to the Dickenson poems, but instead of “line” and “text” we have used different column titles.
## # A tibble: 17 x 2
## number_list just_the_words
## <chr> <chr>
## 1 one here
## 2 one is
## 3 one the
## 4 one first
## 5 one sentence
## 6 two here
## 7 two is
## 8 two the
## 9 two second
## 10 three and
## 11 three a
## 12 three third
## 13 three sentence
## 14 four finally
## 15 four a
## 16 four fourth
## 17 four sentence
From this we can see that the number_list
column was just copied over and preserved for each row containing a word from a given matching line. Obviously, numeric line numbers are more useful.
The subsequent sections of the chapter work with texts that are added to variables thanks to the janeaustenr
and gutenbergr
packages. In most cases, however, you will want to bring your own historical sources that you have obtained from historical databases and other web resources, OCRed and cleaned before hand, or which has been transcribed. If you have PDFs that are not image scans, but contain text, you can use command line utilities such as the pdftotext
utility that comes with xpdf (OSX users can install xpdf, which will come with pdftotext with homebrew) to extract it.
Once you have a text you wish to work with in R, you have an abundance of ways to import it. Notice that all of the following can import a file directly from your hard drive (remember to set your working directory from the Session menu or with setwd()
)
I suggest you use the readr
package command read_lines()
as your main way to get files into R. Compare the following that you may see used in other people’s code:
In the base collection of R utilities, produces a character vector, with each line in an element. If you are importing text, assign “character” to the “what” parameter and the regular expression metacharacter for the newline character used by your text (such as “”). The below examples work with a sample text file lear.txt.
lear <- scan("lear.txt",what="character",sep="\n")
Also works directly on texts imported by a web URL: household_management <- scan("http://www.gutenberg.org/cache/epub/10136/pg10136.txt",what="character",sep="\n")
This command, which is in the basic uitilities in R, is better used for reading in a regular table of information that is delimited by some character such as a comma (csv) or a tab (tsv), etc. You can, however, set the delimiter to a newline character and read in a file. The output will be a data frame, rather than a character vector.
lear <- read.table("lear.txt",sep="\n")
household_management <- read.table("http://www.gutenberg.org/cache/epub/10136/pg10136.txt",sep="\n")
In the base collection of R utilities. However read_lines()
may be slower than either the readr
command read_lines()
or the brio read_lines()
lear <- readLines("lear.txt")
hm <- readLines("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")
This is in the tidyverse’s readr
package. It will read an entire file into a single vector object of length one. It does not divide up the lines.
lear <- read_file("lear.txt")
household_management <- read_file("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")
This command is in the tidyverse’s readr
package. It is a great default command to use for importing text into a character vector. It appears to work faster than the readLines()
command.
lear <- read_lines("lear.txt")
household_management <- read_lines("http://www.gutenberg.org/cache/epub/10136/pg10136.txt")
The brio
package is an input/output package that always reads and writes UTF-8 Unicode files. It has its own read_lines()
command which appears to run faster than the readr
version. This might be useful if you are reading lots of very large files. However, note, that it will not download files remotely through a URL.
lear <- read_lines("lear.txt")
This section makes use of the package janeaustenr
. The austen_books()
command returns a tibble with two columns text
and books
. The first contains each line from each of the six completed published novels of Jane Austen, and the second column book
gives the title of the book. The goal of the first commands are to add a column of line numbers.
To do this it uses the group_by()
command to “group by” book and repeat the addition of line numbers for each book, starting each time from one. It also uses a regular expression, with str_detect()
and cumsum()
to add chapter numbers.
Let us explain what is happening step by step. First, what does the group_by()
do here? Examine the example code below using data from the gapminder
package:
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
gapminder_withmedian <- gapminder %>%
group_by(year) %>%
mutate(medianlife=median(lifeExp),
over_under=(lifeExp-median(lifeExp))) %>%
ungroup()
This package has a lifeExp
column. When we group_by(year)
we are combining all the different observations of various countries into groups, one for all the observations from 1952, one for all the observations from 1957, and so on. Then, when we pass this on to mutate()
, which adds two new columns, the median(lifeExp)
grabs the median of all life expectancies calculated for each grouped year, and for over_under_median
subtracts the median across all observations for the year from the life expectancy of each row in turn, telling us how much the life expectancy is over or under the global median for that year. When we are done performing actions on a particular grouped set of data, we ungroup()
. We see the resulting data for Cuba below:
## # A tibble: 12 x 8
## country continent year lifeExp pop gdpPercap medianlife over_under
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Cuba Americas 1952 59.4 6007797 5587. 45.1 14.3
## 2 Cuba Americas 1957 62.3 6640752 6092. 48.4 14.0
## 3 Cuba Americas 1962 65.2 7254373 5181. 50.9 14.4
## 4 Cuba Americas 1967 68.3 8139332 5690. 53.8 14.5
## 5 Cuba Americas 1972 70.7 8831348 5305. 56.5 14.2
## 6 Cuba Americas 1977 72.6 9537988 6380. 59.7 13.0
## 7 Cuba Americas 1982 73.7 9789224 7317. 62.4 11.3
## 8 Cuba Americas 1987 74.2 10239839 7533. 65.8 8.34
## 9 Cuba Americas 1992 74.4 10723260 5593. 67.7 6.71
## 10 Cuba Americas 1997 76.2 10983007 5432. 69.4 6.76
## 11 Cuba Americas 2002 77.2 11226999 6341. 70.8 6.33
## 12 Cuba Americas 2007 78.3 11416987 8948. 71.9 6.34
Read more in section 5.1.3 of of R for Data Science here.
As we saw above the mutate()
function or “verb” can add new variables or new columns to a tibble, usually by carrying out some transformation on existing data. In the example above, we added two columns, one with the median life expectancy for all countries in a given year, and one which shows the gap between each country in that year and the median. Read more in section 5.5 of of R for Data Science here.
In the Jane Austen example, two new columns are added. One of them uses row_number()
to give an incrementing line number to each line, in each group containing all the lines of a book. The second is more complicated:
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE))))
Let us examine this line step by step from the inside out. We want to find the location of the beginning of each chapter. This will be a line which is called something like “Chapter 14” or “Chapter IV”. So, a regular expression which, in a case insensitive way, looks for a line beginning with the word “chapter” (as a regular epxression ^chapter
) followed by a space and any of the following: any digit, or the letters i, v, x, l, or c ([\\divxlc]
) will find the lines we are looking for. We don’t have to capture anything else: as long as we have found this, we have seen enough to conclude that this is the beginning of a chapter. This explains the regex("^chapter [\\divxlc]", ignore_case = TRUE)
.
Now for the str_detect()
. This stringr
function will search a character vector for a given phrase (in this case, a regular expression, rather than a simple string), and when it finds it, will return TRUE
, otherwise it will return FALSE
. For example, let us imagine I have a fruit basket with an apple, a pear, and a banana. If I want to search the basket to see if I have a banana, here is what that would look like with a search using str_detect()
:
library(stringr)
fruit_basket <- c("apple", "pear", "banana")
find_banana <- fruit_basket %>% str_detect("banana")
find_banana
## [1] FALSE FALSE TRUE
As you can see, it returned FALSE twice, because neither the apple nor the pear are bananas. Then, finding the banana, it returned TRUE. This collection of vectors can now be used to “select” just a few words from a list. In this case, we could make a new basket with just the banana we found:
## [1] "banana"
But what about the case of our book? It has returned a TRUE
for every line in every Jane Austen book which is the beginning of a chapter. For example, the tenth line of original_books
is “CHAPTER 1” so the tenth object in the vector returned by str_detect
will be TRUE. Here is where cumsum()
comes in. If I have a vector x, which contains some numbers, cumsum()
will take the number (Any TRUE
gets interpreted or “coerced” to a 1, and FALSE
gets coerced to a 0) found in each vector and add to it cumulative sum of the items that come before it. So if: x <- c(1,2,3)
Then cumsum(x)
will result in 1 3 and 6. The sum of 1 is just 1. The sum of 1+2 = 3, and the sum of 1+2+3 is 6. With logical vectors containing FALSE
and TRUE
here is an example:
## [1] 1 1 2 2 3
The first TRUE
is coerced to 1, then 1+0=1, then 1+0+1=2, then 1+0+1+0=2, then 1+0+1+0+1=3. This is why we end up with 1 1 2 2 3. And this is how the chapter numbers were obtained, by incrementing the chapter each time it finds (TRUE
) another chapter head.
After using unnest_tokens()
to break the text down into words, the next piece of code loads a dataset called stop_words
from the tidytext
package. This is simple dataset with two columns, a “word” for a list of English language words, and second a “lexicon” to indicate the source of the list of stop words (onix, SMART, and snowball).
The anti_join()
command is like a cookie cutter: we “cut out” the words in the stop words list from the tidy_books
variable. Also see more discussion of some of the other join commands further below. Let us show another example:
seven_sins <- tibble(characteristics=c("gluttony","lust","greed","sloth","wrath","envy","pride"))
my_personality <- tibble(characteristics=c("gluttony","modesty","lust","restraint","greed","industriousness","sloth","tolerance","wrath","generosity","envy","strenght","pride"))
my_personality
## # A tibble: 13 x 1
## characteristics
## <chr>
## 1 gluttony
## 2 modesty
## 3 lust
## 4 restraint
## 5 greed
## 6 industriousness
## 7 sloth
## 8 tolerance
## 9 wrath
## 10 generosity
## 11 envy
## 12 strenght
## 13 pride
Notice that my_personality
had a column characteristics
which was the same as a column characteristics
in the seven_sins
tibble. We now have a list of sins, and collection of personality traits with some sins among them. Let us now remove all the sins from our personality and examine the result:
## # A tibble: 6 x 1
## characteristics
## <chr>
## 1 modesty
## 2 restraint
## 3 industriousness
## 4 tolerance
## 5 generosity
## 6 strenght
Since the two tibbles had a column with the same name, we didn’t have to tell anti_join()
which columns matched eachother. If our sins had been listed in column named name
instead, we would get an error when running anti_join
. To fix that, we can pass the information of what column mantches which column to anti_join()
with: anti_join(seven_sins, by = c("characteristics" = "name"))
Now that the stop words were removed, the remaining words were counted and sorted with the dplyr
verb count()
. This counts things in an inputted tibble and produces a list, optionally sorted, with the frequency in an “n” column. Here is another very simple example:
drawer_contents <- tibble(objects=c("pen","pen","pen","pen","pen","pen","pencil","pencil","eraser","highlighter","highlighter","highlighter","usb stick","usb stick"))
drawer_contents %>%
count(objects, sort = TRUE)
## # A tibble: 5 x 2
## objects n
## <chr> <int>
## 1 pen 6
## 2 highlighter 3
## 3 pencil 2
## 4 usb stick 2
## 5 eraser 1
Finally ggplot2
is used to plot the resulting frequencies. Ggplot 2 is not something easily explained, and I would suggest you read the (online) chapter 3 of R for Data Science for more details.
Much of this section builds on techniques from the earlier sections, but using the gutenbergr
package to download and work with multiple books at once. There is one somewhat complicated section which uses bind_rows()
to merge together several tibbles, uses select()
with a negative to exclude only one column, then uses spread()
and gather()
from the tidyr
package to reformulate the table.
This function from dplyr
takes two tibbles or data frames and merges them together by stacking them on top of each other. Here is a simple example:
some_countries <- tibble(name=c("Hungary","Lebanon","Panama"),language=c("Hungarian","Arabic","Spanish"))
more_countries <- tibble(name=c("Cambodia","Morroco","Argentina"),language=c("Khmer","Arabic","Spanish"))
all_countries <- bind_rows(some_countries,more_countries)
all_countries %>% arrange(language)
## # A tibble: 6 x 2
## name language
## <chr> <chr>
## 1 Lebanon Arabic
## 2 Morroco Arabic
## 3 Hungary Hungarian
## 4 Cambodia Khmer
## 5 Panama Spanish
## 6 Argentina Spanish
Here I used the dplyr
verb arrange()
to sort the final results by language. In the text, you will notice that instead of just passing it the tidy_bronte
and tidy_hgwells
etc. tibbles, it also used mutate()
to first add a column with author information.
Another column that is added is the percentage of words a given frequency is with: mutate(proportion = n / sum(n))
. If the most frequent word is found 1,200 times in a text, but there are 1,200,000, then the proportional frequency of that word is 0.1% (1,200/1,200,000). The sum(n)
derives the total number of words in the text by adding all the frequencies together.
The select(-n)
uses the dplyr
verb select()
which grabs data from only certain columns. However, if you put a “-” before the name of a column, you are saying, “Give me all the columns except this one.” For example, using the all_countries
tibble above I can grab a list of just the names of the countries with either of the following:
## # A tibble: 6 x 1
## name
## <chr>
## 1 Hungary
## 2 Lebanon
## 3 Panama
## 4 Cambodia
## 5 Morroco
## 6 Argentina
## # A tibble: 6 x 1
## name
## <chr>
## 1 Hungary
## 2 Lebanon
## 3 Panama
## 4 Cambodia
## 5 Morroco
## 6 Argentina
The spread()
and gather()
have, in recent versions of the tidyr
package been renamed pivot_wider
and pivot_longer
. The best way to understand what is happening here is to take a closer look at the tidyr
vignette on pivot.
One technique in making data more “tidy” (each column a variable, each row an observation), is to “spread” or pivot_wider()
a dataset. Imagine a case where we had some data on things you collected on a walk in the forest. The following data is not tidy because we have the thing collected in one field and the amount of them collected in another:
foraging <- tibble(day=c("Monday Trip","Monday Trip","Tuesday Trip","Wednesday Trip","Wednesday Trip","Friday Trip"),item=c("raspberry","mushroom","raspberry","blackberry","mushroom","strawberry"),amount=c(14,3,27,35,6,18))
foraging
## # A tibble: 6 x 3
## day item amount
## <chr> <chr> <dbl>
## 1 Monday Trip raspberry 14
## 2 Monday Trip mushroom 3
## 3 Tuesday Trip raspberry 27
## 4 Wednesday Trip blackberry 35
## 5 Wednesday Trip mushroom 6
## 6 Friday Trip strawberry 18
If we think of our observations as as composed of the combined harvest of single trips to the forest, then the number of mushrooms collected are one variable and the strawberries collected as another, etc., and so these should be columns, not values in separate rows, with the number collected in a separate column. We can resolve this problem with pivot_wider()
(the old spread):
library(tidyr)
tidy_foraging <- foraging %>% pivot_wider(names_from=item,values_from=amount,values_fill=0)
The names_from
determines where the names of the distinct columns will be pulled. Here the item
column will be replaced by four columns corresponding to the four different kinds of things we are collecting. The values_from
indicates where the values to be associated with those columns are to be found. Finally, the values_fill
will handle cases where there are no available values. For example, when there is no data for one of the things potentially collected on a given trip. The resulting data will look like this:
## # A tibble: 4 x 5
## day raspberry mushroom blackberry strawberry
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Monday Trip 14 3 0 0
## 2 Tuesday Trip 27 0 0 0
## 3 Wednesday Trip 0 6 35 0
## 4 Friday Trip 0 0 0 18
What about pivot_longer()
? Imagine an untidy dataset which has a list of candidates and the votes they won in various years (This is very similar to the example used here).
votes <- tibble(name=c("Mickey","Goofy","Minnie","Donald"),"2004"=c(1200,1600,1400,1700),"2008"=c(1400,900,1600,2000),"2011"=c(1000,1000,2500,900))
votes
## # A tibble: 4 x 4
## name `2004` `2008` `2011`
## <chr> <dbl> <dbl> <dbl>
## 1 Mickey 1200 1400 1000
## 2 Goofy 1600 900 1000
## 3 Minnie 1400 1600 2500
## 4 Donald 1700 2000 900
This is not a tidy dataset because our rows are not individual observations, as the columns contain multiple years and the individual cells are the observations. To convert this into a tidy format in which each row constitutes a single observation we can use pivot_longer()
:
tidy_votes <- votes %>% pivot_longer(c("2004","2008","2011"),names_to="years", values_to="votes")
tidy_votes
## # A tibble: 12 x 3
## name years votes
## <chr> <chr> <dbl>
## 1 Mickey 2004 1200
## 2 Mickey 2008 1400
## 3 Mickey 2011 1000
## 4 Goofy 2004 1600
## 5 Goofy 2008 900
## 6 Goofy 2011 1000
## 7 Minnie 2004 1400
## 8 Minnie 2008 1600
## 9 Minnie 2011 2500
## 10 Donald 2004 1700
## 11 Donald 2008 2000
## 12 Donald 2011 900
Now each line of the dataset is an observation, how many votes someone got in a particular year.
The plot that follows uses ggplot
with the scales
package, both of which cannot be covered in detail here. The final cor.test()
is a simple correlation test between paired samples using Pearson’s product-moment correlation. You can read more in the documentation for this command.
The introductory section discusses the nature and origins of the lexicons of words which lie at the heart of the exercises in the chapter. In addition to the tidytext
package, you may need to install the textdata
package (install.packages("textdata")
). Running the get_sentiments()
command will sometimes prompt you to accept a license for use of the lexicon, or commit to citing the data before downloading it. It is good to note that there are many more lexicons than those discussed in the book, and as the authors point out, it is important to learn more about their composition, limitations, and target sources.
This merely loads a sentiment lexicon into a tibble. It is best to examine the tibble as the column names vary. For example, the “bing” dataset uses “sentiment” as a column name (with negative and positive as its values), while “afinn” uses “value” as a column name with negative and positive integer values.
Inner join is one of a series of ways that two datasets can be combined to produce a variety of results. Some of these are “mutating joins” which combine information together. Others are “filtering joins”. Above we encountered the filtering join anti_join()
which acts like a cookie cutter: cutting out elements that two tibbles have in common. We used the example of a list of the seven sins being used to remove these from a list of personality characteristics.
An inner join combines two datasets, but retains only the information in rows that are found in both of them. Let us imagine I have a list
“rule of thumb or heuristic quantity” “its theoretical foundations are considered less than firm by information theory experts.”
DH Tutorials Home
The GitHub Repository for this handout and its files.
Konrad M. Lawson. @kmlawson
Creative Commons - Attribution CC BY, 2020.