In this example, I’ll step through some of the steps (on Mac OS X) of getting a collection of texts into R in a tidy format so you can carry out some of the techniques explored in Text Mining with R by Julia Silge and David Robinson.
The first step is to get your material into a raw text form. If your starting point is a PDF that has no ability to select the text in it (that is, it has not yet been OCRed) then the workflow will be:
brew install tesseract
from the terminal.tesseract file-name.tiff file-name.txt
This will produce a text file which contains the text, though likely with some proportion of errors.If you have already have a PDF with text in it that is selectable, you may either simply select all the text, copy and paste it into an empty text file in, for example, Atom. Or, you may use this alternative:
brew cask install pdftotext
from the terminal.cd
command. To go “up” a directory you enter cd ..
and to find out what directory you are currently in you can enter pwd
. To see what is in the current directory you enter ls
and to ender a directory in the current directory you can enter cd name-of-directory
or the full path of the directory location in your file hierarchy cd /home/user/path/to/directory
pdftotext name-of-file.pdf
or, if you have a large collection of files you can enter for file in *.pdf; do pdftotext "$file" "$file.txt"; done
which will convert every PDF in the current directory into text.In this example, we’ll work with some volumes (for this example code I’ll just work with #15 and #16) of Gandhi’s complete works.
First we’ll rename the files do that they refer only to the volume number. This will use the rename
command, which if you don’t have it installed, can be installed with homebrew using brew install rename
.
rename 's/.*(\d\d).txt/$1.txt/' *
Each volume varies in size from just over 600KB to 1.3MB in size.
NOTE: If you run the below code yourself, I suggest working first with just one or two volumes, then when you are happy with the results overall, then run it for all the volumes, which is much more time consuming. In my test of the below, it took 30 minutes to run this script with 15 volumes.
Let us load some libraries for our use:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.7
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(tidytext)
Now let us import all the volumes (in this example, I’ll only work with volume #15 and #16). We’ll first create an empty variable mergedlines
and create a for
loop that cycles through each of the files in the directory containing our text files.
Then we’ll import each file.
mergedlines<-NULL
for(i in list.files("gandhi/"))
{
importedfile=as_data_frame(list(text=read_lines(paste0("gandhi/",i)),volume=str_replace(i,".txt","")))
mergedlines<-bind_rows(mergedlines,importedfile)
}
nrow(mergedlines)
## [1] 48092
The read_lines()
command will read our file into a vector of lines of text (I have gotten the location of the file by merging the directory location and the file name of the current imported file i using the paste0()
command). The as_data_frame()
command, that creates a tibble, expects a list
so we give it one consisting of the imported file as text and seperately the volume number, which we have extracted from the file name using str_replace()
Each time I import a file, I use bind_rows()
to merge it the the lines of all the previous volumes. Using nrow()
you can see that the total lines of all these volumes is close to two million lines of text.
currentdoc<-NA
mergedlines <- mergedlines %>%
mutate(document="Unknown")
count<-0
for(i in mergedlines$text)
{
count<-count+1
if(str_detect(i,"\\d+\\. [A-Z][A-Z]+"))
{
currentdoc<-i
}
mergedlines$document[count]<-str_replace(currentdoc,"\f","")
}
NOTE: This process up to here took 6 seconds for me with a single volume (#15) and 42 seconds with three volumes. If you add all the volumes from vol. 15-97, for example it may take as long as 20-30 minutes to process everything. It is not unusual to have large data sets take a very long time to process.
Now let us break this down into tokens, remove common stop words and look at basic frequencies:
tidy_volumes <- mergedlines %>%
unnest_tokens(word,text)
data(stop_words) # load the stop words vector
# Remove the stop words from the letter word list using anti_join():
tidy_volumes <- tidy_volumes %>%
anti_join(stop_words)
## Joining, by = "word"
# Get the word frequencies:
word_freq <- tidy_volumes %>%
count(word,sort=TRUE)
head(word_freq,40) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
coord_flip()
It looks like some numbers might have been worth adding to the stop word list…
my_stop_words <- c("gandhi","letter","1","2","3","4","5","6","7","8","9") # list a few stop words of my own from the letters
my_stop_words_df <- data_frame(word=my_stop_words,lexicon="custom") # prepare to add to custom stop words
stop_words <- stop_words %>% bind_rows(my_stop_words_df) # add the custom stop words to the list
# remove the stop words again:
tidy_volumes <- tidy_volumes %>%
anti_join(stop_words)
## Joining, by = "word"
What if we wanted only look at word frequencies for letters to one group of recipients?
chhaganlal<-tidy_volumes %>%
filter(str_detect(document,"GANDHI")|str_detect(document,"KALLENBACH")) %>%
# filters for letter by Chhanganlal, Maganlal, Ramdas, Narandas, Manilal, Khushalchand Gandhi - OR Hermann Kallenbach
count(word,sort=TRUE)
head(chhaganlal,40) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
coord_flip()
Still looks like What if we wanted a list of all documents that use a certain term?
caste<-tidy_volumes %>%
filter(word=="caste"|word=="castes") %>%
distinct(document)
caste
What if we wanted all the lines that mention a particular word?
caste_lines<-mergedlines %>%
filter(str_detect(text,"caste")) %>%
select(text, document)
caste_lines
adding tf_idf:
counts_by_document <- tidy_volumes %>%
count(document,word,sort=TRUE) %>%
ungroup()
gandhi_tf_idf <- left_join(tidy_volumes,counts_by_document) %>%
bind_tf_idf(word,document,n)
## Joining, by = c("document", "word")
gandhi_tf_idf %>%
arrange(desc(tf_idf)) %>%
distinct(word,.keep_all=TRUE)