This is a quick R Notebook to help a student work with an XML file using the xml2 library, extracting various useful information from it.

It uses the example file found here.

Setup:

library(xml2)
library(tidyverse)

In xml2 you can use read_xml() to import a document, and it will be saved as a class xml_document that can be further manipulated.

xmlimport <- read_xml("xml/example.xml")
class(xmlimport)
## [1] "xml_document" "xml_node"
xmlimport
## {xml_document}
## <TEI.2>
## [1] <text>\n  <body>\n    <div0 type="sessionsPaper" id="16770601">\n      <i ...

Children

You can manually “step down” through an xml file with xml_child. Here are some examples of us stepping slowly down into the file (not the recommended way if you just want to grab something inside!).

firstchild <- xmlimport %>% xml_children()
firstchild
## {xml_nodeset (1)}
## [1] <text>\n  <body>\n    <div0 type="sessionsPaper" id="16770601">\n      <i ...
secondchild <- firstchild %>% xml_children()
secondchild
## {xml_nodeset (1)}
## [1] <body>\n  <div0 type="sessionsPaper" id="16770601">\n    <interp inst="16 ...
thirdchild <- secondchild %>% xml_children()
thirdchild
## {xml_nodeset (1)}
## [1] <div0 type="sessionsPaper" id="16770601">\n  <interp inst="16770601" type ...
fourthchild <- thirdchild %>% xml_children()
fourthchild
## {xml_nodeset (17)}
##  [1] <interp inst="16770601" type="collection" value="BAILEY"/>
##  [2] <interp inst="16770601" type="year" value="1677"/>
##  [3] <interp inst="16770601" type="uri" value="sessionsPapers/16770601"/>
##  [4] <interp inst="16770601" type="date" value="16770601"/>
##  [5] <xptr type="transcription" doc="16770601"/>
##  [6] <div1 type="frontMatter" id="f16770601-1">\n  <interp inst="f16770601-1" ...
##  [7] <div1 type="trialAccount" id="t16770601-1">\n  <interp inst="t16770601-1 ...
##  [8] <div1 type="trialAccount" id="t16770601-2">\n  <interp inst="t16770601-2 ...
##  [9] <div1 type="trialAccount" id="t16770601-3">\n  <interp inst="t16770601-3 ...
## [10] <div1 type="trialAccount" id="t16770601-4">\n  <interp inst="t16770601-4 ...
## [11] <div1 type="trialAccount" id="t16770601-5">\n  <interp inst="t16770601-5 ...
## [12] <div1 type="trialAccount" id="t16770601-6">\n  <interp inst="t16770601-6 ...
## [13] <div1 type="trialAccount" id="t16770601-7">\n  <interp inst="t16770601-7 ...
## [14] <div1 type="trialAccount" id="t16770601-8">\n  <interp inst="t16770601-8 ...
## [15] <div1 type="trialAccount" id="t16770601-9">\n  <interp inst="t16770601-9 ...
## [16] <div1 type="trialAccount" id="t16770601-10">\n  <interp inst="t16770601- ...
## [17] <div1 type="punishmentSummary" id="s16770601-1">\n  <interp inst="s16770 ...
print("Now let us just grab one of these children:")
## [1] "Now let us just grab one of these children:"
fifthchild <- fourthchild[6]
fifthchild
## {xml_nodeset (1)}
## [1] <div1 type="frontMatter" id="f16770601-1">\n  <interp inst="f16770601-1"  ...
# This is a div1 case and its enclosed sections
div1child <- fifthchild %>% xml_children()
# We are now inside of one of the cases and c
div1child
## {xml_nodeset (13)}
##  [1] <interp inst="f16770601-1" type="collection" value="BAILEY"/>
##  [2] <interp inst="f16770601-1" type="year" value="1677"/>
##  [3] <interp inst="f16770601-1" type="uri" value="sessionsPapers/16770601"/>
##  [4] <interp inst="f16770601-1" type="date" value="16770601"/>
##  [5] <p><xptr type="pageFacsimile" doc="16770601001"/>A true NARRATIVE Of the ...
##  [6] <p>At a Sessions there held On the 1st. and 2d. of June 1677.</p>
##  [7] <p>Being a true Relation of the Tryal and Condemnation of the grand High ...
##  [8] <p>With the Tryal of the Midwife for pretending to be deliverd of a ston ...
##  [9] <p>With the Tryal of the two Searchers that were her Confederates.</p>
## [10] <p>And all other considerable Transactions there, with the number of tho ...
## [11] <p>With Allowance. Ro. L'Estrange.</p>
## [12] <p>LONDON: Printed for D.M. 1677.</p>
## [13] <p><xptr type="pageFacsimile" doc="16770601002"/>A Narrative of the Proc ...

Xpath

You can find something inside an XML file using the incredibly powerful xpath. Using xml_find_all() you can specify an xpath that refers to all the div1 tags in the file, for example. Then you could grab just a single one of those by number by digging into the list.

# Find all div1 tags
mycases <- xmlimport %>% xml_find_all("//div1")
# Let us grab just the first case (after the frontMatter) by direct numerical call:
mycases[[2]]
## {xml_node}
## <div1 type="trialAccount" id="t16770601-1">
## [1] <interp inst="t16770601-1" type="collection" value="BAILEY"/>
## [2] <interp inst="t16770601-1" type="year" value="1677"/>
## [3] <interp inst="t16770601-1" type="uri" value="sessionsPapers/16770601"/>
## [4] <interp inst="t16770601-1" type="date" value="16770601"/>
## [5] <join result="criminalCharge" id="t16770601-1-off2-c1" targOrder="Y" targ ...
## [6] <p>The first Tryal was of a \n               <persName id="t16770601-1-de ...

You can also further refine your search. Let us use xpath again to grab only the div1 tags that have the attribute “type” set to “trialAccount”:

trialaccounts <- xmlimport %>% xml_find_all('//div1[@type="trialAccount"]')
trialaccounts
## {xml_nodeset (10)}
##  [1] <div1 type="trialAccount" id="t16770601-1">\n  <interp inst="t16770601-1 ...
##  [2] <div1 type="trialAccount" id="t16770601-2">\n  <interp inst="t16770601-2 ...
##  [3] <div1 type="trialAccount" id="t16770601-3">\n  <interp inst="t16770601-3 ...
##  [4] <div1 type="trialAccount" id="t16770601-4">\n  <interp inst="t16770601-4 ...
##  [5] <div1 type="trialAccount" id="t16770601-5">\n  <interp inst="t16770601-5 ...
##  [6] <div1 type="trialAccount" id="t16770601-6">\n  <interp inst="t16770601-6 ...
##  [7] <div1 type="trialAccount" id="t16770601-7">\n  <interp inst="t16770601-7 ...
##  [8] <div1 type="trialAccount" id="t16770601-8">\n  <interp inst="t16770601-8 ...
##  [9] <div1 type="trialAccount" id="t16770601-9">\n  <interp inst="t16770601-9 ...
## [10] <div1 type="trialAccount" id="t16770601-10">\n  <interp inst="t16770601- ...

If you wanted all the defendants of your cases you could search for them across all cases:

alldefendants <- xmlimport %>% xml_find_all('//persName[@type="defendantName"]')
alldefendants
## {xml_nodeset (11)}
##  [1] <persName id="t16770601-1-defend1" type="defendantName">young fellow<int ...
##  [2] <persName id="t16770601-2-defend2" type="defendantName">pickpocket<inter ...
##  [3] <persName id="t16770601-3-defend4" type="defendantName">young <rs id="t1 ...
##  [4] <persName id="t16770601-4-defend5" type="defendantName">Highway-man<inte ...
##  [5] <persName id="t16770601-5-defend7" type="defendantName">\n  <rs id="t167 ...
##  [6] <persName id="t16770601-6-defend9" type="defendantName">\n  <rs id="t167 ...
##  [7] <persName id="t16770601-6-defend10" type="defendantName">aged poor women ...
##  [8] <persName id="t16770601-7-defend11" type="defendantName">\n  <rs id="t16 ...
##  [9] <persName id="t16770601-8-defend13" type="defendantName">person<interp i ...
## [10] <persName id="t16770601-9-defend15" type="defendantName">woman<interp in ...
## [11] <persName id="t16770601-10-defend16" type="defendantName">young Man<inte ...

If you wanted to find all the defendants who are also women:

womendefendants <- xmlimport %>% xml_find_all('//persName[@type="defendantName"]/interp[@type="gender" and @value="female"]')
womendefendants
## {xml_nodeset (4)}
## [1] <interp inst="t16770601-3-defend4" type="gender" value="female"/>
## [2] <interp inst="t16770601-6-defend9" type="gender" value="female"/>
## [3] <interp inst="t16770601-6-defend10" type="gender" value="female"/>
## [4] <interp inst="t16770601-9-defend15" type="gender" value="female"/>

Create a Tibble with a List of all Defendants

You can extract the raw material from inside the XML file with a combination of xml_attr() to find the contents of attributes, and xml_text() to extract the inside of <bla>tags</bla>.

# We already did this abvoe: trialaccounts <- xmlimport %>% xml_find_all('//div1[@type="trialAccount"]')

# Just as a precaution, I'm going to create an empty tibble:
defendants=NULL
for(i in 1:length(trialaccounts)) {
  # The trial id is part of the div1 tag, which is the node we are currently on, so just grab the attribute:
  trialid <- trialaccounts[[i]] %>% xml_attr("id")
  # There should be only one year so just find the first hit among the interp tags:
  year <- trialaccounts[[i]] %>% xml_find_first('.//interp[@type="year"]') %>% xml_attr("value")
  # Get the gender of all the defendants:
  genderdefendants <- trialaccounts[[i]] %>% 
    xml_find_all('.//persName[@type="defendantName"]/interp[@type="gender"]') %>%
    xml_attr("value")
  # Get the text inside the persName tag which gives us the original description:
  descrip <- trialaccounts[[i]] %>% 
    xml_find_all('.//persName[@type="defendantName"]') %>% 
    xml_text(trim=TRUE)
  # Now we have all the information we need, now let us add the data to a row of a dataframe.
  # It is possible that we have more than one defendant in genderdefendants so we need another loop:
  for(j in 1:length(genderdefendants)) { 
    defendants <- defendants %>%
      bind_rows(tibble(defendantid=i,trial_id=trialid,year_tried=year,description=descrip,gender=genderdefendants[j]))
  }
}
defendants
## # A tibble: 13 x 5
##    defendantid trial_id     year_tried description     gender
##          <int> <chr>        <chr>      <chr>           <chr> 
##  1           1 t16770601-1  1677       young fellow    male  
##  2           2 t16770601-2  1677       pickpocket      male  
##  3           3 t16770601-3  1677       young Girl      female
##  4           4 t16770601-4  1677       Highway-man     male  
##  5           5 t16770601-5  1677       Victualer       male  
##  6           6 t16770601-6  1677       Midwife         female
##  7           6 t16770601-6  1677       aged poor women female
##  8           6 t16770601-6  1677       Midwife         female
##  9           6 t16770601-6  1677       aged poor women female
## 10           7 t16770601-7  1677       Gentleman       male  
## 11           8 t16770601-8  1677       person          male  
## 12           9 t16770601-9  1677       woman           female
## 13          10 t16770601-10 1677       young Man       male

DH Tutorials Home
The GitHub Repository for this handout and its files.
Konrad M. Lawson. @kmlawson
Creative Commons - Attribution CC BY, 2020.