This is a quick R Notebook to help a student work with an XML file using the xml2
library, extracting various useful information from it.
It uses the example file found here.
Setup:
In xml2 you can use read_xml()
to import a document, and it will be saved as a class xml_document
that can be further manipulated.
## [1] "xml_document" "xml_node"
## {xml_document}
## <TEI.2>
## [1] <text>\n <body>\n <div0 type="sessionsPaper" id="16770601">\n <i ...
You can manually “step down” through an xml file with xml_child
. Here are some examples of us stepping slowly down into the file (not the recommended way if you just want to grab something inside!).
## {xml_nodeset (1)}
## [1] <text>\n <body>\n <div0 type="sessionsPaper" id="16770601">\n <i ...
## {xml_nodeset (1)}
## [1] <body>\n <div0 type="sessionsPaper" id="16770601">\n <interp inst="16 ...
## {xml_nodeset (1)}
## [1] <div0 type="sessionsPaper" id="16770601">\n <interp inst="16770601" type ...
## {xml_nodeset (17)}
## [1] <interp inst="16770601" type="collection" value="BAILEY"/>
## [2] <interp inst="16770601" type="year" value="1677"/>
## [3] <interp inst="16770601" type="uri" value="sessionsPapers/16770601"/>
## [4] <interp inst="16770601" type="date" value="16770601"/>
## [5] <xptr type="transcription" doc="16770601"/>
## [6] <div1 type="frontMatter" id="f16770601-1">\n <interp inst="f16770601-1" ...
## [7] <div1 type="trialAccount" id="t16770601-1">\n <interp inst="t16770601-1 ...
## [8] <div1 type="trialAccount" id="t16770601-2">\n <interp inst="t16770601-2 ...
## [9] <div1 type="trialAccount" id="t16770601-3">\n <interp inst="t16770601-3 ...
## [10] <div1 type="trialAccount" id="t16770601-4">\n <interp inst="t16770601-4 ...
## [11] <div1 type="trialAccount" id="t16770601-5">\n <interp inst="t16770601-5 ...
## [12] <div1 type="trialAccount" id="t16770601-6">\n <interp inst="t16770601-6 ...
## [13] <div1 type="trialAccount" id="t16770601-7">\n <interp inst="t16770601-7 ...
## [14] <div1 type="trialAccount" id="t16770601-8">\n <interp inst="t16770601-8 ...
## [15] <div1 type="trialAccount" id="t16770601-9">\n <interp inst="t16770601-9 ...
## [16] <div1 type="trialAccount" id="t16770601-10">\n <interp inst="t16770601- ...
## [17] <div1 type="punishmentSummary" id="s16770601-1">\n <interp inst="s16770 ...
## [1] "Now let us just grab one of these children:"
## {xml_nodeset (1)}
## [1] <div1 type="frontMatter" id="f16770601-1">\n <interp inst="f16770601-1" ...
# This is a div1 case and its enclosed sections
div1child <- fifthchild %>% xml_children()
# We are now inside of one of the cases and c
div1child
## {xml_nodeset (13)}
## [1] <interp inst="f16770601-1" type="collection" value="BAILEY"/>
## [2] <interp inst="f16770601-1" type="year" value="1677"/>
## [3] <interp inst="f16770601-1" type="uri" value="sessionsPapers/16770601"/>
## [4] <interp inst="f16770601-1" type="date" value="16770601"/>
## [5] <p><xptr type="pageFacsimile" doc="16770601001"/>A true NARRATIVE Of the ...
## [6] <p>At a Sessions there held On the 1st. and 2d. of June 1677.</p>
## [7] <p>Being a true Relation of the Tryal and Condemnation of the grand High ...
## [8] <p>With the Tryal of the Midwife for pretending to be deliverd of a ston ...
## [9] <p>With the Tryal of the two Searchers that were her Confederates.</p>
## [10] <p>And all other considerable Transactions there, with the number of tho ...
## [11] <p>With Allowance. Ro. L'Estrange.</p>
## [12] <p>LONDON: Printed for D.M. 1677.</p>
## [13] <p><xptr type="pageFacsimile" doc="16770601002"/>A Narrative of the Proc ...
You can find something inside an XML file using the incredibly powerful xpath. Using xml_find_all()
you can specify an xpath that refers to all the div1 tags in the file, for example. Then you could grab just a single one of those by number by digging into the list.
# Find all div1 tags
mycases <- xmlimport %>% xml_find_all("//div1")
# Let us grab just the first case (after the frontMatter) by direct numerical call:
mycases[[2]]
## {xml_node}
## <div1 type="trialAccount" id="t16770601-1">
## [1] <interp inst="t16770601-1" type="collection" value="BAILEY"/>
## [2] <interp inst="t16770601-1" type="year" value="1677"/>
## [3] <interp inst="t16770601-1" type="uri" value="sessionsPapers/16770601"/>
## [4] <interp inst="t16770601-1" type="date" value="16770601"/>
## [5] <join result="criminalCharge" id="t16770601-1-off2-c1" targOrder="Y" targ ...
## [6] <p>The first Tryal was of a \n <persName id="t16770601-1-de ...
You can also further refine your search. Let us use xpath again to grab only the div1 tags that have the attribute “type” set to “trialAccount”:
## {xml_nodeset (10)}
## [1] <div1 type="trialAccount" id="t16770601-1">\n <interp inst="t16770601-1 ...
## [2] <div1 type="trialAccount" id="t16770601-2">\n <interp inst="t16770601-2 ...
## [3] <div1 type="trialAccount" id="t16770601-3">\n <interp inst="t16770601-3 ...
## [4] <div1 type="trialAccount" id="t16770601-4">\n <interp inst="t16770601-4 ...
## [5] <div1 type="trialAccount" id="t16770601-5">\n <interp inst="t16770601-5 ...
## [6] <div1 type="trialAccount" id="t16770601-6">\n <interp inst="t16770601-6 ...
## [7] <div1 type="trialAccount" id="t16770601-7">\n <interp inst="t16770601-7 ...
## [8] <div1 type="trialAccount" id="t16770601-8">\n <interp inst="t16770601-8 ...
## [9] <div1 type="trialAccount" id="t16770601-9">\n <interp inst="t16770601-9 ...
## [10] <div1 type="trialAccount" id="t16770601-10">\n <interp inst="t16770601- ...
If you wanted all the defendants of your cases you could search for them across all cases:
## {xml_nodeset (11)}
## [1] <persName id="t16770601-1-defend1" type="defendantName">young fellow<int ...
## [2] <persName id="t16770601-2-defend2" type="defendantName">pickpocket<inter ...
## [3] <persName id="t16770601-3-defend4" type="defendantName">young <rs id="t1 ...
## [4] <persName id="t16770601-4-defend5" type="defendantName">Highway-man<inte ...
## [5] <persName id="t16770601-5-defend7" type="defendantName">\n <rs id="t167 ...
## [6] <persName id="t16770601-6-defend9" type="defendantName">\n <rs id="t167 ...
## [7] <persName id="t16770601-6-defend10" type="defendantName">aged poor women ...
## [8] <persName id="t16770601-7-defend11" type="defendantName">\n <rs id="t16 ...
## [9] <persName id="t16770601-8-defend13" type="defendantName">person<interp i ...
## [10] <persName id="t16770601-9-defend15" type="defendantName">woman<interp in ...
## [11] <persName id="t16770601-10-defend16" type="defendantName">young Man<inte ...
If you wanted to find all the defendants who are also women:
womendefendants <- xmlimport %>% xml_find_all('//persName[@type="defendantName"]/interp[@type="gender" and @value="female"]')
womendefendants
## {xml_nodeset (4)}
## [1] <interp inst="t16770601-3-defend4" type="gender" value="female"/>
## [2] <interp inst="t16770601-6-defend9" type="gender" value="female"/>
## [3] <interp inst="t16770601-6-defend10" type="gender" value="female"/>
## [4] <interp inst="t16770601-9-defend15" type="gender" value="female"/>
The problem with this, however, is that you have jumped outside the hierarchy and you would have to use the “id” number part of the persName id to figure out “where you are”. One solution would be to loop through the cases one by one, extract the information you need into variables, and then assign them to a row (per case) in a data frame.
So, let us work with just one case, the first one. You can restrict the search to the current node down in an xpath with “.//”
## {xml_nodeset (1)}
## [1] <persName id="t16770601-1-defend1" type="defendantName">young fellow<inte ...
Note that this found only the defendant in this trial, not all of them. Had we not included that initial “.” (start at this node) then it would have again found all of the defendants, as if we had run xmlimport %>% xml_find_all('//persName[@type="defendantName"]
since it “knows” all the nodes, despite the fact we have assigned mytrial
only the first of the trials.
You can extract the raw material from inside the XML file with a combination of xml_attr()
to find the contents of attributes, and xml_text()
to extract the inside of <bla>tags</bla>
.
# We already did this abvoe: trialaccounts <- xmlimport %>% xml_find_all('//div1[@type="trialAccount"]')
# Just as a precaution, I'm going to create an empty tibble:
defendants=NULL
for(i in 1:length(trialaccounts)) {
# The trial id is part of the div1 tag, which is the node we are currently on, so just grab the attribute:
trialid <- trialaccounts[[i]] %>% xml_attr("id")
# There should be only one year so just find the first hit among the interp tags:
year <- trialaccounts[[i]] %>% xml_find_first('.//interp[@type="year"]') %>% xml_attr("value")
# Get the gender of all the defendants:
genderdefendants <- trialaccounts[[i]] %>%
xml_find_all('.//persName[@type="defendantName"]/interp[@type="gender"]') %>%
xml_attr("value")
# Get the text inside the persName tag which gives us the original description:
descrip <- trialaccounts[[i]] %>%
xml_find_all('.//persName[@type="defendantName"]') %>%
xml_text(trim=TRUE)
# Now we have all the information we need, now let us add the data to a row of a dataframe.
# It is possible that we have more than one defendant in genderdefendants so we need another loop:
for(j in 1:length(genderdefendants)) {
defendants <- defendants %>%
bind_rows(tibble(defendantid=i,trial_id=trialid,year_tried=year,description=descrip,gender=genderdefendants[j]))
}
}
defendants
## # A tibble: 13 x 5
## defendantid trial_id year_tried description gender
## <int> <chr> <chr> <chr> <chr>
## 1 1 t16770601-1 1677 young fellow male
## 2 2 t16770601-2 1677 pickpocket male
## 3 3 t16770601-3 1677 young Girl female
## 4 4 t16770601-4 1677 Highway-man male
## 5 5 t16770601-5 1677 Victualer male
## 6 6 t16770601-6 1677 Midwife female
## 7 6 t16770601-6 1677 aged poor women female
## 8 6 t16770601-6 1677 Midwife female
## 9 6 t16770601-6 1677 aged poor women female
## 10 7 t16770601-7 1677 Gentleman male
## 11 8 t16770601-8 1677 person male
## 12 9 t16770601-9 1677 woman female
## 13 10 t16770601-10 1677 young Man male
DH Tutorials Home
The GitHub Repository for this handout and its files.
Konrad M. Lawson. @kmlawson
Creative Commons - Attribution CC BY, 2020.