Simple Network Visualisation with R

There are a number of user friendly tools for visualizing networks out there which don’t require any programming knowledge. These include Cytoscape and Gephi, among others. The language R, however, has become a popular and powerful platform for data analysis, as well as the cleaning of data, visualization of texts, networks, and geographical information. R benefits from a large ecosystem of open source packages, and in recent years, a collection of them that have come to be known as the Tidyverse has made the process of exploring data significantly easier. On the network analysis front, mature packages like statnet and igraph are joined by new ones including the pair tidygraph and ggraph to make it fairly easy to visualize and explore networks in R.

Network science is an active and well developed field in the sciences and social sciences such as sociology. In history, however, while use and reference to “networks” is now very common in publications, the use of formal analytical network analysis has been rather more limited. Learn more about work in this area at the website dedicated to Historical Network Research.

Effective use of formal network analysis depends on strong familiarity with the science and mathematics of graphs. However, there are many contexts in which visualising a historical network is useful without more advanced techniques that tap the full analytical potential of network exploration. Illustrating historical research with a network graph diagram can help the reader better grasp the scope and connections of a group of individuals you may be discussing in your research. This illustrative value of social network visualisations depends in great part on the ability to craft visualisations that communicate well. Network visualisation is also a way to explore connections and patterns in your historical materials which, especially as your collection of individuals and organisations (if you create a “bimodal” network, see below) grows beyond the scale at which you can easily derive patterns by browsing a table or spreadsheet. We might call this the heuristic value of social network visualisations. They may include some use of basic network analysis tools, but it uses them usually as a path to finding new questions to ask about your material, or as a way to cast the spotlight on possible patterns that you can explore in depth, perhaps returning to other sources and methods. This tutorial is primarily for students and scholars in the humanities who are interested in network visualisations for their illustrative and heuristic potential, but who may want to gain some familiarity and exposure to its analytical potential.

This tutorial was designed for history students in a masters level skills module at the University of St Andrews MLitt programme in Global, Transnational, and Spatial history to get a first taste of how R might be used to explore historical networks. In this exercise we will practice creating some simple network visualisations using a fictional network of East Asian gangsters and revolutionaries.

Prerequisites: My students working with this tutorial R Notebook have done a little bit of previous work with R and text analysis with material from Text Mining with R by Julia Silge & David Robinson, Text Analysis with R for Students of Literature by Matthew Jockers, and read some of A User’s Guide to Network Analysis in R on igraph as well has having completed the DataCamp module Introduction to the Tidyverse. I would suggest trying this tutorial if you have had a least some basic introduction to R and familiarity with RStudio.

This tutorial is inspired by or adapts material by Jesse Sadler and Douglas A. Luke, among others (see the bibliography below). It was created as an “R Notebook”" which can be used by anyone directly if you have R installed and open the file in RStudio. You can download the files used in this tutorial here in the github repository. If you open this notebook in RStudio, you will see the code and can run all of it in one go with cmd/ctrl-option-r. Alternatively, you can run code from a single section using cmd/ctrl-shift-enter. Many of the questions below ask you to see what happens when you tweak some of the code found here.

For this exercise you need the packages:

readr dplyr tidyr stringr fircats tibble ggplot2 ggraph tidygraph igraph visNetwork scales

In RStudio “Install Packages…” from the Tools menu and you can paste in the above list of packages separated as they are by a space and press Install. After the packages are installed (you may have had some of them already), then load them as follows:

Importing the Data

Now we need to get our data into our network.

The raw data that is used for network analysis and visualisation is usually in the form of edge and node tables. When visualised, these are the lines and points of a graph diagram. The nodes of your network are very often indivduals along with any attributes tied to those individuals. You can style the nodes in your network graphs using these attributes.

Even more important than the nodes are a table of edges which contain the relational information of your network: the relationships between your agents, or between agents and organisations in the case of bimodal graphs (see below). These relationships may also have attributes that can be visualised with styling. John Scott’s introductory text Social Network Analysis has a nice chapter on considerations for collecting and organising your data for analysis and visualisation you might want to consult.

Put the nodes.csv and edges.csv files that I have shared with you into the working directory where you should be keeping this notebook file (or set the working directory to the write place in the Session menu).

We will now load the nodes into a nodes data frame and edges into an edges data frame. The head() command with 10 as a parameter will give you a peak at the contents of each file.

person location age nationality mentions discuss gender
Tomohiko Tokyo 22 Japan 14 1 m
Kyŏngmin Seoul 55 Korea 12 0 m
Jiurong Shanghai 44 China 3 0 f
Sangok Pusan 33 Korea 5 0 m
Yoshinobu Tokyo 66 Japan 67 1 m
Wei Qingdao 57 China 30 1 f
Songbae Seoul 36 Korea 26 0 m
Minjun Pusan 55 Korea 4 0 m
Hayun Pusan 22 Korea 2 0 m
Minjae Pusan 30 Korea 12 0 m
from to kind intensity year_start year_end
Chŏngsu Minjae 3 3 1907 1921
Hayun Minjae 3 1 1902 1943
Jiurong Tomohiko 3 1 1896 1947
Kyŏngmin Jiurong 3 1 1895 1920
Minjae Chŏngsu 3 3 1907 1921
Takamasa Kei 3 2 1898 1934
Tomohiko Jiurong 3 1 1910 1936
Wei Guoran 3 3 1872 1920
Yoshinobu Tomohiko 3 2 1898 1915
Chŏngsu Yŏngsik 2 1 1919 1931

Notice that the nodes have age, nationality, location, mentions (let us say this is number of times they appear in some source or collection of sources). I have also an arbitrary binary discuss column where I have manually flagged up a few important characters I might want to emphasise.

When preparing a collection of nodes and edges for network visualization it is usually best to have a column in the nodes table with unique id numbers that are used as a reference key to all other information about that agent. Then, in the edges table, you would see only the relevant id numbers, instead of the names. However, for this simple example, to increase the readability of the files as we learn the basics, I have chosen to use the given name the fictional individuals (there are just one or two real given names that fit the description of individuals for this network to add to the fun for East Asian historians) without any special id column.

Creating a Simple Network

Let us create a network object from our nodes and edges:

This creates an igraph network object, but it is a format that is easily understood by ggraph and most of its features. Later in this exercise we will convert this to a tidygraph tibble graph. For now, we can very easily create a simple graph diagram using the ggraph() command. It works in a very similar fashion to ggplot, which it is an expansion of. You tell it the network to use, assign a layout type, then add options. In this case we will simply add a geom_edge_link() which will give us the edges, and a geom_node_point() which will display points.

This is very simple. We can see that it is placed on an x,y axis and looks like a kind of special ggplot diagram. There is lots of things we might want to do to improve this.

Adding Labels

Let us start by adding labels to the graph. Under geom_node_point() we will add a geom_node_label(). The aesthetics we will give it are to connect its label to the name column of our nodes, set the font face.

Then, back outside the aesthetics aes() we will set the transparency level to 60% (alpha=0.6). This may seem like somethign we would put inside the aesthetic, but because we are giving it a specific value, and not mapping it to our data, it goes outside. This will allow us to see any edge lines and nodes behind the label.

We also add the repel = TRUE here to help with the formatting of the location of the labels.

Questions 1

Try the following questions below.

  1. What happens if you remove the repel=TRUE (remember to cut out the trailing comma too)?
  2. How would you show the age instead of the name?
  3. What would you do if you wanted remove the nodes altogether and just use labels instead of nodes? Try this without transparency and removing the repel feature.

Adding a Theme

Both ggplot and ggraph can work with “themes” that store lots of custom settings that we can apply to our graphs. You can store a theme in a function that calls the theme and then add that theme function to any graph you call. You might, for example, create several to match different purposes. See the R for Data Science book, or ggplot: Elegant Graphics for Data Analysis or the DataCamp class Communicating with Data in R (Tidyverse) or just run ?theme for more on themes.

Let us create a theme to use for our graphs. They will make the background a light grey, extend the margins, remove the axis text, ticks, and titles. It will also remove all the grid lines.

We need only call network_theme() at the end of our graphs to apply these settings.

Now in our next graph, notice the changes caused by our theme and no other changes:

Adding Labels and More

Now let us begin adding some more things to our graph. Using the labs() function, we can add a title, a caption on the source of the data at the bottom right. Also in labs() we can rename the legend titles. Notice I use the escaped n character in one case to create a two line legend header. Notice that, in the case of the edge width, I had to use an “edge_” prefix before naming the legend header.

We’ll also make some other additions to our graph diagram. In the geom_edge_link() aesthetics, we will tell it to vary the width of the edgbe by the intensity column of our data. Then outside the aesthetics we will fix the color of the lines to a mid level grey.

We can control the scale of the width with the scale_edge_width() function, which sets the range to a minimum of 0.2 in width and a maximum of 2, scaling the numbers to something within that range.

For our nodes, our aes() now scales the size of the node by the number of mentions in the sources, and the color of the nodes according to the nationality.

Questions 2

  1. How would you change the code so that the transparency varied according to mentions?
  2. How would you vary the color by location rather than nationality?

Showing Direction with Arrows

We have a directed network, meaning that relationships between two inviduals may only go in one or sometimes in both directions. You can add arrows to the geom_edge_link() as below. Notice I also switched from width to transparency to show varying intensity and instead of showing nationality with node color, now show the gender.

Questions 3

  1. Change the code so that instead of varying the size of the node by the number of source mentions, it adjusts the size by the discuss column of the node table. This is a number 0 or 1.

  2. In the scale command for the geom_node_point, scale the size from 2 to 8 by adding a scale_size()

Creating a Subgraph by Filtering on an Edge attribute

One of the variables we have that we haven’t used is the kind column in the node data, which is a number from 1-3. What if we wanted to create a second diagram that only shows those relationships which are of kind 3?

The tidygraph library has a nice activate() method that allows you to manipulate nodes and edges or filter them in various ways. Instead of calling activate(edges) before manipulating edges, there is also a nice shortcut, with %E>% instead of the usual pipe or %N>% to work with your nodes. For this we need to take our igraph network and convert it to a tbl_graph with as_tbl_graph() and then we can use the filter() command to find just the edges which have a kind==3. If graphed this immediately, we would see the filtered edges, but also a number of isolated nodes no longer connected to the rest of the graph. We can activate the node layer and then filter out the isolated nodes with filter(!node_is_isolated()).

Questions 4

  1. Go back and change the filter to look only for edges of kind 2, then again for kind 1.
  2. Instead of filtering by kind, create a graph diagram of only the people based in Tokyo, or only the Koreans in the network.
  3. How would you create a subgraph showing only those whose relationship year_start was before 1890 and year_end after 1910? You can do this with two filter commands, or with a compound & statement.

Community Detection

Network scientists have developed a variety of algorithms to detect communities in a network. While the analytical value of this algorithmically derived grouping in the context of historical research may be limited, for larger networks, it can help you identify clusters to explore. For more on this read the chapter on “Subgroups” in the book A User’s Guide to Network Analysis in R. The tidygraph package inherits many of the community detection algorithms imbedded into igraph and makes them available to us, including Edge-betweenness (group_edge_betweenness), Leading eigenvector (group_leading_eigen), Fast-greedy (group_fast_greedy), Louvain (group_louvain), Walktrap (group_walktrap), Label propagation (group_label_prop), InfoMAP (group_infomap), Spinglass (group_spinglass), and Optimal (group_optimal). Some community algorithms are designed to take into account direction or weight, while others ignore it. Below we try Walktrap, which is not, in fact, designed for directed networks, but try comparing its results with other community detection algorithms and note the differences.

Questions 5

  1. Had you done so manually, would you have divided up the graph into “communities” along these lines? Which assignments by the algorithm look out of place to you?
  2. Try the other community detection algorithms and compare the results.

Bimodal Networks

Bimodal, bipartite, or affiliation networks have two different types of nodes and generally only link between the two types of nodes. As the term “affiliation network” suggests, this is often in the form of the affiliation of an individual to an organisation of some kind.

Let us import a list of edges between individuals and organisations.

From To
Tomohiko Toilers of the Great East
Jiurong Green Crane Society
Minjun Workers Alliance
Hyejin East Wind
Yoshinobu Kawakami-gumi
Wei Great Harmony Society
Wei Green Crane Society
Hyejin Toilers of the Great East
Kyŏngmin Toilers of the Great East
Sangok East Wind

We have now a table with relationships between indivdiuals and organisations, but it would be nice to create a merged node table which joins all the attribute information from organisations, which includes the location of the organisations’ headquarters, and all the attribute data for individuals. We can use full_join() for this.

person location age nationality mentions discuss gender HQ
Yŏngsu Tokyo 31 Korea 34 0 f NA
Yōsuke Nagoya 26 Japan 10 0 m NA
Kei Osaka 24 Japan 3 0 m NA
Senjūrō Kagoshima 35 Japan 1 0 m NA
Masahirō Kagoshima 41 Japan 1 0 m NA
Takamasa Kōchi 45 Japan 4 0 m NA
Michiō Niigata 37 Japan 1 0 m NA
Kanno Osaka 32 Japan 44 1 f NA
Fumiko Seoul 29 Japan 31 1 f NA
Kikue Tokyo 40 Japan 14 0 f NA
Zhen Yizheng 30 China 29 1 f NA
Jongmyung Seoul 23 Korea 10 0 f NA
Toilers of the Great East NA NA NA NA NA NA Pusan
Green Crane Society NA NA NA NA NA NA Beijing
Workers Alliance NA NA NA NA NA NA Seoul
East Wind NA NA NA NA NA NA Shanghai
Kawakami-gumi NA NA NA NA NA NA Tokyo
Iwaguchi-gumi NA NA NA NA NA NA Kagoshima
Great Harmony Society NA NA NA NA NA NA Beijing
Red Wave Association NA NA NA NA NA NA Tokyo

Now we can create a network object from this merged information. In order to keep track of what nodes are part of each mode (individuals or organisations) we’ll add a type column to the node data that will get a TRUE value if it is one of the organisations.

Now we can great a graph diagram of our bimodal network. In the code, I have made a few customisations to our usual graphs above by setting the shape of the node to correspond to whether it is an individual or an organisation and then chose a circle (ggplot shape number 19) or a square (15). I increased the fig_width to make the chart wider, and used some conditionals in the form of ifelse() to conditionally distinguish the organisations by color, and only assign labels to individuals.

Note: If you run this code in R Studio, note the difference between the appearance of the plots within R Studio and the exported web page version.

You can also use a special bipartite layout for the graph that produces a hierarchical look. Sometimes the tree layout will also produce a desirable effect as well.

Bimodal graphs are nice for visualising the connections between two different types of things. As Scott Weingart has argued in several web posts, including his overview of bimodal networks, they are significantly more difficult to analysis using formal network analysis methods, including the challenge of exploring various forms of centrality or clustering coefficients.

They are valuable, however, as a heuristic visualisation to explore your network and discover new questions, or areas to focus in on for more research. They can also serve more simple illustrative purposes when you are exploring a historical network in your narrative and want to illustrate visually relationships between individuals and organisations or some other combination of two modes even without formal analysis being carried out.

One useful transformation of your bimodal newtorks that can be particularly useful, especially for larger networks than the one we are dealing with here, is to explore connections between the nodes in one mode or the other by means of their connections to the other mode. In our historical example, we might explore what the connectivity is between organisations based on members who tie them together, or, what connections are there between individuals by virtue of the fact that they share membership in an organisation. These are called projections of bimodal networks.

To create these projections we can use the igraph function bipartite.projection() function. This will create a list with two projections proj1 and proj2, one for each mode. Let us assign each one to its own network object and then plot them.

The lines here are thicker in the cases where members were more linked to each other by mutual membership in multiple organisations. In the second plot we see that four of the organisations each share two members. Not terribly revealing in this case, but with much larger networks, this may reveal interlocking organisations with overlapping memberships that might not be immediately obvious by perusing a table of membership data.

One Plot to Rule Them All

Bimodal networks include only connections between two different modes. But there is nothing preventing you from flattening a bimodal graph and including all the edges from our unimodal network. That is, you can create a visualisation, for illustrative or heuristic purposes, that depicts both relationships between individuals and between these individuals and the organisations. Please note that if formal analysis plays any role in your exploration of these networks, this is not methodologically sound for any number of reasons. Among the issues is that we are mixing a directed network (of individuals) with an undirected network (of affiliations).

To create our mega plot, we will merge the edge table with relationships between individuals and organisations using bind_rows(), with that of individuals to individuals. For simplicity, we will first assign an intensity of 1 and type 4 to all affiliation relationships, and leave all date info as NA. We’ll also standardise the naming of the columns as “From” and “To” are capitalised in one case and not in the other. mutate() makes it easy to rename the columns.

We can then visualise all the edges together, and use various visual features to help make the plot more readable, but anyone who has used software such as Cytoscape, for example, will see that it is much easier to customise the visualisation of multiple networks together there than here, as far as I have been able to determine. Especially if the aim is just to explore your data as a part of the research and thinking process, then Cytoscape is a much easier alternative to R and igraph/ggraph.

Now let us create a new network object with this merged edge table and our previously merged node table and plot the results:

This plot includes too much information to communicate its contents clearly at this size. If you plan on creating complex plots, I suggest you use ggsave() (see below) to export large versions of the graph after playing with the figure widths and heights.

Other Layouts

Up until now we have been mostly using the Kamada-Kawai layout algorithm to determine the look of our network. There are a range of the other layouts you can create with the replacement of the layout type.

Below see our graph with the Fruchterman-Reingold layout.

There is also a “circular” layout, which takes a bit more tweaking of the parameters and size to get it to fit well:

Questions 6 Playing with the Layouts

  1. Try replacing the layout="" to the following possible layouts: sugiyama,star,dh,gem,graphopt,drl and compare the results.
  2. Why did I add the fig.width=5 option used in the case of the circular layout in the declration of the r code section. What happens if you remove it?
  3. Why did I hard code the font size , size=2.2 outside of the aes() for the geom_node_label()? What happens if you cut that out?
  4. What happens if you add another ggplot option (don’t forget the + on the end of the previous line!) with coord_cartesian(xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
  5. How could I colour the labels by the nationality of the nodes? By the location of the members?
  6. How could I set it so that the size of the labels changes according to the age of the members of the network?
  7. How could I limit the range of the size of the fonts from sizes 2 to 4?

Adding Some Network Analysis

Although we have been using the ggraph package to visualise our network, the graph itself is an igraph object and can take advantage of all the analytical tools in igraph:

Look how easy it is to add columns, using our trusty dplyr mutate() to add columes with the betweenness, closeness, and eigenvector centrality computed for our nodes, together with the total, in, and out degrees.

Note: If your graph is in tidygraph you can also use the wide variety of centrality_ prefixed functions.

We can do a quick comparison of in, out and total degree of the nodes, which measures the outgoing and incoming relationships, or their total, minus any overlapping edges. Notice I used the fct_reorder() function from the forcats library to re-sort the names by their total degree (degree_all). Comment out that line to see what happens to the graph.

With this data we could easily plot the relationship between various kinds of centrality. Betweenness centrality is a measure of the degree to which a node is a gatekeeper to other nodes. How many of the shortest paths between nodes must pass through a given node? Eigenvector centrality tries to judge the importance of a node by the relative connectivity of its neighbors. Read more about it here. Let us compare the two in our own network:

How about the relationship bewteen Eigenvector centrality and another measure, closeness centrality. Closeness centrality is a measure of how close a given node is to all the other nodes.

Now that we have all this information, we can also now redo our network graph using any of these measures. Let us get a network graph that incorporates all the new variables we had added to the node table:

For example, here is a graph diagram with the size of the node changed to indicate its betweenness.

Questions 7

  1. How would you change this to colour by location, but size by closeness centrality? Or eigenvector centrality?
  2. How would you create a ggplot that showed the relationship between betweenness centrality and the mentions in the sources?
  3. Challenge: How would you create a ggplot that visualized the comparison of the average betweenness of women in the network compared to men?
  4. Challenge: What steps would you need to go through to compare the total density (ratio of the number of the edges vs. possible edges) of the members of the network in each of the three nationalities? What about in each location? How could you plot this in a simple bar graph? You may have to do some exploring in the documentation for igraph or ggraph

Creating an Interactive Network Graph with visNetwork

There are a number of ways to make your network graph interactive, especially in a website. These include using a Shiny app, D3.js and its R connector networkD3, or the R package visNetwork. See Jesse Sadler’s network tutorial for a comparison of D3.js and visNetwork, as well as a demonstration of how you can use networkD3 to create what is known as a Sankey diagram.

To convert our simple network to a visNetwork that will allow interaction, we’ll have to abandon our use of given name in the place of id numbers as a key. If you have been using id numbers from the start (recommended) in your node and edge tables, you don’t need this step at all. The convert our tables, we’ll add an id number column to our nodes, and then replace all the given names in the edges table with their corresponding id number. First let us add an id column to the nodes and few the top ten rows of the resulting data frame:

id person location age nationality mentions discuss gender
1 Tomohiko Tokyo 22 Japan 14 1 m
2 Kyŏngmin Seoul 55 Korea 12 0 m
3 Jiurong Shanghai 44 China 3 0 f
4 Sangok Pusan 33 Korea 5 0 m
5 Yoshinobu Tokyo 66 Japan 67 1 m
6 Wei Qingdao 57 China 30 1 f
7 Songbae Seoul 36 Korea 26 0 m
8 Minjun Pusan 55 Korea 4 0 m
9 Hayun Pusan 22 Korea 2 0 m
10 Minjae Pusan 30 Korea 12 0 m

Now let’s replace the names with the node id numbers in the from and to columns of the edge table using the match() function. For each row of our new from column, we ask it to supply us the node id for the row in which the name in the edges from column matches the name in the person column of the now id-equiped nodes_wids varialbe.

from to kind intensity year_start year_end
14 10 3 3 1907 1921
9 10 3 1 1902 1943
3 1 3 1 1896 1947
2 3 3 1 1895 1920
10 14 3 3 1907 1921
25 22 3 2 1898 1934
1 3 3 1 1910 1936
6 19 3 3 1872 1920
5 1 3 2 1898 1915
14 15 2 1 1919 1931

Now we can produce the visNetwork interactive plot with our new nodes_wids and edges_wids node and edge tables.

This is a very limited and boring graph, however. You can click on and manipulate the nodes but its physics allows for very limited moving of things around before they spring back into place. It is also missing almost everything else useful to communicate anything.

Now let us create a visNetwork object with more information communicated that you can freely manipulate by clicking on nodes. It will also include navigation buttons for easily manipulation of zoom levels and panning. If we add columns to the data which indicate things like size (of the nodes), width (of the edges), and color (of the nodes). Zoom in on the graph and you will see that the labels fade in and out depending on your zoom level.

There are many more ways to customise the visNetwork options. For more on this, see the documentation for the visNetwork package.

Saving Your Plots

There are a number of ways of extracting the plots you produce. One convenient way is the use of the ggsave() command in the ggplot() package.

This is not only useful for you to embed any of the graphs seen here in a separate document but gives you the ability to create a crystal clear SVG version that is unpixelated at any zoom, or save a PNG version, for example, at a size much larger than those shown here, so that there is less chance of nodes overlapping.

The following ggsave() command, for example, will save the last plot you have made to the disk as plot.svg which is zoom independent in its resolution, and a second version saved as a png file but with a fixed size:

## Saving 7 x 5 in image

This should give you a good start at creating network graph diagrams using R. See some of these resources for more:

Further Reading

Books

Luke, Douglas A. A User’s Guide to Network Analysis in R. 1st ed. 2015 edition. Cham Hildesheim New York: Springer, 2015. - Much of the code here is adapted from examples in this volume. Uses statnet and igraph but also shows how to convert between them. Unfortunately, all plots are with base R plots, rather than ggplot.

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. 2016 edition. New York, NY: Springer, 2016. - ggraph is built on top of ggplot and many of the customisations to graphs benefit from understanding how ggplot works.

Scott, John. Social Network Analysis. 4th ed., 2017. - A great introductory text on the topic for humanities students.

Scott, John, and Peter J. Carrington, eds. The SAGE Handbook of Social Network Analysis. London ; Thousand Oaks, Calif: SAGE, 2011.

Wasserman, Stanley, and Katherine Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. - The classic text in the field which devles into the mathamatical foundations of graph theory.

Online Tutorials and Resources

This R Notebook was written with the help of various books and tutorials mentioned above, but mostly thanks to 40-60 google searches, with the answers found generally on the websites above, Stack Overflow, and obscure online bulletin boards.