Throughout the tutorial, a basic knowledge of R and network analysis is assumed

The main focus of this tutorial is empirical analysis of networks and skips a lot of additional functionality of `igraph`

. If you are interested in, e.g. creating random networks, consult the literature at the end of this document.

After this tutorial, you should have a basic understanding on how to import network data into R and perform first analyses. These include computing descriptive statistics, centrality scores and clustering.

To run all the code in this tutorial, you need to install and load two packages.

`install.packages(c("igraph","netrankr"))`

`igraph`

is used for the majority of analytic tasks and for its data structures. `netrankr`

is a package for network centrality. Its main functionality is explained in another tutorial. Here, we only need some indices that are implemented.

```
library(igraph)
library(netrankr)
```

`igraph`

can deal with many different foreign network formats with the function `read_graph`

. The `rgexf`

package can be used to import Gephi files.

```
read_graph(file, format = c("edgelist", "pajek", "ncol", "lgl",
"graphml", "dimacs", "graphdb", "gml", "dl"), ...)
```

I personally have encountered some issue when importing `graphml`

files that were produced in visone. Also `dl`

files sometimes through an error.

If your network data is in one of the above formats you will find it easy to import your network. Below, we read in the Game of Thrones interaction network from the first season, which is stored as a `graphml`

file.

```
gotS1 <- read_graph("data/GoT/gotS1.graphml",format="graphml")
gotS1
```

```
## IGRAPH 833b0c8 UNW- 127 550 --
## + attr: title (g/c), name (v/c), clu (v/c), size (v/n), id (v/c),
## | Season (e/n), weight (e/n)
## + edges from 833b0c8 (vertex names):
## [1] Ned --Robert Daenerys --Jorah
## [3] Jon --Sam Ned --Littlefinger
## [5] Ned --Varys Daenerys --Drogo
## [7] Ned --Arya Catelyn --Robb
## [9] Bronn --Tyrion Ned --Cersei
## [11] Cersei --Robert Littlefinger--Varys
## [13] Shae --Tyrion Ned --Catelyn
## + ... omitted several edges
```

The output shows a summary of the network. The first line contains the number of nodes (127) and the number of edges (550). These can also be obtained via `vcount(gotS1)`

and `ecount(gotS1)`

.

The following rows list the graph, vertex, and edge attributes of the graph. The attributes can be distinguished by the string in parentheses. `(g/c)`

are character graph attributes, and `(v/n)`

/`(v/c)`

numeric/character vertex attributes. The same holds for edges (`(e/n)`

and `(e/c)`

). The remaining rows show an excerpt from the edges.

One of the advantages of using R over other software tools is that you can easily work with complete collections of networks. The below code reads all Game of Thrones networks at once and stores the network in a list.

```
file_names <- c("data/GoT/gotS1.graphml","data/GoT/gotS2.graphml",
"data/GoT/gotS2.graphml","data/GoT/gotS4.graphml",
"data/GoT/gotS3.graphml","data/GoT/gotS6.graphml",
"data/GoT/gotS7.graphml")
gotS1to7 <- lapply(file_names,read_graph,format = "graphml")
```

If your network data is not in a network file format, you will need one of the following functions to turn raw network data into an `igraph`

object: `graph_from_edgelist()`

, `graph_from_adjacency_matrix()`

, `graph_from_adj_list()`

and `graph_from_data_frame()`

.

Before using these functions, however, you still need to get the raw data into R. The procedure depends on the file format. If your data is stored as an excel spreadsheet, you need additional packages. If you are familiar with the `tidyverse`

, you can use the `readxl`

package. Other options are, e.g. the `xlsx`

package. However, I recommend to refrain from using excel spreadsheets, especially when your network is large.

Most network data you’ll find is in a plain text format (csv or tsv), either as an edgelist or adjacency matrix. To read in such data, you can use `read.table()`

.1 This is the base R solution. Further options are given at the end of this section.

Make sure you check the following before trying to load a file: Does it contain a header (e.g. row/column names of an adjacency matrix)? How are values delimited (comma, whitespace or tab)?

The data repository for the Game of Thrones networks contains two adjacency matrices of the first season. The file `gotS1_matrix.dat`

has a header (and row names) and values are separated by a single space. The file `gotS1_matrix1.dat`

on the other hand has no header (and no row names) and values are separated by a comma.

Knowing this, you can read the files as follows.

```
#file with header, rownames and separated by a space
A1 <- read.table(file = "data/GoT/gotS1_matrix.dat",
sep = " ",header = TRUE,row.names = 1)
#file without header and rownames, and comma separated
A2 <- read.table("data/GoT/gotS1_matrix1.dat",
header = FALSE,sep=",")
```

Note that `read.table()`

reads the data as a data frame, so we need to convert the result into a matrix.

```
A1 <- as.matrix(A1)
A2 <- as.matrix(A2)
A1[1:5,1:5]
```

```
## Ned Daenerys Jon Littlefinger Arya
## Ned 0 3 29 107 90
## Daenerys 3 0 0 0 0
## Jon 29 0 0 0 20
## Littlefinger 107 0 0 0 5
## Arya 90 0 20 5 0
```

`A2[1:5,1:5]`

```
## V1 V2 V3 V4 V5
## [1,] 0 3 29 107 90
## [2,] 3 0 0 0 0
## [3,] 29 0 0 0 20
## [4,] 107 0 0 0 5
## [5,] 90 0 20 5 0
```

The procedure is very similar if your data is an edgelist. We thus skip this here.

The data is now ready to be converted into an igraph object. The networks are both undirected and weighted, so we need to adjust the corresponding parameters.

```
gA1 <- graph_from_adjacency_matrix(A1,mode = "undirected",weighted = TRUE)
gA2 <- graph_from_adjacency_matrix(A2,mode = "undirected",weighted = TRUE)
gA1
```

```
## IGRAPH 3dda64e UNW- 127 550 --
## + attr: name (v/c), weight (e/n)
## + edges from 3dda64e (vertex names):
## [1] Ned--Daenerys Ned--Jon Ned--Littlefinger
## [4] Ned--Arya Ned--Catelyn Ned--Cersei
## [7] Ned--Joffrey Ned--Sansa Ned--Tyrion
## [10] Ned--Jeor Ned--Robb Ned--Bran
## [13] Ned--Jaime Ned--Loras Ned--Jory.Cassel
## [16] Ned--Jorah Ned--Ros Ned--Benjen
## [19] Ned--Greatjon.Umber Ned--Pyp Ned--Renly
## [22] Ned--Barristan Ned--Hound Ned--Theon
## + ... omitted several edges
```

`gA2`

```
## IGRAPH 84322f9 UNW- 127 550 --
## + attr: name (v/c), weight (e/n)
## + edges from 84322f9 (vertex names):
## [1] V1--V2 V1--V3 V1--V4 V1--V5 V1--V6 V1--V8 V1--V10
## [8] V1--V11 V1--V12 V1--V13 V1--V14 V1--V15 V1--V16 V1--V17
## [15] V1--V18 V1--V19 V1--V20 V1--V22 V1--V23 V1--V27 V1--V28
## [22] V1--V30 V1--V35 V1--V36 V1--V37 V1--V38 V1--V39 V1--V41
## [29] V1--V43 V1--V44 V1--V46 V1--V48 V1--V49 V1--V51 V1--V52
## [36] V1--V53 V1--V56 V1--V64 V1--V65 V1--V66 V1--V67 V1--V76
## [43] V1--V77 V1--V85 V1--V90 V1--V102 V1--V105 V1--V106 V1--V107
## [50] V1--V109 V1--V114 V1--V116 V1--V117 V1--V118 V1--V125 V1--V126
## + ... omitted several edges
```

Note that `gA1`

and `gA2`

are basically the same network, only that the second one is missing the real name attribute.

Speaking of attributes: both functions `graph_from_edgelist()`

and `graph_from_adjacency_matrix()`

come with a drawback. You can’t add node or edge attributes automatically.

In general, it is always a good idea to organize network data in two different data frames. One for the nodes (with attributes) and one for the edges (edgelist+attributes). This will save you a lot of pain when trying to convert the data into an igraph object.

If you are wondering what this network represents: Nodes are characters from Grey’s Anatomy and links indicate who slept with whom…

```
edges <- read.csv("data/greys-edges.csv", header = FALSE,stringsAsFactors = FALSE)
nodes <- read.csv("data/greys-nodes.csv", header = TRUE,stringsAsFactors = FALSE)
ga <- graph_from_data_frame(d = edges,directed = FALSE,vertices = nodes)
ga
```

```
## IGRAPH 7ef648a UN-- 54 57 --
## + attr: name (v/c), sex (v/c), race (v/c), birthyear (v/n),
## | position (v/c), season (v/n), sign (v/c)
## + edges from 7ef648a (vertex names):
## [1] Arizona Robbins--Leah Murphy Alex Karev --Leah Murphy
## [3] Arizona Robbins--Lauren Boswell Arizona Robbins--Callie Torres
## [5] Erica Hahn --Callie Torres Alex Karev --Callie Torres
## [7] Mark Sloan --Callie Torres George O'Malley--Callie Torres
## [9] Izzie Stevens --George O'Malley Meredith Grey --George O'Malley
## [11] Denny Duqutte --Izzie Stevens Izzie Stevens --Alex Karev
## [13] Derek Sheperd --Meredith Grey Preston Burke --Cristina Yang
## + ... omitted several edges
```

All code in this section is sufficient if your networks only have several hundreds of nodes. But at one point, `read.table()`

may become a bottleneck. Consider switching to `readr::read_csv()`

, `data.table::fread()`

or `vroom::vroom()`

instead.

Once you have properly read in your network you can start by looking at somee very basic network statistics.

Node and edge counts can be computed as follows There appears to be quite a lot of action in Grey’s Anatomy

`vcount(ga)`

`## [1] 54`

`ecount(ga)`

`## [1] 57`

To do this for a list of networks, use the `sapply()`

function.

`sapply(gotS1to7,vcount)`

`## [1] 127 129 129 172 124 143 81`

`sapply(gotS1to7,ecount)`

`## [1] 550 486 486 667 504 542 412`

The density and diameter of networks are calculated with `graph.density()`

and `diameter()`

`graph.density(ga)`

`## [1] 0.03983229`

`diameter(ga)`

`## [1] 8`

`sapply(gotS1to7,graph.density)`

```
## [1] 0.06874141 0.05886628 0.05886628 0.04535564 0.06608969 0.05338324
## [7] 0.12716049
```

`sapply(gotS1to7,diameter,weights = NA) #remember that the networks are weighted`

`## [1] 6 6 6 8 7 7 5`

The degree distribution can be visualised with the `hist()`

function.

```
hist(degree(ga), col="firebrick", xlab="Vertex Degree",
ylab="Frequency", main="")
```

The neighbors of a specific node can be extracted with the `ego()`

function. Below, we are looking for all characters that “were intimate” with Alex Karev.

```
karev <- which(V(ga)$name=="Alex Karev")
ego(ga,order = 1,nodes = karev, mode = "all",mindist = 1)
```

```
## [[1]]
## + 11/54 vertices, named, from 7ef648a:
## [1] Addison Montgomery Rebecca Pope Izzie Stevens
## [4] Lexie Grey Lucy Fields Dana Seabury
## [7] Nurse Olivia Callie Torres Heather Brooks
## [10] Jo Wilson Leah Murphy
```

The clustering coefficient of a network can be computed with the `transitivity()`

function.

Why is the clustering coefficient of Grey’s Anatomy zero?

`transitivity(ga)`

`## [1] 0`

`sapply(gotS1to7,transitivity)`

`## [1] 0.3818514 0.4157044 0.4157044 0.4355249 0.4810657 0.4632628 0.4669358`

In a directed network, there are 16 possible triads. The *triad census* of a network gives the number of occurrences of each triad. Triads are labelled xyzL where x is the number of reciprocated ties, y is the number of unreciprocated ties and z is the number of null ties. The L term is a letter (U,C,D or T) which differentiates between different triads in which these numbers are the same.

If you were at Sunbelt 2019, you may have learned how the triad census shows that social licking among cows is similar to legislators sponsoring bills.

One of the many applications of the triad census is to compare a set of networks. In this example, we are tackling the question of “how transitive is football?” and asses structural differences among a set of 112 football leagues. If you played football as a kid, you for sure have heard the sentence: “We won against B and B won against C so naturally we will win against C”. Of course reality is not that simple.

RDS files store R objects directly without needing to read from csv files. The data repository contains `football.RDS`

which we can read with `readRDS()`

.

`fb_graphs <- readRDS("data/football.RDS")`

`fb_graphs`

is a list which contains networks of 112 football leagues as igraph objects. A directed link between team A and B indicates that A won a match against B. Note that there can also be an edge from B to A, since most leagues play a double round robin. For the sake of simplicity, all draws were deleted so that there could also be null ties between two teams if both games ended in a draw.

Below, we calculate the triad census for all network at once using `lapply()`

. The function returns the triad census for each network as a list, which we turn into a matrix in the second step. Afterwards, we manually add the row and column names of the matrix. The name of the league is stored as a graph attribute and we gather all names with `sapply()`

.

```
fb_triads <- lapply(fb_graphs,triad_census)
fb_triads <- matrix(unlist(fb_triads),ncol=16,byrow = T)
rownames(fb_triads) <- sapply(fb_graphs,function(x) x$name)
colnames(fb_triads) <- c("003","012","102","021D","021U","021C","111D","111U",
"030T","030C","201","120D","120U","120C","210","300")
#normalize to make proportions comparable across leagues
fb_triads_norm <- fb_triads/rowSums(fb_triads)
#check the Top 5 leagues
idx <- which(rownames(fb_triads)%in%c("england","spain","germany",
"italy","france"))
fb_triads[idx,]
```

```
## 003 012 102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C
## england 2 10 0 58 31 40 34 44 338 29 19 118 129 143
## france 1 23 5 30 33 44 48 40 332 41 16 132 108 160
## germany 0 21 6 27 19 49 38 46 165 16 23 77 79 117
## italy 1 4 2 35 43 30 30 22 419 38 5 164 116 118
## spain 0 8 4 27 42 45 32 35 364 43 11 126 105 148
## 210 300
## england 131 14
## france 114 13
## germany 120 13
## italy 99 14
## spain 130 20
```

The figure below shows the fraction of transitive triads per league (Big five European leagues in red).

It would be interesting to compare these values to other sports. On average, 28% of all triads are transitive. The most transitive league is Macao with more than 60%. From the top European leagues, the German Bundesliga is the least transitive (and thus the most competitive?). The most transitive league is the Italian Serie A (and thus the least competitive?).

In empirical studies, we are not necessarily only interested in transitive triads, but rather how the triad census profiles compare across networks. Katherine Faust suggests to use correspondence analysis (CA) for this purpose. There are several possibilities to do a CA in R. We here use the packages `FactoMineR`

and `factoextra`

. You may also need to install `ggrepel`

to use `repel = TRUE`

in `fviz_ca_row()`

to reduce label overlaps. The function `CA()`

from `FactoMineR`

is used to do the CA and `fviz_ca_row()`

from `factoextra`

to visualize the first two dimensions. The parameter `ncp`

controls the number of dimensions to extract. We here follow Katherine Faust and use four.

```
#install.packages(c("FactoMineR","factoextra"))
library(FactoMineR)
library(factoextra)
ca_triads <- CA(fb_triads,ncp = 4,graph = FALSE)
fviz_ca_row(ca_triads,repel = TRUE)
```

Unfortunately we have a huge overplotting problem. It may in this case be better to not plot the labels.

`fviz_ca_row(ca_triads,repel = TRUE,geom = "point")`

How to interpret the dimensions? To investigate this question, we take a closer look at the first two dimensions of the CA solution and compare it to some network descriptives. For the sake of brevity, we here only look at the density and proportion of 030T ties. In principle, any node/dyad/triad level statistic could be used.

```
plot(
x = sapply(fb_graphs,graph.density),
y = ca_triads$row$coord[,"Dim 1"],
xlab = "density",ylab = "dimension 1"
)
```

The correlation is -0.96. We thus have a strong association between the first dimension of the CA and the density of the networks.

```
plot(
x = fb_triads_norm[,"030T"],
y = ca_triads$row$coord[,"Dim 2"],
xlab = "proportion of 030T",ylab = "dimension 2"
)
```

In this case, the correlation is -0.83 and we have a strong association between the second dimension of the CA and the proportion of transitive triads.

For a more extensive treatment of the triad census, see this paper by Katherine Faust.

Besides the triad census, there is also the dyad census, which counts the number of mutual asymmetric and null ties. The code below computes the dyad census for all leagues and prints the result for the big five European leagues.

```
fb_dyads <- lapply(fb_graphs,dyad_census)
fb_dyads <- matrix(unlist(fb_dyads),ncol=3,byrow = T)
rownames(fb_dyads) <- sapply(fb_graphs,function(x) x$name)
colnames(fb_dyads) <- c("mutual","asym","null")
fb_dyads[idx,]
```

```
## mutual asym null
## england 45 131 14
## france 44 131 15
## germany 43 94 16
## italy 39 141 10
## spain 44 134 12
```

If you want more indices, check out the `centiserve`

and `CINNA`

package. However, the standard indices are usually more than enough.

`igraph`

implements all of the commonly used centrality indices:

- degree (
`degree()`

) - weighted degree (
`graph.strength()`

) - betweenness (
`betweenness()`

) - closeness (
`closeness()`

) - eigenvector (
`eigen_centrality()`

) - alpha centrality (
`alpha_centrality()`

) - power centrality (
`power_centrality()`

) - PageRank (
`page_rank()`

) - eccentricity (
`eccentricity()`

) - hubs and authorities (
`authority_score()`

and`hub_score()`

) - subgraph centrality (
`subgraph_centrality()`

)

Although using these functions is fairly straightforward, there are several potential pitfalls to be aware of. The functions `eigen_centrality`

, `page_rank()`

, `subgraph_centrality()`

, `authority_score()`

and `hub_score()`

always return the centrality scores for all vertices. For the others, you can control for which vertices the scores should be calculated. The parameter names are not consistent though. Check the help if it is either `v`

or `vids`

.

If a graph has an edge attribute `weight`

, then most centrality functions will use the weights by default. An exception is `degree()`

, since its weighted version is `graph.strength()`

.

If the graph is directed, then the `mode`

attribute controls which version to compute. Possible values are “in”, “out” and “all”. However, some functions have a logical parameter `directed`

that controls if directed or undirected scores should be calculated.

Given these potential pitfalls, it is always a good idea to check the help before applying any index.

The below example shows the difference between weighted and unweighted betweenness.

`rbind(betweenness(gotS1)[1:10], betweenness(gotS1,weights = NA)[1:10])`

```
## Ned Daenerys Jon Littlefinger Arya Catelyn Bronn
## [1,] 1599.531 294.8472 209.5429 209.6917 325.0375 778.6164 75.56667
## [2,] 2364.557 798.4258 672.8643 123.3748 516.7144 1041.7014 23.41364
## Cersei Shae Joffrey
## [1,] 543.6563 0 345.0673
## [2,] 138.9682 0 224.6781
```

Similar to descriptive statistics, we can also compute centrality scores for a network collection stored in a list. If you want to compare centrality scores across networks, you need to normalize the scores. Most indices have a built-in normalization parameter.

```
wdeg_GoT <- lapply(gotS1to7,graph.strength)
wdeg_GoT_norm <- lapply(wdeg_GoT,function(x) x/sum(x))
```

lapply(wdeg_GoT_norm,function(x) sort(x,decreasing = TRUE)[1:5])

betw_GoT <- lapply(gotS1to7,betweenness,normalize=TRUE) lapply(betw_GoT,function(x) sort(x,decreasing = TRUE)[1:5])

sort(eigen_centrality(ga)$vector)

This example should make you aware of some general issues of centrality indices and make you think about the question “how do I choose an appropriate index”?

The data repository contains two small networks that are used in this example.

```
g1 <- readRDS("data/cent_example_1.rds")
g2 <- readRDS("data/cent_example_2.rds")
```

Both networks have the same number of nodes (11) and a similar density (0.31 and 0.35).

The networks are undirected and unweighted. We start by calculating all scores of indices that are applicable for such networks.

```
cent1_df <- data.frame(
vertex = 1:11,
dc = degree(g1),
bc = betweenness(g1),
cc = closeness(g1),
ec = eigen_centrality(g1)$vector,
sc = subgraph_centrality(g1))
cent2_df <- data.frame(
vertex = 1:11,
dc = degree(g2),
bc = betweenness(g2),
cc = closeness(g2),
ec = eigen_centrality(g2)$vector,
sc = subgraph_centrality(g2))
```

The table below shows the scores for the first network.

vertex | dc | bc | cc | ec | sc |
---|---|---|---|---|---|

1 | 1 | 0 | 0.037 | 0.226 | 1.8251 |

2 | 1 | 0 | 0.0294 | 0.0646 | 1.5954 |

3 | 2 | 0 | 0.04 | 0.3786 | 3.1486 |

4 | 2 | 9 | 0.04 | 0.2415 | 2.4231 |

5 | 3 | 3.8333 | 0.05 | 0.5709 | 4.3871 |

6 | 4 | 9.8333 | 0.0588 |
0.9847 | 7.8073 |

7 | 4 | 2.6667 | 0.0526 | 1 |
7.9394 |

8 | 4 | 16.3333 |
0.0556 | 0.8386 | 6.6728 |

9 | 4 | 7.3333 | 0.0556 | 0.9114 | 7.0327 |

10 | 4 | 1.3333 | 0.0526 | 0.9986 | 8.2421 |

11 | 5 |
14.6667 | 0.0556 | 0.845 | 7.3896 |

Notice how every index considers a different vertex to be the most central one (in bold). Let’s check out the second network.

vertex | dc | bc | cc | ec | sc |
---|---|---|---|---|---|

1 | 3 | 0 | 0.0588 | 0.557 | 10.337 |

2 | 3 | 0 | 0.0588 | 0.557 | 10.337 |

3 | 3 | 0 | 0.0588 | 0.557 | 10.337 |

4 | 5 | 1 | 0.0667 | 0.751 | 18.0023 |

5 | 2 | 0 | 0.0556 | 0.3986 | 5.9211 |

6 | 2 | 0 | 0.0556 | 0.3986 | 5.9211 |

7 | 7 | 5.5 | 0.0769 | 0.8898 | 24.7149 |

8 | 1 | 0 | 0.0526 | 0.2109 | 2.5587 |

9 | 1 | 0 | 0.0526 | 0.2109 | 2.5587 |

10 | 1 | 0 | 0.0526 | 0.2109 | 2.5587 |

11 | 10 |
29.5 |
0.1 |
1 |
31.159 |

Here, all indices agree on the most central node! If you look more closely, you will even see that they all induce the same node ranking.

You may be wondering, why we are only looking at the ranking and not the actual values. Effectively, the values themselves don’t have any meaning. There is no such thing as a “unit of centrality”, if we look at it from a measurement perspective. For instance, we can’t say that a node is “twice as between” as another if its betweenness value is twice as high. Centrality should thus not be considered to be on an interval scale, but rather an ordinal one. This might seem like a restriction at first, but we will see later on2 this is part of another tutorial that it facilitates many theoretical examinations.

The two networks illustrate the big problem of choice. We have only tried 5 different indices, so we actually can’t make any conclusive statements about central nodes. After all, 5 indices can in the best case produce 5 completely different rankings. But theoretically, there are 11!=39,916,800 possibilities to rank the nodes of the network without allowing ties, which indices actually do. So, what if we missed hundreds of thousands of potential indices that would rank, say, node nine on top for network 1? What if those 5 indices are exactly the ones that rank node eleven on top for network 2, but no other index does that? In the next example, we add some (made up) empirical context to illustrate the problem of how to validate the appropriateness of chosen indices.

Centrality indices are commonly used as an explanatory variable for some observed phenomenon or node attribute in a network. Let’s say we have the following abstract research question. Given a network where each node is equipped with a binary attribute, which could signify the presence or absence of some property. Can a centrality index explain the presence of this attribute?

The data repository contains a network that is equipped with such a binary attribute. Since the network is rather large, there is also an RDS file which contains the precomputed centrality scores.

```
g3 <- readRDS("data/cent_example_3.rds")
cent3_df <- readRDS("data/cent_example3_scores.rds")
head(cent3_df)
```

```
## attr degree betweenness closeness eigen subgraph infocent
## 1 0 1 0.0000 9.448224e-05 9.356028e-03 830.570056 0.5634474
## 2 0 1 0.0000 9.559316e-05 6.160046e-03 364.526673 0.5843892
## 3 0 1 0.0000 7.303535e-05 5.509022e-05 1.632701 0.3409455
## 4 0 3 112.0519 1.089562e-04 1.107902e-02 2814.242127 1.0096304
## 5 1 1 0.0000 7.750136e-05 5.010141e-05 1.898569 0.4697716
## 6 0 5 3181.8374 1.013993e-04 3.520925e-03 167.154082 1.0072708
```

If we assume that one of the indices is somehow related with the attribute, then we should see that nodes with the attribute tend to be ranked on top of the induced ranking. Below, we calculate the number of nodes having the attribute that are ranked in the top 150 for each index.

```
apply(cent3_df[,2:7],2,function(x)
sum(cent3_df$attr[order(x,decreasing = TRUE)][1:150]))
```

```
## degree betweenness closeness eigen subgraph infocent
## 63 53 56 72 76 62
```

According to this crude evaluation, subgraph centrality is best in explaining the node attribute. But how conclusive is this now? Note that we did not specify any real hypothesis so basically any index could be a valid choice. Instead of trying out one of the other mentioned ones though, we now try to design a new index which hopefully gives us an even better fit. After some wild math, we may end up with something like this: \[ c(u)= ccoef(u) \left[\sum\limits_{v \in N(u)} \sum\limits_{k=0}^{\infty} \frac{(A^{[u]})_{vv}^{2k+1}}{(2k+1)!} \right] \]

Ok, so what is happening here? `ccoef(u)`

is the clustering coefficient of the node u (`transitivity(g,type="local")`

). The first sum is over all neighbors v of u. The second sum is used to sum up all closed walks of even length weighted by the inverse factorial of the length.

We can directly invent a second one, based on the walks of odd length. \[ c(u)= ccoef(u) \left[\sum\limits_{v \in N(u)} \sum\limits_{k=0}^{\infty} \frac{(A^{[u]})_{vv}^{2k+1}}{(2k+1)!} \right] \]

Mathematically fascinating, yet both indices defy any rational meaning and have not yet been considered in the literature.3 please do not write a paper about them! The indices are implemented in `netrankr`

as the *hyperbolic index*.

How do they compare to subgraph centrality, the best performing index so far?

```
cent3_df$hyp_eve <- hyperbolic_index(g3,type = "even")
cent3_df$hyp_odd <- hyperbolic_index(g3,type = "odd")
apply(cent3_df[,c(6,8:9)],2,function(x)
sum(cent3_df$attr[order(x,decreasing = TRUE)][1:150]))
```

```
## subgraph hyp_eve hyp_odd
## 76 99 102
```

Both indices are far superior. Around 66% of the top 150 nodes are equipped with the attribute, compared to 50% for subgraph centrality. A better evaluation may be to treat the problem as a binary classification problem and calculate the area under the ROC curve as a performance estimate. You have to trust me, that the indices also outperform the others with that method.

Obviously, this is a very contrived example, yet it emphasizes some important points. First, it is relatively easy to design an index that gives you the results you intend to get and hence justify the importance of the index. Second, you can never be sure, though, that you found “the best” index for the task. There may well be some even more obscure index that gives you better results. Third, if you do not find a fitting index, you can not be sure that there does not exist one after all.

The goal of clustering is to find cohesive subgroups within a network.4 also referred to as “community detection” The following algorithms for graph clustering are implemented in `igraph`

.

```
## [1] "cluster_edge_betweenness" "cluster_fast_greedy"
## [3] "cluster_infomap" "cluster_label_prop"
## [5] "cluster_leading_eigen" "cluster_louvain"
## [7] "cluster_optimal" "cluster_spinglass"
## [9] "cluster_walktrap"
```

It is important to note that there is no real theoretically basis for what constitutes a cluster in a network. Only the vague “internally dense” versus “externally sparse” argument. As such, there is no convincing argument for or against certain algorithms/methods. In this tutorial, we follow the “modularity maximization” principle.

As for centrality, if a `weight`

attribute is present, it will be used by default!

No matter which algorithm is chosen, the workflow is always the same.

```
clu <- cluster_louvain(ga)
#membership vector
mem <- membership(clu)
head(mem)
```

```
## Addison Montgomery Adele Webber Teddy Altman
## 4 8 7
## Amelia Shepherd Arizona Robbins Rebecca Pope
## 7 6 6
```

```
#communities as list
com <- communities(clu)
com[[1]]
```

```
## [1] "Miranda Bailey" "Ben Warren" "Lloyd" "Tucker Jones"
## [5] "Eli James"
```

An example for the infamous karate network is shown below.

```
karate <- make_graph("Zachary")
imc <- cluster_infomap(karate)
lec <- cluster_leading_eigen(karate)
loc <- cluster_louvain(karate)
sgc <- cluster_spinglass(karate)
wtc <- cluster_walktrap(karate)
scores <- c(infomap = modularity(karate,membership(imc)),
eigen = modularity(karate,membership(lec)),
louvain = modularity(karate,membership(loc)),
spinglass = modularity(karate,membership(sgc)),
walk = modularity(karate,membership(wtc)))
scores
```

```
## infomap eigen louvain spinglass walk
## 0.4020381 0.3934089 0.4188034 0.4197896 0.3532216
```

For the karate network, `cluster_spinglass()`

produces the highest modularity score. In general, though, it is advisable to use `cluster_louvain()`

since it has the best speed/performance trade-off.

This tutorial focused on the package `igraph`

yet there are many other packages for more specific network analytic tasks:

- statnet: large collection of SNA methods including ergms
- multinet: Analysis and Mining of Multilayer Social Networks
- bipartite: Analysis of two-mode networks (focus on ecology)
- egor: ego network analysis
- blockmodeling: Generalized blockmodeling for valued networks
- RSiena: Stochastic actor oriented models

The book “Statistical Analysis of Network Data with R” by Eric Kolaczyk and Gábor Csárdi is an excellent source for SNA material in R.

There also exists a variety of tutorials on the web that cover basic and more advanced network analysis using R. One of the most extensive ones is “Network Analysis and Visualization with R and igraph” by Katherine Ognyanova (link). Her tutorial covers most of the functionality of `igraph`

with in-depth explanations of the built-in plotting function.

The “Network Analysis R Cookbook” by Sacha Epskamp (link) makes use of the `qgraph`

package which implements several methods to analyze psychometric networks.

“Introduction to Network Analysis with R” by Jesse Sadler (link) provides a short comparison of `igraph`

, `network`

and `tidygraph`

.