netdiffuseR_sunbelt2016_workshop_simulations.Rmd

---
title: |
  | Understanding Diffusion with __netdiffuseR__
  | _Simulating data_
subtitle: Sunbelt 2016 INSNA
author: George Vega Yon \and Thomas Valente
date: |
  | Newport Beach, CA
  | April 5, 2016
institute: |
  | Department of Preventive Medicine
  | University of Southern California
header-includes: \usepackage{grffile}
output: beamer_presentation
fontsize: 10pt
---

```{r Config-knitr, echo=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
  out.width = ".8\\textwidth", dev = "pdf", fig.align = "center",
  fig.width = 7, fig.height = 5)
pdf.options(family="Palatino")
library(sna)
library(SparseM)
```

# The plan

1. Review some concepts
2. Simulating diffusion networks
3. __netdiffuseR__'s core functions
4. Bonus track (if we have time)

# Introduction

Before start, a review of concepts that we will be using here

1. Exposure: What proportion/number of your neighbors has adopted an innovation.
2. Threshold: What was the proportion/number of your neighbors had adopted by the time you adopted.
3. Infectiousness: How much $i$'s adoption affects her alters
4. Susceptibility: How much $i$'s alters' adoption affects her.
5. Structural equivalence: How similar are $i$ and $j$ in terms of position on the network.

# Simulating diffusion networks

We will simulate a diffusion network with the following parameters:

1. Will have 1,000 vertices,
2. Will span 20 time periods,
3. The set of early adopters will be random,
4. Early adopters will be a 10\% of the network,
5. The graph will be small-world,
6. Will use the WS algorithmwith $p=.2$ (probability of rewire).
7. Threshold levels will be uniformly distributed between [0.3, 0.7\]

# Simulating diffusion networks
## In __netdiffuseR__

To generate such diffusion network we can use the `rdiffnet` function included in the package:

\footnotesize

```{r Generating the random graph}
# Loading the package
library(netdiffuseR)

# Setting the seed for the RNG
set.seed(1213)

# Generating a random diffusion network
net <- rdiffnet(
  n              = 1e3,                         # 1.
  t              = 20,                          # 2.
  seed.nodes     = "random",                    # 3.
  seed.p.adopt   = .1,                          # 4.
  seed.graph     = "small-world",               # 5.
  rgraph.args    = list(p=.2),                  # 6.
  threshold.dist = function(x) runif(1, .3, .7) # 7.
  )
```

\normalsize


# Simulating diffusion networks
## In __netdiffuseR__ (cont. 1)

This output can be printed to see some information about this network

\footnotesize

```{r Printing the random graph}
net
```

\normalsize

# Simulating diffusion networks
## In __netdiffuseR__ (cont. 2)

Including some summary stats (we only will see times 1, 5, 10, 15 and 20).

\footnotesize
```{r Summarizing the random graph, cache=TRUE}
summary(net, slices=c(1,5,10,15,20))
```
\normalsize


# Simulating diffusion networks
## In __netdiffuseR__ changing the first adopters

Using different parameters for `seed.nodes` (who are the first adopters), we get different [theoretical\] resuls:

```{r Comparing seed.nodes, echo=FALSE, fig.cap="Cumulative Adopters for Different seed.nodes"}
set.seed(1213)
net_mar <- rdiffnet(
  n              = 1e3,                         # 1.
  t              = 20,                          # 2.
  seed.nodes     = "marginal",                  # 3.
  seed.p.adopt   = .1,                          # 4.
  seed.graph     = "small-world",               # 5.
  rgraph.args    = list(p=.2),                  # 6.
  threshold.dist = function(x) runif(1, .3, .7) # 6.
  )
set.seed(1213)
net_cen <- rdiffnet(
  n              = 1e3,                         # 1.
  t              = 20,                          # 2.
  seed.nodes     = "central",                   # 3.
  seed.p.adopt   = .1,                          # 4.
  seed.graph     = "small-world",               # 5.
  rgraph.args    = list(p=.2),                  # 6.
  threshold.dist = function(x) runif(1, .3, .7) # 6.
  )

# Plotting
cols <- c(rgb(.8,.5,.5,1), rgb(.5,.8,.5,1), rgb(.5,.5,.8,1))
  
plot_adopters(net_cen, what="cumadopt", include.legend = FALSE, cex=2, 
              pch = 16,           bg=cols[1], col=cols[1], main="", xlab="Time (log-scale)")
plot_adopters(net, what="cumadopt", include.legend = FALSE, cex=2,
              pch = 16, add=TRUE, bg=cols[2], col=cols[2])
plot_adopters(net_mar, what="cumadopt", include.legend = FALSE, cex=2,
              pch = 16, add=TRUE, bg=cols[3], col=cols[3])
legend("bottomright",
       legend=c("Central", "Random", "Marginal"),
       col = cols, pch=16, bty="n")
```

# Simulating diffusion networks
## Testing theoretical results

-   Given dynamic graphs of size $n=150$, with density $\rho\sim 0.013$ and $t=5$ time periods. What should we expect from the following combinations:

    $$\left(\mbox{Bernoulli, Scale-free, Small-world}\right) \times \left(\mbox{Marginal, Central, Random}\right)$$

    In which of these the diffusion process is fastest?

-   Lets run some monte carlo simulations to find out!

# Simulating diffusion networks
## Testing theoretical results (cont.)

Recall that densities for these graphs can be computed as follow:

-   Bernoulli 
    $$\rho = p$$
-   Small-world 
    $$\rho = \frac{n\times k}{n\times(n-1)} =\frac{k}{n-1}$$
-   Scale-free
    $$\rho=\frac{m\times t}{(m_0 + t)(m_0 + t - 1)}$$
    
We use this to generate graphs with similar densities (see on the code).

# Simulating diffusion networks
## Theoretically, what should we expect from... marginal

```{r, echo=FALSE, warning=FALSE, message=FALSE, cache=TRUE}

set.seed(1312)
n <- 150

# Creating the seed graphs
library(igraph)
ba <- function() as_adj(barabasi.game(n, m=2))
d  <- nlinks(ba())/n/(n-1)
er <- function() as_adj(erdos.renyi.game(n, d))
ws <- function() as_adj(sample_smallworld(1, n, d*n/2, .2))

# Parameters for the simulations
N <- 1e3
t <- 5
p <- .2 # initial adopters
a <- "marginal"

# Montecarlo
bernoulli <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = er(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
bernoulli <- do.call(rbind, bernoulli)
# boxplot(bernoulli)

# Montecarlo
scale_free <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ba(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
scale_free <- do.call(rbind, scale_free)
# boxplot(scale_free)

# Montecarlo
small_world <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ws(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
small_world <- do.call(rbind, small_world)
# boxplot(small_world)

# Function to plot all the outcomes
plot_all <- function() {
  plot(colMeans(bernoulli), x=1:5, ylim=c(0,1), type="b", pch=1, col="blue",
       xlab = "Time", ylab="Cummlative % of adopters",
       main = NA)
  lines(colMeans(scale_free), x=1:5, type="b", pch=2, col="red")
  lines(colMeans(small_world), x=1:5, type="b", pch=3, col="black")
  
  # Adding legend and  info
  legend(
    "bottomright",
    legend = c("Bernoulli", "Scale-free", "Small-world"),
    col    = c("blue", "red", "black"),
    pch    = c(1,2,3), bty="n")
  
  text(1, .8,
       sprintf("# of vertices: %d\n# of simulations: %d\nAprox. density: %.4f\n%% of initial adopters: %.2f (%s)", n, N, d, p*100, a),
       pos=4)
}

plot_all()
```

Small-worlds! In _better_ connected networks, marginal nodes are closer to the entire graph.

# Simulating diffusion networks
## Theoretically, what should we expect from... central

```{r, echo=FALSE, warning=FALSE, message=FALSE, cache=TRUE}

set.seed(1312)

# Creating the seed graphs
library(igraph)
ba <- function() as_adj(barabasi.game(n, m=2))
d  <- nlinks(ba())/n/(n-1)
er <- function() as_adj(erdos.renyi.game(n, d))
ws <- function() as_adj(sample_smallworld(1, n, d*n/2, .2))

# Parameters for the simulations
N <- 1e3
t <- 5
p <- .2 # initial adopters
a <- "central"

# Montecarlo
bernoulli <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = er(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
bernoulli <- do.call(rbind, bernoulli)
# boxplot(bernoulli)

# Montecarlo
scale_free <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ba(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
scale_free <- do.call(rbind, scale_free)
# boxplot(scale_free)

# Montecarlo
small_world <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ws(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
small_world <- do.call(rbind, small_world)
# boxplot(small_world)

plot_all()
```

Scale-free! Central nodes have higher degree in these networks.

# Simulating diffusion networks
## Theoretically, what should we expect from... random

```{r, echo=FALSE, warning=FALSE, message=FALSE, cache=TRUE}

set.seed(1312)

# Creating the seed graphs
library(igraph)
ba <- function() as_adj(barabasi.game(n, m=2))
d  <- nlinks(ba())/n/(n-1)
er <- function() as_adj(erdos.renyi.game(n, d))
ws <- function() as_adj(sample_smallworld(1, n, d*n/2, .2))

# Parameters for the simulations
N <- 1e3
t <- 5
p <- .2 # initial adopters
a <- "random"

# Montecarlo
bernoulli <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = er(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
bernoulli <- do.call(rbind, bernoulli)
# boxplot(bernoulli)

# Montecarlo
scale_free <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ba(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
scale_free <- do.call(rbind, scale_free)
# boxplot(scale_free)

# Montecarlo
small_world <- lapply(seq_len(N), function(x) {
  net <- rdiffnet(t=t, seed.graph = ws(), seed.p.adopt = p, seed.nodes = a)
  cumulative_adopt_count(net)["prop",]
})

# Coercing into a matrix and observing the output
small_world <- do.call(rbind, small_world)
# boxplot(small_world)

plot_all()
```

Any clues here?

# __netdiffuseR__'s core functions

- Now we will review some of the package's main functions.
- Among __netdiffuseR__ features, we find three classical Diffusion Network Datasets:
    - `brfarmersDiffNet` Brazilian farmers and the innovation of Hybrid Corn Seed (1966).
    - `medInnovationsDiffNet` Doctors and the innovation of Tetracycline (1955).
    - `kfamilyDiffNet` Korean women and Family Planning methods (1973).
- For the review we will use the Medical Innovations dataset.


# Medical Innovations

\scriptsize
```{r Plot of medical Innovations, out.width=".5\\linewidth"}
data("medInnovationsDiffNet")
plot(medInnovationsDiffNet)
```
\normalsize


# __netdiffuseR__'s core functions
## Structural equivalence

-   Structural equivalence between vertices $(i,j)$ is defined as

    $$SE_{ij} = \frac{(dmax_i - d_{ji})^v}{\sum_{k\neq i}^n(dmax_i-d_{ki})^v}$$

    with the summation over $k\neq i$, and $d_{ji}$, eucledian distance in terms of geodesics, is defined as

    $$d_{ji} = \left[(z_{ji} - z_{ij})^2 + \sum_k^n (z_{jk} - z_{ik})^2 +  \sum_k^n (z_{ki} - z_{kj})^2\right]^\frac{1}{2}$$

    with $z_{ij}$ as the geodesic (shortest path) from $i$ to $j$, and $dmax_i$ equal to largest Euclidean distance between $i$ and any other vertex in the network. All summations are made over $k\not\in \{i,j\}$.
  
- This is a disimilarity measure! So as $SE_{ij}\to \infty$ it means that $i$ and $j$ are more \emph{structurally} different.


# __netdiffuseR__'s core functions
## Structural equivalence (cont.)

\tiny
```{r Structural equiv}
# Computing and printing
se <- struct_equiv(medInnovationsDiffNet, groupvar="city")
se

# Taking a look at the first elements at time 1
se[[1]]$SE[1:5, 1:5]
```
\normalsize


# __netdiffuseR__'s core functions
## Exposure

The exposure for the vertex $i\in V$ can be computed as

$$e_{i,t} = \frac{\sum_{j\neq i}s_{i,j}\times x_j\times a_j}{\sum_{j\neq i}s_{i,j}\times x_j}$$

Where $s_{i,j}\in S_t$ is the $ij-th$ element of the adjacency matrix, $x_j\in \mathbf{x}_t$ is $j$'s attribute in time $t$, and $a_j \in A_t$ is a dichotomous scalar equal to 1 if $j$ has adopted the innovation in time $t$. In matrix notation would be:

$$E_t = \left(S_t \times \left[\mathbf{x}_t \circ A_t\right]\right) / (S_t \times \mathbf{x}_t)$$


# __netdiffuseR__'s core functions
## Exposure (cont. 1)

\tiny
```{r Computing exposure, out.width=".9\\textwidth", fig.height=4, fig.width=8}
# Computing
e1 <- exposure(medInnovationsDiffNet)
e2 <- exposure(medInnovationsDiffNet, attrs="proage")

oldpar <- par(no.readonly = TRUE)
par(mfrow=c(1,2))
plot_threshold(medInnovationsDiffNet, expo=e1, main="Plain exposure")
plot_threshold(medInnovationsDiffNet, expo=e2, main="Attribute weighted exposure")
par(oldpar)
```
\normalsize

# __netdiffuseR__'s core functions
## Infectiousness and Susceptibility

Susceptibility of $i\in V$

$$
S_i = \frac{%
\sum_{k=1}^K\sum_{j=1}^n x_{ij(t-k+1)}z_{j(t-k)}\times \frac{1}{w_k}}{%
\sum_{k=1}^K\sum_{j=1}^n x_{ij(t-k+1)}z_{j(1\leq t \leq t-k)} \times \frac{1}{w_k} }\qquad \mbox{for }i,j=1,\dots,n\quad i\neq j
$$

Infectiousness of $i\in V$

$$
I_i = \frac{%
\sum_{k=1}^K \sum_{j=1}^n x_{ji(t+k-1)}z_{j(t+k)}\times \frac{1}{w_k}}{%
\sum_{k=1}^K \sum_{j=1}^n x_{ji(t+k-1)}z_{j(t+k\leq t \leq T)}\times \frac{1}{w_k} }\qquad \mbox{for }i,j=1,\dots,n\quad i\neq j
$$

Normalized

$$
S_i' = \frac{S_i}{\sum_{k=1}^K\sum_{j=1}^nz_{j(t-k)}\times \frac{1}{w_k}} %
\qquad I_i' = \frac{I_i}{\sum_{k=1}^K\sum_{j=1}^nz_{j(t-k)} \times\frac{1}{w_k}}
$$


# __netdiffuseR__'s core functions
## Infectiousness and Susceptibility (cont. 1)

\scriptsize
```{r out.width=".7\\linewidth"}
# Computing and plotting both with K=5
is <- plot_infectsuscep(net, K=5L, logscale = FALSE)
```
\normalsize

# __netdiffuseR__'s core functions
## Infectiousness and Susceptibility (cont. 2)

\tiny
```{r out.width=".7\\linewidth"}
# Is there any thing?
t.test(is$infect, alternative = "greater", na.rm=TRUE)
t.test(is$suscept, alternative = "greater", na.rm=TRUE)
```
\normalsize


<!-- # Other functions -->

<!-- \tiny -->

<!-- ```{r others1} -->
<!-- # Counting -->
<!-- nnodes(net) -->
<!-- nlinks(net)[[1]] -->
<!-- nslices(net) -->

<!-- ``` -->
<!-- \normalsize -->

<!-- # Other functions (cont. 1) -->

<!-- - Users can access subsets of `diffnet` objects easily -->

<!-- \tiny -->

<!-- ```R -->
<!-- # Removing nodes from a diffnet -->
<!-- net - 1:5 -->
<!-- net - net[1:5] -->
<!-- net[6:nnodes(net)] -->
<!-- ``` -->
<!-- \normalsize -->

<!-- \tiny -->
<!-- ```{r others2, echo=FALSE} -->
<!-- net - 1:5 -->
<!-- ``` -->
<!-- \normalsize -->

<!-- # Other functions (cont. 2) -->

<!-- - You can also manage diffnet objects' vertices and attributes -->

<!-- \tiny -->
<!-- ```{r others3} -->
<!-- # Subsetting slices -->
<!-- net[,,1:3] -->

<!-- # Attributes management -->
<!-- head(net[["real_threshold"]]) -->

<!-- # Assing new attributes -->
<!-- net[["indegree"]] <- dgr(net, cmode="indegree") -->
<!-- head(net[["indegree"]][[1]]) -->
<!-- ``` -->

<!-- \normalsize -->

# Bonus track: Structural dependence test

- A novel statistical method (work-in-progress) that allows conducting inference.
- Included in the package, tests whether a particular network statistic actually depends on network structure
- Suitable to be applied to network thresholds (you can't use thresholds in regression-like models!)


# Bonus track: Structural dependence test
## Idea

-   Let $\mathcal{G} = (V,E)$ be a graph, $\gamma$ a vertex attribute, and $\beta = f(\gamma,\mathcal{G})$, then

    $$\gamma \perp \mathcal{G} \implies \mathbb{E}\left[\beta(\gamma,\mathcal{G})|\mathcal{G}\right] = \mathbb{E}\left[\beta(\gamma,\mathcal{G})\right]$$

- This is, if for example time of adoption is independent on the structure of the network, then the average threshold level will be independent from the network structure as well.

- Another way of looking at this is that the test will allow us to see how probable is to have this combination of network structure and network threshold (if it is uncommon then we say that the diffusion model is highly likely)

# Bonus track: Structural dependence test
## Structural dependence test: Not random TOA

- To use this test, __netdiffuseR__ has the `struct_test` function.
- Basically it simulates networks with the same density and computes a particular statistic every time, generating an EDF (Empirical Distribution Function) under the Null hyphothesis (p-values).

\scriptsize
```{r Struct test non random toa, cache=TRUE}
# Running the test
ps <- c(20*100, rep(200, 19))
test <- struct_test(
  graph     = net, 
  statistic = function(x) mean(threshold(x), na.rm = TRUE),
  R         = 1e3,
  rewire.args = list(p=ps, algorithm="swap"),
  ncpus=8, parallel="multicore"
  )

# See the output
test
```
\normalsize

# Bonus track: Structural dependence test
## Structural dependence test: Not random TOA (cont. 2)

```{r, echo=FALSE}
hist(test)
```

# Bonus track: Structural dependence test
## Structural dependence test: Random TOA (cont. 3)

\scriptsize
```{r Struct test Random toa, cache=TRUE}
# Resetting TOAs (now will be completely random)
diffnet.toa(net) <- sample(c(NA, 1:20), nnodes(net), TRUE)

# Running the test
test <- struct_test(
  graph     = net, 
  statistic = function(x) mean(threshold(x), na.rm = TRUE),
  R         = 1e3,
  rewire.args = list(p=ps, algorithm="swap"),
  ncpus=8, parallel="multicore"
  )

# See the output
test
```
\normalsize

# Bonus track: Structural dependence test
## Structural dependence test: Random TOA (cont. 4)

```{r, echo=FALSE}
hist(test)
```