Last updated: April 05, 2016

This tutorial is a joint product of the Statnet Development Team:

Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Skye Bender de-Moll (Oakland)
Pavel N. Krivitsky (University of Wollongong) Martina Morris (University of Washington)

For general questions and comments, please refer to the statnet wiki and the statnet users group and mailing list
http://statnet.org/statnet_users_group.shtml

1. Installation

Open an R session, and set your working directory to the location where you would like to save this work.

To install all of the CRAN packages in the statnet suite:

install.packages('statnet')
library(statnet)

To install the ergm.ego,

install.packages('ergm.ego')

2. Overview of ergm.ego

The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.

In many empirical contexts, it is not feasible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.

Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here is to combine what is observed, with assumptions, to define a class of models that represent a distribution of networks that are centered on the observed properties. The variation in these networks quantifies some of the uncertainty introduced by the assumptions.

The package comprises:

The package is designed to work with the other statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from ergm, simulate to simulate complete network realizations from the model, the network descriptives from sna to explore the properities of the network, and you can use other R functions and packages as well after converting the network data structure into a data frame.

Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.

3. Background

The full technical details on ERGM estimation and inference from egocentrically sampled data are in a paper that is currently under review. The working paper can be found here. This tutorial provides a brief introduction to the key concepts.

3a. Exponential-family random graph models

ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks with the underlying probability distribution implied by that model.

The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where \(Y\) is the random variable for the state of the network (with realization y), \(g(y,x)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).

The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more or less common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, these statistics must be observed in the sample more details in section 4.2

A key distinction in model terms is dyad independence or dyad dependence. Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are dyad dependent.

3b. Network Sampling

Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.

3b2. Egocentric Designs

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”“) who, via interview, are asked to nominate a list of persons (”alters“) with whom they have a specific type of relationship (”tie“), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the”alter" matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.

Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

For the moment ergm.ego uses the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected The minimal design is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices. However, deveopment of estimation where alter–alter matrices are available is being planned.

3c. Existing methods for sampled network data

Model-based

Handcock and Gile (2010): Likelihood inference for partially observed networks, has egocentric data as a special case.

Kosikinen and Robins (2010): Bayesian inference for partially observed networks, has egocentric data as a special case.

Pros:
  • Can fit any ERGM that can be identified.
  • Can handle link-tracing designs.
Cons:
  • Requires alters to be identifiable.
  • Cannot take into account sampling weights (unless all attributes that affect sampling weights are part of the model).
  • Might not scale.
  • Requires knowledge of the population distribution of actor attributes used in the model.

Design-based

Krivitsky and Morris (2015) Use design-based estimators for sufficient statistics of the ERGM of interest and then transfer their properties to the ERGM estimate.

Pros:
  • Does not require alters to be identifiable.
  • Borrows directly from design-based inference methods. (Can easily incorporate sampling weights, stratification, etc.)
  • Can fit any ERGM that can be identified (though see below).
  • Can be made invariant to network size for some models.
Cons:
  • Requires “reimplementation” of the model statistics as “EgoStats”: currenly does not support alter–alter statistics or directed or bipartite networks.
  • Relies on independent sampling form population of interest in some form.
  • Cannot be fit to more complex (e.g., RDS) designs.
  • Requires knowledge of the population distribution of actor attributes used in the model.

4. Theoretical Framework and Definitions

Some notation (sorry)

Population network

\(N\)
be the population being studied: a very large, but finite, set of actors whose relations are of interest
\(x _ i\)
attribute (e.g., age, sex, race) vector of actor \(i \in N\)
\(x_N\) (or just \(x\), when there is no ambiguity)
the attributes of actors in \(N\)
\(\mathbb{Y}(N)\)
the set of dyads (potential ties) in an undirected network of actors in \(N\)
\(y\subseteq \mathbb{Y}(N)\)
the population network: a fixed but unknown network (a set of relationships) of relationships of interest

In particular,

\(y_{ij}\)
an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\)
\(y _ i=\{j\in N: y _ {ij}=1\}\)
the set of \(i\)’s network neighbors.

Egocentric sample

\(e_i\)
the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts:
\(e^e_i \equiv x_i\):
\(i\)’s own attributes
\(e^a_i \equiv (x_{j})_{j\in y_i}\):
an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but not their identities (indices in \(N\))

Also, let the \(k\)th attribute/covariate observed on ego \(i\) and its alters as \(e^e_{i,k}\equiv x_{i,k}\) and \(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\).

Then,

\(e_{N}\)
the egocentric census, the information retained by the minimal egocentric sampling design
\(S\subseteq N\)
the set of egos in the sample
\(e_{S}\)
the data contained in an egocentric sample

4a. Specifying Egocentric ERGMs

Egocentric ERGMs are specified the same way as plain ergm: via terms (e.g. nodematch) used to represent predictors on the right-hand size of equations used in:

  • calls to summary (to obtain measurements of network statistics on a dataset)
  • calls to ergm.ego (to estimate an ERGM)
  • calls to simulate (to simulate networks from an ERGM fit)

The terms that can be used in an ERGM depend on the type of network being analyzed (directed or undirected, one-mode or two-mode (“bipartite”), binary or valued edges) and on the statistics that can be observed in the sample.

Even if the whole population is egocentrically observed (i.e., \(S=N\), a census), the alters are still not uniquely identifiable. This limits the kinds of network statistics that can be observed, and the ERGM terms that can be fit to such data. We turn to the notion of sufficiency to identify those that can be.

Egocentric Statistics

We call a network statistic \(g_{k}(\cdot,\cdot)\) egocentric if it can be expressed as \[ g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i) \] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.

The space of egocentric statistics includes dyadic-independent statistics that can be expressed in the general form of \[ g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j) \] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some dyadic-dependent statistics that can be expressed as \[ g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}}) \] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.

What is “egocentric” depends on available data.

Egocentric with basic design
  • Homophily
  • Covariate effects
  • Degree distribution
Egocentric with alter-alter ties
  • Triadic closure (transitive/cyclical ties, triangles)
  • 4-cycles (possibly)
Egocentric with star sample (full set of alter’s ties)
  • Degree assortativity
Not Egocentric for other reasons
  • Mean degree (\(g_{k}(y,x)=2|y|/|N|\)): \(e _ i\) doesn’t know how big the network is 1