Cambridge Colloquium on Complexity and Social Networks (CCCSN)

Colloquium  

Controversies

Scale-free Networks (PDF)

Dealing with Missing Data

SocNet responses on the request of Ines Mergel, Kennedy School of Government, Harvard University and University of St. Gallen, Switzerland

This page contains the initial request on the topic "Dealing with Missing Data", an extracted literature list and the responses of SocNet-members:

SocNet Request, 10/28/2003: Follow up question on "Dealing with Missing Data"

====================================================

I am in the process of analyzing my data, collected on the diffusion and adoption of educational technologies in a higher education institution. My research interest is focused on how the network variables affected the adoption of different kinds of educational technologies in an environment, which formally has no hierarchies (all actors have the same official position).

My network consists of 98 faculty members. The response rate is: 87 % on the overall questionnaire and 70 % on the relational questions. Some actors refused to answer the relational questions (give/receive information on educational technologies, meet on social occasions, and meet on professional occasions). The frequency of interactions is coded from 1-3 (never-seldom-very often).

I would like to get opinions from people on how to deal with different forms of missing data:

How to code _and_ analyze:

1. Actors, who never replied to the questionnaire? (After four rounds of requests)

2. Actors, who refused to answer the relational questions? (f.e. if they think that it is too private) Would it be ok, for example, to impute the missing values and then examine whether the network (including both responded and imputed data) affected adoption?

3. Actors, who were mentioned by other respondents, but left the organization right before or during my study and never responded.

Do I have to treat these different categories of non-responses as missings or is there a way to code and analyze them differently? How to interpret the results?

Thanks a lot for your input - I will of course compile the answers and make them available to the list!

Ines
====================================================

Literature recommendations from SocNet members:

Butts, Carter T. (2003): "Network Inference, Error, and Informant (In)Accuracy: A Bayesian Approach", Social Networks, 25(2), 103-140.
-> slides from Sunbelt talk 2002

Kossinets, G. (2003): Effects of missing data in social networks, October 30, 2003.

Little, R. J. A./Rubin, D. B. (2002), Statistical Analysis with Missing Data, New York.

Stork, D./Richards, W. D. (1992): Nonrespondents in Communication Network Studies: Problems and Possibilities, in: Group & Organization Management; Jun 1992; 17, 2; ABI/INFORM Global pg. 193.

Valente, T. (2003): Network Models and Methods for Studying the Diffusion of Innovations, To appear in: P. Carrington, S. Wasserman, & J. Scott (Eds.) Recent Advances in Network Analysis.

====================================================

Replies from SocNet list members:

1. Bill Richards:

Stork, D./Richards, W. D. (1992): Nonrespondents in Communication Network Studies: Problems and Possibilities, in: Group & Organization Management; Jun 1992; 17, 2; ABI/INFORM Global pg. 193.

2. Tom Valente:

Ines, Well if you can get a time stamp, even a rough one, there are lots of other methods you can use. Attached is a recent review chapter. - Tom

ines_mergel@harvard.edu wrote:

> Hi Tom,
>
> I am just collecting data on the extent of the adoption, by counting the (increasing) activities over four semester (since the beginning of the eLearning platform, which everyone has to use). I also collected self-reported data on the extent of usage (=adoption), but not into deep detail like mentioned above and without a specific time reference.

Literature: Valente, T. (2003): Network Models and Methods for Studying the Diffusion of Innovations, To appear in: P. Carrington, S. Wasserman, & J. Scott (Eds.) Recent Advances in Network Analysis.

3. Gueorgi Kossinets, Columbia University

Literature:

- Little and Rubin (2002), Statistical Analysis with Missing Data.
- http://arxiv.org/abs/cond-mat/0306335

Hi Ines,

You can try analysing you data as is (simply dropping non-respondents) and under some 'optimistic' assumptions, e.g. that non-respondents always reciprocate. Then compare the findings. It may well be that the amount of missing data is small enough so that your results are robust.

I've written an exploratory study on the effects of missing data on different graph properties, you can take a look at it here: http://arxiv.org/abs/cond-mat/0306335

For non-response, the general approach would be to infer the likelihood of relationships under some sensible assumptions. It is possible to use different assumptions for different categories of non-respondents (e.g. say that non-respondents who are in the organization tend to reciprocate, while those non-respondents who have left might have been less connected in the recent past, etc). For a good discussion of imputation see Little and Rubin (2002), Statistical Analysis with Missing Data.

One ad hoc technique might be based on the assumption that actors are similar in their behaviour, therefore the average is a good characterization of the population. For example, if A, B and C nominate D who does not respond at all, you could estimate D's out-degree using mean in-degree z_in and mean out-degree z_out, estimated from available data. The imputed out-degree for D will be k_out(D) = k_in(D)*z_out/z_in (k_in is number of received nominations). Then pick D's connections randomly or preferentially reciprocating in-ties. If certain attributes are known for both respondents and non-respondents, then, as was recently mentioned on this list, you could use logistic regression to estimate the probability of a tie given the attributes for every dyad.

Regards,

Gueorgi Kossinets
Graduate Fellow, ISERP and
Teaching Fellow, Dept. of Sociology
Columbia University
E-mail: gk297 at columbia.edu
Phone: +1 (212) 854 0367

4. Carol Hon

Dear Ines,

I am troubling by the same issue. I found the paper of

Stoke,D.and Richards, W.D. (1992) Nonrespondents in Communication Network
Studies: Problems and Possibilities. Group and Organization Management, 17(2), 193-209 very useful.

It stimulates me to think of the best way to deal with my own situation.

I ask 11 actors their frequency of contact in current construction stage and also recall their memory in design stage but I have many missing data in desgin stage due to the rotation of staff. I come up with the solution that I symetricize the communication data with maximum ties to mininize missing data. Since I decide the scale of frequency of contact to be likert scale ,1 to 5 with 1= Very infrequent and 5= Very Frequent. I treat
non-respondents,and no contact as missing data without separation.

Hope this also stimulates your thinking.

Carol

5. Carter Butts

I posted somewhat recently on this general topic to SOCNET, and hence don't want to spam the list by posting again -- I would recommend searching the archives for some past discussion of this issue. More
proximately, a family of models which can be used to account for the effects of missing data are given in

Butts, Carter T. (2003). ``Network Inference, Error, and Informant (In)Accuracy: A Bayesian Approach.'' Social Networks, 25(2), 103-140.

Since you do not have much in the way of replicate observations, you will not be able to conduct inference on error rates; nevertheless, the model can still be used to account for the impact of missing data, so
long as the data is ignorably missing. I presented a more complex family of models, which allows for non-ignorably missing data, at Sunbelt last year. The slides from this talk can be found at

http://erzuli.ss.uci.edu/~buttsc/distribution/netbayes2.pres.out.pdf

As a starting point, I would recommend using the models from the published paper (fixed error rates or the pooled error model). Try a range of assumed error rates (you might start with those found in the paper as examples), examine the posterior predictive in each case, and ensure that your results are robust to small changes in error parameters (and reasonable network priors). If so, you can be reasonably confident
that your results would hold given the full data.

Hope that helps,

Carter

6. Bob Faris

Hello--

Following on the missing data discussion, I'd like to get feedback on an imputation procedure that Jim Moody has helped me out with. We are working with 26 networks of middle & highschool students, each with around 150 to 300 students. Roughly 15 percent of the students refused to participate in the survey.

We recognized that it would be impossible to impute their missing friendship nominations, but thought that it might be reasonable to impute reciprocity--i.e., given that a participating A nominates a missing B, we
attempt to model whether B reciprocates.

Students were allowed to nominate up to 5 friends; among participants, reciprocity was typically close to 50%, problematizing any wholesale assumption about reciprocity (either treating the missing ties as 0's, or
symmetrizing the ties).

So we ran a logit on all dyads where there was at least one tie, the dependent variable being whether the tie was reciprocated. We included in our model various measures of tie strength (emotional closeness,
frequency of interaction, etc.), the outdegree & indegree of sender, the indegree of the recipient, and a measure of transitivity (i.e., how many triads would be transitive if B reciprocates).

Has anyone done something similar to this? Any thoughts?

Thanks in advance,
Bob Faris

7. Carter Butts

bob faris wrote:

> We recognized that it would be impossible to impute their missing friendship nominations, but thought that it might be reasonable to impute reciprocity--i.e., given that a participating A nominates a missing B, we
> attempt to model whether B reciprocates. So we ran a logit on all dyads where there was at least one tie, the Dependent variable being whether the tie was reciprocated. We included in our model various measures > of tie strength (emotional closeness, frequency of interaction, etc.), the outdegree & indegree of sender, the indegree of the recipient, and a measure of transitivity (i.e., how many triads would be transitive if
> Beciprocates). Has anyone done something similar to this? Any thoughts?

Aside from the question of estimating the reciprocity rate, a somewhat more robust procedure would be to use your reciprocity estimates to draw repeatedly from the set of graphs (conditional on size, observed ties,
and your estimated reciprocity rates), to calculate your quantity of interest on each draw, and then to study the distribution which results (instead of any single estimate). If your results hold across the vast majority of draws (and if you believe the reciprocity model), then you have some evidence that results would be unlikely to change if you had the full data. (This can be thought of as multiple imputation, although one can frame it in other ways if one likes.)

For that matter, though, you might want to consider other, more general models for the data which incorporate other sorts of structural biases. Since I think I know the data set to which you refer, I believe that Jim has already done work on this. Extrapolating p*/ERGM parameters to graphs of larger size (in order to take draws, which is what you'd ultimately want to do) is somewhat problematic, but the mean value parameterization might provide a way around this (especially since you're not adding all that many vertices). Perhaps Mark or others would like to chime in here....

-Carter

8. Gindo Tampubolon, University of Manchester

Hi Ines,

hope this is not yet too late. You might also want to take a look at Peter Hoff's work, e.g. 'Random Effects Models for Network Data'. A few weeks ago, if i remember, he announced the availability of his software in this ist to possibly do this kind of thing.

HTH

BACK TO TOP

 

JOHN F. KENNEDY SCHOOL OF GOVERNMENT JOHN F. KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY