Cambridge Colloquium on Complexity
and Social Networks (CCCSN)
Controversies
Scale-free
Networks (PDF)
Dealing with Missing Data
SocNet responses on the request of Ines Mergel,
Kennedy School of Government, Harvard University and University
of St. Gallen, Switzerland
This page contains the initial request on the topic
"Dealing with Missing Data", an
extracted literature list and the responses
of SocNet-members:
SocNet Request, 10/28/2003: Follow up question
on "Dealing with Missing Data"
====================================================
I am in the process of analyzing my data, collected
on the diffusion and adoption of educational technologies in a higher
education institution. My research interest is focused on how the
network variables affected the adoption of different kinds of educational
technologies in an environment, which formally has no hierarchies
(all actors have the same official position).
My network consists of 98 faculty members. The
response rate is: 87 % on the overall questionnaire and 70 % on
the relational questions. Some actors refused to answer the relational
questions (give/receive information on educational technologies,
meet on social occasions, and meet on professional occasions). The
frequency of interactions is coded from 1-3 (never-seldom-very often).
I would like to get opinions from people on how
to deal with different forms of missing data:
How to code _and_ analyze:
1. Actors, who never replied to the questionnaire?
(After four rounds of requests)
2. Actors, who refused to answer the relational
questions? (f.e. if they think that it is too private) Would it
be ok, for example, to impute the missing values and then examine
whether the network (including both responded and imputed data)
affected adoption?
3. Actors, who were mentioned by other respondents,
but left the organization right before or during my study and never
responded.
Do I have to treat these different categories of
non-responses as missings or is there a way to code and analyze
them differently? How to interpret the results?
Thanks a lot for your input - I will of course
compile the answers and make them available to the list!
Ines
====================================================
Literature recommendations from SocNet
members:
Butts, Carter T. (2003): "Network Inference,
Error, and Informant (In)Accuracy: A Bayesian Approach", Social
Networks, 25(2), 103-140.
-> slides
from Sunbelt talk 2002
Kossinets, G. (2003): Effects
of missing data in social networks, October 30, 2003.
Little, R. J. A./Rubin, D. B. (2002), Statistical
Analysis with Missing Data, New York.
Stork, D./Richards, W. D. (1992): Nonrespondents
in Communication Network Studies: Problems and Possibilities, in:
Group & Organization Management; Jun 1992; 17, 2; ABI/INFORM
Global pg. 193.
Valente, T. (2003): Network Models and Methods
for Studying the Diffusion of Innovations, To appear in: P. Carrington,
S. Wasserman, & J. Scott (Eds.) Recent Advances in Network Analysis.
====================================================
Replies from SocNet list members:
1. Bill Richards:
Stork, D./Richards, W. D. (1992): Nonrespondents
in Communication Network Studies: Problems and Possibilities, in:
Group & Organization Management; Jun 1992; 17, 2; ABI/INFORM
Global pg. 193.
2. Tom Valente:
Ines, Well if you can get a time stamp, even a
rough one, there are lots of other methods you can use. Attached
is a recent review chapter. - Tom
ines_mergel@harvard.edu wrote:
> Hi Tom,
>
> I am just collecting data on the extent of the adoption, by
counting the (increasing) activities over four semester (since the
beginning of the eLearning platform, which everyone has to use).
I also collected self-reported data on the extent of usage (=adoption),
but not into deep detail like mentioned above and without a specific
time reference.
Literature: Valente, T. (2003): Network Models
and Methods for Studying the Diffusion of Innovations, To appear
in: P. Carrington, S. Wasserman, & J. Scott (Eds.) Recent Advances
in Network Analysis.
3. Gueorgi Kossinets, Columbia University
Literature:
- Little and Rubin (2002), Statistical Analysis
with Missing Data.
- http://arxiv.org/abs/cond-mat/0306335
Hi Ines,
You can try analysing you data as is (simply dropping
non-respondents) and under some 'optimistic' assumptions, e.g. that
non-respondents always reciprocate. Then compare the findings. It
may well be that the amount of missing data is small enough so that
your results are robust.
I've written an exploratory study on the effects
of missing data on different graph properties, you can take a look
at it here: http://arxiv.org/abs/cond-mat/0306335
For non-response, the general approach would be
to infer the likelihood of relationships under some sensible assumptions.
It is possible to use different assumptions for different categories
of non-respondents (e.g. say that non-respondents who are in the
organization tend to reciprocate, while those non-respondents who
have left might have been less connected in the recent past, etc).
For a good discussion of imputation see Little and Rubin (2002),
Statistical Analysis with Missing Data.
One ad hoc technique might be based on the assumption
that actors are similar in their behaviour, therefore the average
is a good characterization of the population. For example, if A,
B and C nominate D who does not respond at all, you could estimate
D's out-degree using mean in-degree z_in and mean out-degree z_out,
estimated from available data. The imputed out-degree for D will
be k_out(D) = k_in(D)*z_out/z_in (k_in is number of received nominations).
Then pick D's connections randomly or preferentially reciprocating
in-ties. If certain attributes are known for both respondents and
non-respondents, then, as was recently mentioned on this list, you
could use logistic regression to estimate the probability of a tie
given the attributes for every dyad.
Regards,
Gueorgi Kossinets
Graduate Fellow, ISERP and
Teaching Fellow, Dept. of Sociology
Columbia University
E-mail: gk297 at columbia.edu
Phone: +1 (212) 854 0367
4. Carol Hon
Dear Ines,
I am troubling by the same issue. I found the paper
of
Stoke,D.and Richards, W.D. (1992) Nonrespondents
in Communication Network
Studies: Problems and Possibilities. Group and Organization Management,
17(2), 193-209 very useful.
It stimulates me to think of the best way to deal
with my own situation.
I ask 11 actors their frequency of contact in current
construction stage and also recall their memory in design stage
but I have many missing data in desgin stage due to the rotation
of staff. I come up with the solution that I symetricize the communication
data with maximum ties to mininize missing data. Since I decide
the scale of frequency of contact to be likert scale ,1 to 5 with
1= Very infrequent and 5= Very Frequent. I treat
non-respondents,and no contact as missing data without separation.
Hope this also stimulates your thinking.
Carol
5. Carter Butts
I posted somewhat recently on this general topic
to SOCNET, and hence don't want to spam the list by posting again
-- I would recommend searching the archives for some past discussion
of this issue. More
proximately, a family of models which can be used to account for
the effects of missing data are given in
Butts, Carter T. (2003). ``Network Inference, Error,
and Informant (In)Accuracy: A Bayesian Approach.'' Social Networks,
25(2), 103-140.
Since you do not have much in the way of replicate
observations, you will not be able to conduct inference on error
rates; nevertheless, the model can still be used to account for
the impact of missing data, so
long as the data is ignorably missing. I presented a more complex
family of models, which allows for non-ignorably missing data, at
Sunbelt last year. The slides from this talk can be found at
http://erzuli.ss.uci.edu/~buttsc/distribution/netbayes2.pres.out.pdf
As a starting point, I would recommend using the
models from the published paper (fixed error rates or the pooled
error model). Try a range of assumed error rates (you might start
with those found in the paper as examples), examine the posterior
predictive in each case, and ensure that your results are robust
to small changes in error parameters (and reasonable network priors).
If so, you can be reasonably confident
that your results would hold given the full data.
Hope that helps,
Carter
6. Bob Faris
Hello--
Following on the missing data discussion, I'd like
to get feedback on an imputation procedure that Jim Moody has helped
me out with. We are working with 26 networks of middle & highschool
students, each with around 150 to 300 students. Roughly 15 percent
of the students refused to participate in the survey.
We recognized that it would be impossible to impute
their missing friendship nominations, but thought that it might
be reasonable to impute reciprocity--i.e., given that a participating
A nominates a missing B, we
attempt to model whether B reciprocates.
Students were allowed to nominate up to 5 friends;
among participants, reciprocity was typically close to 50%, problematizing
any wholesale assumption about reciprocity (either treating the
missing ties as 0's, or
symmetrizing the ties).
So we ran a logit on all dyads where there was
at least one tie, the dependent variable being whether the tie was
reciprocated. We included in our model various measures of tie strength
(emotional closeness,
frequency of interaction, etc.), the outdegree & indegree of
sender, the indegree of the recipient, and a measure of transitivity
(i.e., how many triads would be transitive if B reciprocates).
Has anyone done something similar to this? Any
thoughts?
Thanks in advance,
Bob Faris
7. Carter Butts
bob faris wrote:
> We recognized that it would be impossible
to impute their missing friendship nominations, but thought that
it might be reasonable to impute reciprocity--i.e., given that a
participating A nominates a missing B, we
> attempt to model whether B reciprocates. So we ran a logit
on all dyads where there was at least one tie, the Dependent variable
being whether the tie was reciprocated. We included in our model
various measures > of tie strength (emotional closeness, frequency
of interaction, etc.), the outdegree & indegree of sender, the
indegree of the recipient, and a measure of transitivity (i.e.,
how many triads would be transitive if
> Beciprocates). Has anyone done something similar to this? Any
thoughts?
Aside from the question of estimating the reciprocity
rate, a somewhat more robust procedure would be to use your reciprocity
estimates to draw repeatedly from the set of graphs (conditional
on size, observed ties,
and your estimated reciprocity rates), to calculate your quantity
of interest on each draw, and then to study the distribution which
results (instead of any single estimate). If your results hold across
the vast majority of draws (and if you believe the reciprocity model),
then you have some evidence that results would be unlikely to change
if you had the full data. (This can be thought of as multiple imputation,
although one can frame it in other ways if one likes.)
For that matter, though, you might want to consider
other, more general models for the data which incorporate other
sorts of structural biases. Since I think I know the data set to
which you refer, I believe that Jim has already done work on this.
Extrapolating p*/ERGM parameters to graphs of larger size (in order
to take draws, which is what you'd ultimately want to do) is somewhat
problematic, but the mean value parameterization might provide a
way around this (especially since you're not adding all that many
vertices). Perhaps Mark or others would like to chime in here....
-Carter
8. Gindo Tampubolon, University of Manchester
Hi Ines,
hope this is not yet too late. You might also want
to take a look at Peter Hoff's work, e.g. 'Random Effects Models
for Network Data'. A few weeks ago, if i remember, he announced
the availability of his software in this ist to possibly do this
kind of thing.
HTH
BACK TO TOP
©
2005 The
President and Fellows of Harvard College
I Web
Administrator
I Reporting
copyright infringiments
|