11.2.3 Missing Data
Data can be incomplete in ways other than having an unobserved
variable. A data set can simply be missing the values of some variables for
some of the tuples.
When some of the values of the variables are missing, one must be very careful in using
the data set because the missing data may be correlated with the
phenomenon of interest.
Example 11.6:
Suppose you have a (claimed) treatment for a disease that does not actually
affect the disease or its symptoms. All it does is make sick people
sicker. If you were to randomly assign patients to the treatment, the
sickest people would drop out of the study, because they become too sick to
participate. The sick people who took the treatment would drop out at
a faster rate than the sick people who did not take the treatment. Thus,
if the patients for whom the data is missing are ignored,
it looks like the treatment works; there are fewer sick people in
the set of those who took the treatment and remained in the study!
If the data is missing at random, the missing data can be
ignored. However, "missing at random" is a strong assumption. In general, an agent
should construct a model of why the data is missing or, preferably,
it should go out into the world and find out why the data is missing.