Sections 1.3, 1.4, 1.5
The proper collection of data is as important as its analysis. We discuss in this section the various questions related to data collection.
Most of the times a sample is selected from the population, and data is collected on that sample. Some of the questions that arise in this context are:
The selection of a sample is crucial. For instance many years ago a national poll to determine who would win the elections provided completely wrong estimates, because they targeted a sample (those individuals that were well off during a depression) that was not representative of the population as a whole.
No amount of clever analysis can compensate for badly collected data.
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
Sir Ronald A. Fisher
There are broadly speaking four different sources of data:
Observational studies follow a number of individuals, recording information about them, often over an extended period of time. Typically in observational studies there is little direct involvement with and control over the subjects.
Observational studies come in two forms:
The golden standard of data collection, and not always practical. In a controlled experiment, the researcher directly sets the values for the predictor variables, and measures the response. For example if we wanted to investigate whether smoking causes cancer, a very unethical way to do this would be via a controlled experiment where we force people to smoke for extended periods of time and witness whether they develop lung cancer or not.
Controlled Experiments differ from Observational Studies in that they are more intrusive, controlling the experiment parameters rather than simply monitoring.
Sampling from a population is a very common approach to collecting information. It however contains many risks. The various risks can be summarized by saying that the end result is to select a sample that is not representative of the population as a whole. And if the sample does not represent the population, then there is very little we can do to draw reliable conclusions.
A key step in ensuring a proper sample is to start by selecting a random sample, by which we mean that every individual in the population has an equal chance of being selected for our sample. This is a great start as it ensures for example that the researcher doesn’t just contact people they know, who are likely to share the researcher’s views and therefore not really represent the population.
Even within the question of sampling however, there are a couple of different possibilities:
A stratified random sample aims to ensure that we have sufficient represntation from the various geographical locations or groups that we study. The goal in general is to group together in the same stratum cases that are in some way similar.
For example if we wanted to investigate the difference between greeks and non-greeks, we could consider dividing our population into 4 strata: male greeks, female greeks, male non-greeks and female non-greeks. Then we draw simple random samples within each stratum, and this ensures sufficient representation of all the different groups. This may however result in non-equal representation overall: We may end up with just as many male greeks as male non-greeks, even though perhaps there are very few male non-greeks.
In a cluster sample we divide our population into lots of small clusters, then select some clusters at random and sample everyone in those clusters. For example in order to do a demographic study, we could choose at random some counties around the country, then collect data on (almost) everyone living in that county.
The difference from stratified samples is that we expect individuals within clusters to be different from each other, but that clusters overall are similar to each other. In stratified sampling on the other hand, we expect the individuals within a stratum to be similar to each other, but that strata would be different from each other.
Here is a brief summary of the methods:
As there is a greater degree of control on experiments, a number of principles have emerged to ensure proper experimentation and minimize biases.