Note: Much of the content in the first half of this module is presented in a 38 minute lecture by Professor Lisa Sullivan. The lecture is available below, and a transcript of the lecture is also available. Link to transcript of lecture on basics probability
Sampling individuals from a population into a sample is a critically important step in any biostatistical analysis, because we are making generalizations about the population based on that sample. When selecting a sample from a population, it is important that the sample is representative of the population, i.e., the sample should be similar to the population with respect to key characteristics. For example, studies have shown that the prevalence of obesity is inversely related to educational attainment (i.e., persons with higher levels of education are less likely to be obese). Consequently, if we were to select a sample from a population in order to estimate the overall prevalence of obesity, we would want the educational level of the sample to be similar to that of the overall population in order to avoid an over- or underestimate of the prevalence of obesity.
There are two types of sampling: probability sampling and non-probability sampling. In probability sampling, each member of the population has a known probability of being selected. In non-probability sampling, each member of the population is selected without the use of probability.
Simple Random Sampling
In simple random sampling, one starts by identifying the sampling frame, i.e., a complete list or enumeration of all of the population elements (e.g., people, houses, phone numbers, etc.). Each of these is assigned a unique identification number, and elements are selected at random to determine the individuals to be included in the sample. As a result, each element has an equal chance of being selected, and the probability of being selected can be easily computed. This sampling strategy is most useful for small populations, because it requires a complete enumeration of the population as a first step.
Many introductory statistical textbooks contain tables of random numbers that can be used to ensure random selection, and statistical computing packages can be used to determine random numbers. Excel, for example, has a built-in function that can be used to generate random numbers.
Systematic sampling also begins with the complete sampling frame and assignment of unique identification numbers. However, in systematic sampling, subjects are selected at fixed intervals, e.g., every third or every fifth person is selected. The spacing or interval between selections is determined by the ratio of the population size to the sample size (N/n). For example, if the population size is N=1,000 and a sample size of n=100 is desired, then the sampling interval is 1,000/100 = 10, so every tenth person is selected into the sample. The selection process begins by selecting the first person at random from the first ten subjects in the sampling frame using a random number table; then 10th subject is selected.
If the desired sample size is n=175, then the sampling fraction is 1,000/175 = 5.7, so we round this down to five and take every fifth person. Once the first person is selected at random, every fifth person is selected from that point on through the end of the list.
With systematic sampling like this, it is possible to obtain non-representative samples if there is a systematic arrangement of individuals in the population. For example, suppose that the population of interest consisted of married couples and that the sampling frame was set up to list each husband and then his wife. Selecting every tenth person (or any even-numbered multiple) would result in selecting all males or females depending on the starting point. This is an extreme example, but one should consider all potential sources of systematic bias in the sampling process.
In stratified sampling, we split the population into non-overlapping groups or strata (e.g., men and women, people under 30 years of age and people 30 years of age and older), and then sample within each strata. The purpose is to ensure adequate representation of subjects in each stratum.
Sampling within each stratum can be by simple random sampling or systematic sampling. For example, if a population contains 70% men and 30% women, and we want to ensure the same representation in the sample, we can stratify and sample the numbers of men and women to ensure the same representation. For example, if the desired sample size is n=200, then n=140 men and n=60 women could be sampled either by simple random sampling or by systematic sampling.
There are many situations in which it is not possible to generate a sampling frame, and the probability that any individual is selected into the sample is unknown. What is most important, however, is selecting a sample that is representative of the population. In these situations non-probability samples can be used. Some examples of non-probability samples are described below.
In convenience sampling, we select individuals into our sample based on their availability to the investigators rather than selecting subjects at random from the entire population. As a result, the extent to which the sample is representative of the target population is not known. For example, we might approach patients seeking medical care at a particular hospital in a waiting or reception area. Convenience samples are useful for collecting preliminary or pilot data, but they should be used with caution for statistical inference, since they may not be representative of the target population.
In quota sampling, we determine a specific number of individuals to select into our sample in each of several specific groups. This is similar to stratified sampling in that we develop non-overlapping groups and sample a predetermined number of individuals within each. For example, suppose our desired sample size is n=300, and we wish to ensure that the distribution of subjects' ages in the sample is similar to that in the population. We know from census data that approximately 30% of the population are under age 20; 40% are between 20 and 49; and 30% are 50 years of age and older. We would then sample n=90 persons under age 20, n=120 between the ages of 20 and 49 and n=90 who are 50 years of age and older.
Distribution in Population
Quota to Achieve n=300
Sampling proceeds until these totals, or quotas, are reached. Quota sampling is different from stratified sampling, because in a stratified sample individuals within each stratum are selected at random. Quota sampling achieves a representative age distribution, but it isn't a random sample, because the sampling frame is unknown. Therefore, the sample may not be representative of the population.