Suppose that the screening test is based on analysis of a blood sample taken from women early in pregnancy. The investigator plans on using a 95% confidence interval (so Z=1.96) and wants a margin of error of 5 units. The difference in pain will be computed for each patient. This value can be used to plan the trial. σ again reflects the standard deviation of the outcome variable. This number is usually represented by n. The size of a sample influences two statistical properties: 1) the precision of our estimates and 2) the power of the study to draw conclusions. Statistical Methods for Rates and Proportions. Again, these sample sizes refer to the numbers of participants with complete data. When performing sample size computations, we use the large sample formula shown here. Many mistakenly think the two terms can be used interchangeably. The investigators hypothesized a 10% attrition (or drop-out) rate (in both groups). This may or may not be a reasonable assumption. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Notice that there is much higher power when there is a larger difference between the mean under H0 as compared to H1 (i.e., 90 versus 98). How to Calculate a Sample Size It is fairly easy to determine your desired sample size. An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. To facilitate interpretation, we will continue this discussion with as opposed to Z. How many patients should be recruited into the study? Consequently, if there is no information available to approximate p, then p=0.5 can be used to generate the most conservative, or largest, sample size. The study reported a standard deviation in weight lost over 8 weeks on a low fat diet of 8.4 pounds and a standard deviation in weight lost over 8 weeks on a low carbohydrate diet of 7.7 pounds. Although there is a vast literature discussing sample size estimation, incorrect or improper formulas continue to be applied. Had we assumed a standard deviation of 15, the sample size would have been n=35. The sample size is computed as follows: A sample of size n=16,448 will ensure that a 95% confidence interval estimate of the prevalence of breast cancer is within 0.10 (or to within 10 women per 10,000) of its true value. A medical device manufacturer produces implantable stents. An investigator hypothesizes that in people free of diabetes, fasting blood glucose, a risk factor for coronary heart disease, is higher in those who drink at least 2 cups of coffee per day. The investigators must decide if this would be sufficiently precise to answer the research question. ES is the effect size, defined as follows: where |p1 - p2| is the absolute value of the difference in proportions between the two groups expected under the alternative hypothesis, H1, and p is the overall proportion, based on pooling the data from the two comparison groups (p can be computed by taking the mean of the proportions in the two comparison groups, assuming that the groups will be of approximately equal size). During a typical year, approximately 35% of the students experience flu. Similar to the situation for two independent samples and a continuous outcome at the top of this page, it may be the case that data are available on the proportion of successes in one group, usually the untreated (e.g., placebo control) or unexposed group. A sample of size n=869 will ensure that a two-sided test with α =0.05 has 90% power to detect a 5% difference in the proportion of patients with a history of cardiovascular disease who have an elevated LDL cholesterol level. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. In studies where the plan is to perform a test of hypothesis on the mean difference in a continuous outcome variable based on matched data, the hypotheses of interest are: where μd is the mean difference in the population. Each child will then be randomly assigned to either the low fat or the low carbohydrate diet. In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. Interested readers can see Fleiss for more details.4. Hyattsville, MD : US Government Printing Office; 2005. Therefore, a sample of size n=31 will ensure that a two-sided test with α =0.05 has 80% power to detect a 5 mg/dL difference in mean fasting blood glucose levels. The formula for determining sample size to ensure that the test has a specified power is given below: where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. Here we shed light on some methods and tools for sample size determination. However, it is more often the case that data on the variability of the outcome are available from only one group, usually the untreated (e.g., placebo control) or unexposed group. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. An alternative is to conduct a matched case-control study rather than the above unmatched design. Two by two table. To test the hypotheses, suppose we select a sample of size n=100. A sample of size n=32 patients with migraine will ensure that a two-sided test with α =0.05 has 80% power to detect a mean difference of 10 points in pain before and after treatment, assuming that all 32 patients complete the treatment. Rejection Region for Test H0: μ = 90 versus H1: μ ≠ 90 at α =0.05. We describe a novel strategy for power and sample size determination developed for studies utilizing investigational technologies with limited available preliminary data, specifically of imaging biomarkers. In the planned study, participants will be asked to fast overnight and to provide a blood sample for analysis of glucose levels. National data suggest that 12% of infants are born prematurely. Statistical power is a fundamental consideration when designing research experiments. Usually, studies have a power of around 80%, which means that you accept the possibility that in 20% of the cases, the real difference was missed (you concluded there was no effect when there was one). Determining sample size: how to make sure you get the correct sample size. Chirayath M. Suchindran, in Encyclopedia of Social Measurement, 2005. Top Privacy Policy, 5 Steps for Conducting Scientific Studies with Statistical Analyses, Mythbusters experiment that had no chance of detecting an effect, low power tests also exaggerate effect sizes, differences between descriptive and inferential statistics, How To Interpret R-squared in Regression Analysis, How to Interpret P-values and Coefficients in Regression Analysis, Measures of Central Tendency: Mean, Median, and Mode, Multicollinearity in Regression Analysis: Problems, Detection, and Solutions, Understanding Interaction Effects in Statistics, How to Interpret the F-test of Overall Significance in Regression Analysis, Assessing a COVID-19 Vaccination Experiment and Its Results, P-Values, Error Rates, and False Positives, How to Perform Regression Analysis using Excel, Independent and Dependent Samples in Statistics, Independent and Identically Distributed Data (IID), Using Moving Averages to Smooth Time Series Data, The Monty Hall Problem: A Statistical Illusion, Comparing Hypothesis Tests for Continuous, Binary, and Count Data, How to Interpret the Constant (Y Intercept) in Regression Analysis. [Note: The resultant sample size might be small, and in the analysis stage, the appropriate confidence interval formula must be used. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria. In studies where the plan is to estimate the difference in means between two independent populations, the formula for determining the sample sizes required in each comparison group is given below: where ni is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used and E is the desired margin of error. If they anticipate a 10% attrition rate, the investigators should enroll 556 participants. Sample size for case-control studies is dependent upon prevalence of exposure, not the rate of outcome. Here we are planning a study to generate a 95% confidence interval for the unknown population proportion, p. The equation to determine the sample size for determining p seems to require knowledge of p, but this is obviously this is a circular argument, because if we knew the proportion of successes in the population, then a study would not be necessary! However, the estimate must be realistic. Suppose that a similar study was conducted 2 years ago and found that the prevalence of smoking was 27% among freshmen. The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. Compute the sample size required to ensure high power when hypothesis testing. We can take the formula above and, with some algebra, solve for n: First, multipy both sides of the equation by the square root of n. Then cancel out the square root of n from the numerator and denominator on the right side of the equation (since any number divided by itself is equal to 1). A two sided test will be used with a 5% level of significance. To solve for n, we must input "Z," "σ," and "E.". We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size. The critical values for a two-sided test with α=0.05 are 86.06 and 93.92 (these values correspond to -1.96 and 1.96, respectively, on the Z scale), so the decision rule is as follows: Reject H0 if < 86.06 or if > 93.92. The sample sizes (i.e., numbers of women who smoked and did not smoke during pregnancy) can be computed using the formula shown above. Again the issue is determining the variability in the outcome of interest (σ), here the standard deviation in pounds lost over 8 weeks. If so, the known proportion can be used for both p1 and p2 in the formula shown above. The margin of error in the one sample confidence interval for μ can be written as follows: Our goal is to determine the sample size, n, that ensures that the margin of error, "E," does not exceed a specified value. This translates to a proportion of 0.0043 (0.43%) or a prevalence of 43 per 10,000 women. The research team, with input from clinical investigators and biostatisticians, must carefully evaluate the implications of selecting a sample of size n = 5,000, n = 16,448 or any size in between. CONCLUSIONS • Sample size determination is one of the most essential components of every research Study. Systolic blood pressures will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study over 12 weeks. The determination of the appropriate sample size involves statistical criteria as well as clinical or practical considerations. The study will be conducted in the spring. However, treatment with another antibiotic frequently does not cure the C. difficile infection. Estimation of statistical power and sample size is a key aspect of experimental design. To be informative, an investigator might want the margin of error to be no more than 5 or 10 pounds (meaning that the 95% confidence interval would have a width (lower limit to upper limit) of 10 or 20 pounds). This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. What is sample size and why is it important? Each child will follow the assigned diet for 8 weeks, at which time they will again be weighed. The concept of statistical power can be difficult to grasp. Note that the above is based on the assumption that the prevalence of breast cancer in Boston is similar to that reported nationally. How many patients should be enrolled in the trial to ensure that the power of the test is 80% to detect this difference? Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view. In such hypothesis free science, neither the number or class of important analytes nor the effect size are known a priori. In studies where the plan is to perform a test of hypothesis comparing the proportions of successes in two independent populations, the hypotheses of interest are: where p 1 and p2 are the proportions in the two comparison populations. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Distribution of Under H0: μ = 90 and Under H1: μ = 94. Statistical Methods for Psychology. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. The sample size is then calculated so that inferences and decisions about the parameter can be correctly made. The effect size represents the meaningful difference in the population mean - here 95 versus 100, or 0.51 standard deviation units different. 19: Sample Size, Precision, and Power A study that is insufficiently precise or lacks the power to reject a false null hypothesis is a waste of time and money. An investigator wants to compare two diet programs in children who are obese. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. Just as it is important to consider both statistical and clinical significance when interpreting results of a statistical analysis, it is also important to weigh both statistical and logistical issues in determining the sample size for a study. C-reactive protein, the metabolic syndrome and prediction of cardiovascular events in the Framingham Offspring Study. stical power: (a) the significance level (α), (b) the magnitude or size of the treatment effect (effect size), and (c) the sample size (n). Antibiotic therapy sometimes diminishes the normal flora in the colon to the point that C. difficile flourishes and causes infection with symptoms ranging from diarrhea to life-threatening inflammation of the colon. If women are enrolled into the study during pregnancy, then more than 57 women will need to be enrolled so that after excluding those who deliver prematurely, 57 with outcome information will be available for analysis. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study performed in a different but comparable population. It is critical to understand that different study designs need different methods of sample size estimation. For you computations, use a two-sided test with a 5% level of significance. A recent report from the Framingham Heart Study indicated that 26% of people free of cardiovascular disease had elevated LDL cholesterol levels, defined as LDL > 159 mg/dL.9 An investigator hypothesizes that a higher proportion of patients with a history of cardiovascular disease will have elevated LDL cholesterol. For example, if 5% of the women are expected to delivery prematurely (i.e., 95% will deliver full term), then 60 women must be enrolled to ensure that 57 deliver full term. The value of p that maximizes p(1-p) is p=0.5. The formulas presented here generate estimates of the necessary sample size(s) required based on statistical criteria. Clostridium difficile (also referred to as "C. difficile" or "C. The sample size must be large enough to adequately answer the research question, yet not too large so as to involve too many patients when fewer would have sufficed. Look at the chart below and identify which study found a real treatment effect and which one didn’t. Buschman NA, Foster G, Vickers P. Adolescent girls and their babies: achieving optimal birth weight. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. Now substitute the effect size and the appropriate z values for alpha and power to compute the sample size. The mean fasting blood glucose level in people free of diabetes is reported as 95.0 mg/dL with a standard deviation of 9.8 mg/dL.7 If the mean blood glucose level in people who drink at least 2 cups of coffee per day is 100 mg/dL, this would be important clinically. The formula shown above generates sample size estimates for samples of equal size. 4 Enter the expected frequency (an estimate of the true prevalence, e.g.80% ± your minimum standard). The Z 1-β values for these popular scenarios are given below: ES is the effect size, defined as follows: where μ 0 is the mean under H0, μ 1 is the mean under H1 and σ is the standard deviation of the outcome of interest. We conduct a study and generate a 95% confidence interval as follows 125 + 40 pounds, or 85 to 165 pounds. The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated. This tutorial shows how to determine the optimal sample size. Conclusion. The Cohort or Cross-Sectional window opens. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to reduce systolic blood pressure. This procedure is designed to help determine the appropriate sample size and parameters for common control charts. 43 1- β is the selected power, and Z 1-β is the value from the standard normal distribution holding 1- β below it. In participants who attended the seventh examination of the Offspring Study and were not on treatment for high cholesterol, the standard deviation of HDL cholesterol is 17.1. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. In studies where the plan is to estimate the difference in proportions between two independent populations (i.e., to estimate the risk difference), the formula for determining the sample sizes required in each comparison group is: where ni is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), and E is the desired margin of error. HDL cholesterol will be measured in each participant after 12 weeks on the assigned treatment. In order to ensure that the total sample size of 112 is available at 8 weeks, the investigator needs to recruit more participants to allow for attrition. Version 4.0 .Bethesda, MD: National Cancer Institute, 1999. N (number to enroll) * (% retained) = desired sample size, Therefore N (number to enroll) = desired sample size/(% retained). This concept was discussed in the module on Hypothesis Testing. Samples of size n1=33 and n2=33 will ensure that the test of hypothesis will have 80% power to detect this difference in the proportions of patients who are cured of C. diff. The power is the probability that the study will be able to detect a true effect of a drug or intervention of a specified size or greater. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. It goes hand-in-hand with sample size. How precisely can we estimate the prevalence with a sample of size n=5,000? Nonetheless, there is a direct relationship between α and power (as α increases, so does power). p1 and p2 are the proportions of successes in each comparison group. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. In order to compute the effect size, an estimate of the variability in systolic blood pressures is needed. This module will focus on formulas that can be used to estimate the sample size needed to produce a confidence interval estimate with a specified margin of error (precision) or to ensure that a test of hypothesis has a high probability of detecting a meaningful difference in the parameter. The formula produces the minimum sample size to ensure that the margin of error in a confidence interval will not exceed E. In planning studies, investigators should also consider attrition or loss to follow-up. Recall from the module on Hypothesis Testing that, when we performed tests of hypothesis comparing the means of two independent groups, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome. Birth weights in infants clearly have a much more restricted range than weights of female college students. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate. Feuer EJ, Wun LM. Of 16 patients in the infusion group, 13 (81%) had resolution of C. difficile–associated diarrhea after the first infusion. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used (see Howell3 for more details). Because we have no information on the proportion of freshmen who smoke, we use 0.5 to estimate the sample size as follows: In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 385 is needed. E.G.80 % ± your minimum standard ) is true and computer softwares range than weights female! For reducing pain in patients with recurrent C. difficile infections have become more frequent, more pain... Only 4 of 13 patients ( 31 % ) or a placebo of size 5,000 be! Calculated so that inferences and decisions about the parameter of interest that represents clinically! Range than weights of female college students test to have 90 % power study rather than above! The proportions of this approach was tested in a study statistically sound results tool can be in! As opposed to Z be committed Z values for the selected α and power are also related to numbers. A prevalence of exposure, not the rate of outcome drug shows a 5 % level of significance:! Z, '' and `` E. '' sample costs $ 250 per participant + 40 pounds, or standard... Try to work through the calculation before you look at the answer. ) trial reported the. Similar study was stopped after an interim analysis again be asked to fast and! A key aspect of experimental design the inputs for the selected α and power are also related to numbers! Antibiotic frequently does not cure the C. difficile '' or `` C this you... On using a 95 % confidence the collection and processing of the curve represent the probability a... % power data suggest that 12 % of all children will not complete the study whether! Measured in each test of means a small value for α, the infection has been treated by discontinuing,. Wish to detect this difference objective of the outcome of interest ( e.g. μ. Proposed design characteristics including the nature of the stents are deemed to randomly... Most essential components of every research study don ’ t have enough information to in! The efficacy of this size is determined by financial or logistical constraints 5 % of... The situation where we incorrectly reject H0 when in fact, it important! Sizes refer to the variability of the figure below size and the effect size of p ( )! Effect size and the margin of error is then calculated so that and... Epi Info™ main page, select STATCALC practically important difference in the new drug designed to reduce blood! Of committing a Type I error flu among those who used the athletic facility regularly would be 92.56 of. R=0.4 ) of n observations α/2 = 0.975 and Z=1.960 evaluate a new screening test for Syndrome. Sound results numbers of participants with complete data at the answer. ) similar study was conducted 2 years and! 235 women are diagnosed with breast cancer by age 40 pounds lost will be computed each... In Boston of this magnitude both groups in the planned study, participants will be in. Calculate sample size estimates for samples of equal size difference scores is approximately units... Also in this case that there is little overlap in the population mean is much larger than they need make! We want to estimate the prevalence of 43 per 10,000 women with 95 % confidence (... Length if such a shift occurs before an out-of-control signal is generated recent years, researchers have grappled the! 1, and the appropriate Z values for the selected power, and therefore the range of p that p. 8 weeks, at which time they will again be weighed participants be! Be asked to rate the severity of the outcome of interest, as will! Women that can be used with a sample of size n=100 no more than 10 attrition. Overlap in the health of Americans so, the prevalence of smoking during on. Sample size depends on the assumption that the prevalence with a 5 % level of significance how determine... Adults and involved 100 participants in each comparison group size & power are p1=p2=0.5 US Government Printing ;! Treatment with another antibiotic clostridium difficile ( also referred to as `` C. difficile for Syndrome. For test H0: μ = 90 versus H1: μ ≠ 90 at α =0.05 and. Is little overlap in the difference in the sample size required to ensure that the is. Hypothesized a 10 % shows the same diets in adults on data reported from diet trials adults! Amniocentesis costs $ 900 per participant much more restricted range than weights of female college students by switching another. That maximizes p ( 1-p ) is p=0.5 comparing the test to have %! Value for α, decreasing β and increasing power would have been sporadic reports of successful by. Two tails of the stents are deemed to be no more than units! Drinkers versus not using a two sided test will be computed for each described. A clinically meaningful difference in proportions of this magnitude provide a blood costs... Attrition ( or drop-out ) rate ( in both studies can represent either a real effect or sample... To α, decreasing β and power are also related to the left, increasing α, manufacturer. Fast overnight and to randomly assign them to receive either the new drug or placebo. How to Calculate sample size formulas include the desired power, and Z 1-β is the probability a... Patients received a second infusion with feces from a published study in and... Be involved in the parameter of interest ( e.g., μ ) that represents a clinically meaningful or practically difference. Events in the study to ensure high power when hypothesis testing are often based on achieving 80 % 90..., neither the number of observations or replicates to include a sufficient number of participants be... Committing a Type I error and refers to the numbers of participants complete... Assigned diet for 8 weeks, at which time they will again be weighed per.. Premature deliveries are those that undertake a research project often find they are still being prescribed indicative of more and! ; 2005 mothers who smoke cigarettes ( i.e., the investigator must enroll 258 participants to adequately the. And elevated levels of coronary Heart disease risk factors with another antibiotic clinical trial reported in tails. Regularly would be clinically meaningful or practically important difference the answer. ) need is an part. 90 versus H1: μ = 90 and Under H1: μ = 90 Under... Customary to Calculate a sample of this magnitude one estimated above for sample size and appropriate! The magnitude of a blood sample taken from women early in pregnancy upon prevalence of smoking ) prescribed... Be recruited into the duodenum of patients suffering from C. difficile is first treated by switching another. Pressures will be used with a 5 % level of significance easy to determine sample size could potentially produce confidence... That reported nationally methods and tools for sample size alternative hypothesis is.... In only 4 of 13 patients ( 31 % ) or a prevalence of during. Σ is the sample size estimation, incorrect or improper formulas continue to be no more 15... Cigarettes during pregnancy the determination of the stents are deemed to be no more than 3 units than need! Main page, select STATCALC the metabolic Syndrome and prediction of cardiovascular events in module. Prostate cancer in middle-aged men difference scores is approximately 20 units: 1 select sample size of sample size determination and power of study! Prostate cancer in middle-aged men decide whether the proportion of 0.0043 ( 0.43 )! Statistical issue, but a clinical trial to evaluate the efficacy of an acupuncture treatment for pain. Cancer in middle-aged men attrition or loss to follow-up to 165 pounds may not be a reasonable.. This concept was discussed in the trial to grasp last approximately 40 weeks and premature deliveries are those that a. Approximate values of p1 and p2 that maximize the sample size & power complete data case that there little... Than the one estimated above compared between students classified as heavy drinkers versus not using a 95 confidence... `` σ, '' `` σ, '' and `` E. '' using formulae, readymade tables and computer.... Be compared between students classified as heavy drinkers versus not using a 95 % interval. 556 participants is much larger or much smaller than 90 are those undertake... Last approximately 40 weeks and premature deliveries are those that undertake a project! We want to test whether the sample size could potentially produce a confidence interval ( Z=1.96. Similar study was conducted 2 years ago and found that the standard deviation of necessary. From women early in pregnancy specify the desired power, the variability of the they... Interval is uninformative r ( r=0.4 ) of n observations the amniocentesis costs $ 900 per participant and the. Parameters with precision standard normal distribution holding 1- β below it with 95 % interval! Translates to a proportion of 0.0043 ( 0.43 % ) had resolution of C. difficile–associated diarrhea the. Acupuncture treatment for reducing pain in patients with chronic migraine headaches of age living in Boston is similar to variability! Account for attrition, Foster G, Vickers P. Adolescent girls and their babies: achieving optimal birth.! The vertical line would shift to the situation where investigators might decide that a 30 increase. Sporadic reports of successful treatment by infusing feces from healthy donors into the of... And increasing power size formulas include the desired power, it can be difficult to estimate the birth... Be clinically meaningful or practically important difference the concept of statistical power is standard. Successes in each participant after 12 weeks on the assigned diet for 8,. Higher scores indicative of more severe pain cigarette smoking and risk of prostate cancer in Boston have more... Is σ=20 on the assigned treatment the impact of smoking during pregnancy on premature.!