Statistics
for the Clinician
A
look at the statistical structure and design of a clinical trial.
BY SARAH ROSNER, MPH BERNARD ROSNER, PhD
Recently Genaera Corporation reported limited clinical results for Envizon (squalamine lactate), a drug under investigation for wet AMD. The trial included 6 patients and all 12 eyes enrolled in the study showed preserved or improved vision.1 These results are encouraging, but what should a clinician conclude from this data? What about the data reported for other CNV treatments nearing an approval decision, such as Retaane (15 mg anecortave acetate, Alcon Labs, Inc.), Lucentis (ranibizumab, Genentech) or recently approved Macugen (pegaptanib sodium, Eyetech/Pfizer)? The studies behind these drugs represent a small part of the flood of clinical data that clinicians must wade through in seeking the best care for their patients. The statistical designs and analyses that are applied to a particular drug program are important to understand and deserve a thorough review given the new drugs that will potentially be available to ophthalmologists in the coming months and years.
When a drug enters clinical evaluation it is investigated using a prospective design. Clinical studies can use a retrospective structure; however, this is typically performed in epidemiologic studies and is not applicable to the drug development process. A prospective clinical trial is the most powerful experimental design used to test the effectiveness of an intervention in human populations. A prospective study design means that subjects are followed from a well-defined baseline point and forward in time. The key feature of a clinical trial is that it involves an intervention, such as a device or therapeutic agent that is assigned according to a randomization code. A proper design always includes a control, which can be a placebo agent and/or a standard therapy. Including a proper control limits the potentially confounding effect of natural disease progression, seasonality, and patient effects such as the Placebo Effect, Hawthorne Effect, and regression to the mean. This article will review basic statistical concepts needed to interpret data from clinical trials. For simplicity, a 2-group study consisting of an intervention and a control group with a continuous outcome variable will be considered.
STATISTICAL ANALYSIS FOR A CLINICAL TRIAL
When interpreting results from a clinical trial, many clinicians will simply focus on the P-value. What exactly is the P-value and why is 0.05 often used to define statistical significance? If the P-value for a test statistic is less than 0.05 then the results are statistically significant and represent a finding of interest. Usually, when the P-value is greater than or equal to 0.05, the findings are not statistically significant and thus may be disregarded by the clinician. However, one must consider that the value of 0.05 is an arbitrary cutoff. Studies with "significant" results may represent false positives and studies with "nonsignificant" results can still provide extremely valuable information. A brief review of some basic statistical concepts will give the clinician tools for evaluating the results of a clinical trial beyond simply looking at the P-value.
Hypothesis Test
It is important to understand the concept of a hypothesis test. When conducting a clinical trial, the investigators must first decide if they are interested in testing a 1-sided or 2-sided hypothesis. A 1-sided hypothesis would look for differences in only 1 direction, for example if the intervention is better than the control. A 2-sided hypothesis looks for differences in both directions: it would test if the intervention is either better or worse than the control. Unless there is a strong justification for why the investigator would expect to see a difference in only 1 direction, 2-sided hypothesis tests are typically used. It is important to designate the null and alternative hypotheses at the start of a study. For example, if one were comparing the mean response level between the treatment (intervention) and control groups (control), one would test the null hypothesis H0: μintervention = μcontrol OR μintervention - μcontrol = 0. If there is no effect of the treatment, it is assumed that the mean responses for the 2 groups are equal, or that the difference between the mean responses is 0. The appropriate 2-sided alternative hypothesis would be HA: μintervention � μcontrol OR μintervention - μcontrol � 0. The alternative hypothesis states that the intervention is either better or worse than the control, but is not the same.
The goal of a study is to test the null hypothesis and decide whether or not it can be rejected. If it is rejected then the alternative hypothesis is likely to be true. There are 4 possible outcomes of a hypothesis test (Table 1):
(1) H0 true, H0 is not rejected
(2) H0 true, H0 is rejected (Type I error)
(3) HA true, H0 is rejected
(4) HA true, H0 is not rejected (Type II error)
It is possible that a study can incorrectly reject the null hypothesis, therefore having false positive findings. The probability of this type of result is termed the significance level or the Type I error rate, normally designated as alpha. The significance level of a study is normally set at 0.05, but this is an arbitrary choice. The significance level relates to the P-value of a statistical test. The P-value is the probability of observing an absolute difference between the mean responses of the groups as large, or larger than the difference observed, given that the null hypothesis is true. If a P-value is very small, then it is unlikely that you would observe the difference between the intervention and control groups that was seen if there is truly no difference between the 2 groups. If the P-value is less than the Type I error (0.05) then the null hypothesis is rejected. It is also possible for a study to have false negative findings where the null hypothesis is accepted when it is in fact false. The probability of incorrectly accepting the null hypothesis is termed the Type II error rate, normally designated as beta.
Table 1. Four Possible Outcomes of a Hypothesis Test. |
||
Hypothesis Test Result* |
"True State" |
|
H0 True | HA True | |
Reject H0 | Type I error ( ) | No error (1- ) |
Do not reject H0 | No error (1- ) | Type II error ( ) |
*Reject H0 when P-value < |
Confidence Interval
When designing a clinical trial it is important to be able to reject the null hypothesis when the alternative hypothesis is true. In other words, if there is truly a difference between the treatment and control groups, you want your study to be able to detect it.
Generally, a confidence interval for the sample difference is reported along with the P-value. For example, if is 0.05 then a 95% confidence interval will be reported. So what does a 95% confidence interval really mean? On average, a 95% confidence interval calculated from a data set will contain the true population difference with a probability of 0.95. For example, if you were to repeat your study 100 times, 95 of the confidence intervals calculated would contain the true population difference while 5 of them would not. It gives you an idea of the range of different values that are consistent with your data. Ideally, you do not want the null value (0) to be contained in the confidence interval. For example, if given a 95% confidence interval of (–2, 3), you would conclude that there could be a difference between your groups of anywhere from –2 to 3 units, including no difference at all. You would want to obtain a confidence interval that does not include the null value, such as (-4,-2) or (3,5). The P-value relates directly to the confidence interval. For example if P is greater than or equal to 0.05, then the null value will be contained in the 95% confidence interval. If P<0.05, then there is a significant difference between the 2 groups and the null value will not be contained in the 95% confidence interval.
Sample Size
Power is the ability of a study to detect a true difference between the treatment groups of a specific size. Power is the probability of concluding that the alternative hypothesis is true when it is actually true. The concept of power is related to the Type II error rate in that power is equal to 1- alpha. For example, if ‚ is 0.2 then the study has an 80% chance of detecting a difference of size � between treatment and control groups if the difference truly exists. One of the challenges in designing a clinical trial is to determine the smallest difference between groups that is still of clinical significance.
The concept of power is directly related to the sample size of a study. The size of a study should be determined in the planning phase of the trial. It is essential that a study have a large enough sample size to detect a clinically meaningful difference between treatment groups with high probability. If the sample size is too small, it is possible that a study may not have enough power to detect a true effect. The calculation of sample size depends on 4 parameters: 1) power (1-, 2) significance level, 3) the detectable difference between groups, and 4) the standard deviation within groups. Typically, the significance level is set at 0.05 and the power is set at 0.8 or 0.9. Although it may seem desirable to set the power as close as possible to 100%, the extremely large sample sizes that are required to attain that power make it an unfeasible choice. The budget of a study will limit the sample size.
There are numerous formulas available for calculating the sample size for a study with a given power, significance level, and detectable difference between groups depending on the study design. The details of these formulas are beyond of the scope of this article; however, they can be found in most texts on biostatistics.2 It is important to note that these formulas only provide estimates of the sample size needed for the study. Therefore it is advisable to use a sample size slightly larger than what is determined by the formula. Given these considerations, statistical powering of a study is heavily influenced by expected differences, which can be taken from previous studies found in the literature, or earlier proof-of-concept clinical studies done in the same patient population.
In the field of ophthalmology, the calculation of sample size is not as straightforward as in other disciplines. Since each person contributes 2 eyes to the study population, does that mean that you need half as many people? Unfortunately, no, since a person's 2 eyes are not independent units they cannot be considered as 2 separate "subjects." Special sample size formulas are available for ophthalmic data where the eye is the unit of analysis.3
COMMON STATISTICAL TESTS
Once the study has been designed and implemented, the data can be analyzed using statistical tests. The investigator has a choice of 2 categories of test statistics: parametric and nonparametric. Parametric tests assume that the data is sampled from populations that usually follow a Gaussian, or bell-shaped, distribution. Nonparametric tests make no assumptions about the distribution of the study populations. Nonparametric tests perform calculations based on the rank of the values rather than on the actual data values. Therefore, there is less influence from outliers than with parametric tests.
So how do you know if a particular test statistic calculated from your study population follows a Gaussian distribution? It is often the case that if your sample size is large, then it should follow a Gaussian distribution and a parametric test can be used. If the sample size is small or the outcome variable is on an ordinal scale (i.e., pain, comfort, or slit-lamp exam scores) it is better to use a nonparametric test. Table 2 summarizes some common statistical tests. For each parametric test, there is a nonparametric test equivalent. The choice of a statistical test will depend on how many groups you want to compare and whether or not you have paired data.
Table 2. Common Statistical Tests. |
||
Goal of Test | Parametric Test | Non-Parametric Test |
To compare 2 unpaired groups | 2-sample t test | Mann-Whitney U Test (equivalent to Wilcoxan Rank sum) |
To compare 2 paired groups | Paired 2-sample t test | Wilcoxon signed rank test |
To compare 3+ unpaired groups | ANOVA (analysis of variance) | Kruskal-Wallis test |
STUDY DESIGN
In addition to looking at the statistical results of a trial, it is also important to consider study design. Traditionally, clinical trials are superiority trials where the purpose is to show that a new treatment is more effective than placebo. However, more recently there have been many equivalence or noninferiority trials where the goal is to show that 2 treatments have the same therapeutic effect.
Since it can be difficult to show that a new treatment is superior to other treatments, many development programs try to show equivalence across drugs. Although 2 drugs may have the same therapeutic effect, a newer treatment may have other desirable properties, such as fewer side effects or extended duration of action. One of the challenges involved in an equivalence trial is that a "region of therapeutic equivalence" must be defined. Before starting a trial the investigator must determine which range of values corresponds to therapeutic equivalence. This can be a subjective process, which can dramatically influence the results of the trial if the range selected is too large or too small.
A recent example is the Retaane 15 mg (anecortave acetate suspension, Alcon Labs) versus Visudyne Photodynamic Therapy (verteporfin, QLT/Novartis, Vancouver, Canada) Study that reported results in late 2004.4 This trial defined rigorous noninferiority parameters based on previous studies upon initiation of the trial, which greatly affected the results. Alcon defined equivalence in the trial as percentage of patients who maintained vision (subject who lost fewer than 3 lines of vision) within 7% in the Retaane and Visudyne groups; however, this number is highly subjective and it has been argued that a larger window could have been justifiable for this study.
Choosing this range of equivalence is a vitally important part of the study design: as it was initially reported, Retaane would have met its predefined endpoint if a slightly larger window was used; however, since a 7% window was used the endpoint was not met. Thus, it is possible to obtain a different study outcome while using the same results. It is also interesting to note that a Chi-square analysis showed no statistically significant difference between the Retaane and Visudyne treatment groups.
Several subanalyses were also performed looking at the effect of controllable factors such as treatment interval and drug reflux on the results (Table 3). These subanalyses reduced the number of subjects included in the analysis; however, it showed numeric differences, which lead Alcon to further investigate potentially controllable factors in the study. Table 3 shows that the Retaane group had 57% of patients with preserved vision. This analysis is valuable in showing an overall trend toward better efficacy after excluding patients who experienced reflux during drug administration and/or were retreated after 6 months.
After analyzing the results above, Alcon initiated a small clinical study to specifically investigate the effect of drug reflux and if a specially designed counter-pressure device (CPD) could prevent drug reflux. Results reported at the Macula Society in February indicated that the CPD was effective in preventing drug reflux in 100% of patients and serum levels of the drug confirmed that eliminating reflux correlated with a higher level of drug absorption.
Table 3. Subgroup Analysis of Patients who Maintained Visiona in Retaane and Visudyne Treatment Groups at 12 Months. |
||
Retaane | Visudyne | |
Patients who maintained vision at 12 monthsa | 57% (n=75b) | 49% (n=220) |
a Maintained vision is defined as patients who lost less than 3-lines of visual acuity. b This subgroup excludes patients who experienced reflux during drug administration and patients treated after the 6-month dosing interval. |
CROSSOVER VERSUS PARALLEL DESIGN
In addition to deciding whether to perform a superiority or equivalence trial, the investigator must also decide whether to have a parallel group design or a crossover design. In a parallel group design, each subject is assigned to a single treatment group for the duration of the study. At the end of the study, the experience of each treatment group is compared. In contrast, with the crossover design, each subject receives both treatments albeit at different times. They receive 1 treatment for a period of time and then they switch to the other treatment. The major advantage of crossover trials is that each subject can serve as their own control. However, if there is any carryover effect of the first treatment that the patients receives, then one must use only the first period results to compare treatments, which will generally provide less power than if a standard parallel design where a baseline and follow-up period was used.
When interpreting the results of a clinical trial it is important to consider not only the P-value, but also the power, sample size, and study design of the trial.
Sarah Rosner, MPH is a ScD candidate epidemiology at the Harvard School of Public Health. Bernard Rosner, PhD is professor in the Department of Biostatistics, Channing Laboratory Harvard Medical School. Neither author has an financial interest in the information contained in this article. Dr. Bernard Rosner can be reached by e-mail at stbar@channing.harvard.edu.
REFERENCES
1. Genaera Corporation. Genaera Reports Positive Preliminary Two-Month Data from Phase II Study for Squalamine in Age-Related Macular Degeneration- Press Release. Plymouth Meeting, PA: January 10, 2005.
2. Rosner B. Fundamentals of Biostatistics, 6th ed. Wadsworth: Belmount, CA;1999.
3. Friedman LM, Furberg CD, DeMets DL. Fundamental of Clinical Trials, 3rd ed. Springer: New York, NY; 1998.
4. American Academy of Ophthalmology. National conference. New Orleans, LA; October 15, 2004.