Requiring an amyloid-β1-42 biomarker may improve the efficiency of a study, and simulations may help in planning studies

A recent article by Schneider and colleagues has generated a lot of interest in simulation studies as a way to improve study design. The study also illustrates the foremost principal in simulation studies, which is that the results of a simulation are an embodiment of the assumptions that went into it. This simulation study assumes that the effect size is proportional to the mean to standard deviation ratio of the Alzheimer Disease Assessment Scale - cognitive subscale in the population being studied. Under this assumption, selecting a subgroup for a clinical trial based on biomarkers will not affect the efficiency of the study, despite achieving the desired increase in the mean to standard deviation ratio.

Th e simulation study reported by Schneider and colleagues is a valuable contribution to the fi eld of Alzheimer's disease. Th e study was conducted under a detailed protocol and clearly lays out the assumptions that were made and the criteria that were used for each set of simulations [1]. Th e article makes the point that some situations are complicated enough that standard power calculations do not capture the whole picture, because they require simplifying assumptions that may not hold. In these cases, power calculations may more accurately refl ect reality when based on simulations that do not rely as much on distributional assumptions.
In this study, one critical assumption is the basis of the main conclusion. Table 1 presents the results taken from Schneider and colleagues' study, including the mean and standard deviation (SD) for each group for the Alzheimer Disease Assessment Scale -cognitive subscale (ADAS-cog).
Th e diff er ence between group means divided by the SD is called a standardised diff erence (or Cohen's D value [2]) and allows estimation of power based on a t test. If you take the placebo group mean and subtract the treatment group mean and then divide that diff erence by the placebo SD, using numbers that are all shown in the table, you obtain the eff ect size shown in the third column, within the rounding error (25%, 35% and 45%). Th is exercise illustrates that the eff ect size used in this simulation study increases and decreases proportionally to the standardised diff erence, which is tied mathe matically to the power. In other words, although the sensitivity of the ADAS-cog to decline over time is increased with the biomarker selection methods, the treatment diff erence was decreased in order to maintain the same standardised diff erence.
Although this same type of approach seems to have been taken for the Clinical Dementia Rating scale sum of boxes (CDR-sb), calculating the observed eff ect size (Cohen's D value) by taking the diff erence between group means divided by the SD does not correspond to the planned eff ect size shown in Schneider and colleagues' Table 3 [1]. Th e rows with a planned eff ect size of 25% have calculated values ranging from 21 to 22%, the rows with a planned eff ect size of 35% have calculated values ranging from 27 to 30%, and the rows with a planned eff ect size of 45% have calculated values ranging from 35 to 39% (data not shown). It is unclear why the simulations consistently provide outcomes with lower eff ect sizes than those planned, particularly when the ADAS-cog observed eff ect sizes are not biased.

Defi ning eff ect size
Th ere are several diff erent ways to defi ne eff ect size [3,4]. Because the estimated power is often compared between diff erent scenarios assuming an equal eff ect size, it is important to know which eff ect size is assumed to be equal, and what impact that assumption is expected to have on the estimated power. Th e means and SDs referred to here are the mean and SD of the change from baseline for each treatment group.

Abstract
A recent article by Schneider and colleagues has generated a lot of interest in simulation studies as a way to improve study design. The study also illustrates the foremost principal in simulation studies, which is that the results of a simulation are an embodiment of the assumptions that went into it. This simulation study assumes that the eff ect size is proportional to the mean to standard deviation ratio of the Alzheimer Disease Assessment Scale -cognitive subscale in the population being studied. Under this assumption, selecting a subgroup for a clinical trial based on biomarkers will not aff ect the effi ciency of the study, despite achieving the desired increase in the mean to standard deviation ratio.
Four diff erent defi nitions of eff ect size will be com pared: one that is unstandardised (the absolute diff erence); and three that are standardised values, calculated through dividing by some type of scaling factor (Cohen's D value using the baseline SD, Cohen's D value using the change from baseline SD, and the percentage of placebo decline that uses the placebo mean change from baseline).

Absolute diff erence
Th e absolute diff erence between treatment groups is calculated by simply subtracting the two treatment group means (usually the mean changes from baseline): Diff erence = active mean -placebo mean Th is observed treatment diff erence is often reported in addition to some type of standardised eff ect size. Th is diff erence is nonstandardised so it is diffi cult to compare between diff erent instruments, because a 2-point diff erence on the ADAS-cog is not comparable with a 2-point diff erence on the CDR-sb. Th e absolute diff erence on a single scale is also diffi cult to compare between studies if the studies include diff erent patient populations. For instance, a 2-point diff erence on the ADAS-cog may be more meaningful in a mild patient population than in a

Cohen's D value using the baseline standard deviation
One way to standardise the eff ect size is to divide the observed treatment diff erence by the baseline SD. Th is procedure is common and appropriate when the baseline scores represent some type of normal or healthy state from which patients may deteriorate, and then to which they may possibly return. Th is value is the number of SDs of diff erence between the two groups relative to the baseline population: Cohen's D value using baseline SD = diff erence / baseline SD In the case of Alzheimer's disease, mild cognitive impair ment or prodromal Alzheimer's disease, the baseline population represents an already deteriorated patient population so standardising based on this non-healthy population can therefore lead to unusual eff ect sizes. For instance, a homogeneous group of patients -that is, a population with very similar severity at baseline -may have a SD that is one-half that of a less homogeneous population with the same baseline mean. If the same absolute treat ment diff erence is observed in these two populations, then the fi rst population would have a Cohen's D value that is twice as large as the second population due solely to the diff erences in baseline variability.
Cohen's D value using change from baseline standard deviation (z-score eff ect size or standardised diff erence) If the absolute treatment diff erence is divided by the SD of the change from baseline, then this eff ect size also represents the number of SDs of diff erence between the two groups relative to the changes from baseline that were observed. Th is is a type of z-score calculation and is often referred to as a standardised diff erence. Th is is the eff ect size that was used by Schneider and colleagues [1]: Cohen's D value eff ect size (standardised diff erence) = D = diff erence / placebo SD Although the placebo SD is shown in the equation, this calculation sometimes uses the pooled SD across treatment groups. Th is type of eff ect size calculation is less susceptible to population diff erences at baseline, but it is still susceptible to diff erences in the homogeneity of the change over time. So if a group is more homogeneous at baseline, it is also likely that the changes from baseline will be more homogeneous, making comparison between the groups complicated.
Th e other issue that factors into this calculation is the sensitivity of the instrument. If an instrument is used that has substantial variability in the change from baseline over time, then the Cohen's D values will be lower than with an instrument with less variability. Although one could argue that an eff ect on a more variable instrument should be penalised because of the variability, it means that a 35% eff ect, for instance, on a variable instrument could be quite a lot larger than a 35% eff ect on a less variable instrument. Th ere is a direct relationship between this standardised diff erence (D), the sample size and power for a two-sample t test: where K is a constant that depends on α (the type 1 error rate, traditionally selected to be 0.05), and n is the sample size per group.

Percentage placebo eff ect
Because Alzheimer's disease, including mild cognitive impairment and prodromal Alzheimer's disease, is a degenerative disease, a natural scaling factor is the placebo rate. Dividing the absolute diff erence by the placebo mean change from baseline results in an eff ect size that represents the percentage reduction in the placebo decline -an eff ect size >100% indicates an improvement over baseline: Percentage placebo eff ect (% reduction in decline) = diff erence / placebo mean Th is eff ect size has the advantage that it is standardised to time rather than to the variability of a group of patients. A 30% eff ect size, for instance, can therefore be interpreted as a reduction of 30% in the rate of the placebo group. Th is eff ect size is easily comparable across diff erent instruments in the same disease state because the sensitivity of the instrument does not aff ect the eff ect size. Th is eff ect size is also at least somewhat comparable between patient groups in diff erent disease states, since any fl oor or ceiling eff ects that may impact the instrument sensitivity may similarly aff ect the diff erence, thus not impacting the eff ect size.
An additional metric, referred to as the signal to noise ratio, measures the sensitivity of a particular instrument in a specifi c population of patients and is useful when using the percentage placebo eff ect: Sensitivity (signal to noise ratio) = placebo mean / placebo SD Th is metric allows comparison of instruments within a population, and also allows estimation of ceiling and fl oor eff ects. Th e signal to noise ratio multiplied by the percentage placebo eff ect is equal to the Cohen's D value eff ect size using the change from baseline SD. Th is relationship allows us to make a set of power curves based on the sensitivity of an instrument which can then be used to compare the power between diff erent percentage placebo eff ects.

Discussion
Th e three farthest right columns in Table 1 show that as the sensitivity increases, the percentage placebo eff ect size decreases. Consider the example shown in the fi rst three rows of Table 1, with n = 100 per group and a dropout rate of 20%. Th e sensitivity of the ADAS-cog increases from 0.51 for the amnestic mild cognitive impair ment (aMCI) group to 0.62 or 0.63 with the biomarker selected groups. If the percentage placebo eff ect of 0.69 that is shown for the aMCI group is also used for the two biomarker selected groups, we can estimate the power using Figures 1 and 2. For the aMCI group, the power would be approximately 0.60 (using Figure 1, eff ect size = 0.70; PASS 2005 [5] used for all power calculations). For the biomarker selected groups, the power is approximately 0.75 (using Figure 2, eff ect size = 0.70). Also using Figure 2, a sample size of approxi mately 70 per group can achieve power of 0.60, com parable with the power achieved with a sample size of 100 in the aMCI group.
Th ere are critical diff erences in the approaches that have been used to discuss power and eff ect size in clinical trials. Below are three assumptions that correspond to assuming equal eff ects using three diff erent methods of reporting treatment eff ects.
Th e fi rst method is the absolute diff erence. Assuming that the absolute treatment diff erence (point diff erence) is the same across diff erent trial scenarios implies that a treatment can give X points benefi t no matter how much the placebo group declines, how sensitive the instrument, or how hetero geneous the population being studied. Th is approach can not reasonably be used to compare power between two diff erent instruments such as the ADAScog and CDR-sb, since the same point diff erences on these two instruments would not be comparable.
Th e second reporting method is the standardised diff erence or Cohen's D value using the placebo standard deviation (used in Schneider and colleagues' article [1]). Assuming that the standardised diff erence is the same across trial scenarios implies that a treatment gives the same percen tage benefi t relative to the SD of the change from baseline of the instrument. If diff erent instruments used to measure a disease are similarly sensitive to decline over time, then this type of comparison may be valid. Using this method, however, an increase in measurement error, such as that introduced with careless instrument admin is tration, would be associated with an increase in the expected effi cacy of the treatment under consideration, suffi cient to counteract the decrease in power due to increased variability.
Th e fi nal method is the percentage placebo eff ect. Assuming that the percentage diff erence relative to placebo is the same across trial scenarios implies that a treatment gives the same percentage benefi t relative to the decline of the placebo group. Th is approach could only be considered for a disease with an increasing outcome or a degenerative disease such as Alzheimer's disease. In Alzheimer's disease, use of the method assumes that the treatment is expected to reduce the decline by the same percentage across diff erent trial scenarios. Th is assumption may be justifi ed when studying the same patient population with diff erent instruments; it may not be reasonable when comparing diff erent disease stages, however, since a treatment may not have the same percentage benefi t in these diff erent patient populations. Th is is the basis of the argument for earlier treatment. Treatments may be able to aff ect the disease more in the earlier stages. It is not clear whether this would be related to the position in the disease or to the slower decline rate that may be expected earlier in the disease (which, inciden tally, may be due to a ceiling eff ect of an instrument). When selecting a population based on biomarkers in order to increase the decline rate seen over the study period, it is not clear whether the same percentage eff ect would be expected in this subgroup, or whether the percentage eff ect might actually go down due to the more rapid progression of this subgroup. Th is method does have the advantage of not depending on the sensitivity of the instrument being used. Figure 3 shows the diff erence between a percentage placebo eff ect and a standardised diff erence. Figure 3a shows a 50% eff ect as a percentage of the placebo decline of 4 points (2-point eff ect). Figure 3b shows a 50% standardised diff erence eff ect when the SD is 6 points (3-point eff ect). Using the same 50% eff ect but a scenario with a smaller placebo decline (3-point decline instead of 4-point decline), a 1.5-point diff erence is obtained for the placebo decline eff ect ( Figure 3c) and a 3-point diff erence for the standardised eff ect (Figure 3d) since the SD was kept at 6 points. Th ese data illustrate the diff erence between a percentage eff ect relative to placebo and a standardised diff erence that is relative to the SD.
Although observing similar power when comparing biomarker selected groups and the aMCI group as a whole is a direct result of using the same standardised diff erence, column J in Table 1 shows that the absolute diff erence is also quite similar between the biomarker selected groups and the aMCI group, and is actually larger for the aMCI group. Th is indicates the conclusion that the power may not be much improved with biomarker selection may be valid if a treatment has a similar absolute diff erence in all three groups. In fact, the power diff erences would be even less than were shown since the absolute treatment diff erence shown for the aMCI group is slightly smaller than for the two biomarker selected groups.
Th e question therefore comes down to the issue of whether selecting a faster declining patient group, which generally increases the mean to SD ratio of an instrument (by increasing both the mean and the SD, but increasing the mean more than the SD), will also result in an increase in eff ect proportional to the increase in the mean placebo decline. Previous power comparisons have assumed that it will. Schneider and colleagues assume that this selection would not but that the eff ect will stay proportionally the same relative to the SD, resulting in no eff ect on power [1]. Another way of assuming that this selection will not result in an increase in eff ect proportional to the increase in the mean placebo decline would be to assume a constant absolute treatment eff ect. Th is assumption also results in very similar power between aMCI and biomarker selected groups.

Conclusions
Simulation studies are an appropriate way to explore the impact of diff erent study design decisions in order to improve the study design. Th e results of a simulation are an embodiment of the assumptions that went into it. Th is simulation study assumes that the eff ect size is proportional to the mean to SD ratio of the ADAS-cog in the population being studied. Because this type of eff ect size increases proportionally to the mean to SD ratio, increasing the mean to SD ratio cannot aff ect the power. Th e small diff erences in power that were observed by Schneider and colleagues between the three selection  Hendrix Alzheimer's Research & Therapy 2011, 3:10 http://alzres.com/content/3/2/10 methods are probably due to the diff erences in simulating the measurement error compo nent of the treatment response. In addition, the CDR-sb is not able to show increased power despite its improved sensitivity to decline (signal to noise ratio), because the eff ect size, as defi ned, increases proportionally to the signal to noise ratio -although there are some concerns about the observed eff ect sizes calculated from Table 3 in Schneider and colleagues' paper [1]. Assuming a constant absolute treat ment eff ect also results in very similar estimated power between the aMCI and biomarker selected groups.
Assuming a constant percentage placebo eff ect size does show diff erences in the power for the selected patient subgroups. Th is assumption also shows improved power of the CDR-sb over the ADAS-cog, specifi cally due to the improvement in sensitivity or signal to noise ratio.
Separating the evaluation of an instrument in its ability to measure the decline in Alzheimer's disease over time (sensitivity) from the ability of a treatment to aff ect the decline over time (percentage placebo eff ect size) clarifi es the discussion of power, effi ciency and sample size.
Th ere is no way to know whether selecting a faster declining patient group, which generally increases the mean to SD ratio of an instrument, will also result in an increase in eff ect proportional to the increase in the mean placebo decline. Previous power comparisons have assumed that it will. Both the approach described in Schneider and colleagues' article and an approach using a constant absolute treatment eff ect assume that this selection would not increase the eff ect, resulting in very similar power between the aMCI and biomarker selected groups.