In this vignette, we illustrate how to calibrate a two-arm one stage phase II design with binary endpoints from a Bayesian perspective. Details on the methodology can be found in (Kelter 2026). Our main assumption here is that the observed data in both groups are from two random variables \(Y_1,Y_2\) which both follow a binomial distribution with parameters \(n_1\) and \(n_2\) and \(p_1\) respectively \(p_2\), \[Y_1\sim \mathrm{Bin}(n_1,p_1), \hspace{1cm} Y_2\sim \mathrm{Bin}(n_2,p_2)\]
In its current form, the package implements four different hypothesis tests for such trials:
\[H_0:p_1=p_2 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:p_1\neq p_2\] Alternatively, a well-known parameterization of this test introduces a difference parameter \(\eta=p_2-p_1\) and the grand mean \(\zeta=\frac{1}{2}(p_1+p_2)\). Using this parameterization, we have \[p_1=\zeta-\frac{\eta}{2}, \hspace{1cm} p_2=\zeta+\frac{\eta}{2}\] and the hypotheses can be rewritten as: \[H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta \neq 0\] Next to this two-sided test, three directional tests are available in the package:
For each of the four tests, a separate Bayes factor exists and can be used. For the two-sided test, we denote the Bayes factor as \(BF_{01}\), and for the three directional tests above we denote the Bayes factors as \(BF_{+-}\), \(BF_{+0}\) and \(BF_{-0}\). Thus, the test of \(H_0:\eta \leq 0\) versus \(H_1:\eta >0\) can also be written as \(H_-:p_2 \leq p_1\) versus \(H_+:p_2 > p_1\).
The \(\mathrm{Beta}(a_0,b_0)\) distribution is a conjugate prior for the binomial likelihood, and when chosen as the prior, the posterior \(P_{p \mid Y}\) is also Beta-distributed. A natural choice for the priors is the beta distribution. We assume independent Beta design priors \(H_0\) as follows: \[p_1 =p_2 = p\mid H_0 \sim \mathrm{Beta}(a_0^d,b_0^d)\] Thus, under \(H_0:\eta = 0\), both probabilities are identical, \(p_1=p_2\), and take some value \(p\in [0,1]\), which has a beta design prior. Likewise, we pick independent Beta design priors under \(H_1:\eta \neq 0\): \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^d,b_1^d), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_2^d,b_2^d)\] For the analysis priors \(P_{p_1}^a\), \(P_{p_2}^a\) under \(H_1\), we also choose independent Beta priors, with possibly different values \(a_i^a\) and \(b_i^a\) for \(i=1,2\), where the superscript signals that the hyperparameters belong to our analysis instead of design prior: \[p_1 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a)\] Lastly, for the analysis prior \(P_{p}^a\) under \(H_0:\eta=0\), we choose a Dirac prior with all probability on \(\eta=p_2-p_1=0\) conditionally on a uniform prior on \(\zeta\), that is \[p_1=p_2=p|H_0 \sim 1_{\{\eta=0\}}| \zeta \sim U(0,1)\] for the analysis with the Bayes factor.
First, we load the package after installation:
Next, we illustrate the main calibration function for a two-arm
one-stage trial by re-analyzing a phase II trial in the context of
oncology. While no Bayesian approach was used in the original
statistical analysis of the trial, the step-by-step walktrough below
showcases how a structured approach to designing and calibrating a
Bayesian two-arm one-stage phase II trial with the
bfbin2arm package looks like. Importantly, the trial must
have two trial arms (treatment and control) and binary endpoints. We
assume further that one of the four tests detailed above is carried out
using Bayes factors as the test criterion.
The ICT-107 trial (Wen et al. 2019) was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes.
Reported results (ITT population):
We start by calculating the Bayes factor(s) for the ICT-107 trial data:
## -------------------------------------------------------------
## 2. ICT-107 trial (immunologic response)
## Placebo (control): 12 responders, 31 non-responders
## ICT-107 (treatment): 49 responders, 32 non-responders
## -------------------------------------------------------------
y1_ict <- 12 # control successes
n1_ict <- 12 + 31
y2_ict <- 49 # treatment successes
n2_ict <- 49 + 32
cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n")
#>
#> === ICT-107 Trial (n1 = 43 , n2 = 81 ) ===
# BF01
BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict,
a_0_a = 1, b_0_a = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1)
# BF+1
BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1)
# BF-1
BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1)
# BF+0
cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict))
#> === ICT-107 Trial === Bayes factor BF+0 results in 186.6192
# BF+-
cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict))
#> === ICT-107 Trial === Bayes factor BF+- results in 3702.659The most relevant Bayes factor here is \(BF_{+-}\), because it is directional and
leaves open the possibility of the placebo group having a larger
response rate than the treatment group. Note that the hyperparameters of
the beta analysis priors are specified in twoarmbinbf01 via
a_0_a = 1, b_0_a = 1 et cetera.
Now, a key question is which operating characteristics can be
expected based on the actual sample sizes used in the trial. The
powertwoarmbinbf01 function can provide the answer:
ict_results <- powertwoarmbinbf01(
n1 = n1_ict, n2 = n2_ict,
k = 1/3, k_f = 3,
test = "BF+-", # H+: p2 > p1 vs H-: p2 <= p1
a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1,
a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1,
output = "numeric",
compute_freq_t1e = TRUE,
)
print(ict_results)Power Type1_Error
0.8788106 0.0214111
CE_H0 Frequentist_Type1_Error
0.8788106 0.2871811
attr(,"hypothesis")
[1] "H[+]:~p[2] > p[1] ~~ vs ~~ H[-]:~p[2] <= p[1]"
attr(,"compute_freq_t1e")
[1] TRUE
We see that based on the actual sample sizes and a moderate evidence
threshold \(k=1/3\), the Bayesian power
is sufficiently large with \(87.8\%\).
Still, the frequentist type-I-error rate is way too high with \(28.7\%\), so we increase the evidence
threshold to \(k=1/10\) (strong
evidence) and use the ntwoarmbinbf01 function to calibrate
the design based on our requirements next.
The core working function to design a Bayesian two-arm one-stage
trial with the package is the design_twoarm_onestage_bf()
function. It searches over a grid of total sample sizes and returns a
design object that contains
n1,
n2) and their sum (n_total)Internally, the function uses the same numerical engine as the legacy
ntwoarmbinbf01() function, but exposes a richer,
object-based interface and S3 methods for printing, summarizing, and
plotting. The old function ntwoarmbinbf01() remains
available as a compatibility wrapper that now returns the same design
object.
First, we perform a sample size search for an ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor \(BF_{+-}\). Note that evidence in favour of \(H_-\) happens when \(BF_{+-}<k\) for \(k<1\). Internally, the function therefore uses the Bayes factor \(BF_{-+}\) when calibrating the design, but for our purposes this does not matter. Selecting \(BF_{+-}\) will use the directional test we intend to use when calibrating our design:
des <- design_twoarm_onestage_bf(
n_min = 10,
n_max = 75,
k = 1/10,
k_f = 10,
test = "BF+-",
calibration = "Bayesian",
target_power = 0.80,
target_type1 = 0.05,
target_ce_h0 = 0.80,
# design and analysis priors: flat Beta(1,1) everywhere
a_0_d = 1, b_0_d = 1,
a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 1,
a_2_d = 1, b_2_d = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1,
# assumed true proportions for frequentist power (optional here)
p1_power = 0.3, p2_power = 0.6,
# equal randomisation
alloc1 = 0.5,
alloc2 = 0.5,
# require sustained feasibility over the next 10 larger n
sustain_n = 10L,
progress = FALSE
)We can summarize or print the results with the print()
and summary() methods:
Summary: One-stage two-arm Bayes factor design
---------------------------------------------
Mode: optimal
Status: No feasible one-stage two-arm design found.
Calibration: Bayesian
Feasible: no
Search overview
n evaluated = 66
pointwise feasible = 0
sustained feasible = 0
first pointwise n = NA
first sustained n = NA
Also, we can plot the results:
Figure 1: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint
The summary and plot show that for the range of sample sizes provided
to the function, under flat design priors, no sample size satisfies the
requirement of Bayesian power \(\geq
0.80\). Thus, if we want to obtain a calibrated design we can
either increase n_max or choose more informative design
priors. Alternatively, we could shift to a less stringent threshold for
evidence \(k\), e.g. \(k=1/3\) instead of \(k=1/10\), so it becomes easier for the
Bayes factor to accumulate evidence in favour of \(H_+\).
The arguments correspond closely to the conceptual requirements:
k is the evidence threshold for rejecting the null
(inverted Bayes factor). Here, k = 1/10 corresponds to
moderately strong evidence against the null.k_f is the threshold for compelling evidence in favour
of the null (here, \(k_f = 10\) in
favour of \(H_-\)).calibration selects which constraints to enforce:
"Bayesian" uses Bayesian power, type-I error, and CE(H0);
"frequentist", "hybrid", and
"full" add frequentist constraints. Frequentist calibration
implies that frequentist power and type-I-error rare calibrated, hybrid
calibration implies that Bayesian power and frequentist type-I-error are
calibrated, and full calibration implies that both frequentist and
Bayesian power and type-I-error are calibrated.target_power, target_type1, and
target_ce_h0 are the Bayesian calibration targets.p1_power and p2_power specify the assumed
proportions for frequentist power (when used), here \(p_1 = 0.3\) and \(p_2 = 0.6\).alloc1 and alloc2 specify randomisation
probabilities for control and treatment; here we use equal
allocation.The resulting object des shows in its print and summary
output whether a feasible design was found and, if so, which sample
sizes and operating characteristics are selected. The old three-panel
plot is now available via
which reproduces the original
ntwoarmbinbf01(output = "plot") visualisation. For this
default plot, it is also possible to just call
In addition, two more compact plot types are provided:
plot(des, type = "oc") # operating characteristics across n_total
plot(des, type = "feasibility") # pointwise vs sustained feasibility across n_totalThe old function ntwoarmbinbf01() is still available for
backward compatibility. It now returns the same design object as
design_twoarm_onestage_bf() and internally calls that
function. A simple compatibility call is:
des_legacy <- ntwoarmbinbf01(
k = 1/10, k_f = 10,
power = 0.8, alpha = 0.05, pce_H0 = 0.8,
test = "BF+-",
nrange = c(10, 75), n_step = 1,
progress = FALSE,
compute_freq_t1e = TRUE,
p1_power = 0.3, p2_power = 0.6,
alloc1 = 0.5, alloc2 = 0.5,
output = "numeric"
)We plot the results:
Figure 2: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint
This code path is mainly intended for users with existing scripts;
new analyses should use design_twoarm_onestage_bf()
directly.
The example above used flat design priors, which might be unrealistic
in a variety of settings. While it would be possible to increase the
maximum sample size n_max in the search range to eventually
find a calibrated trial design, a more helpful approach is to use
informative design priors. Such design priors should reflect the
expectations investigators have about the effect of a novel drug or
treatment. In particular, it is strongly recommended to use at least
slightly informative design priors, because if no expectation about the
effect of the drug or treatment (e.g. due to prior phase I trials) is
made, this might be unrealistic from a practical point of view. Not only
is the question why and whether a phase II trial should be conducted in
such a case. Using flat design priors is highly unrealistic in several
aspects:
Next, we therefore perform a sample size search for the ICT-107-type
trial (balanced arms) under informative design priors with very strong
evidence thresholds k = 1/30 and k_f = 30.
Notice the additionally specified parameters
a_1_d = 1, b_1_d = 2 and a_2_d = 2, b_2_d = 1
which are the design prior hyperparameters of the Beta design priors for
\(p_1\) and \(p_2\) under \(H_+\). These express slight optimism about
the treatment effect in the sense that they can be thought of as having
already observed 1 success and 2 failures in the control group and 2
successes and 1 failure in the treatment group. Also, we lower our
requirements for the probability of compelling evidence in favour of
\(H_0\) to, say, \(60%\). We additionally require the
reporting of the frequentist type-I-error for the calibrated design by
specifying report_freq_type1 = TRUE in the function
call:
des_informative <- design_twoarm_onestage_bf(
n_min = 10,
n_max = 100,
k = 1/30,
k_f = 30,
test = "BF+-",
calibration = "Bayesian",
target_power = 0.80,
target_type1 = 0.05,
target_ce_h0 = 0.60,
# design and analysis priors: flat Beta(1,1) everywhere
a_0_d = 1, b_0_d = 1,
a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1,
# assumed true proportions for frequentist power (optional here)
p1_power = 0.3, p2_power = 0.6,
# report frequentist type-I-error? (optional here)
report_freq_type1 = TRUE,
# equal randomisation
alloc1 = 0.5,
alloc2 = 0.5,
# require sustained feasibility over the next 10 larger n
sustain_n = 10L,
progress = FALSE
)We summarize the results:
Summary: One-stage two-arm Bayes factor design
---------------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Feasible: yes
Search overview
n evaluated = 91
pointwise feasible = 28
sustained feasible = 27
first pointwise n = 72
first sustained n = 74
Selected design
n_total = 74, n1 = 37, n2 = 37
The output shows that the first feasible sample size for which the target constraints hold was \(n=72\). However, as we require the next ten sample sizes for the operating characteristics not to violate their respective constraint (that is, power should not decrease below its specified target threshold, type-I-error not increase above its specified target threshold and probability of compelling evidence not drop below its specified target threshold for the next ten observations), the first sample size for which this holds is \(n=74\). This leads to the selected design with \(n_1=37\) and \(n_2=37\) patients in the control and treatment group.
Details on the implementation: For each operating
characteristic we also compute a metric‑specific sustained attainment
sample size that respects the user‑supplied sustain_n constraint.
Concretely, we form separate logical indicators over the search grid for
Bayesian power (≥ target_power), Bayesian type‑I error (≤target_type1),
CE(H0) (≥target_ce_h0), and frequentist power (≥target_freq_power).
Given such an indicator vector for a particular metric, we then search
for the first total sample size \(n\)
such that the metric’s target is satisfied not only at \(n\), but also for all subsequent total
sample sizes in the forward window of length sustain_n + 1,
truncated at the upper end of the search range. The vertical reference
lines in the diagnostic plots are drawn at these metric‑specific
sustained crossing points. This ensures that the plotted “required”
sample sizes reflect the same sustained feasibility logic as the
calibration itself, so that, for example, Bayesian power may first reach
its nominal threshold at \(n=72\), but
the corresponding vertical line will only be shown at \(n=74\) if the power constraint fails to
remain satisfied over the next sustain_n total sample
sizes.
The above figure illustrates the sustained feasibility logic which
currently is implemented in the calibration algorithm. For \(n=71\) in this toy example, and
sustain_n + 1 = 5, even though the threshold of 80% is
achieved, the sample size eventually selected is \(n=76\). For \(n=76\), the next \(5\) sample sizes up to \(n=80\) satisfy the operating characteristic
threshold of at leat 80%, which is not the case for \(n=71\).
Now, back to our calibrated design. We plot the results:
Figure 3: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under the alternative hypothesis
We see that now the Bayesian power is calibrated for \(n=74\) patients per trial arm and does not drop below the required 80% for at least the next ten sample sizes (it does not drop below the 80% for any sample size up to \(n=100\), as can be verified by the plot). Frequentist power is calibrated for \(n=81\) patients trial arm. The Bayesian type-I-error is already calibrated for \(n=10\), requiring only \(5\) patients per trial arm. Importantly, the frequentist type-I-error is also calibrated and is \(0.034<0.05\), as can be inspected by
One-stage two-arm Bayes factor design
------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Optional freq. Type-I reporting: on
Design: n_total = 74, n1 = 37, n2 = 37
Operating characteristics
Power = 0.8004
Type-I error = 0.0021
CE(H0) = 0.6697
Freq. Type-I = 0.0340
Freq. Power = 0.7778
The probability of compelling evidence for \(H_-\) is shown in the bottom plot. It is calibrated for \(n=49\), so the trial design is fully calibrated from a Bayesian perspective if \(n=74\) patients are recruited in total (\(n_1=37\) in the control and \(n_2=37\) in the treatment group). Then, the probability of compelling evidence is also calibrated.
Based on the above plot we can see that the probability of compelling evidence does not reach 80% in the sample size range up to \(n=100\) patients. However, suppose we want a trial design which achieves such a high probability of compelling evidence for \(H_0\), but we cannot afford to recruit more than \(n=100\) patients in total. A possible solution is to modify the design priors under \(H_-\) to express more information about our expectation of the effect the novel drug or treatment has.
Thus, we perform a sample size search for new ICT-107-type trial
(balanced arms) under informative design priors with very strong
evidence thresholds, and change the design prior under H- to achieve the
target probability of compelling evidence PCE(H0) for even smaller
sample sizes. Note that now, additionally, the design prior
hyperparameters of the Beta design priors for \(p_1\) and \(p_2\) under \(H_-\) are specified in
a_1_d_Hminus = 2, b_1_d_Hminus = 1 and
a_2_d_Hminus = 1, b_2_d_Hminus = 2. Note that we increased
target_ce_h0 = 60 to target_ce_h0 = 0.80:
des_informative_higher_ce <- design_twoarm_onestage_bf(
n_min = 10,
n_max = 100,
k = 1/30,
k_f = 30,
test = "BF+-",
calibration = "Bayesian",
target_power = 0.80,
target_type1 = 0.05,
target_ce_h0 = 0.80,
# design and analysis priors: flat Beta(1,1) everywhere
a_0_d = 1, b_0_d = 1,
a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1,
# design prior parameters under H_-
a_1_d_Hminus = 2, b_1_d_Hminus = 1,
a_2_d_Hminus = 1, b_2_d_Hminus = 2,
# assumed true proportions for frequentist power (optional here)
p1_power = 0.3, p2_power = 0.6,
# report frequentist type-I-error? (optional here)
report_freq_type1 = TRUE,
# equal randomisation
alloc1 = 0.5,
alloc2 = 0.5,
# require sustained feasibility over the next 10 larger n
sustain_n = 10L,
progress = FALSE
)We check the results:
Summary: One-stage two-arm Bayes factor design
---------------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Feasible: yes
Search overview
n evaluated = 91
pointwise feasible = 28
sustained feasible = 27
first pointwise n = 72
first sustained n = 74
Selected design
n_total = 74, n1 = 37, n2 = 37
The design has not changed. Why is that? We plot the results:
Figure 4: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under both hypotheses and a stronger requirement on the probability of compelling evidence (80% instead of only 60%)
The plot shows that the calibration sample sizes for Bayesian power, type-I-error and frequentist power remain identical to the previous function call. The only thing which changed are the design priors under \(H_-\) in the top panel, and the bottom panel for the probability of compelling evidence. First, the design priors under \(H_-:p_2 \leq p_1\) have a form which puts more prior probability mass to small success probabilities in the treatment group with parameter \(p_2\), and more prior probability mass to large success probabilities in the control group with parameter \(p_1\). This is precisely expressed by \(H_-:p_2 \leq p_1\), and thus under \(H_0\), we can expect that evidence for \(H_0\) accumulates faster. This is reflected in the bottom panel for the probability of compelling evidence, as now \(n=74\) patients suffice to reach 80% probability of compelling evidence for \(H_0\).
The result is a fully calibrated Bayesian design which meets Bayesian power demands of 80%, Bayesian type-I-error rate requirements of less than 5%, and our requirement of 80% on the probability of compelling evidence for \(H_0\) (that is, \(H_-\) in this case).
What about the frequentist operating characteristics of this design? We see that \(n=81\) patients in total suffice to calibrate the design additionally in terms of frequentist power.
One-stage two-arm Bayes factor design
------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Optional freq. Type-I reporting: on
Design: n_total = 74, n1 = 37, n2 = 37
Operating characteristics
Power = 0.8004
Type-I error = 0.0011
CE(H0) = 0.8004
Freq. Type-I = 0.0340
Freq. Power = 0.7778
The type-I-error is still calibrated, so choosing \(n=81\) patients in total even yields a fully calibrated design both from a Bayesian and frequentist perspective.
The calibration function design_twoarm_onestage_bf
reveals several aspects. If a balanced design with equal randomization
probabilities is desired, then:
In the original ICT-107 trial, \(2/3\) of the patients was randomized into
the treatment group, while \(1/3\) of
the patients was randomized into the control group. We can use the
parameters alloc1 and alloc2 to specify
randomization probabilities for the control and treatment arms and carry
out the Bayesian sample size calculations based on these randomization
probabilities. As an example, we rerun the last calibration, but use the
randomization probabilities of the ICT-107 trial:
des_informative_higher_ce_uneq_alloc <- design_twoarm_onestage_bf(
n_min = 10,
n_max = 100,
k = 1/30,
k_f = 30,
test = "BF+-",
calibration = "Bayesian",
target_power = 0.80,
target_type1 = 0.05,
target_ce_h0 = 0.80,
# design and analysis priors: flat Beta(1,1) everywhere
a_0_d = 1, b_0_d = 1,
a_0_a = 1, b_0_a = 1,
a_1_d = 1, b_1_d = 2,
a_2_d = 2, b_2_d = 1,
a_1_a = 1, b_1_a = 1,
a_2_a = 1, b_2_a = 1,
# design prior parameters under H_-
a_1_d_Hminus = 2, b_1_d_Hminus = 1,
a_2_d_Hminus = 1, b_2_d_Hminus = 2,
# assumed true proportions for frequentist power (optional here)
p1_power = 0.3, p2_power = 0.6,
# report frequentist type-I-error? (optional here)
report_freq_type1 = TRUE,
# equal randomisation
alloc1 = 1/3,
alloc2 = 2/3,
# require sustained feasibility over the next 10 larger n
sustain_n = 10L,
progress = FALSE
)We summarize the results:
Summary: One-stage two-arm Bayes factor design
---------------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Feasible: yes
Search overview
n evaluated = 91
pointwise feasible = 18
sustained feasible = 18
first pointwise n = 83
first sustained n = 83
Selected design
n_total = 83, n1 = 28, n2 = 55
We plot the results:
Figure 5: Visualization of the calibrated Bayesian two-arm one-stage phase II design with a binary endpoint, using informative design priors under both hypotheses and a stronger requirement on the probability of compelling evidence (80% instead of only 60%). Additionally, unequal randomization probabilities are used when calibrating the design.
Remember that the sample size shown at the x-axis in the power and type-I-error rate plot as well as in the probability of compelling evidence plot is the total sample size in both arms. We see that now we need \(n=83\) patients in total to reach Bayesian power of 80%, while \(n=88\) patients in total are required for frequentist power calibration of 80%. The probability of compelling evidence reaches 80% at \(n=83\) patients in total. The frequentist type-I-error rate is still below the required 5% threshold, too:
One-stage two-arm Bayes factor design
------------------------------------
Mode: optimal
Status: Smallest feasible one-stage two-arm design found.
Calibration: Bayesian
Optional freq. Type-I reporting: on
Design: n_total = 83, n1 = 28, n2 = 55
Operating characteristics
Power = 0.8018
Type-I error = 0.0011
CE(H0) = 0.8018
Freq. Type-I = 0.0369
Freq. Power = 0.7829
If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then:
To fulfill all four requirements, it thus suffices if \(n_1=29\) patients in the control arm and \(n_2=59\) in the treatment arm are enrolled in the trial, and the Bayes factor thresholds \(k=1/30\) and \(k_f=30\) are used for decision making about the hypotheses \(H_+\) and \(H_-\) under consideration.
For a Bayesian calibration only, it suffices if \(n_1=28\) patients in the control arm and \(n_2=55\) in the treatment arm are enrolled in the trial.
This vignette has illustrated how to design and calibrate two‑arm
one‑stage Bayes factor trials with binary endpoints using the
bfbin2arm package. The core workflow starts from specifying
a Bayes factor test (two‑sided or directional), choosing coherent design
and analysis priors under the competing hypotheses, and then mapping
clinical requirements onto calibration targets for Bayesian power,
Bayesian type‑I error, and the Bayesian probability of compelling
evidence for the null (or \(H_-\) in
directional tests). The central calibration function
design_twoarm_onestage_bf() searches over a user‑defined
grid of total sample sizes and returns a design object that contains the
selected allocation \((n_1, n_2)\), the
corresponding total sample size \(n_{\text{total}}\), and both Bayesian and
frequentist operating characteristics at the chosen design.
A key innovation is the use of a sustained feasibility constraint,
controlled by the argument sustain_n, which guards against
oscillatory behaviour of operating characteristics driven by the
discreteness of the binomial model. Instead of treating a sample size as
feasible as soon as it meets its calibration thresholds pointwise, the
algorithm only accepts a candidate \(n_{\text{total}}\) if all relevant targets
hold at that \(n_{\text{total}}\) and
continue to hold for at least the next sustain_n larger
total sample sizes within the search range. The diagnostic plots reflect
this logic: for each operating characteristic (Bayesian power, Bayesian
type‑I error, CE(H0), and optional frequentist power), the vertical
reference line is drawn at the first total sample size where the
corresponding metric attains its target in this sustained sense. As a
result, the graphical summaries and numerical design recommendations are
aligned and directly interpretable as robust to local oscillations in
the operating characteristic curves.
Using the ICT‑107 phase II trial as a running example, we have shown
how flat design priors can be replaced by more informative priors that
encode realistic expectations about treatment and control response
rates. This shift often allows one (1) to achieve the desired
calibration targets at substantially smaller total sample sizes compared
to flat priors and (2) achieve higher constraints on certain operating
characteristics such as the probability of compelling evidence for
identical sample sizes, especially when strong evidence thresholds
(e.g. \(k = 1/30\), \(k_f = 30\)) are required. The vignette has
also demonstrated how to handle equal and unequal randomization, how to
request frequentist type‑I error and power alongside the Bayesian
criteria, and how to interpret the resulting design recommendations in
terms of total sample size \(n_{\text{total}}\) and arm‑specific
allocations. Overall, the bfbin2arm package provides a
flexible, unified framework in which Bayesian, frequentist, hybrid, and
fully dual calibrations can be performed and visualised in a way that is
directly tied to clinically meaningful decision thresholds.
Further details on the methodology can be found in (Kelter 2026).