When should we test the non-inferiority hypothesis?

When should we test the non-inferiority hypothesis?
An article from the Stitch Fix team suggests using a non-inferiority trials approach in marketing and product A/B testing. This approach really applies when we are testing a new solution that has benefits that are not measurable by tests.

The simplest example is bone loss. For example, let's automate the process of assigning the first lesson, but we don't want to drop the end-to-end conversion too much. Or we test changes that are focused on one segment of users, while making sure that conversions for other segments do not sag much (when testing several hypotheses, do not forget about the corrections).

Choosing the right non-inferiority bound adds additional challenges at the test design stage. The question of how to choose Δ is not well covered in the article. It seems that this choice is not completely transparent in clinical trials either. Review medical publications on non-inferiority reports that only half of the publications justify the choice of boundary and often these justifications are ambiguous or not detailed.

In any case, this approach seems interesting, as By reducing the required sample size, it can increase the speed of testing, and, hence, the speed of decision making. — Daria Mukhina, product analyst for the Skyeng mobile app.

The Stitch Fix team loves to test different things. The whole technology community basically likes to run tests. Which version of the site attracts more users - A or B? Does version A of the recommender model make more money than version B? Almost always, to test hypotheses, we use the simplest approach from the basic statistics course:

When should we test the non-inferiority hypothesis?

Although we rarely use the term, this form of testing is called "testing the hypothesis of superiority". With this approach, we assume that there is no difference between the two options. We stick with this idea and only abandon it if the findings are convincing enough to warrant it—that is, it shows that one option (A or B) is better than the other.

Superiority hypothesis testing is suitable for solving a variety of problems. We release the B version of the recommender model only if it is clearly better than the A version already in use. But in some cases, this approach does not work so well. Let's look at a few examples.

1) We use a third party service, which helps to identify fake bank cards. We found another service that costs significantly less. If a cheaper service works as well as the one we currently use, we will choose it. It doesn't have to be better than the service you're using.

2) We want to drop the data source A and replace it with data source B. We could delay abandoning A if B produces very bad results, but it is not possible to continue using A.

3) We would like to move from a modeling approachA to B's approach, not because we expect better results from B, but because it gives us more operational flexibility. We have no reason to believe that B will be worse, but we will not transition if it is.

4) We have made some quality changes website design (Version B) and believe that this version is superior to Version A. We do not expect changes in conversions or any of the KPIs that we normally measure a website against. But we believe that there are advantages in parameters that are either immeasurable, or our technologies are not enough to measure.

In all these cases, excellence research is not the best solution. But most specialists in such situations use it by default. We carefully conduct the experiment to correctly determine the magnitude of the effect. If it were true that versions A and B work in very similar ways, chances are we won't be able to reject the null hypothesis. Do we conclude that A and B generally work in the same way? No! Failing to reject the null hypothesis and accepting the null hypothesis are not the same thing.

Sample size calculations (which you've done, of course) tend to have tighter bounds on Type I error (the probability of wrongly rejecting the null hypothesis, often referred to as alpha) than Type II error (Probability of failing to reject the null hypothesis, given condition that the null hypothesis is false, often called beta). A typical value for alpha is 0,05 while a typical value for beta is 0,20, which corresponds to a statistical power of 0,80. This means that we may not detect the true effect of the value that we indicated in our power calculations with a probability of 20% and this is a rather serious gap in information. As an example, let's consider the following hypotheses:

When should we test the non-inferiority hypothesis?

H0: my backpack is NOT in my room (3)
H1: my backpack is in my room (4)

If I searched my room and found my backpack, great, I can drop the null hypothesis. But if I looked around the room and couldn't find my backpack (Figure 1), what conclusion should I draw? Am I sure it's not there? Have I searched carefully enough? What if I only searched 80% of the room? To conclude that there is definitely no backpack in the room would be a rash decision. No wonder we can't "accept the null hypothesis".
When should we test the non-inferiority hypothesis?
The area we searched
We didn't find the backpack - should we accept the null hypothesis?

Figure 1. Searching 80% of a room is about the same as doing a search with 80% power. If you didn't find a backpack after looking around 80% of the room, can you conclude that it's not there?

So what should a data scientist do in this situation? You can greatly increase the power of the study, but then you will need a much larger sample size, and the result will still be unsatisfactory.

Fortunately, such problems have long been studied in the world of clinical research. Drug B is cheaper than drug A; drug B is expected to cause fewer side effects than drug A; drug B is easier to transport because it does not need to be refrigerated, but drug A does. Let's test the hypothesis of non-inferiority. This is to show that version B is just as good as version A—at least within some predetermined “not less efficient” limit, Δ. We'll talk more about how to set this limit a bit later. But for now, let's assume that this is the minimum difference that is practically significant (in the context of clinical trials, this is usually called clinical significance).

Hypotheses about no less efficiency turn everything upside down:

When should we test the non-inferiority hypothesis?

Now, instead of assuming that there is no difference, we assume that version B is worse than version A, and we will stick to this assumption until we demonstrate that this is not the case. This is exactly the point when it makes sense to use one-sided hypothesis testing! In practice, this can be done by constructing a confidence interval and determining whether the interval is indeed greater than Δ (Figure 2).
When should we test the non-inferiority hypothesis?

Choice Δ

How to choose the right Δ? The Δ selection process includes statistical justification and substantive evaluation. In the world of clinical research, there are normative guidelines that suggest that the delta should be the smallest clinically significant difference - one that will matter in practice. Here is a quote from the European manual to test yourself with: “If the difference has been chosen correctly, a confidence interval that lies entirely between –∆ and 0… is still sufficient to demonstrate no less efficiency. If this result does not seem acceptable, it means that ∆ was not chosen appropriately.”

The delta should definitely not exceed the effect size of Version A relative to the true control (placebo/no treatment), as this leads us to conclude that Version B is worse than the true control, while at the same time showing "no less efficacy". Suppose that when version A was introduced, version 0 was in its place, or the feature didn't exist at all (see Figure 3).

Based on the results of testing the hypothesis of superiority, the effect size E was revealed (that is, presumably μ^A−μ^0=E). Now A is our new standard, and we want to make sure B is as good as A. Another way to write μB−μA≤−Δ (the null hypothesis) is μB≤μA−Δ. If we assume that doing is equal to or greater than E, then μB ≤ μA−E ≤ placebo. Now we see that our estimate for μB is completely greater than μA−E, which thus completely refutes the null hypothesis and allows us to conclude that B is not inferior to A, but at the same time, μB can be ≤ μ placebo, which is not what do we need. (Figure 3).

When should we test the non-inferiority hypothesis?
Figure 3. Demonstration of the risks of choosing a border of no less efficiency. If the limit is too large, it can be concluded that B is not inferior to A, but at the same time is indistinguishable from placebo. We will not change a drug that is clearly more effective than placebo (A) for a drug that is as effective as placebo.

Choice α

Let us pass to the choice of α. You can use the standard value α = 0,05, but this is not entirely fair. Like, for example, when you buy something on the Internet and use several discount codes at once, although they should not be added up - the developer just made a mistake, and you got away with it. According to the rules, the value of α must be equal to half the value of α, which is used in testing the hypothesis of superiority, i.e. 0,05 / 2 = 0,025.

Sample size

How to estimate sample size? If you assume that the true mean difference between A and B is 0, then the calculation of the sample size is the same as in the superiority hypothesis test, except that you replace the effect size with a limit of no less efficiency, provided that you use α no less efficient = 1/2 α superiority (αnon-inferiority=1/2αsuperiority). If you have reason to believe that option B might be slightly worse than option A, but you want to prove that it is no more than Δ worse, then you are in luck! In effect, this reduces your sample size because it's easier to demonstrate that B is worse than A if you actually think it's slightly worse, not equal.

Solution Example

Let's say you want to upgrade to version B, provided that it is no more than 0,1 points worse than version A on a 5-point customer satisfaction scale ... Let's approach this problem using the superiority hypothesis.

To test the superiority hypothesis, we would calculate the sample size as follows:

When should we test the non-inferiority hypothesis?

That is, if you have 2103 observations in a group, you can be 90% sure that you will find an effect of 0,10 or more. But if 0,10 is too high for you, it might not be worth testing the superiority hypothesis for it. You may want to be sure to run the study for a smaller effect size, such as 0,05. In this case, you will need 8407 observations, that is, the sample will increase by almost 4 times. But what if we stick to our original sample size but increase the power to 0,99 so we don't doubt if we get a positive result? In this case, n for one group will be 3676, which is already better, but increases the sample size by more than 50%. And as a result, we still simply won’t be able to refute the null hypothesis, and we won’t get an answer to our question.

What if instead we test the hypothesis of no less efficiency?

When should we test the non-inferiority hypothesis?

The sample size will be calculated using the same formula except for the denominator.
The differences from the formula used in testing the superiority hypothesis are as follows:

- Z1−α/2 is replaced by Z1−α, but if you do everything according to the rules, you replace α = 0,05 with α = 0,025, that is, this is the same number (1,96)

- appears in the denominator (μB−μA)

- θ (effect size) is replaced by Δ (limit of no less efficiency)

If we assume that µB = µA, then (µB − µA) = 0 and calculating the sample size for the non-inferiority margin is exactly what we would get when calculating superiority for an effect size of 0,1, great! We can do a study of the same scale with different hypotheses and a different approach to conclusions and we will get the answer to the question we really want to answer.

Now suppose we don't really believe that µB = µA and
we think µB is a bit worse, maybe by 0,01 units. This increases our denominator, reducing the sample size per group to 1737.

What happens if version B is actually better than version A? We reject the null hypothesis that B is worse than A by more than ∆ and accept the alternative hypothesis that B, if worse, is not worse than ∆ and can be better. Try putting that conclusion into a cross-functional presentation and see what happens (seriously, try it). In a situation where you need to be oriented to the future, no one wants to settle for "worse than Δ and possibly better."

In this case, we can conduct a study that is called very briefly "testing the hypothesis that one of the options is superior or inferior to the other." It uses two sets of hypotheses:

The first set (same as when testing the hypothesis of no less efficiency):

When should we test the non-inferiority hypothesis?

The second set (same as when testing the superiority hypothesis):

When should we test the non-inferiority hypothesis?

We test the second hypothesis only if the first one is rejected. In sequential testing, we keep the overall level of Type I errors (α). In practice, this can be achieved by creating a 95% confidence interval for the difference between the means and checking to see if the entire interval is greater than -Δ. If the interval does not exceed -Δ, we cannot reject the zero value and stop. If the entire interval is indeed greater than −Δ, we will go ahead and see if the interval contains 0.

There is another type of research that we have not discussed - equivalence studies.

Studies of this type can be replaced by studies to test the hypothesis of no less effectiveness and vice versa, but they themselves have an important difference. A non-inferiority test aims to show that option B is at least as good as A. And an equivalence study aims to show that option B is at least as good as A, and option A is just as good as B, which is harder. In essence, we are trying to determine whether the entire confidence interval for the difference between the means lies between −∆ and ∆. Such studies require larger sample sizes and are conducted less frequently. So the next time you do a study where your main concern is making sure the new version is as good, don't settle for "failing to disprove the null hypothesis." If you want to test a really important hypothesis, consider different options.

Source: habr.com

Add a comment