How Do We Know If a COVID-19 Drug Really Works?

Flawed studies appear to offer hope but rigorous studies can tell a different story.

Edmond Alkaslassy
12 min readSep 22, 2020

--

Dr. Stella Immanuel is a pediatrician who claims to have treated over 400 COVID-19 patients with hydroxychloroquine (HCQ) and that none died, thus she concludes that HCQ is an effective treatment for COVID-19. Her video statement was retweeted by the president of the United States and so became international news. Should we believe her?

Although HCQ and other therapies (oleandrin!) have been touted as cures for COVID-19, the experts have not been convinced. Instead, Dr. Fauci and others have urged caution about studies that are not properly “randomized” and “controlled.” But what does that mean? How does Dr. Fauci know that Dr. Immanuel’s work is not trustworthy?

Experts use straightforward, easy-to-understand criteria to determine the trustworthiness of clinical research on experimental drugs. The gold standard is a randomized and double blind study with a control group taking a placebo and an experimental group taking an experimental drug. This is called a “randomized double blind placebo control” study. Reliable studies also include a large number of subjects, analyze the data using statistical methods and are peer reviewed.

Each of these elements makes a unique contribution to the trustworthiness of a study. Together, they ensure logical thought, reduce bias and provide transparency. (What’s not to like?) Let’s examine each of these seven elements to see how they increase our confidence, and then use them to evaluate Dr. Immanuel’s claims.

Having control and experimental groups ensures logical thought. The only way to determine the effect (positive, negative or neutral) of an experimental drug is to compare it to a control. The control group takes a placebo (e.g., a sugar pill). If the patients in the experimental group fare better or worse than the patients in the control group then the difference must be due to the experimental drug. If both groups fare equally well then the experimental drug is as effective as a sugar pill.

But imagine a study with no control group (or “control arm”). For example, suppose one hundred patients take a drug and 90 of them get better. That sounds promising, but maybe 100% of the patients in a control group would have gotten better. Or maybe the experimental group got better more slowly than the control group would have. There is simply no way to know. In the absence of a control, it is a logical impossibility to determine the effect of any treatment.

Randomization reduces researcher bias by randomly assigning patients to the experimental and control groups. Suppose a cancer researcher hopes that an experimental drug will be effective (and it would be perfectly natural to have that bias). Consciously or not, the researcher could put a thumb on the scale to increase the odds that the study will show the drug to be effective. How could she do that? She could assign slightly healthier patients to the experimental group. That would increase the odds that the experimental group would fare better than the control group, and lead to the potentially erroneous conclusion that the experimental drug is effective. Randomization prevents this bias by ensuring that the two groups are populated at random.

Another strategy for reducing bias is double blindness. This ensures that both patients and researchers are in the dark re which group — control or experimental — a patient belongs to until after the data are collected. Consider our researcher who wants a cancer drug to be effective. If she knows that a particular cancer patient is in the experimental group, she might hope that the patient’s tumor shrank as a result of taking the drug. So when the researcher measures the size of that patient’s tumor at the end of the study she may (unconsciously?) tend to “round down” by excluding small, faint bits of tumor on the CT scan that an unbiased person might have included. Her biased measurements would show that the tumor shrank even if it did not. Her actions are not necessarily nefarious; it is just human nature (although nefarious motives are of course also possible). Regardless of motive, the data would be compromised by her bias; double-blindness prevents this bias.

It is also better if the patients do not know whether they are in the control or experimental group. Placebos can make people feel better. If patients think they are taking a promising experimental drug, they may feel happier and less stressed, which might lead to measureable consequences unrelated to the actual treatment. The remedy for both researcher bias and the “placebo effect” is double blindness.

Statistical methods also reduce bias. If 14 of 100 patients in the experimental group and 11 of 100 patients in the control group were cured of a disease, would those results indicate that the experimental drug is effective? Before statistical analysis became standard practice, one could only eyeball the data and offer an opinion. Now reliable scientific studies use statistical tests to determine whether the difference between two groups is “statistically significant” (meaningful) or not. These tests are unbiased: unlike a researcher eyeballing the data, statistical tests do not (can not) care about the outcome. ++

A large sample size (hundreds or thousands of subjects) further increases our confidence. If you flipped a coin five times you might get five heads in a row. Maybe not often, but it can and does happen, purely by chance. But flip a coin 1000 times and you have my personal assurance that you will not flip 1000 heads in a row (with a “clean” coin, the odds against it are (1/2) to the 1000th power, and none of us will live to see that outcome). Small sample sizes are more likely to lead to chance outcomes; this is a kind of statistical bias that reduces our confidence in the results. Large sample sizes are more resistant to chance events and so increase our confidence that results are meaningful and not simply due to chance.

Peer review increases our confidence by offering transparency. Research earns our trust when we know the best minds in the field have read and critiqued it. Research submitted to a journal for publication is peer reviewed. The reviewers may have concerns that are minor, major or both. Two heads are better than one and the more expert heads scrutinizing a study the more likely that important flaws will be detected. Only 30- 45% of submitted articles pass the peer review process, and even fewer (10% or less) pass muster at more prestigious journals. Peer review is a crucial tool for vetting scientific research.

Studies that scrupulously follow this checklist of best practices leave very little room for doubt; they offer the greatest possible level of trustworthiness. However it is worth noting two important caveats:

First, very few of us are expert scientists and nobody who is not one is going to become one via web searches or by watching any number of YouTube videos. Efforts to read new scientific research are laudable and many of us are capable of getting the gist of published research, but we must be humble and realistic. Few of us will read a stack of scientific research articles on HCQ and have this thought: “This in vitro-in vivo disparity may be partly because of the complex pharmacokinetics of 4-aminoquinolines, and hence, the same applies to HCQ.” No, that is not going to happen. Deep critique of the science underlying new research requires skills and knowledge that is now and always will be far out of the reach of most of us. (Hence the majority’s prudent decision to listen to the experts.) To put it plainly, knowledge of the seven “gold standard” practices does not make anyone an expert scientist.

And second, these standards do not apply equally to all forms of research. The standards apply readily to clinical research on experimental drugs (e.g., studies searching for a cure for COVID-19). But research in other fields (e.g., genetics or climate change) is structured differently and so may not always be amenable to evaluation via the seven gold standards described here. Although the principles underlying the checklist have wide application, we should not expect every kind of research to precisely follow the checklist.

With those caveats in mind, let’s apply our checklist of best practices to Dr. Immanuel’s claims and find out how much confidence we should have in her work.

Her study included 400 patients. This is a respectable though not spectacular sample size; we are off to a promising start.

But there was no control group. A control group would have consisted of patients who received a placebo instead of the experimental drug, in this case HCQ (or an HCQ cocktail; more on this later). Dr. Immanuel claims that every patient taking HCQ survived, which seems like great news. But what if there had been a control group and everyone taking the placebo had also survived? That outcome would indicate that HCQ is as effective as a sugar pill.

It is even possible that Dr. Immanuel’s patients took longer to recover than they would have had they not taken HCQ, or that the HCQ treatment caused undesirable side effects that the control group would not have experienced. We have no way of knowing. In the absence of a control group, it is impossible to determine the effect of her treatment on her patients. The lack of a control is a catastrophic flaw for which there is no remedy.

The lack of a control causes additional problems. If the researchers and patients knew that all patients were taking HCQ (and they did, since there was no control group), then the study cannot be blind, let alone double blind. If the only measurement made at the end of the study was the binary categorization of whether subjects were “alive” or “dead” then this becomes a moot point; few would be so biased as to record a dead person as alive or vice versa. But if non-binary measurements (e.g., lung capacity) were also made then the potential for researcher bias exists.

In the absence of a control group, randomization is also impossible, but the larger problem is the absence of two groups (control and experimental) to whom patients should have been randomly assigned.

And without a control group it is impossible to use statistical methods to determine the efficacy of the treatment. The survival rate of the experimental group cannot be compared to that of the control group because there wasn’t one.

Clearly, the absence of a control group has extremely detrimental and cascading effects, all of which reduce our confidence to a very low level.

All of those methodological flaws would have been flagged in the peer review process. But has her work been subjected to peer review? Dr. Immanuel has not published her HCQ research. (I know of no evidence that she has even written a report, let alone submitted one for publication.) Thus her research has not been subjected to peer review, which is a necessity if her work is to be published in a reputable journal. Instead, we have only one researcher’s opinion about the quality of her research: hers.

The peer review process (like other aspects of good science) moves slowly and so is at odds with a public that is hungry for the latest research on COVID-19. But unvetted science is likely poor science. We should treat unreviewed research presented in informal venues — video statements (like those of Dr. Immanuel), press releases and the like — with skepticism. Our desire for a cure for COVID-19 makes it hard to contain our excitement when we hear of promising results, but we should curb our enthusiasm if the research has not been subjected to peer review.

However well intentioned Dr. Immanuel’s research and video statements may be, the absence of crucial components of clinical research (a control group, [double] blindness, randomization, statistical analysis, and peer review) makes it impossible to have confidence in her work.

And there are yet additional concerns about the quality of her research, concerns that certainly would have been noted by peer reviewers.

First, Dr. Immanuel says she has treated patients with HCQ “along with zinc, and the antibiotic zithromax.” For argument’s sake, let us grant that the HCQ cocktail had a positive effect (although in the absence of a control group this is unknowable). How do we know that HCQ had a positive effect — might the combination of only zinc and zithromax (without HCQ) have had the same effect? We have no way of knowing.

Second, we do not know the ages of all 400 patients. Dr. Immanuel says that her research includes asthmatics, diabetics and the elderly but how many of each, and how old were they? Perhaps her subjects consisted of 399 children and one elderly person. Or 399 elderly people and one child. We do not know, but this is vitally important information, and here’s why:

Younger people die from COVID-19 at a much lower rate than do older people. The CDC found that in 2572 cases of COVID-19 in patients under the age of 18 there were only three deaths, or roughly 1 death out of every 850 cases. Dr. Immanuel is a pediatrician so perhaps some or many of her patients were children. Treating 399 children and one elderly person who have COVID-19 and having none die would not be a noteworthy success since only 1 of 850 children would be expected to die anyway. But successfully treating 399 elderly patients and one child would tell a different story. It is (again) difficult to have confidence in her research because she has not provided enough information; her claims lack sufficient transparency.

Dr. Immanuel’s research does provide one definitive conclusion: Taking whatever combination of drugs she gave her 400 patients did not kill them. Although this is good news it comes up rather short of her claim that her HCQ cocktail cures COVID-19.

Perhaps there is more to her research than she has shared in her videos and interviews. Perhaps she will write and submit a report for publication. Perhaps her peers will find that her work is worthy of publication and it will be published in a scientific journal, and you and I will be able to read it for ourselves, and evaluate it in light of the gold standard practices. But none of that has happened. And it may never happen. What we currently know about Dr. Immanuel’s research indicates that it is very far from meeting any gold standard; it fails to earn our trust.

(In an effort to have the reader evaluate Dr. Immanuel’s research with an unbiased mind, I have refrained from noting until now that her medical expertise apparently extends to the DNA of aliens and the spermatozoa of demons.)

Sadly, claims that HCQ is an effective treatment for COVID-19 come from relatively weak studies such as Dr. Immanuel’s and these (see Table 2), all of which have material flaws in methodology: small sample size, lack of controls, lack of randomization and absence of peer review. These flaws seriously undermine our confidence; they are not niceties.

Unfortunately, these time-tested “best practices” are not universally appreciated. Consider HCQ researcher Dr. Raoult, whose own flawed work was subjected to heavy criticism by his peers. Dr. Raoult decried his critics (a.k.a. his peers) with a piquant phrase that may outlast his research: He says he is a victim of “the dictatorship of the methodologists.” We should be grateful for such a dictatorship.

The best studies on HCQ (like these and this one) deserve our confidence because they follow the best practices known to mankind. Regrettably, these reliable studies find that HCQ is not the cure for COVID-19 that we are all hoping for.

This does not mean the final word is in. Science is provisional: a final word is hard to come by. Perhaps a new HCQ study using a different combination of drugs or some other innovation will show promise. Studies on HCQ are ongoing and we should keep an open mind to future research.

The public should know that clinical studies on HCQ and other possible cures for COVID-19 earn our trust when they include a large sample size, control and experimental groups, researchers and subjects who are both “blind” to which group (experimental or control) the patients belong, statistical analysis of the data and vigorous peer review before (and after) publication. This is how experts determine which studies are trustworthy. And if you don’t trust the experts then you can read new studies and see for yourself whether they follows these best practices.

We are incredibly fortunate to live in a time when scientists know how to conduct clinical research of such high quality; we should accept nothing less.

++ Statistical significance is often indicated as “p < 0.05” (read “p is less than five percent”); the “p” stands for “probability.” In studies that have an experimental and control group, this equation means: “The probability is less than five percent that we are wrong to conclude that the experimental group fared better than the control group.” Put another way, we can be more than 95% certain that the experimental group fared better than the control group, thus we conclude the drug is effective. (A study could also find that the experimental group fared significantly worse than the control group.) Sometimes we see even smaller numbers, such as p < 0.01 or p < 0.001. The smaller the number, the more confident we can be that the two groups are meaningfully different, e.g., that the drug works. And when p > 0.05 (read “p is greater than five percent”) we conclude that there is no significant difference between the two groups.

Author biography: Edmond Alkaslassy is an Emeritus Faculty, Assistant Professor of Biology at Pacific University, Forest Grove, Oregon and is writing a book that compares the daily lives of humans and other animals.

--

--

Edmond Alkaslassy

Faculty Emeritus, Assistant Professor of Biology, Pacific University, Oregon. He is writing a book that compares the daily lives of humans and other animals.