Fools Gold. RCTs are neither golden nor a standard.

As a professional evaluator, nothing gets my goat like reading the phrase: 'Randomised Control Trials (RCT's) are the gold standard of evaluation'. When I read this I always yell 'NO THEY ARE NOT' to my cat. (She doesn't care as she is a supporter of RCTs.) RCTs are a good evaluation method, but they are NOT the gold standard! There is no such thing as an evaluation method that is best and most appropriate across all contexts.

*"A photo of the author's face when she reads the phrase "RCTs are the gold standard of evaluation."*

This is why I was so happy to see Michael Patton's article 'Fools gold: the widely touted methodological "gold standard" is neither golden nor a standard', - originally published on the Better Evaluation blog on the 4th of December 2014. Patton is a giant in the field of evaluation theory and practice, and as usual he can discuss the reasons why RCTs should not be held up as the gold standard of evaluation far more eloquently than I can. So take it away Patton!

The Wikipedia entry for Randomised Controlled Trials (RCTs), reflecting common usage, designates such designs as the “gold standard” for research. News reports of research findings routinely repeat and reinforce the “gold standard” designation forRCTs. Government agencies and scientific associations that review and rank studies for methodological quality acclaim RCTs as the gold standard.

The Gold Standard Versus Methodological Appropriateness

A consensus has emerged in evaluation research that evaluators need to know and use a variety of methods in order to address the priority questions of particular stakeholders in specific situations. But researchers and evaluators get caught in contradictory assertions:

(a) select methods appropriate for a specific evaluation purpose and question, and use multiple methods—both quantitative and qualitative—to triangulate and increase the credibility and utility of findings, but

(b) one question is more important than others (the causal attribution question), and one method (RCTs) is superior to all other methods in answering that question.

Thus, we have a problem. The ideal of researchers and evaluators being situationally responsive, methodologically flexible, and sophisticated in using a variety of methods runs headlong into the conflicting ideal that experiments are the gold standard and all other methods are, by comparison, inferior. Who wants to conduct (or fund) a second-rate study if there is an agreed-on gold standard?

The Rigidity of a Single, Fixed Standard

The gold standard allusion derives from international finance, in which the rates of exchange among national currencies were fixed to the value of gold. Economic historians share a “remarkable degree of consensus” about the gold standard as the primary cause of the Great Depression.

The gold standard system collapsed in 1971 following the United States’ suspension of convertibility from dollars to gold. The system failed because of its rigidity. And not just the rigidity of the standard itself but also the rigid ideology of the people who believed in it: policymakers across Europe and North America clung to the gold standard despite the huge destruction it was causing. There was a clouded mind-set with a moral and epistemological tinge that kept them advocating the gold standard until political pressure emerging from the disaster became overwhelming.

Treating RCTs as the gold standard is no less rigid. Asserting a gold standard inevitably leads to demands for standardization and uniformity (Timmermans & Berg, 2003). Distinguished evaluation pioneer Eleanor Chelimsky (2007) has offered an illuminative analogy:

It is as if the Department of Defense were to choose a weapon system without regard for the kind of war being fought; the character, history, and technological advancement of the enemy; or the strategic and tactical givens of the military campaign. (p. 14)

The gold standard accolade means that funders and policymakers begin by asking, “How can we do an experimental design?” rather than asking, “Given the state of knowledge and the priority inquiry questions at this time, what is the appropriate design?” Here are examples of the consequences of this rigid mentality:

At an African evaluation conference, a program director came up to me in tears. She directed an empowerment program with women in 30 rural villages. Thefunder, an international agency, had just told her that to have the funding renewed, she would have to stop working in half the villages (selected randomly by the funder) in order to create a control group going forward. The agency was under pressure for not having enough “gold standard evaluations.” But, she explained, the villages and the women were networked together and were supporting each other. Even if they didn’t get funding, they would continue to support each other. That was the empowerment message. Cutting half of them off made no sense to her. Or to me.
At a World Bank conference on youth service learning, the director of a university exercise in evaluation design. She explained that she carefully selected 40 students each year and matched them to villages that needed the kind of assistance the students could offer. Matching students and villages was key, she explained. A senior World Bank economist told her and the group to forget matching. He advised an RCT in which she would randomly assign students to villages and then create a control group of qualified students and villages that did nothing to serve as a counterfactual. He said, “That’s the only design we would pay any attention to here. You must have a counterfactual. Your case studies of students and villages are meaningless and useless.” The participants were afterward aghast that he had completely dismissed the heart of the intervention: matching students and villages.
I’ve encountered several organizations, domestic and international, that give bonuses to managers who commission RCTs for evaluation to enhance the organization’s image as a place that emphasizes rigor. The incentives to do experimental designs are substantial and effective. Whether they are appropriate or not is a different question.

Those experiences, multiplied 100 times, are what have generated this rumination.

Evidence-Based Medicine and RCTs

Medicine is often held up as the bastion of RCT research in its commitment to evidence-based medicine. But here again, gold standard designation has a downside, as observed by the psychologist Gary Klein (2014):

Sure, scientific investigations have done us all a great service by weeding out ineffective remedies. For example, a recent placebo-controlled study found that arthroscopic surgery provided no greater benefit than sham surgery for patients with osteoarthritic knees. But we also are grateful for all the surgical advances of the past few decades (e.g., hip and knee replacements, cataract treatments) that were achieved without randomized controlled trials and placebo conditions. Controlled experiments are therefore not necessary for progress in new types of treatments and they are not sufficient for implementing treatments with individual patients who each have unique profiles.

Worse, reliance on EBM can impede scientific progress. If hospitals and insurance companies mandate EBM, backed up by the threat of lawsuits if adverse outcomes are accompanied by any departure from best practices, physicians will become reluctant to try alternative treatment strategies that have not yet been evaluated using randomized controlled trials. Scientific advancement can become stifled if front-line physicians, who blend medical expertise with respect for research, are prevented from exploration and are discouraged from making discoveries.

RCTs and Bias

RCTs aim to control bias, but implementation problems turn out to be widespread:

Even in the most stringent research designs, bias seems to be a major problem. For example, there is strong evidence that selective outcome reporting, with manipulation of the outcomes and analyses reported, is a common problem even for randomized trails. (Chan, Hrobjartsson, Haahr, Gotzsche, & Altman, 2004, p. 2457)

The result is that “a great many published research findings are false” (Ioannidis, 2005).

Methodological Appropriateness as the Platinum Standard

It may be too much to hope that the gold standard designation will disappear from popular usage. So perhaps we need to up the ante and aim to supplant the gold standard with a new platinum standard: methodological pluralism and appropriateness. To do so, I offer the following seven-point action plan (and resources below):

1. Educate yourself about the strengths and weaknesses of RCTS.

2. Never use the "gold standard" designation yourself. If it comes up, refer to the “so-called gold standard.”

3. When you encounter someone referring to RCTs as "the gold standard", don’t be shy. Explain the negative consequences and even dangers of such a rigid pecking order of method Understand and be able to articulate the case for methodological pluralism and appropriateness, to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation.

4. Understand and be able to articulate the case for methodological pluralism and appropriateness, to wit, adapting designs to the existing state of knowledge, the available resources, the intended uses of the inquiry results, and other relevant particulars of the inquiry situation.

5. Promote the platinum standard as higher on the hierarchy of research excellence.

6. Don’t be argumentative and aggressive in challenging gold standard narrow-mindedness. It’s more likely a matter of ignorance than intolerance. Be kind, sensitive, understanding, and compassionate, and say, “Oh, you haven’t heard. The old RCT gold standard has been supplanted by a new, more enlightened, Knowledge-Age platinum standard.” (Beam wisely.)

7. Repeat Steps 1 to 6 over and over again.

Creating using evidencePower to Persuade6 April 2015research