Assessing impact needs a reliable comparison group

This letter discusses an article in Stanford Social Innovation Review and was first published there.

“Dressed to Thrive” [in Stanford Social Innovation Review, Winter, 2013] describes the work of Fitted For Work (FFW) in helping women into work. By way of demonstrating FFW’s effectiveness, it reports that “75 percent of women who received wardrobe support and interview coaching from FFW find employment within three months… In comparison…about 48 percent of women who rely on Australian federal job agencies find work within a three-month period.”

But the comparison isn’t valid, and doesn’t demonstrate anything about FFW’s effect. This is because women who get FFW’s support differ from those who don’t in (at least) two respects. First, they found out about FFW and chose to approach it for help. It’s quite possible that the women who do this are better networked and motivated than those who don’t. That would be a ‘selection bias’ in the women which FFW serves. And second, of course, the women who come to FFW get FFW’s support. The comparison doesn’t show how much of the difference is due to the selection effect versus how much is due to FFW’s support.

The purpose of any social intervention is to improve something more than would have happened anyway. So it’s important that we use reliable comparisons. That means isolating the effect of the intervention from other effects such as selection bias.

This isn’t just theory. Microloans to poor villages in Northeast Thailand appeared to be having a positive effect when analysed using readily-available comparators. But these analyses didn’t deal with selection bias in the people who took the loans. A careful study which did correct for selection bias and looked at how those people would have fared anyway found that that loans had little impact. They had no effect at all on the amounts that households save, the time they spend working or the amount they spend on education.

Without such careful research, we risk wasting precious resources on programmes which don’t actually work. Worse, selection effects are sometimes so strong that a programme can appear to work even if it’s actually producing harm. [Example below.] The importance of using reliable comparisons is clear from the unusually ardent title of a medical journal editorial last year about medical trials: it was called ‘Blood on Our Hands: See The Evil In Inappropriate Comparators’.

None of which is to say that FFW’s program doesn’t work. Rather that these data don’t show whether it works or not. We all need to be rigorous in assessing social programmes, lest we waste our resources helping people only a little when we could be helping a lot.

[Added later]: Medicine has many examples of practices being withdrawn after a proper comparison shows them to be harmful. Here‘s Ben Goldacre on just one:

“We used to think that hormone-replacement therapy reduced the risk of heart attacks by around half, for example, because this was the finding of a small trial, and a large observational study. That research had limitations. The small trial looked only at “surrogate outcomes”, blood markers that are associated with heart attack, rather than real-world attacks; the observational study was hampered by the fact that women who got prescriptions for HRT from their doctors were healthier to start with. But at the time, this research represented our best guess, and that’s often all you have to work with.

When a large randomised trial looking at the real-world outcome of heart attacks was conducted, it turned out that HRT increased the risk by 29%.”

This high-profile innovation relies on an unreliable comparison too–>

How to get a reliable comparison –>