This was originally posted into the ISCON Facebook Page, I repost it here in its entirety:
Klaus Fiedler has granted me permission to share a letter that he wrote to a reported (Bruce Bowers) in response to the replication project. This letter contains Klaus’s words only and the only part I edited was to remove his phone number. I thought this would be of interest to the group.
These are his words on the 2015 estimating the replicability of psychology article.
Thanks for your email. You can call be tomorrow but I guess what I have to say is summarized in this email.
Before I try to tell it like it is, I ask you to please attend to my arguments, not just the final evaluations, which may appear unbalanced. So if you want to include my statement in your article, maybe along with my name, I would be happy not to detach my evaluative judgment from the arguments that in my opinion inevitably lead to my critical evaluation.
First of all I want to make it clear that I have been a big fan of properly conducted replication and validation studies for many years – long before the current hype of what one might call a shallow replication research program. Please note also that one of my own studies has been included in the present replication project; the original findings have been borne out more clearly than in the original study. So there is no self-referent motive for me to be overly critical.
However, I have to say that I am more than disappointed by the present report. In my view, such an expensive, time-consuming, and resource-intensive replication study, which can be expected to receive so much attention and to have such a strong impact on the field and on its public image, should live up (at least) to the same standards of scientific scrutiny as the studies that it evaluates. I’m afraid this is not the case, for the following reasons …
The rationale is to plot the effect size of replication results as a function of original results. Such a plot is necessarily subject to regression toward the mean. On a-priori-grounds, to the extent that the reliability of the original results is less than perfect, it can be expected that replication studies regress toward weaker effect sizes. This is very common knowledge. In a scholarly article one would try to compare the obtained effects to what can be expected from regression alone. The rule is simple and straightforward. Multiply the effect size of the original study (as a deviation score) with the reliability of the original test, and you get the expected replication results (in deviation scores) – as expected from regression alone. The informative question is to what extent the obtained results are weaker than the to-be-expected regressive results.
To be sure, the article’s muteness regarding regression is related to the fact that the reliability was not assessed. This is a huge source of weakness. It has been shown (in a nice recent article by Stanley & Spence, 2014, in PPS) that measurement error and sampling error alone will greatly reduce the replicability of empirical results, even when the hypothesis is completely correct. In order not to be fooled by statistical data, it is therefore of utmost importance to control for measurement error and sampling error. This is the lesson we took from Frank Schmidt (2010). It is also very common wisdom.
The failure to assess the reliability of the dependent measures greatly reduces the interpretation of the results. Some studies may use single measures to assess an effect whereas others may use multiple measures and thereby enhance the reliability, according to a principle well-known since Spearman & Brown. Thus, some of the replication failures may simply reflect the naïve reliance on single-item dependent measures. This is of course a weakness of the original studies, but a weakness different from non-replicability of the theoretically important effect. Indeed, contrary to the notion that researchers perfectly exploit their degrees of freedom and always come up with results that overestimate their true effect size, they often make naïve mistakes.
By the way, this failure to control for reliability might explain the apparent replication advantage of cognitive over social psychology. Social psychologists may simply often rely on singular measure, whereas cognitive psychologists use multi-trial designs resulting in much higher reliability.
The failure to consider reliability refers to the dependent measure. A similar failure to systematically include manipulation checks renders the independent variables equivocal. The so-called Duhem-Quine problem refers to the unwarranted assumption that some experimental manipulation can be equated with the theoretical variable. An independent variable can be operationalized in multiple ways. A manipulation that worked a few years ago need to work now, simply because no manipulation provides a plain manipulation of the theoretical variable proper. It is therefore essential to include a manipulation check, to make sure that the very premise of a study is met, namely a successful manipulation of the theoretical variable. Simply running the same operational procedure as years before is not sufficient, logically.
Last but not least, the sampling rule that underlies the selection of the 100 studies strikes me as hard to tolerate. Replication teams could select their studies from the first 20 articles published in a journal in a year (if I correctly understand this sentence). What might have motivated the replication teams’ choices? Could this procedure be sensitive to their attitude towards particular authors or their research? Could they have selected simply studies with a single dependent measure (implying low reliability)? – I do not want to be too suspicious here but, given the costs of the replication project and the human resources, does this sampling procedure represent the kind of high-quality science the whole project is striving for?
Across all replication studies, power is presupposed to be a pure function of the size of participant samples. The notion of a truly representative design in which tasks and stimuli and context conditions and a number of other boundary conditions are taken into account is not even mentioned (cf. Westfall & Judd).
What do you think about this?
I 100% agree with his concern about the expense. Speaking with some of the replicators, we estimated the endeavor cost over 1 million euros, all told. This paid for the time of 300 psychologists, who ‘donated’ their time to the endeavor. The taxpayer paid for this… Is it the best use of their tax dollars, I guess not.
I also definitely agree with his assessment about regression to the mean.