Guest post by Eyal Peer
Many researchers who use MTurk are concerned about the quality of data they can get from their MTurk workers. Can we trust our “Turkers” to read instructions carefully, answer all the questions candidly and follow our procedures prudently? Moreover, are there specific types of workers we should be targeting to ensure high quality data? I stumbled upon this question when I tried to run an experiment about cheating on MTurk. I gave Turkers the opportunity to cheat by over-reporting lucky coin tosses, and found very low levels of cheating. I was surprised, especially in light of all the research on how much people can be dishonest when given a chance. I then realized that I, as I usually do, sampled only “high reputation” Turkers (those who had more than 95% of previous HITs approved). I was more surprised to see that when I removed this restriction and allowed any Turker to take part in my study, cheating went up considerably. This led us (Joachim Vosgerau, Alessandro Acquisti and I) to think whether high reputation workers are indeed less dishonest and, presumably, produce higher data quality. If so, than perhaps sampling only high reputation workers can be sufficient to obtain high quality data. We found that to be true. In a paper forthcoming in Behavior Research Methods we report how high reputation workers produced very high quality data, even when no efforts to increase their attention were made.
Until then, my colleagues and me have been mostly relying on attention-check questions (henceforth, ACQs) to ensure high quality data. These are “trick” questions that test respondents’ attention by prescribing a specific response to those who read the questions or the preceding instructions (e.g., “have you ever had a fatal heart attack?” see Paolacci, Chandler, & Ipeirotis, 2010). Using ACQs have inherent disadvantages. First, including ACQs might disrupt the natural flow our survey and might even (in extreme cases) interfere with our manipulation. Second, it’s possible that attentive Turkers might get offended by such “trick” questions, and might react unfavorably. Lastly, and perhaps most important, even when failing ACQs can be considered a reliable signal that the participant has not paid attention in the rest of the survey, a sampling bias might be created if those who fail ACQs are excluded post-hoc from the sample. On the other hand, not using ACQs and sampling only high reputation workers can also restrict the sample’s size, reduce response rate and might also create a response bias. So, we compared these two methods for ensuring high quality data: using ACQs vs. sampling only high reputation workers (e.g., those with 95% or more approved HITs).
In the first study, we found that high reputation workers did provide higher quality data on all measures: they failed common ACQs very rarely, their questionnaires’ reliability was high and they replicated known effects such as the anchoring effect. This was true whether or not these workers received ACQs. Namely, high reputation workers who did not receive any ACQ showed similarly high quality data compared to those who did receive (and pass) our ACQs. Low reputation workers, on the other hand, failed ACQs more often and their data quality was much lower. However, ACQs did make a difference in that group. Those who passed the ACQs provided higher data quality, sometimes very similar to the quality of data obtained from high reputation workers. So, it seemed that ACQs were not necessary for high reputation workers. Moreover, sampling only high reputation workers did not reduce response rate (in fact, it increased it) and we couldn’t find any evidence for a sampling bias, as their demographics were similar to those that had low reputation. However, we used pretty common ACQs (such as the one mentioned above), and it was very likely that high reputation workers – being highly active on MTurk – probably recognized those questions, which made them less effective for them. We thus ran another study with novel ACQs, and found similar results: high reputation workers produced high quality data with our without ACQs, even when the ACQs were not familiar to them. We also found that, among the high reputation workers, those who have been more productive on MTurk (completed more HITs in the past) produced slightly higher data quality than less productive workers. But even the less productive high-reputation workers produced very high data quality, with our without ACQs.
Our final conclusion was that reputation is a sufficient condition for obtaining high quality data on MTurk. Although using ACQs doesn’t seem to hurt much (we did not find any evidence of ACQs causing reactance), they also seem to be unnecessary when using high reputation workers. Because sampling only high reputation workers did not reduce response rate or create a selection bias, and also since most researchers we know use only those workers anyway, it seems to be the preferable method to ensure high quality data. ACQs might have been necessary in the past, when completing behavioral surveys and experiments was not very common on MTurk, but their efficacy seems to have diminished recently. As MTurk is a highly dynamic environment, this assertion might not hold true in the future, as new and different people join MTurk. Nevertheless, we believe that if one has to choose between using ACQs or not, it’s better to not use them and, instead, simply sample high reputation workers to get high quality data. Personally, I have stopped using ACQs altogether and am still getting reliable and high quality data in all my studies on MTurk.
Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.
Peer, E., Vosgerau, J., & Acquisti, A. (in press). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk, Behavior Research Methods, available here.