Posted by: Gabriele Paolacci | July 19, 2010


Do workers on AMT pay attention to the directions provided by experimenters?

We recruited participants from AMT, discussion boards around the Internet, and an introductory subject pool at a Midwestern American University. After completing some tasks, participants were submitted the Subjective Numeracy Scale (SNS; Fagerlin et al. 2007). The SNS is an eight-item self-report measure of perceived ability to perform various mathematical tasks and preference for the use of numerical versus prose information, and it provided an ideal context for an Instructional Manipulation Check (IMC), as discussed in Oppenheimer et al. (2009). Included with the SNS, participants read a question that required them to give a precise and obvious answer (“While watching the television, have you ever had a fatal heart attack?”). This question employed a six-point scale anchored on “Never” and “Often” very similar to those in the SNS, thus representing an ideal test of whether participants paid attention to the survey or not.

Participants in the three subject pools did not differ in terms of attention provided to the survey. Participants in Mechanical Turk had the lowest IMC failing rate, although the number of respondents who failed the IMC is very low and not significantly different across subject pools, χ2(2, 301) = .187, p = 0.91. See here for a more detailed analysis of the experiment.


Fagerlin, A., Zikmund-Fisher, B., Ubel, P., Jankovic, A., Derry, H., & Smith, D. (2007). Measuring numeracy without a math test: Development of the subjective numeracy scale. Medical Decision Making, 27, 672–680.

Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.


  1. Is it really that common to run experiments — on AMT or elsewhere — without including catch trials of some sort?

    What percentage of AMT participants are paying attention seems to vary a great deal depending on various features of the task: how long it is meant to take, how many related HITs there are, etc. I’ve definitely seen as high as 20-30% failing the catch trials, though I’ve also had rates as low as 0-5%.

    • We routinely run very large tasks on AMT, without including “catch trials”. We rely on enough redundancy and on latent class models to jointly infer the quality of the workers and the correct answers for each question.

      While the latent class models can easily incorporate “known cases”/”catch trials”, we did not see a very significant improvement for very big tasks, when using “catch trials”. The benefits were apparent only at the very early stages of the annotation tasks but were minimal later on.

      Of course, our tasks are mainly annotation/coding tasks, for which we know that there is a single correct answer. Unclear if this approach works for the tasks that you focus on.

  2. I agree. Failing rates vary a lot also in our field, and these were lower than usual.

    It might be that our preceding tasks were not so cognitively demanding, and that people in all subject pools generally got to the IMC without strong temptations to engage in satisficing behavior. In other words, we may have been below a threshold that produces differences in the IMC between AMT and more traditional subject pools.

    I have a feeling that the result is pretty robust, but one should definitely look more into it. If other studies showed different patterns, it would be informative about what kinds of experiment one should and should not run on AMT.

  3. This post took a lot of writing, as I was trying to keep it from sounding too harsh. I enjoy your blog and work. Part of what I write about, though, is research methods, and this was an ideal launching-off point.

  4. I’m glad we incidentally provided a start for a possibly interesting discussion, which really does not relate to this single result. Check out my comment on your blog.

  5. I don’t think it is all that informative to pay attention to the number of people who pass the catch trial as the be-all end-all proportion of “people paying attention”. The number of people who pass or fail varies as a function of many variables. What is important is the relative proportion of people who pass/fail across categories.

  6. I agree with Jesse’s comment, of course. Catch trials give us an estimate of how many people are paying attention, but obviously easier catch trials will show more people are paying attention. They are presumably useful for tossing out the *worst* participants, thus increasing your power (that’s the point, I think, of the paper that was referenced). And you can compare different testing scenarios.

  7. […] measures to insure that subjects are paying attention, such as instructional manipulation checks, or at the least verification questions to filter […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: