Guest post by David J. Hauser

In this new article, Norbert Schwarz and I show in two experiments that answering an instructional manipulation check (IMC) changes the way participants approach later survey questions.

IMCs are often included in online research (and especially on MTurk) in order to assess whether participants are paying attention to instructions. However, participants can potentially see them as “trick” questions that violate conversational norms of trust. As a result, these questions may make participants more cautious when answering later questions in an effort to avoid being tricked again.

Two studies provided support for this hypothesis. In one study, participants received an IMC and the Cognitive Reflection Test (Frederick, 2005), a math test assessing the tendency to reflect and correct intuitive answers. Crucially, half of the participants completed the IMC before the CRT, whereas the other half completed the math test first. Completing the IMC first increased CRT scores (vs when the CRT came first), suggesting it increased systematic thinking.

In a second study, participants received an IMC and a probabilistic reasoning task assessing rational decision making (Toplak, West, & Stanovich, 2011). Like before, half of the participants completed the IMC before the reasoning task, whereas the other half completed the reasoning task first. Completing the IMC first increased accuracy on the reasoning task (compared to completing the reasoning task first). Thus, answering an IMC teaches participants that there may be more than meets the eye to later questions, a conclusion that significant alters participants’ reasoning strategies.

IMCs are typically conceptualized as measures, not interventions. However, as demonstrated here, this is not the case. One should therefore exercise caution in IMC use.


This document guides you through a simple method to avoid recruiting MTurk workers for your studies who already participated in a certain study of yours. The core of the procedure relies on Excel (as opposed to CLT or the MTurk API) to assign a Qualification to multiple workers at the same time. Using this procedure will allow you to exclude from the recruitment workers who participated in a previous related study (e.g., a study you are now replicating), and can be functional to other goals too (e.g., executing longitudinal research, building your own panel).

MTurk Workshop at ACR

The use of Mturk by behavioral researchers continues to increase. Despite the evidence on the benefits (and drawbacks) of MTurk, many researchers, reviewers, and editors intuitively distrust the reliability and validity of online labor markets.

On October 25 , we will host a workshop at ACR called "Questioning the Turk: Conducting High Quality Research with Amazon Mechanical Turk". We will answer and debate questions from the ACR community regarding MTurk, and raise some new questions. We will discuss the current issues that arise from MTurk's use, as well as some of the solutions and replications.

Te North American ACR conference will take place on October 24-26 at the Hilton Baltimore in Baltimore, MD.

Update: Thanks to all participants for contributing to a fruitful discussion! The slides we used in the workshop can be found here.

Review of MTurk as a Participant Pool

We recently published on Current Directions in Psychological Science a review of MTurk as a source of survey and experimental data. We discuss the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors. The Psych Report published a nice summary of the paper, that you can find here.


Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184-188.

2nd Workshop on Crowdsourcing and Online Behavioral Experiments

The Annual Workshop on Crowdsourcing and Online Behavioral Experiments (COBE) seeks seeks to bring together researchers and academics to present their latest online behavioral experiments (e.g., on MTurk) and share new results, methods and best practices. See below the details of the workshop.

Quick survey about software of online data collection

Guest post by Todd Gureckis

Dear Experimental Turk readers,

My lab has been developing some open source software to simplify the process of running online experiments (e.g., using MTurk). We are interested in getting feedback from a wide swath of researchers about what types of features would be useful in this software.

If you have an interest (or experience) running online experiments and have a moment please fill out this quick survey.  Your responses might help use gear our software development effort in a way to maximize the utility for the overall community.

We will share the results of the survey as a comment to this post in a few weeks.


Todd Gureckis

Reputation as a Sufficient Condition for High Data Quality on MTurk

Guest post by Eyal Peer

Many researchers who use MTurk are concerned about the quality of data they can get from their MTurk workers. Can we trust our “Turkers” to read instructions carefully, answer all the questions candidly and follow our procedures prudently? Moreover, are there specific types of workers we should be targeting to ensure high quality data? I stumbled upon this question when I tried to run an experiment about cheating on MTurk. I gave Turkers the opportunity to cheat by over-reporting lucky coin tosses, and found very low levels of cheating. I was surprised, especially in light of all the research on how much people can be dishonest when given a chance. I then realized that I, as I usually do, sampled only “high reputation” Turkers (those who had more than 95% of previous HITs approved). I was more surprised to see that when I removed this restriction and allowed any Turker to take part in my study, cheating went up considerably. This led us (Joachim Vosgerau, Alessandro Acquisti and I) to think whether high reputation workers are indeed less dishonest and, presumably, produce higher data quality. If so, than perhaps sampling only high reputation workers can be sufficient to obtain high quality data. We found that to be true. In a paper forthcoming in Behavior Research Methods we report how high reputation workers produced very high quality data, even when no efforts to increase their attention were made.

Until then, my colleagues and me have been mostly relying on attention-check questions (henceforth, ACQs) to ensure high quality data. These are “trick” questions that test respondents’ attention by prescribing a specific response to those who read the questions or the preceding instructions (e.g., “have you ever had a fatal heart attack?” see Paolacci, Chandler, & Ipeirotis, 2010). Using ACQs have inherent disadvantages. First, including ACQs might disrupt the natural flow our survey and might even (in extreme cases) interfere with our manipulation. Second, it’s possible that attentive Turkers might get offended by such “trick” questions, and might react unfavorably. Lastly, and perhaps most important, even when failing ACQs can be considered a reliable signal that the participant has not paid attention in the rest of the survey, a sampling bias might be created if those who fail ACQs are excluded post-hoc from the sample. On the other hand, not using ACQs and sampling only high reputation workers can also restrict the sample’s size, reduce response rate and might also create a response bias. So, we compared these two methods for ensuring high quality data: using ACQs vs. sampling only high reputation workers (e.g., those with 95% or more approved HITs).

In the first study, we found that high reputation workers did provide higher quality data on all measures: they failed common ACQs very rarely, their questionnaires’ reliability was high and they replicated known effects such as the anchoring effect. This was true whether or not these workers received ACQs. Namely, high reputation workers who did not receive any ACQ showed similarly high quality data compared to those who did receive (and pass) our ACQs. Low reputation workers, on the other hand, failed ACQs more often and their data quality was much lower. However, ACQs did make a difference in that group. Those who passed the ACQs provided higher data quality, sometimes very similar to the quality of data obtained from high reputation workers. So, it seemed that ACQs were not necessary for high reputation workers. Moreover, sampling only high reputation workers did not reduce response rate (in fact, it increased it) and we couldn’t find any evidence for a sampling bias, as their demographics were similar to those that had low reputation. However, we used pretty common ACQs (such as the one mentioned above), and it was very likely that high reputation workers – being highly active on MTurk – probably recognized those questions, which made them less effective for them. We thus ran another study with novel ACQs, and found similar results: high reputation workers produced high quality data with our without ACQs, even when the ACQs were not familiar to them. We also found that, among the high reputation workers, those who have been more productive on MTurk (completed more HITs in the past) produced slightly higher data quality than less productive workers. But even the less productive high-reputation workers produced very high data quality, with our without ACQs.

Our final conclusion was that reputation is a sufficient condition for obtaining high quality data on MTurk. Although using ACQs doesn’t seem to hurt much (we did not find any evidence of ACQs causing reactance), they also seem to be unnecessary when using high reputation workers. Because sampling only high reputation workers did not reduce response rate or create a selection bias, and also since most researchers we know use only those workers anyway, it seems to be the preferable method to ensure high quality data. ACQs might have been necessary in the past, when completing behavioral surveys and experiments was not very common on MTurk, but their efficacy seems to have diminished recently. As MTurk is a highly dynamic environment, this assertion might not hold true in the future, as new and different people join MTurk. Nevertheless, we believe that if one has to choose between using ACQs or not, it’s better to not use them and, instead, simply sample high reputation workers to get high quality data. Personally, I have stopped using ACQs altogether and am still getting reliable and high quality data in all my studies on MTurk.


MTurk Roundtable at ACR

The North American Conference of the Association for Consumer Research is taking place on October 3-6 at the Hilton Palmer House Hotel in Chicago, IL. On October 4, the conference will host a roundtable about MTurk. Many issues will be discussed, ranging from practical decisions that researchers need to make when conducting studies (e.g., how much to compensate participants) to priorities for future research about crowdsourcing social science. The roundtable will take place on October 4, 11.00 am in the room Indiana.

Consequences of Worker Nonnaïvete: The Cognitive Reflection Test

In a previous post, we documented the existence of MTurk workers that are disproportionately likely to show up in academic studies, potentially leading to foreknowledge of experimental procedures. We report here a study that illustrates the potential challenges that foreknowledge can have for MTurk data validity.

The Cognitive Reflection Test (CRT; Frederick, 2005) is typically used to measure a stable individual difference in cognitive orientation. It consists of three questions, each of which elicits an intuitive response that can be recognized as wrong with some additional thought. As a result, the number of correct answers to the CRT can serve as a parsimonious measure of the individual’s tendency to make reflective decisions. Foreknowledge – either as a result of previous exposure or through information shared by others who have completed the task – is problematic for the CRT because it increases the likelihood that the individual has discovered the correct response and can provide it without reflection, or at a minimum is aware that there is a “trick” that necessitates that the question receive additional scrutiny.

The CRT appears frequently on MTurk, therefore workers who spend more time using MTurk should be more likely to provide the correct answer. We recruited one hundred workers that varied in their (known) prior experience on MTurk. Participants completed a study that included, among other measures, the original version of the CRT (Frederick, 2005). We found that workers who were known to have completed more research studies on MTurk answered more CRT questions correctly, suggesting that their performance in fact improving with experience.

One alternative explanation to this finding is that more productive workers differ in some meaningful way from less productive workers. For example, they could be more reflective or conscientious. To rule this out, we asked the same workers to complete a “novel” version of the CRT (from Finucane & Gullion, 2010) prior to completing the “original” version. The questions posed by the two tests are logically identical and only differ in terms of their familiarity to workers. As expected, performances on the original and novel test were highly correlated. However, whereas prior experience significantly predicted performance on the original CRT, it did not predict performance on the novel CRT. This suggests that the results were not caused by a fundamental difference in the cognitive style of more experienced workers.

This study illustrates that using measures that are familiar to MTurk workers and at the same time assume that participants are naïve is problematic. By collecting the CRT among non-naïve participants, researchers might draw false conclusions about their levels of cognitive reflection and about the relationship between CRT performance and any variable that correlates with worker experience. Moreover, nonnaiveté introduces another source of error that might obscure the relationship between cognitive reflection and other constructs of interest.

See full paper for more details about the study and a broader discussion of the challenges connected to worker nonnaïveté.


How naïve are MTurk workers?

More and more social scientists have adopted MTurk as a venue for their research, praising its speed, cost, and diversity relative to undergraduate samples. However, many of them may fail to take into account some other critical aspects that differentiate MTurk samples from undergraduate subject pool samples.

In a paper just published in Behavior Research Methods, we find worker non-naïveté to be a serious concern.  One general issue is that MTurk workers share information about HITs with each other publicly and searchably on various forums, including on two different subreddits (see here and here for some collected examples of manipulation checks and common measures that have become common knowledge among workers via forum).

More specifically, while the probability that any worker has seen some manipulation may be low, there is a population of “superturkers”, i.e., extremely prolific workers, who are significantly more likely to end up in your studies. We pooled 16,408 HITs in 132 unique studies, and found that the HITs were completed by 7,498 unique workers.  While the average worker completed 2.2 HITs, the top 1% of most prolific workers (15+ HITs) completed 11% of the total, and the top 10% (5+ HITs) completed nearly half (41%) of the total HITs.


As mentioned, superturkers can be problematic because:

  • They are more likely to have seen standard manipulations (see figure below)
  • They are significantly more likely to read MTurk blogs/forums
  • They are significantly more likely to receive notifications from each time you (as an academic requester) post new HITs


However, on the plus side, we find that:

  • They are less likely to be multitasking while on MTurk
  • They are much better at responding and completing follow-up studies (one year later, 75% of these workers completed a follow-up), if you are interested in longitudinal research.

In sum, superturkers are a mixed blessing.  MTurk has the capability for more sophisticated designs, including longitudinal studies, and superturkers are reliable enough to make this viable.  Since they are less likely to be multitasking, they may also be good participants in studies that require more attention (e.g., reaction time).

However, their non-naïveté, and worker non-naïveté in general, is a serious concern.  Researchers can and should take steps to exclude workers from subsequent studies within their lines of research (the paper provides one solution; this method is another good solution if your studies are on Qualtrics).  Ideally, they would also communicate with other researchers who do similar work, so that one’s previous participants could be excluded from the other’s study, and vice versa.  Beyond that, though, researchers should be very wary about using “classic” manipulations or measures on MTurk.  The next blog post will detail the data that supports this admonition.


