Posted by: Gabriele Paolacci | July 30, 2015

How many people can your lab reach on MTurk?

Guest post by Neil Stewart

How many people can your lab reach on MTurk? We used the capture-recapture method¹ from wildlife ecology to estimate how many workers you are sampling from. Our estimate is 7,300 workers.

Using 114,460 HITs completed from 2012 onwards we estimated, for each of our labs, how many workers we are sampling from. We then used a random-effect meta-analysis to estimate the number of workers a new lab, which could be yours, could reach. Why does this matter? Well, there is an exponential-like increase in the number of publications using this MTurk population—and we, like others, have found considerable overlap between our laboratories. And if you are planning a series of experiments or running adequately powered replications, you could run out of workers quite fast.

What can you do to increase your reach? Surprisingly, paying more doesn’t help. Our population estimate was reduced for higher paying HITs—we think because the most active workers seek out these HITs and crowd out the less active workers.  (Still, no reason not to pay a living wage to our participants!) Running larger batch sizes does help. Our larger batches sampled from a population nearly three times larger than the smaller batches. One last strategy is to wait. We estimate that it takes about 7 months for half of the workers on MTurk to leave and be replaced.

View the paper in-press in Judgment and Decision Making here.

Neil Stewart, Christoph Ungemach, Adam Harris, Dan Bartels, Ben Newell, Gabriele Paolacci, and Jesse Chandler

¹ The intuition behind capture-recapture method is not too hard. Ecologists might, for example, use it to estimate the number of fish in a pond. Go fishing on Day 1. Catch some fish, tag them, and release them. Then, on Day 2, go fishing again. Catch some fish and observed the proportion that are tagged. Now you have an estimate of the proportion tagged in the pond from Day 2, and the number tagged in the pond from Day 1, so you can estimate the total number in the pond. If you tag five fish on Day 1 and observe one quarter are tagged in Day 2’s catch then there must be 20 fish. We used WorkerIDs like tags.


Posted by: Gabriele Paolacci | June 16, 2015

Using Nonnaive Participants Can Reduce Effect Sizes

Guest post by Jesse Chandler

In a new Psychological Science article we provide direct evidence that effect sizes for experimental results are reduced among participants who have previously completed an experimental paradigm. Specifically, we recruited MTurk workers who participated in the Many Labs 1 series of two-condition experiments and invited them to participate in a research study that included the exact same package of experiments. We found that effect sizes decreased the second time around, especially when among those who were exposed to opposite conditions at the two different time points.

Previous studies have demonstrated that MTurk worker performance changes as workers become more experienced. For example, we have demonstrated that worker scores on the Cognitive Reflection Task (a commonly used measure of intellectual ability) is correlated with worker experience. Likewise, Dave Rand and Winter Mason have led projects that provided evidence that workers get better at economic games over time. All of these findings are new twists on older observations that attitudes of survey panel members tend to change over time (a phenomena known as panel conditioning in the survey literature) and that people tend to improve on measures of aptitude (a phenomena known as a practice effect within the psychometric testing literature).

Our findings illustrate that participant experience can also affect experimental results, even when dependent measures are not straightforward measures of ability. These findings are surprising (at least to us) because we have tended to assume that workers are being relatively unengaged while completing HITs and complete so many tasks that any individual experiment could hardly be memorable. But apparently they are.

Fortunately there is good news. First, we see some evidence that this effect wears off over time, suggesting that people eventually forget whatever information they may have seen. Second, if all you care about are the direction and sign of an effect (rather than an exact point estimate) smaller effect sizes can be offset by increased sample size. Third, there are an increasing number of tools available to prevent duplicate workers from participating in an experiment.

  • The MTurk API or GUI allows you to create qualifications that filter out unwanted workers.
  • Qualtrics can be set up to check workers against a predefined list of workers and exclude those with matching WorkerIDs. This is useful if you know of workers you want to exclude, but have not worked with them before.
  • TurkGate will do something similar. It needs to run on a server, but it is likely easier to maintain than Qualtrics, particularly for a lab group or research team that wants to coordinate their efforts.
  • TurkPrime and UniqueTurker are newer solutions that seem easier to use for individual researchers. We have not tested them, but readers may wish to experiment with them.

In short, there are lots of ways to limit worker participation. Of course, this does raise serious concerns for experimental paradigms that are used to the point of abuse (trolley problem, I’m looking at you) and highlights that the finite size of the MTurk pool means a finite limit on the amount of times a particular experimental paradigm can be run.

Posted by: Gabriele Paolacci | May 29, 2015

MTurk workshop at EMAC

On May 27 I held a workshop at EMAC on conducting behavioral research using Amazon Mechanical Turk samples. Slides are available here.

Guest post by David J. Hauser

In this new article, Norbert Schwarz and I show in two experiments that answering an instructional manipulation check (IMC) changes the way participants approach later survey questions.

IMCs are often included in online research (and especially on MTurk) in order to assess whether participants are paying attention to instructions. However, participants can potentially see them as “trick” questions that violate conversational norms of trust. As a result, these questions may make participants more cautious when answering later questions in an effort to avoid being tricked again.

Two studies provided support for this hypothesis. In one study, participants received an IMC and the Cognitive Reflection Test (Frederick, 2005), a math test assessing the tendency to reflect and correct intuitive answers. Crucially, half of the participants completed the IMC before the CRT, whereas the other half completed the math test first. Completing the IMC first increased CRT scores (vs when the CRT came first), suggesting it increased systematic thinking.

In a second study, participants received an IMC and a probabilistic reasoning task assessing rational decision making (Toplak, West, & Stanovich, 2011). Like before, half of the participants completed the IMC before the reasoning task, whereas the other half completed the reasoning task first. Completing the IMC first increased accuracy on the reasoning task (compared to completing the reasoning task first). Thus, answering an IMC teaches participants that there may be more than meets the eye to later questions, a conclusion that significant alters participants’ reasoning strategies.

IMCs are typically conceptualized as measures, not interventions. However, as demonstrated here, this is not the case. One should therefore exercise caution in IMC use.


Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19, 25-42.

Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cognitive Reflection Test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition39, 1275-1289.

This document guides you through a simple method to avoid recruiting MTurk workers for your studies who already participated in a certain study of yours. The core of the procedure relies on Excel (as opposed to CLT or the MTurk API) to assign a Qualification to multiple workers at the same time. Using this procedure will allow you to exclude from the recruitment workers who participated in a previous related study (e.g., a study you are now replicating), and can be functional to other goals too (e.g., executing longitudinal research, building your own panel).

Update June 10, 2015: Arnoud Plantinga developed an script on R based on this method in which you don’t have to create the new variables yourself. You can download the script here.

Posted by: Gabriele Paolacci | October 14, 2014

MTurk Workshop at ACR

The use of Mturk by behavioral researchers continues to increase. Despite the evidence on the benefits (and drawbacks) of MTurk, many researchers, reviewers, and editors intuitively distrust the reliability and validity of online labor markets.

On October 25 , we will host a workshop at ACR called “Questioning the Turk: Conducting High Quality Research with Amazon Mechanical Turk”. We will answer and debate questions from the ACR community regarding MTurk, and raise some new questions. We will discuss the current issues that arise from MTurk’s use, as well as some of the solutions and replications. Questions can be submitted using the hashtag #mturkacr via Twitter (@aconsres, @joekgoodman, @gpaolacci) or Facebook (ACR page), as a comment to this post, or via email the organizers Joseph Goodman and Gabriele Paolacci.

Te North American ACR conference will take place on October 24-26 at the Hilton Baltimore in Baltimore, MD. The MTurk workshop will take place on Saturday, October 25, 2pm in the room Key 5.

Update: Thanks to all participants for contributing to a fruitful discussion! The slides we used in the workshop can be found here. Joe & Gabriele.

Posted by: Gabriele Paolacci | July 10, 2014

Review of MTurk as a Participant Pool

We recently published on Current Directions in Psychological Science a review of MTurk as a source of survey and experimental data. We discuss the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors. The Psych Report published a nice summary of the paper, that you can find here.


Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184-188.

Posted by: Gabriele Paolacci | April 10, 2014

2nd Workshop on Crowdsourcing and Online Behavioral Experiments

The Annual Workshop on Crowdsourcing and Online Behavioral Experiments (COBE) seeks seeks to bring together researchers and academics to present their latest online behavioral experiments (e.g., on MTurk) and share new results, methods and best practices. See below the details of the workshop.

– Workshop Date: June 8 or 9, 2014 (TBD)

– Location: Stanford University, Palo Alto, California, before the 15th ACM Conference on Electronic Commerce:

– Call for papers:

Posted by: Gabriele Paolacci | March 20, 2014

Quick survey about software of online data collection

Guest post by Todd Gureckis

Dear Experimental Turk readers,

My lab has been developing some open source software to simplify the process of running online experiments (e.g., using MTurk). We are interested in getting feedback from a wide swath of researchers about what types of features would be useful in this software.

If you have an interest (or experience) running online experiments and have a moment please fill out this quick survey.  Your responses might help use gear our software development effort in a way to maximize the utility for the overall community.

We will share the results of the survey as a comment to this post in a few weeks.


Todd Gureckis

Posted by: Gabriele Paolacci | December 11, 2013

Reputation as a Sufficient Condition for High Data Quality on MTurk

Guest post by Eyal Peer

Many researchers who use MTurk are concerned about the quality of data they can get from their MTurk workers. Can we trust our “Turkers” to read instructions carefully, answer all the questions candidly and follow our procedures prudently? Moreover, are there specific types of workers we should be targeting to ensure high quality data? I stumbled upon this question when I tried to run an experiment about cheating on MTurk. I gave Turkers the opportunity to cheat by over-reporting lucky coin tosses, and found very low levels of cheating. I was surprised, especially in light of all the research on how much people can be dishonest when given a chance. I then realized that I, as I usually do, sampled only “high reputation” Turkers (those who had more than 95% of previous HITs approved). I was more surprised to see that when I removed this restriction and allowed any Turker to take part in my study, cheating went up considerably. This led us (Joachim Vosgerau, Alessandro Acquisti and I) to think whether high reputation workers are indeed less dishonest and, presumably, produce higher data quality. If so, than perhaps sampling only high reputation workers can be sufficient to obtain high quality data. We found that to be true. In a paper forthcoming in Behavior Research Methods we report how high reputation workers produced very high quality data, even when no efforts to increase their attention were made.

Until then, my colleagues and me have been mostly relying on attention-check questions (henceforth, ACQs) to ensure high quality data. These are “trick” questions that test respondents’ attention by prescribing a specific response to those who read the questions or the preceding instructions (e.g., “have you ever had a fatal heart attack?” see Paolacci, Chandler, & Ipeirotis, 2010). Using ACQs have inherent disadvantages. First, including ACQs might disrupt the natural flow our survey and might even (in extreme cases) interfere with our manipulation. Second, it’s possible that attentive Turkers might get offended by such “trick” questions, and might react unfavorably. Lastly, and perhaps most important, even when failing ACQs can be considered a reliable signal that the participant has not paid attention in the rest of the survey, a sampling bias might be created if those who fail ACQs are excluded post-hoc from the sample. On the other hand, not using ACQs and sampling only high reputation workers can also restrict the sample’s size, reduce response rate and might also create a response bias. So, we compared these two methods for ensuring high quality data: using ACQs vs. sampling only high reputation workers (e.g., those with 95% or more approved HITs).

In the first study, we found that high reputation workers did provide higher quality data on all measures: they failed common ACQs very rarely, their questionnaires’ reliability was high and they replicated known effects such as the anchoring effect. This was true whether or not these workers received ACQs. Namely, high reputation workers who did not receive any ACQ showed similarly high quality data compared to those who did receive (and pass) our ACQs. Low reputation workers, on the other hand, failed ACQs more often and their data quality was much lower. However, ACQs did make a difference in that group. Those who passed the ACQs provided higher data quality, sometimes very similar to the quality of data obtained from high reputation workers. So, it seemed that ACQs were not necessary for high reputation workers. Moreover, sampling only high reputation workers did not reduce response rate (in fact, it increased it) and we couldn’t find any evidence for a sampling bias, as their demographics were similar to those that had low reputation. However, we used pretty common ACQs (such as the one mentioned above), and it was very likely that high reputation workers – being highly active on MTurk – probably recognized those questions, which made them less effective for them. We thus ran another study with novel ACQs, and found similar results: high reputation workers produced high quality data with our without ACQs, even when the ACQs were not familiar to them. We also found that, among the high reputation workers, those who have been more productive on MTurk (completed more HITs in the past) produced slightly higher data quality than less productive workers. But even the less productive high-reputation workers produced very high data quality, with our without ACQs.

Our final conclusion was that reputation is a sufficient condition for obtaining high quality data on MTurk. Although using ACQs doesn’t seem to hurt much (we did not find any evidence of ACQs causing reactance), they also seem to be unnecessary when using high reputation workers. Because sampling only high reputation workers did not reduce response rate or create a selection bias, and also since most researchers we know use only those workers anyway, it seems to be the preferable method to ensure high quality data. ACQs might have been necessary in the past, when completing behavioral surveys and experiments was not very common on MTurk, but their efficacy seems to have diminished recently. As MTurk is a highly dynamic environment, this assertion might not hold true in the future, as new and different people join MTurk. Nevertheless, we believe that if one has to choose between using ACQs or not, it’s better to not use them and, instead, simply sample high reputation workers to get high quality data. Personally, I have stopped using ACQs altogether and am still getting reliable and high quality data in all my studies on MTurk.


Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.

Peer, E., Vosgerau, J., & Acquisti, A. (in press). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk, Behavior Research Methods, available here.

Older Posts »



Get every new post delivered to your Inbox.

Join 45 other followers

%d bloggers like this: