Posted by: Gabriele Paolacci | April 10, 2014

2nd Workshop on Crowdsourcing and Online Behavioral Experiments

The Annual Workshop on Crowdsourcing and Online Behavioral Experiments (COBE) seeks seeks to bring together researchers and academics to present their latest online behavioral experiments (e.g., on MTurk) and share new results, methods and best practices. See below the details of the workshop.

– Workshop Date: June 8 or 9, 2014 (TBD)

– Location: Stanford University, Palo Alto, California, before the 15th ACM Conference on Electronic Commerce:

– Call for papers:

Posted by: Gabriele Paolacci | March 20, 2014

Quick survey about software of online data collection

Guest post by Todd Gureckis

Dear Experimental Turk readers,

My lab has been developing some open source software to simplify the process of running online experiments (e.g., using MTurk). We are interested in getting feedback from a wide swath of researchers about what types of features would be useful in this software.

If you have an interest (or experience) running online experiments and have a moment please fill out this quick survey.  Your responses might help use gear our software development effort in a way to maximize the utility for the overall community.

We will share the results of the survey as a comment to this post in a few weeks.


Todd Gureckis

Posted by: Gabriele Paolacci | December 11, 2013

Reputation as a Sufficient Condition for High Data Quality on MTurk

Guest post by Eyal Peer

Many researchers who use MTurk are concerned about the quality of data they can get from their MTurk workers. Can we trust our “Turkers” to read instructions carefully, answer all the questions candidly and follow our procedures prudently? Moreover, are there specific types of workers we should be targeting to ensure high quality data? I stumbled upon this question when I tried to run an experiment about cheating on MTurk. I gave Turkers the opportunity to cheat by over-reporting lucky coin tosses, and found very low levels of cheating. I was surprised, especially in light of all the research on how much people can be dishonest when given a chance. I then realized that I, as I usually do, sampled only “high reputation” Turkers (those who had more than 95% of previous HITs approved). I was more surprised to see that when I removed this restriction and allowed any Turker to take part in my study, cheating went up considerably. This led us (Joachim Vosgerau, Alessandro Acquisti and I) to think whether high reputation workers are indeed less dishonest and, presumably, produce higher data quality. If so, than perhaps sampling only high reputation workers can be sufficient to obtain high quality data. We found that to be true. In a paper forthcoming in Behavior Research Methods we report how high reputation workers produced very high quality data, even when no efforts to increase their attention were made.

Until then, my colleagues and me have been mostly relying on attention-check questions (henceforth, ACQs) to ensure high quality data. These are “trick” questions that test respondents’ attention by prescribing a specific response to those who read the questions or the preceding instructions (e.g., “have you ever had a fatal heart attack?” see Paolacci, Chandler, & Ipeirotis, 2010). Using ACQs have inherent disadvantages. First, including ACQs might disrupt the natural flow our survey and might even (in extreme cases) interfere with our manipulation. Second, it’s possible that attentive Turkers might get offended by such “trick” questions, and might react unfavorably. Lastly, and perhaps most important, even when failing ACQs can be considered a reliable signal that the participant has not paid attention in the rest of the survey, a sampling bias might be created if those who fail ACQs are excluded post-hoc from the sample. On the other hand, not using ACQs and sampling only high reputation workers can also restrict the sample’s size, reduce response rate and might also create a response bias. So, we compared these two methods for ensuring high quality data: using ACQs vs. sampling only high reputation workers (e.g., those with 95% or more approved HITs).

In the first study, we found that high reputation workers did provide higher quality data on all measures: they failed common ACQs very rarely, their questionnaires’ reliability was high and they replicated known effects such as the anchoring effect. This was true whether or not these workers received ACQs. Namely, high reputation workers who did not receive any ACQ showed similarly high quality data compared to those who did receive (and pass) our ACQs. Low reputation workers, on the other hand, failed ACQs more often and their data quality was much lower. However, ACQs did make a difference in that group. Those who passed the ACQs provided higher data quality, sometimes very similar to the quality of data obtained from high reputation workers. So, it seemed that ACQs were not necessary for high reputation workers. Moreover, sampling only high reputation workers did not reduce response rate (in fact, it increased it) and we couldn’t find any evidence for a sampling bias, as their demographics were similar to those that had low reputation. However, we used pretty common ACQs (such as the one mentioned above), and it was very likely that high reputation workers – being highly active on MTurk – probably recognized those questions, which made them less effective for them. We thus ran another study with novel ACQs, and found similar results: high reputation workers produced high quality data with our without ACQs, even when the ACQs were not familiar to them. We also found that, among the high reputation workers, those who have been more productive on MTurk (completed more HITs in the past) produced slightly higher data quality than less productive workers. But even the less productive high-reputation workers produced very high data quality, with our without ACQs.

Our final conclusion was that reputation is a sufficient condition for obtaining high quality data on MTurk. Although using ACQs doesn’t seem to hurt much (we did not find any evidence of ACQs causing reactance), they also seem to be unnecessary when using high reputation workers. Because sampling only high reputation workers did not reduce response rate or create a selection bias, and also since most researchers we know use only those workers anyway, it seems to be the preferable method to ensure high quality data. ACQs might have been necessary in the past, when completing behavioral surveys and experiments was not very common on MTurk, but their efficacy seems to have diminished recently. As MTurk is a highly dynamic environment, this assertion might not hold true in the future, as new and different people join MTurk. Nevertheless, we believe that if one has to choose between using ACQs or not, it’s better to not use them and, instead, simply sample high reputation workers to get high quality data. Personally, I have stopped using ACQs altogether and am still getting reliable and high quality data in all my studies on MTurk.


Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.

Peer, E., Vosgerau, J., & Acquisti, A. (in press). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk, Behavior Research Methods, available here.

Posted by: Gabriele Paolacci | October 2, 2013

MTurk Roundtable at ACR

The North American Conference of the Association for Consumer Research is taking place on October 3-6 at the Hilton Palmer House Hotel in Chicago, IL. On October 4, the conference will host a roundtable about MTurk. Many issues will be discussed, ranging from practical decisions that researchers need to make when conducting studies (e.g., how much to compensate participants) to priorities for future research about crowdsourcing social science. The roundtable will take place on October 4, 11.00 am in the room Indiana.

Posted by: Gabriele Paolacci | August 5, 2013

Consequences of Worker Nonnaïvete: The Cognitive Reflection Test

In a previous post, we documented the existence of MTurk workers that are disproportionately likely to show up in academic studies, potentially leading to foreknowledge of experimental procedures. We report here a study that illustrates the potential challenges that foreknowledge can have for MTurk data validity.

The Cognitive Reflection Test (CRT; Frederick, 2005) is typically used to measure a stable individual difference in cognitive orientation. It consists of three questions, each of which elicits an intuitive response that can be recognized as wrong with some additional thought. As a result, the number of correct answers to the CRT can serve as a parsimonious measure of the individual’s tendency to make reflective decisions. Foreknowledge – either as a result of previous exposure or through information shared by others who have completed the task – is problematic for the CRT because it increases the likelihood that the individual has discovered the correct response and can provide it without reflection, or at a minimum is aware that there is a “trick” that necessitates that the question receive additional scrutiny.

The CRT appears frequently on MTurk, therefore workers who spend more time using MTurk should be more likely to provide the correct answer. We recruited one hundred workers that varied in their (known) prior experience on MTurk. Participants completed a study that included, among other measures, the original version of the CRT (Frederick, 2005). We found that workers who were known to have completed more research studies on MTurk answered more CRT questions correctly, suggesting that their performance in fact improving with experience.

One alternative explanation to this finding is that more productive workers differ in some meaningful way from less productive workers. For example, they could be more reflective or conscientious. To rule this out, we asked the same workers to complete a “novel” version of the CRT (from Finucane & Gullion, 2010) prior to completing the “original” version. The questions posed by the two tests are logically identical and only differ in terms of their familiarity to workers. As expected, performances on the original and novel test were highly correlated. However, whereas prior experience significantly predicted performance on the original CRT, it did not predict performance on the novel CRT. This suggests that the results were not caused by a fundamental difference in the cognitive style of more experienced workers.

This study illustrates that using measures that are familiar to MTurk workers and at the same time assume that participants are naïve is problematic. By collecting the CRT among non-naïve participants, researchers might draw false conclusions about their levels of cognitive reflection and about the relationship between CRT performance and any variable that correlates with worker experience. Moreover, nonnaiveté introduces another source of error that might obscure the relationship between cognitive reflection and other constructs of interest.

See full paper for more details about the study and a broader discussion of the challenges connected to worker nonnaïveté.


Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130

Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for measuring the decision-making competence of older adults. Psychology and Aging, 25(2), 271.

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42.

Posted by: Gabriele Paolacci | July 11, 2013

How naïve are MTurk workers?

More and more social scientists have adopted MTurk as a venue for their research, praising its speed, cost, and diversity relative to undergraduate samples. However, many of them may fail to take into account some other critical aspects that differentiate MTurk samples from undergraduate subject pool samples.

In a paper just published in Behavior Research Methods, we find worker non-naïveté to be a serious concern.  One general issue is that MTurk workers share information about HITs with each other publicly and searchably on various forums, including on two different subreddits (see here and here for some collected examples of manipulation checks and common measures that have become common knowledge among workers via forum).

More specifically, while the probability that any worker has seen some manipulation may be low, there is a population of “superturkers”, i.e., extremely prolific workers, who are significantly more likely to end up in your studies. We pooled 16,408 HITs in 132 unique studies, and found that the HITs were completed by 7,498 unique workers.  While the average worker completed 2.2 HITs, the top 1% of most prolific workers (15+ HITs) completed 11% of the total, and the top 10% (5+ HITs) completed nearly half (41%) of the total HITs.


As mentioned, superturkers can be problematic because:

  • They are more likely to have seen standard manipulations (see figure below)
  • They are significantly more likely to read MTurk blogs/forums
  • They are significantly more likely to receive notifications from each time you (as an academic requester) post new HITs


However, on the plus side, we find that:

  • They are less likely to be multitasking while on MTurk
  • They are much better at responding and completing follow-up studies (one year later, 75% of these workers completed a follow-up), if you are interested in longitudinal research.

In sum, superturkers are a mixed blessing.  MTurk has the capability for more sophisticated designs, including longitudinal studies, and superturkers are reliable enough to make this viable.  Since they are less likely to be multitasking, they may also be good participants in studies that require more attention (e.g., reaction time).

However, their non-naïveté, and worker non-naïveté in general, is a serious concern.  Researchers can and should take steps to exclude workers from subsequent studies within their lines of research (the paper provides one solution; this method is another good solution if your studies are on Qualtrics).  Ideally, they would also communicate with other researchers who do similar work, so that one’s previous participants could be excluded from the other’s study, and vice versa.  Beyond that, though, researchers should be very wary about using “classic” manipulations or measures on MTurk.  The next blog post will detail the data that supports this admonition.


Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130.

Posted by: Gabriele Paolacci | April 22, 2013


Guest post by Gideon Goldin and Adam Darlow

As MTurk was not designed for psychological research, it cannot be expected to provide the experimental control that psychologists typically exercise when recruiting participants for laboratory studies. In particular, MTurk lacks:

(I) The ability to exclude participants that have already participated in related studies.

(II) The ability to prevent study previews.

(III) The ability to verify participants’ completion of a study.

TurkGate, or Grouping and Access Tools for External surveys (for use with Amazon Mechanical Turk), gives researchers an easy-to-use web application for providing such control when using MTurk with externally hosted studies (e.g., Qualtrics surveys).

TurkGate groups related HITs together, such that participants may only access one HIT per group. These HITs only link to surveys after workers accept them, and only if they have not already accessed related HITs. Workers attempting to preview HITs receive information about the HIT, but no link. Once the HIT is accepted, the worker’s ID is tested against a database to verify eligibility before the worker is granted access to the study. TurkGate also generates completion codes that can be automatically verified.

Pe’er et al., (2012) have already created a relatively simple method to resolve some of these issues, which we refer to as the Qualtrics method since it is based on functionality inherent to Qualtrics. Like TurkGate, their method supports hosting studies on a variety of sites, as users can adopt (the free version of) Qualtrics for the purpose of screening participants before redirecting them elsewhere. In addition, their method also uses a script to prevent study previews. Researchers who use the Qualtrics method–and who also track response IDs–can address all of the MTurk limitations described above without needing to configure or maintain any soft- or hardware.

In contrast, TurkGate needs to be downloaded and installed on a web server (e.g., Apache HTTP Server) with a database management system (e.g., MySQL). As such, the security and reliability of TurkGate will depend on the system it is installed on. If the server goes down, TurkGate goes down with it–an unlikely occurrence with a professional service like Qualtrics. And although TurkGate is designed to work out-of-the-box, it may require some administration, such as updating versions or setting up database backups. These requirements suggest that TurkGate is best suited for an entire laboratory or department of researchers, where a single computer-savvy individual or IT professional can maintain it.

In return for this investment, however, TurkGate offers a streamlined workflow with several, distinct advantages:

(i) TurkGate manages related studies with minimal overhead. Researchers using either TurkGate or the Qualtrics method add a script to their Web Interface HITs that checks workers’ IDs against a list of restricted IDs. The difference is that the Qualtrics method maintains the list of IDs within each survey, whereas TurkGate maintains a global list of IDs for all surveys, separated by group. Having a centralized database has several advantages, especially when researchers collaborate. Namely, there is no need for researchers to store, share or update lists because all surveys use the same list and multiple researchers can use the same TurkGate installation. To make any pair of studies mutually exclusive, researchers simply assign them the same group name. Another benefit is that researchers can run their studies simultaneously, since TurkGate’s list is updated automatically and in real-time. Obviating the need to manage multiple lists of IDs makes creating surveys faster. A researcher simply submits a URL and group name for their survey into TurkGate to get the aforementioned script. They are then ready to create their HIT and run their survey. It is this highly optimized workflow that represents the original raison d’etre of TurkGate.

(ii) TurkGate disables HIT previews while preventing unnecessary HIT returns. The Qualtrics method, like TurkGate, prevents HIT previews by sending workers to an intermediary page prior to the actual study. However, explicit care was taken in developing TurkGate to prevent workers from ever needing to return HITs for which they are not eligible. This prevents the artificial and undesirable inflation of workers’ return rates. Instead, workers are provided with a group name, and if they recall having participated in the group, they know not to accept the HIT. If they do not recall participating, they can simply verify their eligibility by submitting their worker ID.

(iii) TurkGate offers intuitive, verifiable, and anonymous completion codes. TurkGate’s completion codes were crafted to support a number of features. First, the codes themselves contain useful information in human-readable form, including MTurk worker ID, group name, survey identifier, and an optional Qualtrics or LimeSurvey record ID (researchers can augment their codes with any number of additional key-value pairs). Critically, each code also contains an encrypted segment used to prevent fake codes. After running a batch of HITs, researchers simply copy and paste their MTurk results file into TurkGate, which then instantly flags invalid records and duplicates. For experimenters (or IRBs) concerned about anonymity, TurkGate can also verify participation without using response IDs (that are coupleable with study data). Those using Qualtrics often use response IDs as completion codes (, but this requires manual verification and precludes complete anonymity.

For researchers in psychology laboratories or departments that already have access to IT support capable of configuring and maintaining a web server and database, TurkGate will likely serve as a convenient and long-term solution. However, the Qualtrics method is better suited for researchers who are uninterested in the overhead of deploying a separate system, especially if they already use Qualtrics.

TurkGate is an actively developed, open-source project that users are free to download (and modify) via GitHub. It is used in multiple laboratories and continues to evolve based on the feedback and contributions of its users. Learn more about TurkGate here.

(Suggested citation: Goldin, G., Darlow, A. (2013). TurkGate (Version 0.4.0) [Software]. Available from


Pe’er, Eyal, Paolacci, Gabriele, Chandler, Jesse and Mueller, Pam, Selectively Recruiting Participants from Amazon Mechanical Turk Using Qualtrics (May 2, 2012). Available at SSRN: or

Posted by: Gabriele Paolacci | March 13, 2013

Using MTurk to Study Clinical Populations

Guest Post by Danielle Shapiro and Jesse Chandler

Relative to behavioral sciences, clinical sciences have been slow to adopt MTurk as a recruitment tool. This is unfortunate because a major obstacle of clinical research is locating individuals who score in the extremes of clinically relevant variables, who by definition make up the minority of the population and are thus hard to find in large numbers.

We investigated the use of MTurk as a recruitment tool for populations of interest to clinical scientists. In line with numerous previous studies of MTurk, we found that data quality was high. Scale items used to measure underlying psychological constructs held together well. More importantly, the relationship between self-reported demographic information, life experiences, and psychological constructs was largely consistent with prior research, e.g., unemployment predicts depression, women report more anxiety, and men drink more alcoholic beverages. In general, workers looked a lot like the US population as a whole, except they reported surprisingly high levels of social anxiety.

We also learned a few things that may be of interest to researchers in other fields. First, in line with previous results, we found that workers are basically honest about personal details when payment is not contingent on their responses. We asked workers to report demographic information at two different time points more than a week apart, and for workers from US IP addresses, virtually all of them reported the same information both times.

Second, workers may be less honest when details relevant to payment are concerned. For example, a surprising number (around 6%) of workers who claimed US residence in fact came from IP addresses assigned to Eastern Europe and India. This is probably because US based workers are paid in cash rather than Amazon credit. Similarly, we measured malingering – the tendency to report symptoms that seem clinically relevant but are in fact rarely reported in clinical populations. We found a substantial portion of the population (around 10%) reported unusually high levels (>3 SD above the norm) of malingering. One interpretation of this finding is that workers infer the purpose of a survey and try to provide information that is relevant to what the requester wants. This interpretation is in line with earlier research that shows a higher level of social desirability bias among MTurk workers than among other populations (Behrend, Sharek, Meade & Wiebe, 2011).

Moreover, we learned that a substantial proportion of workers are unemployed or underemployed – far more than the US national average. The extent to which these workers are willing to work for very low wages should not be construed as satisfaction with current payment rates.

You can find our full report here.


Behrend, T. S., Sharek, D. J., Meade, A. W., & Wiebe, E. N. (2011). The viability of crowdsourcing for survey research. Behavior Research Methods, 43, 800-813

Shapiro, D.N., Chandler, J., & Mueller, P. (in press). Using Mechanical Turk to Study Clinical Populations. Clinical Psychological Science.

Posted by: Gabriele Paolacci | January 21, 2013

Conference on Experiments with Crowd Sourced Subjects

The Nuffield Centre for Experimental Social Sciences at the University of Oxford (UK) has organized a one-day Conference on Experiments with Crowd Sourced Subjects for February 14, 2013. The conference will introduce the development and use of crowed sourced experiments to researchers who are interested in conducting such experiments in their fields, with special focus on AMT. See here for the conference program and registration details.

Posted by: Gabriele Paolacci | October 9, 2012

Slides from ACR 2012

On October 5, we held a special session at the Association for Consumer Research North American Conference called “Inside the Turk: Methodological Concerns and Solutions in Mechanical Turk Experimentation.” Below you can find the presenters’ slides (click on the title of the talks).

Data Collection in a Flat World: Strengths and Weaknesses of Mechanical Turk Samples (Joseph Goodman)
We compare Mechanical Turk participants to community and student samples on personality, financial, and consumption dimensions, as well as classic decision-making biases. We find many similarities between Mechanical Turk participants and traditional samples, but also find important differences researchers should consider when using Mechanical Turk for consumer research.

Screening Participants on Mechanical Turk: Techniques and Justifications (Emily Peel)
Concerns about the quality of Mechanical Turk participants induce researchers to screen participants. We evaluate screening strategies according to their discriminant ability to identify observations that contribute only noise. Our results suggest omitting participants based on these indicators would likely bias the sample rather than improve data quality.

Under the Radar: Determinants of Honesty in an Online Labor Market (Dan Goldstein)
Online subject pools depend on participants’ honesty. After establishing a baseline level of dishonesty on Mechanical Turk, we manipulate the incentives to cheat and the probability of detection. We find workers act like intuitive statisticians, cheating at a level below statistical detection at the individual, but not aggregate, level.

Non-Naivety among Experimental Participants on Amazon Mechanical Turk (Gabriele Paolacci, including introduction to the session)
We conducted two studies to identify the extent to which participant cross-talk and duplicate participation contribute to non-naivety among participants in Mechanical Turk. Whereas cross-talk is not a critical issue, there is evidence of numerous duplicate participants. We discuss the implications for Mechanical Turk experimentation.




« Newer Posts - Older Posts »


%d bloggers like this: