This document guides you through a simple method to avoid recruiting MTurk workers for your studies who already participated in a certain study of yours. The core of the procedure relies on Excel (as opposed to CLT or the MTurk API) to assign a Qualification to multiple workers at the same time. Using this procedure will allow you to exclude from the recruitment workers who participated in a previous related study (e.g., a study you are now replicating), and can be functional to other goals too (e.g., executing longitudinal research, building your own panel).
The use of Mturk by behavioral researchers continues to increase. Despite the evidence on the benefits (and drawbacks) of MTurk, many researchers, reviewers, and editors intuitively distrust the reliability and validity of online labor markets.
On October 25 , we will host a workshop at ACR called “Questioning the Turk: Conducting High Quality Research with Amazon Mechanical Turk”. We will answer and debate questions from the ACR community regarding MTurk, and raise some new questions. We will discuss the current issues that arise from MTurk’s use, as well as some of the solutions and replications. Questions can be submitted using the hashtag #mturkacr via Twitter (@aconsres, @joekgoodman, @gpaolacci) or Facebook (ACR page), as a comment to this post, or via email the organizers Joseph Goodman and Gabriele Paolacci.
Te North American ACR conference will take place on October 24-26 at the Hilton Baltimore in Baltimore, MD. The MTurk workshop will take place on Saturday, October 25, 2pm in the room Key 5.
Update: Thanks to all participants for contributing to a fruitful discussion! The slides we used in the workshop can be found here. Joe & Gabriele.
We recently published on Current Directions in Psychological Science a review of MTurk as a source of survey and experimental data. We discuss the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors. The Psych Report published a nice summary of the paper, that you can find here.
Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184-188.
The Annual Workshop on Crowdsourcing and Online Behavioral Experiments (COBE) seeks seeks to bring together researchers and academics to present their latest online behavioral experiments (e.g., on MTurk) and share new results, methods and best practices. See below the details of the workshop.
- Workshop Date: June 8 or 9, 2014 (TBD)
- Location: Stanford University, Palo Alto, California, before the 15th ACM Conference on Electronic Commerce: http://www.sigecom.org/ec14/
- Call for papers: http://decisionresearchlab.com/cobe/
Guest post by Todd Gureckis
Dear Experimental Turk readers,
My lab has been developing some open source software to simplify the process of running online experiments (e.g., using MTurk). We are interested in getting feedback from a wide swath of researchers about what types of features would be useful in this software.
If you have an interest (or experience) running online experiments and have a moment please fill out this quick survey. Your responses might help use gear our software development effort in a way to maximize the utility for the overall community.
We will share the results of the survey as a comment to this post in a few weeks.
Guest post by Eyal Peer
Many researchers who use MTurk are concerned about the quality of data they can get from their MTurk workers. Can we trust our “Turkers” to read instructions carefully, answer all the questions candidly and follow our procedures prudently? Moreover, are there specific types of workers we should be targeting to ensure high quality data? I stumbled upon this question when I tried to run an experiment about cheating on MTurk. I gave Turkers the opportunity to cheat by over-reporting lucky coin tosses, and found very low levels of cheating. I was surprised, especially in light of all the research on how much people can be dishonest when given a chance. I then realized that I, as I usually do, sampled only “high reputation” Turkers (those who had more than 95% of previous HITs approved). I was more surprised to see that when I removed this restriction and allowed any Turker to take part in my study, cheating went up considerably. This led us (Joachim Vosgerau, Alessandro Acquisti and I) to think whether high reputation workers are indeed less dishonest and, presumably, produce higher data quality. If so, than perhaps sampling only high reputation workers can be sufficient to obtain high quality data. We found that to be true. In a paper forthcoming in Behavior Research Methods we report how high reputation workers produced very high quality data, even when no efforts to increase their attention were made.
Until then, my colleagues and me have been mostly relying on attention-check questions (henceforth, ACQs) to ensure high quality data. These are “trick” questions that test respondents’ attention by prescribing a specific response to those who read the questions or the preceding instructions (e.g., “have you ever had a fatal heart attack?” see Paolacci, Chandler, & Ipeirotis, 2010). Using ACQs have inherent disadvantages. First, including ACQs might disrupt the natural flow our survey and might even (in extreme cases) interfere with our manipulation. Second, it’s possible that attentive Turkers might get offended by such “trick” questions, and might react unfavorably. Lastly, and perhaps most important, even when failing ACQs can be considered a reliable signal that the participant has not paid attention in the rest of the survey, a sampling bias might be created if those who fail ACQs are excluded post-hoc from the sample. On the other hand, not using ACQs and sampling only high reputation workers can also restrict the sample’s size, reduce response rate and might also create a response bias. So, we compared these two methods for ensuring high quality data: using ACQs vs. sampling only high reputation workers (e.g., those with 95% or more approved HITs).
In the first study, we found that high reputation workers did provide higher quality data on all measures: they failed common ACQs very rarely, their questionnaires’ reliability was high and they replicated known effects such as the anchoring effect. This was true whether or not these workers received ACQs. Namely, high reputation workers who did not receive any ACQ showed similarly high quality data compared to those who did receive (and pass) our ACQs. Low reputation workers, on the other hand, failed ACQs more often and their data quality was much lower. However, ACQs did make a difference in that group. Those who passed the ACQs provided higher data quality, sometimes very similar to the quality of data obtained from high reputation workers. So, it seemed that ACQs were not necessary for high reputation workers. Moreover, sampling only high reputation workers did not reduce response rate (in fact, it increased it) and we couldn’t find any evidence for a sampling bias, as their demographics were similar to those that had low reputation. However, we used pretty common ACQs (such as the one mentioned above), and it was very likely that high reputation workers – being highly active on MTurk – probably recognized those questions, which made them less effective for them. We thus ran another study with novel ACQs, and found similar results: high reputation workers produced high quality data with our without ACQs, even when the ACQs were not familiar to them. We also found that, among the high reputation workers, those who have been more productive on MTurk (completed more HITs in the past) produced slightly higher data quality than less productive workers. But even the less productive high-reputation workers produced very high data quality, with our without ACQs.
Our final conclusion was that reputation is a sufficient condition for obtaining high quality data on MTurk. Although using ACQs doesn’t seem to hurt much (we did not find any evidence of ACQs causing reactance), they also seem to be unnecessary when using high reputation workers. Because sampling only high reputation workers did not reduce response rate or create a selection bias, and also since most researchers we know use only those workers anyway, it seems to be the preferable method to ensure high quality data. ACQs might have been necessary in the past, when completing behavioral surveys and experiments was not very common on MTurk, but their efficacy seems to have diminished recently. As MTurk is a highly dynamic environment, this assertion might not hold true in the future, as new and different people join MTurk. Nevertheless, we believe that if one has to choose between using ACQs or not, it’s better to not use them and, instead, simply sample high reputation workers to get high quality data. Personally, I have stopped using ACQs altogether and am still getting reliable and high quality data in all my studies on MTurk.
Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.
Peer, E., Vosgerau, J., & Acquisti, A. (in press). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk, Behavior Research Methods, available here.
The North American Conference of the Association for Consumer Research is taking place on October 3-6 at the Hilton Palmer House Hotel in Chicago, IL. On October 4, the conference will host a roundtable about MTurk. Many issues will be discussed, ranging from practical decisions that researchers need to make when conducting studies (e.g., how much to compensate participants) to priorities for future research about crowdsourcing social science. The roundtable will take place on October 4, 11.00 am in the room Indiana.
In a previous post, we documented the existence of MTurk workers that are disproportionately likely to show up in academic studies, potentially leading to foreknowledge of experimental procedures. We report here a study that illustrates the potential challenges that foreknowledge can have for MTurk data validity.
The Cognitive Reflection Test (CRT; Frederick, 2005) is typically used to measure a stable individual difference in cognitive orientation. It consists of three questions, each of which elicits an intuitive response that can be recognized as wrong with some additional thought. As a result, the number of correct answers to the CRT can serve as a parsimonious measure of the individual’s tendency to make reflective decisions. Foreknowledge – either as a result of previous exposure or through information shared by others who have completed the task – is problematic for the CRT because it increases the likelihood that the individual has discovered the correct response and can provide it without reflection, or at a minimum is aware that there is a “trick” that necessitates that the question receive additional scrutiny.
The CRT appears frequently on MTurk, therefore workers who spend more time using MTurk should be more likely to provide the correct answer. We recruited one hundred workers that varied in their (known) prior experience on MTurk. Participants completed a study that included, among other measures, the original version of the CRT (Frederick, 2005). We found that workers who were known to have completed more research studies on MTurk answered more CRT questions correctly, suggesting that their performance in fact improving with experience.
One alternative explanation to this finding is that more productive workers differ in some meaningful way from less productive workers. For example, they could be more reflective or conscientious. To rule this out, we asked the same workers to complete a “novel” version of the CRT (from Finucane & Gullion, 2010) prior to completing the “original” version. The questions posed by the two tests are logically identical and only differ in terms of their familiarity to workers. As expected, performances on the original and novel test were highly correlated. However, whereas prior experience significantly predicted performance on the original CRT, it did not predict performance on the novel CRT. This suggests that the results were not caused by a fundamental difference in the cognitive style of more experienced workers.
This study illustrates that using measures that are familiar to MTurk workers and at the same time assume that participants are naïve is problematic. By collecting the CRT among non-naïve participants, researchers might draw false conclusions about their levels of cognitive reflection and about the relationship between CRT performance and any variable that correlates with worker experience. Moreover, nonnaiveté introduces another source of error that might obscure the relationship between cognitive reflection and other constructs of interest.
See full paper for more details about the study and a broader discussion of the challenges connected to worker nonnaïveté.
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130
Finucane, M. L., & Gullion, C. M. (2010). Developing a tool for measuring the decision-making competence of older adults. Psychology and Aging, 25(2), 271.
Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4), 25–42.
More and more social scientists have adopted MTurk as a venue for their research, praising its speed, cost, and diversity relative to undergraduate samples. However, many of them may fail to take into account some other critical aspects that differentiate MTurk samples from undergraduate subject pool samples.
In a paper just published in Behavior Research Methods, we find worker non-naïveté to be a serious concern. One general issue is that MTurk workers share information about HITs with each other publicly and searchably on various forums, including on two different subreddits (see here and here for some collected examples of manipulation checks and common measures that have become common knowledge among workers via forum).
More specifically, while the probability that any worker has seen some manipulation may be low, there is a population of “superturkers”, i.e., extremely prolific workers, who are significantly more likely to end up in your studies. We pooled 16,408 HITs in 132 unique studies, and found that the HITs were completed by 7,498 unique workers. While the average worker completed 2.2 HITs, the top 1% of most prolific workers (15+ HITs) completed 11% of the total, and the top 10% (5+ HITs) completed nearly half (41%) of the total HITs.
As mentioned, superturkers can be problematic because:
- They are more likely to have seen standard manipulations (see figure below)
- They are significantly more likely to read MTurk blogs/forums
- They are significantly more likely to receive notifications from www.turkalert.com each time you (as an academic requester) post new HITs
However, on the plus side, we find that:
- They are less likely to be multitasking while on MTurk
- They are much better at responding and completing follow-up studies (one year later, 75% of these workers completed a follow-up), if you are interested in longitudinal research.
In sum, superturkers are a mixed blessing. MTurk has the capability for more sophisticated designs, including longitudinal studies, and superturkers are reliable enough to make this viable. Since they are less likely to be multitasking, they may also be good participants in studies that require more attention (e.g., reaction time).
However, their non-naïveté, and worker non-naïveté in general, is a serious concern. Researchers can and should take steps to exclude workers from subsequent studies within their lines of research (the paper provides one solution; this method is another good solution if your studies are on Qualtrics). Ideally, they would also communicate with other researchers who do similar work, so that one’s previous participants could be excluded from the other’s study, and vice versa. Beyond that, though, researchers should be very wary about using “classic” manipulations or measures on MTurk. The next blog post will detail the data that supports this admonition.
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130.
As MTurk was not designed for psychological research, it cannot be expected to provide the experimental control that psychologists typically exercise when recruiting participants for laboratory studies. In particular, MTurk lacks:
(I) The ability to exclude participants that have already participated in related studies.
(II) The ability to prevent study previews.
(III) The ability to verify participants’ completion of a study.
TurkGate, or Grouping and Access Tools for External surveys (for use with Amazon Mechanical Turk), gives researchers an easy-to-use web application for providing such control when using MTurk with externally hosted studies (e.g., Qualtrics surveys).
TurkGate groups related HITs together, such that participants may only access one HIT per group. These HITs only link to surveys after workers accept them, and only if they have not already accessed related HITs. Workers attempting to preview HITs receive information about the HIT, but no link. Once the HIT is accepted, the worker’s ID is tested against a database to verify eligibility before the worker is granted access to the study. TurkGate also generates completion codes that can be automatically verified.
Pe’er et al., (2012) have already created a relatively simple method to resolve some of these issues, which we refer to as the Qualtrics method since it is based on functionality inherent to Qualtrics. Like TurkGate, their method supports hosting studies on a variety of sites, as users can adopt (the free version of) Qualtrics for the purpose of screening participants before redirecting them elsewhere. In addition, their method also uses a script to prevent study previews. Researchers who use the Qualtrics method–and who also track response IDs–can address all of the MTurk limitations described above without needing to configure or maintain any soft- or hardware.
In contrast, TurkGate needs to be downloaded and installed on a web server (e.g., Apache HTTP Server) with a database management system (e.g., MySQL). As such, the security and reliability of TurkGate will depend on the system it is installed on. If the server goes down, TurkGate goes down with it–an unlikely occurrence with a professional service like Qualtrics. And although TurkGate is designed to work out-of-the-box, it may require some administration, such as updating versions or setting up database backups. These requirements suggest that TurkGate is best suited for an entire laboratory or department of researchers, where a single computer-savvy individual or IT professional can maintain it.
In return for this investment, however, TurkGate offers a streamlined workflow with several, distinct advantages:
(i) TurkGate manages related studies with minimal overhead. Researchers using either TurkGate or the Qualtrics method add a script to their Web Interface HITs that checks workers’ IDs against a list of restricted IDs. The difference is that the Qualtrics method maintains the list of IDs within each survey, whereas TurkGate maintains a global list of IDs for all surveys, separated by group. Having a centralized database has several advantages, especially when researchers collaborate. Namely, there is no need for researchers to store, share or update lists because all surveys use the same list and multiple researchers can use the same TurkGate installation. To make any pair of studies mutually exclusive, researchers simply assign them the same group name. Another benefit is that researchers can run their studies simultaneously, since TurkGate’s list is updated automatically and in real-time. Obviating the need to manage multiple lists of IDs makes creating surveys faster. A researcher simply submits a URL and group name for their survey into TurkGate to get the aforementioned script. They are then ready to create their HIT and run their survey. It is this highly optimized workflow that represents the original raison d’etre of TurkGate.
(ii) TurkGate disables HIT previews while preventing unnecessary HIT returns. The Qualtrics method, like TurkGate, prevents HIT previews by sending workers to an intermediary page prior to the actual study. However, explicit care was taken in developing TurkGate to prevent workers from ever needing to return HITs for which they are not eligible. This prevents the artificial and undesirable inflation of workers’ return rates. Instead, workers are provided with a group name, and if they recall having participated in the group, they know not to accept the HIT. If they do not recall participating, they can simply verify their eligibility by submitting their worker ID.
(iii) TurkGate offers intuitive, verifiable, and anonymous completion codes. TurkGate’s completion codes were crafted to support a number of features. First, the codes themselves contain useful information in human-readable form, including MTurk worker ID, group name, survey identifier, and an optional Qualtrics or LimeSurvey record ID (researchers can augment their codes with any number of additional key-value pairs). Critically, each code also contains an encrypted segment used to prevent fake codes. After running a batch of HITs, researchers simply copy and paste their MTurk results file into TurkGate, which then instantly flags invalid records and duplicates. For experimenters (or IRBs) concerned about anonymity, TurkGate can also verify participation without using response IDs (that are coupleable with study data). Those using Qualtrics often use response IDs as completion codes (http://www.qualtrics.com/university/researchsuite/faqs#codenumber), but this requires manual verification and precludes complete anonymity.
For researchers in psychology laboratories or departments that already have access to IT support capable of configuring and maintaining a web server and database, TurkGate will likely serve as a convenient and long-term solution. However, the Qualtrics method is better suited for researchers who are uninterested in the overhead of deploying a separate system, especially if they already use Qualtrics.
TurkGate is an actively developed, open-source project that users are free to download (and modify) via GitHub. It is used in multiple laboratories and continues to evolve based on the feedback and contributions of its users. Learn more about TurkGate here.
(Suggested citation: Goldin, G., Darlow, A. (2013). TurkGate (Version 0.4.0) [Software]. Available from http://gideongoldin.github.com/TurkGate/)
Pe’er, Eyal, Paolacci, Gabriele, Chandler, Jesse and Mueller, Pam, Selectively Recruiting Participants from Amazon Mechanical Turk Using Qualtrics (May 2, 2012). Available at SSRN: http://ssrn.com/abstract=2100631 or http://dx.doi.org/10.2139/ssrn.2100631