Posted by: Gabriele Paolacci | May 23, 2016

Using MTurk to Study Political Ideology

Guest post by Scott Clifford, Ryan Jewell, and Philip Waggoner

MTurk is increasingly used to study questions about politics and political psychology. MTurk samples are well known to deviate from the national population on a number of dimensions, particularly political ideology. The underrepresentation of conservatives has led some scholars, most notably Dan Kahan, to worry that the conservatives who opt into MTurk are not “real” conservatives. As Dan puts it, the underrepresentation of conservatives means we “can infer there is something different about the conservatives who do sign up from the ones who don’t” (see Dan’s discussion here and here). For example, they may differ from other conservatives in psychological dispositions central to their identities. If this claim were true, it may render MTurk samples invalid for studying political and ideological divides. This would be particularly worrisome for research using ideology or partisanship as a moderator of experimental treatment effects or examining psychological differences between liberals and conservatives.

In a recent article published in Research & Politics, we evaluated this concern using a large sample recruited from MTurk (N = 1,500). We compared this sample to two nationally representative benchmark surveys from the American National Election Studies 2012 Time Series Study, which was conducted before and after the 2012 US presidential election. The ANES study recruited 1,413 respondents for face-to-face interviews and 3,860 respondents for a web-based survey (through GfK).

In our MTurk survey we asked a series of questions that allowed us to make a direct comparison to the ANES surveys. Following research in political psychology, we focused on two sets of variables that we expected would be associated with political ideology: the Big Five personality traits and values (egalitarianism, moral traditionalism, racial resentment, authoritarianism). Our first analysis consisted of looking at the levels of each trait and value across political ideology. Contrary to the concerns discussed above, our MTurk conservatives looked nearly identical to ANES conservatives across all of these measures. Surprisingly, it was liberals who looked different – our MTurk liberals consistently held more liberal patterns of values and issue attitudes than our ANES liberals.

As our primary test, we estimated models predicting political ideology as a function of either personality traits or values, while controlling for standard demographics. The figure below plots the coefficients and 95% confidence intervals for each sample. As is clear from the figure, the results are highly similar across samples. In fact, across a broader set of tests, over 90% of the coefficients are statistically indistinguishable in size across samples. Thus, a researcher investigating the psychological predictors of political ideology would have reached largely the same conclusions whether using MTurk or the ANES.

clifford pic

Overall, we found that liberals and conservatives closely mirror the psychological divisions of liberals and conservatives in the mass public, providing little support for the concern that self-selection creates a pool of conservatives who are psychologically distinct from their counterparts in the larger population. We did find, however, that MTurk liberals hold more characteristically liberal values and attitudes than liberals from nationally representative samples. As a result, we encourage researchers to use more robust measures of political ideology, such as an index of political attitudes, in order to more fully capture variation in political ideology. Nonetheless, we find little reason to believe that liberals and conservatives recruited from MTurk are psychologically distinct from their counterparts in the mass public.

Guest Post by Mark Keuschnigg

Local confinement of samples and results has motivated questions as to the external validity of social science experiments. The last 15 years have thus seen a sharp increase in experiments conducted at multiple locations including developing countries and small-scale societies. However, cross-regional comparisons of economic behavior have run into obstacles due to limited transferability of standardized decision situations into parallel laboratory set-ups. In a recent article in Social Science Research we utilize cross-regional participation at MTurk to circumvent common pitfalls of traditional multi-location experimentation.

We argue that MTurk experiments provide a sorely needed complement to laboratory research, transporting a homogeneous decision situation into various living conditions and social contexts. In fact, we believe that quasi-experimental variation of the characteristics people bring to the experimental situation is the key potential of crowdsourced online designs. Our research shows that such analyses of “virtual pools” can be adapted to study local patterns of behavior.

We use Ultimatum (UG) and Dictator Games (DG) for data generation (N = 991). We use bargaining games specifically because norms of fairness are strongly conditional on local context. UG and DG thus reveal expectations about valid social norms in a particular population.

To assess the importance of context, our design includes an experimental variation of monetary stakes ($0, $1, $4, and $10) as a benchmark. Our marginal totals correspond closely to laboratory findings: Monetary incentives induce more selfish behavior but, in line with most laboratory findings, the particular size of a positive stake appears irrelevant.

Analyses of “virtual pools” first mirror standard sub-group analyses contrasting participants from different regions. We illustrate this by comparing workers’ behavior from India and the US: Controlling for differences in the socio-demographic composition of national pools we do not find a cross-country difference in a parametric situation (DG). Culture, however, seems to be relevant in strategic interaction (UG). Participants in India are more selfish (proposers) and less demanding (responders) than US Americans. Within the US, Southerners appeared both more selfish (proposers) and more demanding (responders) than Northerners.

More importantly, however, participants’ geographical locations provide an interface for direct inclusion of macro variables potentially influencing individual behavior. We limit our analysis to regional variation of economic affluence and social capital across US states. According to our estimates, dictators’ allocations from wealthier and socially more integrated states are 13 percent larger on average than those from less-advantaged states. Interestingly, total size of contextual influence clearly exceeds stake effects, and most important from a sociological perspective, context effects are both more pronounced and theoretically consistent than effects of individual attributes.

  • For cross-country comparability we used tokens and weighted payoffs for Indian participants using a purchasing power parity conversion factor.
  • To balance national pools we posted four HITs daily (early morning and late afternoon at local time in each country) and recruited for each daily session as many US Americans as we had recruited Indians earlier that day.
  • To avoid waiting time and drop-out, actual matching of subjects occurred only before payoff from a pool of preceding participants’ decisions (without replacement).
  • Submissions were only accepted once per worker-ID. We also disabled participation from IP-addresses similar to those existing in our database to impede multiple participations by one household.

So far, the use of “virtual pools” has received scant attention in experimental research. We know a great deal about how institutional arrangements affect fairness, trust, cooperation, and reciprocity in economic games, yet we know little about how local socio-economic conditions and strategies learned in daily interaction influence outcomes of social experiments. Bringing context back into social experiments is particularly relevant for sociological research which—unlike most experimental research in economics and psychology—fully acknowledges the importance of context effects in a multi-level explanation of individual action.


Keuschnigg, M., Bader, F., Bracher, J. (2016) Using crowdsourced online experiments to study context-dependency of behavior. Social Science Research, doi: 10.1016/j.ssresearch.2016.04.014.

Posted by: Gabriele Paolacci | July 30, 2015

How many people can your lab reach on MTurk?

Guest post by Neil Stewart

How many people can your lab reach on MTurk? We used the capture-recapture method¹ from wildlife ecology to estimate how many workers you are sampling from. Our estimate is 7,300 workers.

Using 114,460 HITs completed from 2012 onwards we estimated, for each of our labs, how many workers we are sampling from. We then used a random-effect meta-analysis to estimate the number of workers a new lab, which could be yours, could reach. Why does this matter? Well, there is an exponential-like increase in the number of publications using this MTurk population—and we, like others, have found considerable overlap between our laboratories. And if you are planning a series of experiments or running adequately powered replications, you could run out of workers quite fast.

What can you do to increase your reach? Surprisingly, paying more doesn’t help. Our population estimate was reduced for higher paying HITs—we think because the most active workers seek out these HITs and crowd out the less active workers.  (Still, no reason not to pay a living wage to our participants!) Running larger batch sizes does help. Our larger batches sampled from a population nearly three times larger than the smaller batches. One last strategy is to wait. We estimate that it takes about 7 months for half of the workers on MTurk to leave and be replaced.

View the paper in-press in Judgment and Decision Making here.

Neil Stewart, Christoph Ungemach, Adam Harris, Dan Bartels, Ben Newell, Gabriele Paolacci, and Jesse Chandler

¹ The intuition behind capture-recapture method is not too hard. Ecologists might, for example, use it to estimate the number of fish in a pond. Go fishing on Day 1. Catch some fish, tag them, and release them. Then, on Day 2, go fishing again. Catch some fish and observed the proportion that are tagged. Now you have an estimate of the proportion tagged in the pond from Day 2, and the number tagged in the pond from Day 1, so you can estimate the total number in the pond. If you tag five fish on Day 1 and observe one quarter are tagged in Day 2’s catch then there must be 20 fish. We used WorkerIDs like tags.


Posted by: Gabriele Paolacci | June 16, 2015

Using Nonnaive Participants Can Reduce Effect Sizes

Guest post by Jesse Chandler

In a new Psychological Science article we provide direct evidence that effect sizes for experimental results are reduced among participants who have previously completed an experimental paradigm. Specifically, we recruited MTurk workers who participated in the Many Labs 1 series of two-condition experiments and invited them to participate in a research study that included the exact same package of experiments. We found that effect sizes decreased the second time around, especially when among those who were exposed to opposite conditions at the two different time points.

Previous studies have demonstrated that MTurk worker performance changes as workers become more experienced. For example, we have demonstrated that worker scores on the Cognitive Reflection Task (a commonly used measure of intellectual ability) is correlated with worker experience. Likewise, Dave Rand and Winter Mason have led projects that provided evidence that workers get better at economic games over time. All of these findings are new twists on older observations that attitudes of survey panel members tend to change over time (a phenomena known as panel conditioning in the survey literature) and that people tend to improve on measures of aptitude (a phenomena known as a practice effect within the psychometric testing literature).

Our findings illustrate that participant experience can also affect experimental results, even when dependent measures are not straightforward measures of ability. These findings are surprising (at least to us) because we have tended to assume that workers are being relatively unengaged while completing HITs and complete so many tasks that any individual experiment could hardly be memorable. But apparently they are.

Fortunately there is good news. First, we see some evidence that this effect wears off over time, suggesting that people eventually forget whatever information they may have seen. Second, if all you care about are the direction and sign of an effect (rather than an exact point estimate) smaller effect sizes can be offset by increased sample size. Third, there are an increasing number of tools available to prevent duplicate workers from participating in an experiment.

  • The MTurk API or GUI allows you to create qualifications that filter out unwanted workers.
  • Qualtrics can be set up to check workers against a predefined list of workers and exclude those with matching WorkerIDs. This is useful if you know of workers you want to exclude, but have not worked with them before.
  • TurkGate will do something similar. It needs to run on a server, but it is likely easier to maintain than Qualtrics, particularly for a lab group or research team that wants to coordinate their efforts.
  • TurkPrime and UniqueTurker are newer solutions that seem easier to use for individual researchers. We have not tested them, but readers may wish to experiment with them.

In short, there are lots of ways to limit worker participation. Of course, this does raise serious concerns for experimental paradigms that are used to the point of abuse (trolley problem, I’m looking at you) and highlights that the finite size of the MTurk pool means a finite limit on the amount of times a particular experimental paradigm can be run.

Posted by: Gabriele Paolacci | May 29, 2015

MTurk workshop at EMAC

On May 27 I held a workshop at EMAC on conducting behavioral research using Amazon Mechanical Turk samples. Slides are available here.

Guest post by David J. Hauser

In this new article, Norbert Schwarz and I show in two experiments that answering an instructional manipulation check (IMC) changes the way participants approach later survey questions.

IMCs are often included in online research (and especially on MTurk) in order to assess whether participants are paying attention to instructions. However, participants can potentially see them as “trick” questions that violate conversational norms of trust. As a result, these questions may make participants more cautious when answering later questions in an effort to avoid being tricked again.

Two studies provided support for this hypothesis. In one study, participants received an IMC and the Cognitive Reflection Test (Frederick, 2005), a math test assessing the tendency to reflect and correct intuitive answers. Crucially, half of the participants completed the IMC before the CRT, whereas the other half completed the math test first. Completing the IMC first increased CRT scores (vs when the CRT came first), suggesting it increased systematic thinking.

In a second study, participants received an IMC and a probabilistic reasoning task assessing rational decision making (Toplak, West, & Stanovich, 2011). Like before, half of the participants completed the IMC before the reasoning task, whereas the other half completed the reasoning task first. Completing the IMC first increased accuracy on the reasoning task (compared to completing the reasoning task first). Thus, answering an IMC teaches participants that there may be more than meets the eye to later questions, a conclusion that significant alters participants’ reasoning strategies.

IMCs are typically conceptualized as measures, not interventions. However, as demonstrated here, this is not the case. One should therefore exercise caution in IMC use.


Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19, 25-42.

Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cognitive Reflection Test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition39, 1275-1289.

This document guides you through a simple method to avoid recruiting MTurk workers for your studies who already participated in a certain study of yours. The core of the procedure relies on Excel (as opposed to CLT or the MTurk API) to assign a Qualification to multiple workers at the same time. Using this procedure will allow you to exclude from the recruitment workers who participated in a previous related study (e.g., a study you are now replicating), and can be functional to other goals too (e.g., executing longitudinal research, building your own panel).

Update June 10, 2015: Arnoud Plantinga developed an script on R based on this method in which you don’t have to create the new variables yourself. You can download the script here.

Posted by: Gabriele Paolacci | October 14, 2014

MTurk Workshop at ACR

The use of Mturk by behavioral researchers continues to increase. Despite the evidence on the benefits (and drawbacks) of MTurk, many researchers, reviewers, and editors intuitively distrust the reliability and validity of online labor markets.

On October 25 , we will host a workshop at ACR called “Questioning the Turk: Conducting High Quality Research with Amazon Mechanical Turk”. We will answer and debate questions from the ACR community regarding MTurk, and raise some new questions. We will discuss the current issues that arise from MTurk’s use, as well as some of the solutions and replications. Questions can be submitted using the hashtag #mturkacr via Twitter (@aconsres, @joekgoodman, @gpaolacci) or Facebook (ACR page), as a comment to this post, or via email the organizers Joseph Goodman and Gabriele Paolacci.

Te North American ACR conference will take place on October 24-26 at the Hilton Baltimore in Baltimore, MD. The MTurk workshop will take place on Saturday, October 25, 2pm in the room Key 5.

Update: Thanks to all participants for contributing to a fruitful discussion! The slides we used in the workshop can be found here. Joe & Gabriele.

Posted by: Gabriele Paolacci | July 10, 2014

Review of MTurk as a Participant Pool

We recently published on Current Directions in Psychological Science a review of MTurk as a source of survey and experimental data. We discuss the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors. The Psych Report published a nice summary of the paper, that you can find here.


Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184-188.

Posted by: Gabriele Paolacci | April 10, 2014

2nd Workshop on Crowdsourcing and Online Behavioral Experiments

The Annual Workshop on Crowdsourcing and Online Behavioral Experiments (COBE) seeks seeks to bring together researchers and academics to present their latest online behavioral experiments (e.g., on MTurk) and share new results, methods and best practices. See below the details of the workshop.

– Workshop Date: June 8 or 9, 2014 (TBD)

– Location: Stanford University, Palo Alto, California, before the 15th ACM Conference on Electronic Commerce:

– Call for papers:

Older Posts »



Get every new post delivered to your Inbox.

Join 45 other followers

%d bloggers like this: