Guest Post by Antonio Alonso Arechar, Gordon Kraft-Todd, David Rand, and Jesse Chandler

People who participate in a research study differ depending on when the study occurs. For example, people who work in traditional white collar jobs may be unavailable to complete a study during regular business hours (when they would ordinarily be at work). Likewise, in college samples, students who sign up to complete studies at the beginning of the semester can differ from students who sign up to complete studies at the end of the semester. These differences likely vary across populations and recruitment methods, and in many cases evidence for their existence resides only in the beliefs and tacit knowledge of study recruiters and researchers.

Online experiments expand the times at which studies can be completed and the population available to complete them. At the same time, online experiments often distance researchers from the recruitment process, making it harder to have a sense of who is available to complete a study at different times. To address this knowledge gap, we present results from two recent working papers that explore differences among Turkers participating in HITs at different times: Arechar et al. 2016 (hereafter AKR) examine the demographics, big-5 personality traits, and incentivized economic behaviors of 2,336 Turkers as a function of their local time when completing the study; and Casey et al. 2016 (hereafter CCLBS) examine demographics and big-5 personality traits in 9,770 Turkers as a function of the time at which HITs were posted. Our studies show convergent evidence of some differences in demographics and personality based on when HITs are posted, and when within a HIT’s posting period a subject participates. The incentivized behavioral measures of AKR, however, do not in general find that these differences translate into meaningful differences in actual behavior.

Available workers are different on different days and at different times.

Both of our papers found that participants with more prior experience on MTurk (self-reported in AKR, cross-referenced for participation in a prior study by CCLBS) were more likely to complete HITs earlier in the day. AKR also found that participants with less prior experience on MTurk were more likely to complete HITs on weekends.

We both found that participants’ age varies with time and day, although the observed patterns were less consistent: CCLBS observed that workers were younger on Thursdays and older on Saturdays, and AKR observed that participants were younger in the evening and older in the morning.

Participants also differed across time in terms of personality. In both studies, participants who scored lower on the big-5 personality dimension of conscientiousness were more likely to complete HITs later in the day. AKR also found that participants who scored higher on the big-5 personality dimension of neuroticism were more likely to complete HITs later in the day.

AKR also examined a personality dimension not considered by CCLBS: intuitive versus reflective cognitive style (using the “Cognitive Reflection Test,” math problems with intuitively compelling but incorrect answers (Frederick, 2005)). Cognitive style has been linked to a wide range of behaviors and beliefs, with more deliberative people (who scored better on this test) being for example less impatient, less religious, less inclined to hold traditional moral values, and less susceptible to pseudo-profound bullshit (Pennycook et al. 2015). AKR found that participants on the weekends performed more poorly on this task, indicating that they were less deliberative (i.e. more intuitive).

CCLBS also examined additional demographics characteristics and found that workers recruited later in the day were more likely to complete the survey using cellphones and to be of an Asian American ancestry. Moreover, they found that participants were more likely to be Asian on Wednesdays, and more likely to have a full-time job on Sundays, and that single participants were more likely to complete the survey later in the day.

Finally, AKR examined a range of incentivized economic behaviors: various measures of prosociality (Prisoner’s Dilemma, Dictator Game, a charitable giving decision, and an honesty task where participants could lie to earn more money), third party punishment, and intertemporal choice. They found no significant day or time differences in behavior, except that participants late at night donated more money to charity (and took longer to complete the study). They also found that that participants on the weekends failed comprehension questions more often.

Early responders are different from late responders.

Both studies found that workers who participated earlier in the data collection process were substantially more experienced, and tended to score higher on the big-5 personality dimension of agreeableness.

CCLBS also found that participants tend to be older, more likely to have a full-time job, more likely to be Asian American, less neurotic, more conscientious and more likely to be male earlier in the data collection process. And AKR found that early participants are more likely to pass comprehension questions correctly, to be more deliberative (i.e. perform better on the Cognitive Reflection Test), to give less to charity, and to take less time completing the study.

Take-home

These findings have several implications for researchers using MTurk:

First, differences across day and time are most crucial for researchers trying to make point estimates about the worker population (e.g., to understand the average number of completed experiments, opinions about piece work employment or online labor market dynamics). Studies on these topics need a sampling strategy that appropriately weights workers who are online at different times and who have different levels of experience.

Second, there is variation in the impact of study launch time on participant characteristics. Paradigms in which the reported differences are likely to matter should be posted at carefully selected times. For example, studies that rely on workers using a computer (e.g., reaction time studies) are more efficient to run earlier in the day and studies of charitable donations may benefit from larger variance observed in evening populations. Also, complex paradigms for which risk of non-comprehension is high may benefit from more highly experienced, conscientious, and/or reflective participants who are more likely to participate earlier in the day and on weekdays. However, there are paradigms for which timing of the study launch does not seem to have much impact, such as behavior in the simple economic games and decisions examined by AKR. Given the variation observation variation in participant characteristics, we recommend reporting when (time of day and days of week) data is collected from MTurk in the study methods section as a best practice.

Finally, timing of participation during a study (serial position) matters. Early responders are more experienced at completing surveys and may also be more diligent (as reflected by differences in the CRT and comprehension checks in AKR). Crucially, differences impact not only early and late responders in large studies, but also participants across sequences of studies that exclude workers who have completed previous studies in the sequence. These changes may influence the potential to replicate earlier studies in the sequence: as measurement error increases (because participants don’t understand the question, don’t think about it carefully or are less interested in helping the researcher) so too will the necessary sample size. Replication will become more difficult if researchers do not account for this change.

Posted by: Gabriele Paolacci | May 23, 2016

Using MTurk to Study Political Ideology

Guest post by Scott Clifford, Ryan Jewell, and Philip Waggoner

MTurk is increasingly used to study questions about politics and political psychology. MTurk samples are well known to deviate from the national population on a number of dimensions, particularly political ideology. The underrepresentation of conservatives has led some scholars, most notably Dan Kahan, to worry that the conservatives who opt into MTurk are not “real” conservatives. As Dan puts it, the underrepresentation of conservatives means we “can infer there is something different about the conservatives who do sign up from the ones who don’t” (see Dan’s discussion here and here). For example, they may differ from other conservatives in psychological dispositions central to their identities. If this claim were true, it may render MTurk samples invalid for studying political and ideological divides. This would be particularly worrisome for research using ideology or partisanship as a moderator of experimental treatment effects or examining psychological differences between liberals and conservatives.

In a recent article published in Research & Politics, we evaluated this concern using a large sample recruited from MTurk (N = 1,500). We compared this sample to two nationally representative benchmark surveys from the American National Election Studies 2012 Time Series Study, which was conducted before and after the 2012 US presidential election. The ANES study recruited 1,413 respondents for face-to-face interviews and 3,860 respondents for a web-based survey (through GfK).

In our MTurk survey we asked a series of questions that allowed us to make a direct comparison to the ANES surveys. Following research in political psychology, we focused on two sets of variables that we expected would be associated with political ideology: the Big Five personality traits and values (egalitarianism, moral traditionalism, racial resentment, authoritarianism). Our first analysis consisted of looking at the levels of each trait and value across political ideology. Contrary to the concerns discussed above, our MTurk conservatives looked nearly identical to ANES conservatives across all of these measures. Surprisingly, it was liberals who looked different – our MTurk liberals consistently held more liberal patterns of values and issue attitudes than our ANES liberals.

As our primary test, we estimated models predicting political ideology as a function of either personality traits or values, while controlling for standard demographics. The figure below plots the coefficients and 95% confidence intervals for each sample. As is clear from the figure, the results are highly similar across samples. In fact, across a broader set of tests, over 90% of the coefficients are statistically indistinguishable in size across samples. Thus, a researcher investigating the psychological predictors of political ideology would have reached largely the same conclusions whether using MTurk or the ANES.

clifford pic

Overall, we found that liberals and conservatives closely mirror the psychological divisions of liberals and conservatives in the mass public, providing little support for the concern that self-selection creates a pool of conservatives who are psychologically distinct from their counterparts in the larger population. We did find, however, that MTurk liberals hold more characteristically liberal values and attitudes than liberals from nationally representative samples. As a result, we encourage researchers to use more robust measures of political ideology, such as an index of political attitudes, in order to more fully capture variation in political ideology. Nonetheless, we find little reason to believe that liberals and conservatives recruited from MTurk are psychologically distinct from their counterparts in the mass public.

Guest Post by Mark Keuschnigg

Local confinement of samples and results has motivated questions as to the external validity of social science experiments. The last 15 years have thus seen a sharp increase in experiments conducted at multiple locations including developing countries and small-scale societies. However, cross-regional comparisons of economic behavior have run into obstacles due to limited transferability of standardized decision situations into parallel laboratory set-ups. In a recent article in Social Science Research we utilize cross-regional participation at MTurk to circumvent common pitfalls of traditional multi-location experimentation.

We argue that MTurk experiments provide a sorely needed complement to laboratory research, transporting a homogeneous decision situation into various living conditions and social contexts. In fact, we believe that quasi-experimental variation of the characteristics people bring to the experimental situation is the key potential of crowdsourced online designs. Our research shows that such analyses of “virtual pools” can be adapted to study local patterns of behavior.

We use Ultimatum (UG) and Dictator Games (DG) for data generation (N = 991). We use bargaining games specifically because norms of fairness are strongly conditional on local context. UG and DG thus reveal expectations about valid social norms in a particular population.

To assess the importance of context, our design includes an experimental variation of monetary stakes ($0, $1, $4, and $10) as a benchmark. Our marginal totals correspond closely to laboratory findings: Monetary incentives induce more selfish behavior but, in line with most laboratory findings, the particular size of a positive stake appears irrelevant.

Analyses of “virtual pools” first mirror standard sub-group analyses contrasting participants from different regions. We illustrate this by comparing workers’ behavior from India and the US: Controlling for differences in the socio-demographic composition of national pools we do not find a cross-country difference in a parametric situation (DG). Culture, however, seems to be relevant in strategic interaction (UG). Participants in India are more selfish (proposers) and less demanding (responders) than US Americans. Within the US, Southerners appeared both more selfish (proposers) and more demanding (responders) than Northerners.

More importantly, however, participants’ geographical locations provide an interface for direct inclusion of macro variables potentially influencing individual behavior. We limit our analysis to regional variation of economic affluence and social capital across US states. According to our estimates, dictators’ allocations from wealthier and socially more integrated states are 13 percent larger on average than those from less-advantaged states. Interestingly, total size of contextual influence clearly exceeds stake effects, and most important from a sociological perspective, context effects are both more pronounced and theoretically consistent than effects of individual attributes.

  • For cross-country comparability we used tokens and weighted payoffs for Indian participants using a purchasing power parity conversion factor.
  • To balance national pools we posted four HITs daily (early morning and late afternoon at local time in each country) and recruited for each daily session as many US Americans as we had recruited Indians earlier that day.
  • To avoid waiting time and drop-out, actual matching of subjects occurred only before payoff from a pool of preceding participants’ decisions (without replacement).
  • Submissions were only accepted once per worker-ID. We also disabled participation from IP-addresses similar to those existing in our database to impede multiple participations by one household.

So far, the use of “virtual pools” has received scant attention in experimental research. We know a great deal about how institutional arrangements affect fairness, trust, cooperation, and reciprocity in economic games, yet we know little about how local socio-economic conditions and strategies learned in daily interaction influence outcomes of social experiments. Bringing context back into social experiments is particularly relevant for sociological research which—unlike most experimental research in economics and psychology—fully acknowledges the importance of context effects in a multi-level explanation of individual action.

Reference

Keuschnigg, M., Bader, F., Bracher, J. (2016) Using crowdsourced online experiments to study context-dependency of behavior. Social Science Research, doi: 10.1016/j.ssresearch.2016.04.014.

Posted by: Gabriele Paolacci | July 30, 2015

How many people can your lab reach on MTurk?

Guest post by Neil Stewart

How many people can your lab reach on MTurk? We used the capture-recapture method¹ from wildlife ecology to estimate how many workers you are sampling from. Our estimate is 7,300 workers.

Using 114,460 HITs completed from 2012 onwards we estimated, for each of our labs, how many workers we are sampling from. We then used a random-effect meta-analysis to estimate the number of workers a new lab, which could be yours, could reach. Why does this matter? Well, there is an exponential-like increase in the number of publications using this MTurk population—and we, like others, have found considerable overlap between our laboratories. And if you are planning a series of experiments or running adequately powered replications, you could run out of workers quite fast.

What can you do to increase your reach? Surprisingly, paying more doesn’t help. Our population estimate was reduced for higher paying HITs—we think because the most active workers seek out these HITs and crowd out the less active workers.  (Still, no reason not to pay a living wage to our participants!) Running larger batch sizes does help. Our larger batches sampled from a population nearly three times larger than the smaller batches. One last strategy is to wait. We estimate that it takes about 7 months for half of the workers on MTurk to leave and be replaced.

View the paper in-press in Judgment and Decision Making here.

Neil Stewart, Christoph Ungemach, Adam Harris, Dan Bartels, Ben Newell, Gabriele Paolacci, and Jesse Chandler

¹ The intuition behind capture-recapture method is not too hard. Ecologists might, for example, use it to estimate the number of fish in a pond. Go fishing on Day 1. Catch some fish, tag them, and release them. Then, on Day 2, go fishing again. Catch some fish and observed the proportion that are tagged. Now you have an estimate of the proportion tagged in the pond from Day 2, and the number tagged in the pond from Day 1, so you can estimate the total number in the pond. If you tag five fish on Day 1 and observe one quarter are tagged in Day 2’s catch then there must be 20 fish. We used WorkerIDs like tags.

capture_recapture

Posted by: Gabriele Paolacci | June 16, 2015

Using Nonnaive Participants Can Reduce Effect Sizes

Guest post by Jesse Chandler

In a new Psychological Science article we provide direct evidence that effect sizes for experimental results are reduced among participants who have previously completed an experimental paradigm. Specifically, we recruited MTurk workers who participated in the Many Labs 1 series of two-condition experiments and invited them to participate in a research study that included the exact same package of experiments. We found that effect sizes decreased the second time around, especially when among those who were exposed to opposite conditions at the two different time points.

Previous studies have demonstrated that MTurk worker performance changes as workers become more experienced. For example, we have demonstrated that worker scores on the Cognitive Reflection Task (a commonly used measure of intellectual ability) is correlated with worker experience. Likewise, Dave Rand and Winter Mason have led projects that provided evidence that workers get better at economic games over time. All of these findings are new twists on older observations that attitudes of survey panel members tend to change over time (a phenomena known as panel conditioning in the survey literature) and that people tend to improve on measures of aptitude (a phenomena known as a practice effect within the psychometric testing literature).

Our findings illustrate that participant experience can also affect experimental results, even when dependent measures are not straightforward measures of ability. These findings are surprising (at least to us) because we have tended to assume that workers are being relatively unengaged while completing HITs and complete so many tasks that any individual experiment could hardly be memorable. But apparently they are.

Fortunately there is good news. First, we see some evidence that this effect wears off over time, suggesting that people eventually forget whatever information they may have seen. Second, if all you care about are the direction and sign of an effect (rather than an exact point estimate) smaller effect sizes can be offset by increased sample size. Third, there are an increasing number of tools available to prevent duplicate workers from participating in an experiment.

  • The MTurk API or GUI allows you to create qualifications that filter out unwanted workers.
  • Qualtrics can be set up to check workers against a predefined list of workers and exclude those with matching WorkerIDs. This is useful if you know of workers you want to exclude, but have not worked with them before.
  • TurkGate will do something similar. It needs to run on a server, but it is likely easier to maintain than Qualtrics, particularly for a lab group or research team that wants to coordinate their efforts.
  • TurkPrime and UniqueTurker are newer solutions that seem easier to use for individual researchers. We have not tested them, but readers may wish to experiment with them.

In short, there are lots of ways to limit worker participation. Of course, this does raise serious concerns for experimental paradigms that are used to the point of abuse (trolley problem, I’m looking at you) and highlights that the finite size of the MTurk pool means a finite limit on the amount of times a particular experimental paradigm can be run.

Posted by: Gabriele Paolacci | May 29, 2015

MTurk workshop at EMAC

On May 27 I held a workshop at EMAC on conducting behavioral research using Amazon Mechanical Turk samples. Slides are available here.

Guest post by David J. Hauser

In this new article, Norbert Schwarz and I show in two experiments that answering an instructional manipulation check (IMC) changes the way participants approach later survey questions.

IMCs are often included in online research (and especially on MTurk) in order to assess whether participants are paying attention to instructions. However, participants can potentially see them as “trick” questions that violate conversational norms of trust. As a result, these questions may make participants more cautious when answering later questions in an effort to avoid being tricked again.

Two studies provided support for this hypothesis. In one study, participants received an IMC and the Cognitive Reflection Test (Frederick, 2005), a math test assessing the tendency to reflect and correct intuitive answers. Crucially, half of the participants completed the IMC before the CRT, whereas the other half completed the math test first. Completing the IMC first increased CRT scores (vs when the CRT came first), suggesting it increased systematic thinking.

In a second study, participants received an IMC and a probabilistic reasoning task assessing rational decision making (Toplak, West, & Stanovich, 2011). Like before, half of the participants completed the IMC before the reasoning task, whereas the other half completed the reasoning task first. Completing the IMC first increased accuracy on the reasoning task (compared to completing the reasoning task first). Thus, answering an IMC teaches participants that there may be more than meets the eye to later questions, a conclusion that significant alters participants’ reasoning strategies.

IMCs are typically conceptualized as measures, not interventions. However, as demonstrated here, this is not the case. One should therefore exercise caution in IMC use.

References:

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19, 25-42.

Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cognitive Reflection Test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition39, 1275-1289.

This document guides you through a simple method to avoid recruiting MTurk workers for your studies who already participated in a certain study of yours. The core of the procedure relies on Excel (as opposed to CLT or the MTurk API) to assign a Qualification to multiple workers at the same time. Using this procedure will allow you to exclude from the recruitment workers who participated in a previous related study (e.g., a study you are now replicating), and can be functional to other goals too (e.g., executing longitudinal research, building your own panel).

Update June 10, 2015: Arnoud Plantinga developed an script on R based on this method in which you don’t have to create the new variables yourself. You can download the script here.

Posted by: Gabriele Paolacci | October 14, 2014

MTurk Workshop at ACR

The use of Mturk by behavioral researchers continues to increase. Despite the evidence on the benefits (and drawbacks) of MTurk, many researchers, reviewers, and editors intuitively distrust the reliability and validity of online labor markets.

On October 25 , we will host a workshop at ACR called “Questioning the Turk: Conducting High Quality Research with Amazon Mechanical Turk”. We will answer and debate questions from the ACR community regarding MTurk, and raise some new questions. We will discuss the current issues that arise from MTurk’s use, as well as some of the solutions and replications. Questions can be submitted using the hashtag #mturkacr via Twitter (@aconsres, @joekgoodman, @gpaolacci) or Facebook (ACR page), as a comment to this post, or via email the organizers Joseph Goodman and Gabriele Paolacci.

Te North American ACR conference will take place on October 24-26 at the Hilton Baltimore in Baltimore, MD. The MTurk workshop will take place on Saturday, October 25, 2pm in the room Key 5.

Update: Thanks to all participants for contributing to a fruitful discussion! The slides we used in the workshop can be found here. Joe & Gabriele.

Posted by: Gabriele Paolacci | July 10, 2014

Review of MTurk as a Participant Pool

We recently published on Current Directions in Psychological Science a review of MTurk as a source of survey and experimental data. We discuss the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors. The Psych Report published a nice summary of the paper, that you can find here.

Reference:

Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184-188.

Older Posts »

Categories