Posted by: Gabriele Paolacci | January 20, 2012

AMT symposium at SPSP 2012

We organized a Mechanical Turk symposium at the Annual Meeting of the Society for Personality and Social Psychology (January 26-28 2012, San Diego, CA). The symposium will provide information for researchers at any level of MTurk experience, beginning with an introduction to AMT, and moving through research and tutorials on improving data quality, collecting data across time, incentivizing workers, and conducting true group dynamics experiments. It will take place on Saturday, January 28, 9:45 – 11:00 am, in Room 26. Below you can find more details about the talks.

We plan to share on the blog some symposium-related information and material, so stay tuned!

Research Using Mechanical Turk: Getting The Most Out of Crowdsourcing
Chair: Jesse Chandler, 
Co-Chair: Pam Mueller

Mechanical Turk: An Introduction and Initial Evaluation
Michael Buhrmester, Tracy Kwang, Sam Gosling
Mechanical Turk is a unique online marketplace that contains the major elements required to conduct research: a simple participant payment system; access to a large participant pool; and a streamlined interface for study design, participant recruitment, and data collection. After introducing these fundamental mechanics, we will evaluate findings that bear on MTurk’s potential validity and suitability for research purposes. Our findings indicate that (a) MTurk participants are more demographically diverse than typical college samples; (b) under the right parameters, participants can be recruited rapidly and inexpensively without affecting data quality; and (c) MTurk data can be just as reliable as data obtained via traditional methods. Finally, to help ease concerns of novice users, we will provide a quick beginner’s walkthrough of the major elements required to get a study off the ground.

Are Your Participants Gaming The System? Improving Data Quality on Mechanical Turk
Julie S. Downs, Mandy B. Holbrook
Amazon’s Mechanical Turk provides an efficient means to recruit large samples quickly, but this ease may be at the cost of lower control over data quality. Previous research has shown that simple measures (e.g. time stamps) are insufficient to differentiate between conscientious workers and people looking for free money. Popular strategies for quality control, including instructional manipulation checks, only identify the most egregious attentional lapses. Additionally, they violate Gricean conversational norms, breach the scientific trust relationship, and bias the study sample. Unlike most MTurk tasks, psychological surveys cannot be assessed directly for worker performance without violating scientific impartiality and ethical edicts against punishing participants for their responses. In this talk, we will present an empirical assessment of other strategies for restricting data collection and data retention to those truly participating in the study, and will discuss the implications for generalizability of MTurk data in general and when using screening procedures.

Advanced Uses of Mechanical Turk Crowdsourcing in Psychological Research
Pam A. Mueller, Jesse J. Chandler, Gabriele Paolacci
Mechanical Turk has many tools and capabilities that can be quite useful for behavioral researchers, but which are not immediately evident to users. We discuss the features of MTurk that give it important advantages over other online collection methods, and address the problem of duplicate participants across programmatic research. We also present an introduction to advanced uses of MTurk for researchers with minimal programming knowledge. These tools improve data quality by, for instance, allowing workers to be incentivized, preventing workers from completing related studies, and facilitating direct communication between a requester and workers. We also discuss how these tools enable more sophisticated data collection (e.g. prescreening, longitudinal studies). We demonstrate the effectiveness of these techniques through their implementation in our own work, and provide a potential solution to the issues that may arise as MTurk workers become non-naïve participants through their involvement in numerous behavioral studies.

Conducting Synchronous Experiments on Mechanical Turk
Siddharth Suri, Winter Mason
Crowdsourcing platforms, including Amazon’s Mechanical Turk, are a new and fruitful means of conducting online research for relatively low cost. However, many psychological studies require groups of participants to interact synchronously, and the mechanisms for accomplishing this on MTurk are not built-in and are far from obvious. We will describe a technique we have developed for accomplishing this, which has four key components: recruitment of participants into a panel, notification of a start time, a “waiting” room that accumulates participants up to a threshold, and methods for handling attrition. We will discuss some common pitfalls associated with running synchronous experiments online and with crowdsourcing platforms, and demonstrate the efficacy of our technique with research we have conducted.

Posted by: Gabriele Paolacci | October 25, 2011

Science covers AMT

In the October 21 issue of Science, a News Focus article discussed using AMT as a source of experimental participants. The article included some exciting examples of research that was conducted on AMT that would have been difficult to conduct elsewhere, like a study that recently examined whether competence cues within politicians can be detected across cultures (download it here).

It also suggests something that many AMT experimenters may already know or suspect, but which has not been studied systematically yet: Different payments may attract different workers in terms of intrinsic and extrinsic motivations. Previous investigations found no evidence that low payments negatively affect data quality. This article suggests that, high payments might be likely to attract the wrong workers, with bad consequences, especially on tasks sensitive to intrinsic motivation. Not only may worker quality depend on the interaction between task and payment level, but it also may not be linear. These are additional considerations to be accounted for when considering payment level and should perhaps be systematically examined.

Access here the full text of the Science article.

Posted by: Gabriele Paolacci | September 14, 2011

Computational Social Science and the Wisdom of Crowds

Computational social science is an emerging academic research area at the intersection of computer science, statistics, and the social sciences, in which quantitative methods and computational tools are used to identify and answer social science questions. The second Workshop on “Computational Social Science and the Wisdom of Crowds” will be held at NIPS 2011, on December 17th in Sierra Nevada, Spain. Computer scientists, psychologists, economists, and other social scientists who do research on AMT (using it as a research setting or as a tool) should definitely consider submitting a paper. See last year’s contributions for getting an idea of what you may find in Sierra Nevada (besides this).

Papers must be submitted by October 7th, 2011. The call for papers and other details about the workshop are available here.

Posted by: ssuri0 | July 13, 2011

Honesty on AMT

(Guest post by Sid Suri)

For an online labor market such as AMT to function, there must be some degree of honesty between the workers and employers. Workers need to have some confidence that employers will pay them for their work. Employers need to have some confidence that workers will submit honest work.

In a series of three behavioral experiments, we measured the degree to which workers on AMT are honest and explore different factors that could affect their honesty. The first experiment in the series asked workers to roll a 6-sided die and report the outcome. If the worker didn’t have a die, a link to random.org was provided which simulates fair dice rolls. We paid the worker $0.25 plus $0.25 times the reported outcome, providing an obvious incentive to be dishonest. If everyone reported the roll honestly, the mean reported roll would be 3.5. The mean reported roll by the participants was 3.91, caused by an under reporting of 1′s and 2′s and an over reporting of 5′s and 6′s (see Figure 1 below). This is a clear indication of dishonesty, although perhaps less blatant than one might expect. One possible reason why the participants were as dishonest as we observed could be related to the relative amount they stood to gain by misreporting their roll.

Our second experiment aimed to test this hypothesis by changing the relative amount they could earn by misreporting their roll. We kept the average payoff about the same as in the first experiment but reduced the variance; we paid $1.00 plus $0.05 times the roll of a die. The results were not statistically different from the previous experiment, suggesting the variance in the pay was not the leading factor in our participants’ dishonesty.

Our final study explored how the detectability of deception would affect the behavior. When only rolling one die, it is impossible to know if a specific worker was being deceptive. Therefore, we asked workers to roll 30 dice and input all of the outcomes, so that we could at least detect egregious deception from a single individual. We paid the workers $0.01 times the sum of the rolls, and found the mean reported roll was 3.57—which, although still reliably different from fair, represents much less deception than the previous two studies. As Figures 1 and 2 (below) show there is a dramatic difference in the distribution of reported outcomes. One explanation for this result is that people may have some notion of what a likely distribution of rolls would look like, which guided them to cheat less and to cheat less egregiously in the multiple roll experiment so that they could not be caught. Finally, despite recording responses from a demographically diverse group of participants, we did not see any significant relationship between any of the reported characteristics and the probability of being dishonest.

Figure 1: The distribution of rolls in the single roll, baseline experiment. Error bars are confidence intervals for two- sided binomial tests relative to chance (p = 0.167) with Bonferroni-corrected α = 0.05.

Figure 2: The distribution of rolls in the multiple roll experiment (when each participant rolled 30 dice). Error bars are confidence intervals for two-sided binomial tests relative to chance (p = 0.167) with α = 0.05.

References

Suri, S., Goldstein, D., and Mason, W. (2011). Honesty in an Online Labor Market. Human Computation Workshop (HComp) 2011 (pdf).

Posted by: jessicahullman | April 28, 2011

Managing Reasoning Styles on AMT

Including measures to insure that subjects are paying attention, such as instructional manipulation checks, or at the least verification questions to filter subjects who don’t understand the task, is a regular practice when running crowdsourcing experiments. Such measures act as stop-gaps for tendencies toward satisficing behavior in environments like AMT. While working for rewards encourages reasoned, accurate answers, if a worker can find and apply a heuristic that results in “accurate enough” responses they can complete more HITs faster, optimizing rewards.

While such checks may be effective in many cases, or the need for them avoided altogether by methods that strategically combine answers into an accurate aggregate signal, there are cases where including them isn’t enough to bring about the desired quality of responses. Or, these checks might add additional issues to a task such as when verification questions interrupt the flow of a worker’s thinking, or anchor a worker’s subsequent responses.

Dual reasoning accounts of cognition suggest that people can process incoming information intuitively, automatically, and relatively effortlessly, or they can apply more deliberative, systematic and analytical thinking to make a decision. In a recent paper (Hullman, 2011), I discuss how evidence from psychological experiments that manipulate task stimuli in ways that encourage one reasoning style over another can be integrated into crowdsourced research to improve the quality and validity of experimental results. In many cases, a central challenge to experimentation on AMT is how to activate the sort of systematic thinking that will lead to better responses, and possibly more skilled workers in the long run, while still creating HITs that workers want to do!

One technique to induce active, systematic cognitive processing of presented task information (such as images or text) involves integrating more difficult to parse stimuli in key places in the task. Harder-to-read fonts, for example, have been shown to increase recall and comprehension of textual information (Alter et al., 2009), as has using more cognitively “costly” legends as opposed to labels on graphs (Shah et al., 2011). While subjects may perceive such stimuli as more difficult, the effect is less interruptive of their cognitive processing of the target task than verification questions or instructional manipulation checks.

Motivation, the self-directed component of psychological activity, can both strengthen and balance the effects of the above “desirable difficulties” techniques. Active, engaged processing of information can be induced by increasing a reasoner’s desire to engage with the content. What if researchers devoted more attention to creating an aesthetically-pleasing or personalized HIT? Doing so could strike a balance between increasing a worker’s motivation and sense of enjoyment of the task while simultaneously introducing cognitive difficulties that decrease the likelihood of more erroneous automatic reasoning.

References

Hullman, J. (2011). Not all HITs are Created Equal: Controlling for Reasoning and Leaning Processes in MTurk. Positionpaper. ACM CHI 2011, Vancouver, BC.

Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the Tribes of Fluency to Form a Metacognitive Nation. Personality and Social Psychology Review, 13(3), 219-235.

Shah, P., Miyake, A., & Freedman, E. (2011). Are Labels Really Better Than Legends? The Effects of Display Characteristics and Topic Familiarity on the Comprehension of Multivariate Line Graphs. Working Paper in preparation.

Posted by: Gabriele Paolacci | January 11, 2011

Synchronous Experiments

To what extent can researchers rely on AMT for experiments with simultaneous participation? Designs that require participants to interact with each other at the same time are very common (e.g., in experimental economics), yet it is difficult to implement them on the web: One needs to have a certain number of participants to be present at a certain time, to handle dropouts, etc. Although AMT has some features that might facilitate the implementation of synchronous designs, there are only a few available examples. Mason and Suri (2010) describe a possible procedure to run synchronous experiments on AMT, which was successfully used by Suri and Watts (2010). Another description  of how a synchronous study was conducted on AMT can be found in  Lydia Chilton’s Master’s Thesis. We welcome other examples of studies with simultaneous participation conducted in AMT (or elsewhere).

References

Chilton, Lydia B. (2009). Seaweed: A Web Application for Designing Economic Games. M.Eng. Thesis, Massachusetts Institute of Technology.

Mason, Winter and Suri, Siddharth, Conducting Behavioral Research on Amazon’s Mechanical Turk (October 12, 2010). Available at SSRN: http://ssrn.com/abstract=1691163

Suri, Siddharth and Watts, Duncan J. (2010). Cooperation and contagion in networked public goods experiments.  Available at: http://arxiv.org/abs/ 1008.1276

Posted by: Gabriele Paolacci | September 29, 2010

Cross-Cultural Research

A remarking feature of AMT is that tasks can be confined to only workers who live in specific countries, allowing for focused comparisons between subjects from two or more groups. This can eliminate many of the barriers to conducting cross-cultural comparisons of basic psychological processes (e.g, finding a subject pool in the country of interest). An example of such research can be found in Eriksson and Simpson (2010), who used AMT to recruit participants from the US and India and asked them to complete a questionnaire. As Mechanical Turk is populated by an increasingly internationalized workforce, we foresee large scope for cross-culture comparisons in the future.

References

Eriksson, K., & Simpson, B. (2010). Emotional reactions to losing explain gender differences in entering a risky lottery. Judgment and Decision Making, 5, 159–163.

Posted by: Gabriele Paolacci | July 19, 2010

Attention

Do workers on AMT pay attention to the directions provided by experimenters?

We recruited participants from AMT, discussion boards around the Internet, and an introductory subject pool at a Midwestern American University. After completing some tasks, participants were submitted the Subjective Numeracy Scale (SNS; Fagerlin et al. 2007). The SNS is an eight-item self-report measure of perceived ability to perform various mathematical tasks and preference for the use of numerical versus prose information, and it provided an ideal context for an Instructional Manipulation Check (IMC), as discussed in Oppenheimer et al. (2009). Included with the SNS, participants read a question that required them to give a precise and obvious answer (“While watching the television, have you ever had a fatal heart attack?”). This question employed a six-point scale anchored on “Never” and “Often” very similar to those in the SNS, thus representing an ideal test of whether participants paid attention to the survey or not.

Participants in the three subject pools did not differ in terms of attention provided to the survey. Participants in Mechanical Turk had the lowest IMC failing rate, although the number of respondents who failed the IMC is very low and not significantly different across subject pools, χ2(2, 301) = .187, p = 0.91. See here for a more detailed analysis of the experiment.

References

Fagerlin, A., Zikmund-Fisher, B., Ubel, P., Jankovic, A., Derry, H., & Smith, D. (2007). Measuring numeracy without a math test: Development of the subjective numeracy scale. Medical Decision Making, 27, 672–680.

Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.

Posted by: Gabriele Paolacci | April 7, 2010

HCOMP 2010

Human computation is a relatively new research area that studies the process of channeling the vast internet population to perform tasks or provide data towards solving difficult problems that no known efficient computer algorithms can yet solve. The second Human Computation Workshop (HComp 2010) will be held on July 25th in Washington DC, collocated with KDD 2010. The goal of HComp 2010 is to bring together academic and industry researchers in a stimulating discussion of existing human computation applications and future directions of this new subject area. Amazon Mechanical Turk is a human computation application that coordinates workers to perform tasks in exchange for monetary rewards, and researchers who use this platform (e.g. to run experiments) should consider HComp 2010 for paper submission. Papers must be submitted by May 3, 2010 May 7, 2010. The call for papers and other details about the workshop are available here.

Posted by: Gabriele Paolacci | February 11, 2010

Altruism

David Rand posted on Crowdflower about a great AMT study he recently conducted along with John Horton on altruism (as measured by cooperative behavior on a Prisoner’s Dilemma), that also used religious priming. The authors found that (rearranged from the original post):

1. A majority of Turkers cooperate in a Prisoner’s Dilemma. Thus even in the entirely anonymous and profit-motivated online labor market of AMT, many people still choose to help each other.

2. Reading a religious passage about the important of charity makes religious Turkers more altruistic, but has no effect on Turkers who do not believe in god. This shows that Turkers respond in basically the same way as “normal” lab subjects, and is fairly intuitive. Those who believe in god are receptive to calls for generosity phrased in religious language, while non-believers aren’t.

Access here the original post about the study.

Older Posts »

Categories

Follow

Get every new post delivered to your Inbox.