PITFALLS AND POTENTIALS OF CROWD SCIENCE: A META-ANALYSIS OF CONTEXTUAL INFLUENCES

: Crowd science is becoming an integral part of research in many disciplines. The research discussed in this paper lies at the intersection of spatial and behavioral sciences, two of the greatest beneﬁciaries of crowd science. As a young methodological development, crowd science needs attention from the perspective of a rigorous evaluation of the data collected to explore potentials as well as limitations (pitfalls). Our research has addressed a variety of contextual effects on the validity of crowdsourced data such as cultural, linguistic, regional, as well as methodological differences that we will discuss here in light of semantics.


INTRODUCTION
Crowd science (here interchangeably used with crowdsourcing) is becoming an integral part of current research in many disciplines (Khatib et al., 2011, Clery, 2011).Two of the greatest beneficiaries of crowd science are the spatial and behavioral sciences.As a young methodological development, crowd science needs attention from the perspective of a rigorous evaluation of the data collected to explore potentials as well as limitations (pitfalls).Conceptually, crowdsourcing can be distinguished into being either active or passive.Active crowdsourcing involves a software platform and the active elicitation of input from the crowd.Active crowdsourcing can occur either 'in situ' via mobile devices (e.g., Citizens as Sensors (Goodchild, 2007)) or address any kind of (geographic) topic that can be communicated electronically via a computer such as a mapping or data collection project (such as OpenStreetMap 1 or Ushahidi 2 ) or any kind of behavioral experiment that can be deployed electronically.There are numerous applications for active crowdsourcing; we will discuss Amazon's Mechanical Turk 3 as the most prominent yet recently controversial platform.Passive crowdsourcing is targeting information that has been made publicly available but not as a response to a particular request or to a request different from the research question at hand.In other words, the information collected is unsolicited.Web sites, Twitter feeds, or Facebook entries are examples of such information.Advances, especially in natural language processing (Woodward et al., 2010, Socher et al., 2013) and georeferencing (Hu andGe, 2007, Gelernter andBalaji, 2013) are enabling access to an immense reservoir of data, information, and knowledge that potentially is related to specific aspects of geographical space.
Our research has addressed a variety of contextual effects on the validity of crowdsourced data such as cultural, linguistic, regional, as well as methodological differences that we will discuss here in light of semantics.

CULTURAL, LINGUISTIC, AND REGIONAL CONTEXTS
Sourcing from the crowd opens up the possibility for using diverse cultural and linguistic backgrounds to make contributions to the question of, for example, linguistic relativity (Gumperz andLevinson, 1996, Boroditsky, 2000).Linguistic relativity, as a major research area in the cognitive sciences, is addressing the influence of language on cognitive processing.The theory is that someone's native language has substantial influences on the way information is processed.While usually performed in rather expensive field or lab studies, being able to target groups in the crowd that share characteristics such as speaking the same languages or speaking the same language but in different environmental contexts, is an exciting possibility.Though running experiments via the Internet has great appeal, the downside is that certain crowd science platforms may not be globally available, such as Amazon Mechanical Turk, or that the infrastructure in a country is not sufficiently developed.Additionally, it may be more difficult to control unwanted influencing parameters and prevent participants from cheating, for example, using translation services to pretend mastery of a foreign language.
We have run several experiments in this context, using active crowdsourcing (eliciting responses from the crowd after forming a research question), using passive crowdsourcing (re-using information the crowd made publicly available for potentially a different purpose), and using a mixed approach in which we compared results from a field study against a crowd sourced experiment; results were largely positive.In (Klippel et al., 2013), we demonstrated the feasibility of using crowdsourcing via the Internet platform Amazon Mechanical Turk (AMT) as a means to address questions of linguistic relativity by comparing responses of English, Chinese, and Korean speaking participants (cmp.Figure 1).The question addressed was: How many spatial relations between overlapping extended spatial entities people intuitively distinguish (see Figure 5 (left) for examples of the stimulus material used in this study).While participants share a current cultural and linguistic environment (their computers were located in the US), their mother tongue of English, Chinese, and Korean was different.The research approach mimicked studies in the psychological sciences trying to replace expensive overseas field studies (Papafragou and Selimis, 2010).We were able to demonstrate a) that it is possible to elicit feedback from diverse linguistic background through platforms such as AMT; b) that results are reliable; and c) that in this particular experiment the main distinction between non-overlapping, overlapping, and proper-part relations outweighs potential language specific differences.We were able to confirm the validity of individual responses by having native language speakers on the research team.However, collecting data from native Korean speakers proved challenging due to smaller numbers of AMT workers which forced us to lower AMT approval rates.This led to an invitation for cheaters: some participants used translation services to read the instructions and provide answers to the questions.While we were able to identify these participants, the lesson learned is that high AMT approval rates are essential for ensuring the validity of research results obtained through AMT. 4 Another study (Xu et al., 2014) used a framework for passive crowdsourcing (Jaiswal et al., 2011) and collected a corpus of >11,000 instances of route directions from web pages covering 4 There is a lively discussion about the validity of AMT for research results that cannot be discussed in detail here.For an overview see, for example, (Crump et al., 2013).Amazon also recently announced changes to their financial model which potentially will lower the attractiveness of AMT for academic research.
the entire USA at the granularity of states.Through various tools that aided data processing, we were able to show regional differences in the way that people give route directions, that is, whether they have a preference for cardinal or relative directions (see Figure 2).The data validation in this case is challenging as there is no way to access the 'participants' or to learn anything about their backgrounds (e.g., are the people who put route directions on web pages people who lived in an area for a long time or grew up there such that they absorbed enough regional specificities?).One aspect in favor of this crowdsourcing approach is the large number of participants, that is, the size of the corpus.In this sense this study fulfills a classic promise of crowd science, that is, that large numbers (classic interpretation of the crowd) exhibit intelligence (Surowiecki, 2005).This is a valid point here as it is fair to assume that the majority of people who put route directions on the web need to have had some substantial exposure to their environments, especially at the level of states and regions (it was not the point of this study to prove that people in Connecticut differ from people in New Hampshire in the way they give route directions).In addition, we were able to confirm the results by theoretical considerations such as different environments (mountains versus plains) as well as historical planning strategies.While one point of big data is that it may mean the end of theory (Anderson, 2008), we believe that we are not there yet in terms of data quality and reliability, especially with respect to behavioral studies using We were able to show that the patterns that emerged in our study (see Figure 2) were in line with many regional characteristics and planning aspect of cities across the US.The bottom line is that without improved methods of accessing background information of the crowd many behavioral studies benefit from theoretical grounding of their findings as well as large numbers.
Crowdsourcing can also be used to complement field studies.In a recent study (Klippel et al., 2015), we addressed emerging topics in the area of landscape conceptualization and explicitly used a diversity fostering approach to uncover potentials, challenges, complexities, and patterns in human landscape concepts.Based on a representation of different landscapes (see Figure 5 (right) for examples of the images used as stimulus material), responses from two different populations were elicited: Navajo and the (US) crowd.Data from Navajo participants was obtained through field studies while data from English-speaking participants was collected via AMT.Results support the idea of conceptual pluralism, that is, even within a linguistically homogeneous group of participants different conceptualizations of reality (geographic landscapes) exist (see also Section 4.).

EXPERTS VERSUS LAY PEOPLE VERSUS DIFFERENT INPUT DATA SOURCES
One of the potentially most exciting developments in crowd science is the possibility of extending earth observations beyond artificial sensors and use the crowd to aid in unprecedented extensive data collection (Salk et al., 2015, Comber et al., 2013, Goodchild and Glennon, 2010).There are excellent reasons to use the crowd as human sensors: In certain situations, the crowd outperforms artificial sensors.One of the best examples are birding applications in which volunteers contribute tremendous and reliable insight into the distribution and migration patterns of birds 5 .This data would be impossible to collect through current sensor networks.In other areas such as land cover data, human sensors complement artificial sensors to, potentially, increase the availability of ad hoc data (Heinzelman and Waters, 2010) or improve artificial sensors (Comber et al., 2013).The Geo-Wiki Project (Fritz et al., 2009) provides aerial photos of the earth's surface to online participants and asks them to classify these patches of land into various land cover classes.While there are studies that explore the accuracy and reliability of this Geo-Wiki data (Foody 5 see http://www.birds.cornell.edu/and Boyd, 2013, Perger et al., 2012, See et al., 2013), there is a need for further understanding citizens' perception and their classification process of the environment.The Citizen Observatory Web6 , for example, aims to have citizens create environmental data through mobile devices in and around the area where the citizens live.By working with them throughout this process, one of their goals is to better understand the citizens' environmental perception and learn how citizens go about the data creation process.Although the community is making progress, we are far from understanding humans' abilities to sense environmental information reliably.
While a lot of excitement has been spread through projects such as Geo-Wiki, a comprehensive set of studies we performed on humans' abilities to reliably identify land cover types shows that the claimed high accuracy of human land cover classifications in other studies is only possible at a coarse level of granularity or for specific land cover types.Figure 3 shows the results of five experiments we conducted (Sparks et al., 2015a, Sparks et al., 2015b), which tested the effect of participant expertise, methodological design, and the influence of different input data sources and perspectives (i.e., ground-based photos and aerial photos) when classifying land cover, in the form of confusion matrices.Correctly classified images are along the diagonal (top-left to bottom-right).
All experiments asked participants to classify photos of land covers into one of 11 possible categories.The two methodological designs varied the size of photos, and the visual availability of those photos.The first methodological design presented the participant with a series of ground-based photos all at once, side by side, as relatively small icons.This allowed the participant to see all the images at all times throughout the classification process.The second, presented the participant with a ground-based (and aerial) photo one at a time, and thus were larger images than shown in the previous methodological design.Thus, the participant could not simultaneously view all the images in the second methodological design.These categorical classification tasks have proven to be difficult for participants.The experiments demonstrated that a) experts are not significantly different from educated lay participants (i.e., participants given definitions and prototypical images of the land cover classes before the experiment) when classifying land cover, b) methodological changes in classification tasks did not significantly affect participants' classification, and c) the addition of aerial photos (plus ground images) did not significantly change participants' classification.
The earth's surface can be complex and heterogeneous so asking crowdsourced participants to take this complexity and classify it into relatively low-level categories is perhaps not the most effective method, that is, the level of granularity at which humans are able to classify land cover might be rather coarse.This is especially the case when understanding these low-level categories rely so much on participants' interpretation of class names.This interpretation perhaps has the largest influence on classification outcome as we see variation in expertise, methodological design, and different input data sources has little influence on classification outcome.Some land cover classes are more challenging to interpret than others, with participants classifying land cover classes like Forest, Developed, and Open Water more consistently.Conversely, participants classified more challenging classes like Grassland and Pasture less consistently.As previously mentioned, this pattern persisted in light of participant expertise differences, and varying input data sources/perspectives (ground-based photos versus aerial photos).

THE COMPLEXITY OF THE HUMAN MIND-COGNITIVE SEMANTICS
The final aspect to discuss in this short paper are competing conceptualizations humans may have of the same set of stimuli (Fou-cault, 1994, Wrisley III, George Alfred, 2008, Barsalou, 1983, Gärdenfors, 2000).We have made substantial progress in analyz-ing crowdsourced data in depth and provide a statistical measure on the agreement of participants with respect to the task they perform (in most of our experiments participants create categories for stimuli they are presented with such as landscape images).While this is a rather specific task, it does reveal some important aspects about the human mind (cognitive semantics) that sound straight forward but are difficult to quantify: the more complex the stimulus/task is, the more varied are participants responses.This is particularly true for unrestricted sampling from the crowd.
To quantify this relation, we developed, for example, a crossmethod-similarity-index (CMSI, see (Wallgrün et al., 2014)).The CMSI measures agreement between the results of applying different hierarchical clustering methods (cf.(Kos and Psenicka, 2000)) to the data collected in category construction experiments for a given number of clusters (c).The value is computed for different values of c. Analyses from two experiments are provided in Figure 4 with examples of the icons used in the respective experiment shown in Figure 5. Without going into too much detail: Consistency of human conceptualizations (cognitive semantics) is established in a bootstrapping approach by sampling from a participant pool (actual responses) with increasing sample sizes.
The average CMSI values are then plotted over the sample size.The top part of Figure 4 shows results for the above mentioned experiment on overlap relations.It is clear that even a small number of participants converge at the most reliable solution, that is, a separation into three categories (non-overlapping, overlapping, and proper part relations).This is indicated by the line for three clusters in the graph approaching 1 (ideal solution) quickly and for low numbers of participants.In contrast, data from a recent experiment on landscape concepts (Klippel et al., 2015) shows that there is no universally acceptable category structure that, on an abstract level, would work for all participants, that is, no number of clusters converges to 1.
This finding, partially in combination with results discussed above, has resulted in three lines of current research: • The quantification of how complex individual stimuli are.
• The statistical identification of conceptually consistent subgroups of participants.
• The definition of conceptual pluralism as a means to statistically determine the complete set of intuitive conceptualizations the crowd may have on the stimulus used.

CONCLUSIONS
We focused on a meta-discussion of the lessons learned so far on different aspects of semantics on crowd science.Crowd science is still a young discipline and as such requires discussions about pitfalls and potentials.We argue that the semantic diversity of the crowd is an opportunity rather than a downside.It does require, however, attention to detail to harvest the full potential of this diversity.First, there needs to be some quality control either in form of reliability scores (AMT), hands-on validation, or a thorough theoretical underpinning against which the results can be evaluated.Additionally, we need statistical methods that allow for identifying relevant semantic contexts, that is, we need new methods that intelligently process data collected from the crowd and identify consistent views/performance by sub-groups.
When the crowd is used to assist in earth observation, it is important to make the crowd's task as objective as possible.As seen in the land cover classification experiments described above, when subjective interpretation of terms is allowed, the consistency and reliability of responses drop and the variety of unique responses increases.Having a relatively high number of classes to classify from, and those classes being relatively broad in their interpretation allows for much more subjectivity than objectivity.
To address this problem, we are currently designing experiments that replace a categorical land cover classification scheme with a feature-based classification scheme.This feature-based scheme mimics a decision tree process, continually asking the participant a series of 'either-or' questions (e.g.Is this photo either primarily vegetated or primarily non-vegetated?).Our hypothesis is that participants are more likely to agree on the presence or absence of environmental features compared to agreeing on lower-level categorical classifications.This scheme reduces the variety of class name interpretation in the classification task and creates a more objective approach.

Figure 1 :
Figure 1: Screenshot of the three start screens of CatScan in three different languages (English, Chinese, Korean)

Figure 2 :
Figure 2: Proportion of relative direction vs. cardinal direction usage for expressing change of direction in the U.S. (Dark: more relative direction usage; light: more cardinal direction usage)

Figure 3 :
Figure 3: Comparison of patterns of responses of five experiments.Row/column names of each matrix represent unique land cover classes the participants could choose from (Barren, Cultivated Crops, Developed Low Intensity, Developed High Intensity, Emergent Herbaceous Wetlands, Forest, Grassland, Open Water, Pasture/Hay, Shrub/Scrub, Woody Wetlands).The first three matrices (left to right) represent the first three experiments, testing the influence of expertise in classification.The last two represent the last two experiments, testing the influence of added aerial photos.Results show agreement against NLCD data, more precisely the numbers represent how often images of the class given by the row have been categorized as the class given by the column (i.e., a confusion matrix).Darker (red) colors indicate higher error rates.More important is the comparison of similarity between patterns.

Figure 4 :
Figure 4: Results of cluster validation using CMSI for two experiments.Top: experiment on overlap relations (see Figure 5 (left)); bottom: experiment on landscape conceptualizations (see Figure 5 (right)).

Figure 5 :
Figure 5: Example of the icons used in the Mode of Overlap experiment (left) and the Navajo Landscape Concepts (right).