DETECTING SPATIAL PATTERNS OF NATURAL HAZARDS FROM THE WIKIPEDIA KNOWLEDGE BASE
Keywords: Volunteered Geographic Information; User-Generated Knowledge; Topic Modeling; Big Geospatial Data; Wildfire
Abstract. The Wikipedia database is a data source of immense richness and variety. Included in this database are thousands of geotagged articles, including, for example, almost real-time updates on current and historic natural hazards. This includes usercontributed information about the location of natural hazards, the extent of the disasters, and many details relating to response, impact, and recovery. In this research, a computational framework is proposed to detect spatial patterns of natural hazards from the Wikipedia database by combining topic modeling methods with spatial analysis techniques. The computation is performed on the Neon Cluster, a high performance-computing cluster at the University of Iowa. This work uses wildfires as the exemplar hazard, but this framework is easily generalizable to other types of hazards, such as hurricanes or flooding. Latent Dirichlet Allocation (LDA) modeling is first employed to train the entire English Wikipedia dump, transforming the database dump into a 500-dimension topic model. Over 230,000 geo-tagged articles are then extracted from the Wikipedia database, spatially covering the contiguous United States. The geo-tagged articles are converted into an LDA topic space based on the topic model, with each article being represented as a weighted multidimension topic vector. By treating each article’s topic vector as an observed point in geographic space, a probability surface is calculated for each of the topics. In this work, Wikipedia articles about wildfires are extracted from the Wikipedia database, forming a wildfire corpus and creating a basis for the topic vector analysis. The spatial distribution of wildfire outbreaks in the US is estimated by calculating the weighted sum of the topic probability surfaces using a map algebra approach, and mapped using GIS. To provide an evaluation of the approach, the estimation is compared to wildfire hazard potential maps created by the USDA Forest service.