BUILDING A BILLION SPATIO-TEMPORAL OBJECT SEARCH AND VISUALIZATION PLATFORM
Keywords: Solr, Lucene, big data, OpenStack, containers, visualization, social media, spatio-temporal, open source, research, GIS, geospatial, spatio-temporal, real time, streaming, cloud-based
Abstract. With funding from the Sloan Foundation and Harvard Dataverse, the Harvard Center for Geographic Analysis (CGA) has developed a prototype spatio-temporal visualization platform called the Billion Object Platform or BOP. The goal of the project is to lower barriers for scholars who wish to access large, streaming, spatio-temporal datasets. The BOP is now loaded with the latest billion geo-tweets, and is fed a real-time stream of about 1 million tweets per day. The geo-tweets are enriched with sentiment and census/admin boundary codes when they enter the system. The system is open source and is currently hosted on Massachusetts Open Cloud (MOC), an OpenStack environment with all components deployed in Docker orchestrated by Kontena. This paper will provide an overview of the BOP architecture, which is built on an open source stack consisting of Apache Lucene, Solr, Kafka, Zookeeper, Swagger, scikit-learn, OpenLayers, and AngularJS. The paper will further discuss the approach used for harvesting, enriching, streaming, storing, indexing, visualizing and querying a billion streaming geo-tweets.