Twitter-Based Flood Damage Mapping Using Text Classification and Geoparsing Techniques
Keywords: Text Classification, Geoparsing, Natural language processing, Natural Disaster, Machine Learning, Social Media
Abstract. Social media is a prominent source of real-time information for disaster understanding. This study provides a method that uses natural language processing (NLP), machine learning, and geoparsing for flood damage mapping and classification from Twitter data. It addresses a key research gap in prior work, which has primarily focused on classifying tweets as either damage-related or non-damage. In contrast, the framework proposed in this study categorizes tweets into multiple risk classes, enabling more detailed assessment and enhancing spatial resolution by analyzing risk at the city level rather than providing only a broad overview. For this purpose, we retrieved a dataset of 3,000 tweets from India and Pakistan from the CrisisNLP database. After cleaning the text, 1,000 tweets were manually annotated into three damage classes. Three machine learning classifiers—Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression—were trained after applying Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. SVM performed best in terms of accuracy, precision, and recall. The trained models were used to label the remaining 2,000 tweets. For spatial analysis, a rule-based geoparsing strategy was used with a curated list of states and cities, and geographic coordinates were retrieved using the Geopy library. Tweets were then grouped by location, and predicted flood damage trends were mapped in ArcGIS. Validation involved visual comparison of satellite image before and after the flood, confirming damage detection in selected cities. Results indicate that combining social media analysis with geospatial techniques can effectively assess flood damage in areas lacking organized or geotagged data.
