ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume X-1/W1-2023
https://doi.org/10.5194/isprs-annals-X-1-W1-2023-303-2023
https://doi.org/10.5194/isprs-annals-X-1-W1-2023-303-2023
05 Dec 2023
 | 05 Dec 2023

MMCPP: A MULTI-MODAL CONTRASTIVE PRE-TRAINING MODEL FOR PLACE REPRESENTATION BASED ON THE SPATIO-TEMPORAL FRAMEWORK

Y. Chen, X. S. Yu, and K. Qin

Keywords: Place Embedding, Geographic Contexts, Spatial Interaction, Temporal Position Encoding, Self-supervised Learning

Abstract. The concept of "place" is crucial for understanding geographical environments from a human perspective. Place representation learning involves converting places into numerical low-dimensional dense vectors and is a fundamental procedure for artificial intelligence in geography (GeoAI). However, most studies ignore multi-level distance constraints and spatial proximity interactions that enable behavioral interactions between places. Furthermore, representing the temporal characteristics of these interactions in trajectory sequences poses a challenge for natural language processing and other field techniques. In addition, most existing methods rely on all modalities from inputs as they use joint training to integrate multiple modalities. To address these issues, we propose a Multi-Modal Contrastive Pre-training model for Place representation (MMCPP). Our model consists of three encoders that capture corresponding place attributes across different modalities, including point of interests (POIs), images, and trajectories. The trajectory encoder, named RodtFormer, takes fine-grained spatio-temporal trajectories as input and leverages self-attention with rotary temporal interval position embedding to simulate dynamic spatial and behavioral proximity interactions between places. By using a coordinated pre-training framework, MMCPP independently encodes place representations across different modalities and improves model reusability. We verify the effectiveness of our model on a taxi trajectory dataset using the location prediction task at next n seconds, including 30 seconds(s), 180(s), 300(s). Our results demonstrate that compared to existing embedding methods, our model is capable of learning higher-quality position representations during pre-training, leading to improved performance on downstream tasks.