MMCPP: A MULTI-MODAL CONTRASTIVE PRE-TRAINING MODEL FOR PLACE REPRESENTATION BASED ON THE SPATIO-TEMPORAL FRAMEWORK

: The concept of "place" is crucial for understanding geographical environments from a human perspective. Place representation learning involves converting places into numerical low-dimensional dense vectors and is a fundamental procedure for artificial intelligence in geography (GeoAI). However, most studies ignore multi-level distance constraints and spatial proximity interactions that enable behavioral interactions between places. Furthermore, representing the temporal characteristics of these interactions in trajectory sequences poses a challenge for natural language processing and other field techniques. In addition, most existing methods rely on all modalities from inputs as they use joint training to integrate multiple modalities. To address these issues, we propose a Multi-Modal Contrastive Pre-training model for Place representation (MMCPP) . Our model consists of three encoders that capture corresponding place attributes across different modalities, including point of interests (POIs), images, and trajectories. The trajectory encoder, named RodtFormer, takes fine-grained spatio-temporal trajectories as input and leverages self-attention with rotary temporal interval position embedding to simulate dynamic spatial and behavioral proximity interactions between places. By using a coordinated pre-training framework, MMCPP independently encodes place representations across different modalities and improves model reusability. We verify the effectiveness of our model on a taxi trajectory dataset using the location prediction task at next n seconds, including 30 seconds(s), 180(s), 300(s). Our results demonstrate that compared to existing embedding methods, our model is capable of learning higher-quality position representations during pre-training, leading to improved performance on downstream tasks.


INTRODUCTION
Places offer meaningful insight into the geographical environment from a human perspective, as they contain attributes related to spatial features and human behavior (Liu, Yao, et al. 2020).Place representation learning has emerged as a key component in urban studies and applications, aiming to represent places as numerical low-dimensional vectors.These vectors can be used to reveal the inherent laws of places (Huang et al. 2021) and improve the performance of downstream tasks, such as location prediction (Li et al. 2022), urban function identification (Jenkins et al. 2019;Paul et al. 2021), house price forecasts (Das et al. 2021), etc. Place representation learning drives the development of artificial intelligence in geography (GeoAI) (Mai, Janowicz, et al. 2022;Janowicz et al. 2020).
Representing a place requires describing attributes of the place.These attributes include two aspects: first-order attributes intrinsic to the place itself, and second-order attributes shaped by spatial and behavioral proximity interactions among places.Different modalities of data capture distinct facets of these attributes.For instance, point-of-interests (POIs) (Jenkins et al. 2019) and images (Wang, Wang, et al. 2020) can reveal firstorder attributes such as place functions and surface information.Trajectories, can simulate dynamic interactions between places and provide insight into second-order attributes such as spatial autocorrelations and spatial complementarity (Liu et al. 2022).
Several recent studies have demonstrated that combining multimodal data can yield higher quality place representations To address these issues, we aim to develop a pre-training model that can simultaneously capture first-order and second-order attributes using multi-modal data, considering not only the attributes of the place itself, but also the interaction of dynamic spatial proximity and behavioral proximity.The model can be used in the tasks involving missing partial modalities while utilizing multi-modal information, ensuring higher reusability.
The main contributions of this paper are summarized as follows:

•
We propose MMCPP -a model that uses three types of geographic data: POIs, images, and trajectories.MMCPP features three encoders that correspond to the three modalities and can represent places independently, making the model more reusable.

•
We propose a trajectory encoder named RodtFormer.This encoder contains a self-attention mechanism with a rotary temporal interval position embedding based on Roformer structure (a variant of Transformer).It allows for simulating dynamic spatial proximity interactions and behavioral proximity interactions among places in finegrained trajectory sequences.
• A contrastive sample construction method is designed for integrating multi-modal information.By using contrastive learning method, the encoders in MMCPP integrates the information of other modalities, and can represent the place independently when the respective modality data is given, which improves the reusability of the pre-train model.

•
We employ MMCPP in the location prediction task at next n seconds, including 30 seconds, 180 seconds, and 300 seconds, and conduct experiments on a taxi trajectory dataset.Our experimental results show significant improvement in prediction performance, demonstrating the superiority of our proposed model.

RELATED WORK
Aligning heterogeneous and multi-source geographic data to corresponding places and integrating information of different modalities have become challenges in multimodal place representation modeling.
Some studies extract features of different modalities and concatenate them together as the place representation vector for each geographical unit (Xuan et al. 2016;Li 2018).It can be expressed as [ 1 ;  2 ; … ;   ] .The place representation vector obtained by these methods have strong interpretability.However, this method has a heavy workload and only considers first-order attributes of the place, ignoring second-order attributes generated by interaction among places.
Alternatively, some studies integrate information of different modalities when constructing the input and learn the place representation containing multi-modal information, which can be expressed as: Enc([ 1 ;  2 ; … ;   ]).For example, scholars use the structure of heterogeneous graphs to integrate different modalities' information (Liu et al. 2022;Paul et al. 2021) or compare trajectories to sentences using natural language processing methods to jointly integrate multi-modal information (Wan et al. 2022;Zhao et al. 2017;Zhu et al. 2019).
Moreover, integrating information of different modalities can be achieved by adding a fusion structure in the model, expressed as: Fusion(Enc 1 ( 1 ), Enc 2 ( 2 ), … , Enc  (  )) .For example, statistical operators such as mean value aggregation (Wang et al. 2021); deep learning modules such as fully connected layers (Jenkins et al. 2019;Zhang et al. 2020;Luo et al. 2022;Wang et al. 2022), attention mechanism variants (Zhang et al. 2020;Luo et al. 2022;Sun et al. 2022) can be utilized.Encoder-decoder and other special training structures also can be used for modality integration (Du et al. 2019;Zhang et al. 2019;Fu et al. 2019).
Some studies design training tasks, i.e., loss functions, to enable the representation model to integrate multi-modal information during the training process.These tasks can be categorized into mask data recovery tasks (Zhang et al. 2017;Lin et al. 2021) and contrastive learning tasks (Huang et al. 2021;Radford et al. 2021).
Most studies using the first three methods above belong to joint representation structures that rely on all modal inputs (Baltrušaitis et al. 2019).The model will not be able to represent when the downstream task lacks any modality, and the model's reusability is low.In contrast, most models using contrastive learning training strategy are coordinated representation structures (Baltrušaitis et al. 2019) applicable to only one modality.However, few studies combine place representation with this method (Huang et al. 2021).
Furthermore, most studies only consider the interaction between stay points in trajectory data, such as pick-up and drop-off points of taxi trajectories.They overlook multi-level distance constraints and spatial proximity interactions in fine-grained trajectory sequences (Yao et al. 2018;Du et al. 2019;Liu, Miranda, et al. 2020).
And it is usually difficult to represente the geographical characteristics of places (e.g.spatio-temporal characteristics) by directly transferred the representation methods from specific fields such as natural language processing.For example, For example, the self-attention mechanism in the Transformer encoder can encode semantics of the sequence context and use absolute sequence position (0, 1, 2, …) to supplement the position information lost in the self-attention mechanism.However, capturing trajectory sequences and temporal characteristics of place interactions is crucial for building place representations.Even though some studies introduce temporal positions (Lin et al. 2021), the representation ability remains insufficient.

METHODOLOGY
Figure 1 illustrates the structure of MMCPP, which includes three encoders and two pre-training phases.In phase 1, place attributes described by each modality of data are encoded by three encoders using self-supervised pre-training tasks individually.In phase 2, the encoded place attributes from each modality representation are integrated coordinately using the contrastive learning method.This section explains the structure of MMCPP in greater detail.

POIs Encoder
The POIs encoder extracts functional attributes from categories and geographical location distributions of POI data to encode the first-order attributes of a place.Its structure, shown in Figure 2, includes an input layer and an attention pooling layer for capturing dominant functional properties in the place.
The Input Layer comprises a coding layer of POI categories that converts discrete POI categories into continuous numerical variables.Additionally, a coding layer of POI location uses the sinusoidal multi-scale position encoder (Mai, Xuan, et al. 2022) to encode the two-dimensional coordinates (including latitude and longitude) of POIs within the place.For places without POIs, a special "pseudo-POI" with category name "[NAN]" and coordinates at the geometric center of the place is added to avoid the same representation of places without POIs.After category embedding and coordinate coding, additive operators are used to integrate the POI category and geographical location distribution.Then, the attention pooling layer performs weighted aggregation of different categories of POIs at different locations within the place to capture the main functional attributes of the place.Finally, a feature vector integrated POIs information is output as a representation of the place in this modality.

Image Encoder
The image encoder of MMCPP adopts VisionTransformer (ViT) (Dosovitskiy et al. 2021), which is used to extract surface information from images within a place, such as the size, shape, and spatial position distribution of ground objects.
The input to ViT is the images within the place, each containing 200×200 pixels.The size of each image patch is set to 10×10 pixels, resulting in 400 image patchs per image.ViT adds a learning "[CLS]" item before image patch embeddings, which serves as a representation of the entire image corresponding to the place.Therefore, the feature vector of "[CLS]" is used as the representation of the place in the image modality.

Trajectory Encoder
Fine-grained trajectory sequences describe the dynamic interactions of spatial proximity and behavior proximity among places simultaneously.The representation of a place will spread to the representations of other places along the trajectory sequence containing the movement behavior of the crowd.And the autocorrelation and complementary effects from different places will also spread to the representations of the target places.This paper proposes a trajectory encoder, named RodtFormer, as shown in Figure 3.It based on the RoFormer (a variant of Transformer) structure, takes fine-grained spatio-temporal trajectories as inputs and uses the self-attention mechanism with rotary temporal interval position embedding to simulate the dynamic interactions among places.It captures spatial proximity interactions, behavioral proximity interactions, and temporal characteristics including absolute chronological order and relative time intervals among places.As a result, the encoder represents second-order properties of places.
RodtFormer begins by converting the coordinate sequence of trajectories into the corresponding place ID sequence.This is followed by the place ID coding layer, which maps each place ID to the corresponding continuous feature vectors.
These vectors are then input into a multi-head self-attention mechanism.However, the self-attention mechanism loses the sequence position information (Vaswani et al. 2017).Transformer and RoFormer use simple sequential orders (0, 1, 2, 3, ...), respectively combined with sinusoidal position encoding and rotary position encoding to complement posotion information.However, for trajectories, the time interval of visiting each place is not uniform and contains important in ormation, such as locations' isite requencies or sta e durations (Lin et al. 2021).
This paper introduces rotary temporal interval position encoding to capture two aspects of time characteristics in trajectory data: absolute chronological order and relative time intervals.The new encoding method replaces "simple sequential orders" with "first time interval" to introduce time position information.This information is calculated by subtracting the timestamp   of each position in the sequence from the timestamp  1 of the starting point.  ′ ⊤   ′ = (    ) ⊤ (    ) (1)

Phase2 Multi-modal coordinated pre-training:
Building upon the results of phase 1, we propose a contrastive sample construction method that uses contrastive learning to train MMCPP coordinatedly, allowing the model to integrate firstorder and second-order attributes of the place in multi-modal data while maintaining each encoder's independent encoding ability.
Figure 4 shows the training process of phase 2.
To begin, a batch of places is randomly selected from the research area, and corresponding POIs and images are collected.
Trajectories are then sampled from existing collections of trajectories that include these places as stop points.After the data for each modality is input into its respective encoder, representations of each place in each modality are output.
where  − = symmetric cross entropy losses for A and B modalities q = a unit matrix p = a similarity matrix of representations between modalities For places without POIs or trajectories, we use encoded representations of "pseudo-POIs" and "pseudo-trajectories," respectively.The "pseudo-trajectory" consists of only a corresponding single place ID in the sequence, and its timestamp can be initialized randomly since the "first time interval" of the first point of the trajectory is transferred to 1.This approach enables unification of inputs across all modalities, including cases with partial missing modalities, in multi-modal contrastive learning tasks.
Upon completion of phase 1 and phase 2 pre-training, we obtain three encoder components that integrate different modal information and can still be independently encoded: POI encoder ℱ  , image encoder ℱ  , and trajectory encoder ℱ  .

EXPERIMENTS
To assess the quality of the place representations generated by MMCPP, they were incorporated into a location prediction model and compared with other location embedding methods.

Study area
The rectangular envelope of the Third Ring Road in Wuhan, China, is taken as the research area.And considering the sampling interval and average speed of most taxi trajectories, the research unit is set as a longitude-latitude geographic grid with a side length of 0.0018° (each grid is about 160 meters×200 meters in the WGS84 coordinate system).The research area is composed of a total of 22,950 geographic grids (170 × 135).

Datasets
The proposed model uses POIs, maptiles, and taxi trajectories as data sources for the corresponding three modalities.

Point of Interests (POIs)
POIs describe the typical functions and activities of an area and can be used to mine the functional semantics of cities, as shown in previous studies (Jenkins et al. 2019).In this study, POI data was obtained from the Amap platform (https://lbs.amap.com) in 2018, which included 17 categories.

Images
Maptiles offer geospatial information such as the shape, color, and size of ground objects (Wang, Wang, et al. 2020).The maptiles used in this study were obtained from the TencentMap platform, with a zoom level of 17, a resolution of around 1 meter, and three channels.As a data enhancement measure, maptiles of cities that are of the same size as Wuhan were also collected and mixed with Wuhan data for model pre-training.These included parts of four Chinese cities: Hangzhou, Chengdu, Nanjing, and Changsha.All maptiles were resampled to 0.00009°, so that each geographic grid contained 200x200 pixels.

Trajectories
The passenger trajectories are extracted from a taxi trajectory dataset to simulate dynamic interactions between the grids."Meshing" is a process that converts trajectory points into corresponding grids based on their latitudes and longitudes, resulting in grid ID sequences.For taxi trajectory data, it aims to model the interactions between grids, therefor trajectory points inside each grid are merged, and only the first trajectory point entering the grid is retained.Specially, in cases where multiple trajectory points fall within the final grid of a sequence, the last trajectory point in the grid is kept to preserve the complete time interval of the trajectory.This transformation process is illustrated in Figure 5. Finally, any transformed trajectories with a length shorter than three are filtered out.

Figure 5. The process of meshing and merging trajectories
Finally, the trajectory dataset 2,533,754 passenger trajectories of 4,121 taxis in the first 4 weeks (May 27, 2019-June 23, 2019) and 663,574 passenger trajectories of 4,129 taxis in the last week (June 24, 2019-June 30, 2019).The data cover more than 90% of the grids in the study area.

Baseline Place Representation Methods
To prove the superiority of MMCPP, we include two classic representation methods, and also two state-of-the-art place embedding methods for comparison.

RegionEncoder(Jenkins et al. 2019):
A deep learning model for learning low-dimensional distributed representations of discrete spatial regions, which utilizes POIs, satellite images and taxi trajectories as data sources.We experimented with satellite data replaced by maptiles in this paper.

Settings
During the single-modal pre-training in phase 1, we used POI data from Wuhan, maptiles from Wuhan and 4 other cities, as well as trajectory data from the first 4 weeks of Wuhan as the pretraining datasets for each modality.For the POI encoder of MMCPP, we implemented an input layer with a hidden size of 128 and an attention pooling layer with a hidden size of 128.The optimization process for the POI encoder involved using a batch size of 512, a maximum of 150 training epochs, and AdamW with a learning rate of 0.0002.Regarding the MAE structure, the image encoder adopted a 4-layer ViT where the input layer had a hidden size of 128, 8-head self-attention mechanism, and an embedding dimension of the feedforward layer at size 512.The decoder had 2 layers of Transformer with an 8-head self-attention mechanism and a feed-forward layer with a hidden size of 512.The MAE optimization process included using a batch size of 32, a maximum of 50 training epochs, and AdamW with a learning rate of 0.0002.The trajectory encoder used 3 layers of RodtFormer based on an input layer with a hidden size of 128, 8head self-attention mechanism, and a hidden embedding of feedforward layer with a hidden size of 512.The optimization for the trajectory encoder entailed utilizing a batch size of 128, a maximum of 20 training epochs, and AdamW with a learning rate of 0.0008.During phase 2 of our multi-modal coordinated pretraining, we utilized the single-modal component that performed the best in phase 1 and the number of sampling trajectories is set to 20.For phase 2 optimization, we used a batch size of 128, a maximum of 20 training epochs, and AdamW with a learning rate of 0.00001.In the two pre-training phases, 95% of these datasets were allocated as the training set, while the remaining 5% of the data was used for validation purposes.And the losses on the validation sets are used to judge whether to stop training early.
To verify performances of models, we used the trajectory dataset from the last week as the downstream task data.Of the total dataset, 94% was allocated to training the models, with 1% set aside for the validation set to select the best downstream task model.The remaining 5% was utilized to evaluate the performance of the downstream models.

Experimental Results
Figure 6 shows the performance comparison of different models for location prediction at next n seconds including 30 seconds(s), 180(s), 300(s).In the chart, a redder color indicates better performance in the downstream task metric, while a bluer color indicates worse performance.Values in bold indicate that the highest number in the row is the top performer.
In the experiment, the model trained with the place representation encoded by MMCPP outperformed the baseline method on the test set, with an average of 1.96% higher in F1 score.training process.Furthermore, Yan et al. 2019 has found that using the dot product operation in Transformer encoding does not appropriately model time-interval characteristics so that it is less effective than MMCPP in this task.The embedding represented by RegionEncoder performs the worst on the task, possibly due to its weaker convolutional neural network structure than ViT in encoding images.Additionally, the method only considers the interaction between pick-up and drop-off points, ignoring finer details such as visit order and access time intervals contained in the fine-grained trajectories.Lastly, due to the sparseness of taxi flow matrix and POI data, it may be challenging to build sufficient place representations.
MMCPP integrates the information of multiple modalities; and uses the rotary temporal interval position embedding to introduce the time information in the trajectory sequence, and considers the absolute chronological order and relative time intervals among places.These designs enable MMCPP to build higher-quality place representations during pre-training, which can help downstream location prediction models achieve better performance.

Ablation Study
Ablation experiments are used to verify the effectiveness of the rotary temporal interval position embedding and the effectiveness of integrating various multi-modal components.

Verification of the effectiveness of the rotary temporal interval position embedding
This paper compares the position embedding masked as deltaT(sμ)+RoPE with three variants shown in

Figure 1 .
Figure 1.The structure of multi-modal contrastive pre-training model for place representation (MMCPP)

Figure 4 .
Figure 4. Phase 2: multi-modal coordinated pre-training Using the alignment relationship of "information from different modalities describe the same place," representations of the same place in each modality are paired together as positive samples (colored squares on the diagonal in Figure 4(d)).Representations of different places in each modality are paired as negative samples (white squares in Figure 4(d)).The three encoders learn a multi-modal embedding space coordinatedly using contrastive learning by maximizing the cosine similarity of the POIs and image representations, POIs and trajectory representations, as well as the trajectory and image representations of the 3*N positive pairs in the batch while minimizing the cosine similarity of the embeddings of the 3*(N 2 -N) negative pairings.We optimize the sum of the three symmetric cross entropy losses L (Radford et al. 2021) over these similarity scores, which are computed as follows:  =  − +  − +  − (5)  = (,)+(,  ) 2 (6) (, ) = −sum( ⊙ log(softmax())) (7) hree enco ers that inte rate multi mo al in ormation an can enco e in epen entl are obtaine a ter pre trainin Mikolov et al. 2013): An implementation of Word2Vec, which captures the semantic of sequence points through the correlation between the target place and its context.Skip-Gram(Mikolov et al. 2013): Another implementation of Word2Vec.It takes target places as input and predicts context places within a certain window range as output, thus capturing the semantic of sequence points.CTLE(Lin et al. 2021): Context and Time aware Location Embeddingn model uses the Transformer as the backbone with the sinusoidal timestamp position embedding and uses masked trajectory, masked hour, and masked weekday as the pre-training objectives.
ri inal tra ector (b) eshin tra ector (c) er e tra ector points The location prediction models with different initial grid representation are trained with Cross Entropy loss, and evaluated with weighted-F1 score.All embe in mo els' imensions o the vectors of place representaton are set to 128.We implement CTLE based on 3layer networks with a hidden size of 512.The CTLE was pretrained on the training sets for 100 epochs and AdamW with learning rate of 0.00008.The same asLin et al. 2021, for our MMCPP model and CTLE model, we used the placeID encoding layer of the trajectory encoder in MMCPP(marked as Traj-GE(MMCPP)) and in CTLE to obtain the grid representations respectively.The location representations modeled by different baselines and the method in this paper are fixed as non-learnable parameters, and the downstream task models all use a singlelayer LSTM with a hidden vector dimension of 128.And all models are trained with the early-stopping mechanism to obtain the best-performing epochs on the evaluation sets.AdamW was finally chosen as optimizer and an initial learning rate of 0.001 for LSTM in the downstream task.We implement all baseline models and our model in PyTorch.All experiments have run on Intel(R) Xeon(R) CPUs, and NVIDIA Tesla T4 GPUs.

Figure 6 .
Figure 6.Experimental results of the downstream task using different grid representations encoded by corresponding place representation methods

Figure 7 .
Figure 7. Experimental results of the downstream task using models with different position embedding methods

Table 1 .
The results are presented in Figure7.

Table 1 .
List of models for the ablation experiment with different position embedding methods