Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs), including the following settings:
However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. To tackle this challenge, we propose ConGeo, a single- and cross-modal Contrastive method for Geo-localization that improves the base model's robustness across various ground view variations.
ConGeo enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location.
ConGeo's learning pipeline. For feature representation in the left and right boxes, the North-aligned ground image, the transformed ground image, and the aerial view are sent to their respective encoders. Then in the feature space, the single- and cross-modal contrastive learning losses are applied to enforce the proximity of the paired images.
The proposed training objectives contains cross-modal and single-modal contrastive learning with four losses:
Experiments are performed on four cross-view geolocalization benchmarks to investigate ConGeo's robustness, adaptability, and generalization ability. Analysis on orientation invariance and activation map further comfirms ConGeo's robustness.
Using a single model, ConGeo outperforms SOTA FoV-specific or orientation-specific models under ground-view variations and achieves leading performance under North alignment.
Comparison with SOTA FoV-specific methods on different settings on the CVUSA dataset.
Retrieval results of the baseline model and of ConGeo undering the North-aligned setting and limited FoV setting.
ConGeo shows effectiveness compared with different data augmentation methods.
Comparison on unknown orientation setting and limited FoV setting between ConGeo and different data augmentation methods on the CVUSA dataset. “Shift” denotes using shifted query images and “FoV” denotes using query images of limited FoVs, “Rotate” randomly rotating aerial images as data augmentation.
ConGeo generalizes better on unseen ground view variations than baseline model and baseline model with data augmentation.
Comparison on unseen ground view variations (e.g., Random FoVs, Random Zooming, Gaussian Noise, and Motion Blur) between ConGeo and baselines on the CVUSA dataset. “DA” means data augmentation.
Orientation invariance analysis that showcases models’ vulnerabilities to orientation shifts and activation map visualization that investigates the models’ focus.
ConGeo shows better orientation invariance. We cyclically shift the ground view with an angle (x-axis) as the model’s input to test its retrieval performance. Note that “N-A” denotes the North-aligned setting and “DA” means data augmentation.
ConGeo’s activation areas are more consistent across ground view shifts.
@article{mi2024congeo,
title={ConGeo: Robust Cross-view Geo-localization across Ground View Variations},
author={Mi, Li and Xu, Chang and Castillo-Navarro, Javiera and Montariol, Syrielle and Yang, Wen and Bosselut, Antoine and Tuia, Devis},
journal={arXiv preprint arXiv:2403.13965},
year={2024}
}