ConGeo:
Robust Cross-view Geo-localization across Ground View Variations

Li Mi^*, Chang Xu^*, Javiera Castillo-Navarro, Syrielle Montariol,

Wen Yang, Antoine Bosselut and Devis Tuia

EPFL, Wuhan University

ECCV 2024
^*Indicates Equal Contribution

ConGeo enhances base models’ robustness to ground view variations by aligning the representations of the same location.

As a generic learning objective for cross-view geo-localization, when integrated into state-of-the-art pipelines, ConGeo significantly boosts the performance of three base models on four geo-localization benchmarks for diverse ground view variations.

Motivation

Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view. In real-world scenarios, the task requires accommodating diverse ground images captured by users with varying orientations and reduced field of views (FoVs), including the following settings:

North-aligned: Ground-view and aerial-view images are aligned to the North.
Unknown Orientation: Ground-view images are of arbitrary orientations (FoV=360°).
Limited Field of View: Ground-view images are of arbitrary orientations and limited fielf of views (e.g., FoV=70°, 90°, or 180°).

However, existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations. To tackle this challenge, we propose ConGeo, a single- and cross-modal Contrastive method for Geo-localization that improves the base model's robustness across various ground view variations.

Method

ConGeo enhances robustness and consistency in feature representations to improve a model's invariance to orientation and its resilience to FoV variations, by enforcing proximity between ground view variations of the same location.

ConGeo's learning pipeline. For feature representation in the left and right boxes, the North-aligned ground image, the transformed ground image, and the aerial view are sent to their respective encoders. Then in the feature space, the single- and cross-modal contrastive learning losses are applied to enforce the proximity of the paired images.

The proposed training objectives contains cross-modal and single-modal contrastive learning with four losses:

Results

Experiments are performed on four cross-view geolocalization benchmarks to investigate ConGeo's robustness, adaptability, and generalization ability. Analysis on orientation invariance and activation map further comfirms ConGeo's robustness.

Robustness across different settings

Using a single model, ConGeo outperforms SOTA FoV-specific or orientation-specific models under ground-view variations and achieves leading performance under North alignment.

Comparison with SOTA FoV-specific methods on different settings on the CVUSA dataset.

Retrieval results of the baseline model and of ConGeo undering the North-aligned setting and limited FoV setting.

Superiority over data augmentations

ConGeo shows effectiveness compared with different data augmentation methods.

Comparison on unknown orientation setting and limited FoV setting between ConGeo and different data augmentation methods on the CVUSA dataset. “Shift” denotes using shifted query images and “FoV” denotes using query images of limited FoVs, “Rotate” randomly rotating aerial images as data augmentation.

Generalization ability on unseen ground view variations

ConGeo generalizes better on unseen ground view variations than baseline model and baseline model with data augmentation.

Comparison on unseen ground view variations (e.g., Random FoVs, Random Zooming, Gaussian Noise, and Motion Blur) between ConGeo and baselines on the CVUSA dataset. “DA” means data augmentation.

Analysis: How does ConGeo achieve robustness?

Orientation invariance analysis that showcases models’ vulnerabilities to orientation shifts and activation map visualization that investigates the models’ focus.

ConGeo shows better orientation invariance. We cyclically shift the ground view with an angle (x-axis) as the model’s input to test its retrieval performance. Note that “N-A” denotes the North-aligned setting and “DA” means data augmentation.

ConGeo’s activation areas are more consistent across ground view shifts.

BibTeX

@article{mi2024congeo,
  title={ConGeo: Robust Cross-view Geo-localization across Ground View Variations},
  author={Mi, Li and Xu, Chang and Castillo-Navarro, Javiera and Montariol, Syrielle and Yang, Wen and Bosselut, Antoine and Tuia, Devis},
  journal={arXiv preprint arXiv:2403.13965},
  year={2024}
}