Prompting-MammAlps: Fine-Grained Text-to-Video Retrieval for Camera-Trap Data

Valentin Gabeff, Baptiste Maquignaz, Jennifer Shan, Sepideh Mamooler, Gencer Sumbul, Blair Costelloe, Devis Tuia, Alexander Mathis.

Overview

MammAlps-S2 extends MammAlps with a second field season (2024). It contains 2,865 video clips from camera traps deployed at three sites in the Swiss National Park, annotated with per-frame bounding boxes, individual tracks, and behavioural labels — species, action, activity, and for deer, age and sex. Compared to MammAlps, it adds two new species (chamois and marten), extends the label vocabulary to 23 actions and 12 activities, and covers an earlier season capturing different ecological dynamics (roe deer courtship, distinct species occurrences).

Built on these dense annotations, we introduce Prompting-MammAlps — a text-to-video retrieval (TVR) benchmark with 135 natural-language queries. Each query is matched to the subset of train/test videos whose content fully satisfies the query; partial matches are excluded. The benchmark is designed to reflect realistic retrieval conditions, including queries with no matching video and queries spanning multiple individuals or temporal context.

The dataset is publicly released to encourage research in text-to-video retrieval, species and behaviour recognition, multi-animal tracking, action segmentation, and spatio-temporal action localisation.

2,865
video clips
18.4 h
total duration
15
unique camera views
135
text queries
2
field seasons (2023–2024)
23 | 12 | 7
actions | activities | species
Query A juvenile red deer scratching its body.
Query An adult male roe deer running while participating in courtship.
Query An animal foraging, then reacting to a camera, and then returning to foraging.

Examples of query-video associations. Two matching videos with ground-truth associations are shown for each query.

Text-to-Video Retrieval Benchmark: Prompting-MammAlps

Prompting-MammAlps includes 135 natural-language queries manually defined in collaboration with behavioural ecologists. Each query is paired with the subset of train and test videos whose content fully satisfies the query — partial matches are excluded to enforce strict retrieval semantics. Some queries have no matching video (reflecting realistic field conditions where an event may never have been captured).

Ecological Relevance

Rare species, rare behaviours, or specific weather conditions
Courtship-related behaviours
Other social interactions (chasing, playing, nursing…)
Animals reacting to the camera trap
Frequent baseline events

Retrieval Complexity

A single attribute (e.g., species or action)
Two or more attributes combined
Two or more individuals simultaneously
Temporal or relational reasoning beyond attribute presence
Find clips with content similar to a reference video

Getting Started

Follow the steps below to get started:

  1. Download the Dataset: Visit the Hugging Face repository to download the dataset.
  2. Documentation: Check out the GitHub repository for detailed instructions and examples.

If you encounter any issues, feel free to open an issue on our GitHub Issues page.