Prompting-MammAlps

Overview

MammAlps-S2 extends MammAlps with a second field season (2024). It contains 2,865 video clips from camera traps deployed at three sites in the Swiss National Park, annotated with per-frame bounding boxes, individual tracks, and behavioural labels — species, action, activity, and for deer, age and sex. Compared to MammAlps, it adds two new species (chamois and marten), extends the label vocabulary to 23 actions and 12 activities, and covers an earlier season capturing different ecological dynamics (roe deer courtship, distinct species occurrences).

Built on these dense annotations, we introduce Prompting-MammAlps — a text-to-video retrieval (TVR) benchmark with 135 natural-language queries. Each query is matched to the subset of train/test videos whose content fully satisfies the query; partial matches are excluded. The benchmark is designed to reflect realistic retrieval conditions, including queries with no matching video and queries spanning multiple individuals or temporal context.

The dataset is publicly released to encourage research in text-to-video retrieval, species and behaviour recognition, multi-animal tracking, action segmentation, and spatio-temporal action localisation.

2,865

video clips

18.4 h

total duration

unique camera views

135

text queries

field seasons (2023–2024)

23 | 12 | 7

actions | activities | species

Query A juvenile red deer scratching its body.

Query An adult male roe deer running while participating in courtship.

Query An animal foraging, then reacting to a camera, and then returning to foraging.

Examples of query-video associations. Two matching videos with ground-truth associations are shown for each query.

Text-to-Video Retrieval Benchmark: Prompting-MammAlps

Prompting-MammAlps includes 135 natural-language queries manually defined in collaboration with behavioural ecologists. Each query is paired with the subset of train and test videos whose content fully satisfies the query — partial matches are excluded to enforce strict retrieval semantics. Some queries have no matching video (reflecting realistic field conditions where an event may never have been captured).

Ecological Relevance

Rare species, rare behaviours, or specific weather conditions

Courtship-related behaviours

Other social interactions (chasing, playing, nursing…)

Animals reacting to the camera trap

Frequent baseline events

Retrieval Complexity

A single attribute (e.g., species or action)

Two or more attributes combined

Two or more individuals simultaneously

Temporal or relational reasoning beyond attribute presence

Find clips with content similar to a reference video

Getting Started

Follow the steps below to get started:

Download the Dataset: Visit the Hugging Face repository to download the dataset.
Documentation: Check out the GitHub repository for detailed instructions and examples.

If you encounter any issues, feel free to open an issue on our GitHub Issues page.

Prompting-MammAlps: Fine-Grained Text-to-Video Retrieval for Camera-Trap Data

Overview

Text-to-Video Retrieval Benchmark: Prompting-MammAlps

Ecological Relevance

Retrieval Complexity

Getting Started