Extreme Rotation Estimation in the Wild

Hana Bezalel¹ Dotan Ankri¹ Ruojin Cai² Hadar Averbuch-Elor^1,2

¹Tel Aviv University, ²Cornell University

CVPR 2025

arXiv Code Dataset Interactive Visualization Quiz Results

Given a pair of images captured in the wild — e.g. under arbitrary illumination and intrinsic camera parameters— and in extreme settings (with little or no overlap), such as the images of the Dam Square in Amsterdam depicted in red and blue boxes above, can we leverage 3D priors to estimate the relative 3D rotation between the images? (Hover over the image above to see the full scene)

Abstract

We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.

Overview of our Method

We design a network architecture optimized for estimating 3D relative rotation from pairs of images captured in challenging, real-world conditions. Given a pair of input Internet images, we extract image features using pretrained LoFTR. These features are combined with auxiliary channels, including keypoint and pairwise matches masks, and segmentation maps (visualized on the bottom left). These image features are reshaped into tokens and concatenated with Euler angle position embeddings, which are then processed by our Rotation Estimation Transformer module. The output Euler angle tokens and averaged image tokens are concatenated and processed by MLPs to predict the probability distribution of Euler angles, representing the relative 3D rotation between the input images.

ExtremeLandmarkPairs Dataset

In this paper, we present a new approach that tackles the problem of extreme rotation estimation in the wild. Internet (i.e., in the wild) images may vary due to a wide range of factors, including transient objects, weather conditions, time of day, and the cameras' intrinsic parameters. To explore this problem, we introduce a new dataset, ExtremeLandmarkPairs(ELP), assembled from publicly-available scene-level Internet image collections. This dataset contains a training set with nearly 34K non-overlapping pairs originating from over 2K unique landmarks, constructed from MegaScenes dataset. Additionally, for evaluation, we have created two test sets, to separately examine image pairs captured in a single camera setting with constant illumination (sELP)and image pairs captured in the wild (wELP). These test sets are respectively sourced from the Cambridge Landmarks and MegaDepth datasets. ExtremeLandmarkPairs(ELP) dataset can be accessed via link.

In this figure, we can observe the camera distribution for the Vatican, Rome scene from the ExtremeLandmarkPairs Dataset. We have constructed a dataset of real perspective image pairs with predominant rotational motion shown in (b) and (c) from the dense imagery reconstruction in (a).

Results

See our interactive visualization for results of all models on the sELP and wELP test sets.

We visualize the results of our model over different overlap levels, where the images on the left serve as the reference points, and their coordinate system detemines the relative rotation, which defines the images on the right. The ellipsoids representing the ground truth are color-coded to match their respective images, with the estimated relative rotation illustrated by a cyan dashed line. As illustrated by the examples above, our method can accurately predict relative rotations for diverse image pairs containing varying appearances and intrinsic parameters.

Acknowledgements

This work was partially supported by ISF (grant number 2510/23).