Extreme Rotation Estimation in the Wild
Given a pair of images captured in the wild — e.g. under arbitrary illumination and intrinsic camera parameters— and in extreme settings (with little or no overlap), such as the images of the Dam Square in Amsterdam depicted in red and blue boxes above, can we leverage 3D priors to estimate the relative 3D rotation between the images? (Hover over the image above to see the full scene)
Abstract
We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extreme-view Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.
Overview of our Method
ExtremeLandmarkPairs Dataset
In this paper, we present a new approach that tackles the problem of extreme rotation estimation in the wild. Internet (i.e., in the wild) images may vary due to a wide range of factors, including transient objects, weather conditions, time of day, and the cameras' intrinsic parameters. To explore this problem, we introduce a new dataset, ExtremeLandmarkPairs(ELP), assembled from publicly-available scene-level Internet image collections. This dataset contains a training set with nearly 34K non-overlapping pairs originating from over 2K unique landmarks, constructed from MegaScenes dataset. Additionally, for evaluation, we have created two test sets, to separately examine image pairs captured in a single camera setting with constant illumination (sELP)and image pairs captured in the wild (wELP). These test sets are respectively sourced from the Cambridge Landmarks and MegaDepth datasets. ExtremeLandmarkPairs(ELP) dataset can be accessed via link.
In this figure, we can observe the camera distribution for the Vatican, Rome scene from the ExtremeLandmarkPairs Dataset. We have constructed a dataset of real perspective image pairs with predominant rotational motion shown in (b) and (c) from the dense imagery reconstruction in (a).
Results
See our interactive visualization for results of all models on the sELP and wELP test sets.
Acknowledgements
This work was partially supported by ISF (grant number 2510/23).