ProtoSnap: Prototype Alignment for Cuneiform Signs

Rachel Mikulinsky*¹ Morris Alper*¹ Shai Gordin² Enrique Jimenez³ Yoram Cohen¹ Hadar Averbuch-Elor^1,4

¹Tel Aviv University ²Ariel University ³LMU ⁴Cornell University

* Equal Contribution

ICLR 2025

TL;DR: Given a target image of a cuneiform sign, and a corresponding prototype, we align the skeleton with the target image ("snapping" the prototype into place).

Abstract

The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. We will release our code and data to the research community, foreseeing their use in a variety of applications in the digital humanities.

Sample results: aligning the prototypes (first row) to target cuneiform images (second row). Results are illustrated both after global alignment (third row) and also after refinement (bottom row).

How does it work?

We propose an optimization-based approach that does not require an alignment dataset. We leverage diffusion features, extracted from a fine-tuned stable diffusion model to compute meaningful similarity scores between each two pixels in the prototype and target images. We then store those similarities in a 4D similarity volume, as illustrated below:

We use the 4D similarity volume to find Best-Buddies correspondences, defined as pairs of pixels in the two images which are mutual nearest-neighbors according to their similarities scores. The correspondences than used to fit an affine transformation defining a global alignment of the prototype to the target image.
The similarities than used again for a per-stroke local refinement, to allow each stroke to "snap" into place. The refinement is done by optimizing a per-stroke transformation, via gradient descent.

Evaluation

We propose a test set composed of 272 cuneiform signs, annotated by experts. We use this dataset to numerically evaluate our method, comparing it to several generic correspondence matching baselines, including a geometry-based method (SIFT) and deep feature-based methods (DINOv2, DIFT). As illustrated below, our method significantly outperforms these baselines. Furthermore, our local refinement stage provides a performance boost beyond learning simply a global transform.

Boosting Downstream OCR Performance

We leveraged our method to create a dataset of paired cuneiform signs and aligned skeletons, and used it to fine-tune ControlNet, which can generate new cuneiform signs, based only on a prototype. We used this model to generate a synthetic training data, which was added to a real dataset for learning cuneiform sign classification (denoted as "+CN Data" below). We show that using by structurally controlling the generated signs, we improve the classification, even more than just by adding synthetic data, generated by using a fine-tuned Stable Diffusion (denoted as "+SD Data" below).

By controlling the sign structure, we can generate the exact required sign, matching the correct era and variant. This is compared to signs generated using Stable Diffusion, where there is no such conditioning.

Acknowledgements

This research was funded by TAU Center for Artificial Intelligence & Data Science (TAD) and by LMU-TAU Research Cooperation Program.

The method and the test set were developed using the cuneiform OCR dataset. The photographs of tablets are from the British Museum Digital Collections.

Citation

@misc{mikulinsky2025protosnapprototypealignmentcuneiform,
      title={ProtoSnap: Prototype Alignment for Cuneiform Signs},
      author={Rachel Mikulinsky and Morris Alper and Shai Gordin and Enrique Jiménez and Yoram Cohen and Hadar Averbuch-Elor},
      year={2025},
      eprint={2502.00129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.00129},