TECCI

Tricky Edits of Collected and Curated Images

Aishwarya Agrawal*,1,†, Roy Hirsch*,1 Yasumasa Onoe*,2, Sherry Ben2, Jason Baldridge2
1Google Research 2Google DeepMind
*Equal contribution, ordered alphabetically.
Work partially done while Aishwarya Agrawal was at Google DeepMind.
@article{AgrawalTECCI2026,
  author        = {Aishwarya Agrawal and Roy Hirsch and Yasumasa Onoe and Sherry Ben and Jason Baldridge},
  title         = {{TECCI: Tricky Edits of Collected and Curated Images}},
  journal       = {arXiv},
  year          = {2026}
}
Copy to clipboard Close

Abstract

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

Dataset

TECCI comprises two complementary subsets: TECCI-IRCS, which contains 530 images with 530 edit instructions (one per image), and TECCI-GGIS, which contains 1,404 images paired with 7,020 edit instructions (five instructions per image). TECCI-IRCS consists of challenging manually written edit instructions. The edit instructions for the images in TECCI-GGIS were generated automatically using Gemini 3 Pro.


Benchmarking Image Editing Models on TECCI

Models Evaluated: We evaluate the performance of several state-of-the-art image generation models, Nano Banana 2, Nano Banana Pro, Grok Imagine Pro, Seedream 5.0 Lite and GPT Image 1.5, using TECCI. Representative samples are shown in the figure below.

Evaluation Criteria: The quality of instruction-based image editing is inherently multi-dimensional and subjective, so we defined granular scoring guidelines.

  1. Instruction Following (IF):

    Assesses the semantic alignment between the edit instruction and the resulting edited image. It measures whether the generative model accurately and completely fulfilled all the required modifications, serving as the primary benchmark for the system’s functional utility.

  2. Image Consistency (IC):

    Evaluates the preservation of the original image’s identity and non-targeted regions and elements. This criterion penalizes "over-editing" or unrequested alterations to the background and perspective, ensuring the edit is minimal.

  3. Visual Quality (VQ):

    Captures the aesthetic and technical excellence of the edit, identifying any introduced artifacts such as blurring, pixelation, or unnatural blending. It determines if the modifications are seamlessly integrated to maintain a realistic, high-resolution appearance.

Human Evaluation on a subset of TECCI: We report the overall success rate alongside the per-criterion success rates. The low overall scores across the model suite highlight the inherent difficulty of TECCI. Even the most capable models struggle to exceed a 22.3 overall success rate. IRCS stands out as a significantly more challenging subset.

Automatic Evaluation on the Full set of TECCI: We propose an MLLM-based automatic evaluation framework that enables streamlined and reproducible benchmarking on the TECCI dataset. The autorater processes the source image, the edit instruction, and the resulting edited image. We prompt the model to perform a rigorous, systematic analysis of the visual output, independently rating each of the three evaluation criteria on a 1–5 Likert scale. The observed trends largely align with the human evaluation.


Downloads

🤗 TECCI can be used via Huggingface Datasets 🤗

The annotations and images are licensed by Google LLC under CC BY 4.0 license.


Related Project