MultiEdits: Simultaneous
Multi-Aspect Editing
with Text-to-Image Diffusion Models

Department of Computer Science and Engineering
University at Buffalo, State University of New York

Abstract

Text-driven image synthesis has made significant advancements with the devel- opment of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making si- multaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through innovative attention distribution mechanism and multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and at- tributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios.

Pipeline

Interpolate start reference image.

Our method, MultiEdits, takes a source image, source prompt, and target prompt as input and produces an edited image. The target prompt specifies the edits needed in the source image. Attention maps for all edited aspects are first collected. Aspect Grouping categorizes each aspect into one of N groups (in the above figure, N = 5). Each group is then assigned a branch and each branch can be viewed either as a rigid editing branch, non-rigid editing branch, or global editing branch. Finally, adjustments to query/key/value at the self-attention and cross-attention layers.

Results

PIE-Bench++

What is PIE-Bench++?

PIE-Bench++ builds upon the foundation laid by the original PIE-Bench dataset introduced by (Ju et al., 2024), designed to provide a comprehensive benchmark for multi-aspect image editing evaluation. This enhanced dataset contains 700 images and prompts across nine distinct edit categories, encompassing a wide range of manipulations:

  • Object-Level Manipulations: Additions, removals, and modifications of objects within the image.
  • Attribute-Level Manipulations: Changes in content, pose, color, and material of objects.
  • Image-Level Manipulations: Adjustments to the background and overall style of the image.

While retaining the original images, the enhanced dataset features revised source prompts and editing prompts, augmented with additional metadata such as editing types and aspect mapping. This comprehensive augmentation aims to facilitate more nuanced and detailed evaluations in the domain of multi-aspect image editing

Dataset Structure

  • Images
    • 0_random_140
      • 000000000001.jpg
      • ...
      • 000000000140.jpg
    • 1_change_object_80
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 2_add_object_80
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 3_delete_object_80
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 4_change_attribute_content_40
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 5_change_attribute_pose_40
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 6_change_attribute_color_40
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 7_change_attribute_material_40
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 8_change_background_80
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
    • 9_change_style_80
      • 1_artificial
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
      • 2_natural
        • 1_animal
          • ...
        • 2_human
          • ...
        • 3_indoor
          • ...
        • 4_outdoor
          • ...
  • annotation.json

Data Annotation Guide

Overview

Our dataset annotations are structured to provide comprehensive information for each image, facilitating a deeper understanding of the editing process. Each annotation consists of the following key elements:

  • Source Prompt: The original description or caption of the image before any edits are made.
  • Target Prompt: The description or caption of the image after the edits are applied.
  • Edit Action: A detailed specification of the changes made to the image, including:
    • The position index in the source prompt where changes occur.
    • The type of edit applied (e.g., 1:change object, 2:add object, 3:remove object, 4:change attribute content, 5:change attribute pose, 6:change attribute color, 7:change attribute material, 8:change background, 9:change style).
    • The operation required to achieve the desired outcome (e.g., '+' / '-' means adding/removing words at the specified position, and 'xxx' means replacing the existing words.
  • Aspect Mapping: A mapping that connects objects undergoing editing to their respective modified attributes. This helps identify which objects are subject to editing and the specific attributes that are altered.

Example Annotation

Here is an example annotation for an image in our dataset:

{
  "000000000002": {
    "image_path": "0_random_140/000000000002.jpg",
    "source_prompt": "a cat sitting on a wooden chair",
    "target_prompt": "a [red] [dog] [with flowers in mouth] [standing] on a [metal] chair",
    "edit_action": 
      {"red":{"position":1,"edit_type":6,"action":"+"}},
      {"dog":{"position":1,"edit_type":1,"action":"cat"}},
      {"with flowers in mouth":{"position":2,"edit_type":2,"action":"+"}},
      {"standing":{"position":2,"edit_type":5,"action":"sitting"}},
      {"metal":{"position":5,"edit_type":7,"action":"wooden"}},
    "aspect_mapping": {
      "dog":["red","standing"],
      "chair":["metal"],
      "flowers":[]},
    "blended_words": [
      "cat,dog",
      "chair,chair"
    ],
    "mask": "0 262144"
  }
}
      

BibTex

      @misc{huang2024multiedits,
        title={MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models}, 
        author={Mingzhen Huang and Jialing Cai and Shan Jia and Vishnu Suresh Lokhande and Siwei Lyu},
        year={2024},
        eprint={2406.00985},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }
      

Acknowledgement

Our dataset is the extension of the original PIE-Bench dataset introduced in paper PnP Inversion: Boosting Diffusion-based Editing with 3 Lines of Code, thanks to the contributions.