VP-T2I (NeurIPS 2023)

Abstract

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.

First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models.

Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation.

We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.

VPGen: Visual Programming for Step-by-Step Text-to-Image Generation

VPGen is a novel visual programming framework for interpretable step-by-step text-to-image (T2I) generation. As illustrated in the figure below, we decompose the text-to-image generation task into three steps: (1) object/count generation, (2) layout generation, and (3) image generation. VPGen employs an LM to handle the first two steps: (1) object/count generation and (2) layout generation, making it easy to adapt the knowledge of pretrained LMs and enables generating layouts of objects that are unseen during text-to-layout training (e.g., ‘pikachu’). Then VPGen uses a layout-to-image module to generate images from the predicted layouts. For the layout generation LM, we finetune Vicuna 13B on text-layout pair annotations on three public datasets: Flickr30K entities, MS COCO, and PaintSkills. For the layout-to-image module, we use GLIGEN.

VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation

VPEval is a novel interpretable/explainable evaluation framework for T2I generation models, based on visual programming. Unlike existing T2I evaluation methods that compute image-text alignment scores with an end-to-end model, our evaluation provides an interpretable program and visual+textual explanations for the evaluation results. We propose two types of evaluation prompts: (1) skill-based evaluation and (2) open-ended evaluation. In skill-based evaluation, we define five image generation skills and use a set of skill-specific prompts and evaluation programs. In open-ended evaluation, we use a diverse set of prompts that require multiple image generation skills, and use a language model to dynamically generate an evaluation program for each text prompt.

Evaluation Modules

Unlike previous T2I evaluation methods that use a single model to evaluate all kinds of image generation skills, we use a set of visual modules specialized for different tasks. In the following figure, we show Python pseudocode of the evaluation modules.

Skill-based Evaluation with Visual Programs

For skill-based evaluation, we create text prompts with various skill-specific templates that are used for image generation and evaluation with our programs. In the figure below, we illustrate our skill-based evaluation in VPEval. Given text prompts that require different image-generation skills, our evaluation programs measure image-text alignment scores by calling the relevant visual modules. Unlike existing T2I evaluation methods, our evaluation programs provide visual+textual explanations of the evaluation results.

Open-ended Evaluation with Visual Program Generator LM

Although our evaluation with skill-specific prompts covers five important and diverse image generation skills, user-written prompts can sometimes be even more complex and need multiple evaluation criteria. To handle such open-ended prompts, we extend the VPEvaL setup with evaluation programs using many visual modules together. We generate open-ended prompt evaluation programs with an LLM, then the evaluation programs output the average score and the visual+textual explanations from their visual modules. The program generation involves choosing which prompt elements to evaluate and which modules will evaluate those elements.

As annotation of evaluation programs with open-ended prompts can be expensive, we use ChatGPT to generate evaluation programs via in-context learning. As illustrated in the figure below, we give ChatGPT the list of visual modules and example text prompts and programs, then ask the model to generate a program given a new prompt. For reproducible and accessible evaluation, we release the evaluation programs so that VPEval users do not have to generate the programs. We will also release a public LM (finetuned for evaluation program generation using ChatGPT outputs) that can run on local machines.

BibTeX

@inproceedings{Cho2023VPT2I,
  author    = {Jaemin Cho and Abhay Zala and Mohit Bansal},
  title     = {Visual Programming for Text-to-Image Generation and Evaluation},
  booktitle = {NeurIPS},
  year      = {2023},
}

Visual Programming for
Text-to-Image Generation and Evaluation
(NeurIPS 2023)

Abstract

Summary

VPGen: Visual Programming for Step-by-Step Text-to-Image Generation

VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation

Evaluation Modules

Skill-based Evaluation with Visual Programs

Open-ended Evaluation with Visual Program Generator LM

BibTeX

Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)

Abstract

Summary

VPGen: Visual Programming for Step-by-Step Text-to-Image Generation

VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation

Evaluation Modules

Skill-based Evaluation with Visual Programs

Open-ended Evaluation with Visual Program Generator LM

BibTeX

Visual Programming for
Text-to-Image Generation and Evaluation
(NeurIPS 2023)