Visual Programming for
Text-to-Image Generation and Evaluation
(NeurIPS 2023)

UNC Chapel Hill


As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.

First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models.

Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation.

We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.


Illustration of the proposed visual programming frameworks for text-to-image (T2I) generation. In (a) VPGen, we first generate a list of objects, then object positions, and finally an image, by executing three modules step-by-step. In (b) VPEval, we use evaluation programs with a mixture of evaluation modules that handle different skills and provide visual+textual explanation of evaluation results.

VPGen: Visual Programming for Step-by-Step Text-to-Image Generation

VPGen is a novel visual programming framework for interpretable step-by-step text-to-image (T2I) generation. As illustrated in the figure below, we decompose the text-to-image generation task into three steps: (1) object/count generation, (2) layout generation, and (3) image generation. VPGen employs an LM to handle the first two steps: (1) object/count generation and (2) layout generation, making it easy to adapt the knowledge of pretrained LMs and enables generating layouts of objects that are unseen during text-to-layout training (e.g., ‘pikachu’). Then VPGen uses a layout-to-image module to generate images from the predicted layouts. For the layout generation LM, we finetune Vicuna 13B on text-layout pair annotations on three public datasets: Flickr30K entities, MS COCO, and PaintSkills. For the layout-to-image module, we use GLIGEN.

VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation

VPEval is a novel interpretable/explainable evaluation framework for T2I generation models, based on visual programming. Unlike existing T2I evaluation methods that compute image-text alignment scores with an end-to-end model, our evaluation provides an interpretable program and visual+textual explanations for the evaluation results. We propose two types of evaluation prompts: (1) skill-based evaluation and (2) open-ended evaluation. In skill-based evaluation, we define five image generation skills and use a set of skill-specific prompts and evaluation programs. In open-ended evaluation, we use a diverse set of prompts that require multiple image generation skills, and use a language model to dynamically generate an evaluation program for each text prompt.

Evaluation Modules

Unlike previous T2I evaluation methods that use a single model to evaluate all kinds of image generation skills, we use a set of visual modules specialized for different tasks. In the following figure, we show Python pseudocode of the evaluation modules.

Skill-based Evaluation with Visual Programs

For skill-based evaluation, we create text prompts with various skill-specific templates that are used for image generation and evaluation with our programs. In the figure below, we illustrate our skill-based evaluation in VPEval. Given text prompts that require different image-generation skills, our evaluation programs measure image-text alignment scores by calling the relevant visual modules. Unlike existing T2I evaluation methods, our evaluation programs provide visual+textual explanations of the evaluation results.

Open-ended Evaluation with Visual Program Generator LM

Although our evaluation with skill-specific prompts covers five important and diverse image generation skills, user-written prompts can sometimes be even more complex and need multiple evaluation criteria. To handle such open-ended prompts, we extend the VPEvaL setup with evaluation programs using many visual modules together. We generate open-ended prompt evaluation programs with an LLM, then the evaluation programs output the average score and the visual+textual explanations from their visual modules. The program generation involves choosing which prompt elements to evaluate and which modules will evaluate those elements.

As annotation of evaluation programs with open-ended prompts can be expensive, we use ChatGPT to generate evaluation programs via in-context learning. As illustrated in the figure below, we give ChatGPT the list of visual modules and example text prompts and programs, then ask the model to generate a program given a new prompt. For reproducible and accessible evaluation, we release the evaluation programs so that VPEval users do not have to generate the programs. We will also release a public LM (finetuned for evaluation program generation using ChatGPT outputs) that can run on local machines.



  author    = {Jaemin Cho and Abhay Zala and Mohit Bansal},
  title     = {Visual Programming for Text-to-Image Generation and Evaluation},
  booktitle = {NeurIPS},
  year      = {2023},