An experiment to enhance the quality of DALL.E-3 by analyzing backgrounds and poses using GPT-4v

5 min readOct 24, 2023

Preparation:

Using GPT-4v, we aim to read multiple images and adjust materials, compositions, poses, etc., to get them closer to the ideal. Even if you can visualize the finished artwork in your mind, expressing it in words and passing it to DALL.E-3 can be quite challenging. Therefore, we prepare photos that match our own images and have GPT-4v put them into words.

We loaded three images for this experiment:

The inside of a custom-made PC case
A photo of a marathon runner crossing the finish line
A cluttered computer desk with wires everywhere

Each image is numbered to make it easier for GPT-4v to understand (see the example below).

Prompt 1: Photo Analysis

Analyze these photos in detail. Write the analysis results separately in the lower left corner in English.

GPT4v

**1.**
This image showcases a high-end PC setup. Visible components and characteristics include:
- RGB lighting, illuminating various components in the rig.
- A GPU, specifically an “NVIDIA GEFORCE GTX”, which is a high-performance graphics card.
- AORUS branding, suggesting components from this manufacturer, likely the motherboard or other hardware.
- Liquid cooling system with transparent tubes, which help dissipate heat from the computer’s components.
- White cables, likely custom-sleeved, suggesting attention to aesthetics.
- Various cooling fans, with at least one featuring RGB lighting.
**2.**
This image captures a moment of a female marathon runner crossing the finish line. Key observations include:
- The runner is in a joyful and triumphant pose, indicating her accomplishment.
- She is wearing pink athletic attire, possibly representing a team or sponsor.
- The backdrop suggests the event is the “Osaka Women’s Marathon”, pinpointing the location and nature of the race.
- There is a crowd in the background, possibly fellow participants or spectators, indicating the significance of the event.
- The runner’s posture and expression convey a sense of relief, pride, and accomplishment.
**3.**
This image depicts a cluttered office space. Features and items include:
- A large number of binders or files on shelves.
- An old-fashioned calculator, which stands out due to its size and design.
- A desktop computer monitor and a keyboard.
- A disorganized array of cables and wires.
- Several paper coffee cups, indicating prolonged work sessions or multiple users.
- A pile of crumpled paper, suggesting discarded ideas or drafts.
- Office supplies like a phone, a mouse, and a pair of headphones.
- Multiple drawers, likely for storage.

Prompt 2: Creating a Prompt for DALL.E-3

You are an excellent prompt engineer. Please write a prompt for DALL.E-3 (DALL.E-3 creates illustrations from text and faithfully depicts prompts) in English.
Based on the photos you analyzed: The background resembles Photo 1 with plenty of code similar to Photo 3. Within this background, there is a long-haired Japanese girl in the pose from Photo 2 (do not mention any information from Photo 2 except for the pose). The scene should convey a sense of escape. The attire should be a suit inspired by a kimono. Depict it in an anime-like style.
Please write the information you analyzed without specifying “Photo 1,” etc., in the prompt.

GPT4v

“A young Japanese girl with long hair stands amidst a backdrop of intricate and entangled cables, reminiscent of the inside of a computer or server room. She strikes a triumphant pose, mirroring that of a marathon runner crossing the finish line. The atmosphere evokes a sense of escape or breaking free. Instead of traditional attire, she wears a suit that is an innovative fusion of a kimono. The entire scene is rendered in a vibrant, anime-inspired art style.”

Prompt 3: Providing the Above Prompt to DALL.E-3

From this point on, we will continue the conversation using ChatGPT and DALL.E-3. Here is the prompt provided above, and the resulting image is as follows.

I’d like to choose the first image. When you click on the image in ChatGPT DALL.E-3, the prompt is displayed, so I’ll copy it.

In the following prompt, please make the face appear more refined as it is currently distorted. Change the color of the clothing to pink and the hair to black. Ensure that the image is drawn in a 16:9 aspect ratio.
Illustration in a vibrant, anime-inspired art style depicting a young Japanese girl with long hair standing amidst a backdrop of intricate and entangled cables, similar to the inside of a computer or server room. She poses triumphantly, similar to a marathon runner crossing the finish line. Her outfit is a creative blend of a kimono and a modern suit, symbolizing innovation and tradition. The overall atmosphere conveys a sense of escape or liberation.

Make the face look more refined. Create a sense of impending arrival and convey a feeling of speed.

Attention:

DALL.E-3 is too faithful to the prompt and won’t yield random results like the Stable Diffusion system. If you use the same prompt, you’ll get nearly identical images each time. Be cautious as it can result in wasteful attempts. Apart from adjusting the prompt, there’s no other way to change the art style.

Therefore, I believe that if we analyze the image generated here with GPT-4v and pass it to ChatGPT DALL.E-3, we will likely get a very similar image.

An experiment to enhance the quality of DALL.E-3 by analyzing backgrounds and poses using GPT-4v

Preparation:

Prompt 1: Photo Analysis

Prompt 2: Creating a Prompt for DALL.E-3

Prompt 3: Providing the Above Prompt to DALL.E-3

Attention:

Bonus image

Written by Shinano Matsumoto

No responses yet