Flux Models Compared with GPT and Gemini - Image processing Capabilities of Popular LLMs

Generating High-Quality Images through Prompt Engineering has grown into an separate area of specialisation within the AI and ML space, primarily due to its potential to reduce operational overheads incurred by a business. Such overheads or expenses are incurred, often, in the form of monetary of human resources allocated to, lets say, create a piece of marketing content for social media distribution, among other business operations. Yet, with fairly substantial resources allocated to this task, the resultant output need not be of the highest quality. In 2024, to guarantee such expected quality from the on-set, business leaders started exploring Generative AI (GenAI) tools to create a foundation or structure for content creation, which would be improvised through skilled personnel for authenticity and relevance within the scope of the business; skilled labour which would otherwise be invested in creating content from scratch can now focus their attention on ensuring that the generated content can be improvised enough for distribution.

But when the value attainable is uncertain, business leaders would like to experience the utility of GenAI tools first; in this case, image generation capabilities, before breaking their bank. That said, premium tools come at a premium price, which inherently restricts such businesses to even explore utilitarian tools, let alone adopt them within their business. Next, business leaders look at open source alternatives, but implementation and support bottlenecks further causes a sense of reluctance in actually procuring a solution. But, if a repeatable play book exists, that breakdowns how one could unleash the power of such open-source models, business leaders would be more open to idea of embracing GenAI tools.

Through this article, we at Codemonk, are devising an open playbook for the effortless installation of ComfyUI to harness the power of Flux models by Black Forrest Labs, thereby gaining access to one of the most powerful image generation tools available today.

Let's dive into the world of Flux.

Flux is a powerful generative AI model developed by Black Forest Labs, designed for image generation by combining the strengths of transformer and diffusion models, leading to superior image quality and prompt adherence. Similar to contemporary large language models, Flux uses a transformer to encode text prompts into a numerical representation. The encoded prompts are then used to guide the generation process, where a noise-added image is gradually refined to match the desired content.

Prominent applications of Flux

Creative Arts: Generating unique and visually appealing images for various artistic purposes, wherein factors such as granularity and detail of the image can be controlled.

Design: Creating concept art, product designs, and visual assets for marketing and branding. Templatization of design artifacts and generation of repeatable artefacts.

Gaming: Developing high-quality game graphics and environments. Incorporate different design languages based on the theming engine employed.

Research: Assisting in scientific research by generating visual representations of data.

Here are a few examples that highlight Why Flux could be preferred over contemporaries.

Prompt - Focus on Photorealistic Style of Image Generation

Generate an Image for the following: 
Subject: a white colored, 5-door, Suzuki Jimny 
Action: standing on the roadside next to a coffee plantation with misty mountains in the backdrop. 
More Context: Include a man drinking coffee next to the Suzuki Jimny
Art form: **Photorealistic image**

GPT output:

Positives

Of the three images generated, GPT output looks more appealing to the eyes for the way misty mountains are added to the image.

Negatives

the prompt specified “a man drinking coffee next to the Suzuki Jimny,” but in the below image, the man appears to be seated on thin air - Violating the law of physics

The prompt specified “A suzuki Jimny, standing on the roadside next to a coffee plantation”, but the generated image showcases what looks like a tea plantation

Gemini Output:

Positives

Of the three images generated, Gemini seems to capture intricate details that are not mentioned in the prompt, but are present regardless such as details of within the leaves of the coffee plant, hinting at the improvisations made by Gemini over the given prompt

Negatives

the prompt specified “a man drinking coffee next to the Suzuki Jimny,” but in the below image, the man is missing

The prompt specified “A suzuki Jimny, standing on the roadside next to a coffee plantation, but the road appears to be missing and instead looks like Jimny is parked on a trench.

A foreign object (what looks like a hat) is placed on the Jimny, which should not exist in the image

Flux Output:

Positives:

Among the 3 images generated, Flux seems to adhere to the proportionality & physical laws more accurately

The recreation of Suzuki Jimny, coffee plantations, and misty mountains as per actuals in the real world is the most accurate of the 3 comparisons.

Negatives: None as per prompt provided.

Lets take a look at another example of one prompt executed on the same three models

Prompt - Focus on Illustrative style of Image generation

Create an image for the following: 
Subject of the Image: A curious cat exploring a deserted alley
Action: Peering into a glowing box
Time and Day: Midnight under a full moon
Art Form: **Illustration**

GPT Output:

Positives:

All of the queries mentioned as a part of the prompt is generated as an artefact within the image

Negatives:

Style of art is not illustration

Although the prompt specifies a deserted alley, the generate image showcases a well-lit alley, with buildings on both sides

Cat’s eyes look alien

Gemini Output:

Positives:

None (too many negatives)

Negatives:

Cat, box, and even the moon in the sky are not as per scale

Image resembles a painting more than an illustration

Characteristics within a specific artefact (cat’s head and feet) are not as per scale.

Flux Output:

Positives:

Most important factors among all, style of art retains the illustration style, as compared to the painting style image generations

Stays true to prompt and captures all queries added as a part of the prompt in terms of specificity

Captures scenic descriptions such as deserted alley in the most accurate manner possible

Negatives:

cat is looking away from the box, as opposed to the prompt, which specifies the cat looking into the box

Inference: The above observations validate the accuracy with which each LLM recreates the text prompt into a definitive image. Athough GPT and Gemini present compelling cases, Flux retains the originality of thought conveyed through the given prompt, proving itself as the clear winner in in this comparison, in terms of accuracy and detail.

With applicability and demonstrations out of the way, lets look at how anyone could actually use Flux through ComfyUI in the subsequent series of articles published here.