Home About Us Services Events Blogs Careers

The Journey: How AI Went From Blurry Faces to Photorealism

The Journey: How AI Went From Blurry Faces to Photorealism

Back in 2014, AI image generation was like staring at a dream through frosted glass blurry, unstable, and often surreal. Generative Adversarial Networks (GANs) changed the game by teaching two neural networks to compete: one creating fake images, the other exposing them. This rivalry sparked the first wave of AI-generated faces and landscapes, but it came with cracks, unstable training, repetitive outputs, and those infamous “GAN-smile” portraits.

Fast forward a few years, and diffusion models arrived to change everything. Instead of leaping from noise to realism in one unstable jump, they learned to build images step by step, carefully peeling order out of randomness. This shift matters because it marks the moment AI stopped merely imitating reality and began reimagining it with precision, style, and control powering tools like MidJourney, Stable Diffusion, and DALL·E that defines today’s creative AI revolution.

This shift marks more than just better pictures; it’s the story of how AI went from imitating reality to reimagining it with precision and control.

What Are GANs?

A Generative Adversarial Network (GAN) is a type of machine learning model built to generate new, realistic data by learning patterns from existing datasets. It works in an unsupervised learning framework powered by deep learning. The architecture involves two competing neural networks:

Generator → creates synthetic data (like images).
Discriminator → evaluates whether the data is real or generated.

Through this adversarial process, the generator improves until it produces outputs that can fool the discriminator, resulting in increasingly realistic images, sounds, or text. While deep learning had already excelled in tasks like image classification and speech recognition, GANs tackled a harder challenge: not just recognizing data, but inventing new data — a leap that set the stage for modern generative AI.

NVIDIA Research Uses AI to Turn Cats Into Dogs, Lions, and Tigers, Too

GANs were a breakthrough but training them reliably was notoriously frail. The adversarial nature of two neural networks competing (generator vs discriminator) often led to unstable dynamics: vanishing gradients, oscillating losses, convergence failures, and the notorious mode collapse, where the generator “gets stuck” producing only a few very similar outputsThis instability often showed up in the generated outputs. One especially persistent quirk was what users dubbed the “GAN-smile” portrait: a default half-smile or overly symmetric grin that the generator produced as a “safe bet.” The GAN leaned into this because smiling faces were abundant in training data, and a gentle smirk was easier to model plausibly than more expressive or varied facial emotions.

Diffusion Models: The New Contender

At its core, Diffusion in AI is about breaking down and rebuilding structure. Inspired by non-equilibrium statistical physics, the idea is to gradually destroy the patterns in data through a forward process and then teach a model to reverse that destruction. By learning how to add noise step by step and then carefully remove it diffusion models provide a powerful and flexible way to generate entirely new data, like images, from pure randomness.

Forward Diffusion: Destroying Structure

The essential idea, borrowed from non-equilibrium statistical physics, is to gradually destroy structure in data through an iterative forward diffusion process. Imagine starting with a clear image and then slowly sprinkling noise over it until nothing recognizable remains. This step-by-step corruption process makes the data distribution easier to model.

Reverse Diffusion: Rebuilding from Noise

Once the data is fully “destroyed” into noise, the challenge is to reverse the process. The reverse diffusion process trains a neural network to peel away noise, little by little, until an image emerges again. Unlike GANs, which must jump from pure noise to a full image in a single leap, diffusion models tackle the problem in small, manageable steps, making training more stable and the results more realistic.

How image generation works in Diffusion models

Training:
- The model learns how to remove noise from images step by step.
- Once it masters that, it can “reverse” the process: start with random noise and turn it into a new image.
Generation:
- You give the model random noise.
- It removes the noise little by little (over many steps).
- The final result is a brand-new image that looks like the training data but isn’t an exact copy.
Steps tradeoff:
- Fewer steps = faster, but less detail.
- More steps = slower, but more accurate and detailed.

Guided Diffusion Models (adding control)

Basic diffusion models just make random variations —you can’t tell them what you want.
So researchers added guidance:

Text-to-image diffusion → You give a text prompt (“a giraffe wearing a top hat”), and the model makes an image that matches.
This works because the diffusion model is paired with a language model (like CLIP or GPT-like models) that understands the text and guides the image creation.

Two main ways to guide:

Classifier-guided: Needs an extra classifier to say “this looks like a dog / cat / giraffe.” Limited to the categories it was trained on.
Classifier-free: More flexible. Uses embeddings from text (like CLIP) and conditions the diffusion model directly. Can even handle new categories it wasn’t trained on.

Latent diffusion models (faster + cheaper)

Problem: Normal diffusion models work directly with raw pixels, which is slow and expensive.

Solution:

First, compress the image into a smaller “hidden representation” (latent space) using an autoencoder (like how you zip a file).
Do the diffusion steps in this smaller space (much faster).
Then decode it back into a full image at the end.

MidJourney

MidJourney is a proprietary text-to-image system that, like DALL·E and Stable Diffusion, is built on the logic of latent diffusion models (LDMs). Instead of working directly on raw pixels, the model first compresses images into a smaller “latent space,” performs the diffusion denoising steps there, and then reconstructs the final image. This makes training and generation far more efficient while preserving quality.

What is U-Net, and why it matters

Karray presents U-Net as a powerful backbone for early diffusion models:

U-Net has an encoder–decoder architecture:
- The encoder compresses the input image through successive convolution and downsampling layers, extracting multiscale feature maps.
- The decoder upsamples and reconstructs those feature maps back into an output image (or prediction).
The key architectural trick in U-Net is the skip connections between matching encoder and decoder layers. These let the decoder reuse high-resolution spatial detail from the encoder, improving reconstruction and preserving fine structure.
In diffusion models, U-Net acts as the denoiser: given a noisy image at some step, the U-Net predicts how to remove noise and recover the underlying clean image (or a slightly less noisy image).

DALL·E’s Text-to-Image via Diffusion

DALL·E turns prompts into pictures by first encoding the text into a latent representation (capturing objects, actions, relationships), then seeding a noisy image and iteratively denoising it with a diffusion model—typically a U-Net conditioned on those text features. Trained on large text–image pairs, this step-by-step process refines structure and detail until the final image matches the prompt, illustrating exactly how diffusion “unravels noise” into coherent visuals

U-Net works mainly with local patterns (small areas of the image at a time). While great for sharp, detailed pictures, it struggles with long-range consistency—for example, keeping multiple objects coherent in the same scene, or ensuring smooth transitions across video frames.

Diffusion Transformers (DiT) and Sora

To overcome these limits, researchers introduced Diffusion Transformers (DiT). Instead of using convolutions like U-Net, DiT uses the transformer architecture—the same technology behind GPT models. Transformers can look at the entire image or video sequence at once, making them much better at handling global structure and temporal coherence.
A strong example is OpenAI’s Sora, which uses DiT to generate videos from text prompts. By treating frames as tokens and applying attention across space and time, Sora can keep the same characters, objects, and motion consistent across seconds of video. Where U-Net gave us photorealistic images, DiT marks the step into cinematic, coherent video generation.

The development of generative AI is not merely an academic progression but a practical toolkit that is actively transforming domains such as design, entertainment, research, and communication. Models such as MidJourney, DALL·E, and Sora exemplify this evolution, demonstrating applications ranging from stylized visual synthesis to prompt-driven photorealism and temporally coherent video generation. Each architectural transition spanning from Generative Adversarial Networks (GANs) to diffusion models and, more recently, to transformer-based DiT frameworks has addressed specific computational and representational constraints. Looking ahead, the central research question will shift from whether machines can generate content to how novel modes of creativity, analysis, and storytelling can be achieved through human AI collaboration.

Aima Adil

09/25/2025

Get our stories delivered from us to your inbox weekly.

info@grayphite.com

123 E San Carlos St, San Jose, CA 95112, USA

+1-408-7869900