The Journey: How AI Went From Blurry Faces to Photorealism
Back in 2014, AI image generation was like staring at a dream through frosted glass blurry, unstable, and often surreal. Generative Adversarial Networks (GANs) changed the game by teaching two neural networks to compete: one creating fake images, the other exposing them. This rivalry sparked the first wave of AI-generated faces and landscapes, but it came with cracks, unstable training, repetitive outputs, and those infamous “GAN-smile” portraits.
Fast forward a few years, and diffusion models arrived to change everything. Instead of leaping from noise to realism in one unstable jump, they learned to build images step by step, carefully peeling order out of randomness. This shift matters because it marks the moment AI stopped merely imitating reality and began reimagining it with precision, style, and control powering tools like MidJourney, Stable Diffusion, and DALL·E that defines today’s creative AI revolution.
This shift marks more than just better pictures; it’s the story of how AI went from imitating reality to reimagining it with precision and control.
What Are GANs?
A Generative Adversarial Network (GAN) is a type of machine learning model built to generate new, realistic data by learning patterns from existing datasets. It works in an unsupervised learning framework powered by deep learning. The architecture involves two competing neural networks:
Through this adversarial process, the generator improves until it produces outputs that can fool the discriminator, resulting in increasingly realistic images, sounds, or text. While deep learning had already excelled in tasks like image classification and speech recognition, GANs tackled a harder challenge: not just recognizing data, but inventing new data — a leap that set the stage for modern generative AI.
NVIDIA Research Uses AI to Turn Cats Into Dogs, Lions, and Tigers, Too
GANs were a breakthrough but training them reliably was notoriously frail. The adversarial nature of two neural networks competing (generator vs discriminator) often led to unstable dynamics: vanishing gradients, oscillating losses, convergence failures, and the notorious mode collapse, where the generator “gets stuck” producing only a few very similar outputsThis instability often showed up in the generated outputs. One especially persistent quirk was what users dubbed the “GAN-smile” portrait: a default half-smile or overly symmetric grin that the generator produced as a “safe bet.” The GAN leaned into this because smiling faces were abundant in training data, and a gentle smirk was easier to model plausibly than more expressive or varied facial emotions.
Diffusion Models: The New Contender
At its core, Diffusion in AI is about breaking down and rebuilding structure. Inspired by non-equilibrium statistical physics, the idea is to gradually destroy the patterns in data through a forward process and then teach a model to reverse that destruction. By learning how to add noise step by step and then carefully remove it diffusion models provide a powerful and flexible way to generate entirely new data, like images, from pure randomness.
Forward Diffusion: Destroying Structure
The essential idea, borrowed from non-equilibrium statistical physics, is to gradually destroy structure in data through an iterative forward diffusion process. Imagine starting with a clear image and then slowly sprinkling noise over it until nothing recognizable remains. This step-by-step corruption process makes the data distribution easier to model.
Reverse Diffusion: Rebuilding from Noise
Once the data is fully “destroyed” into noise, the challenge is to reverse the process. The reverse diffusion process trains a neural network to peel away noise, little by little, until an image emerges again. Unlike GANs, which must jump from pure noise to a full image in a single leap, diffusion models tackle the problem in small, manageable steps, making training more stable and the results more realistic.
How image generation works in Diffusion models
Basic diffusion models just make random variations —you can’t tell them what you want.
So researchers added guidance:
Two main ways to guide:
Problem: Normal diffusion models work directly with raw pixels, which is slow and expensive.
Solution:
MidJourney
MidJourney is a proprietary text-to-image system that, like DALL·E and Stable Diffusion, is built on the logic of latent diffusion models (LDMs). Instead of working directly on raw pixels, the model first compresses images into a smaller “latent space,” performs the diffusion denoising steps there, and then reconstructs the final image. This makes training and generation far more efficient while preserving quality.
What is U-Net, and why it matters
Karray presents U-Net as a powerful backbone for early diffusion models:
DALL·E’s Text-to-Image via Diffusion
DALL·E turns prompts into pictures by first encoding the text into a latent representation (capturing objects, actions, relationships), then seeding a noisy image and iteratively denoising it with a diffusion model—typically a U-Net conditioned on those text features. Trained on large text–image pairs, this step-by-step process refines structure and detail until the final image matches the prompt, illustrating exactly how diffusion “unravels noise” into coherent visuals
U-Net works mainly with local patterns (small areas of the image at a time). While great for sharp, detailed pictures, it struggles with long-range consistency—for example, keeping multiple objects coherent in the same scene, or ensuring smooth transitions across video frames.
Diffusion Transformers (DiT) and Sora
To overcome these limits, researchers introduced Diffusion Transformers (DiT). Instead of using convolutions like U-Net, DiT uses the transformer architecture—the same technology behind GPT models. Transformers can look at the entire image or video sequence at once, making them much better at handling global structure and temporal coherence.
A strong example is OpenAI’s Sora, which uses DiT to generate videos from text prompts. By treating frames as tokens and applying attention across space and time, Sora can keep the same characters, objects, and motion consistent across seconds of video. Where U-Net gave us photorealistic images, DiT marks the step into cinematic, coherent video generation.
The development of generative AI is not merely an academic progression but a practical toolkit that is actively transforming domains such as design, entertainment, research, and communication. Models such as MidJourney, DALL·E, and Sora exemplify this evolution, demonstrating applications ranging from stylized visual synthesis to prompt-driven photorealism and temporally coherent video generation. Each architectural transition spanning from Generative Adversarial Networks (GANs) to diffusion models and, more recently, to transformer-based DiT frameworks has addressed specific computational and representational constraints. Looking ahead, the central research question will shift from whether machines can generate content to how novel modes of creativity, analysis, and storytelling can be achieved through human AI collaboration.
A
Aima Adil
09/25/2025
Related Articles
Get our stories delivered from us to your inbox weekly.