In November 2022, ChatGPT reached one million users in five days. It's the kind of statistic that sounds like marketing hyperbole until you realize that Netflix took 3.5 years to hit the same milestone, and Facebook took 10 months. Something fundamentally different had arrived — and almost nobody was truly prepared for it.
But the story doesn't begin in November 2022. It begins years earlier, in a series of research labs, with a simple question: what if we trained a language model on basically all the text that ever existed?
Before the Revolution: The Quiet Build-Up
The transformer architecture — the engine that powers every modern large language model — was introduced in a 2017 Google paper with the now-legendary title "Attention Is All You Need." At the time, it was a clever solution to sequence-to-sequence translation tasks. Nobody foresaw what it would eventually become.
The key insight was the self-attention mechanism: instead of processing tokens sequentially (as RNNs did), transformers could look at all tokens in a sequence simultaneously and learn which ones were relevant to which. This made parallelization trivial, which made massive scale suddenly tractable.
OpenAI's GPT-2 (2019) was the first model to make mainstream headlines — not because it was deployed, but because OpenAI famously decided not to fully release it, citing "concerns about malicious applications." That decision, in retrospect, was both prescient and a bit overstated. GPT-2 could generate coherent paragraphs, but it still rambled. The limits were obvious to anyone who spent five minutes with it.
GPT-3 (2020) changed the conversation. 175 billion parameters. Trained on 570GB of compressed text from the internet. The model that introduced many developers to the idea of in-context learning — give it a few examples in the prompt, and it would generalize to new tasks without any fine-tuning at all.
The Inflection Point: Instruction Tuning and RLHF
Raw scale wasn't enough. The early GPT-3 API was impressive to researchers but awkward for most users — it would complete your prompt, but not necessarily in a helpful way. Ask it a question, and it would statistically continue the pattern of questions followed by answers from its training data, sometimes correctly, often not.
The breakthrough came with instruction tuning and Reinforcement Learning from Human Feedback (RLHF). The idea was deceptively simple:
- Fine-tune the base model on examples of high-quality instructions and responses
- Train a "reward model" to score outputs based on human preferences
- Use RL to push the language model toward higher-scoring outputs
The result was InstructGPT (2022), which OpenAI showed was preferred by humans over the raw GPT-3 model despite having 100x fewer parameters. Alignment and helpfulness, it turned out, were at least as important as raw scale.
ChatGPT was essentially InstructGPT with a chat interface and some additional tuning. The interface was the product — giving anyone access to a genuinely helpful AI assistant through a familiar conversational UI.
The Cambrian Explosion
The six months following ChatGPT's launch saw an extraordinary proliferation of models and capabilities:
- GPT-4 (March 2023) — Multimodal, dramatically improved reasoning, passed bar exams in the top 10%
- Claude (Anthropic) — Focused on safety and constitutional AI training, introduced "harmlessness, honesty, and helpfulness" as design principles
- Gemini (Google DeepMind) — Deeply integrated with Google's search and workspace ecosystem
- Llama / Llama 2 / Llama 3 (Meta) — Open-weight models that democratized fine-tuning and sparked an entire ecosystem of derivatives
- Mistral, Falcon, Command R, Phi — Efficient models optimized for specific use cases, often dramatically smaller than their benchmark-topping counterparts
What Actually Changed (and What Didn't)
It's worth being precise about what LLMs genuinely revolutionized, because the hype has been so intense that it's easy to lose the signal in the noise.
What demonstrably changed:
- Developer productivity — GitHub Copilot, Cursor, Claude Code, and similar tools have measurably accelerated certain coding tasks. Studies show 20-55% faster completion on well-specified tasks.
- First-draft generation — Marketing copy, emails, summaries, documentation. Not perfect, but a useful starting point.
- Information retrieval — Conversational Q&A is often faster than traditional search for well-defined factual questions.
- Accessibility of technical knowledge — A junior developer can now ask "why is my async code deadlocking" and get a useful answer instead of needing to decode a Stack Overflow thread.
What hasn't fundamentally changed (yet):
- Long-horizon planning and reliable multi-step reasoning on novel problems
- Factual accuracy without retrieval augmentation (hallucination remains an issue)
- Physical-world understanding and embodied tasks
- Genuine creativity vs. sophisticated recombination
The Economics: A Brutal Shakeout Is Coming
Training frontier models costs hundreds of millions to billions of dollars. The compute required doubles roughly every 6 months as researchers push the scaling frontier. This creates a market structure that few observers discuss openly: only a handful of organizations can play at the frontier.
The economics look something like this:
"You need the compute budget of a nation-state, the engineering depth of a top-tier tech company, and the data flywheel of a platform with billions of users. That describes maybe five organizations on Earth."
The strategic response from everyone else is specialization: smaller, faster, cheaper models fine-tuned for specific domains — legal, medical, coding, customer service. The commodity tier of the market is already being contested aggressively on price.
What Comes Next
The honest answer is: nobody knows. The scaling hypothesis still holds — we haven't found the wall yet. But researchers are increasingly exploring directions beyond pure scale:
- Mixture of Experts (MoE) — Only activate a fraction of parameters for any given input, improving efficiency at scale
- Retrieval Augmented Generation (RAG) — Connect models to live databases to reduce hallucination and enable up-to-date knowledge
- Multimodality — Vision, audio, video, and structured data as first-class inputs
- Agentic frameworks — Models that can take actions, use tools, and complete multi-step tasks autonomously
- Test-time compute — o1-style chain-of-thought reasoning that trades inference time for accuracy on hard problems
Conclusion: We're Still in the Prologue
The AI revolution is real, but we're still in the early chapters. The transformation of software development, knowledge work, and creative industries is underway — but it's happening at human timescales, not internet timescales. Companies are integrating AI tools gradually, workers are adapting (and resisting), and the regulatory environment is still catching up.
What's clear is that language models have demonstrated something profound: text is a universal interface. Nearly every domain of human knowledge is encoded in text. A model trained on enough text, with the right techniques, can reason about that knowledge in surprisingly sophisticated ways. We don't yet know how far this scales — but the experiments are running.
Whatever comes next, the five years between the transformer paper and ChatGPT will be seen as one of the most consequential periods in the history of computing. We're lucky to be watching it happen in real time.