Agentic engineering is the discipline that will define how software gets built for the next decade. But it did not appear overnight. It is the product of seven decades of research, three waves of AI hype, a handful of viral open-source projects, one Stanford PhD who keeps coining the right term at the right time, and an industry that finally has models smart enough to act on their own.
This is the complete history — from Alan Turing's first spark to Andrej Karpathy's February 2026 declaration that vibe coding is passe, and from AutoGPT's 100,000-star explosion to the Agentic AI Foundation that now governs the standards. Every milestone, every inflection point, every thread that connects the dots.
TL;DR: Agentic engineering — coined by Karpathy in Feb 2026 — is orchestrating AI agents with human oversight. It evolved through 70+ years: Turing (1950) → deep learning (2012) → Transformers (2017) → AutoGPT (2023) → MCP (2024) → vibe coding (2025) → agentic engineering (2026). The agentic AI market is projected to grow from $7-9B in 2026 to $47-93B by 2030-2032 (Fortune Business Insights, Grand View Research, MarketsandMarkets). Gartner predicts 40% of enterprise apps will have AI agents by end of 2026, up from less than 5% in 2025. Taskade Genesis embodies this evolution — 150,000+ apps built with AI agents, automations, and workspace-level orchestration.
What Is Agentic Engineering?
Agentic engineering is a software development approach where humans orchestrate AI agents who do the actual coding, testing, and deployment, while the human provides architectural oversight, quality standards, and strategic direction. The term was coined by Andrej Karpathy on February 8, 2026, as the professional successor to vibe coding.
Karpathy's exact words:
"Agentic, because the new default is that you are not writing the code directly 99% of the time. You are orchestrating agents who do and acting as oversight. Engineering, to emphasize that there is an art and science and expertise to it."
The distinction is precise:
| Vibe Coding | Agentic Engineering | |
|---|---|---|
| Who writes code | AI generates, human accepts | AI generates, human reviews with same rigor as a human PR |
| Planning | Start prompting immediately | Plan before prompting — design docs, specs, architecture |
| Testing | Hope it works | Test relentlessly — the biggest differentiator |
| Ownership | "It works, I think" | Own the system — docs, version control, CI, monitoring |
| Best for | Prototypes, exploration, learning | Production systems, team projects, anything that must be maintained |
| Risk | 1.7x more major issues, 2.74x more security vulnerabilities (CodeRabbit data) | Human-level quality with AI-level speed |
| Who benefits most | Beginners getting started | Senior engineers as force multipliers (Osmani) |
Google's Addy Osmani identified the "80% Problem": agents generate 80% of a solution fast, but the remaining 20% — architecture, edge cases, production hardening — requires deep engineering knowledge. Agentic engineering is the discipline of directing that last 20%.
This is not casual prompting. It is not "accept all and hope for the best." It is a discipline — with principles, tools, patterns, and a 70-year intellectual lineage that makes it the logical conclusion of everything computer science has been building toward.
To understand why agentic engineering matters, you need to understand where it came from.

The Prehistory: Foundations of Machine Intelligence (1950–2011)
Alan Turing and the First Spark (1950)
Every history of AI begins with Alan Turing. His 1950 paper "Computing Machinery and Intelligence" asked the question that launched the field: Can machines think?
Turing proposed what became known as the Turing Test — if a machine can converse with a human and the human cannot reliably distinguish it from another human, the machine can be said to "think." This was not a technical specification. It was a philosophical provocation. And it worked — it gave the field a North Star.

A rebuilt "Bombe" machine designed by Alan Turing. The device allowed the British to decipher encrypted German communication during World War II. Image credit: Antoine Taveneaux
The Birth of AI as a Field (1956)
In 1956, John McCarthy coined the term "artificial intelligence" at the Dartmouth Conference — a summer workshop where a small group of researchers declared that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."
The optimism was extraordinary. Herbert Simon predicted in 1957 that within ten years, a computer would be chess champion and discover an important mathematical theorem. He was wrong by about four decades on the chess part and arguably still waiting on the math.
The First AI Winter (1974–1980)
Early AI research hit a wall. The models were too simple, the computers too slow, and the problems too hard. Funding dried up. DARPA cut grants. The field entered its first "AI winter" — a period of reduced funding and pessimism that would repeat.
Expert Systems and the Second Winter (1980–1993)
The 1980s brought expert systems — rule-based programs that encoded human knowledge into if-then rules. On a pivotal 1984 episode of The Computer Chronicles, three of AI's founding figures laid out the vision: John McCarthy (who coined "artificial intelligence" and invented LISP), Nils Nilsson (Stanford), and Edward Feigenbaum (who coined the term "knowledge engineering").
The promise was intoxicating. MYCIN could diagnose 20 infectious diseases using 300 hand-coded rules. Companies like Digital Equipment Corporation deployed XCON, which saved $40 million annually configuring computer orders. Dendral could infer molecular structures from mass spectrometry data. AI was a billion-dollar industry.
But McCarthy, even as the field celebrated, identified the fatal flaw: expert systems had no common sense. They could diagnose a rare blood infection but could not understand that a patient is a person who lives in a world with gravity, weather, and emotions. Feigenbaum's knowledge engineers could extract specialist expertise, but the "things everybody knows" — the vast ocean of implicit knowledge humans navigate unconsciously — proved impossible to formalize into rules.
Nilsson called these systems brittle — a word that would prove prophetic. A system that works perfectly within its narrow domain and fails catastrophically one step outside it is not intelligence. It is a lookup table with ambitions. By the late 1980s, expert systems collapsed under the weight of their own maintenance costs and inflexibility. The second AI winter followed.
The irony is that expert systems were the first proto-agents — software that made autonomous decisions within a domain. The concept of "knowledge engineering" — encoding human expertise into a system that can act on it — is a direct ancestor of today's agentic engineering. The difference: modern AI agents learn from data rather than from hand-coded rules, and they generalize across domains rather than shattering at the boundary.
Expert Systems → Modern AI Agents: The Lineage
Expert System (1980s) AI Agent (2026)
┌──────────────────┐ ┌──────────────────┐
│ Hand-coded rules │ │ Learned weights │
│ 300 rules max │ │ Billions of │
│ One domain only │ │ parameters │
│ Brittle at edges │ │ Cross-domain │
│ No learning │ │ Continuous │
│ No memory │ │ learning │
│ No tool use │ │ 22+ tools │
└──────────────────┘ └──────────────────┘
Same goal: autonomous decision-making
Different foundation: rules vs. learned representations
From Perceptrons to Hopfield Networks: The Memory Problem (1957–1986)
Frank Rosenblatt's Perceptron stunned the world in 1957 — a machine that could learn to recognize patterns completely automatically. The New York Times reported it was "expected to walk, talk, see, write, reproduce itself, and be conscious of its existence." It learned by adjusting weighted connections between inputs (dials multiplying signals) until it could classify patterns correctly. The Perceptron Learning Rule was elegant: if the output is wrong, adjust the weights by a fixed learning rate. If correct, leave them alone.
But Marvin Minsky and Seymour Papert's 1969 book Perceptrons exposed a fatal limitation: single-layer networks could not learn non-linearly separable patterns like XOR. The field stalled — nobody could train networks with multiple layers. Widrow and Hoff's LMS algorithm came agonizingly close but could not push gradients through layers with binary step functions (slope = zero everywhere). Neural network research nearly died.
Then in 1982, John Hopfield published a paper that changed how we think about memory itself. His Hopfield network — a recurrent network where neurons influence each other through weighted connections — showed that memories in neural networks are not stored in locations like computer RAM. They are stored as stable states of the entire network. Feed the network a corrupted version of a memory and it auto-completes, gravitating back to the stored pattern. This is associative memory: you recall by content, not by address.
The insight was profound: computer memory has a place (a binary address), but neural network memory has a time — a dynamic trajectory toward a stable attractor. Hopfield proved that networks of simple neurons exhibit emergent memory as a natural behavior of the system, not as an engineered feature. His work won the 2024 Nobel Prize in Physics — recognition that the physics of neural networks is foundational science, not applied engineering.
This matters for the agentic engineering story because the same principle — memory as a dynamic property of connected systems, not static storage — is exactly what separates agentic workspaces from traditional software. A Workspace DNA system stores knowledge not as files in folders but as patterns of context that agents can retrieve associatively: ask a question and the relevant memory surfaces. Hopfield networks proved this was physically possible. Modern AI agents make it practical.
The Backpropagation Breakthrough and Neural Network Renaissance (1986–2011)
The solution to multi-layer training came in 1986 when Rumelhart, Hinton, and Williams replaced the binary step activation function with a smooth sigmoid curve — giving gradients a slope to follow. The backpropagation algorithm generalized Widrow and Hoff's delta rule through the chain rule of calculus, propagating error signals backward through every layer. The same principle — adjusted for scale — trains every neural network today, including the 175 billion parameters of GPT-3 and the transformer architectures behind modern AI agents.
IBM's Deep Blue defeated world chess champion Garry Kasparov in 1997 — the moment AI entered public consciousness.

Gary Kasparov competing against IBM's Deep Blue chess computer in 1997. Image credit: kasparov.com
The 2000s brought big data, better algorithms, and increasing compute. By 2011, IBM Watson won Jeopardy!, and the stage was set for the deep learning revolution that would change everything.
| Year | Milestone | Significance |
|---|---|---|
| 1950 | Turing's "Computing Machinery and Intelligence" | Proposed the Turing Test, launched the field |
| 1956 | Dartmouth Conference | McCarthy coins "artificial intelligence" |
| 1957 | Perceptron (Frank Rosenblatt) | First neural network hardware — learns by adjusting weighted connections |
| 1969 | Perceptrons (Minsky & Papert) | Exposed single-layer limits (XOR problem), nearly killed neural network research |
| 1974 | First AI Winter begins | Funding cuts, pessimism |
| 1982 | Hopfield network | Memory as stable states, not addresses — associative recall (2024 Nobel Prize in Physics) |
| 1984 | Expert systems peak (MYCIN, XCON) | McCarthy warns: no common sense |
| 1986 | Backpropagation (Rumelhart, Hinton, Williams) | Smooth activation functions + chain rule let gradients flow through layers |
| 1997 | Deep Blue defeats Kasparov | AI enters public consciousness |
| 2011 | IBM Watson wins Jeopardy! | NLP reaches mainstream awareness |
The Deep Learning Revolution (2012–2016)
ImageNet and the AlexNet Moment (2012)
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. It won by a staggering margin — reducing the error rate from 26% to 15.3%. This was not an incremental improvement. It was a paradigm shift.
The key insight: deep convolutional neural networks, trained on GPUs, could learn visual features that hand-engineered systems could not. The entire computer vision field pivoted to deep learning within months.
This matters for the agentic engineering story because one of AlexNet's co-authors — Ilya Sutskever — would go on to co-found OpenAI. And one of the students in the Stanford lab that developed the ImageNet dataset was Andrej Karpathy, who would later coin both "vibe coding" and "agentic engineering."
Andrej Karpathy: The Thread Through the Story
To understand agentic engineering, you need to understand the man who named it.
Andrej Karpathy was born in Bratislava, Czechoslovakia, in 1986. His family moved to Toronto when he was 15. He completed his undergraduate degree in Computer Science and Physics at the University of Toronto in 2009, a master's at the University of British Columbia in 2011, and a PhD at Stanford in 2015 under Fei-Fei Li — the computer scientist behind ImageNet.
During his PhD, Karpathy interned at Google Brain (2011), Google Research (2013), and DeepMind (2015). He authored and became primary instructor of Stanford's CS 231n: Convolutional Neural Networks for Visual Recognition — one of the largest classes at Stanford, growing from 150 students in 2015 to 750 by 2017.
| Period | Role | Key Contribution |
|---|---|---|
| 2009–2015 | Stanford PhD student | ImageNet research, CS 231n course |
| 2015–2017 | OpenAI founding member | Research scientist, built core AI capabilities |
| 2017–2022 | Tesla Director of AI | Led Autopilot vision, real-world AI deployment |
| Feb 2023 | Returned to OpenAI | Brief second stint |
| Feb 2024 | Left OpenAI | Founded Eureka Labs |
| Feb 2025 | Coined "vibe coding" | Changed how millions think about AI-assisted building |
| Jun 2025 | YC AI Startup School | "Software Is Changing (Again)" — defined Software 3.0 |
| Dec 2025 | 2025 LLM Year in Review | Identified 6 paradigm shifts including "ghosts" and "vibe coding" |
| Feb 2026 | Coined "agentic engineering" | Declared vibe coding passe, named the next era |
| Mar 2026 | Released autoresearch | Open-source proof of agentic engineering in ML research |
| Mar 2026 | Launched AgentHub | Agent-first collaboration platform — "GitHub for agents" |
Karpathy is not just an observer. He is the thread that connects deep learning research, real-world AI deployment at Tesla, OpenAI's foundational work, and the conceptual frameworks that name each era. When he coins a term, the industry listens.
DeepMind, AlphaGo, and Reinforcement Learning (2014–2016)
While Karpathy was at Stanford, Google acquired DeepMind in January 2014 for approximately $500 million. In March 2016, DeepMind's AlphaGo defeated world Go champion Lee Sedol 4-1 — a feat that many AI researchers had predicted was decades away.
AlphaGo's significance for the agentic engineering story: it demonstrated that AI could make decisions in complex, ambiguous environments with long-term consequences. Go has more possible board positions than atoms in the universe. AlphaGo learned to evaluate positions and plan sequences of moves — a precursor to the planning capabilities that modern AI agents would need.
The Transformer Paradigm (2017–2022)
"Attention Is All You Need" (2017)
In June 2017, eight Google researchers published a paper that would reshape the entire field: "Attention Is All You Need." The Transformer architecture they introduced replaced sequential processing with parallel attention mechanisms, enabling models to process entire sequences simultaneously.
The Transformer made everything that follows in this history possible — GPT, BERT, Claude, Gemini, and every AI agent that orchestrates them.
The same month the Transformer paper was published, Karpathy left OpenAI to become Tesla's Director of AI, where he would spend five years applying deep learning to real-world autonomous systems.
The GPT Series (2018–2022)
OpenAI used the Transformer to build the GPT (Generative Pre-trained Transformer) series:
| Model | Year | Parameters | Key Innovation |
|---|---|---|---|
| GPT-1 | 2018 | 117M | Proved unsupervised pre-training works |
| GPT-2 | 2019 | 1.5B | "Too dangerous to release" (initially withheld) |
| GPT-3 | 2020 | 175B | Few-shot learning, first signs of emergent behavior |
| InstructGPT | 2022 | — | RLHF alignment, followed instructions better |
| ChatGPT | Nov 2022 | — | 100M users in 2 months, fastest-growing consumer app ever |
ChatGPT's launch in November 2022 was the moment AI went mainstream. It reached 100 million users in two months — faster than TikTok (9 months) and Instagram (2.5 years). For the first time, anyone could have a conversation with an AI that felt genuinely intelligent.
But ChatGPT was a chatbot, not an agent. It could answer questions, not take actions. The gap between "impressive conversational AI" and "autonomous AI agent" would take another year to begin closing.
Anthropic CEO Dario Amodei drew this exact line in his interview with Nikhil Kamath (2026): "Coding is going away first. The broader task of software engineering will take longer." The elements that remain human — system design, understanding user demand, managing teams of AI models — are precisely the skills agentic engineering would later formalize.
The Academic Foundations of Agentic AI (2022)
Two academic papers published in 2022 laid the theoretical groundwork for everything that would follow:
Chain of Thought Prompting (Wei et al., 2022) — Researchers at Google demonstrated that prompting language models to "think step by step" dramatically improved performance on complex reasoning tasks. This was the first proof that LLMs could decompose problems into sequential steps — a prerequisite for any agent that needs to plan.
ReAct: Reasoning + Acting (Yao et al., 2022) — This paper introduced the agent loop that would power every subsequent AI agent framework: think → act → observe → repeat. ReAct showed that LLMs could synergize reasoning traces with tool use, overcoming hallucination by grounding responses in real-world interactions.
These papers were not consumer products. They were not viral tweets. But without Chain of Thought and ReAct, there is no AutoGPT, no LangChain, no Claude Code, and no agentic engineering.
The Autonomous Agent Explosion (2023)
Toolformer: Machines Learn to Use Tools (February 2023)
In February 2023, Meta AI published Toolformer — a model that could teach itself which external tools (calculators, search engines, APIs) to call, when to call them, and how to incorporate results. This was the missing piece: language models that could not only reason but interact with the outside world.
AutoGPT: The Viral Proof of Concept (March 2023)
On March 30, 2023, game developer Toran Bruce Richards released AutoGPT — an open-source project that connected GPT-4 to a loop of planning, execution, and self-evaluation. AutoGPT could browse the web, write and execute code, manage files, and pursue multi-step goals with minimal human intervention.
The repository exploded. Within weeks, it had over 100,000 GitHub stars — one of the fastest-growing open-source projects in history.
AutoGPT was deeply flawed. It burned through API credits, got stuck in loops, and hallucinated confidently. But it proved something that academic papers could not: autonomous AI agents were not a research curiosity. They were a product category.
BabyAGI: The Minimalist Vision (April 2023)
Days after AutoGPT went viral, venture capitalist Yohei Nakajima released BabyAGI — a stripped-down Python script that demonstrated the core autonomous agent loop in just 140 lines of code. BabyAGI could create tasks, prioritize them, and execute them using GPT-4 and a vector database for memory.
If AutoGPT was the flashy demo, BabyAGI was the elegant proof that the agent pattern could be simple, composable, and practical.
LangChain: The Infrastructure Layer (2023)
Harrison Chase's LangChain emerged as the connective tissue of the agent ecosystem. What began as a library for chaining LLM calls evolved into a full orchestration framework with:
- Agent abstractions for tool use and planning
- Memory systems for maintaining conversation context
- Retrieval-augmented generation (RAG) for grounding responses in documents
- Integration with dozens of LLM providers and tools
LangChain's download numbers tell the story: 47+ million PyPI downloads and the largest community ecosystem in the agent space.
The Lilian Weng Blog Post (June 2023)
In June 2023, OpenAI researcher Lilian Weng published "LLM Powered Autonomous Agents" — a comprehensive blog post that became the definitive reference for how agent systems work. She formalized the architecture into four components:
- Planning — Task decomposition and self-reflection
- Memory — Short-term (context window) and long-term (vector databases)
- Tool use — APIs, code execution, web browsing
- Action — Executing plans in the real world
This framework became the blueprint that every subsequent agent platform would follow — including Taskade's AI Agents.
| Project | Launched | GitHub Stars | Key Innovation |
|---|---|---|---|
| AutoGPT | Mar 2023 | 100K+ | First viral autonomous agent |
| BabyAGI | Apr 2023 | 20K+ | Minimalist agent loop (140 lines) |
| LangChain | 2023 | 94K+ | Agent orchestration framework |
| MetaGPT | Mid 2023 | 48K+ | Multi-agent software company simulation |
| GPT-Engineer | Mid 2023 | 52K+ | Full codebase generation from prompts |

The Infrastructure Year (2024)
If 2023 was the year of viral demos, 2024 was the year the industry built real infrastructure.
GPT-4 and the Reasoning Revolution (2024)
OpenAI's GPT-4o launched in May 2024 — the first truly multimodal model handling text, audio, and vision in real-time. But the real paradigm shift came in September with o1-preview, OpenAI's first reasoning model that "thinks step by step" before answering.
This mattered enormously for agents: reasoning models could plan multi-step workflows, evaluate their own output, and course-correct — the exact capabilities that separate a useful agent from a hallucinating loop.
Devin: The First AI Software Engineer (March 2024)
On March 12, 2024, Cognition Labs announced Devin — marketed as "the world's first AI software engineer." Devin could plan and execute complex engineering tasks end-to-end, using a shell, code editor, and browser within a sandboxed environment.
Devin resolved 13.86% of real-world GitHub issues on the SWE-bench benchmark — far exceeding the previous state-of-the-art of 1.96%.
The reaction was polarizing. Some called it the beginning of the end for software engineering. Others pointed out that 13.86% was still failing 86% of the time. But Devin proved that autonomous coding agents were a real product category, not just an open-source experiment.
Anthropic's Model Context Protocol — MCP (November 2024)
In November 2024, Anthropic released the Model Context Protocol (MCP) — an open standard for connecting AI models to external tools and data sources. MCP defined how agents could securely interact with databases, APIs, file systems, and external services.
MCP was the USB-C of AI agents — a universal connector that made tools portable across platforms and reduced vendor lock-in. Its importance cannot be overstated: before MCP, every agent framework had its own proprietary tool integration. After MCP, tools became interoperable.
But adoption exposed a design problem. Jeremiah Lowin, creator of FastMCP and CEO of Prefect, observed that most early MCP servers simply mirrored CRUD operations — create_user, get_user, update_user, delete_user — which is "REST-brain" thinking. Lowin articulated the core principle that would define good MCP server design: design for outcomes, not operations. A single outcome-oriented tool (like check_order_status) can replace four or five CRUD tools, cutting token usage and reducing agent confusion. He also identified a critical performance threshold: agent quality degrades noticeably above approximately 50 tools, making curation essential. These design principles — flatten arguments, respect the token budget, curate ruthlessly, and treat errors as prompts for progressive disclosure — became the emerging best practices for the MCP ecosystem.
By March 2026, MCP has been adopted by OpenAI, Google DeepMind, Microsoft, and dozens of other companies. It was donated to the Linux Foundation's Agentic AI Foundation in December 2025.
Karpathy's LLM OS Vision (2024)
Throughout 2024, Karpathy developed his vision of the LLM Operating System — the idea that LLMs are not chatbots but the kernel process of a new computing paradigm. He described the system:
"LLMs not as a chatbot, but the kernel process of a new Operating System. It orchestrates input and output across modalities (text, audio, vision), code interpreter ability to write and run programs, browser/internet access, and embeddings database for files and internal memory storage and retrieval."
This framing was prophetic. Every major agent platform in 2025-2026 — Taskade Genesis, Cursor, Claude Code, Devin — implements some version of the LLM OS architecture.
The Competitive Landscape Crystallizes
| Framework | Category | Launch | Key Innovation |
|---|---|---|---|
| LangGraph | Enterprise orchestration | 2024 | Graph-based stateful agent workflows |
| CrewAI | Business automation | 2024 | Role-based multi-agent systems |
| AutoGen (Microsoft) | Research | 2023-2024 | Asynchronous multi-agent conversations |
| OpenAI Function Calling | API | 2023-2024 | Native tool use in GPT models |
| Anthropic MCP | Standard | Nov 2024 | Universal agent-tool protocol |
| Devin (Cognition) | Autonomous coder | Mar 2024 | End-to-end software engineering |
The Vibe Coding Phenomenon (2025)
February 2, 2025: The Tweet That Changed Everything
On February 2, 2025, Andrej Karpathy posted a tweet that would become the most influential statement about software development since "move fast and break things":
"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
He elaborated: "I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like 'decrease the padding on the sidebar by half' because I'm too lazy to find it. I 'Accept All' always, I don't read the diffs anymore."
The term went supernova. Within months:
- Collins Dictionary named "vibe coding" its 2025 Word of the Year
- The vibe coding market grew to $4.7 billion (projected $12.3B by 2027, 38% CAGR)
- 63% of vibe coding users were non-developers
- r/vibecoding grew to 153,000+ members
- 25% of Y Combinator startups built 95% of their codebases using AI
Vibe coding gave permission. It told millions of people — many of them non-developers — that they could build software by describing what they wanted. The AI handles the code. You handle the vision.
Karpathy's Software 3.0 Framework (June 2025)
At Y Combinator's AI Startup School on June 17, 2025, Karpathy delivered a keynote titled "Software Is Changing (Again)" that formalized his thinking into the Software 3.0 framework:
| Era | Paradigm | Programming Interface | Who Programs |
|---|---|---|---|
| Software 1.0 | Code | Explicit instructions (C, Python, Java) | Trained developers |
| Software 2.0 | Weights | Data + optimization (neural networks) | ML engineers |
| Software 3.0 | Prompts | Natural language (English) | Everyone |
The key insight: LLMs are a new kind of programmable entity, and the programming language is natural language itself. This was not a incremental change — it was "the most profound shift in software development since the 1940s."
Karpathy's prescription: build "Iron Man suits" that augment expert capabilities, with a highly efficient "AI Generation → Human Verification" loop.
The Explosion of Vibe Coding Platforms (2025)
The vibe coding concept spawned an entire category of AI-powered development platforms:
| Platform | Category | Key Metric | Approach |
|---|---|---|---|
| Cursor | AI code editor | $2B ARR in 24 months | Background Agents in VS Code |
| Replit | Cloud IDE | 30M+ users | Browser-based, instant deployment |
| Lovable | App builder | $100M ARR | No-code, prompt-to-app |
| Bolt.new | Web builder | Rapid growth | Instant web app generation |
| Taskade Genesis | AI workspace | 150K+ apps built | Agents + automations + workspace |
| Windsurf | Code editor | Acquired by OpenAI ($3B) | AI-first development |
| v0 | UI builder | Vercel ecosystem | React component generation |
The Problems Surface (2025)
As vibe coding scaled, its limitations became impossible to ignore:
- Quality degradation — AI-generated code that "worked" on first test broke in edge cases, under load, or after updates
- Maintenance nightmare — Code nobody understands is code nobody can maintain
- Tech debt acceleration — Zoho CEO Sridhar Vembu's critique landed: "Vibe coding just piles up tech debt faster"
- Security vulnerabilities — Code generated without review contained injection vulnerabilities, leaked credentials, and insecure defaults
- The 80% problem — AI agents reliably handle 80% of a task but struggle with the remaining 20% that determines production readiness
Google's Addy Osmani crystallized the 80% problem: agents produce impressive first drafts that fail at the edges. The gap between "demo-quality" and "production-quality" became the central challenge.
Karpathy's 2025 LLM Year in Review (December 2025)
On December 19, 2025, Karpathy published his annual review identifying six paradigm shifts:
- RLVR (Reinforcement Learning from Verifiable Rewards) — The new dominant training methodology replacing RLHF
- Ghosts vs. Animals — LLMs are "summoned ghosts, not evolved animals" — optimized under entirely different constraints than biological intelligence
- Cursor / New LLM App Layer — Revealed a distinct bundling and orchestration layer for LLM applications
- Claude Code / AI on Your Computer — First convincing demonstration of extended agentic problem-solving: "a little spirit/ghost that lives on your computer"
- Vibe Coding — Code became "free, ephemeral, malleable, discardable after single use"
- Nano Banana / LLM GUI — First hints of graphical interfaces for LLMs
His conclusion about coding agents: they had "crossed a qualitative threshold since December — from brittle demos to sustained, long-horizon task completion with coherence and tenacity."
He described delegating an entire local deployment — SSH keys, vLLM, model download, benchmarking, server endpoint, UI, systemd service, and report — with minimal intervention. The future was not typing code. It was orchestrating agents.
The Convergence on Harness Engineering (2026)
By early 2026, the emerging discipline of harness engineering began crystallizing from multiple independent sources. OpenAI published a blog post titled "Harness Engineering." Anthropic released a guide on building effective harnesses for long-running agents. Manus (the AI company later acquired by Meta) published their context engineering lessons after rebuilding their entire agent framework five times in six months.
The term "harness" describes everything wrapped around the model: what context it can see, what tools it has access to, how it recovers from failures, and how it maintains state across sessions. The evolution was clear: prompt engineering (optimize a single turn) gave way to context engineering (optimize a single session), which gave way to harness engineering (design systems that work across sessions, agents, and workflows).
The EPICS Agent benchmark — which tests AI on real professional tasks that take humans 1-2 hours — revealed why this matters. The best frontier model completed those tasks only 24% of the time, despite scoring above 90% on standard benchmarks. After eight attempts: only ~40%. The failures were not about model intelligence. The agents could reason through problems fine. They failed at execution and orchestration — getting lost after too many steps, looping on failed approaches, losing track of the original objective.
Three of the most successful agent systems arrived at the same insight from completely different directions:
- OpenAI Codex: Layered architecture — orchestrator plans, executive handles tasks, recovery layer catches failures
- Claude Code: Minimal harness — just four tools (read, write, edit, bash) with extensibility via MCP and skills
- Manus: Reduce-offload-isolate — shrink context, use file system as memory, spin up sub-agents, bring back summaries
All three converged on the same conclusion: the harness matters more than the model. Richard Sutton's bitter lesson — that approaches scaling with compute always beat hand-engineered knowledge — applied directly: as models get smarter, harnesses should get simpler, not more complex.

The Agentic Engineering Era (2026)
February 8, 2026: Karpathy Declares Vibe Coding Passe
Exactly one year after coining vibe coding, Karpathy declared his own term obsolete:
"LLMs have gotten much smarter. Vibe coding is passe."
His replacement — agentic engineering — was deliberately chosen:
"Agentic, because the new default is that you are not writing the code directly 99% of the time. You are orchestrating agents who do and acting as oversight. Engineering, to emphasize that there is an art and science and expertise to it."
The key phrase: "orchestrating agents who do and acting as oversight." The human role shifted from code writer to system architect, agent director, and quality gatekeeper.
Why the Name Change Matters
This was not semantic wordplay. The shift from "vibe coding" to "agentic engineering" represented three critical changes:
| Dimension | Vibe Coding (2025) | Agentic Engineering (2026) |
|---|---|---|
| Philosophy | "Forget the code exists" | "Own the architecture, delegate the implementation" |
| Human role | Prompter | Architect + reviewer + orchestrator |
| Quality bar | "Does it seem to work?" | "Does it pass the test suite?" |
| AI role | Code generator | Autonomous agent with tools |
| Maintenance | "I'll prompt it again later" | Persistent memory + continuous testing |
| Professional legitimacy | Awkward in job descriptions | "Agentic Engineer" on your resume |
| Accountability | Unclear | Human owns the system |
Addy Osmani's Principles (February 2026)
Google engineering lead Addy Osmani published the most comprehensive framework for agentic engineering practice, which quickly became industry consensus:
1. Plan Before Prompting — Write a specification before touching an AI agent. Design docs, structured prompts, or task breakdowns — the spec is the highest-leverage artifact.
2. Direct with Precision — Give agents well-scoped tasks. The skill is decomposition: breaking a project into agent-sized work packages with clear inputs, outputs, and success criteria.
3. Review Rigorously — Evaluate AI output with the same rigor you would apply to a human engineer's PR. Do not assume the agent got it right because it looks right.
4. Test Relentlessly — "The single biggest differentiator between agentic engineering and vibe coding is testing." Test suites are deterministic validation for non-deterministic generation.
5. Own the System — Maintain documentation, use version control and CI, monitor production. The AI accelerates the work; you are responsible for the system.
The Factory Model: From Coder to Conductor
Osmani also published "The Factory Model," describing the generational evolution of AI coding tools:
| Generation | Model | Human Role | Example |
|---|---|---|---|
| 1st Gen | Accelerated autocomplete | Writer with suggestions | GitHub Copilot (early) |
| 2nd Gen | Synchronous agents | Director with real-time review | Cursor, Claude Code |
| 3rd Gen | Autonomous agents | Architect with checkpoint review | Background Agents, Devin 2.0 |
The critical insight: "You are no longer just writing code. You are building the factory that builds your software."
And the data backed it up:
- New website creation: +40% year-over-year
- New iOS apps: +49% increase
- GitHub code pushes in US: +35% jump
These metrics had been flat for years. Agentic engineering was not just changing how software was built — it was changing how much software existed.
The Four Species of AI Agents (2026)
As agentic engineering matured, a critical realization emerged: saying "agents" was too vague. Not all agents are the same — and using the wrong species for the wrong work is one of the most common and costly mistakes in production AI systems.
All four species share the same primitive — LLM + tools + feedback loop. What differs is the construction of that loop: the context, scope, human involvement, and optimization target. Getting this taxonomy right is fundamental to practicing agentic engineering effectively.
| Species | Scale | Human Role | Quality Gate | When to Use |
|---|---|---|---|---|
| Coding Harness | Individual task | Manager — decomposes, delegates, reviews | Human judgment | Your judgment is the gold standard |
| Project Harness | Team / project | Architect — involved at beginning and end | Planner agent + human review | 8-20 developers' worth of complexity |
| Dark Factory | Fully autonomous pipeline | Spec writer + evaluator | Automated eval + optional human review | You trust the evals, want to minimize bottlenecks |
| Auto Research | Metric optimization | Goal setter + result reviewer | Metric improvement | You have a measurable rate to optimize |
Coding Harnesses are the simplest pattern — an agent taking the place of a developer in an engineering process. Claude Code, Codex, and the Peter Steinberger model all operate here. The critical skill is decomposition: breaking a big problem into well-defined chunks, each given to a single-threaded agent. Karpathy runs his agents 16 hours a day; Steinberger manages multiple agents simultaneously across 10+ repository checkouts.
Project Harnesses extend the pattern to team-scale work. Cursor proved this across browsers and compilers — millions of lines of code — using a planner agent that manages tasks, keeps notes, tracks memory, and evaluates executor work. Short-running "grunt" agents are spun up for exactly one problem, then disposed. The critical learning: Cursor tried three levels of management hierarchy and it failed. Simple scales well with agents.
Dark Factories remove humans from the middle entirely. Spec goes in, software comes out. Humans are heavily involved at the top (design, requirements, excellent specifications) and at the end (verifying evals, code review for accountability), but the system runs autonomously in between. The name comes from Chinese automated factories where the lights are off — robots work end-to-end. Amazon learned the risks the hard way when AI-generated incidents from junior engineers triggered a company-wide review by senior and principal engineers.
Auto Research is a different species entirely — descended from classical machine learning, not software engineering. The agent climbs a hill by relentlessly running experiments to optimize a specific metric. Shopify CEO Tobi Lutke used it to make a 20-year-old codebase 53% faster overnight. Karpathy's autoresearch ran 700 experiments in two days. The critical distinction: is your problem software-shaped or metric-shaped? If you have a rate to optimize, use auto research. If you need working software, use a harness.
A fifth pattern — Orchestration — routes work across agents with genuinely specialized roles (researcher → writer → editor, or ticket pickup → research → resolution). Frameworks like LangGraph and CrewAI serve this pattern. The coordination overhead only pays off at scale — 10,000+ items, not 100.
How to Pick the Right Species: The Decision Flowchart
The most common mistake teams make is using the wrong agent species for the wrong kind of work. Use this decision tree:
Real-World Agent Species in Production (2026)
| Company | Agent Species | What They Built | Result |
|---|---|---|---|
| Shopify (Tobi Lutke) | Auto Research | Optimized 20-year-old Liquid framework | 53% faster runtime overnight |
| Anthropic (Boris Cherny) | Coding Harness | Claude Code — multiple parallel instances | 70% productivity gain per engineer |
| Cursor | Project Harness | Browser + compiler via planner-executor | Millions of lines, shipped to production |
| OpenAI (Sherwin Wu) | Coding Harness | 95% of engineers on Codex daily | 70% more PRs from agentic-leaning engineers |
| Monday.com (Eran Zinman) | Dark Factory | Replaced 100-person SDR team with AI agents | Response time: 24h → 3 minutes |
| Stripe | Coding Harness | Agent-authored PRs at scale | 1,000+ PRs/week merged from agents |
| Karpathy | Auto Research | Autoresearch for GPT-2 optimization | 700 experiments in 2 days, 11% speed gain |
Where Taskade Genesis Fits in the Taxonomy
Taskade Genesis operates as a runtime dark factory — but with a critical difference from code-generating dark factories. Traditional dark factories produce code that still needs deployment, hosting, and maintenance. Genesis produces deployed, living applications with AI agents, automations, and database built in.
The Workspace DNA architecture maps directly to the species taxonomy: Memory (projects and databases) provides the context that all four species need. Intelligence (AI agents with 22+ built-in tools) provides the execution engine. Execution (automations with 100+ integrations) provides the reliable workflow layer. When a user prompts Genesis, the system acts as an integrated dark factory where the "spec" is the prompt, the "eval" is the live application, and the "human review" is the builder iterating in real time.
For teams that want to experience agentic engineering without building their own agent infrastructure — no harness configuration, no prompt engineering, no deployment pipeline — Genesis is the fastest path from intent to deployed system. Over 150,000 apps built and counting.
The Anti-Patterns: What Goes Wrong
The anti-patterns are just as important as the patterns:
| Anti-Pattern | Why It Fails | What to Do Instead |
|---|---|---|
| Using auto research to build software | Auto research optimizes metrics, not produces working code | Use a coding harness or dark factory |
| Calling individual assistants a "dark factory" | Going to make coffee for 20 min ≠ autonomous pipeline | Be honest about human involvement level |
| Adding complexity to agent architectures | Cursor tried 3 management levels — failed. Manus rebuilt 5 times, got simpler | Keep the harness simple; complexity kills agents |
| Skipping decomposition for individual harnesses | Decomposition is the skill | Break problems into agent-sized chunks first |
| Using orchestration at low scale | Coordination overhead exceeds value under 1,000 items | Use a simple coding harness instead |
The deeper lesson: the art of building good agents is often the art of finding different simple configurations that enable the agent to do the particular work you have in front of you. Frame your work around making it easy for the agent — not around keeping the human at the center of everything.
OpenAI's Internal Evidence
The shift from coder to conductor is not theoretical — it is already the default at the companies building the models themselves. Sherwin Wu, head of engineering for OpenAI's API and developer platform, shared that 95% of OpenAI engineers use Codex daily and 100% of PRs are reviewed by Codex. Engineers who lean into agentic tools open 70% more PRs than those who do not, and the gap is widening.
"Engineers are becoming tech leads. They're managing fleets and fleets of agents. It literally feels like we're wizards casting all these spells. And these spells are kind of like going out and doing things for you." — Sherwin Wu, OpenAI
Wu described engineers running 10 to 20 parallel Codex threads simultaneously — not actively coding, but steering agents, checking output, and providing feedback. One internal team is maintaining a 100% Codex-written codebase with no human escape hatch, forcing them to solve the exact context and documentation problems that agentic engineering principles address.
The biggest lesson from that experiment: when agents fail, the problem is almost always context — underspecified instructions or missing tribal knowledge. The fix is encoding that knowledge into the codebase via documentation, .md files, and structured code comments — exactly the kind of specification-first discipline that Osmani's five principles demand.
The pace of reinvention required is staggering. As Harry Stebbings observed on the 20VC podcast (2026): "The prize for winning is to reinvent the company from scratch and the product from scratch every 6 to 9 months." Companies that treat agentic engineering as a one-time adoption rather than a continuous discipline will fall behind.
Inside Claude Code: Building for the Model 6 Months From Now
Boris Cherny, the creator of Claude Code and a former Meta principal engineer, revealed a design philosophy that captures the essence of agentic engineering. In a 2025 interview, Cherny described the principle that guides Claude Code's development:
"Don't build for the model of today. Build for the model 6 months from now."
The product should get better as models improve — without changing any code. This is the opposite of traditional software engineering, where features are hand-built for current capabilities. Claude Code's architecture is designed so that smarter models automatically unlock better agentic workflows.
Cherny also described how agentic engineering has already transformed Anthropic internally: even though the company tripled in size, productivity per engineer grew ~70% because of Claude Code. Engineers run multiple Claude Code instances in parallel, let them work for hours, and return to completed PRs. Cherny gives agents tools like Puppeteer so they can see UI and self-correct — exactly the kind of feedback loop that distinguishes agentic engineering from passive code generation.
The hiring philosophy at Anthropic reinforces the shift. Cherny's Claude Code team recruits generalists — engineers who code, do product work, design, and talk to users:
"Our product managers code, our data scientists code, our user researchers code a little bit. I just love these generalists."
This is Osmani's "coder to conductor" transition made concrete. When the AI handles most implementation, the engineer who can think across product, design, and infrastructure becomes the highest-leverage contributor. Cherny's career arc — from building Undux (React state management) and writing the TypeScript book, to directing AI agents at Anthropic — is itself the agentic engineering story in miniature.
One more principle from Cherny that crystallizes the discipline: latent demand — the most important principle in product development. At Meta, 40% of Facebook Group posts were buy/sell activity. Users were already doing commerce; Marketplace just formalized it. The same pattern drives agentic engineering adoption: developers were already copy-pasting code from ChatGPT into their editors. Claude Code just formalized the workflow.
"You can never get people to do something they do not yet do. Find the intent they have and steer it." — Boris Cherny
The Standards War (Late 2025 – 2026)
The Agentic AI Foundation — AAIF (December 2025)
On December 9, 2025, the Linux Foundation announced the formation of the Agentic AI Foundation (AAIF) — the first neutral governance body for AI agent standards.
Founding contributions:
- Anthropic → Model Context Protocol (MCP)
- Block → goose (open-source local-first agent framework)
- OpenAI → AGENTS.md (project-specific guidance standard)
Platinum members: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.
This was unprecedented. The companies building the most advanced AI systems — companies that compete fiercely on model quality — agreed to collaborate on the standards that connect those models to the real world.
Google's Agent2Agent Protocol — A2A (2025)
Google launched the Agent2Agent (A2A) protocol in April 2025 with support from over 50 partners including Salesforce, SAP, and ServiceNow. While MCP standardizes how agents connect to tools, A2A standardizes how agents communicate with each other.
The emerging stack:
| Layer | Standard | Purpose | Governed By |
|---|---|---|---|
| Agent-to-Tool | MCP | Connect agents to external tools and data | AAIF (Linux Foundation) |
| Agent-to-Agent | A2A | Inter-agent communication and coordination | Linux Foundation |
| Agent-to-Project | AGENTS.md | Project-specific agent configuration | AAIF |
The Enterprise Adoption Wave
Gartner and McKinsey data paint a clear picture of where the industry is heading:
| Metric | Value | Source |
|---|---|---|
| Enterprise apps with AI agents by end of 2026 | 40% (up from <5% in 2025) | Gartner |
| Enterprise software with agentic AI by 2028 | 33% | Gartner |
| Agentic AI annual value potential | $2.6T–$4.4T | McKinsey |
| Median ROI for mature implementations | 540% | McKinsey |
| Organizations investing in agentic AI | 61% (19% significant, 42% conservative) | Gartner |
| Agentic AI projects canceled by end of 2027 | >40% | Gartner |
| Day-to-day decisions made by agentic AI by 2028 | 15% (up from 0% in 2024) | Gartner |
The last statistic is sobering: Gartner predicts over 40% of agentic AI projects will be canceled by 2027. Agentic engineering is not magic. Without the discipline Karpathy and Osmani describe, agent projects fail.
Karpathy's Autoresearch: Agentic Engineering in Action (March 2026)
On March 7, 2026, Karpathy open-sourced autoresearch — a 630-line Python tool that lets AI agents run autonomous ML experiments on a single GPU. It was not just a tool release. It was a live demonstration of every agentic engineering principle.
How It Works
Autoresearch gives an AI agent a small but real LLM training setup and lets it experiment overnight:
- Agent reads human-provided instructions (the spec)
- Agent modifies training code — architecture, optimizers, hyperparameters
- Training runs for exactly 5 minutes per experiment
- Agent evaluates results against an unambiguous metric: validation bits-per-byte (lower is better)
- Agent keeps or discards the change
- Repeat — approximately 12 experiments per hour, ~100 experiments overnight
AUTORESEARCH: AGENTIC ENGINEERING IN PRACTICE
══════════════════════════════════════════════
HUMAN (Agentic Engineer) AI AGENT
┌─────────────────────┐ ┌─────────────────────┐
│ 1. Write spec │────────►│ 2. Read instructions │
│ 2. Set metric │ │ 3. Modify code │
│ 3. Review results │◄────────│ 4. Train (5 min) │
│ 4. Adjust direction │ │ 5. Evaluate metric │
│ │ │ 6. Keep or discard │
│ │ │ 7. Repeat x100 │
└─────────────────────┘ └─────────────────────┘
Principles demonstrated:
✓ Plan before prompting (human writes spec)
✓ Direct with precision (5-min time budget, single metric)
✓ Test relentlessly (every experiment evaluated)
✓ Own the system (human reviews final results)
The Three-File Architecture: Why Autoresearch Works
Autoresearch's elegance lies in a strict three-file constraint that prevents the agent from gaming its own evaluation:
program.md— The human-written instruction file. Defines the goal, constraints, and rules the agent must follow. This is the most important file — the human setting the objective. Karpathy optimized his ownprogram.mdextensively before letting the agent run.train.py— The one file the agent can modify. This could be training code, a configuration, a prompt template, a marketing script — literally anything you want optimized. The constraint is crucial: one file, not two, not zero.prepare.py— The evaluation script the agent cannot touch. This defines what "better" means. Without this restriction, the agent could rewrite the scoring function to fake its results. The metric must be unambiguous and automatically computable.
The fixed 5-minute time budget per experiment is equally critical. By giving every experiment the same compute budget, the system ensures fair comparison — only the raw quality of the idea wins, not how long the agent trains. As Karpathy explains: if you give one applicant seven days and another seven minutes, the results are meaningless. Equal time makes every experiment directly comparable.
TL;DR: One file to change, one metric to chase, one time budget per experiment. If you can score it, you can auto-research it.
What the Agent Actually Found
The results were remarkable — not just for the improvements, but for what they revealed about agent-driven research:
- 56% improvement in validation bits-per-byte (val_bpp) on the Tiny Stories dataset — a metric where even single-digit gains are considered significant in language modeling
- The agent discovered and fixed bugs in the training code that humans had missed for years — subtle issues in data loading and gradient accumulation that only surfaced through systematic experimentation
- Running on a single consumer GPU, the agent matched or exceeded results that would typically require a researcher spending days or weeks of manual hyperparameter tuning
- The agent's experimentation log showed it developing what Karpathy called "intuition by brute force" — trying architectural modifications, learning rate schedules, and tokenization changes that a human researcher might dismiss but that yielded measurable gains
AUTORESEARCH RESULTS (TINY STORIES DATASET)
═══════════════════════════════════════════
Metric: val_bpp (validation bits-per-byte — lower = better)
Baseline (human config) ████████████████████████████ 1.00x
After 50 experiments ██████████████████████ 0.78x
After 100 experiments ████████████████ 0.62x
Final (agent-optimized) ████████████ 0.44x ← 56% improvement
Key findings by the agent:
✓ Fixed data loading bug (missed by humans for years)
✓ Discovered non-obvious learning rate schedule
✓ Identified optimal tokenization strategy
✓ Found architecture modifications humans wouldn't try
The most telling metric was not the improvement itself but the prediction accuracy: on a held-out set of Tiny Stories completions, the agent-optimized model's predictions were nearly indistinguishable from human predictions — approaching the theoretical floor of what is predictable given the randomness inherent in language.
Real-World Impact
Following the release, Shopify CEO Tobi Lutke adapted the autoresearch framework internally. An agent-optimized smaller model achieved a 19% improvement in validation scores, eventually outperforming a larger model configured through standard manual methods.
This was agentic engineering working exactly as Karpathy described: human sets the goal, agent executes autonomously, results are objectively measurable, and the human reviews and adjusts direction. The autoresearch experiment proved something deeper: agents are not just automating existing research workflows — they are finding things humans miss, because they test hypotheses a human would dismiss as unlikely and they never get tired of systematic iteration.
Beyond Text: Autoresearch for Music Generation
The autoresearch framework proved its generality when developers applied it beyond text to ABC notation sheet music — training a model on the Sanderwoods Irishman dataset of traditional Irish folk music. The results demonstrated that autoresearch's power extends to any domain with a measurable objective:
- Baseline: val BPB of 2.08 (model essentially lost, producing garbled notation that sounded like "a child running on a piano")
- After 18 experiments: val BPB dropped to 0.97 — a 53% improvement
- The optimized model produced coherent melodies with proper chord progressions, bar structure, and musical rhythm
- Key insight: the optimal strategy for small, structured, low-entropy datasets was making the model smaller and faster to see the data more times within the 5-minute budget, rather than building a larger model that barely completes one pass
The winning configuration: aspect ratio of 32, head dimension of 64, batch size of 2^14, depth of 8, and 5% warm-up — discovered entirely by the agent through systematic experimentation. The biggest single win came from reducing batch size (4x more optimizer steps), not from increasing model capacity. This counterintuitive finding — that for structured data, throughput beats capacity — is exactly the kind of insight agents find because they test hypotheses humans dismiss.
Autoresearch as a Work Primitive
The deeper significance of autoresearch extends beyond ML research. The concept of an iterative agentic loop — define goal, execute experiment, measure result, keep or discard, repeat — is emerging as a new fundamental work primitive.
Work primitives are basic building blocks so fundamental they show up everywhere across roles and industries. New ones don't appear often. The last major primitive was arguably the spreadsheet (1979). Autoresearch demonstrates that agentic loops may be the next one:
- A/B testing for marketing — agent writes landing page variants, sends traffic, measures conversions, keeps winners, iterates indefinitely
- Niche optimization agents — Amazon listing experimenter, email sequence tuner for realtors, SaaS pricing optimizer — each a packaged autoresearch loop tuned for one painful niche
- Trading signal generation — agent runs backtests of simple trading rules overnight, keeps promising strategies
- CRM lead qualification — agent tests scoring rules and follow-up messages against conversion data, surfaces only high-value leads
- Internal productivity labs — define KPIs (response time, close rate, ticket resolution), let agents iterate on workflows, templates, and routing rules
The scale of this shift is staggering. As marketing strategist Eric Sue observed: "Most marketing teams run 30 experiments per year. The next generation will run 36,000 — roughly 100 per day." Each experiment follows the same autoresearch pattern: agent modifies the copy, measures conversions, decides whether to keep or discard.
Practical autoresearch use cases emerging in 2026:
- Website performance optimization — Agent tweaks CSS, JavaScript, and asset loading; measures page load time via Puppeteer benchmarks; keeps improvements, reverts regressions. In one demo, a portfolio site went from 50ms to 25ms load time — a 50% improvement — in under 4 minutes of autonomous iteration
- Trading strategy refinement — Agent adjusts buy/sell rules and risk parameters across years of historical market data, scoring each experiment by its Sharpe ratio (risk-adjusted returns). Hundreds of strategies tested overnight while the trader sleeps
- Prompt engineering at scale — Agent fine-tunes system instructions behind AI agents, testing different phrasing, tone levels (beginner, PhD-level), even different languages to find which prompt configuration produces the best task completion rate
- Open-source model compression — Developers point autoresearch at open-source LLMs to find configurations that run faster on consumer hardware. The prediction: Sonnet-quality models running on iPhones within months, discovered entirely through agent-driven experimentation
- Email and ad creative testing — Agent generates subject lines, body copy, and CTA variants; sends to test segments; measures open rates and click-through; iterates 100x faster than any human marketing team
The key to successful autoresearch in business contexts is the metric hierarchy — a three-tier scoring system that prevents agents from gaming shallow metrics:
| Tier | Role | Example (Email Marketing) | Example (Landing Page) |
|---|---|---|---|
| Primary | The metric you optimize for | Reply rate | Conversion rate |
| Secondary | Supporting metrics that validate quality | Open rate, click-through rate | Time on page, scroll depth |
| Guardrail | Hard limits the agent cannot violate | Unsubscribe rate < 2%, spam rate < 0.1% | Bounce rate < 40%, load time < 3s |
Without guardrail metrics, agents find shortcuts — subject lines that maximize opens but tank conversions, or landing pages that convert but load so slowly they lose 60% of visitors. The metric hierarchy is intent engineering applied to autoresearch: primary metrics define what you want, guardrail metrics define what you will not sacrifice to get it.
A practical 4-week autoresearch implementation roadmap:
| Week | Focus | Deliverable |
|---|---|---|
| Week 1 | Define metric + baseline | Primary metric chosen, guardrail limits set, baseline measured across 7 days |
| Week 2 | Build the loop | Agent configured with one editable variable, evaluation automated, first 50 experiments run |
| Week 3 | Analyze + refine | Review winning experiments, adjust metric hierarchy if guardrails triggered, expand to secondary variables |
| Week 4 | Scale + systematize | Move from single variable to multi-variable optimization, document learnings, share pattern with team |
In Taskade Genesis, this maps to: Week 1 — create a project with your baseline data. Week 2 — train an AI agent to modify and test one variable. Week 3 — set up automations to run the loop on schedule. Week 4 — expand the workspace with additional agents for multi-variable experiments.
Stripe CEO Patrick Collison and Shopify CEO Tobi Lutke have both publicly endorsed the pattern — recognizing that autoresearch is not limited to ML but applies to any measurable business process.
Shopify CEO Tobi Lutke captured this shift: "Auto research works even better for optimizing any piece of software. Make an auto folder. Add a program.md and a bench script. Make a branch and let it rip."
The pattern is the same everywhere: human defines the objective and evaluation metric, agent executes the search autonomously, results are measured against ground truth. The only things that change are the domain, the search space, and the metric. This is agentic engineering distilled to its essence.
The Three Conditions (And Where Autoresearch Fails)
Autoresearch works when three conditions are met simultaneously:
- A clear metric — One number with a clear direction (lower latency, higher conversion rate, better Sharpe ratio). Not a committee vote, not a feeling, not "does this look good?"
- An automated evaluation — No human in the loop during the experiment cycle. If you need a human to judge each result, the loop runs at human speed and loses its power. The evaluation must be scriptable.
- One file the agent can change — A single, bounded search space. Multiple files create combinatorial explosion that agents handle poorly.
Remove any one condition and the loop breaks:
| Missing Condition | What Happens |
|---|---|
| No clear metric | Agent optimizes in a random direction with high confidence |
| Human in the loop | Loop slows to human speed; no longer runs while you sleep |
| Multiple files to edit | Combinatorial search space; agent makes conflicting changes |
Where autoresearch fails: Brand design, UX feel, pricing strategy (for low-traffic sites), editorial voice — anything where "better" is subjective. If the success criterion is a judgment call or a feeling, the agent cannot tell what is working. It will optimize confidently in the wrong direction.
The key insight: if you give it a bad metric, it will very confidently optimize the wrong thing. Choosing the right metric is the human skill that makes autoresearch valuable — and the skill that will separate practitioners from amateurs in the agentic engineering era.
AgentHub: GitHub for Agents
Following autoresearch, Karpathy launched AgentHub — an agent-first collaboration platform described as "GitHub for agents." Where GitHub organizes human collaboration around branches, PRs, and merges, AgentHub strips all of that away:
- No main branch — a sprawling DAG of commits in every direction
- No PRs or merges — agents commit directly
- Message board — agents coordinate via a shared message board rather than code review
- First use case: autoresearch, but designed to be far more general
AgentHub represents a vision where agent swarms work on the same codebase simultaneously, each exploring different directions. The first use case was autoresearch — multiple agents running parallel experiments on the same training code — but the architecture supports any collaborative agent workflow.
As Karpathy wrote: "Think of it like a stripped-down GitHub where there's no main branch, no PRs, no merges — a sprawling DAG of commits in every direction with a message board for agents to coordinate." The repo already has 25,000+ GitHub stars.
Karpathy's SETI@Home Vision for AI Research
Karpathy's end vision for autoresearch reaches far beyond individual experiments. In the early 2000s, the SETI@Home project let anyone donate spare computer power to search for extraterrestrial intelligence. Karpathy envisions the same model for AI research: millions of AI agents distributed across thousands of computers, with humans allocating where that research effort goes.
This is not speculative — it is the logical extension of autoresearch + AgentHub. If one agent running 100 experiments overnight can achieve a 56% improvement on a training benchmark, what happens when thousands of agents run millions of experiments across distributed infrastructure? The answer is recursive self-improvement at civilization scale — and Karpathy believes we may already be in its early stages.
"We might be in the early stages of the singularity."
Every frontier AI lab — OpenAI, Anthropic, Google DeepMind — is investing tens of millions in researchers doing essentially this same work manually. Karpathy made the pattern open-source and accessible to anyone with a GPU and a clear metric.
Karpathy's Claws: The Layer Above Agents
In his March 2026 interview on the No Briars podcast, Karpathy described a new abstraction layer above agents called claws — persistent autonomous entities with their own sandboxes, looping independently, with sophisticated memory systems:
"It really when I say a claw I mean this layer that takes persistence to a whole new level. It's not something that you are interactively in the middle of. It kind of like has its own little sandbox, does stuff on your behalf even if you're not looking."
His personal claw, Dobby the House Elf, controls his entire home. The discovery was startling in its simplicity — he told an agent "I think I have Sonos at home. Can you try to find it?" The agent did an IP scan of the local network, found the Sonos system (which had no password protection), reverse-engineered the APIs, and played music. Three prompts from discovery to playback.
"I can't believe I just typed in 'can you find my Sonos?' And suddenly it's playing music."
Dobby now controls lights, HVAC, shades, the pool and spa, and a security camera system where a Quinn vision model watches camera feeds via change detection and sends WhatsApp alerts — "Hey, a FedEx truck just pulled up." Six separate smart home apps replaced by one natural language interface.
"I used to use like six apps, completely different apps and I don't have to use these apps anymore. Dobby controls everything in natural language. It's amazing."
The implications extend beyond home automation. Karpathy sees claws as the consumer-ready layer of AI — where agents are still semi-finished primitives requiring interactive guidance, claws are autonomous entities that maintain state, make decisions, and execute without human intervention. The hierarchy is clear: LLMs (raw token generators) → Agents (semi-finished) → Claws (consumer-ready, deployable).
For builders on Taskade Genesis, the claw pattern maps directly to Workspace DNA: Memory provides the persistent state, AI Agents provide the intelligence, and Automations provide the autonomous execution loop. A Genesis app with workspace memory, trained agents, and triggered automations is functionally a claw — a system that acts on your behalf without requiring your presence.
The Multi-Agent Reality: Token Throughput as the New Metric
The interview revealed how top practitioners actually work with agents in 2026. Karpathy described the Peter Steinberg model — multiple Codex agents displayed on a monitor wall, each running ~20-minute tasks across 10+ repository checkouts simultaneously:
"It's not just like here's a line of code, here's a new function. It's like here's a new functionality and delegate it to agent one. Here's a new functionality that's not going to interfere with the other one. Give it to two."
The developer's role becomes orchestration at the macro action level — research agent, code agent, planning agent, all running in parallel. The metric that matters is no longer lines of code or features shipped. It is token throughput:
"What is your token throughput and what token throughput do you command? I feel nervous when I have subscription left over — that just means I haven't maximized my token throughput."
Karpathy compared this to his PhD days when idle GPUs felt like wasted potential. The resource anxiety shifted from FLOPs to tokens. And when capability outstrips what any individual can direct, the diagnosis is always the same:
"It all kind of feels like skill issue when it doesn't work. It's not that the capability is not there. It's that you just haven't found a way to string it together of what's available."
This framing — that agent limitations are configuration problems, not capability problems — has profound implications for agentic engineering. The agents.md file, the memory system, the parallelization strategy — these are the new engineering skills. Karpathy's progression rule maps the path: single session → multiple agents → agent teams → claws → optimization over claws.
Karpathy's autoresearch is part of a broader wave of autonomous AI research systems emerging in 2025-2026: Google DeepMind's FunSearch discovered new mathematical constructions by having LLMs write and evaluate programs. Weco AI's AIDE automates ML engineering pipelines end-to-end. Sakana AI's The AI Scientist generates research hypotheses, runs experiments, and writes papers. What unites them all is the agentic engineering pattern: human defines the objective and evaluation metric, agent executes the search, results are measured against ground truth.
Evolutionary Agents: From Stepping Stones to Scientific Discovery
Agentic engineering is not limited to coding and deployment. In March 2026, Sakana AI published Shinka Evolve — a system that uses frontier LLMs as mutation operators inside an evolutionary algorithm to discover new solutions to open mathematical and scientific problems.
The architecture mirrors agentic engineering principles. A population of programs is maintained in a database. Parent programs are sampled, paired with "inspiration" programs, and handed to an LLM that proposes mutations — diffs, full rewrites, or crossovers between two parents. Each mutated program is evaluated against a fitness function, and successful innovations propagate through the tree.
Three innovations made Shinka Evolve remarkably sample-efficient, matching or exceeding Google DeepMind's Alpha Evolve results in under 200 program evaluations:
Multi-model ensembling with bandit selection — Instead of using a single frontier model, Shinka Evolve ensembles models from OpenAI, Anthropic, and Google, using an Upper Confidence Bound (UCB) algorithm to adaptively select which model proposes each mutation. Different models excel at different types of edits, and the system learns which to deploy when.
Meta scratch pad — Programs are summarized, and global insights are extracted and fed back into the system prompt. This creates a form of semantic memory — the evolutionary process accumulates not just better programs but better understanding of why they work.
Adaptive operator selection — The algorithm itself co-evolves alongside the solutions. The evolutionary strategy adapts on the fly — hence the name: Shinka means "evolve" in Japanese, so Shinka Evolve literally means "evolve evolve."
The deepest insight from this work echoes Kenneth Stanley's Why Greatness Cannot Be Planned: sometimes solving the wrong problem works better. Shinka Evolve's circle packing experiments showed that using a relaxed fitness function (allowing tiny circle overlaps as a surrogate problem) converged faster than the exact formulation. The surrogate problem served as a stepping stone — a concept from open-endedness research where intermediate discoveries enable future breakthroughs even when they do not directly solve the target problem.
This has profound implications for agentic engineering. Current AI agents optimize for the exact problem they are given. But human researchers routinely reformulate problems, invent proxies, and transfer insights across domains. The next frontier of agentic systems — what Robert Lange calls "vibe optimization" and "vibe researching" — envisions AI shepherds overseeing populations of evolving solutions across parallel threads, checking results in the morning like a researcher reviewing overnight experiments.
The connection to Workspace DNA is structural: Memory stores the population of solutions and accumulated insights. Intelligence (multi-model agents) proposes mutations and evaluates fitness. Execution runs the evaluations and propagates successful innovations. The evolutionary loop is the Memory-Intelligence-Execution cycle, operating at the frontier of scientific discovery.
The Shopify Precedent: Agentic Engineering Goes Corporate
Shopify's adoption of agentic engineering principles deserves special attention because it shows where every company is heading.
In April 2025, Shopify CEO Tobi Lutke sent an internal memo that became public:
"Reflexive AI usage is now a baseline expectation at Shopify."
The key mandate: before requesting additional headcount, teams must demonstrate why they cannot accomplish the work using AI. The memo asked teams to consider: "What would this area look like if autonomous AI agents were already part of the team?"
This is agentic engineering applied to organizational design — not just code, but every knowledge work function.
Monday.com CEO Eran Zinman shared a concrete example of this shift on the 20VC podcast (2026): his company replaced its entire 100-person SDR team with AI agents, cutting response times from 24 hours to 3 minutes while improving conversion rates across every metric. All Monday.com developers now use Claude Code and Cursor. "Nobody will want to buy software that's not doing the majority of the work for them," Zinman said — a statement that makes agentic engineering not optional but existential for software companies.
How Taskade Genesis Embodies Agentic Engineering
When Karpathy described agentic engineering — "orchestrating agents who do and acting as oversight" — he described the architecture Taskade Genesis has been building since launch.
The Workspace DNA Architecture
Taskade Genesis implements agentic engineering through three pillars that form a self-reinforcing loop:
| Agentic Engineering Principle | Workspace DNA Pillar | Implementation |
|---|---|---|
| Persistent context | Memory (Projects) | Projects store data, history, and context across 8 views (List, Board, Calendar, Table, Mind Map, Gantt, Org Chart, Timeline) |
| Autonomous execution | Intelligence (Agents) | AI Agents v2 with 22+ built-in tools, custom tools via MCP, persistent memory, multi-agent collaboration |
| Reliable workflows | Execution (Automations) | Automations with durable execution, 100+ integrations, branching/looping/filtering |
Memory feeds Intelligence → Intelligence triggers Execution → Execution creates Memory. This is not a marketing framework. It is the engineering architecture that makes agentic engineering practical at scale.
Why Platform Beats Framework
The tools comparison for agentic engineering reveals a critical insight:
| Approach | Example | Requires | Deploys To | Maintains Via |
|---|---|---|---|---|
| Code generator | Cursor, Devin | Developer skills | Separate hosting | Manual updates |
| Agent framework | CrewAI, LangGraph | Python skills | BYO infrastructure | Custom code |
| AI workspace | Taskade Genesis | Natural language | Instant (built-in) | Agents + automations |
For the 63% of AI-assisted builders who are non-developers, Taskade Genesis is the only platform that implements all five agentic engineering principles without requiring code:
- Plan → Write a detailed prompt (the spec) — or grab one from the prompt template library
- Direct → AI agents build the app using 11+ frontier models from OpenAI, Anthropic, and Google
- Review → Interact with the live app immediately
- Test → Iterate by describing changes
- Own → AI agents and automations maintain the system over time
150,000+ apps built. Custom domains, password protection, Community Gallery publishing, 7-tier RBAC (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer).

The Complete Timeline: From Turing to Agentic Engineering
| Year | Event | Significance for Agentic Engineering |
|---|---|---|
| 1950 | Turing's "Computing Machinery and Intelligence" | First formal framework for machine intelligence |
| 1956 | Dartmouth Conference — "AI" coined | Field gets a name |
| 1986 | Backpropagation (Hinton) | Neural networks can learn |
| 1997 | Deep Blue defeats Kasparov | AI beats humans at complex strategy |
| 2012 | AlexNet wins ImageNet | Deep learning revolution begins |
| 2015 | OpenAI founded (Karpathy co-founds) | Mission: safe, beneficial AGI |
| 2016 | AlphaGo defeats Lee Sedol | AI handles ambiguous, long-horizon planning |
| 2017 | "Attention Is All You Need" (Transformer) | Architecture that enables everything |
| 2017 | Karpathy joins Tesla as Director of AI | Real-world AI deployment at scale |
| 2018 | GPT-1 | Unsupervised pre-training works |
| 2020 | GPT-3 (175B parameters) | Emergent few-shot learning |
| 2022 | Chain of Thought prompting (Wei et al.) | LLMs can reason step-by-step |
| 2022 | ReAct: Reasoning + Acting (Yao et al.) | Think → Act → Observe loop |
| Nov 2022 | ChatGPT launches | AI goes mainstream (100M users in 2 months) |
| Feb 2023 | Toolformer (Meta) | LLMs learn to use external tools |
| Mar 2023 | AutoGPT released | 100K+ stars, autonomous agents go viral |
| Apr 2023 | BabyAGI released | Minimalist agent loop proves the pattern |
| Jun 2023 | Lilian Weng's agent architecture post | Definitive reference for agent design |
| 2023 | LangChain ecosystem emerges | Agent orchestration infrastructure |
| Feb 2024 | Karpathy leaves OpenAI, founds Eureka Labs | Independent AI education and research |
| Mar 2024 | Devin announced (Cognition) | "First AI software engineer" — 13.86% SWE-bench |
| Sep 2024 | OpenAI o1-preview | First reasoning model, think-before-answer |
| Nov 2024 | Anthropic releases MCP | Universal agent-tool protocol |
| Dec 2024 | OpenAI o3 preview | 87.5% on ARC-AGI benchmark |
| Feb 2025 | Karpathy coins "vibe coding" | "Forget the code exists" — goes viral |
| Apr 2025 | Google launches A2A protocol | Agent-to-agent communication standard |
| Apr 2025 | Shopify memo: "Reflexive AI usage" | Enterprise agentic engineering mandate |
| Jun 2025 | Karpathy YC keynote: Software 3.0 | Natural language as programming interface |
| Aug 2025 | GPT-5 launches | Algorithmic efficiency > brute-force scale |
| Nov 2025 | Collins Dictionary: "vibe coding" Word of Year | Cultural mainstreaming of AI-assisted building |
| Dec 2025 | AAIF formed (Linux Foundation) | Neutral governance for agent standards |
| Dec 2025 | Karpathy: 2025 LLM Year in Review | 6 paradigm shifts, "ghosts on your computer" |
| Feb 2026 | Karpathy coins "agentic engineering" | Declares vibe coding passe |
| Feb 2026 | Osmani publishes agentic engineering principles | 5 principles become industry consensus |
| Mar 2026 | Karpathy releases autoresearch | Live demo of agentic engineering in ML research |
What Comes Next: The Agentic Engineering Roadmap
The trajectory from vibe coding to agentic engineering points to a clear future:
Phase 1: Vibe Coding (2025) — Completed
Humans prompt, AI generates, humans accept or reject. Minimal oversight, minimal quality control. Proved the concept: AI can write functional software.
Phase 2: Agentic Engineering (2026) — Current
Humans architect and oversee, AI agents implement with human review. The middle loop emerges. Quality improves dramatically. The discipline gets a name and principles.
Phase 3: Supervised Autonomy (2027–2028)
AI agents handle entire subsystems with human checkpoint reviews. Agents run test suites, fix their own bugs, and flag only high-risk changes for human review. The middle loop becomes shorter and more focused.
Phase 4: Autonomous Systems (2029+)
AI agents build, maintain, and improve software autonomously. Humans set goals and constraints; agents handle everything else. Karpathy's "tokens tsunami" — tight agentic loops requiring massive token throughput — becomes the dominant compute workload.
Taskade Genesis is built for this trajectory. Workspace DNA — Memory, Intelligence, Execution — provides the foundation where each phase builds on the previous one. Today's agentic engineering becomes tomorrow's supervised autonomy, all within the same workspace.

The Agentic Engineering Stack (2026)
For Non-Developers
| Layer | Tool | Purpose |
|---|---|---|
| Specification | Natural language prompt | Define what to build |
| Building | Taskade Genesis | AI agents build the app |
| Infrastructure | Taskade Workspace | Database, hosting, security, 8 views |
| Intelligence | Taskade AI Agents | 22+ tools, persistent memory, multi-agent |
| Automation | Taskade Automations | 100+ integrations, durable execution |
| Deployment | Instant (built-in) | Custom domains, password protection |
For Developers
| Layer | Tool Options | Purpose |
|---|---|---|
| Specification | Design docs, structured specs | Define architecture + requirements |
| Building | Cursor, Claude Code, Devin, Taskade Genesis | AI agents write code |
| Orchestration | LangGraph, CrewAI, AutoGen | Multi-agent coordination |
| Testing | TDD frameworks, CI pipelines | Deterministic validation |
| Standards | MCP, A2A, AGENTS.md | Interoperability |
| Deployment | CI/CD, or Taskade for instant deploy | Ship to production |
The Convergence
The agentic engineering landscape is moving toward what industry analysts call the Agentic Mesh — a modular ecosystem where different tools specialize in different layers:
| Layer | Best Tool | Function |
|---|---|---|
| End-user apps | Taskade Genesis | Non-developers build living software |
| Business automation | CrewAI | Role-based multi-agent workflows |
| Enterprise orchestration | LangGraph | Production agent systems |
| Code development | Cursor, Devin, Claude Code | AI-assisted engineering |
| Standards | MCP + A2A (AAIF) | Universal interoperability |
| Model infrastructure | OpenAI, Anthropic, Google | Foundation models |
The winning strategy is not choosing one tool. It is choosing the right tool for each layer. For most teams, that means Taskade Genesis for end-user applications and team tools, combined with developer-focused agents for custom engineering work.
Start practicing agentic engineering →
Related Reading
- From Vibe Coding to Agentic Engineering: What Karpathy's New Term Means — Deep dive on the paradigm shift
- Agentic Engineering Tools and Platforms — 10+ platforms compared
- What Is Vibe Coding? — The foundational concept Karpathy evolved from
- Best Claude Code Alternatives — Terminal-first AI coding agents compared
- Best OpenClaw Alternatives — Managed alternatives to the open-source agent framework
- Best Vibe Coding Tools — 15 tools for the full spectrum
- What Is OpenAI? Complete History — The company behind GPT and the agent revolution
- What Is Anthropic? History of Claude AI — MCP, Claude Code, and the safety-first approach
- What Are AI Agents? — Foundational guide to AI agents
- How Workspace DNA Works Inside Taskade Genesis — The architecture behind it
- Taskade Genesis Reviews — What users are building with agentic engineering
- Vibe Coding vs No-Code vs Low-Code — How AI app building compares
- What Are AI Micro Apps? — The output of agentic engineering at scale
- Vibe Coding for Teams — Team-level agentic engineering in practice
- Best OpenClaw Alternatives — Open-source agent frameworks compared
- Claude Code Alternatives — Terminal-based AI coding tools
- AI Prompts Library — 1,000+ ready-to-use prompts for agentic workflows
- AI Convert Tools — Transform content with AI agents
Context Engineering: The Foundation of Agentic Systems
Context engineering is the discipline of designing the information environment that AI agents operate in — what data they can access, which documents they reference, what tools they can call, and how instructions are structured. The term gained traction in 2026 through Gartner research and Phil Schmid at Hugging Face, who argued that most agent failures are not model failures but context failures.
The relationship between context engineering and agentic engineering is hierarchical. Context engineering is the foundation; agentic engineering is the execution layer built on top of it.
The Vercel case study illustrates this perfectly. When Vercel's team analyzed their AI coding agent's accuracy, they discovered that removing unnecessary tools from the agent's context — giving it fewer options, not more — pushed accuracy from 80% to 100%, reduced token usage by 40%, and made responses 3.5x faster. The lesson: better context beats bigger models.
This aligns with the EPICS benchmark findings (2026), which tested frontier models on real professional tasks across engineering, product management, and customer support. The result: even the best models achieved only 24% success on authentic workplace tasks. The bottleneck was not model intelligence — it was context. Models failed when they lacked the right documents, the right tool access, or the right framing of the problem.
Each layer builds on the previous. Prompt engineering handles single-turn instructions. Context engineering designs what the model sees. Harness engineering adds pipelines, guardrails, and routing. Agentic engineering adds autonomous decision-making, multi-step execution, and human oversight loops.
Taskade's Workspace DNA implements all four layers natively: Memory provides context (documents, knowledge bases, project history), Intelligence provides agentic capabilities (AI agents with 22+ tools and persistent memory), and Execution provides harness-level automation (100+ integrations with branching, looping, and error handling).
Intent Engineering: The Third Discipline
Prompt engineering taught us how to talk to AI. Context engineering taught us what AI needs to know. Intent engineering — the discipline emerging in 2026 — teaches us what AI needs to want.
The distinction matters because AI agents that succeed at the wrong objective cause more damage than agents that fail entirely. In January 2026, fintech company Klarna reported that its AI agent handled 2.3 million customer conversations across 23 markets in 35 languages, doing the work of 700 full-time employees. Resolution times dropped from 11 minutes to 2. The CEO projected $60 million in savings.
Then customers started complaining. Generic answers, robotic tone, no judgment. The AI agent was technically brilliant — optimizing for exactly the metric it was given (resolve tickets fast). But Klarna's actual organizational goal was not fast resolution. It was building lasting customer relationships that drive lifetime value in a competitive fintech market. Those are profoundly different objectives requiring profoundly different decisions at the point of interaction.
Klarna CEO Sebastian Siemiatkowski later reflected on this publicly. In a 20VC interview, he acknowledged the early approach had "too much focus on cost" and described the pivot: "The future of VIP experience will be the human connection, the relationship... We need to transform our customer service from thinking about it as just good customer service to making it the human part of what Klarna is." Klarna now recruits its most passionate customers — not outsourced call center workers — as part-time support agents through an Uber-style model, resulting in dramatically higher NPS and customer satisfaction.
The deeper lesson: Siemiatkowski also explained why Klarna could not buy customer service off the shelf. "For customer service agents, whether AI or human, to answer questions really well, they need as much context as possible. Where is that context? It's in the source code of your software." This is the intent engineering problem in miniature — AI agents need not just data but organizational context: how the company calculates interest, when to bend policy, which customers are at risk. When that tacit knowledge was never formalized, the AI optimized a proxy metric (speed) instead of the real objective (relationships).
A senior human agent with five years at the company knows when to bend policy, when to spend extra time because a customer's tone signals they are about to churn, when efficiency is the right move versus when generosity is the right move. That knowledge was never documented — it lived in tacit institutional experience. When the human agents were laid off, that knowledge walked out the door.
The three disciplines of AI engineering stack on each other:
| Discipline | Era | Core Question | What It Governs |
|---|---|---|---|
| Prompt Engineering | 2023-2024 | How do I talk to AI? | Individual instructions |
| Context Engineering | 2025-2026 | What does AI need to know? | Information state, RAG, MCP |
| Intent Engineering | 2026+ | What does AI need to want? | Goals, values, trade-offs, decision boundaries |
Intent engineering requires something most organizations have never had to produce: machine-readable expressions of organizational purpose. Not "increase customer satisfaction" (a human-readable aspiration), but structured parameters an agent can act on: What signals indicate satisfaction in our context? What data sources contain those signals? What actions am I authorized to take? What trade-offs am I empowered to make — speed versus thoroughness, cost versus quality? Where are the hard boundaries I may not cross?
This is why Workspace DNA matters at the organizational level. Memory stores the institutional knowledge that senior employees carry in their heads. Intelligence (AI agents) interprets that knowledge against live context. Execution (automations) acts within defined boundaries. The workspace becomes the intent layer — encoding not just what the agent can do, but what it should do given the organization's actual values.
Deloitte's 2026 State of AI report found that 84% of companies have not redesigned jobs around AI capabilities and only 21% have a mature model for agent governance. Meanwhile, 74% report no tangible value from AI deployments. The models work. The context pipelines are improving. What is missing is the organizational infrastructure that connects AI capability to organizational purpose.
The Microsoft Copilot story reinforces this pattern. One of the most heavily invested enterprise AI products in history — billions in infrastructure, AI embedded in every Office application — achieved 85% Fortune 500 adoption. Then it stalled. Gartner found only 5% of organizations moved from Copilot pilot to larger-scale deployment. Bloomberg reported Microsoft slashing internal sales targets. Inside companies that signed six-figure Copilot deals, employees preferred other AI tools. The issue was not model quality or UX — it was deploying AI across an organization without intent alignment. Forty thousand knowledge workers given AI tools but never told how those tools connect to what the company is trying to accomplish.
The investment behind this gap is staggering. Big tech's combined AI capital expenditure approached half a trillion dollars in 2025 and is projected to exceed that in 2026 — with the big five (Amazon, Microsoft, Google, Meta, Oracle) planning to add over $2 trillion in AI-related assets in the next four years. Meanwhile, the SWE-bench coding benchmark went from 4% AI solve rate in 2023 to approximately 90-95% saturation in 2025 — a capability doubling time that is itself shrinking. The models are not the bottleneck. The organizational infrastructure that connects model capability to organizational purpose — that is the bottleneck.
The companies that win the next phase will not be the ones with the best model subscription. They will be the ones with the best organizational intent architecture — goals, values, decision frameworks, and trade-off hierarchies that are discoverable, structured, and agent-actionable. As one analyst put it: a company with a mediocre model and extraordinary intent infrastructure will outperform a company with a frontier model and fragmented organizational knowledge every single time.
The autoresearch insight applies here too: if you give the agent a bad metric, it will very confidently optimize the wrong thing. Choosing the right metric — the one that reflects actual organizational intent, not just the one that is easiest to measure — is the skill that separates successful AI deployments from expensive failures.
For teams building with Taskade Genesis, intent engineering starts with Workspace DNA: define your goals as structured project data (Memory), train AI agents with explicit decision boundaries and knowledge bases (Intelligence), and encode your workflows with the right triggers and escalation rules (Automations). The workspace is your intent layer — persistent, collaborative, and auditable. Start building →
Agentic Engineering Platforms Compared
The agentic engineering ecosystem in 2026 spans no-code platforms, developer frameworks, and low-code automation tools. Here is how the major platforms compare:
| Platform | Code Required | Multi-Agent | Memory | Integrations | Pricing |
|---|---|---|---|---|---|
| Taskade Genesis | No | Yes | Persistent | 100+ | $16/mo (10 users) |
| CrewAI | Python | Yes | Custom | Via code | Open source |
| LangGraph | Python | Yes | Custom | Via code | Open source |
| n8n | Low-code | Limited | Basic | 400+ | $20+/mo |
| AutoGen | Python | Yes | Custom | Via code | Open source |
Taskade Genesis is the only platform that delivers agentic engineering without code — persistent memory across sessions, multi-agent collaboration, and 100+ native integrations out of the box. Developer frameworks like CrewAI, LangGraph, and AutoGen offer more customization but require Python expertise and custom infrastructure. n8n bridges the gap as a low-code option but has limited multi-agent orchestration.
For most teams, the right approach is Taskade Genesis for business workflows and team tools, combined with developer-focused frameworks for custom engineering projects. See our full agentic engineering tools comparison.
Get Started: Build Your First Agentic Workflow
You do not need to be a developer to practice agentic engineering. Taskade Genesis lets any team build agentic workflows in minutes:
Step 1: Create a workspace. Go to taskade.com/create and describe what you want to build. Genesis generates a living application — not a prototype, but a deployed system with a database, UI, and logic.
Step 2: Add AI agents with custom tools and knowledge. Configure AI agents with persistent memory, train them on your documents and knowledge sources, and equip them with 22+ built-in tools. Browse the Prompts Library for ready-to-use agent instructions.
Step 3: Connect automations to trigger agent workflows. Set up automation workflows with 100+ integrations — Slack, email, CRM, payments, and more. Agents run on schedule, on trigger, or on demand. Explore what others have built in the Community Gallery.
This is agentic engineering in practice: you define the goal, configure the agents, set the guardrails, and let the system execute. The same pattern Karpathy describes for code applies to every workflow — plan, direct, review, test, own.
Start building your first agentic workflow →
FAQ
What exactly is agentic engineering?
Agentic engineering is orchestrating AI agents who write, test, and deploy code while you provide architectural oversight, quality standards, and strategic direction. Coined by Andrej Karpathy in February 2026, it emphasizes that directing AI agents effectively is an art and science — not just casual prompting. The five core principles: plan, direct, review, test, own.
How is agentic engineering different from vibe coding?
Vibe coding means accepting whatever AI generates without rigorous review. Agentic engineering adds five disciplines: plan before prompting, direct with precision, review rigorously, test systematically, and own the architecture. Both use AI to build software, but agentic engineering produces production-quality results.
Who coined the term and when?
Andrej Karpathy coined agentic engineering on February 8, 2026. He had previously coined vibe coding on February 2, 2025. Exactly one year later, he declared vibe coding passe because LLMs had gotten smart enough that casual prompting was no longer sufficient — orchestration with oversight was the new professional standard.
What are the five principles of agentic engineering?
Google's Addy Osmani codified them: 1) Plan before prompting — write specs and break work into agent-sized tasks, 2) Direct with precision — give agents well-scoped tasks, 3) Review rigorously — evaluate output like a human PR, 4) Test relentlessly — the single biggest differentiator from vibe coding, 5) Own the system — maintain docs, version control, CI, and production monitoring.
Do I need to be a developer to practice agentic engineering?
No. The principles apply to anyone orchestrating AI agents. On Taskade Genesis, non-developers practice agentic engineering by writing detailed prompts (planning), reviewing generated apps (oversight), iterating on designs (testing), and deploying AI agents for ongoing improvement. 63% of AI-assisted builders are non-developers.
What is the Model Context Protocol (MCP)?
MCP is an open standard created by Anthropic in November 2024 for connecting AI models to external tools and data sources. Think of it as USB-C for AI agents — a universal connector. It was donated to the Linux Foundation's Agentic AI Foundation in December 2025 and adopted by OpenAI, Google, Microsoft, and dozens of others.
What are the best agentic engineering tools?
By category: Taskade Genesis for non-developers (free tier, Pro $16/mo for 10 users). CrewAI for role-based business automation (open-source). LangGraph for enterprise orchestration. Cursor ($20/mo) and Devin 2.0 ($20/mo) for professional coding. Claude Code for terminal-based workflows. See our full agentic engineering tools comparison.
What did Gartner predict about agentic AI?
Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. By 2028, 33% of enterprise software will include agentic AI. However, they also predict over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.
What is Karpathy's autoresearch project?
Autoresearch is a 630-line Python tool released by Karpathy on March 7, 2026. It gives an AI agent an LLM training setup and lets it experiment autonomously — approximately 12 experiments per hour, 100 overnight. It demonstrates agentic engineering: human sets the goal and metric, agent executes autonomously, results are objectively measurable.
How does Taskade Genesis implement agentic engineering?
Taskade Genesis implements agentic engineering through Workspace DNA — Memory (projects as databases), Intelligence (AI agents with 22+ tools and persistent memory), and Execution (automations with 100+ integrations). Users orchestrate these components to build, deploy, and maintain living software — exactly the pattern Karpathy describes.
What is the middle loop in agentic engineering?
The middle loop is supervisory work between writing code (inner loop) and delivery operations (outer loop). It involves directing AI agents, evaluating their output, calibrating trust, and maintaining architectural coherence. Senior engineering leaders identified it as the most important emerging skill category for the AI era.
Is agentic engineering a fad or a lasting shift?
Agentic engineering represents a permanent shift. The $4.7B vibe coding market growing at 38% CAGR, Gartner's 40% enterprise adoption forecast, the Linux Foundation's AAIF, and MCP becoming the universal standard all point to structural change. The discipline of orchestrating agents becomes more valuable as AI becomes more capable, not less.
What is cognitive debt?
Cognitive debt is the gap between system complexity and human understanding — when AI-generated systems work but no human fully comprehends why. It is the agentic engineering equivalent of technical debt. Taskade Genesis reduces cognitive debt by keeping architecture visible (workspace structure), agents transparent (inspectable instructions), and history preserved.
How does agentic engineering connect to the "SaaS is dead" debate?
Y Combinator CEO Garry Tan predicted non-technical teams would vibe-code custom solutions instead of buying SaaS, naming Taskade among the disruptors. Klarna CEO Sebastian Siemiatkowski went further in his 20VC interview, arguing that AI agents will demolish SaaS switching costs entirely: "The next thing that's going to hit everyone bad is the switching cost of data... What's going to happen is people are going to start solving that problem — how do I get all my data from the existing vendor and move it to the new vendor with the help of AI through one click." On a weekend, Siemiatkowski built what he calls "company in a box" — an open-source accounting system + CRM + Claude agent that could bookkeep invoices and manage customers via natural language. The winner of the future, he argues, is not a siloed SaaS tool but something "extremely broad" — an AI-native operating system for the entire company. Klarna has already dropped Salesforce and approximately 1,200 other SaaS services, shrinking from 7,000 employees to below 3,000 through AI-driven agentic workflows. Agentic engineering elevates the SaaS debate: teams will orchestrate AI agents to build, deploy, and maintain living software that replaces over-bundled per-seat tools. See: The SaaSpocalypse Explained and Will Vibe Coding Kill SaaS?
What is the difference between agentic engineering and context engineering?
Context engineering focuses on designing the information environment for AI — what data, documents, and tools agents can access. Agentic engineering is broader: it includes context engineering plus the orchestration patterns, tool use, and autonomous decision-making that make agents useful. Think of context engineering as the foundation and agentic engineering as the full building. Taskade's Workspace DNA implements both — Memory provides context, Intelligence provides agentic capabilities, Execution automates the results.
How do I start with agentic engineering without code?
Taskade Genesis lets non-technical teams build agentic workflows without writing a single line of code. Create AI agents with 22+ built-in tools, train them on your knowledge sources, connect 100+ integrations, and set up automation workflows — all through a visual interface. Over 150,000 apps have been built this way. Start free →




