Skip to main content
Taskadetaskade
PricingLoginSign up for free →Sign up for free →
Loved by 1M+ users·Hosting 100K+ apps·Deploying 500K+ AI agents·Running 1M+ automations·Backed by Y Combinator
TaskadeAboutPressPricingFeaturesIntegrationsChangelogContact us
GalleryReviewsHelp CenterDocsFAQ
VibeVibe AppsVibe AgentsVibe CodingVibe Workflows
Vibe MarketingVibe DashboardsVibe CRMVibe AutomationVibe PaymentsVibe DesignVibe SEOVibe Tracking
Community
FeaturedQuick AppsTools
DashboardsWebsitesWorkflowsProjectsFormsCreators
DownloadsAndroidiOSMac
WindowsChromeFirefoxEdge
Compare
vs Cursorvs Boltvs Lovable
vs V0vs Windsurfvs Replitvs Emergentvs Devinvs Claude Codevs ChatGPTvs Claudevs Perplexityvs GitHub Copilotvs Figma AIvs Notionvs ClickUpvs Asanavs Mondayvs Trellovs Jiravs Linearvs Todoistvs Evernotevs Obsidianvs Airtablevs Basecampvs Mirovs Slackvs Bubblevs Retoolvs Webflowvs Framervs Softrvs Glidevs FlutterFlowvs Base44vs Adalovs Durablevs Gammavs Squarespacevs WordPressvs UI Bakeryvs Zapiervs Makevs n8nvs Jaspervs Copy.aivs Writervs Rytrvs Manusvs Crewvs Lindyvs Relevance AIvs Wrikevs Smartsheetvs Monday Magicvs Codavs TickTickvs Any.dovs Thingsvs OmniFocusvs MeisterTaskvs Teamworkvs Workfrontvs Bitrix24vs Process Streetvs Toggl Planvs Motionvs Momentumvs Habiticavs Zenkitvs Google Docsvs Google Keepvs Google Tasksvs Microsoft Teamsvs Dropbox Papervs Quipvs Roam Researchvs Logseqvs Memvs WorkFlowyvs Dynalistvs XMindvs Whimsicalvs Zoomvs Remember The Milkvs Wunderlist
Genesis AIApp BuilderVibe CodingAgent Builder
Dashboard BuilderCRM BuilderWebsite BuilderForm BuilderWorkflow AutomationWorkflow BuilderBusiness-in-a-BoxAI for MarketingAI for Developers
AI Agents
FeaturedProject ManagementProductivity
MarketingTranslatorContentWorkflowResearchPersonalSalesSocial MediaTo-Do ListCRMTask AutomationCoachingCreativityTask ManagementBrandingFinanceLearning and DevelopmentBusinessCommunity ManagementMeetingsAnalyticsDigital AdvertisingContent CurationKnowledge ManagementProduct DevelopmentPublic RelationsProgrammingHuman ResourcesE-CommerceEducationLegalEmailSEODeveloperVideo ProductionDesignFlowchartDataPromptNonprofitAssistantsTeamsCustomer ServiceTrainingTravel PlanningUML DiagramER DiagramMath TutorLanguage LearningCode ReviewerLogo DesignerUI WireframeFitness CoachAll Categories
Automations
FeaturedBusiness-in-a-BoxInvestor Operations
Education & LearningHealthcare & ClinicsStripeSalesContentMarketingEmailCustomer SupportHubSpotProject ManagementAgentic WorkflowsBooking & SchedulingCalendarReportsSlackWebsiteFormTaskWeb ScrapingWeb SearchChatGPTText to ActionYoutubeLinkedInTwitterGitHubDiscordMicrosoft TeamsWebflowRSS & Content FeedsGoogle WorkspaceManufacturing & OperationsAI Agent TeamsAll Categories
Wiki
GenesisAI AgentsAutomation
ProjectsLiving DNAPlatformIntegrationsProductivityMethodsProject ManagementAgileScrumAI ConceptsCommunityTerminologyFeatures
Templates
FeaturedChatGPTTable
PersonalProject ManagementSalesFlowchartTask ManagementEngineeringEducationDesignTo-Do ListMarketingMind MapGantt ChartOrganizationalPlanningMeetingsTeam ManagementStrategyGamingProductionProduct ManagementStartupRemote WorkY CombinatorRoadmapCustomer ServiceLegalEmailBudgetsContentConsultingE-CommerceStandard Operating Procedure (SOP)Human ResourcesProgrammingMaintenanceCoachingSocial MediaHow-TosResearchMusicTrip PlanningAll Categories
Generators
AI AppAI WebsiteAI Dashboard
AI FormAI AgentClient PortalAI WorkspaceAI ProductivityAI To-Do ListAI WorkflowsAI EducationAI Mind MapsAI FlowchartAI Scrum Project ManagementAI Agile Project ManagementAI MarketingAI Project ManagementAI Social Media ManagementAI BloggingAI Agency WorkflowsAI ContentAI Software DevelopmentAI MeetingAI PersonasAI OutlineAI SalesAI ProgrammingAI DesignAI FreelancingAI ResumeAI Human ResourceAI SOPAI E-CommerceAI EmailAI Public RelationsAI InfluencersAI Content CreatorsAI Customer ServiceAI BusinessAI PromptsAI Tool BuilderAI SEOAI Gantt ChartAI CalendarsAI BoardAI TableAI ResearchAI LegalAI ProposalAI Video ProductionAI Health and WellnessAI WritingAI PublishingAI NonprofitAI DataAI Event PlanningAI Game DevelopmentAI Project Management AgentAI Productivity AgentAI Marketing AgentAI Personal AgentAI Business and Work AgentAI Education and Learning AgentAI Task Management AgentAI Customer Relations AgentAI Programming AgentAI SchemaAI Business PlanAI Pitch DeckAI InvoiceAI Lesson PlanAI Social Media CalendarAI API DocumentationAI Database SchemaAll Categories
Converters
AI Featured ConvertersAI PDF ConvertersAI CSV Converters
AI Markdown ConvertersAI Prompt to App ConvertersAI Data to Dashboard ConvertersAI Workflow to App ConvertersAI Idea to App ConvertersAI Flowcharts ConvertersAI Mind Map ConvertersAI Text ConvertersAI Youtube ConvertersAI Knowledge ConvertersAI Spreadsheet ConvertersAI Email ConvertersAI Web Page ConvertersAI Video ConvertersAI Coding ConvertersAI Task ConvertersAI Kanban Board ConvertersAI Notes ConvertersAI Education ConvertersAI Language TranslatorsAI Business → Backend App ConvertersAI File → App ConvertersAI SOP → Workflow App ConvertersAI Portal → App ConvertersAI Form → App ConvertersAI Schedule → Booking App ConvertersAI Metrics → Dashboard ConvertersAI Game → Playable App ConvertersAI Catalog → Directory App ConvertersAI Creative → Studio App ConvertersAI Agent → Agent App ConvertersAI Audio ConvertersAI DOCX ConvertersAI EPUB ConvertersAI Image ConvertersAI Resume & Career ConvertersAI Presentation ConvertersAI PDF to Spreadsheet ConvertersAI PDF to Database ConvertersAI PDF to Quiz ConvertersAI Image to Notes ConvertersAI Audio to Notes ConvertersAI Email to Tasks ConvertersAI CSV to Dashboard ConvertersAI YouTube to Flashcards ConvertersAll Categories
Prompts
Blog WritingBrandingPersonal Finance
Human ResourcesPublic RelationsTeam CollaborationProduct ManagementSupportAgencyReal EstateMarketingCodingResearchSalesAdvertisingSocial MediaCopywritingContentProject ManagementWebsite CreationDesignStrategyE-commerceEngineeringSEOEducationEmail MarketingUX/UIProductivityInfluencer MarketingAnalyticsEntrepreneurshipLegalAll Categories
Blog
The HyperCard Moment: From Bill Atkinson to AI Micro Apps (2026)How to Generate Creative Ideas: Idea Stacking, Visual Thinking & Storytelling Frameworks (2026)History of Apple: Steve Jobs' 50-Year Vision, From a Garage to a $3.7 Trillion AI Powerhouse (2026)Why One-Person Companies Are the Future of Work: AI Agents, Solo Founders, and the $1B Prediction (2026)
Build Your Own AI CRM vs Paying Salesforce $300/Seat (2026)The Great SaaS Unbundling: How AI Agents Break Per-Seat Pricing (2026)Garry Tan SaaS Prediction Scorecard: 3 Months Later (2026)History of Obsidian: From a Dynalist Side Project to the Second Brain Movement and the AI Knowledge OS Era (2026)State of Vibe Coding 2026: Market Size, Adoption & TrendsWhat is NVIDIA? Complete History: Jensen Huang, CUDA, GPUs, AI Revolution, Vera Rubin & More (2026)The SaaSpocalypse Explained: $285 Billion Wiped, AI Agents Rising (2026)AI-Native vs AI-Bolted-On: Why Software Architecture Decides Who Wins (2026)History of Mermaid.js: Diagrams as Code, From a Lost Visio File to 85K GitHub Stars (2026)The Complete History of Computing: From Binary to AI Agents — How We Got Here (2026)The BFF Experiment: From Noise to Life in the Age of AI Agents (2026)What Are AI Claws? Persistent Autonomous Agents Explained (2026)They Generate Code. We Generate Runtime — The Taskade Genesis Manifesto (2026)What Is Intelligence? From Neurons to AI Agents — A Complete Guide (2026)What Is Artificial Life? How Intelligence Emerges from Code (2026)
AIAutomationProductivityProject ManagementRemote WorkStartupsKnowledge ManagementCollaborative WorkUpdates
Changelog
GitHub App Export & EVE Mentions (Mar 30, 2026)GitHub App Import & Agent Editor Redesign (Mar 27, 2026)Improved EVE Selector & App Kit Polish (Mar 26, 2026)
App Kit Template Redesign & Community Creators (Mar 26, 2026)Agent Media Commands & Workflow Indicators (Mar 23, 2026)Asana Integration & Dark Mode Diagrams (Mar 22, 2026)Notion Integration & Smarter Agent Teams (Mar 21, 2026)
Wiki
GenesisAI AgentsAutomation
ProjectsLiving DNAPlatformIntegrationsProductivityMethodsProject ManagementAgileScrumAI ConceptsCommunityTerminologyFeatures
© 2026 Taskade.
PrivacyTermsSecurity
Made withTaskade AIforBuilders
Blog›AI›What Is Grokking in AI? When…

What Is Grokking in AI? When Models Suddenly Learn to Generalize (2026)

Grokking is when neural networks suddenly transition from memorizing data to truly understanding patterns. Discovered by accident at OpenAI, this phenomenon reveals how AI models learn trigonometric identities to solve math — and what it means for the future of AI. Updated March 2026.

March 16, 2026·30 min read·Dawid Bednarski·AI·#grokking#machine-learning#neural-networks
On this page (23)
🧠 What Is Grokking?🔬 The Accidental Discovery📐 The Modular Arithmetic Problem🌊 The Three Phases of GrokkingPhase 1: Memorization (~0-200 steps)Phase 2: Structure Building (~200-7,000 steps)Phase 3: Generalization + Cleanup (~7,000+ steps)🎵 The Trigonometric SolutionWhat the Embedding Layer LearnsWhat the MLP Neurons ComputeThe Sum-of-Angles IdentityDiagonal Symmetry🔍 Why Grokking MattersHidden Learning Is RealImplications for AI SafetyThe Physics of Phase TransitionsEmergent Capabilities in Large ModelsLatent Operators: What Grokking Teaches Us About Mental RotationSurrogate Problems: When Solving the Wrong Problem Works Better🤖 From Grokking to Workspace Intelligence🔮 The Bigger Picture❓ Frequently Asked Questions🚀 Build AI That Understands Your Work

In 2021, a researcher at OpenAI was training a small neural network on a simple math problem — modular arithmetic. The model memorized the training examples quickly, and the results looked unremarkable. So the researcher went on vacation and left the experiment running.

When they came back, something had changed. The model, which had shown zero improvement on unseen data for thousands of training steps, had suddenly achieved perfect generalization. Not gradual improvement. Not a slow climb. A near-instantaneous leap from rote memorization to genuine understanding.

The team called this phenomenon grokking — after the Martian word from Robert Heinlein's 1961 novel Stranger in a Strange Land, meaning to understand something so deeply that you merge with it. And the name stuck, because what this small model did was genuinely alien.

This is one of the most surprising discoveries in modern AI research. It challenges everything we thought we knew about how neural networks learn, when they learn, and what they're really doing beneath the surface. 🧪

TL;DR: Grokking is when a neural network suddenly transitions from memorizing training data to truly understanding the underlying pattern — often thousands of training steps after memorization is complete. Discovered by accident at OpenAI in 2021, grokking reveals that models can build hidden trigonometric solutions while appearing stagnant. Build with AI agents that learn from your data →

Neural network digital art visualization

🧠 What Is Grokking?

Grokking is a sudden phase transition in neural network training where a model shifts from memorizing its training data to genuinely generalizing — understanding the underlying pattern well enough to solve examples it has never seen. The transition happens after an extended period of apparent stagnation, during which standard training metrics show no improvement whatsoever.

The term was introduced in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" by researchers at OpenAI. They borrowed it from Heinlein's science fiction novel, where the Martian word means to understand something so profoundly that you become one with it. It is a fitting name — the model does not merely learn a shortcut or heuristic. It discovers the actual mathematical structure of the problem.

Andrej Karpathy, former head of AI at Tesla, has described the strangeness of neural network behavior this way: "Training LLMs is less like building animal intelligence and more like summoning ghosts." Grokking is perhaps the clearest example of why. A model sits there, seemingly stuck, and then — without any change to the training process — it spontaneously reorganizes its internal representations and solves the problem perfectly.

Aspect Memorization Standard Learning Grokking
Training performance Perfect Improves gradually Perfect early
Test performance Poor Improves with training Flat, then sudden jump
Internal structure Lookup table Gradual feature extraction Hidden structure building
When it happens Immediately During training Long after memorization
What the model learns Input-output pairs Approximate patterns Exact mathematical structure

Understanding grokking matters because it reveals that what a model appears to know and what it actually knows can be completely different things. This has profound implications for AI safety, model evaluation, and our understanding of how large language models work.

🔬 The Accidental Discovery

The story of grokking begins with a simple experiment at OpenAI in 2021. Researchers were training small transformer models on algorithmic tasks — the kind of clean mathematical problems where you can verify whether a model truly understands the pattern or is just memorizing.

The task was modular arithmetic: given two numbers X and Y, compute (X + Y) mod P, where P is a prime number. Think of it as clock math — on a clock with P hours, if you start at hour X and move forward Y hours, where do you land?

For a small prime like P = 5, the complete dataset is a 5 by 5 table:

+ (mod 5) 0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 0
2 2 3 4 0 1
3 3 4 0 1 2
4 4 0 1 2 3

The researchers split this table into training and test sets — say, 70% of the cells for training and 30% held out. They trained a small transformer on the training portion and watched what happened.

The initial results were unsurprising. The model memorized the training data within a few hundred steps. Training accuracy hit 100%. But test accuracy — performance on the held-out examples — stayed near random chance. The model had memorized which input pairs mapped to which outputs without learning any underlying rule.

At this point, most researchers would stop the experiment. The model had converged. The loss was flat. Standard practice says: your model has overfit, move on.

But the researcher left the training running — inadvertently, while on vacation. And when they returned days later, they checked the logs and found something nobody expected.

Somewhere around step 7,000, the test accuracy had jumped from near-zero to 100%. Not gradually. The curve looked like a step function — flat, flat, flat, then perfect. The model had gone from a pure memorizer to a perfect generalizer, all while the researcher was not even watching.

The OpenAI team published these findings in January 2022, and the paper sent shockwaves through the machine learning community. It raised an uncomfortable question: how many models have we stopped training just before they were about to grok?

📐 The Modular Arithmetic Problem

To understand why grokking is remarkable, you need to understand what the model is actually seeing — because it is not seeing "numbers."

Modular arithmetic is clock math. On a 12-hour clock, 10 + 5 = 3, because you wrap around past 12. The same idea works with any modulus P. When P = 113 (the prime number used in the key grokking experiments), you have a clock with 113 hours.

But here is the critical detail: the neural network does not receive the numbers 0 through 112 as numeric values. Instead, each number is represented as a one-hot encoded vector — a list of 113 zeros with a single 1 in the position corresponding to that number.

Input: 47 + 81 = ?  (mod 113)

Token "47": [0,0,...,0,1,0,...,0] ← 113-dim vector, 1 at position 47
↑
Token "81": [0,0,...,0,1,0,...,0] ← 113-dim vector, 1 at position 81
↑
Token "=": [0,0,...,0,0,0,...,1] ← special token
↑

Total input: 114 × 3 matrix (113 digits + equals sign, 3 tokens)

The model receives three tokens — the first number, the second number, and an equals sign — each represented as a 114-dimensional one-hot vector (113 possible digits plus the equals token). That is a 114 by 3 input matrix.

From the model's perspective, there is no inherent relationship between "47" and "48." They are just two completely different patterns of zeros and ones. The model has no concept of "numbers" or "addition." It must discover the mathematical structure entirely from the patterns in the data.

The architecture is a small transformer: an embedding matrix (114 to 128 dimensions), an attention block, an MLP (multi-layer perceptron), and an unembedding layer that maps back to 113 possible outputs.

One-Hot Input (114×3)
        │
        ▼
┌─────────────────┐
│  Embedding      │  114 → 128 dimensions
│  Matrix         │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Attention      │  Token interactions
│  Block          │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  MLP            │  Where the magic happens
│  (2 layers)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Unembedding    │  128 → 113 outputs
│  Matrix         │
└─────────────────┘
         │
         ▼
    Answer: 15

The question is: what does this model learn internally? A lookup table of memorized answers? Or something far more elegant?

🌊 The Three Phases of Grokking

Careful analysis of grokking reveals three distinct phases, each with radically different internal dynamics. What makes grokking so striking is that standard training metrics cannot distinguish Phase 2 from convergence — the model appears stuck while secretly building a solution.

Phase 1: Memorization~0-200 stepsTraining acc → 100% Phase 2: Structure Building~200-7,000 stepsHidden trig representations grow Phase 3: Generalization~7,000+ stepsTest acc jumps to 100%

Phase 1: Memorization (~0-200 steps)

The model rapidly memorizes the training examples. Within about 200 training steps, training accuracy reaches 100%. The model has essentially built an internal lookup table — for each training input pair (X, Y), it has stored the correct answer.

At this point, the model's internal representations show no discernible mathematical structure. If you visualize the neuron activations in the MLP layer, they look like noise. The model treats each input pair as an independent fact to be stored, with no relationship between (3 + 7) and (4 + 6) even though both equal 10 mod 113.

Test performance is at or near chance level. The model has memorized answers, not learned a rule.

Phase 2: Structure Building (~200-7,000 steps)

This is the phase that makes grokking genuinely mysterious. Both training loss and test loss appear completely flat. Standard metrics suggest the model has converged and nothing is changing.

But something is changing.

In early 2023, Neel Nanda and collaborators published a groundbreaking analysis showing exactly what happens during this "dormant" phase. Using a metric called excluded loss — which strips specific frequency components from the model's output before measuring performance — they proved that the model is steadily building trigonometric representations beneath its memorized solution.

Here is what excluded loss reveals: remove the memorization component from the model's output, and you can see a new signal growing stronger step by step. The model is constructing sine and cosine functions of its inputs in the embedding layer, wiring them together through the MLP, and slowly building the machinery for a fundamentally different solution strategy.

The reason standard metrics miss this is simple: the memorized solution works perfectly on the training data and masks the emerging trigonometric solution. It is like watching someone build a new house inside an old house — from the outside, nothing changes until the old walls come down.

Phase 3: Generalization + Cleanup (~7,000+ steps)

The phase transition. Around step 7,000, the trigonometric solution becomes strong enough to compete with the memorized solution. Test accuracy shoots from near-zero to near-perfect in a span of just a few hundred steps.

But there is a second, equally important process: cleanup. After generalization, the model actively removes its memorized representations. The internal lookup table that served it through Phase 1 is dismantled, and the clean trigonometric solution is all that remains.

Performance
    │
100%├─────────────■■■■■■■■■■■■■■■■■■■■■■■■■■■  Training
    │                                    ┌──■■■■■  Test
    │                                    │
 50%├────────────────────────────────────┤
    │                                    │
  0%├──■─────────────────────────────────┘
    └──┬──────────┬──────────┬──────────┬──▶ Steps
       0        2000       5000       7000

   ◄──Phase 1──►◄───Phase 2────►◄─Phase 3─►
   Memorize      Build Structure  Generalize

This three-phase pattern has been replicated across many modular arithmetic tasks and other algorithmic problems. It appears to be a fundamental property of how small neural networks discover mathematical structure — and it raises deep questions about what might be happening inside large language models during their own training.

Training curves and metrics analysis

🎵 The Trigonometric Solution

This is the most extraordinary part of the grokking story. When researchers cracked open the model and examined what it had learned, they did not find a cleverer lookup table or a brute-force approximation. They found that the neural network had independently discovered trigonometry.

What the Embedding Layer Learns

After grokking, the embedding matrix transforms each one-hot input into a 128-dimensional vector. When researchers applied a sparse linear probe — a technique that finds interpretable directions in high-dimensional space — they discovered that the most important components of these embedding vectors were sine and cosine functions of the input values.

For each input number x, the embedding contains strong representations of sin(2πkx/113) and cos(2πkx/113) for specific frequencies k. The model had discovered that circular functions are the natural way to represent numbers on a modular clock.

What the MLP Neurons Compute

The MLP (multi-layer perceptron) is where the computation happens. When researchers plotted the output of individual MLP neurons as a function of the two inputs x and y, they found sweeping sine wave patterns.

Even more revealing: when they plotted pairs of neurons against each other as scatter plots, the data points traced out circles and loops. This is the geometric signature of sine and cosine — the model had organized its neurons into circular representations.

A discrete Fourier transform of the neuron activations confirmed specific dominant frequencies: 8π/113 and 6π/113 appeared as the strongest components. The model had not learned all possible frequencies — it had selected a sparse set of frequencies that were sufficient to solve the problem.

The Sum-of-Angles Identity

Here is where it all comes together. The model needs to compute (x + y) mod 113 — it needs to find the sum of its two inputs. But the MLP's basic operation is multiplication (matrix multiply followed by nonlinearity). How do you convert products into sums?

The answer is a trigonometric identity that every calculus student has seen:

cos(x + y) = cos(x) · cos(y) - sin(x) · sin(y)

This is the sum-of-angles identity. It converts the sum (x + y) — which is what the model needs — into a combination of products of cos(x), cos(y), sin(x), and sin(y) — which are exactly the representations the embedding layer has built.

The Model's Solution (decoded):

Step 1: Embed → sin(kx), cos(kx), sin(ky), cos(ky)
Step 2: Products → cos(kx)·cos(ky), sin(kx)·sin(ky)
Step 3: Identity → cos(kx)·cos(ky) - sin(kx)·sin(ky) = cos(k(x+y))
Step 4: Decode → "For which answer z does cos(k(x+y)) peak?"

The model computes cos(kx) · cos(ky) as the strongest component of certain MLP neurons. It then combines these with the sin products via the sum-of-angles identity to produce cos(k(x + y)) — a function that depends only on the sum of x and y, which is exactly the quantity it needs to compute.

Diagonal Symmetry

The most visually striking evidence comes from plotting neuron activations as a heat map over all possible (x, y) pairs. After grokking, individual neurons show diagonal stripe patterns: a neuron fires strongly for all input pairs where x + y equals the same value (modulo 113).

           y
         0  1  2  3  4
       ┌──┬──┬──┬──┬──┐
    0  │0 │1 │2 │3 │4 │  ← Diagonal lines =
    1  │1 │2 │3 │4 │0 │    same sum (mod 5)
x   2  │2 │3 │4 │0 │1 │
    3  │3 │4 │0 │1 │2 │  The model learns to
    4  │4 │0 │1 │2 │3 │  fire along these
       └──┴──┴──┴──┴──┘  diagonals!

For the actual experiment with mod 113, a neuron might fire for all (x, y) pairs where x + y = 65 (mod 113). That includes (0, 65), (1, 64), (2, 63), and so on — but also (100, 78), because 100 + 78 = 178, and 178 mod 113 = 65. The neuron fires along a diagonal that wraps around the grid, which is exactly what you would expect from a circular (trigonometric) representation.

This is a neural network that was given nothing but random-looking patterns of ones and zeros. It received no hint that the numbers it was working with had any circular structure. And yet it independently discovered:

  1. That circular functions are the right representation
  2. That specific frequencies capture the necessary information
  3. That the sum-of-angles identity converts products into sums
  4. That diagonal symmetry over the input grid solves the problem

As Nanda's analysis concluded, grokking gives us a transparent box in a world of black boxes — a case where we can fully trace how a neural network solves a problem, from input to output, with no hidden mystery.

Phase transition Random Weights(Initialization) Memorization Circuits(Lookup Table) Competing Circuits(Trig + Memo Coexist) Generalizing Circuits(Trig Solution Wins) Cleanup(Memo Circuits Pruned)

Abstract futuristic particles representing phase transitions in AI

🔍 Why Grokking Matters

Grokking is a fundamental phenomenon that reveals hidden dynamics of how neural networks learn, with consequences that stretch from training methodology to AI safety.

Standard Metrics Excluded Loss Probe Training Loss Drops(Steps 0-200) Test Loss Plateaus(Steps 200-7,000) Hidden LearningDetected? No — Appears Converged Yes — Trig Representations Growing Early Stopping ✗Miss Generalization Continue Training Sudden Generalization ✓Perfect Test Accuracy

Hidden Learning Is Real

The most important lesson of grokking is that models can appear to have stopped learning while secretly developing new capabilities. During Phase 2, every standard metric — training loss, test loss, validation accuracy — shows no improvement. Any reasonable practitioner would conclude the model has converged.

But the model is not converged. It is actively building a fundamentally different solution strategy. The only way to detect this hidden learning is with specialized probes like excluded loss or mechanistic analysis of internal representations.

This has immediate practical implications: we may be stopping training too early on many models. If grokking-like dynamics occur in larger networks (and there is growing evidence they do), we could be leaving significant generalization performance on the table by using standard early stopping criteria.

Implications for AI Safety

For the AI safety community, grokking is a cautionary tale. If a model can develop hidden capabilities that are invisible to standard evaluation during training, then our current safety evaluation methods may be fundamentally insufficient.

Consider the alignment scenario: a model might appear to behave as intended during evaluation — all benchmarks look good, all safety filters pass — while internally developing representations that could lead to unexpected behavior. Grokking proves that this is not a theoretical concern. It happens in practice, in simple models, on simple tasks.

The phenomenon also connects to ongoing work in mechanistic interpretability — the effort to understand what neural networks are computing internally, rather than evaluating them only by their outputs. Grokking models are valuable test cases because we can verify mechanistic explanations against the known trigonometric solution.

The Physics of Phase Transitions

Grokking is not just a machine learning curiosity — it is a phase transition, the same kind of sudden shift that physicists study in magnets, superconductors, and boiling water.

In 1920, German physicist Wilhelm Lenz and his student Ernst Ising studied how blocks of iron become magnets. In the Ising model, atoms carry tiny magnetic spins (up or down) that interact with their neighbors. At high temperatures, spins flip randomly — disorder. As the system cools, adjacent spins begin aligning, lowering the system's energy. At a critical temperature, the system undergoes a phase transition: disorder snaps into order, and the block becomes a permanent magnet.

The mathematics of grokking maps directly onto this framework. Imagine the model's parameter space as an energy landscape — a mountainous terrain where altitude represents the loss function. During Phase 1 (memorization), the model rolls into a local valley: a lookup table solution. The landscape around this valley is flat — standard metrics see no gradient pushing the model elsewhere.

ENERGY LANDSCAPE OF GROKKING

Energy
│
│ Memorization Trigonometric
│ Valley Valley
│
│ ╔═══╗ Barrier ╔═══╗
│ ║ ║ ┌──────┐ ║ ║
│ ║ ● ║ │ │ ║ ║
│ ║ ║ │ │ ║ ★ ║ ← deeper minimum
│ ╚═══╝ │ │ ╚═══╝ (true solution)
│ └──────┘
└──────────────────────────────────── Parameters

● = model during Phase 1-2 (stuck in local minimum)
★ = model after Phase 3 (grokked — found global minimum)

But the trigonometric solution — the real answer — lies in a deeper valley on the other side of an energy barrier. During Phase 2, weight decay (regularization) acts like slowly raising the temperature: it destabilizes the memorization valley, giving the model the energy to escape. When the model finally crests the barrier and falls into the trigonometric valley, that is the phase transition — the moment of grokking.

This is not a metaphor. The Ising model's phase transition, grokking's sudden generalization, and the Hopfield network's convergence to stored memories are all described by the same mathematics: systems minimizing energy functions and settling into stable configurations. In October 2024, the Nobel Prize in Physics went to John Hopfield and Geoffrey Hinton for recognizing this connection — that the physics of magnets and the learning dynamics of neural networks are governed by the same laws.

This physical perspective also explains why weight decay accelerates grokking. Stronger regularization raises the effective "temperature" of the system, making it easier for the model to escape local minima. Weaker regularization keeps the model trapped in the memorization valley longer. It is thermodynamics applied to learning.

stateDiagram-v2 [*] --> Disorder: Initialize random weights Disorder --> Memorization: Gradient descent (fast) Memorization --> Memorization: Standard metrics flat Memorization --> CriticalPoint: Weight decay raises "temperature" CriticalPoint --> Generalization: Phase transition (sudden) Generalization --> [*]: Stable trig solution
state Disorder {
    direction LR
    note right of Disorder: No structure\nRandom activations
}

state Memorization {
    direction LR
    note right of Memorization: Local minimum\nLookup table solution
}

state CriticalPoint {
    direction LR
    note right of CriticalPoint: Energy barrier crossed\nLike Ising critical temp
}

state Generalization {
    direction LR
    note right of Generalization: Global minimum\nTrigonometric circuits
}

Emergent Capabilities in Large Models

Researchers have observed grokking-like phenomena in larger models and more complex tasks. The sudden appearance of new capabilities in large language models — what the field calls "emergent abilities" — may share underlying mechanisms with grokking in small transformers.

When a model suddenly becomes capable of chain-of-thought reasoning, or suddenly learns to follow instructions, or suddenly develops the ability to do arithmetic at scale — these phase transitions echo the sudden generalization in grokking. The connection remains an active area of research, but grokking provides a concrete, interpretable example of how such sudden shifts can occur.

The broader question for AI development is whether we can predict when these transitions will happen, and whether we can design training procedures that encourage them rather than leaving them to chance. The physics of phase transitions suggests that these shifts are not random — they are governed by the geometry of the loss landscape and the dynamics of optimization. Understanding that geometry may be the key to making emergent capabilities predictable rather than surprising.

Latent Operators: What Grokking Teaches Us About Mental Rotation

Grokking is not just about modular arithmetic. The same principle — a model reorganizing its internal representations to discover the true structure of a problem — appears in a much broader class of tasks involving symmetries.

Consider mental rotation: when you see a rotated image of a familiar object, your brain does not memorize every possible orientation. Instead, it learns the transformation itself — the operation of rotation — and applies it to recognize the object from any angle. Recent research shows that neural networks can learn similar latent operators — internal transformations that act on representations in latent space rather than on raw pixels.

MENTAL ROTATION: BRUTE FORCE vs. LATENT OPERATORS

Brute Force (Memorization) Latent Operators (Grokking-like)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Store every rotation: │ │ Learn the rotation │
│ 🪑 0° → "chair" │ │ OPERATOR in latent │
│ 🪑 15° → "chair" │ │ space: │
│ 🪑 30° → "chair" │ │ │
│ 🪑 45° → "chair" │ │ Object → Canonical │
│ ... │ │ Pose → Recognize │
│ 🪑 345° → "chair" │ │ │
│ (24 entries per object)│ │ Works for ANY angle │
│ BRITTLE: fails at 17° │ │ including never-seen │
└─────────────────────────┘ └─────────────────────────┘

The parallel to grokking is exact. A network that memorizes rotation angles is in Phase 1 — it has a lookup table. A network that discovers the rotation operator has grokked — it found the underlying symmetry. And just as the grokking model transitions from a memorized lookup table to trigonometric representations, a vision model transitions from storing individual orientations to learning a canonical pose in latent space plus a transformation operator.

This solves one of the deepest problems in AI vision: brittleness. A model trained on objects at 0°, 90°, 180°, and 270° will fail at 45° — unless it discovers the symmetry. The grokking insight is that models can discover these symmetries without being told about them, but only if training continues past the memorization phase. The key is what researchers call data efficiency through structure discovery — the same principle that lets the modular arithmetic model generalize from 70% of examples to 100% accuracy.

This connects directly to how the brain works. Neuroscientists have found that grid cells in the entorhinal cortex — the same cells that create hexagonal coordinate systems for spatial navigation — also encode abstract transformations. The brain does not memorize every rotated view of a coffee cup. It learns the rotation operation and applies it in a latent space provided by these grid cells. Grokking models and biological brains appear to converge on the same solution: learn the symmetry, not the instances.

Surrogate Problems: When Solving the Wrong Problem Works Better

One of the most counterintuitive lessons from grokking research emerged from evolutionary AI experiments at Sakana AI in early 2026. Their Shinka Evolve system — which uses frontier LLMs as mutation operators inside an evolutionary algorithm — discovered that solving a relaxed version of a problem often converges faster than solving the exact formulation.

In circle packing experiments, the team first used a fitness function that allowed tiny overlaps between circles (a surrogate problem). The system found state-of-the-art solutions rapidly. When they reran the experiment with the exact constraint (zero overlap), it took significantly longer to reach the same quality. The surrogate problem served as a stepping stone — an intermediate discovery that enabled the final breakthrough.

This echoes grokking's own dynamics. During Phase 2, the model is not solving the modular arithmetic problem directly. It is building trigonometric representations — a different, more fundamental problem whose solution happens to solve the original. The model discovers that "how do sine and cosine relate to modular groups?" is the right question to answer, even though nobody asked it.

Kenneth Stanley's Why Greatness Cannot Be Planned formalizes this insight: the path to a solution often runs through problems you did not set out to solve. In evolutionary computation, maintaining a diverse population of "stepping stones" — partially useful solutions that do not directly optimize the target — produces better outcomes than pure optimization. Open-endedness researchers argue that this is why biological evolution produces such remarkable complexity: it is not optimizing for anything in particular, just accumulating useful building blocks.

The connection to grokking is structural:

  • Memorization = optimizing the obvious objective (fit the training data)
  • Phase 2 stagnation = building stepping stones (trigonometric representations that do not yet improve test performance)
  • Grokking = the stepping stones suddenly assemble into a complete solution

Current AI agents optimize for exactly the problem they are given. But the grokking phenomenon — and the Shinka Evolve experiments — suggest that the next frontier is co-evolution of problems and solutions, where the AI system reformulates the problem itself as part of the search process. Robert Lange of Sakana AI calls this the "problem problem": not just finding solutions, but inventing the right problems to solve.

For agentic engineering, this means designing systems where agents can explore tangential paths, accumulate intermediate insights, and bring back unexpected stepping stones — rather than converging as quickly as possible on the first plausible answer.

🤖 From Grokking to Workspace Intelligence

Grokking demonstrates that AI systems can develop deep understanding from exposure to data — moving from surface-level pattern matching to discovering the fundamental structure of a problem. This principle resonates with how modern AI agents develop contextual intelligence in practical applications.

Taskade AI agents embody a similar philosophy of progressive understanding. When you train an agent on your workspace data — documents, conversations, project histories, and team workflows — it does not just memorize keywords. Through persistent memory and knowledge training, agents build progressively deeper representations of how your team operates.

Here is the connection:

  • Grokking models go from memorizing (X + Y) mod 113 to discovering trigonometric identities
  • Taskade agents go from matching keywords to understanding workflow context — knowing that a "sprint review" means different things to your engineering team versus your marketing team

This progression is powered by Workspace DNA: Memory (persistent context from projects and documents), Intelligence (11+ frontier models from OpenAI, Anthropic, and Google), and Execution (100+ integrations and automated workflows that act on insights).

What you can build today:

  • Custom AI agents with 22+ built-in tools, persistent memory, and slash commands that understand your domain
  • Automated workflows that trigger based on context — not just rules — using durable execution
  • Genesis Apps that turn prompts into live dashboards, portals, and tools your team can use immediately
  • Multi-agent teams where specialized agents collaborate on complex tasks, each with their own knowledge base

The same way grokking reveals that small models can discover profound mathematical truths, workspace AI reveals that agents with the right data and architecture can develop genuine operational intelligence. Try building your first AI agent →

🔮 The Bigger Picture

From a word in a 1961 science fiction novel about a Martian who understood things so deeply he could make them disappear, "grokking" has become one of the most important concepts in modern AI research.

The phenomenon reminds us that artificial intelligence is genuinely strange. A neural network with no concept of trigonometry, trained on nothing but patterns of zeros and ones, independently discovers sine waves, Fourier analysis, and the sum-of-angles identity. It finds the same solution that took human mathematicians centuries to develop — and it finds it by accident, while its trainer was on vacation.

Karpathy's observation keeps proving true: "Training LLMs is less like building animal intelligence and more like summoning ghosts." These intelligences are alien. They do not think the way we think. They find solutions we would not consider, using representations we can barely interpret.

But grokking also offers hope. It is one of the few cases in all of AI where we can fully understand what a neural network is doing — a transparent box in a world of black boxes. As the field of mechanistic interpretability grows, grokking provides both inspiration and methodology. If we can understand grokking, maybe we can understand the rest.

The next time you watch a training loss curve flatten out and think "time to stop," remember: the model might be about to grok. 🧪

Watch: Which AI model should you build with? — choosing the right model for your Taskade Genesis apps.

❓ Frequently Asked Questions

What is grokking in AI?

Grokking is a phenomenon in neural network training where a model suddenly transitions from memorizing its training data to truly generalizing — understanding the underlying mathematical pattern. It typically occurs long after the model has perfectly memorized the training set, during a period when standard training metrics show no improvement. The term was coined by OpenAI researchers in 2022, borrowing from Robert Heinlein's 1961 novel Stranger in a Strange Land.

How is grokking different from normal learning?

In normal machine learning, training and test performance improve together — as the model learns the training data, it simultaneously gets better at unseen examples. In grokking, the model first memorizes the training data (with no test improvement), then appears to stagnate for thousands of steps, and finally achieves sudden perfect generalization. The key difference is the delayed phase transition between memorization and understanding.

Why did the model discover trigonometry?

The model was not taught trigonometry or given any mathematical hints. It discovered trigonometric representations because circular functions are the natural solution to modular arithmetic. Modular arithmetic is inherently circular — numbers "wrap around" after reaching the modulus, just like hours on a clock. Sine and cosine functions are the mathematical tools that describe circular behavior. The model found the most efficient solution through gradient descent, and that solution happened to be trigonometry.

Can grokking happen in large language models?

Evidence suggests that grokking-like dynamics occur in larger models and more complex tasks. The sudden emergence of new capabilities in large language models — such as chain-of-thought reasoning or instruction following — may share underlying mechanisms with grokking. However, the clean three-phase pattern is most clearly demonstrated in small models on algorithmic tasks, where the full internal computation can be analyzed.

What is excluded loss and why does it matter?

Excluded loss is a diagnostic metric that removes specific frequency components from a model's output before measuring performance. It was developed by Neel Nanda and collaborators to reveal hidden learning during Phase 2 of grokking. Standard loss metrics cannot detect the trigonometric solution being built because the memorized solution masks it. Excluded loss strips away the memorization component, revealing steady progress toward generalization even when the model appears stuck.

What does grokking mean for AI training practices?

Grokking suggests that early stopping — halting training when metrics plateau — may cause us to miss significant generalization improvements. It also suggests that weight decay and regularization play important roles in encouraging the transition from memorization to generalization. Stronger regularization tends to speed up grokking, while weaker regularization delays it.

How does grokking connect to mechanistic interpretability?

Grokking is a cornerstone case study in mechanistic interpretability — the field of reverse-engineering neural networks to understand their internal computations. Because the grokked solution (trigonometric identities for modular arithmetic) is mathematically clean and fully understood, researchers can verify their interpretability techniques against a known ground truth. This makes grokking models invaluable testbeds for developing tools that might eventually explain frontier models.

How can I experiment with grokking myself?

You can reproduce grokking with relatively modest compute. Train a small transformer (1-2 layers, 128-dimensional embeddings) on modular addition with a prime modulus (P = 113 is the standard benchmark). Use about 70% of the complete dataset for training, apply weight decay regularization, and train for at least 10,000 steps. Monitor both training and test accuracy — you should see the characteristic flat period followed by sudden generalization.

🚀 Build AI That Understands Your Work

Grokking shows that neural networks can achieve deep understanding — not just memorization. Taskade AI agents bring that same principle to your workspace.

  • ✅ Custom AI agents with persistent memory and 22+ built-in tools
  • ✅ Knowledge training — agents learn from your docs, projects, and team data
  • ✅ Multi-model support — 11+ frontier models from OpenAI, Anthropic, and Google
  • ✅ Automated workflows with 100+ integrations
  • ✅ Genesis Apps — build live tools from prompts, deploy instantly

👉 Start building with Taskade AI agents →

💡 Before you go... Check out these related articles:

  1. What Are AI Agents? — How autonomous agents plan, reason, and act
  2. How Do LLMs Work? — Transformers, training, and inference explained
  3. What Is Mechanistic Interpretability? — Reverse-engineering neural networks
  4. What Is Generative AI? — The technology behind modern AI
  5. What Is OpenAI? — History of ChatGPT and GPT models
  6. What Is Anthropic? — History of Claude AI
  7. Agentic Workspaces — AI-powered workspace intelligence
  8. From Bronx Science to Taskade Genesis — Connecting the dots of AI history
  9. They Generate Code. We Generate Runtime — The Genesis Manifesto
  10. The BFF Experiment — From Noise to Life
  11. What Is Artificial Life? — How intelligence emerges from code
  12. What Is Intelligence? — From neurons to AI agents
  13. Explore Taskade Community — Templates, agents, and workflows
0%

On this page

🧠 What Is Grokking?🔬 The Accidental Discovery📐 The Modular Arithmetic Problem🌊 The Three Phases of GrokkingPhase 1: Memorization (~0-200 steps)Phase 2: Structure Building (~200-7,000 steps)Phase 3: Generalization + Cleanup (~7,000+ steps)🎵 The Trigonometric SolutionWhat the Embedding Layer LearnsWhat the MLP Neurons ComputeThe Sum-of-Angles IdentityDiagonal Symmetry🔍 Why Grokking MattersHidden Learning Is RealImplications for AI SafetyThe Physics of Phase TransitionsEmergent Capabilities in Large ModelsLatent Operators: What Grokking Teaches Us About Mental RotationSurrogate Problems: When Solving the Wrong Problem Works Better🤖 From Grokking to Workspace Intelligence🔮 The Bigger Picture❓ Frequently Asked Questions🚀 Build AI That Understands Your Work

Related Articles

/static_images/What Is Mechanistic Interpretability? How We're Learning to Understand AI
March 15, 2026AI

What Is Mechanistic Interpretability? How We're Learning to Understand AI (2026)

Mechanistic interpretability reverse-engineers neural networks to understand how AI actually thinks. From Anthropic's ci...

/static_images/How Do Large Language Models Actually Work? Transformers Explained (2026)
March 14, 2026AI

How Do Large Language Models Actually Work? Transformers Explained (2026)

A complete guide to how large language models work — from artificial neurons and backpropagation to the transformer arch...

/static_images/The HyperCard Moment: From Bill Atkinson to AI Micro Apps
March 31, 2026AI

The HyperCard Moment: From Bill Atkinson to AI Micro Apps (2026)

The most important principle in computing history — that the tool and the output should be the same thing — was born in ...

/static_images/History of Apple and Steve Jobs — from garage to $3.7 trillion AI powerhouse
March 30, 2026AI

History of Apple: Steve Jobs' 50-Year Vision, From a Garage to a $3.7 Trillion AI Powerhouse (2026)

The complete history of Apple Inc. from its 1976 garage founding to Apple Intelligence — and how Steve Jobs predicted AI...

/static_images/Why one-person companies are the future of work — AI agents replace teams
March 30, 2026AI

Why One-Person Companies Are the Future of Work: AI Agents, Solo Founders, and the $1B Prediction (2026)

Sam Altman predicts a one-person billion-dollar company. Solo founders like Pieter Levels already earn $3M+/year with ze...

/static_images/Build your own AI CRM with Taskade Genesis vs paying Salesforce $300 per seat in 2026
March 25, 2026AI

Build Your Own AI CRM vs Paying Salesforce $300/Seat (2026)

Salesforce charges $165-330/user/month plus $50+ for AI. A 10-person team pays $9,600-$45,600/year before implementation...

View All Articles
What Is Grokking in AI? Phase Transitions in Learning (2026) | Taskade Blog