Which LLM writes the best Manim code?

There is no clear winner in 2026. Gemini 3 Pro Thinking, the latest GPT, and Claude 4.5 Sonnet all produce usable Manim for typical 2D scenes, with similar pass rates and similar failure patterns. Differences are smaller than they were a year ago. Pick by ecosystem and pricing rather than by Manim quality.

Why is Manim easier for LLMs than other animation libraries?

Three reasons. The API is small and declarative. The training data includes Grant Sanderson's open-source repo and the community fork, which is hundreds of well-written examples. And the output is deterministic Python code, not pixels, so the model is operating in its strongest modality.

What kinds of animations still fail consistently?

Complex 3D, custom mobjects with overridden render methods, animations that require precise timing across multiple tracks, color palettes outside the named-color set, and any scene with more than about 200 mobjects. These fail not because the model cannot reason about them but because the prompt-to-code mapping has too many degrees of freedom.

Will multimodal models read example images and produce matching animations?

They already do, partially. Gemini 3 and GPT-5 can ingest a reference frame and produce code that approximates it. The match is rough at the figure level (right shapes, right colors) and unreliable at the layout level. We expect the next generation to close most of that gap.

If LLMs are this good, why does Madio need a retry loop?

Because first-try success is well short of 100 percent. A meaningful fraction of generations need a fix, and most of those fixes are small (a class rename or a misordered argument) but will not run as-is. The retry loop catches the error, sends it back to the model with the fix instruction, and ships a working video on attempt 2 or 3 the large majority of the time.

Why LLMs are good at writing Manim (and where they fail)

Sun May 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) · Sanatan Sharma

LLMs writing Manim code works better than it has any right to. A few years ago you would have expected most prompts to fail outright. In May 2026 most well-formed single-concept prompts produce running code on the first try, and the gap to a usable video closes quickly with one or two retries. This post explains why the underlying generation is this good, where it still falls over, and what wrappers like Madio do to close the remaining gap.

It is the post to read if you are deciding whether to use raw Grant Sanderson Manim, the community fork, or a hosted layer on top.

Why Manim is LLM-friendly

Three properties of Manim line up with what current LLMs do well.

A small declarative API

Manim's user-facing API is small. A typical scene uses 10 to 20 classes: Scene, Circle, Square, Line, MathTex, Text, Create, Write, Transform, FadeIn, FadeOut, MoveAlongPath, Rotate, ApplyMatrix, plus a handful of camera and 3D variants. That is a manageable surface area for any model.

Compare to a graphics library like three.js where you compose materials, geometries, lights, cameras, and a render loop. Manim hides the render loop. You declare what you want and the framework computes the frames. That is exactly the kind of declarative target an LLM can reason about because the cognitive load is structural, not procedural.

Abundant high-quality training data

Grant Sanderson's repo has been public on GitHub since 2017. It contains the source code for hundreds of 3Blue1Brown videos, each one a worked example of a math animation. The community fork has thousands of additional examples in its docs and on GitHub. Stack Overflow has Manim questions back to 2018. Discord and Reddit have years of #help channel logs.

This is a large, clean corpus of how to write Manim. Crucially, the examples are written by people who care about clarity, because the original purpose was teaching. The code style is idiomatic and consistent.

Pure-Python output

The model outputs Python code. Modern LLMs are at their best in Python. They are at their worst writing graphics shaders, GLSL, or pixel-level descriptions. Manim sits squarely in the LLM's strongest modality because the artifact is text, not pixels. The pixels are the renderer's job.

What current LLMs handle well

In practice, these scene types are the most reliable on attempt 1 across the major frontier models.

Basic 2D shapes and graphs. Plotting a function, drawing a polygon, animating a line moving across a plane.
Step-by-step derivations. Algebraic manipulation rendered with MathTex and Transform. The model knows the syntax, knows the LaTeX, and knows the standard layout.
Geometric proofs with named objects. Pythagoras, similar triangles, angle bisector. As long as the entities are named in the prompt.
Single-axis transformations. Rotation, reflection, scaling. The 2D matrix is in the model's training data verbatim.
Coordinate systems and grids. Axes, NumberPlane, Coordinate are well-understood.

These succeed because the prompt has a one-to-one mapping to a scene that already exists somewhere in the training data, possibly in 3Blue1Brown's own repo. The model can pattern-match, not reason from scratch.

Where LLMs fail at Manim

Failures cluster. Not random, predictable. Knowing the failure modes is half the work of getting good output.

Complex 3D

The 3D camera in Manim works but is tricky to set up. The model sometimes forgets to call set_camera_orientation, sometimes uses an axis order that flips up and down, sometimes places lights in the wrong spot. 3D scenes have a noticeably lower first-try pass rate than 2D, and they consume more retries when they do fail.

The fix in production is to retry with the error included. The fix in your prompt is to either go 2D or to give the camera angle explicitly: "Use a 3D scene with the camera at phi=70 degrees, theta=-45 degrees."

Custom mobjects with overridden methods

When a prompt requires a shape Manim does not have built in (a Mobius strip, a fractal, a custom logo), the model has to subclass VMobject and override generate_points. This succeeds inconsistently. The override is subtle and the model often produces points that render but look wrong.

The fix: use a built-in or a parametric function. Mobius strips can be done with Surface and a parametric function. Logos should be SVGs imported via SVGMobject.

Color palettes outside named colors

Manim has a set of named colors: BLUE, RED, GREEN, YELLOW, WHITE, BLACK, GRAY, plus shades like BLUE_E and RED_A. The model knows these. Asking for "a soft pastel pink" causes the model to invent a hex code, often a bad one. The fix: stick to the named palette or specify hex codes explicitly.

Race conditions in animations

When two animations overlap in time and one mutates an object the other reads, Manim sometimes renders a frame between them and produces a flicker. The model does not reason about this race condition because it is not visible in static code review. The fix: use AnimationGroup with lag_ratio to serialize, or use Succession to run sequentially.

Many objects on screen

Beyond about 200 mobjects, the model starts to lose track of which object is which. Variable names collide, animations target the wrong object, or the layout becomes unreadable. Cap at 50 to 100 objects per scene for reliable output. If the visual genuinely requires more, generate the objects in a loop with deterministic names.

Animations requiring precise timing

"The dot reaches the curve at exactly 1.7 seconds while the equation finishes writing at 1.65 seconds." The model can write code with run_time parameters but cannot reliably hit sub-second timing across multiple tracks. The fix: do not micromanage timing, accept the defaults, or animate the tracks sequentially.

How Madio mitigates the failures

Three layers sit between the user prompt and the rendered MP4 to push the success rate up.

Prompt augmentation

The user prompt is wrapped in a system prompt that:

Pins the Manim version to community v0.18.1 so the model uses the right API.
Lists forbidden patterns: subprocess calls, network access, file IO outside the working directory.
Caps scene length to the user's tier (30s on Free, up to 300s on Team).
Injects a palette guard: "Use at most 4 named colors from the standard Manim palette."
Adds a one-shot example for the prompt category if the prompt looks like a derivation, a graph plot, or a 3D scene.

The added context is a few hundred tokens but it improves first-try success measurably over an unaugmented prompt, especially on category-specific scenes.

Syntax check before render

The generated code goes through ast.parse to catch Python syntax errors fast, and a regex sweep flags forbidden imports. If either fires, the prompt loops back to the model with the parse error included before any container boots. This catches roughly 1 percent of generations and saves the Docker boot cost on those.

The retry loop

When the render fails, the pipeline catches stderr, parses the Python traceback, and sends it back to Gemini with a fix instruction. Madio retries up to 3 times. The shape of what we see in production: most prompts succeed on attempt 1, a meaningful chunk succeed on attempt 2, a smaller chunk on attempt 3, and a small residual fail all three (in which case the credit is refunded).

The retry loop is doing real work. The user-visible success rate is materially higher with it than the underlying first-try rate alone, especially for harder multi-step scenes.

The future: multimodal models reading examples

The next leap is models that can ingest a reference image and produce matching code. Gemini 3 and GPT-5 already accept images. The reasoning chain is roughly: look at the reference, identify the shapes and colors, infer the layout, write the Manim code that reproduces it.

In informal testing, this works at the figure level (right shapes, right colors, roughly right positions) most of the time. The match at the layout level (exact spacing, exact font size, exact alignment) is much less reliable. We expect the next generation of models to close most of that gap.

The interesting consequence is that the prompt becomes a screenshot. A teacher photographs a textbook diagram, drops it into Madio, and gets an animated version of the same figure. We are testing this now and it works better than it should.

What this means for choosing a tool

If you are picking between a raw LLM (paste the Manim docs into ChatGPT and ask), the community Manim with self-prompting, and a hosted tool like Madio, the tradeoff is the success rate.

Raw LLM, no wrapper. Free if you already have a chat subscription. Decent first-try success on simple scenes; harder prompts require manual debugging and a local Manim install.
Self-hosted Manim with your own prompt loop. You can build the retry loop yourself. Most people do not, because the engineering effort is non-trivial.
Hosted (Madio and similar). The retry loop, sandbox, and cache are pre-built. You pay per credit. The vast majority of prompts succeed without you seeing any errors.

This is not a sales pitch. If you are a researcher or a hobbyist with time, self-hosting Manim and a retry loop is great. If you are a teacher with a class tomorrow, a hosted tool is the right call.

Where to go next

If you want a deeper view of the system that wraps the LLM, the prompt-to-MP4 pipeline post walks through every stage. If you want to see how to write prompts that play to the LLM's strengths, the 12 patterns post is the practical guide. The text-to-Manim AI tools survey compares the major hosted players head to head.

Madio's pricing covers Free through Team. The editor is the way to test the prompts in this article. The gallery shows the rendered output across topic areas. The templates library has copy-paste-ready prompts for the patterns that succeed most often.