LLMs writing Manim code works better than it has any right to. A few years ago you would have expected most prompts to fail outright. In May 2026 most well-formed single-concept prompts produce running code on the first try, and the gap to a usable video closes quickly with one or two retries. This post explains why the underlying generation is this good, where it still falls over, and what wrappers like Madio do to close the remaining gap.
It is the post to read if you are deciding whether to use raw Grant Sanderson Manim, the community fork, or a hosted layer on top.
Why Manim is LLM-friendly
Three properties of Manim line up with what current LLMs do well.
A small declarative API
Manim's user-facing API is small. A typical scene uses 10 to 20 classes: Scene, Circle, Square, Line, MathTex, Text, Create, Write, Transform, FadeIn, FadeOut, MoveAlongPath, Rotate, ApplyMatrix, plus a handful of camera and 3D variants. That is a manageable surface area for any model.
Compare to a graphics library like three.js where you compose materials, geometries, lights, cameras, and a render loop. Manim hides the render loop. You declare what you want and the framework computes the frames. That is exactly the kind of declarative target an LLM can reason about because the cognitive load is structural, not procedural.
Abundant high-quality training data
Grant Sanderson's repo has been public on GitHub since 2017. It contains the source code for hundreds of 3Blue1Brown videos, each one a worked example of a math animation. The community fork has thousands of additional examples in its docs and on GitHub. Stack Overflow has Manim questions back to 2018. Discord and Reddit have years of #help channel logs.
This is a large, clean corpus of how to write Manim. Crucially, the examples are written by people who care about clarity, because the original purpose was teaching. The code style is idiomatic and consistent.
Pure-Python output
The model outputs Python code. Modern LLMs are at their best in Python. They are at their worst writing graphics shaders, GLSL, or pixel-level descriptions. Manim sits squarely in the LLM's strongest modality because the artifact is text, not pixels. The pixels are the renderer's job.
What current LLMs handle well
In practice, these scene types are the most reliable on attempt 1 across the major frontier models.
- Basic 2D shapes and graphs. Plotting a function, drawing a polygon, animating a line moving across a plane.
- Step-by-step derivations. Algebraic manipulation rendered with
MathTexandTransform. The model knows the syntax, knows the LaTeX, and knows the standard layout. - Geometric proofs with named objects. Pythagoras, similar triangles, angle bisector. As long as the entities are named in the prompt.
- Single-axis transformations. Rotation, reflection, scaling. The 2D matrix is in the model's training data verbatim.
- Coordinate systems and grids.
Axes,NumberPlane,Coordinateare well-understood.
These succeed because the prompt has a one-to-one mapping to a scene that already exists somewhere in the training data, possibly in 3Blue1Brown's own repo. The model can pattern-match, not reason from scratch.
Where LLMs fail at Manim
Failures cluster. Not random, predictable. Knowing the failure modes is half the work of getting good output.
Complex 3D
The 3D camera in Manim works but is tricky to set up. The model sometimes forgets to call set_camera_orientation, sometimes uses an axis order that flips up and down, sometimes places lights in the wrong spot. 3D scenes have a noticeably lower first-try pass rate than 2D, and they consume more retries when they do fail.
The fix in production is to retry with the error included. The fix in your prompt is to either go 2D or to give the camera angle explicitly: "Use a 3D scene with the camera at phi=70 degrees, theta=-45 degrees."
Custom mobjects with overridden methods
When a prompt requires a shape Manim does not have built in (a Mobius strip, a fractal, a custom logo), the model has to subclass VMobject and override generate_points. This succeeds inconsistently. The override is subtle and the model often produces points that render but look wrong.
The fix: use a built-in or a parametric function. Mobius strips can be done with Surface and a parametric function. Logos should be SVGs imported via SVGMobject.
Color palettes outside named colors
Manim has a set of named colors: BLUE, RED, GREEN, YELLOW, WHITE, BLACK, GRAY, plus shades like BLUE_E and RED_A. The model knows these. Asking for "a soft pastel pink" causes the model to invent a hex code, often a bad one. The fix: stick to the named palette or specify hex codes explicitly.
Race conditions in animations
When two animations overlap in time and one mutates an object the other reads, Manim sometimes renders a frame between them and produces a flicker. The model does not reason about this race condition because it is not visible in static code review. The fix: use AnimationGroup with lag_ratio to serialize, or use Succession to run sequentially.
Many objects on screen
Beyond about 200 mobjects, the model starts to lose track of which object is which. Variable names collide, animations target the wrong object, or the layout becomes unreadable. Cap at 50 to 100 objects per scene for reliable output. If the visual genuinely requires more, generate the objects in a loop with deterministic names.
Animations requiring precise timing
"The dot reaches the curve at exactly 1.7 seconds while the equation finishes writing at 1.65 seconds." The model can write code with run_time parameters but cannot reliably hit sub-second timing across multiple tracks. The fix: do not micromanage timing, accept the defaults, or animate the tracks sequentially.
How Madio mitigates the failures
Three layers sit between the user prompt and the rendered MP4 to push the success rate up.
Prompt augmentation
The user prompt is wrapped in a system prompt that:
- Pins the Manim version to community v0.18.1 so the model uses the right API.
- Lists forbidden patterns: subprocess calls, network access, file IO outside the working directory.
- Caps scene length to the user's tier (30s on Free, up to 300s on Team).
- Injects a palette guard: "Use at most 4 named colors from the standard Manim palette."
- Adds a one-shot example for the prompt category if the prompt looks like a derivation, a graph plot, or a 3D scene.
The added context is a few hundred tokens but it improves first-try success measurably over an unaugmented prompt, especially on category-specific scenes.
Syntax check before render
The generated code goes through ast.parse to catch Python syntax errors fast, and a regex sweep flags forbidden imports. If either fires, the prompt loops back to the model with the parse error included before any container boots. This catches roughly 1 percent of generations and saves the Docker boot cost on those.
The retry loop
When the render fails, the pipeline catches stderr, parses the Python traceback, and sends it back to Gemini with a fix instruction. Madio retries up to 3 times. The shape of what we see in production: most prompts succeed on attempt 1, a meaningful chunk succeed on attempt 2, a smaller chunk on attempt 3, and a small residual fail all three (in which case the credit is refunded).
The retry loop is doing real work. The user-visible success rate is materially higher with it than the underlying first-try rate alone, especially for harder multi-step scenes.
The future: multimodal models reading examples
The next leap is models that can ingest a reference image and produce matching code. Gemini 3 and GPT-5 already accept images. The reasoning chain is roughly: look at the reference, identify the shapes and colors, infer the layout, write the Manim code that reproduces it.
In informal testing, this works at the figure level (right shapes, right colors, roughly right positions) most of the time. The match at the layout level (exact spacing, exact font size, exact alignment) is much less reliable. We expect the next generation of models to close most of that gap.
The interesting consequence is that the prompt becomes a screenshot. A teacher photographs a textbook diagram, drops it into Madio, and gets an animated version of the same figure. We are testing this now and it works better than it should.
What this means for choosing a tool
If you are picking between a raw LLM (paste the Manim docs into ChatGPT and ask), the community Manim with self-prompting, and a hosted tool like Madio, the tradeoff is the success rate.
- Raw LLM, no wrapper. Free if you already have a chat subscription. Decent first-try success on simple scenes; harder prompts require manual debugging and a local Manim install.
- Self-hosted Manim with your own prompt loop. You can build the retry loop yourself. Most people do not, because the engineering effort is non-trivial.
- Hosted (Madio and similar). The retry loop, sandbox, and cache are pre-built. You pay per credit. The vast majority of prompts succeed without you seeing any errors.
This is not a sales pitch. If you are a researcher or a hobbyist with time, self-hosting Manim and a retry loop is great. If you are a teacher with a class tomorrow, a hosted tool is the right call.
Where to go next
If you want a deeper view of the system that wraps the LLM, the prompt-to-MP4 pipeline post walks through every stage. If you want to see how to write prompts that play to the LLM's strengths, the 12 patterns post is the practical guide. The text-to-Manim AI tools survey compares the major hosted players head to head.
Madio's pricing covers Free through Team. The editor is the way to test the prompts in this article. The gallery shows the rendered output across topic areas. The templates library has copy-paste-ready prompts for the patterns that succeed most often.