A text-to-MP4 AI animation pipeline looks like one button to the user. Behind the button there are six stages, three failure modes, and a queue. This post walks through every stage in order, explains what determines latency, and shows where the system gives up and refunds.
The same architecture applies to any pipeline that turns prompts into Manim videos, with small variations. The numbers in this post are from Madio's production logs as of May 2026.
The user-side flow
From the user's seat the flow is short.
- Open the editor and write a prompt. Optionally pick a template.
- Click Generate. A progress bar appears with stages: thinking, rendering, finalizing.
- Between 30 seconds and 4 minutes later, the MP4 plays inline. The video is downloadable, the prompt is editable, and the source code is visible (Pro and Team only).
That is the entire user surface. Everything below this section happens in the backend, hidden behind that progress bar.
Stage 1: prompt augmentation
The raw user prompt is rarely shippable to the LLM as-is. Madio's content layer prepends a system prompt, a few-shot example, and a list of safety rules. The augmentation does five things:
- Strips known bad patterns (asking for an external image URL, requesting copyrighted music, mentioning specific 3Blue1Brown videos).
- Injects a palette guard so the model picks four named colors maximum.
- Caps the scene duration based on the user's plan tier (30s, 60s, 180s, or 300s).
- Adds the active Manim version (v0.18.1) so the model writes code against the correct API surface.
- Picks the model: Gemini 3 Flash for short prompts, Gemini 3 Pro Thinking for prompts flagged as derivations or proofs.
The augmented prompt is roughly 1500 tokens of system context plus the user's 100-token prompt.
Stage 2: LLM code generation
The augmented prompt goes to Google Gemini via the official SDK. The response is a single Python file containing one or more Manim Scene subclasses.
We use two models.
- Gemini 3 Flash for the default path. Median latency 1.4s, p95 latency 3.2s. Costs about $0.0003 per generation at our prompt sizes.
- Gemini 3 Pro Thinking for derivations, proofs, and any prompt flagged for complexity. Median latency 6s, p95 18s. Costs about $0.005 per generation.
The model is asked to return only the code, no markdown fences, no explanation. We strip fences anyway in case it ignores that instruction. Roughly 4 percent of generations need a fence strip.
Before the code goes near Docker, a fast syntax check runs. Python's ast.parse catches roughly 1 percent of generations. A regex sweep catches forbidden imports (subprocess, os.system, urllib) at less than 0.1 percent. If either check fails, the prompt loops back to stage 2 with the error included as feedback.
Stage 3: code cache lookup
Before booting a Docker container, the pipeline hashes the generated code and looks it up in a two-tier cache.
- L1 is a per-process Python dict, populated on hit, evicted with an LRU policy. Sub-millisecond lookup.
- L2 is Redis, shared across all worker processes. Roughly 4ms lookup including network.
If either tier returns a hit, the pipeline skips straight to stage 6 (S3 upload of the cached MP4) and shaves 90 seconds off the request. Cache hit rate is around 12 percent overall and rises to 40 percent for prompts that come from a template.
A code cache, not a prompt cache, is the right design. Two different prompts can produce the same code (or near-identical code after whitespace normalization). The hash is on the deterministic post-processed source.
Stage 4: sandboxed Manim render
This is the longest stage. The generated .py file is mounted read-only into a Docker container running Python 3.11, Manim community v0.18.1, ffmpeg 6.1, and the standard math fonts. The container has no network access and a CPU and memory cap.
The container runs:
manim -ql --output_file out.mp4 generated_scene.py
-ql is low quality, 480p, 15 fps. Why low quality? Because the bottleneck is not pixels. It is the symbolic computation that Manim does to lay out the scene. Rendering at 480p first lets us catch errors fast and re-render at higher quality only on the final pass.
Render times observed in production:
- 2D scenes, under 30 seconds: median 8s, p95 25s.
- 2D scenes, 30 to 60 seconds: median 18s, p95 50s.
- 3D scenes, any duration: median 35s, p95 110s.
- Scenes with
MathTexrendering many equations: median 40s, p95 130s. LaTeX is slow.
If the render exits non-zero, the pipeline catches stderr, parses the Python traceback, and triggers stage 4b.
Stage 4b: the retry loop
Manim crashes are common in LLM-generated code. The most frequent causes:
- Calling a class that does not exist (
ParametricFunctionwas renamed toParametricCurvein some versions). - Animating a mobject that has not been added to the scene.
- Using
Textwith a font the container does not have. - Off-by-one in a for loop that produces a list of mobjects.
When a crash happens, the pipeline sends the error message, the failing line, and the original prompt back to Gemini with a fix instruction. The model re-emits a corrected file. Madio retries up to 3 times. The retry rate breakdown:
- 76 percent of generations succeed on attempt 1.
- 18 percent succeed on attempt 2.
- 4 percent succeed on attempt 3.
- 2 percent fail all 3 attempts. The credit is refunded automatically.
The retry loop adds 5 to 60 seconds depending on which attempt succeeds. Most of the time it is invisible to the user because the progress bar still says "rendering".
Stage 5: ffmpeg post-processing
Manim's raw output is an MP4 but it is not the MP4 the user gets. ffmpeg runs a sequence of filters:
- Re-encode to H.264 high profile, level 4.0, for broad device support.
- Set GOP to 30 frames so scrubbing is responsive.
- Normalize loudness if narration is enabled, target -16 LUFS for social platforms.
- Add a watermark for free-tier users, transparent PNG in the bottom-right corner at 30 percent opacity.
- Re-encode to 720p (Free), 1080p (Starter and Pro), or 4K (Team).
The 4K re-encode is the slowest, around 8 to 15 seconds. Most users on Free, Starter, and Pro see ffmpeg add 2 to 4 seconds.
Stage 5b: optional AI narration
Pro and Team plans get AI narration. The pipeline calls a separate prompt to Gemini to generate a 30-to-180-word narration script timed to the video, then sends the script to edge-tts (Microsoft Edge's neural voices, 7 voices selectable) for synthesis.
The audio is mixed under the visual track at -8 dB and the master is normalized again. Narration adds 5 to 10 seconds total. The Pro plan ships this on by default and the user can mute it, the Team plan makes it scriptable via the API.
Free and Starter do not get narration. We considered a watermarked narration for Free but the cost per render becomes negative at $0 revenue.
Stage 6: S3 upload
The final MP4 goes to an S3 bucket in us-east-1. The upload is multipart, 8 MiB chunks, with content-disposition set to inline so the video plays in the browser. Median upload time is 800ms for a 30-second 720p clip and 4 seconds for a 5-minute 4K clip.
The S3 URL is signed with a 7-day expiry. Pro and Team plans get long-term storage in their dashboard, Free and Starter expire after 7 days.
A row goes into Postgres with the prompt, the generated code, the cache key, the credit cost, the user ID, and the URL. This is what the dashboard renders when the user comes back later.
Where time goes
A typical 30-second 1080p generation on Starter without narration spends time roughly like this:
- Prompt augmentation and Gemini call: 2s
- Cache lookup: under 0.01s (miss path)
- Docker container boot: 0.4s (warm pool)
- Manim render at 480p: 8s
- Re-encode to 1080p with ffmpeg: 3s
- S3 upload: 1s
- Postgres write and webhook: under 0.1s
Total: 14 seconds typical. The 4 percent that retry once add 8 to 15 seconds. The 0.4 percent that retry twice add 16 to 30 seconds.
What slows things down
If your videos take longer than expected, in order of likelihood:
- 3D scene. Manim's 3D camera is slower because it renders shaded polygons. Stay 2D when you can.
- Long duration. Render time scales linearly with duration up to about 90 seconds, then GC pressure makes it sublinear but still slow.
- Lots of LaTeX.
MathTexshells out to actual LaTeX, which is single-threaded and disk-bound. - Many mobjects on screen. Each frame has to lay out every visible object. 100 objects is fine, 1000 is not.
- Cache miss on a popular prompt. Sometimes the deterministic hash misses because of trivial whitespace differences. We are working on canonicalization.
How this compares to other AI animation tools
Pipelines that wrap 3Blue1Brown's manim or the community fork all look roughly like this. The differences are in three places: the prompt augmentation (how much safety the wrapper enforces), the retry strategy (how aggressive, how many attempts, whether the model sees the error), and the cache (a true code cache vs a prompt cache vs no cache). Madio is on the aggressive end of all three. If you are curious about how long each stage takes in practice, how long does AI take to render a math animation breaks down the timing on real prompts, and why LLMs are good at writing Manim covers why the underlying generation works at all.
Try it
The pipeline is the same on Free as on Team, with the duration cap and resolution differing per pricing tier. The gallery shows what the output looks like across topic areas. To run a prompt yourself, the editor is the entry point.