Madio

From prompt to MP4: the AI animation pipeline explained

Sun May 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) · Sanatan Sharma

A text-to-MP4 AI animation pipeline looks like one button to the user. Behind the button there are six stages, three failure modes, and a queue. This post walks through every stage in order, explains what determines latency, and shows where the system gives up and refunds.

The same architecture applies to any pipeline that turns prompts into Manim videos, with small variations. The numbers in this post are from Madio's production logs as of May 2026.

The user-side flow

From the user's seat the flow is short.

  1. Open the editor and write a prompt. Optionally pick a template.
  2. Click Generate. A progress bar appears with stages: thinking, rendering, finalizing.
  3. Between 30 seconds and 4 minutes later, the MP4 plays inline. The video is downloadable, the prompt is editable, and the source code is visible (Pro and Team only).

That is the entire user surface. Everything below this section happens in the backend, hidden behind that progress bar.

Stage 1: prompt augmentation

The raw user prompt is rarely shippable to the LLM as-is. Madio's content layer prepends a system prompt, a few-shot example, and a list of safety rules. The augmentation does five things:

The augmented prompt is roughly 1500 tokens of system context plus the user's 100-token prompt.

Stage 2: LLM code generation

The augmented prompt goes to Google Gemini via the official SDK. The response is a single Python file containing one or more Manim Scene subclasses.

We use two models.

The model is asked to return only the code, no markdown fences, no explanation. We strip fences anyway in case it ignores that instruction. Roughly 4 percent of generations need a fence strip.

Before the code goes near Docker, a fast syntax check runs. Python's ast.parse catches roughly 1 percent of generations. A regex sweep catches forbidden imports (subprocess, os.system, urllib) at less than 0.1 percent. If either check fails, the prompt loops back to stage 2 with the error included as feedback.

Stage 3: code cache lookup

Before booting a Docker container, the pipeline hashes the generated code and looks it up in a two-tier cache.

If either tier returns a hit, the pipeline skips straight to stage 6 (S3 upload of the cached MP4) and shaves 90 seconds off the request. Cache hit rate is around 12 percent overall and rises to 40 percent for prompts that come from a template.

A code cache, not a prompt cache, is the right design. Two different prompts can produce the same code (or near-identical code after whitespace normalization). The hash is on the deterministic post-processed source.

Stage 4: sandboxed Manim render

This is the longest stage. The generated .py file is mounted read-only into a Docker container running Python 3.11, Manim community v0.18.1, ffmpeg 6.1, and the standard math fonts. The container has no network access and a CPU and memory cap.

The container runs:

manim -ql --output_file out.mp4 generated_scene.py

-ql is low quality, 480p, 15 fps. Why low quality? Because the bottleneck is not pixels. It is the symbolic computation that Manim does to lay out the scene. Rendering at 480p first lets us catch errors fast and re-render at higher quality only on the final pass.

Render times observed in production:

If the render exits non-zero, the pipeline catches stderr, parses the Python traceback, and triggers stage 4b.

Stage 4b: the retry loop

Manim crashes are common in LLM-generated code. The most frequent causes:

When a crash happens, the pipeline sends the error message, the failing line, and the original prompt back to Gemini with a fix instruction. The model re-emits a corrected file. Madio retries up to 3 times. The retry rate breakdown:

The retry loop adds 5 to 60 seconds depending on which attempt succeeds. Most of the time it is invisible to the user because the progress bar still says "rendering".

Stage 5: ffmpeg post-processing

Manim's raw output is an MP4 but it is not the MP4 the user gets. ffmpeg runs a sequence of filters:

The 4K re-encode is the slowest, around 8 to 15 seconds. Most users on Free, Starter, and Pro see ffmpeg add 2 to 4 seconds.

Stage 5b: optional AI narration

Pro and Team plans get AI narration. The pipeline calls a separate prompt to Gemini to generate a 30-to-180-word narration script timed to the video, then sends the script to edge-tts (Microsoft Edge's neural voices, 7 voices selectable) for synthesis.

The audio is mixed under the visual track at -8 dB and the master is normalized again. Narration adds 5 to 10 seconds total. The Pro plan ships this on by default and the user can mute it, the Team plan makes it scriptable via the API.

Free and Starter do not get narration. We considered a watermarked narration for Free but the cost per render becomes negative at $0 revenue.

Stage 6: S3 upload

The final MP4 goes to an S3 bucket in us-east-1. The upload is multipart, 8 MiB chunks, with content-disposition set to inline so the video plays in the browser. Median upload time is 800ms for a 30-second 720p clip and 4 seconds for a 5-minute 4K clip.

The S3 URL is signed with a 7-day expiry. Pro and Team plans get long-term storage in their dashboard, Free and Starter expire after 7 days.

A row goes into Postgres with the prompt, the generated code, the cache key, the credit cost, the user ID, and the URL. This is what the dashboard renders when the user comes back later.

Where time goes

A typical 30-second 1080p generation on Starter without narration spends time roughly like this:

Total: 14 seconds typical. The 4 percent that retry once add 8 to 15 seconds. The 0.4 percent that retry twice add 16 to 30 seconds.

What slows things down

If your videos take longer than expected, in order of likelihood:

  1. 3D scene. Manim's 3D camera is slower because it renders shaded polygons. Stay 2D when you can.
  2. Long duration. Render time scales linearly with duration up to about 90 seconds, then GC pressure makes it sublinear but still slow.
  3. Lots of LaTeX. MathTex shells out to actual LaTeX, which is single-threaded and disk-bound.
  4. Many mobjects on screen. Each frame has to lay out every visible object. 100 objects is fine, 1000 is not.
  5. Cache miss on a popular prompt. Sometimes the deterministic hash misses because of trivial whitespace differences. We are working on canonicalization.

How this compares to other AI animation tools

Pipelines that wrap 3Blue1Brown's manim or the community fork all look roughly like this. The differences are in three places: the prompt augmentation (how much safety the wrapper enforces), the retry strategy (how aggressive, how many attempts, whether the model sees the error), and the cache (a true code cache vs a prompt cache vs no cache). Madio is on the aggressive end of all three. If you are curious about how long each stage takes in practice, how long does AI take to render a math animation breaks down the timing on real prompts, and why LLMs are good at writing Manim covers why the underlying generation works at all.

Try it

The pipeline is the same on Free as on Team, with the duration cap and resolution differing per pricing tier. The gallery shows what the output looks like across topic areas. To run a prompt yourself, the editor is the entry point.

Try Madio free →