I Built a Pipeline to Generate YouTube Shorts Programmatically

I got bored one weekend and started wondering: how hard would it actually be to generate a YouTube Short entirely from code? Not screen-record something, not stitch clips manually, but write a script in a text file and have a program hand you back a finished .mp4.

Turns out: not that hard. Also not free. Here’s what I built, how it works, and what I’d do differently.

You can find the full code on GitHub: github.com/caden311/content-generator

Here’s an example of one of the Shorts it generated: youtube.com/shorts/pbITB7jEUzc

What it actually does

The pipeline takes a plain text script as input and runs it through five stages:

Scene breakdown — Claude reads your script and returns a JSON breakdown with one scene per segment. Each scene gets an image prompt, a video prompt, narration text, and a target duration in seconds.
Asset generation — For every scene, the pipeline fires off image generation (Fal FLUX), video generation (Fal Kling), and text-to-speech (ElevenLabs) in parallel.
Subtitle generation — Each audio file gets transcribed by Whisper to get word-level timestamps, which are converted to an ASS subtitle file.
Assembly — FFmpeg concatenates the clips, merges the voiceover, and burns in the subtitles.
YouTube metadata — Claude generates a title, description, and tags, saved to upload.json next to the video.

The whole thing runs from one command:

npm run generate -- my-script.txt --format shorts

The stack

Claude handles the script-to-scenes breakdown. I gave it a system prompt that asks for strict JSON output with image prompts, video prompts, and narration split per scene. Using a structured prompt instead of freeform prose made parsing predictable.

Fal.ai runs both FLUX (images) and Kling (video generation). Fal uses an async queue model: you submit a job, get back a status_url and response_url, then poll until it finishes. This is fine for a CLI but means each video clip can take a few minutes.

ElevenLabs handles the narration voice. The default voice is Rachel (21m00Tcm4TlvDq8ikWAM) from their multilingual v2 model. You can swap it with the --voice flag.

OpenAI Whisper transcribes the generated audio back to word-level timestamps, which power the subtitles. Yes, you generate speech and then transcribe it. The timestamps are accurate enough that it works.

FFmpeg does the final assembly. If a video clip failed to generate, it falls back to a still image with a Ken Burns zoom effect so the video doesn’t look dead.

The adapters are all behind interfaces, so swapping one provider for another means writing one new class. Three tiers are built in: budget, standard, and premium, each mapping to a different adapter set.

Getting everything talking to Claude

The trickiest design decision was the scene breakdown prompt. I needed Claude to return consistent JSON every time, with scenes that summed to under 60 seconds for Shorts format.

The fix was simple: inject a {{MAX_DURATION_INSTRUCTION}} placeholder into the system prompt that only gets filled in when you’re targeting 9:16 format.

// src/adapters/llm/claude.ts
const SYSTEM_PROMPT = `You are a video production assistant...
{{MAX_DURATION_INSTRUCTION}}`;

async breakdownScript(script: string, maxDurationSeconds?: number) {
  const maxDurationInstruction = maxDurationSeconds !== undefined
    ? `- IMPORTANT: Total duration MUST NOT exceed ${maxDurationSeconds} seconds`
    : "";
  const systemPrompt = SYSTEM_PROMPT.replace(
    "{{MAX_DURATION_INSTRUCTION}}",
    maxDurationInstruction
  );
  // ...
}

Claude also occasionally wraps the JSON in a markdown code block. The response parser strips that before calling JSON.parse:

const jsonMatch = jsonStr.match(/```(?:json)?\s*([\s\S]*?)```/);
if (jsonMatch?.[1]) {
  jsonStr = jsonMatch[1];
}

Small thing, but it would silently break without it.

The hard part: timing

Getting audio, video, and subtitles to sync up correctly was messier than I expected.

Each scene has a durationSeconds from Claude’s breakdown. But the actual generated audio is rarely exactly that long. ElevenLabs paces speech differently depending on the narration content, and Kling generates clips in fixed 5 or 10 second chunks regardless of what you asked for.

The subtitle system handles this by measuring the real audio duration. Whisper returns actual word timestamps, and the subtitle generator tracks a running offset that accumulates across scenes based on the durationSeconds field, not the real audio length. That mismatch meant subtitles could drift by a second or two on longer videos.

Here’s how the offset calculation works in the orchestrator:

// src/pipeline/orchestrator.ts
let offset = 0;
const sceneAudioInfos = [];

for (const asset of sortedAssets) {
  const scene = project.breakdown.scenes[asset.sceneIndex];
  if (asset.audioPath) {
    sceneAudioInfos.push({
      narration: scene.narration,
      audioPath: asset.audioPath,
      offsetSeconds: offset,        // cumulative offset passed to Whisper
    });
  }
  offset += scene.durationSeconds;  // uses target duration, not real duration
}

The fix would be to measure actual audio duration with ffprobe and use that for the offset instead. I didn’t get around to it.

The final assembly step trims the output to Math.min(videoDuration, audioDuration) to avoid a silent tail if the audio runs shorter than the video. That part at least works cleanly.

What went wrong: subtitles

The subtitle burn-in was the most frustrating part of the whole project.

FFmpeg can burn ASS subtitles using a libass filter. The command looks like:

ffmpeg -vf "ass=filename=subtitles.ass" ...

But libass is not included in the default Homebrew FFmpeg build on macOS. You get a cryptic “No such filter” error at runtime. The workaround is:

brew install libass
brew reinstall ffmpeg

The pipeline now catches the specific error string and logs a helpful message instead of crashing:

// src/assembly/ffmpeg.ts
} catch (err: any) {
  if (err?.stderr?.includes("No such filter")) {
    logger.warn(
      "ffmpeg built without libass — subtitles skipped. " +
      "Fix: brew install libass && brew reinstall ffmpeg"
    );
  } else {
    throw err;
  }
}

If subtitle burning fails, the video still gets assembled, just without the text overlay. For Shorts this matters a lot since most people watch without sound.

How to try it yourself

Prerequisites:

Node.js 22+
FFmpeg with libass: brew install libass && brew install ffmpeg
API keys for Anthropic, OpenAI, Fal.ai, and ElevenLabs

Setup:

git clone https://github.com/caden311/content-generator
cd content-generator
npm install

Create a .env file:

ANTHROPIC_API_KEY=your_key
OPENAI_API_KEY=your_key
FAL_KEY=your_key
ELEVENLABS_API_KEY=your_key

Write a script. Plain text, narration style, a few paragraphs. The shorter the better for Shorts. Save it as script.txt.

Generate:

# YouTube Short (9:16, 60s max)
npm run generate -- script.txt --format shorts

# Standard YouTube video (16:9)
npm run generate -- script.txt

# Dry run (no API calls, generates placeholder media)
npm run generate -- script.txt --dry-run

Output lands in ./output/001_your-video-title/output.mp4 alongside a breakdown.json, upload.json with YouTube metadata, and all the intermediate assets.

Model tiers:

Tier	Images	TTS
`budget`	FLUX Schnell	OpenAI TTS
`standard`	FLUX Schnell	ElevenLabs
`premium`	DALL-E 3	ElevenLabs

All tiers use Kling for video generation. You can switch with --tier premium.

Cost estimate: A standard 4-scene Short runs roughly $0.30-0.60 depending on the tier. Most of that is Kling. Image and TTS costs are small.

Where it is now

It works. The output quality is… fine. Kling generates reasonably coherent motion clips. The ElevenLabs voice sounds natural. The visuals are a little random since there’s no style consistency between scenes, but for a weekend experiment it’s genuinely impressive that it works at all.

The project is not something I’m actively maintaining. It was a curiosity project that answered its question: yes, you can generate short-form video content from a text file in an afternoon. Whether the content is actually good is a separate problem.

What I’d do differently

Add background music. The biggest thing missing from the generated Shorts is audio atmosphere. The narration sits on dead silence, which feels unpolished. Adding a royalty-free background track and ducking it under the voiceover would make a meaningful difference to the final feel.

Measure real audio duration for subtitle offsets. As mentioned above, using the target durationSeconds from Claude instead of the actual audio file length causes subtitle drift. A single ffprobe call per scene at assembly time would fix it.

Add a visual style constraint to image prompts. Right now each scene generates an image independently, so the visual style can jump around. Injecting a consistent style prefix into every image prompt (something like “cinematic, warm color grading, consistent lighting”) would make multi-scene videos look less like a random slideshow.

The underlying approach is solid. The pipeline architecture, the parallel asset generation, the adapter pattern for swapping providers: all of that held up. The rough edges are mostly surface-level quality problems that more prompt engineering and one or two extra processing steps would fix.