📚 Engage with our Skool collective for assistance, premium resources, and further materials!

Join a rapidly expanding cohort and contribute to the expansion of this utility's capabilities

Overview

An open-source utility designed for autonomous production of succinct, short-form visual media. The Automated Short-Clip Fabricator merges synthesized voice audio, automated textual overlays, stock background visuals, and accompanying music tracks to construct compelling short videos based purely on initial text prompts.

This software initiative aims to present a cost-free substitute for computationally intensive visual rendering processes (and an alternative to leveraging costly external API services). It notably does not generate video assets from raw static imagery or descriptive image prompts.

The repository's origin lies with the AI Agents A-Z Youtube Channel. We strongly advocate for exploring the channel for supplementary AI educational content and instructional guides.

The operational server exposes both an MCP endpoint and a traditional REST server.

While the MCP interface facilitates direct interaction with AI Agents (such as n8n), the REST endpoints offer superior adaptability for manual or programmatic video generation control.

You can examine sample n8n automation sequences leveraging this REST/MCP server within this repository.

Operation

Runtime Environment Variables
REST Interface Details
Customization Parameters
MCP Service Exposure

Supplementary Information

Core Capabilities
Operational Flow
Current Constraints
Conceptual Frameworks
Issue Resolution
Cloud Deployment Guidance
Frequently Asked Questions
Required Components for Rendering
Contribution Guidelines
Licensing Details
Credits and Recognition

n8n Integration Guide

Showcase Media

Core Capabilities

Manufacture comprehensive short videos based on text descriptions.
Conversion of textual input into spoken audio.
Automated generation and aesthetic styling of subtitle tracks.
Background visual sourcing and selection via the Pexels platform.
Integration of background auditory tracks, selectable by genre/mood.
Functionality as both a dedicated RESTful service and an MCP host.

Operational Flow

The Shorts Creator accepts straightforward textual input and descriptive search parameters, executing the following sequence:

Transforms input text into audible speech utilizing the Kokoro TTS engine.
Generates precise captions via the Whisper speech recognition model.
Retrieves pertinent background visuals from the Pexels repository.
Merges all distinct elements using the Remotion framework.
Renders a polished, short-form visual artifact with perfectly synchronized temporal captions.

Current Constraints

The video output is presently restricted to English voiceovers (due to Kokoro-js language limitations).
Visual assets are exclusively sourced from Pexels.

Mandatory Prerequisites

Active internet connectivity.
A valid (complimentary) Pexels API access token.
Minimum 3 GB of available volatile memory (RAM); 4GB is suggested.
Minimum 2 virtual processing units (vCPUs).
Minimum 5 GB of available disk storage.

Conceptual Frameworks

Segment (Scene)

Every final production is constructed from a series of discrete segments. Each segment is defined by:

Narration Text: The script content that the TTS engine will vocalize, which subsequently forms the captions.
Sourcing Keywords: The descriptive terms supplied to the system for querying and selecting appropriate visuals from the Pexels catalog. Should no matches be found, fallback terms are employed (nature, globe, space, ocean).

Initial Setup

Docker Deployment (Recommended)

There are three distinct Docker container images tailored for specific operational profiles. For the majority of users, launching the tiny variant is the preferred approach.

Minimalist Variant (`tiny`)

Employs the tiny.en variant of the Whisper.cpp model.
Utilizes the q4 quantization setting for the Kokoro model.
Sets CONCURRENCY=1 to mitigate potential Out-Of-Memory (OOM) errors associated with Remotion under resource constraints.
Sets VIDEO_CACHE_SIZE_IN_BYTES=2097152000 (2GB) to alleviate OOM issues within the Remotion rendering pipeline.

bash docker run -it --rm --name short-video-maker -p 3123:3123 -e LOG_LEVEL=debug -e PEXELS_API_KEY= gyoridavid/short-video-maker:latest-tiny

Standard Variant (`Normal`)

Employs the base.en variant of the Whisper.cpp model.
Utilizes the full precision (fp32) Kokoro model.
Sets CONCURRENCY=1 to manage Remotion resource usage.
Sets VIDEO_CACHE_SIZE_IN_BYTES=2097152000 (2GB) for frame caching.

bash docker run -it --rm --name short-video-maker -p 3123:3123 -e LOG_LEVEL=debug -e PEXELS_API_KEY= gyoridavid/short-video-maker:latest

CUDA Optimized Variant

For users possessing an Nvidia Graphics Processing Unit (GPU), this image enables accelerated processing via GPU offloading for the Whisper model.

Employs the medium.en Whisper.cpp model (leveraging GPU acceleration).
Utilizes the full precision (fp32) Kokoro model.
Sets CONCURRENCY=1 to control parallel rendering threads.
Sets VIDEO_CACHE_SIZE_IN_BYTES=2097152000 (2GB) for frame buffering.

bash docker run -it --rm --name short-video-maker -p 3123:3123 -e LOG_LEVEL=debug -e PEXELS_API_KEY= --gpus=all gyoridavid/short-video-maker:latest-cuda

Docker Compose Integration

This setup is useful when orchestrating short-video-maker alongside other containerized services like n8n.

bash version: "3"

services: short-video-maker: image: gyoridavid/short-video-maker:latest-tiny environment: - LOG_LEVEL=debug - PEXELS_API_KEY= ports: - "3123:3123" volumes: - ./videos:/app/data/videos # Map for persistent video storage

If integrating with the Self-hosted AI starter kit, include networks: ['demo'] within the short-video-maker service definition to enable inter-service communication via http://short-video-maker:3123 within n8n.

NPM Installation

While Docker is the preferred deployment method, execution via npm or npx is also feasible. Beyond the general requirements, the following platform-specific dependencies must be satisfied for server operation:

Supported Operating Environments

Ubuntu: Version 22.04 or newer (requires libc 2.5 or higher for Whisper.cpp).
Required system libraries: git wget cmake ffmpeg curl make libsdl2-dev libnss3 libdbus-1-3 libatk1.0-0 libgbm-dev libasound2 libxrandr2 libxkbcommon-dev libxfixes3 libxcomposite1 libxdamage1 libatk-bridge2.0-0 libpango-1.0-0 libcairo2 libcups2
macOS:
FFmpeg utility: Install via Homebrew (brew install ffmpeg).
Node.js (version 22+ tested).

Windows is not supported currently due to frequent installation failures of the Whisper.cpp component.

Browser Interface (Web UI)

@mushitori has developed a graphical interface accessible via a web browser for simplified video synthesis.

Access the interface at http://localhost:3123

Runtime Environment Variables

🟢 Configuration Settings

Variable	Purpose	Default Value
PEXELS_API_KEY	Your (complimentary) Pexels API credential.	(Empty)
LOG_LEVEL	Verbosity level for the Pino logging framework.	info
WHISPER_VERBOSE	Flag to route whisper.cpp standard output to the console.	false
PORT	The network port the server will actively listen on.	3123

⚙️ System Configuration

Variable	Description	Default Value
KOKORO_MODEL_PRECISION	Specifies the required precision/size of the Kokoro model. Accepts: `fp32`, `fp16`, `q8`, `q4`, `q4f16`.	Varies based on Docker image used (see above).
CONCURRENCY	Dictates the count of parallel browser instances utilized during rendering (each instance processes web content for capture). Adjusting this aids performance stability on constrained hardware.	Varies based on Docker image used (see above).
VIDEO_CACHE_SIZE_IN_BYTES	Defines the maximum memory allocation for caching frames within Remotion's `<OffthreadVideo>` components. Modification can help stabilize rendering under low memory conditions.	Varies based on Docker image used (see above).

⚠️ Advanced/Hazardous Settings

Variable	Description	Default Value
WHISPER_MODEL	Selection for the underlying whisper.cpp acoustic model. Options: `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large-v1`, `large-v2`, `large-v3`, `large-v3-turbo`.	Depends on deployment context. Default for npm is `medium.en`.
DATA_DIR_PATH	The local filesystem path where project data is stored.	`~/.ai-agents-az-video-generator` (for npm); `/app/data` (in Docker)
DOCKER	Boolean indicator for execution within a containerized environment.	`true` (in Docker images); `false` otherwise
DEV	Development mode toggle.	`false`

Customization Parameters

Parameter	Description	Default
paddingBack	Duration, in milliseconds, that the final frame should persist after narration concludes (end screen duration).	0
music	The thematic selection for the background audio. See the GET `/api/music-tags` endpoint for valid selections.	random
captionPosition	Vertical placement of text overlays: `top`, `center`, or `bottom`.	`bottom`
captionBackgroundColor	The solid color applied behind the currently displayed subtitle segment.	`blue`
voice	The specific synthesized voice identity selected from the available Kokoro models.	`af_heart`
orientation	The aspect ratio for the final visual output: `portrait` or `landscape`.	`portrait`
musicVolume	Sets the relative loudness of the backing track. Options: `low`, `medium`, `high`, or `muted`.	`high`

Operational Use

MCP Service Endpoint

Server Communication Paths

/mcp/sse

/mcp/messages

Callable Functionality

create-short-video: Initiates the production of a short video. The controlling LLM determines the optimal parameters; explicit configuration requires careful prompting.
get-video-status: Intended for polling the processing state of a job. Due to inherent temporal reasoning limitations in many AI agents, relying on the REST API for status checks is often more reliable.

REST Interface

GET `/health`

Service operational verification endpoint.

bash curl --location 'localhost:3123/health'

bash { "status": "ok" }

POST `/api/short-video`

Submits a request to render a new video.

bash curl --location 'localhost:3123/api/short-video' \ --header 'Content-Type: application/json' \ --data '{ "scenes": [ { "text": "Hello world!", "searchTerms": ["river"] } ], "config": { "paddingBack": 1500, "music": "chill" } }'

bash { "videoId": "cma9sjly700020jo25vwzfnv9" }

GET `/api/short-video/{id}/status`

Retrieves the current processing status for a specified video identifier.

bash curl --location 'localhost:3123/api/short-video/cm9ekme790000hysi5h4odlt1/status'

bash { "status": "ready" }

GET `/api/short-video/{id}`

Fetches the finalized video binary data.

bash curl --location 'localhost:3123/api/short-video/cm9ekme790000hysi5h4odlt1'

Response body contains the raw video stream.

GET `/api/short-videos`

Lists all currently processed or processing video jobs.

bash curl --location 'localhost:3123/api/short-videos'

bash { "videos": [ { "id": "cma9wcwfc0000brsi60ur4lib", "status": "processing" } ] }

DELETE `/api/short-video/{id}`

Removes a video artifact from the system storage.

bash curl --location --request DELETE 'localhost:3123/api/short-video/cma9wcwfc0000brsi60ur4lib'

bash { "success": true }

GET `/api/voices`

Lists all identifiers for supported synthetic voices.

bash curl --location 'localhost:3123/api/voices'

bash [ "af_heart", "af_alloy", "af_aoede", "af_bella", "af_jessica", "af_kore", "af_nicole", "af_nova", "af_river", "af_sarah", "af_sky", "am_adam", "am_echo", "am_eric", "am_fenrir", "am_liam", "am_michael", "am_onyx", "am_puck", "am_santa", "bf_emma", "bf_isabella", "bm_george", "bm_lewis", "bf_alice", "bf_lily", "bm_daniel", "bm_fable" ]

GET `/api/music-tags`

Lists available mood tags for background music selection.

bash curl --location 'localhost:3123/api/music-tags'

bash [ "sad", "melancholic", "happy", "euphoric/high", "excited", "chill", "uneasy", "angry", "dark", "hopeful", "contemplative", "funny/quirky" ]

Issue Resolution

Docker Container Operation

The service mandates a minimum of 3GB of free RAM. Verify that your Docker environment (Docker Desktop settings or WSL2 global configuration file at wsl.conf) is provisioned with sufficient memory resources.

If utilizing WSL2 on Windows, resource constraints must be managed via the wsl.conf utility (see Microsoft documentation); otherwise, adjustments are made in Docker Desktop.

NPM Installation Path

Confirm that all required Node Package Manager dependencies have been successfully installed.

n8n Configuration Notes

Establishing the connectivity path between n8n and the service (whether via MCP or REST) is contingent on the local deployment topology of both components. Refer to the matrix below for connection string recommendations:

Scenario	n8n Running Locally (`n8n start`)	n8n Running in Docker Locally	n8n Hosted in the Cloud
`short-video-maker` in Docker (Local)	`http://localhost:3123`	Connection relies on host networking (`http://host.docker.internal:3123`) or shared Docker network configuration (`http://short-video-maker:3123`)	Requires external cloud deployment of the video service.
`short-video-maker` via npm/npx (Local)	`http://localhost:3123`	Accessible via host gateway: `http://host.docker.internal:3123`	Requires external cloud deployment of the video service.
`short-video-maker` in Cloud Environment	Use your specific public IP: `http://{YOUR_IP}:3123`	Use your specific public IP: `http://{YOUR_IP}:3123`	Use your specific public IP: `http://{YOUR_IP}:3123`

Cloud Deployment Strategy

Although configurations vary across VPS providers, follow these guidelines for public deployment:

Operating System: Use Ubuntu, version 22.04 or newer.
Resource Allocation: Minimum 4GB RAM, 2 vCPUs, and 5GB storage.
Process Management: Employ pm2 for server lifecycle management (start, stop, logging).
Environment Variables: Configure persistent environment variables, typically within the .bashrc file or equivalent shell profile.

Frequently Asked Questions

Can other languages (e.g., French, German) be processed?

Negative. Currently, the underlying speech synthesis library, Kokoro-js, is limited to English language synthesis.

Can I supply custom images or video clips for stitching?

No, the system is not designed to ingest and combine user-provided visual assets.

Which execution method is superior: npm or Docker?

Docker is the strongly recommended deployment vector due to easier dependency management.

How heavily is the GPU utilized during rendering?

GPU usage is minimal, confined solely to accelerating Whisper.cpp inference if a CUDA-enabled build is used. The core visual assembly (Remotion) and TTS generation (Kokoro-js) are predominantly CPU-bound tasks.

Is there an interactive graphical interface available for generating content?

Not natively, though a community-developed Web UI is available (see Section '# Web UI').

Can the background video source be customized away from Pexels, or can I upload my own?

No, Pexels remains the sole provider for background footage.

Can the system generate videos based on static images?

No, image-to-video conversion is outside the scope of this tool.

Required Components for Video Synthesis

Component	Version	License	Function
Remotion	^4.0.286	Remotion License	Programmatic video assembly and final output rendering.
Whisper CPP	v1.5.5	MIT	Speech-to-text conversion for subtitle generation.
FFmpeg	^2.1.3	LGPL/GPL	Fundamental audio and visual stream handling/manipulation.
Kokoro.js	^1.2.0	MIT	Text-to-speech rendering engine.
Pexels API	N/A	Pexels Terms	Source for environmental background video clips.

Contribution Guidelines

We welcome contributions via Pull Requests. Consult the CONTRIBUTING.md file for detailed instructions on configuring your local development environment.

Licensing

This project operates under the provisions of the MIT License.

Credits and Recognition

❤️ Remotion for enabling declarative video creation.
❤️ Whisper for high-accuracy speech recognition.
❤️ Pexels for access to stock visual media.
❤️ FFmpeg for comprehensive media processing capabilities.
❤️ Kokoro for synthetic voice narration.

WIKIPEDIA: XMLHttpRequest (XHR) is a standardized JavaScript object interface facilitating the transmission of HTTP communications between a web browser client and a remote web server. Its methods allow client-side scripts to dispatch requests to the server asynchronously following the initial page load, and subsequently receive server responses. XHR is foundational to the Asynchronous JavaScript and XML (Ajax) programming methodology. Preceding Ajax, document updates primarily relied on traditional hyperlink navigation or form submissions, actions that typically necessitated a full page refresh.

== Historical Context == The genesis of the XMLHttpRequest concept emerged around the year 2000, attributed to the development team behind Microsoft Outlook. This concept was first instantiated within the Internet Explorer 5 browser release (1999). However, the initial implementation did not employ the final XMLHttpRequest identifier; instead, developers relied on invoking COM objects via ActiveXObject("Msxml2.XMLHTTP") and ActiveXObject("Microsoft.XMLHTTP"). By the release of Internet Explorer 7 (2006), ubiquitous browser support for the standardized XMLHttpRequest identifier was achieved. The XMLHttpRequest identifier has since become the established convention across all primary browser engines, including Mozilla’s Gecko (2002), Safari 1.2 (2004), and Opera 8.0 (2005).

=== Standardization Efforts === The World Wide Web Consortium (W3C) published the initial Working Draft specification for the XMLHttpRequest object on April 5, 2006. On February 25, 2008, the W3C advanced this to the Level 2 specification. Level 2 introduced functionalities such as event progress monitoring, enablement of cross-origin requests, and binary stream handling. By the close of 2011, the Level 2 features were formally integrated back into the primary specification document. By the end of 2012, stewardship of the specification transitioned to WHATWG, which now maintains a living document defined using Web IDL syntax.

== Operational Steps == Sending a network request via XMLHttpRequest typically involves several distinct programming phases:

Instantiate the required XMLHttpRequest object by invoking its constructor:
Invoke the open() method to define the request method (GET, POST, etc.), specify the target Uniform Resource Identifier (URI), and declare whether the operation will be synchronous or asynchronous:
For asynchronous operations, register an event handler callback function designed to process state changes during the request lifecycle:
Commence the transmission of the request payload (if any) by calling the send() method:
Handle state transitions within the registered listener. Upon successful server response, the data is typically accessible via the responseText property. When processing is complete and the server has responded fully, the object transitions to state 4, the terminal "done" state. Beyond these core procedural steps, XMLHttpRequest offers numerous options for request control and response processing. Custom HTTP headers can be injected to guide server behavior, and data can be transmitted to the server within the send() argument. The incoming response stream can be immediately parsed as a structured JavaScript object (e.g., JSON) or processed incrementally as chunks arrive, rather than waiting for the entire data blob. Furthermore, requests can be manually terminated (abort()) or configured to time out if completion is not achieved within a preset duration.

== Cross-Domain Communication ==

Early in the evolution of the World Wide Web, limitations were encountered regarding scripts accessing resources hosted on domains different from the originating site, leading to security challenges and functional barriers.

automated-short-clip-fabricator

Author

gyoridavid

Quick Info

Actions

Tags

📚 Engage with our Skool collective for assistance, premium resources, and further materials!

Join a rapidly expanding cohort and contribute to the expansion of this utility's capabilities

Overview

Table of Contents

Initial Setup

Operation

Supplementary Information

n8n Integration Guide

Showcase Media

Core Capabilities

Operational Flow

Current Constraints

Mandatory Prerequisites

Conceptual Frameworks

Segment (Scene)

Initial Setup

Docker Deployment (Recommended)

Minimalist Variant (tiny)

Standard Variant (Normal)

CUDA Optimized Variant

Docker Compose Integration

NPM Installation

Supported Operating Environments

Browser Interface (Web UI)

Runtime Environment Variables

🟢 Configuration Settings

⚙️ System Configuration

⚠️ Advanced/Hazardous Settings

Customization Parameters

Operational Use

MCP Service Endpoint

Server Communication Paths

Callable Functionality

REST Interface

GET /health

POST /api/short-video

GET /api/short-video/{id}/status

GET /api/short-video/{id}

GET /api/short-videos

DELETE /api/short-video/{id}

GET /api/voices

GET /api/music-tags

Issue Resolution

Docker Container Operation

NPM Installation Path

n8n Configuration Notes

Cloud Deployment Strategy

Frequently Asked Questions

Can other languages (e.g., French, German) be processed?

Can I supply custom images or video clips for stitching?

Which execution method is superior: npm or Docker?

How heavily is the GPU utilized during rendering?

Is there an interactive graphical interface available for generating content?

Can the background video source be customized away from Pexels, or can I upload my own?

Can the system generate videos based on static images?

Required Components for Video Synthesis

Contribution Guidelines

Licensing

Credits and Recognition

See Also

Minimalist Variant (`tiny`)

Standard Variant (`Normal`)

GET `/health`

POST `/api/short-video`

GET `/api/short-video/{id}/status`

GET `/api/short-video/{id}`

GET `/api/short-videos`

DELETE `/api/short-video/{id}`

GET `/api/voices`

GET `/api/music-tags`