The Technology Inside Foundry: Multiple AIs Working Together

Technology Deep Dive

Foundry is not one AI. It is several specialized AI systems that coordinate during every generation. Each handles a different part of the process, and together they produce results that none of them could achieve alone.

The music generator: ACE-Step 1.5

Audio creation starts with a Diffusion Transformer (DiT) model combined with a VAE decoder. The DiT works in a compressed latent space, generating audio step by step from noise, guided by your description. The VAE then decodes the latent representation into actual waveform audio.

This is fully synthesized audio. Every note, every vocal syllable, every drum hit is generated from scratch for your specific prompt. No samples, no loops, no splicing of pre-existing recordings.

On strong GPUs, the generator reaches up to 15x realtime speed. A 3-minute track renders in seconds, not minutes.

The 5Hz Planner LM

Before the music generator starts, your text description passes through a language model called the Planner. Its job is to expand a simple idea into structured musical metadata: tempo, key, scale, time signature, section layout, instrumentation, and detailed production notes.

Think of it this way. You write "warm jazz ballad with soft piano." The Planner translates that into something the generator can work with precisely: specific BPM, a particular key, song structure with defined sections, instrument assignments, and production characteristics.

The Planner runs as a quantized GGUF model using a high-speed inference engine. Foundry uses Q4, Q5, or Q6 quantization levels depending on available VRAM, keeping the model responsive while fitting within practical hardware constraints.

Creative AI: structured writing

The Creative AI (also called Music Writer) is a separate system that helps you draft generation-ready prompts and lyrics. It is more than autocomplete. The AI understands song structure, duration constraints, and the relationship between caption detail and generation quality.

It validates outputs against practical limits, catches prompts that would produce poor results, and iterates with you conversationally until the creative brief is solid. Multiple intelligence levels let you choose between fast drafts and carefully refined output.

This system also runs as a quantized model, separate from the Planner and the generator, to keep the writing workflow responsive while other components are active.

LM-CFG: negative keyword steering

Traditional text-to-audio systems struggle with negative prompts. Tell them "not aggressive" and you might get something that sounds worse, not just less aggressive. Foundry handles this with a two-stage LM-CFG system:

  1. Steering stage: The Planner LM is guided away from restricted concepts while building the metadata blueprint. The avoidance happens at the planning level, before the generator even starts.
  2. Inference rewrite stage: A separate rewrite step refines the final caption so the generator receives a clean, consistent instruction that respects your negative keywords without destroying the creative intent.

The result is negative prompting that actually works. "Energetic but not aggressive" gives you energy with control, instead of silence or chaos.

Stem separation: Demucs

When you use the Separate feature, Foundry runs a neural network (Demucs) that splits a mixed audio track into individual stems: vocals, drums, bass, guitar, piano, and an "other" catch-all. A karaoke mode combines everything except vocals into a single instrumental track.

The separated stems land as aligned tracks in the timeline, immediately ready for editing, remixing, or recombination.

VRAM management: fitting it all in

Running multiple AI models on one GPU creates a real engineering problem: VRAM is finite. Foundry uses a centralized VRAM estimator to figure out what fits and what needs to swap.

Practical VRAM tiers:

  • 8 to 10 GB: the music generator works well, but the Planner LM may not fit alongside it
  • 10 to 12 GB: generator plus a quantized Planner, often with Ultra-VRAM mode enabled
  • 12 GB and above: the recommended baseline, smooth operation of the full system
  • 16 GB and above: extra headroom for larger Creative AI models and less swapping

Ultra-VRAM mode changes the peak VRAM requirement from "everything loaded at once" to "the largest single model." Components swap in and out as needed. It adds a small overhead per generation but lets lower-VRAM cards run the complete workflow.

Why the multi-model approach matters

A single monolithic model cannot plan structure, generate audio, handle negative prompts, and assist with writing all at the same quality level. Specialized components, each optimized for its role, produce better results than one model trying to do everything.

When you use Foundry, these systems coordinate behind the scenes. You type a description, the Planner expands it, the generator creates audio, and the timeline lets you shape the result. The complexity is real, but you never have to think about it.