For about three years, the unwritten rule of AI video was: generate the picture, then hire a sound editor. As of this spring, that rule no longer holds. Veo 3.1, Kling 3.0, and Vidu Q3 all now produce dialogue, sound effects, ambient noise, and music in the same forward pass as the visuals.
This is a quieter shift than a new flagship model, and it matters more than most of them. The post-production gap, the chunk of any AI film pipeline where a silent clip waited around for a human to score it, is closing fast.
What "native audio" actually means
The phrase gets thrown around loosely. The specific thing happening is that the audio track is generated jointly with the video frames, conditioned on the same prompt and the same internal representation. Footsteps land on the beat. Rain gets louder when the camera pushes in on a window. A line of dialogue is timed to the lips that speak it, in the take that gets sampled, with the room tone the scene called for.
This is not a text-to-speech pass stitched to a render. It is one model emitting two synchronized streams. That distinction is what makes the output usable on the first try, rather than something you have to nudge into alignment with a DAW.
Google's Veo 3.1 leans on ambient and environmental sound and clean English speech. Kuaishou's Kling 3.0 ships a multilingual lip-sync variant covering five languages and a multi-shot mode that holds audio continuity across cuts. Shengshu's Vidu Q3 is the dark horse, especially for music-led pieces. All three changed shape between January and April. None of them are perfect. All of them are good enough that the silent-render assumption is dead.
What this breaks in a typical pipeline
A standard AI short used to look like this: prompt to silent clip, manual edit, manual sound design, manual score, manual mix. Maybe ten hours of post for a thirty-second piece. The new shape is shorter and stranger.
The bottleneck moves from sourcing audio to managing it. You now have to pre-decide what the audio should be at the prompt stage, because the model is going to commit to a take. If you let the model improvise, you get a soundtrack that is coherent within a shot and inconsistent across shots, because the model has no idea what the previous clip sounded like. A character's voice drifts. The wind goes from a whisper to a howl between cuts. Background chatter changes language. The model picks up the visual energy of each shot independently, which is fine inside a single take and brutal across an edit.
The fix is the same fix that solved character drift in the visual layer last year: a reference. The serious workflow that is emerging is to lock dialogue with a single voice reference, lock ambience with a written audio bible, and let the model fill in Foley and incidentals shot by shot. Treat the model like a junior sound designer who needs a brief, not a magic box. The teams already doing this are getting clips that cut together cleanly. The teams that aren't are spending the post-production hours they thought they saved.
The new failure modes
Three problems are showing up in real productions. Lip sync looks correct frame by frame but reads as uncanny across a longer take, because micro-expressions don't quite match the phoneme. Music tries to score the action and sometimes wins over the dialogue. And the audio is too clean. Real productions have room tone, breath, the scuff of a chair. Native audio defaults to a slightly sterile mix that telegraphs "AI" the way bad temporal artifacts used to telegraph "AI" in the visual layer.
None of these are dealbreakers. They are the next round of work for a human supervisor. Watch for a class of editors who specialize in nudging native-audio output toward the imperfection of a real shoot. That role did not exist in January and will be in demand by autumn.
What to do about it
If you are working on anything longer than a thirty-second piece, stop treating audio as a post step and start writing audio direction into your prompts. A line of dialogue should be specified the same way you specify a camera move. Ambience should be specified the way you specify a color palette. Build an audio reference set the same way you build a character bible.
The teams that win in the next twelve months will be the ones that move sound design earlier in the pipeline rather than later. That is the shape of an orchestrated pipeline like Promvie's: script, cast, shots, and score moving together rather than in sequence. Native audio does not replace sound designers. It replaces the silent draft.
The post-production gap is closing. The teams that close it on purpose will ship faster than the ones who keep treating audio as a separate problem.