MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide
Table of Contents
- MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide
- Quick Summary
- Contents
- What This Recreates
- Final Working State
- Why This Setup Exists
- 1. The intended local path was MOSS-TTS on MLX, not open Voxtral TTS
- 2. Plain python3 -m venv plus pip install failed under Homebrew Python / PEP 668
- 3. Python 3.14 was a bad fit for this dependency graph
- 4. The original setup instructions referenced a script that did not exist locally
- 5. The codec path was wrong when running from the workspace root
- 6. Mixed narration plus background music hurt voice cloning quality
- 7. MossFormer2 model naming was inconsistent
- 8. Generated audio had hiss
- 9. Generated words were being cut off
- 10. Hugging Face warnings still appear even when generation is local
- Canonical Workspace Layout
- Prerequisites
- Rebuild From Scratch
- Standard Commands
- Generate speech from a reference voice
- Clean a mixed reference, then clone from the cleaned voice
- Generate from inline text instead of a text file
- Disable adaptive behavior and force fixed generation settings
- Clean a mixed narration reference with Demucs only
- Clean a mixed narration reference with Demucs plus MossFormer2
- Use SAM-Audio instead of Demucs
- Apply de-hiss to a generated WAV after the fact
- Force strict offline behavior
- What Is Required vs Optional
- Known Warnings and What They Mean
- Exact Helper Scripts
- Notes for Future Reconstruction
- Minimal Reproduction Checklist
MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide
This guide documents a fully local MOSS-TTS voice cloning setup on Apple Silicon using MLX, a pinned mlx-audio checkout, and a small set of custom helper scripts. It is aimed at macOS users on M1, M2, M3, or M4 hardware who want local voice cloning without a hosted TTS service.
It is designed to be public, searchable, and reconstruction-friendly. A human or an LLM should be able to rebuild the same working environment from this document alone.
Topics covered here include:
- MOSS-TTS 8B on MLX
- Apple Silicon local voice cloning
mlx-audiopinned setup- reference audio cleanup
- Demucs, SAM-Audio, and MossFormer2 integration
- de-hiss cleanup
- adaptive chunking and token handling for long-form generation
This guide intentionally excludes personal paths, machine names, personal filenames, and private transcript contents. All commands use example paths and filenames.
Quick Summary
If you only need the working shape:
- Create a clean working directory.
- Add the root helper scripts from this document.
- Run
setup_moss_m1.shto create the Python3.12environment, clonemlx-audio, pin the commit, and download the local MOSS model folders. - Add the custom
mlx-audio/moss_clone_local.pyfrom this document into the cloned repo. - Optionally install
demucsandtorchcodecfor reference cleanup. - Generate speech with the example commands in the run section.
Contents
- What this recreates
- Final working state
- Why this setup exists
- Canonical workspace layout
- Prerequisites
- Rebuild from scratch
- Standard commands
- Required vs optional pieces
- Known warnings and fixes
- Exact helper scripts
- Minimal reproduction checklist
What This Recreates
This recovers a workspace with:
- A root working directory containing a Python virtualenv and helper scripts.
- A pinned
mlx-audioclone at commitecb8d6a3b3e09efa3c323cb1563baeff6e421d13. - Local model directories for:
mlx-community/MOSS-TTS-8B-8bitOpenMOSS-Team/MOSS-Audio-Tokenizer
- A custom
moss_clone_local.pyrunner that:- accepts MP3 or WAV reference audio
- resolves the codec path correctly
- chunks long text
- adaptively increases token budget and splits capped chunks
- optionally applies post-generation de-hiss
- An optional reference-audio cleanup pipeline using:
- Demucs vocal extraction
- optional SAM-Audio extraction
- optional MossFormer2 enhancement
- A standalone de-hiss helper for generated WAVs
Final Working State
The working state captured here is:
- Apple Silicon target
- Python
3.12.x mlx-audiogit commitecb8d6a3b3e09efa3c323cb1563baeff6e421d13mlx0.31.1mlx-lm0.31.1huggingface_hub1.9.0sounddevice0.5.3librosa0.11.0misaki0.9.4demucs4.0.1torchcodec0.11.0
The root workspace itself does not need to be a git repo. The only pinned git checkout is ./mlx-audio.
Why This Setup Exists
This was not a straight upstream install. The current state is the result of several fixes applied in order.
1. The intended local path was MOSS-TTS on MLX, not open Voxtral TTS
The target was fully local, Apple Silicon, 8B-class voice cloning from reference audio. The practical path was:
MOSS-TTS-8B-8biton MLX- not the open Voxtral TTS release
That led to pinning the mlx-audio branch/commit that contained the MOSS-TTS integration.
2. Plain python3 -m venv plus pip install failed under Homebrew Python / PEP 668
An early attempt hit:
error: externally-managed-environment
The practical fix was to avoid relying on the default interpreter and explicitly use a Homebrew Python 3.12 binary to create the virtualenv.
3. Python 3.14 was a bad fit for this dependency graph
The first venv path used a newer interpreter and the install failed because mlx-audio[tts] depends on:
misaki>=0.9.4
At that moment the available wheel compatibility did not line up with the chosen interpreter, which caused the pip install -e ".[tts]" step to fail.
The fix was:
- pin the local env to Python
3.12
4. The original setup instructions referenced a script that did not exist locally
The initial workflow assumed a moss_clone_local.py script already existed beside the downloaded model folders. It did not.
A custom runner script was created to provide:
- model loading
- local text input loading
- MP3 to WAV conversion for the reference voice
- chunked generation
- WAV output writing
5. The codec path was wrong when running from the workspace root
Running the custom runner from the workspace root produced:
FileNotFoundError: Config not found at moss-audio-tokenizer-full/config.json
Root cause:
mlx-audioexpected the codec folder to be discoverable relative to the process working directory- the runner was being executed from outside the
mlx-audiodirectory
The fix was to add a loader working-directory resolver so the runner temporarily changes to a directory where ./moss-audio-tokenizer-full/config.json is visible before calling load_model(...).
6. Mixed narration plus background music hurt voice cloning quality
Reference audio that included music was usable, but identity and cleanliness degraded badly.
An optional voice-isolation helper was added so the reference audio can be cleaned before cloning. The helper evolved through a few stages:
- first SAM-Audio
- then SAM plus MossFormer2
- then Demucs-based extraction became the default practical option
The current helper defaults to:
demucs
This was the most practical local path for narration plus background music in this workspace.
7. MossFormer2 model naming was inconsistent
One attempt to use MossFormer2 failed with a 404 on an older or renamed Hugging Face repo path.
The helper was updated to try multiple candidate IDs:
starkdmi/MossFormer2-SEstarkdmi/MossFormer2_SE_48K_MLX
8. Generated audio had hiss
A standalone dehiss_wav.py helper was added using ffmpeg filters.
Then the same de-hiss logic was integrated directly into the MOSS runner behind:
--dehiss--dehiss-preset
9. Generated words were being cut off
The most important generation-quality bug was not just hiss. Some generations dropped word endings or jumped to the next word.
This turned out to be primarily a token-budget problem:
- the runner defaulted to a fixed
--max-tokens 300 - at least some chunks hit the ceiling exactly
- exact cap hits meant the model was being truncated mid-utterance
The fix was not to force manual tuning every run. The runner was extended with:
- adaptive token budget selection
- capped-chunk detection
- retry at a larger budget
- safe chunk splitting when still capped
That is now the default behavior.
10. Hugging Face warnings still appear even when generation is local
Even with local model weights present, some parts of the stack still use huggingface_hub for cache checks or fallback metadata access.
That means you may still see warnings like:
- unauthenticated Hugging Face Hub requests
This does not by itself mean generation is happening remotely.
If strict offline behavior is required, run with:
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...Canonical Workspace Layout
Use this layout:
/path/to/workdir/
.venv/
setup_moss_m1.sh
isolate_voice_local.py
dehiss_wav.py
mlx-audio/
moss_clone_local.py
moss-tts-8bit/
moss-audio-tokenizer-full/Temporary outputs, personal audio, generated WAVs, and input transcript files are intentionally not part of the canonical layout.
Prerequisites
Install these system tools first:
brew install ffmpeg git [email protected]Optional but useful:
brew install pipxRebuild From Scratch
1. Create a clean working directory
mkdir -p /path/to/workdir
cd /path/to/workdir2. Create the root helper scripts
Create these root-level files from the inline script appendix later in this document:
./setup_moss_m1.sh./isolate_voice_local.py./dehiss_wav.py
Make them executable:
chmod +x ./setup_moss_m1.sh ./isolate_voice_local.py ./dehiss_wav.pyDo not try to place ./mlx-audio/moss_clone_local.py yet. The mlx-audio clone does not exist until the next step.
3. Run the setup script
./setup_moss_m1.shThis does all of the following:
- creates a fresh Python
3.12venv at./.venv - clones
mlx-audioif missing - checks out commit
ecb8d6a3b3e09efa3c323cb1563baeff6e421d13 - installs
mlx-audio[tts] - installs
huggingface_hub - downloads the MOSS-TTS model to
./mlx-audio/moss-tts-8bit - downloads the audio tokenizer to
./mlx-audio/moss-audio-tokenizer-full
4. Add the custom MOSS runner into the pinned clone
After the setup script completes, create this file from the inline script appendix later in this document:
./mlx-audio/moss_clone_local.py
This file is required because it is a custom workspace script, not an upstream file guaranteed to exist in the pinned mlx-audio checkout.
5. Install the optional reference-cleaning dependencies
The base setup script is enough for TTS generation. To reproduce the full current helper workflow, also install:
/path/to/workdir/.venv/bin/python -m pip install demucs torchcodec6. Verify the environment
/path/to/workdir/.venv/bin/python --version
/path/to/workdir/.venv/bin/pip show mlx mlx-lm huggingface_hub misaki demucs torchcodec
cd /path/to/workdir/mlx-audio
git rev-parse HEADExpected mlx-audio commit:
ecb8d6a3b3e09efa3c323cb1563baeff6e421d137. Optional offline and auth behavior
If you want higher Hugging Face rate limits for model downloads:
/path/to/workdir/.venv/bin/hf auth loginIf you want strict no-network execution after all models are already local:
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1Standard Commands
Generate speech from a reference voice
This is the normal command now. Adaptive chunk handling is on by default, so manual --max-tokens tuning is usually not needed.
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
--model /path/to/workdir/mlx-audio/moss-tts-8bit \
--ref-audio /path/to/workdir/reference_voice.mp3 \
--text-file /path/to/workdir/input.txt \
--output /path/to/workdir/output.wav \
--lang-code en \
--dehissClean a mixed reference, then clone from the cleaned voice
This is the most practical end-to-end path when the reference audio contains narration plus background music.
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
--input /path/to/workdir/reference_mixed.wav \
--output /path/to/workdir/reference_clean.wav \
--mode demucs \
--verbose
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
--model /path/to/workdir/mlx-audio/moss-tts-8bit \
--ref-audio /path/to/workdir/reference_clean.wav \
--text-file /path/to/workdir/input.txt \
--output /path/to/workdir/output.wav \
--lang-code en \
--dehissGenerate from inline text instead of a text file
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
--model /path/to/workdir/mlx-audio/moss-tts-8bit \
--ref-audio /path/to/workdir/reference_voice.mp3 \
--text "This is a local test of MOSS-TTS on Apple Silicon." \
--output /path/to/workdir/output.wav \
--lang-code en \
--dehissDisable adaptive behavior and force fixed generation settings
Only do this for debugging or controlled experiments.
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
--model /path/to/workdir/mlx-audio/moss-tts-8bit \
--ref-audio /path/to/workdir/reference_voice.wav \
--text-file /path/to/workdir/input.txt \
--output /path/to/workdir/output.wav \
--lang-code en \
--no-adaptive \
--max-tokens 500 \
--chunk-chars 240Clean a mixed narration reference with Demucs only
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
--input /path/to/workdir/reference_mixed.wav \
--output /path/to/workdir/reference_clean.wav \
--mode demucs \
--verboseClean a mixed narration reference with Demucs plus MossFormer2
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
--input /path/to/workdir/reference_mixed.wav \
--output /path/to/workdir/reference_clean.wav \
--mode demucs+mossformer2 \
--verboseUse SAM-Audio instead of Demucs
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
--input /path/to/workdir/reference_mixed.wav \
--output /path/to/workdir/reference_clean.wav \
--mode sam \
--description "A person speaking" \
--verboseApply de-hiss to a generated WAV after the fact
/path/to/workdir/.venv/bin/python /path/to/workdir/dehiss_wav.py \
--input /path/to/workdir/output.wav \
--output /path/to/workdir/output_clean.wav \
--preset balancedForce strict offline behavior
If you want the run to fail rather than touch Hugging Face at all:
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
--model /path/to/workdir/mlx-audio/moss-tts-8bit \
--ref-audio /path/to/workdir/reference_voice.wav \
--text-file /path/to/workdir/input.txt \
--output /path/to/workdir/output.wav \
--lang-code enWhat Is Required vs Optional
Required for local voice cloning:
setup_moss_m1.shmlx-audio/moss_clone_local.py- local
moss-tts-8bitmodel folder - local
moss-audio-tokenizer-fullfolder - Python
3.12 ffmpeg
Optional but practical:
isolate_voice_local.pydehiss_wav.pydemucstorchcodecSAM-AudioMossFormer2
Known Warnings and What They Mean
externally-managed-environment
Cause:
- using a Homebrew-managed interpreter directly instead of a proper venv flow
Fix:
- create a venv explicitly with Python
3.12 - use the venv’s
pythonandpip
No matching distribution found for misaki>=0.9.4
Cause:
- incompatible interpreter choice during installation
Fix:
- use Python
3.12
can't open file '.../moss_clone_local.py'
Cause:
- running from the wrong folder or assuming the runner script exists in the workspace root
Fix:
- run the script at
./mlx-audio/moss_clone_local.py - or
cd ./mlx-audiofirst
Config not found at moss-audio-tokenizer-full/config.json
Cause:
- the tokenizer directory is local but not visible from the current working directory
Fix:
- keep
moss-tts-8bitandmoss-audio-tokenizer-fullbeside themlx-audiocheckout - use the custom runner here, which resolves the loader cwd before calling
load_model(...)
Hugging Face unauthenticated warning
Cause:
- local generation still passes through libraries that may consult
huggingface_hubfor cache checks or metadata
Fix options:
- ignore it if the run works
- set
HF_TOKEN - or force offline mode with
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1
Words get cut off or skipped
Cause:
- token cap reached during generation
Fix:
- use the adaptive runner in this document
Reference voice sounds contaminated by background music
Cause:
- the clone model conditions on the whole reference audio, not just the voice
Fix:
- clean the reference first with
isolate_voice_local.py demucsis the most practical default in this workspace
Exact Helper Scripts
These are the canonical script bodies used for the current working state.
setup_moss_m1.sh
#!/usr/bin/env bash
set -euo pipefail
# Clean local setup for fully-local MOSS-TTS 8B voice cloning on Apple Silicon.
# Run this from the directory where you want the project to live.
ROOT_DIR="$(pwd)"
VENV_DIR="${ROOT_DIR}/.venv"
VENV_PY="${VENV_DIR}/bin/python"
VENV_HF="${VENV_DIR}/bin/hf"
PYTHON_BIN="${PYTHON_BIN:-/opt/homebrew/opt/python@3.12/bin/python3.12}"
brew install ffmpeg git [email protected]
if < class="broken-link">! -x "${PYTHON_BIN}"<>; then
PYTHON_BIN="$(command -v python3.12 || true)"
fi
if < class="broken-link">-z "${PYTHON_BIN}"<>; then
echo "python3.12 not found. Install with: brew install [email protected]"
exit 1
fi
rm -rf "${VENV_DIR}"
"${PYTHON_BIN}" -m venv "${VENV_DIR}"
"${VENV_PY}" -m pip install -U pip setuptools wheel
# Pin the exact OpenMOSS MLX branch commit behind PR #586 (add MOSS-TTS).
if < class="broken-link">.git</span>; then
cd mlx-audio
git fetch --all --tags --prune
else
git clone https://github.com/OpenMOSS/mlx-audio.git
cd mlx-audio
fi
git checkout ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
"${VENV_PY}" -m pip install -e ".[tts]"
"${VENV_PY}" -m pip install -U huggingface_hub
# Download the 8B MLX model and the codec weights into the exact default paths
# expected by the current MOSS-TTS MLX README.
"${VENV_HF}" download mlx-community/MOSS-TTS-8B-8bit --local-dir ./moss-tts-8bit
"${VENV_HF}" download OpenMOSS-Team/MOSS-Audio-Tokenizer --local-dir ./moss-audio-tokenizer-full
echo
echo "Setup complete. Run:"
echo " cd ${ROOT_DIR}/mlx-audio"
echo " ${VENV_PY} moss_clone_local.py --ref-audio ./voice.mp3 --text-file ./input.txt --output ./final.wav --lang-code en"dehiss_wav.py
#!/usr/bin/env python3
"""Apply de-hiss cleanup to a WAV file using ffmpeg audio filters.
Presets are tuned for TTS outputs with broadband high-frequency hiss.
"""
from __future__ import annotations
import argparse
import shutil
import subprocess
from pathlib import Path
PRESETS = {
"mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
"balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
"strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}
def parse_args() ->.:
p = argparse.(description="De-hiss WAV audio with ffmpeg.")
p.("--input", required=True, help="Input WAV/audio path")
p.("--output", required=True, help="Output WAV path")
p.(
"--preset",
choices=sorted(.()),
default="balanced",
help="Filter strength preset",
)
p.(
"--filter",
default=None,
help="Custom ffmpeg -af filter string (overrides --preset)",
)
return p.()
def main() -> None:
args =()
in_path =(.).().()
out_path =(.).().()
if not in_path.():
raise FileNotFoundError(f"Input not found: ")
ffmpeg = shutil.("ffmpeg")
if ffmpeg is None:
raise RuntimeError("ffmpeg not found. Install with: brew install ffmpeg")
af = args.filter if args.filter else PRESETS[args.preset]
out_path.parent.(parents=True, exist_ok=True)
cmd = [
ffmpeg,
"-y",
"-i",
str(),
"-af",
af,
"-hide_banner",
"-loglevel",
"error",
str(),
]
subprocess.(, check=True)
print(f"Wrote: ")
print(f"Filter: ")
if __name__ == "__main__":
()isolate_voice_local.py
#!/usr/bin/env python3
"""Isolate clean narration voice from mixed audio (speech + background music).
Local pipeline on Apple Silicon using mlx-audio:
1) SAM-Audio source separation (speech target extraction)
2) Optional MossFormer2 speech enhancement
"""
from __future__ import annotations
import argparse
import shutil
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import List, Optional, Tuple
from huggingface_hub.errors import HfHubHTTPError, RemoteEntryNotFoundError
from mlx_audio import audio_io
from mlx_audio.sts import MossFormer2SEModel, SAMAudio, SAMAudioProcessor, save_audio
from mlx_audio.sts.models.sam_audio.processor import load_audio
from mlx_audio.utils import get_model_path
def parse_args() ->.:
parser = argparse.(
description="Extract clean speech from narration+music audio, fully local."
)
parser.("--input", required=True, help="Input audio path")
parser.(
"--output",
required=True,
help="Output clean voice WAV path",
)
parser.(
"--mode",
default="demucs",
choices=[
"sam",
"mossformer2",
"sam+mossformer2",
"demucs",
"demucs+mossformer2",
],
help="Processing mode (default: demucs)",
)
parser.(
"--description",
default="A person speaking",
help='SAM-Audio text prompt for target sound (default: "A person speaking")',
)
parser.(
"--sam-model",
default="mlx-community/sam-audio-large",
help="SAM model repo or local path",
)
parser.(
"--enhancer-model",
default="starkdmi/MossFormer2-SE",
help="MossFormer2 model repo or local path",
)
parser.(
"--demucs-model",
default="htdemucs_ft",
help="Demucs model name for vocal separation (default: htdemucs_ft)",
)
parser.(
"--sam-strategy",
default="auto",
choices=["auto", "separate", "long"],
help="SAM processing strategy",
)
parser.(
"--long-threshold-sec",
type=float,
default=30.0,
help="Auto mode uses separate_long when input duration exceeds this threshold",
)
parser.(
"--chunk-seconds",
type=float,
default=10.0,
help="SAM long-mode chunk size in seconds",
)
parser.(
"--overlap-seconds",
type=float,
default=3.0,
help="SAM long-mode overlap in seconds",
)
parser.(
"--ode-method",
choices=["midpoint", "euler"],
default="midpoint",
help="SAM ODE method",
)
parser.(
"--ode-step-size",
type=float,
default=(2.0 / 32.0),
help="SAM ODE step size (default: 2/32)",
)
parser.(
"--ode-decode-chunk-size",
type=int,
default=None,
help="SAM decode chunk size to reduce memory (optional)",
)
parser.(
"--anchors",
default=None,
help='Optional SAM anchors for separate mode: "+,1.2,3.0;-,4.0,5.5"',
)
parser.(
"--save-residual",
default=None,
help="Optional path to save background/residual track from SAM",
)
parser.(
"--keep-intermediate",
action="store_true",
help="Keep SAM intermediate target WAV when running sam+mossformer2",
)
parser.(
"--verbose",
action="store_true",
help="Verbose logs",
)
return parser.()
def parse_anchors(raw:[str]) ->[[[[str, float, float]]]]:
if not raw:
return None
items: List[Tuple[str, float, float]] = []
for segment in raw.(";"):
segment = segment.()
if not segment:
continue
parts = [x.() for x in segment.(",")]
if len() != 3:
raise ValueError(
f"Invalid anchor segment ''. Expected format token,start,end."
)
token = parts[0]
if token not in {"+", "-"}:
raise ValueError(f"Invalid anchor token ''. Use '+' or '-'.")
start = float([1])
end = float([2])
if end <= start:
raise ValueError(
f"Invalid anchor range ''. end must be greater than start."
)
items.((,,))
if not items:
return None
return [items]
def resolve_strategy(args:., input_path:) -> str:
if args.sam_strategy in {"separate", "long"}:
return args.sam_strategy
wav, sr =(str(), target_sr=48_000)
duration = len() / float()
return "long" if duration > args.long_threshold_sec else "separate"
def ensure_sam_weights(model_name_or_path: str) ->:
model_path =(
,
allow_patterns=["*.safetensors", "*.json", "*.pt"],
)
has_weights = bool(list(.("*.safetensors"))) or (
model_path / "checkpoint.pt"
).()
if not has_weights:
raise RuntimeError(
f"SAM model weights not found in . "
"If using facebook/sam-audio-large, you may need HF access approval. "
"Recommended ungated model: mlx-community/sam-audio-large"
)
return model_path
def run_sam(
input_path:,
output_voice_path:,
args:.,
) ->[]:
strategy =(,)
anchors =(.)
if strategy == "long" and anchors is not None:
print("Ignoring --anchors because SAM long mode does not support anchors.")
anchors = None
if args.verbose:
print(f"[SAM] strategy=")
print(f"[SAM] model=.")
print(f"[SAM] description=.")
model_path =(.)
model = SAMAudio.(str())
processor = SAMAudioProcessor.(str())
ode_opt = {"method": args.ode_method, "step_size": args.ode_step_size}
if strategy == "separate":
result = model.(
audios=[str()],
descriptions=[.],
anchors=,
ode_opt=,
ode_decode_chunk_size=.,
)
else:
result = model.(
audios=[str()],
descriptions=[.],
chunk_seconds=.,
overlap_seconds=.,
ode_opt=,
ode_decode_chunk_size=.,
verbose=.,
)
output_voice_path.parent.(parents=True, exist_ok=True)
(.[0], str(), sample_rate=.)
if args.verbose:
print(f"[SAM] wrote target voice: ")
if result.peak_memory is not None:
print(f"[SAM] peak_memory=.:.2f GB")
if args.save_residual:
residual_path =(.).().()
residual_path.parent.(parents=True, exist_ok=True)
(
.[0],
str(),
sample_rate=.,
)
if args.verbose:
print(f"[SAM] wrote residual: ")
return residual_path
return None
def run_mossformer2(input_path:, output_path:, model_name_or_path: str, verbose: bool) -> None:
candidates = [model_name_or_path]
for fallback in ("starkdmi/MossFormer2-SE", "starkdmi/MossFormer2_SE_48K_MLX"):
if fallback not in candidates:
candidates.()
last_error = None
for model_id in candidates:
try:
if verbose:
print(f"[MossFormer2] model=")
model = MossFormer2SEModel.()
enhanced = model.(str())
output_path.parent.(parents=True, exist_ok=True)
audio_io.(str(),,..)
if verbose:
print(f"[MossFormer2] wrote enhanced: ")
return
except (RemoteEntryNotFoundError, HfHubHTTPError) as exc:
last_error = exc
if verbose:
print(f"[MossFormer2] failed to load : ")
continue
if last_error is not None:
raise RuntimeError(
"Could not load a MossFormer2 enhancement model from known IDs. "
"Try passing a valid local path with --enhancer-model."
) from last_error
raise RuntimeError("Could not load MossFormer2 enhancement model.")
def run_demucs(input_path:, output_voice_path:, model_name: str, verbose: bool) ->:
"""Run Demucs two-stem separation and return vocals path."""
demucs_cmd = [sys.executable, "-m", "demucs.separate"]
if shutil.("demucs"):
demucs_cmd = ["demucs"]
out_root = output_voice_path.parent / "_demucs_tmp"
out_root.(parents=True, exist_ok=True)
cmd = [
*demucs_cmd,
"--two-stems",
"vocals",
"-n",
model_name,
"-o",
str(),
str(),
]
if verbose:
print("[Demucs] running:", " ".())
try:
subprocess.(, check=True)
except FileNotFoundError as exc:
raise RuntimeError(
"Demucs is not installed in this environment. Install with: pip install demucs torchcodec"
) from exc
vocals_path = out_root / model_name / input_path.stem / "vocals.wav"
if not vocals_path.():
# Fallback: find any vocals.wav in output tree.
candidates = sorted(.("**/vocals.wav"))
if not candidates:
raise RuntimeError("Demucs completed but no vocals.wav was produced.")
vocals_path = candidates[-1]
y, sr =(str(), target_sr=48_000)
output_voice_path.parent.(parents=True, exist_ok=True)
audio_io.(str(),,)
if verbose:
print(f"[Demucs] wrote vocals: ")
return output_voice_path
def main() -> None:
args =()
input_path =(.).().()
output_path =(.).().()
if not input_path.():
raise FileNotFoundError(f"Input audio not found: ")
if args.mode == "sam":
(input_path=, output_voice_path=, args=)
print(f"Done: ")
return
if args.mode == "mossformer2":
(
input_path=,
output_path=,
model_name_or_path=.,
verbose=.,
)
print(f"Done: ")
return
if args.mode == "demucs":
(
input_path=,
output_voice_path=,
model_name=.,
verbose=.,
)
print(f"Done: ")
return
# sam+mossformer2
if args.mode == "sam+mossformer2":
temp_prefix = "sam_voice_"
elif args.mode == "demucs+mossformer2":
temp_prefix = "demucs_voice_"
else:
raise ValueError(f"Unsupported mode: .")
temp_file = tempfile.(prefix=, suffix=".wav", delete=False)
sam_voice_path =(.)
temp_file.()
try:
if args.mode == "sam+mossformer2":
(input_path=, output_voice_path=, args=)
else:
(
input_path=,
output_voice_path=,
model_name=.,
verbose=.,
)
(
input_path=,
output_path=,
model_name_or_path=.,
verbose=.,
)
finally:
if args.keep_intermediate:
kept = output_path.(. + "_sam.wav")
try:
sam_voice_path.()
print(f"Kept intermediate SAM output: ")
except Exception:
pass
else:
sam_voice_path.(missing_ok=True)
print(f"Done: ")
if __name__ == "__main__":
()mlx-audio/moss_clone_local.py
#!/usr/bin/env python3
"""Local MOSS-TTS voice cloning helper for Apple Silicon.
Expected layout (inside mlx-audio checkout):
./moss-tts-8bit
./moss-audio-tokenizer-full
"""
from __future__ import annotations
import argparse
import contextlib
import os
import re
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Iterable, List, Tuple
import numpy as np
import soundfile as sf
from mlx_audio.tts.utils import load_model
DEFAULT_AUDIO_SAMPLING = {
"temperature": 1.7,
"top_p": 0.8,
"top_k": 25,
"text_temperature": 1.5,
"text_top_p": 1.0,
"text_top_k": 50,
"repetition_penalty": 1.0,
}
DEHISS_PRESETS = {
"mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
"balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
"strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}
TOKEN_CAP_MARGIN = 3
MIN_AUTO_TOKENS = 220
MAX_SPLIT_DEPTH = 2
MIN_SPLIT_CHARS = 80
def parse_args() ->.:
parser = argparse.(description="MOSS-TTS local voice clone helper")
parser.(
"--model",
default="./moss-tts-8bit",
help="Path to local MOSS-TTS model directory (default: ./moss-tts-8bit)",
)
parser.(
"--ref-audio",
required=True,
help="Reference audio path (WAV preferred; MP3/M4A supported via ffmpeg)",
)
parser.("--text", default=None, help="Inline text to synthesize")
parser.("--text-file", default=None, help="Text file to synthesize")
parser.(
"--output",
required=True,
help="Output WAV file path (example: ./final.wav)",
)
parser.("--lang-code", default="en", help="Language code (default: en)")
parser.("--instruct", default=None, help="Optional style instruction")
parser.(
"--max-tokens",
type=int,
default=640,
help="Per-chunk token ceiling; adaptive mode chooses lower budgets automatically (default: 640)",
)
parser.(
"--chunk-chars",
type=int,
default=320,
help="Target max characters per chunk (default: 320)",
)
parser.(
"--pause-ms",
type=int,
default=140,
help="Pause between generated chunks in milliseconds (default: 140)",
)
parser.(
"--verbose",
action="store_true",
help="Print detailed generation output",
)
parser.(
"--adaptive",
action=.,
default=True,
help="Auto-size token budget and retry/split capped chunks (default: on)",
)
parser.(
"--dehiss",
action="store_true",
help="Apply post-generation de-hiss cleanup to output audio",
)
parser.(
"--dehiss-preset",
choices=sorted(.()),
default="balanced",
help="De-hiss strength preset (default: balanced)",
)
return parser.()
def load_text(args:.) -> str:
if bool(.) == bool(.):
raise ValueError("Provide exactly one of --text or --text-file")
if args.text_file:
return(.).(encoding="utf-8")
return args.text
def normalize_text(text: str) -> str:
text = text.()
text = re.(r"\s+", " ",)
if not text:
raise ValueError("Input text is empty after normalization")
return text
def sentence_split(text: str) ->[str]:
parts = re.(r"(?<=[.!?。!?])\s+",)
sentences = [part.() for part in parts if part and part.()]
return sentences if sentences else [text]
def chunk_text(text: str, max_chars: int) ->[str]:
sentences =()
chunks: List[str] = []
current = ""
for sentence in sentences:
if len() > max_chars:
# Hard-wrap very long sentence to stay within a stable generation window.
for i in range(0, len(),):
segment = sentence[i : i + max_chars].()
if segment:
if current:
chunks.()
current = ""
chunks.()
continue
candidate = f"{current} {sentence}".() if current else sentence
if len() <= max_chars:
current = candidate
else:
if current:
chunks.()
current = sentence
if current:
chunks.()
return chunks
def ensure_wav_24k_mono(path:) ->[, | None]:
ext = path.suffix.()
if ext == ".wav":
return path, None
ffmpeg = shutil.("ffmpeg")
if ffmpeg is None:
raise RuntimeError(
"ffmpeg is required for non-WAV reference audio. Install with: brew install ffmpeg"
)
tmp =(.(prefix="moss_ref_", suffix=".wav", delete=False).)
cmd = [
ffmpeg,
"-y",
"-i",
str(),
"-ac",
"1",
"-ar",
"24000",
str(),
]
subprocess.(, check=True, stdout=., stderr=.)
return tmp, tmp
@contextlib.contextmanager
def pushd(path:):
prev = Path.()
os.()
try:
yield
finally:
os.()
def resolve_loader_cwd(model_path:) ->:
"""Choose a cwd where moss-audio-tokenizer-full is discoverable by mlx-audio."""
if (model_path / "moss-audio-tokenizer-full" / "config.json").():
return Path.()
script_dir =(__file__).().parent
candidates = [
model_path.parent,
script_dir,
Path.(),
]
for base in candidates:
if (base / "moss-audio-tokenizer-full" / "config.json").():
return base
return Path.()
def generate_chunks(
model,
chunks:[str],
ref_audio_wav:,
lang_code: str,
instruct: str | None,
max_tokens: int,
verbose: bool,
adaptive: bool,
) ->[[.], int]:
generated: List[np.ndarray] = []
sample_rate: int | None = None
for idx, chunk in enumerate(, start=1):
print(f"[] generating (len() chars)")
chunk_segments, chunk_sample_rate =(
model=,
chunk=,
ref_audio_wav=,
lang_code=,
instruct=,
max_tokens=,
verbose=,
adaptive=,
depth=0,
)
generated.()
if sample_rate is None:
sample_rate = chunk_sample_rate
elif sample_rate != chunk_sample_rate:
raise RuntimeError(
f"Inconsistent sample rates across chunks: vs "
)
if sample_rate is None:
raise RuntimeError("No audio generated")
return generated, sample_rate
def choose_token_budget(chunk: str, max_tokens: int) -> int:
# Empirical fit for MOSS-TTS: character count underestimates needed tokens.
estimated = int((len() * 1.25) + 120)
return max(, min(,))
def is_token_capped(token_count: int, budget: int) -> bool:
if budget <= 0:
return False
return token_count >= max(1, -)
def split_chunk_for_retry(chunk: str) ->[str]:
sentences =()
if len() >= 2:
mid = len() // 2
left = " ".([:]).()
right = " ".([:]).()
if (
left
and right
and len() >= MIN_SPLIT_CHARS
and len() >= MIN_SPLIT_CHARS
):
return [left, right]
mid = len() // 2
left_space = chunk.(" ", 0,)
right_space = chunk.(" ",)
candidates = [pos for pos in (left_space, right_space) if pos != -1]
if not candidates:
return [chunk]
split_at = min(, key=lambda pos: abs( -))
left = chunk[:split_at].()
right = chunk[split_at + 1 :].()
if (
left
and right
and len() >= MIN_SPLIT_CHARS
and len() >= MIN_SPLIT_CHARS
):
return [left, right]
return [chunk]
def generate_once(
model,
chunk: str,
ref_audio_wav:,
lang_code: str,
instruct: str | None,
token_budget: int,
verbose: bool,
) ->[., int, int]:
kwargs = dict(
text=,
ref_audio=str(),
lang_code=,
instruct=,
max_tokens=,
verbose=,
**,
)
results = list(.(**))
if not results:
raise RuntimeError("No audio generated")
first = results[0]
arr = np.(., dtype=.)
token_count = int(getattr(, "token_count", 0) or 0)
return arr, int(.), token_count
def generate_chunk_adaptive(
model,
chunk: str,
ref_audio_wav:,
lang_code: str,
instruct: str | None,
max_tokens: int,
verbose: bool,
adaptive: bool,
depth: int,
) ->[[.], int]:
budget =(,) if adaptive else max_tokens
print(f" token budget=")
audio, sample_rate, token_count =(
model=,
chunk=,
ref_audio_wav=,
lang_code=,
instruct=,
token_budget=,
verbose=,
)
print(f" generated token_count=")
if not adaptive:
return [audio], sample_rate
capped =(,)
if not capped:
return [audio], sample_rate
print(f" token cap likely hit (/)")
# One retry at a larger budget before splitting.
if budget < max_tokens:
retry_budget = min(, max( + 64, int( * 1.35)))
print(f" retrying with higher budget=")
retry_audio, retry_sample_rate, retry_token_count =(
model=,
chunk=,
ref_audio_wav=,
lang_code=,
instruct=,
token_budget=,
verbose=,
)
print(f" retry token_count=")
if not(,):
return [retry_audio], retry_sample_rate
audio, sample_rate = retry_audio, retry_sample_rate
if depth >= MAX_SPLIT_DEPTH:
print(" max split depth reached; keeping capped chunk output")
return [audio], sample_rate
parts =()
if len() != 2:
print(" unable to split chunk safely; keeping capped chunk output")
return [audio], sample_rate
print(" splitting chunk and regenerating parts")
stitched_segments: List[np.ndarray] = []
part_sample_rate: int | None = None
for part_idx, part in enumerate(, start=1):
print(f" part : len() chars")
part_segments, part_sr =(
model=,
chunk=,
ref_audio_wav=,
lang_code=,
instruct=,
max_tokens=,
verbose=,
adaptive=,
depth= + 1,
)
if part_sample_rate is None:
part_sample_rate = part_sr
elif part_sample_rate != part_sr:
raise RuntimeError(
f"Inconsistent sample rates in split chunk: vs "
)
stitched_segments.()
if part_sample_rate is None:
raise RuntimeError("Split generation failed to produce audio")
return stitched_segments, part_sample_rate
def join_audio(segments:[.], sample_rate: int, pause_ms: int) ->.:
if len() == 1:
return segments[0]
pause = np.(int( * ( / 1000.0)), dtype=.)
joined: List[np.ndarray] = []
for i, seg in enumerate():
joined.()
if i < len() - 1 and len() > 0:
joined.()
return np.(, axis=0)
def apply_dehiss_inplace(path:, preset: str) -> None:
ffmpeg = shutil.("ffmpeg")
if ffmpeg is None:
raise RuntimeError(
"ffmpeg is required for --dehiss. Install with: brew install ffmpeg"
)
af = DEHISS_PRESETS[preset]
tmp =(.(prefix="moss_dehiss_", suffix=".wav", delete=False).)
try:
cmd = [
ffmpeg,
"-y",
"-i",
str(),
"-af",
af,
"-hide_banner",
"-loglevel",
"error",
str(),
]
subprocess.(, check=True)
tmp.()
print(f"Applied de-hiss () to ")
finally:
tmp.(missing_ok=True)
def main() -> None:
args =()
model_path =(.).().()
ref_audio_path =(.).().()
output_path =(.).().()
if not model_path.():
raise FileNotFoundError(f"Model path not found: ")
if not ref_audio_path.():
raise FileNotFoundError(f"Reference audio not found: ")
text =(())
chunks =(, max_chars=.)
print(f"Loaded text as len() chunk(s)")
ref_wav, cleanup =()
if cleanup:
print(f"Converted reference audio to WAV: ")
try:
loader_cwd =()
if loader_cwd != Path.():
print(f"Using loader working dir: ")
print(f"Loading model: ")
with():
model =(str())
segments, sample_rate =(
model=,
chunks=,
ref_audio_wav=,
lang_code=.,
instruct=.,
max_tokens=.,
verbose=.,
adaptive=.,
)
final_audio =(, sample_rate=, pause_ms=.)
output_path.parent.(parents=True, exist_ok=True)
sf.(str(),,)
if args.dehiss:
(,.)
duration_sec = len() / sample_rate
print(f"Wrote (:.2fs @ Hz)")
finally:
if cleanup is not None and cleanup.():
cleanup.(missing_ok=True)
if __name__ == "__main__":
()Notes for Future Reconstruction
- Do not assume upstream
mlx-audiocontainsmoss_clone_local.py. In this workspace it is a custom file placed inside the pinned checkout. - Do not assume Python
3.14+will install cleanly. Use Python3.12. - Keep the two MOSS model folders beside the
mlx-audiocheckout unless you also update the runner logic. - If reference audio has music, clean it first.
- If a future rebuild produces truncation artifacts, verify that adaptive mode is still enabled.
- If a future rebuild must avoid all network calls, pre-download every model and set offline env vars explicitly.
Minimal Reproduction Checklist
brew install ffmpeg git [email protected]- run
setup_moss_m1.sh - install
demucs torchcodecif reference cleaning is needed - keep
moss-tts-8bitandmoss-audio-tokenizer-fullunder./mlx-audio - run
mlx-audio/moss_clone_local.pywith--dehiss - optionally clean the reference first with
isolate_voice_local.py