﻿# MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide


This guide documents a fully local MOSS-TTS voice cloning setup on Apple Silicon using MLX, a pinned `mlx-audio` checkout, and a small set of custom helper scripts. It is aimed at macOS users on M1, M2, M3, or M4 hardware who want local voice cloning without a hosted TTS service.

It is designed to be public, searchable, and reconstruction-friendly. A human or an LLM should be able to rebuild the same working environment from this document alone.

Topics covered here include:

- MOSS-TTS 8B on MLX
- Apple Silicon local voice cloning
- `mlx-audio` pinned setup
- reference audio cleanup
- Demucs, SAM-Audio, and MossFormer2 integration
- de-hiss cleanup
- adaptive chunking and token handling for long-form generation

This guide intentionally excludes personal paths, machine names, personal filenames, and private transcript contents. All commands use example paths and filenames.

## Quick Summary

If you only need the working shape:

1. Create a clean working directory.
2. Add the root helper scripts from this document.
3. Run `setup_moss_m1.sh` to create the Python `3.12` environment, clone `mlx-audio`, pin the commit, and download the local MOSS model folders.
4. Add the custom `mlx-audio/moss_clone_local.py` from this document into the cloned repo.
5. Optionally install `demucs` and `torchcodec` for reference cleanup.
6. Generate speech with the example commands in the run section.

## Contents

- What this recreates
- Final working state
- Why this setup exists
- Canonical workspace layout
- Prerequisites
- Rebuild from scratch
- Standard commands
- Required vs optional pieces
- Known warnings and fixes
- Exact helper scripts
- Minimal reproduction checklist

## What This Recreates

This recovers a workspace with:

- A root working directory containing a Python virtualenv and helper scripts.
- A pinned `mlx-audio` clone at commit `ecb8d6a3b3e09efa3c323cb1563baeff6e421d13`.
- Local model directories for:
  - `mlx-community/MOSS-TTS-8B-8bit`
  - `OpenMOSS-Team/MOSS-Audio-Tokenizer`
- A custom `moss_clone_local.py` runner that:
  - accepts MP3 or WAV reference audio
  - resolves the codec path correctly
  - chunks long text
  - adaptively increases token budget and splits capped chunks
  - optionally applies post-generation de-hiss
- An optional reference-audio cleanup pipeline using:
  - Demucs vocal extraction
  - optional SAM-Audio extraction
  - optional MossFormer2 enhancement
- A standalone de-hiss helper for generated WAVs

## Final Working State

The working state captured here is:

- Apple Silicon target
- Python `3.12.x`
- `mlx-audio` git commit `ecb8d6a3b3e09efa3c323cb1563baeff6e421d13`
- `mlx` `0.31.1`
- `mlx-lm` `0.31.1`
- `huggingface_hub` `1.9.0`
- `sounddevice` `0.5.3`
- `librosa` `0.11.0`
- `misaki` `0.9.4`
- `demucs` `4.0.1`
- `torchcodec` `0.11.0`

The root workspace itself does not need to be a git repo. The only pinned git checkout is `./mlx-audio`.

## Why This Setup Exists

This was not a straight upstream install. The current state is the result of several fixes applied in order.

### 1. The intended local path was MOSS-TTS on MLX, not open Voxtral TTS

The target was fully local, Apple Silicon, 8B-class voice cloning from reference audio. The practical path was:

- `MOSS-TTS-8B-8bit` on MLX
- not the open Voxtral TTS release

That led to pinning the `mlx-audio` branch/commit that contained the MOSS-TTS integration.

### 2. Plain `python3 -m venv` plus `pip install` failed under Homebrew Python / PEP 668

An early attempt hit:

- `error: externally-managed-environment`

The practical fix was to avoid relying on the default interpreter and explicitly use a Homebrew Python `3.12` binary to create the virtualenv.

### 3. Python 3.14 was a bad fit for this dependency graph

The first venv path used a newer interpreter and the install failed because `mlx-audio[tts]` depends on:

- `misaki>=0.9.4`

At that moment the available wheel compatibility did not line up with the chosen interpreter, which caused the `pip install -e ".[tts]"` step to fail.

The fix was:

- pin the local env to Python `3.12`

### 4. The original setup instructions referenced a script that did not exist locally

The initial workflow assumed a `moss_clone_local.py` script already existed beside the downloaded model folders. It did not.

A custom runner script was created to provide:

- model loading
- local text input loading
- MP3 to WAV conversion for the reference voice
- chunked generation
- WAV output writing

### 5. The codec path was wrong when running from the workspace root

Running the custom runner from the workspace root produced:

- `FileNotFoundError: Config not found at moss-audio-tokenizer-full/config.json`

Root cause:

- `mlx-audio` expected the codec folder to be discoverable relative to the process working directory
- the runner was being executed from outside the `mlx-audio` directory

The fix was to add a loader working-directory resolver so the runner temporarily changes to a directory where `./moss-audio-tokenizer-full/config.json` is visible before calling `load_model(...)`.

### 6. Mixed narration plus background music hurt voice cloning quality

Reference audio that included music was usable, but identity and cleanliness degraded badly.

An optional voice-isolation helper was added so the reference audio can be cleaned before cloning. The helper evolved through a few stages:

- first SAM-Audio
- then SAM plus MossFormer2
- then Demucs-based extraction became the default practical option

The current helper defaults to:

- `demucs`

This was the most practical local path for narration plus background music in this workspace.

### 7. MossFormer2 model naming was inconsistent

One attempt to use MossFormer2 failed with a 404 on an older or renamed Hugging Face repo path.

The helper was updated to try multiple candidate IDs:

- `starkdmi/MossFormer2-SE`
- `starkdmi/MossFormer2_SE_48K_MLX`

### 8. Generated audio had hiss

A standalone `dehiss_wav.py` helper was added using `ffmpeg` filters.

Then the same de-hiss logic was integrated directly into the MOSS runner behind:

- `--dehiss`
- `--dehiss-preset`

### 9. Generated words were being cut off

The most important generation-quality bug was not just hiss. Some generations dropped word endings or jumped to the next word.

This turned out to be primarily a token-budget problem:

- the runner defaulted to a fixed `--max-tokens 300`
- at least some chunks hit the ceiling exactly
- exact cap hits meant the model was being truncated mid-utterance

The fix was not to force manual tuning every run. The runner was extended with:

- adaptive token budget selection
- capped-chunk detection
- retry at a larger budget
- safe chunk splitting when still capped

That is now the default behavior.

### 10. Hugging Face warnings still appear even when generation is local

Even with local model weights present, some parts of the stack still use `huggingface_hub` for cache checks or fallback metadata access.

That means you may still see warnings like:

- unauthenticated Hugging Face Hub requests

This does not by itself mean generation is happening remotely.

If strict offline behavior is required, run with:

```bash
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...
```

## Canonical Workspace Layout

Use this layout:

```text
/path/to/workdir/
  .venv/
  setup_moss_m1.sh
  isolate_voice_local.py
  dehiss_wav.py
  mlx-audio/
    moss_clone_local.py
    moss-tts-8bit/
    moss-audio-tokenizer-full/
```

Temporary outputs, personal audio, generated WAVs, and input transcript files are intentionally not part of the canonical layout.

## Prerequisites

Install these system tools first:

```bash
brew install ffmpeg git python@3.12
```

Optional but useful:

```bash
brew install pipx
```

## Rebuild From Scratch

### 1. Create a clean working directory

```bash
mkdir -p /path/to/workdir
cd /path/to/workdir
```

### 2. Create the root helper scripts

Create these root-level files from the inline script appendix later in this document:

- `./setup_moss_m1.sh`
- `./isolate_voice_local.py`
- `./dehiss_wav.py`

Make them executable:

```bash
chmod +x ./setup_moss_m1.sh ./isolate_voice_local.py ./dehiss_wav.py
```

Do not try to place `./mlx-audio/moss_clone_local.py` yet. The `mlx-audio` clone does not exist until the next step.

### 3. Run the setup script

```bash
./setup_moss_m1.sh
```

This does all of the following:

- creates a fresh Python `3.12` venv at `./.venv`
- clones `mlx-audio` if missing
- checks out commit `ecb8d6a3b3e09efa3c323cb1563baeff6e421d13`
- installs `mlx-audio[tts]`
- installs `huggingface_hub`
- downloads the MOSS-TTS model to `./mlx-audio/moss-tts-8bit`
- downloads the audio tokenizer to `./mlx-audio/moss-audio-tokenizer-full`

### 4. Add the custom MOSS runner into the pinned clone

After the setup script completes, create this file from the inline script appendix later in this document:

- `./mlx-audio/moss_clone_local.py`

This file is required because it is a custom workspace script, not an upstream file guaranteed to exist in the pinned `mlx-audio` checkout.

### 5. Install the optional reference-cleaning dependencies

The base setup script is enough for TTS generation. To reproduce the full current helper workflow, also install:

```bash
/path/to/workdir/.venv/bin/python -m pip install demucs torchcodec
```

### 6. Verify the environment

```bash
/path/to/workdir/.venv/bin/python --version
/path/to/workdir/.venv/bin/pip show mlx mlx-lm huggingface_hub misaki demucs torchcodec
cd /path/to/workdir/mlx-audio
git rev-parse HEAD
```

Expected `mlx-audio` commit:

```text
ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
```

### 7. Optional offline and auth behavior

If you want higher Hugging Face rate limits for model downloads:

```bash
/path/to/workdir/.venv/bin/hf auth login
```

If you want strict no-network execution after all models are already local:

```bash
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
```

## Standard Commands

### Generate speech from a reference voice

This is the normal command now. Adaptive chunk handling is on by default, so manual `--max-tokens` tuning is usually not needed.

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.mp3 \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss
```

### Clean a mixed reference, then clone from the cleaned voice

This is the most practical end-to-end path when the reference audio contains narration plus background music.

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs \
  --verbose

/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_clean.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss
```

### Generate from inline text instead of a text file

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.mp3 \
  --text "This is a local test of MOSS-TTS on Apple Silicon." \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss
```

### Disable adaptive behavior and force fixed generation settings

Only do this for debugging or controlled experiments.

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --no-adaptive \
  --max-tokens 500 \
  --chunk-chars 240
```

### Clean a mixed narration reference with Demucs only

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs \
  --verbose
```

### Clean a mixed narration reference with Demucs plus MossFormer2

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs+mossformer2 \
  --verbose
```

### Use SAM-Audio instead of Demucs

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode sam \
  --description "A person speaking" \
  --verbose
```

### Apply de-hiss to a generated WAV after the fact

```bash
/path/to/workdir/.venv/bin/python /path/to/workdir/dehiss_wav.py \
  --input /path/to/workdir/output.wav \
  --output /path/to/workdir/output_clean.wav \
  --preset balanced
```

### Force strict offline behavior

If you want the run to fail rather than touch Hugging Face at all:

```bash
HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en
```

## What Is Required vs Optional

Required for local voice cloning:

- `setup_moss_m1.sh`
- `mlx-audio/moss_clone_local.py`
- local `moss-tts-8bit` model folder
- local `moss-audio-tokenizer-full` folder
- Python `3.12`
- `ffmpeg`

Optional but practical:

- `isolate_voice_local.py`
- `dehiss_wav.py`
- `demucs`
- `torchcodec`
- `SAM-Audio`
- `MossFormer2`

## Known Warnings and What They Mean

### `externally-managed-environment`

Cause:

- using a Homebrew-managed interpreter directly instead of a proper venv flow

Fix:

- create a venv explicitly with Python `3.12`
- use the venv’s `python` and `pip`

### `No matching distribution found for misaki>=0.9.4`

Cause:

- incompatible interpreter choice during installation

Fix:

- use Python `3.12`

### `can't open file '.../moss_clone_local.py'`

Cause:

- running from the wrong folder or assuming the runner script exists in the workspace root

Fix:

- run the script at `./mlx-audio/moss_clone_local.py`
- or `cd ./mlx-audio` first

### `Config not found at moss-audio-tokenizer-full/config.json`

Cause:

- the tokenizer directory is local but not visible from the current working directory

Fix:

- keep `moss-tts-8bit` and `moss-audio-tokenizer-full` beside the `mlx-audio` checkout
- use the custom runner here, which resolves the loader cwd before calling `load_model(...)`

### Hugging Face unauthenticated warning

Cause:

- local generation still passes through libraries that may consult `huggingface_hub` for cache checks or metadata

Fix options:

- ignore it if the run works
- set `HF_TOKEN`
- or force offline mode with `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1`

### Words get cut off or skipped

Cause:

- token cap reached during generation

Fix:

- use the adaptive runner in this document

### Reference voice sounds contaminated by background music

Cause:

- the clone model conditions on the whole reference audio, not just the voice

Fix:

- clean the reference first with `isolate_voice_local.py`
- `demucs` is the most practical default in this workspace

## Exact Helper Scripts

These are the canonical script bodies used for the current working state.

### `setup_moss_m1.sh`

```bash
#!/usr/bin/env bash
set -euo pipefail

# Clean local setup for fully-local MOSS-TTS 8B voice cloning on Apple Silicon.
# Run this from the directory where you want the project to live.

ROOT_DIR="$(pwd)"
VENV_DIR="${ROOT_DIR}/.venv"
VENV_PY="${VENV_DIR}/bin/python"
VENV_HF="${VENV_DIR}/bin/hf"
PYTHON_BIN="${PYTHON_BIN:-/opt/homebrew/opt/python@3.12/bin/python3.12}"

brew install ffmpeg git python@3.12

if <span class="broken-link">! -x &quot;${PYTHON_BIN}&quot;</span>; then
  PYTHON_BIN="$(command -v python3.12 || true)"
fi

if <span class="broken-link">-z &quot;${PYTHON_BIN}&quot;</span>; then
  echo "python3.12 not found. Install with: brew install python@3.12"
  exit 1
fi

rm -rf "${VENV_DIR}"
"${PYTHON_BIN}" -m venv "${VENV_DIR}"
"${VENV_PY}" -m pip install -U pip setuptools wheel

# Pin the exact OpenMOSS MLX branch commit behind PR #586 (add MOSS-TTS).
if <span class="broken-link">.git</span>; then
  cd mlx-audio
  git fetch --all --tags --prune
else
  git clone https://github.com/OpenMOSS/mlx-audio.git
  cd mlx-audio
fi
git checkout ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
"${VENV_PY}" -m pip install -e ".[tts]"
"${VENV_PY}" -m pip install -U huggingface_hub

# Download the 8B MLX model and the codec weights into the exact default paths
# expected by the current MOSS-TTS MLX README.
"${VENV_HF}" download mlx-community/MOSS-TTS-8B-8bit --local-dir ./moss-tts-8bit
"${VENV_HF}" download OpenMOSS-Team/MOSS-Audio-Tokenizer --local-dir ./moss-audio-tokenizer-full

echo

echo "Setup complete. Run:"
echo "  cd ${ROOT_DIR}/mlx-audio"
echo "  ${VENV_PY} moss_clone_local.py --ref-audio ./voice.mp3 --text-file ./input.txt --output ./final.wav --lang-code en"
```

### `dehiss_wav.py`

```python
#!/usr/bin/env python3
"""Apply de-hiss cleanup to a WAV file using ffmpeg audio filters.

Presets are tuned for TTS outputs with broadband high-frequency hiss.
"""

from __future__ import annotations

import argparse
import shutil
import subprocess
from pathlib import Path


PRESETS = {
    "mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
    "balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
    "strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}


def parse_args() -> argparse.Namespace:
    p = argparse.ArgumentParser(description="De-hiss WAV audio with ffmpeg.")
    p.add_argument("--input", required=True, help="Input WAV/audio path")
    p.add_argument("--output", required=True, help="Output WAV path")
    p.add_argument(
        "--preset",
        choices=sorted(PRESETS.keys()),
        default="balanced",
        help="Filter strength preset",
    )
    p.add_argument(
        "--filter",
        default=None,
        help="Custom ffmpeg -af filter string (overrides --preset)",
    )
    return p.parse_args()


def main() -> None:
    args = parse_args()
    in_path = Path(args.input).expanduser().resolve()
    out_path = Path(args.output).expanduser().resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input not found: {in_path}")
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError("ffmpeg not found. Install with: brew install ffmpeg")

    af = args.filter if args.filter else PRESETS[args.preset]
    out_path.parent.mkdir(parents=True, exist_ok=True)
    cmd = [
        ffmpeg,
        "-y",
        "-i",
        str(in_path),
        "-af",
        af,
        "-hide_banner",
        "-loglevel",
        "error",
        str(out_path),
    ]
    subprocess.run(cmd, check=True)
    print(f"Wrote: {out_path}")
    print(f"Filter: {af}")


if __name__ == "__main__":
    main()
```

### `isolate_voice_local.py`

```python
#!/usr/bin/env python3
"""Isolate clean narration voice from mixed audio (speech + background music).

Local pipeline on Apple Silicon using mlx-audio:
1) SAM-Audio source separation (speech target extraction)
2) Optional MossFormer2 speech enhancement
"""

from __future__ import annotations

import argparse
import shutil
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import List, Optional, Tuple

from huggingface_hub.errors import HfHubHTTPError, RemoteEntryNotFoundError
from mlx_audio import audio_io
from mlx_audio.sts import MossFormer2SEModel, SAMAudio, SAMAudioProcessor, save_audio
from mlx_audio.sts.models.sam_audio.processor import load_audio
from mlx_audio.utils import get_model_path


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Extract clean speech from narration+music audio, fully local."
    )
    parser.add_argument("--input", required=True, help="Input audio path")
    parser.add_argument(
        "--output",
        required=True,
        help="Output clean voice WAV path",
    )
    parser.add_argument(
        "--mode",
        default="demucs",
        choices=[
            "sam",
            "mossformer2",
            "sam+mossformer2",
            "demucs",
            "demucs+mossformer2",
        ],
        help="Processing mode (default: demucs)",
    )
    parser.add_argument(
        "--description",
        default="A person speaking",
        help='SAM-Audio text prompt for target sound (default: "A person speaking")',
    )
    parser.add_argument(
        "--sam-model",
        default="mlx-community/sam-audio-large",
        help="SAM model repo or local path",
    )
    parser.add_argument(
        "--enhancer-model",
        default="starkdmi/MossFormer2-SE",
        help="MossFormer2 model repo or local path",
    )
    parser.add_argument(
        "--demucs-model",
        default="htdemucs_ft",
        help="Demucs model name for vocal separation (default: htdemucs_ft)",
    )
    parser.add_argument(
        "--sam-strategy",
        default="auto",
        choices=["auto", "separate", "long"],
        help="SAM processing strategy",
    )
    parser.add_argument(
        "--long-threshold-sec",
        type=float,
        default=30.0,
        help="Auto mode uses separate_long when input duration exceeds this threshold",
    )
    parser.add_argument(
        "--chunk-seconds",
        type=float,
        default=10.0,
        help="SAM long-mode chunk size in seconds",
    )
    parser.add_argument(
        "--overlap-seconds",
        type=float,
        default=3.0,
        help="SAM long-mode overlap in seconds",
    )
    parser.add_argument(
        "--ode-method",
        choices=["midpoint", "euler"],
        default="midpoint",
        help="SAM ODE method",
    )
    parser.add_argument(
        "--ode-step-size",
        type=float,
        default=(2.0 / 32.0),
        help="SAM ODE step size (default: 2/32)",
    )
    parser.add_argument(
        "--ode-decode-chunk-size",
        type=int,
        default=None,
        help="SAM decode chunk size to reduce memory (optional)",
    )
    parser.add_argument(
        "--anchors",
        default=None,
        help='Optional SAM anchors for separate mode: "+,1.2,3.0;-,4.0,5.5"',
    )
    parser.add_argument(
        "--save-residual",
        default=None,
        help="Optional path to save background/residual track from SAM",
    )
    parser.add_argument(
        "--keep-intermediate",
        action="store_true",
        help="Keep SAM intermediate target WAV when running sam+mossformer2",
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="Verbose logs",
    )
    return parser.parse_args()


def parse_anchors(raw: Optional[str]) -> Optional[List[List[Tuple[str, float, float]]]]:
    if not raw:
        return None
    items: List[Tuple[str, float, float]] = []
    for segment in raw.split(";"):
        segment = segment.strip()
        if not segment:
            continue
        parts = [x.strip() for x in segment.split(",")]
        if len(parts) != 3:
            raise ValueError(
                f"Invalid anchor segment '{segment}'. Expected format token,start,end."
            )
        token = parts[0]
        if token not in {"+", "-"}:
            raise ValueError(f"Invalid anchor token '{token}'. Use '+' or '-'.")
        start = float(parts[1])
        end = float(parts[2])
        if end <= start:
            raise ValueError(
                f"Invalid anchor range '{segment}'. end must be greater than start."
            )
        items.append((token, start, end))
    if not items:
        return None
    return [items]


def resolve_strategy(args: argparse.Namespace, input_path: Path) -> str:
    if args.sam_strategy in {"separate", "long"}:
        return args.sam_strategy
    wav, sr = load_audio(str(input_path), target_sr=48_000)
    duration = len(wav) / float(sr)
    return "long" if duration > args.long_threshold_sec else "separate"


def ensure_sam_weights(model_name_or_path: str) -> Path:
    model_path = get_model_path(
        model_name_or_path,
        allow_patterns=["*.safetensors", "*.json", "*.pt"],
    )
    has_weights = bool(list(model_path.glob("*.safetensors"))) or (
        model_path / "checkpoint.pt"
    ).exists()
    if not has_weights:
        raise RuntimeError(
            f"SAM model weights not found in {model_path}. "
            "If using facebook/sam-audio-large, you may need HF access approval. "
            "Recommended ungated model: mlx-community/sam-audio-large"
        )
    return model_path


def run_sam(
    input_path: Path,
    output_voice_path: Path,
    args: argparse.Namespace,
) -> Optional[Path]:
    strategy = resolve_strategy(args, input_path)
    anchors = parse_anchors(args.anchors)
    if strategy == "long" and anchors is not None:
        print("Ignoring --anchors because SAM long mode does not support anchors.")
        anchors = None

    if args.verbose:
        print(f"[SAM] strategy={strategy}")
        print(f"[SAM] model={args.sam_model}")
        print(f"[SAM] description={args.description}")

    model_path = ensure_sam_weights(args.sam_model)
    model = SAMAudio.from_pretrained(str(model_path))
    processor = SAMAudioProcessor.from_pretrained(str(model_path))

    ode_opt = {"method": args.ode_method, "step_size": args.ode_step_size}

    if strategy == "separate":
        result = model.separate(
            audios=[str(input_path)],
            descriptions=[args.description],
            anchors=anchors,
            ode_opt=ode_opt,
            ode_decode_chunk_size=args.ode_decode_chunk_size,
        )
    else:
        result = model.separate_long(
            audios=[str(input_path)],
            descriptions=[args.description],
            chunk_seconds=args.chunk_seconds,
            overlap_seconds=args.overlap_seconds,
            ode_opt=ode_opt,
            ode_decode_chunk_size=args.ode_decode_chunk_size,
            verbose=args.verbose,
        )

    output_voice_path.parent.mkdir(parents=True, exist_ok=True)
    save_audio(result.target[0], str(output_voice_path), sample_rate=processor.audio_sampling_rate)
    if args.verbose:
        print(f"[SAM] wrote target voice: {output_voice_path}")
        if result.peak_memory is not None:
            print(f"[SAM] peak_memory={result.peak_memory:.2f} GB")

    if args.save_residual:
        residual_path = Path(args.save_residual).expanduser().resolve()
        residual_path.parent.mkdir(parents=True, exist_ok=True)
        save_audio(
            result.residual[0],
            str(residual_path),
            sample_rate=processor.audio_sampling_rate,
        )
        if args.verbose:
            print(f"[SAM] wrote residual: {residual_path}")
        return residual_path
    return None


def run_mossformer2(input_path: Path, output_path: Path, model_name_or_path: str, verbose: bool) -> None:
    candidates = [model_name_or_path]
    for fallback in ("starkdmi/MossFormer2-SE", "starkdmi/MossFormer2_SE_48K_MLX"):
        if fallback not in candidates:
            candidates.append(fallback)

    last_error = None
    for model_id in candidates:
        try:
            if verbose:
                print(f"[MossFormer2] model={model_id}")
            model = MossFormer2SEModel.from_pretrained(model_id)
            enhanced = model.enhance(str(input_path))
            output_path.parent.mkdir(parents=True, exist_ok=True)
            audio_io.write(str(output_path), enhanced, model.config.sample_rate)
            if verbose:
                print(f"[MossFormer2] wrote enhanced: {output_path}")
            return
        except (RemoteEntryNotFoundError, HfHubHTTPError) as exc:
            last_error = exc
            if verbose:
                print(f"[MossFormer2] failed to load {model_id}: {exc}")
            continue

    if last_error is not None:
        raise RuntimeError(
            "Could not load a MossFormer2 enhancement model from known IDs. "
            "Try passing a valid local path with --enhancer-model."
        ) from last_error

    raise RuntimeError("Could not load MossFormer2 enhancement model.")


def run_demucs(input_path: Path, output_voice_path: Path, model_name: str, verbose: bool) -> Path:
    """Run Demucs two-stem separation and return vocals path."""
    demucs_cmd = [sys.executable, "-m", "demucs.separate"]
    if shutil.which("demucs"):
        demucs_cmd = ["demucs"]

    out_root = output_voice_path.parent / "_demucs_tmp"
    out_root.mkdir(parents=True, exist_ok=True)

    cmd = [
        *demucs_cmd,
        "--two-stems",
        "vocals",
        "-n",
        model_name,
        "-o",
        str(out_root),
        str(input_path),
    ]
    if verbose:
        print("[Demucs] running:", " ".join(cmd))
    try:
        subprocess.run(cmd, check=True)
    except FileNotFoundError as exc:
        raise RuntimeError(
            "Demucs is not installed in this environment. Install with: pip install demucs torchcodec"
        ) from exc

    vocals_path = out_root / model_name / input_path.stem / "vocals.wav"
    if not vocals_path.exists():
        # Fallback: find any vocals.wav in output tree.
        candidates = sorted(out_root.glob("**/vocals.wav"))
        if not candidates:
            raise RuntimeError("Demucs completed but no vocals.wav was produced.")
        vocals_path = candidates[-1]

    y, sr = load_audio(str(vocals_path), target_sr=48_000)
    output_voice_path.parent.mkdir(parents=True, exist_ok=True)
    audio_io.write(str(output_voice_path), y, sr)
    if verbose:
        print(f"[Demucs] wrote vocals: {output_voice_path}")
    return output_voice_path


def main() -> None:
    args = parse_args()
    input_path = Path(args.input).expanduser().resolve()
    output_path = Path(args.output).expanduser().resolve()
    if not input_path.exists():
        raise FileNotFoundError(f"Input audio not found: {input_path}")

    if args.mode == "sam":
        run_sam(input_path=input_path, output_voice_path=output_path, args=args)
        print(f"Done: {output_path}")
        return

    if args.mode == "mossformer2":
        run_mossformer2(
            input_path=input_path,
            output_path=output_path,
            model_name_or_path=args.enhancer_model,
            verbose=args.verbose,
        )
        print(f"Done: {output_path}")
        return

    if args.mode == "demucs":
        run_demucs(
            input_path=input_path,
            output_voice_path=output_path,
            model_name=args.demucs_model,
            verbose=args.verbose,
        )
        print(f"Done: {output_path}")
        return

    # sam+mossformer2
    if args.mode == "sam+mossformer2":
        temp_prefix = "sam_voice_"
    elif args.mode == "demucs+mossformer2":
        temp_prefix = "demucs_voice_"
    else:
        raise ValueError(f"Unsupported mode: {args.mode}")

    temp_file = tempfile.NamedTemporaryFile(prefix=temp_prefix, suffix=".wav", delete=False)
    sam_voice_path = Path(temp_file.name)
    temp_file.close()
    try:
        if args.mode == "sam+mossformer2":
            run_sam(input_path=input_path, output_voice_path=sam_voice_path, args=args)
        else:
            run_demucs(
                input_path=input_path,
                output_voice_path=sam_voice_path,
                model_name=args.demucs_model,
                verbose=args.verbose,
            )
        run_mossformer2(
            input_path=sam_voice_path,
            output_path=output_path,
            model_name_or_path=args.enhancer_model,
            verbose=args.verbose,
        )
    finally:
        if args.keep_intermediate:
            kept = output_path.with_name(output_path.stem + "_sam.wav")
            try:
                sam_voice_path.replace(kept)
                print(f"Kept intermediate SAM output: {kept}")
            except Exception:
                pass
        else:
            sam_voice_path.unlink(missing_ok=True)

    print(f"Done: {output_path}")


if __name__ == "__main__":
    main()
```

### `mlx-audio/moss_clone_local.py`

```python
#!/usr/bin/env python3
"""Local MOSS-TTS voice cloning helper for Apple Silicon.

Expected layout (inside mlx-audio checkout):
  ./moss-tts-8bit
  ./moss-audio-tokenizer-full
"""

from __future__ import annotations

import argparse
import contextlib
import os
import re
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Iterable, List, Tuple

import numpy as np
import soundfile as sf
from mlx_audio.tts.utils import load_model


DEFAULT_AUDIO_SAMPLING = {
    "temperature": 1.7,
    "top_p": 0.8,
    "top_k": 25,
    "text_temperature": 1.5,
    "text_top_p": 1.0,
    "text_top_k": 50,
    "repetition_penalty": 1.0,
}

DEHISS_PRESETS = {
    "mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
    "balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
    "strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}

TOKEN_CAP_MARGIN = 3
MIN_AUTO_TOKENS = 220
MAX_SPLIT_DEPTH = 2
MIN_SPLIT_CHARS = 80


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="MOSS-TTS local voice clone helper")
    parser.add_argument(
        "--model",
        default="./moss-tts-8bit",
        help="Path to local MOSS-TTS model directory (default: ./moss-tts-8bit)",
    )
    parser.add_argument(
        "--ref-audio",
        required=True,
        help="Reference audio path (WAV preferred; MP3/M4A supported via ffmpeg)",
    )
    parser.add_argument("--text", default=None, help="Inline text to synthesize")
    parser.add_argument("--text-file", default=None, help="Text file to synthesize")
    parser.add_argument(
        "--output",
        required=True,
        help="Output WAV file path (example: ./final.wav)",
    )
    parser.add_argument("--lang-code", default="en", help="Language code (default: en)")
    parser.add_argument("--instruct", default=None, help="Optional style instruction")
    parser.add_argument(
        "--max-tokens",
        type=int,
        default=640,
        help="Per-chunk token ceiling; adaptive mode chooses lower budgets automatically (default: 640)",
    )
    parser.add_argument(
        "--chunk-chars",
        type=int,
        default=320,
        help="Target max characters per chunk (default: 320)",
    )
    parser.add_argument(
        "--pause-ms",
        type=int,
        default=140,
        help="Pause between generated chunks in milliseconds (default: 140)",
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="Print detailed generation output",
    )
    parser.add_argument(
        "--adaptive",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Auto-size token budget and retry/split capped chunks (default: on)",
    )
    parser.add_argument(
        "--dehiss",
        action="store_true",
        help="Apply post-generation de-hiss cleanup to output audio",
    )
    parser.add_argument(
        "--dehiss-preset",
        choices=sorted(DEHISS_PRESETS.keys()),
        default="balanced",
        help="De-hiss strength preset (default: balanced)",
    )
    return parser.parse_args()


def load_text(args: argparse.Namespace) -> str:
    if bool(args.text) == bool(args.text_file):
        raise ValueError("Provide exactly one of --text or --text-file")
    if args.text_file:
        return Path(args.text_file).read_text(encoding="utf-8")
    return args.text


def normalize_text(text: str) -> str:
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    if not text:
        raise ValueError("Input text is empty after normalization")
    return text


def sentence_split(text: str) -> List[str]:
    parts = re.split(r"(?<=[.!?。！？])\s+", text)
    sentences = [part.strip() for part in parts if part and part.strip()]
    return sentences if sentences else [text]


def chunk_text(text: str, max_chars: int) -> List[str]:
    sentences = sentence_split(text)
    chunks: List[str] = []
    current = ""
    for sentence in sentences:
        if len(sentence) > max_chars:
            # Hard-wrap very long sentence to stay within a stable generation window.
            for i in range(0, len(sentence), max_chars):
                segment = sentence[i : i + max_chars].strip()
                if segment:
                    if current:
                        chunks.append(current)
                        current = ""
                    chunks.append(segment)
            continue
        candidate = f"{current} {sentence}".strip() if current else sentence
        if len(candidate) <= max_chars:
            current = candidate
        else:
            if current:
                chunks.append(current)
            current = sentence
    if current:
        chunks.append(current)
    return chunks


def ensure_wav_24k_mono(path: Path) -> tuple[Path, Path | None]:
    ext = path.suffix.lower()
    if ext == ".wav":
        return path, None
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError(
            "ffmpeg is required for non-WAV reference audio. Install with: brew install ffmpeg"
        )
    tmp = Path(tempfile.NamedTemporaryFile(prefix="moss_ref_", suffix=".wav", delete=False).name)
    cmd = [
        ffmpeg,
        "-y",
        "-i",
        str(path),
        "-ac",
        "1",
        "-ar",
        "24000",
        str(tmp),
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    return tmp, tmp


@contextlib.contextmanager
def pushd(path: Path):
    prev = Path.cwd()
    os.chdir(path)
    try:
        yield
    finally:
        os.chdir(prev)


def resolve_loader_cwd(model_path: Path) -> Path:
    """Choose a cwd where moss-audio-tokenizer-full is discoverable by mlx-audio."""
    if (model_path / "moss-audio-tokenizer-full" / "config.json").exists():
        return Path.cwd()

    script_dir = Path(__file__).resolve().parent
    candidates = [
        model_path.parent,
        script_dir,
        Path.cwd(),
    ]
    for base in candidates:
        if (base / "moss-audio-tokenizer-full" / "config.json").exists():
            return base
    return Path.cwd()


def generate_chunks(
    model,
    chunks: Iterable[str],
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    max_tokens: int,
    verbose: bool,
    adaptive: bool,
) -> tuple[List[np.ndarray], int]:
    generated: List[np.ndarray] = []
    sample_rate: int | None = None
    for idx, chunk in enumerate(chunks, start=1):
        print(f"[{idx}] generating ({len(chunk)} chars)")
        chunk_segments, chunk_sample_rate = generate_chunk_adaptive(
            model=model,
            chunk=chunk,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            max_tokens=max_tokens,
            verbose=verbose,
            adaptive=adaptive,
            depth=0,
        )
        generated.extend(chunk_segments)
        if sample_rate is None:
            sample_rate = chunk_sample_rate
        elif sample_rate != chunk_sample_rate:
            raise RuntimeError(
                f"Inconsistent sample rates across chunks: {sample_rate} vs {chunk_sample_rate}"
            )
    if sample_rate is None:
        raise RuntimeError("No audio generated")
    return generated, sample_rate


def choose_token_budget(chunk: str, max_tokens: int) -> int:
    # Empirical fit for MOSS-TTS: character count underestimates needed tokens.
    estimated = int((len(chunk) * 1.25) + 120)
    return max(MIN_AUTO_TOKENS, min(max_tokens, estimated))


def is_token_capped(token_count: int, budget: int) -> bool:
    if budget <= 0:
        return False
    return token_count >= max(1, budget - TOKEN_CAP_MARGIN)


def split_chunk_for_retry(chunk: str) -> List[str]:
    sentences = sentence_split(chunk)
    if len(sentences) >= 2:
        mid = len(sentences) // 2
        left = " ".join(sentences[:mid]).strip()
        right = " ".join(sentences[mid:]).strip()
        if (
            left
            and right
            and len(left) >= MIN_SPLIT_CHARS
            and len(right) >= MIN_SPLIT_CHARS
        ):
            return [left, right]

    mid = len(chunk) // 2
    left_space = chunk.rfind(" ", 0, mid)
    right_space = chunk.find(" ", mid)
    candidates = [pos for pos in (left_space, right_space) if pos != -1]
    if not candidates:
        return [chunk]

    split_at = min(candidates, key=lambda pos: abs(pos - mid))
    left = chunk[:split_at].strip()
    right = chunk[split_at + 1 :].strip()
    if (
        left
        and right
        and len(left) >= MIN_SPLIT_CHARS
        and len(right) >= MIN_SPLIT_CHARS
    ):
        return [left, right]
    return [chunk]


def generate_once(
    model,
    chunk: str,
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    token_budget: int,
    verbose: bool,
) -> Tuple[np.ndarray, int, int]:
    kwargs = dict(
        text=chunk,
        ref_audio=str(ref_audio_wav),
        lang_code=lang_code,
        instruct=instruct,
        max_tokens=token_budget,
        verbose=verbose,
        **DEFAULT_AUDIO_SAMPLING,
    )
    results = list(model.generate(**kwargs))
    if not results:
        raise RuntimeError("No audio generated")
    first = results[0]
    arr = np.array(first.audio, dtype=np.float32)
    token_count = int(getattr(first, "token_count", 0) or 0)
    return arr, int(first.sample_rate), token_count


def generate_chunk_adaptive(
    model,
    chunk: str,
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    max_tokens: int,
    verbose: bool,
    adaptive: bool,
    depth: int,
) -> Tuple[List[np.ndarray], int]:
    budget = choose_token_budget(chunk, max_tokens) if adaptive else max_tokens
    print(f"    token budget={budget}")

    audio, sample_rate, token_count = generate_once(
        model=model,
        chunk=chunk,
        ref_audio_wav=ref_audio_wav,
        lang_code=lang_code,
        instruct=instruct,
        token_budget=budget,
        verbose=verbose,
    )
    print(f"    generated token_count={token_count}")

    if not adaptive:
        return [audio], sample_rate

    capped = is_token_capped(token_count, budget)
    if not capped:
        return [audio], sample_rate

    print(f"    token cap likely hit ({token_count}/{budget})")

    # One retry at a larger budget before splitting.
    if budget < max_tokens:
        retry_budget = min(max_tokens, max(budget + 64, int(budget * 1.35)))
        print(f"    retrying with higher budget={retry_budget}")
        retry_audio, retry_sample_rate, retry_token_count = generate_once(
            model=model,
            chunk=chunk,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            token_budget=retry_budget,
            verbose=verbose,
        )
        print(f"    retry token_count={retry_token_count}")
        if not is_token_capped(retry_token_count, retry_budget):
            return [retry_audio], retry_sample_rate
        audio, sample_rate = retry_audio, retry_sample_rate

    if depth >= MAX_SPLIT_DEPTH:
        print("    max split depth reached; keeping capped chunk output")
        return [audio], sample_rate

    parts = split_chunk_for_retry(chunk)
    if len(parts) != 2:
        print("    unable to split chunk safely; keeping capped chunk output")
        return [audio], sample_rate

    print("    splitting chunk and regenerating parts")
    stitched_segments: List[np.ndarray] = []
    part_sample_rate: int | None = None
    for part_idx, part in enumerate(parts, start=1):
        print(f"      part {part_idx}: {len(part)} chars")
        part_segments, part_sr = generate_chunk_adaptive(
            model=model,
            chunk=part,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            max_tokens=max_tokens,
            verbose=verbose,
            adaptive=adaptive,
            depth=depth + 1,
        )
        if part_sample_rate is None:
            part_sample_rate = part_sr
        elif part_sample_rate != part_sr:
            raise RuntimeError(
                f"Inconsistent sample rates in split chunk: {part_sample_rate} vs {part_sr}"
            )
        stitched_segments.extend(part_segments)

    if part_sample_rate is None:
        raise RuntimeError("Split generation failed to produce audio")
    return stitched_segments, part_sample_rate


def join_audio(segments: List[np.ndarray], sample_rate: int, pause_ms: int) -> np.ndarray:
    if len(segments) == 1:
        return segments[0]
    pause = np.zeros(int(sample_rate * (pause_ms / 1000.0)), dtype=np.float32)
    joined: List[np.ndarray] = []
    for i, seg in enumerate(segments):
        joined.append(seg)
        if i < len(segments) - 1 and len(pause) > 0:
            joined.append(pause)
    return np.concatenate(joined, axis=0)


def apply_dehiss_inplace(path: Path, preset: str) -> None:
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError(
            "ffmpeg is required for --dehiss. Install with: brew install ffmpeg"
        )
    af = DEHISS_PRESETS[preset]
    tmp = Path(tempfile.NamedTemporaryFile(prefix="moss_dehiss_", suffix=".wav", delete=False).name)
    try:
        cmd = [
            ffmpeg,
            "-y",
            "-i",
            str(path),
            "-af",
            af,
            "-hide_banner",
            "-loglevel",
            "error",
            str(tmp),
        ]
        subprocess.run(cmd, check=True)
        tmp.replace(path)
        print(f"Applied de-hiss ({preset}) to {path}")
    finally:
        tmp.unlink(missing_ok=True)


def main() -> None:
    args = parse_args()
    model_path = Path(args.model).expanduser().resolve()
    ref_audio_path = Path(args.ref_audio).expanduser().resolve()
    output_path = Path(args.output).expanduser().resolve()

    if not model_path.exists():
        raise FileNotFoundError(f"Model path not found: {model_path}")
    if not ref_audio_path.exists():
        raise FileNotFoundError(f"Reference audio not found: {ref_audio_path}")

    text = normalize_text(load_text(args))
    chunks = chunk_text(text, max_chars=args.chunk_chars)
    print(f"Loaded text as {len(chunks)} chunk(s)")

    ref_wav, cleanup = ensure_wav_24k_mono(ref_audio_path)
    if cleanup:
        print(f"Converted reference audio to WAV: {ref_wav}")

    try:
        loader_cwd = resolve_loader_cwd(model_path)
        if loader_cwd != Path.cwd():
            print(f"Using loader working dir: {loader_cwd}")

        print(f"Loading model: {model_path}")
        with pushd(loader_cwd):
            model = load_model(str(model_path))
        segments, sample_rate = generate_chunks(
            model=model,
            chunks=chunks,
            ref_audio_wav=ref_wav,
            lang_code=args.lang_code,
            instruct=args.instruct,
            max_tokens=args.max_tokens,
            verbose=args.verbose,
            adaptive=args.adaptive,
        )
        final_audio = join_audio(segments, sample_rate=sample_rate, pause_ms=args.pause_ms)
        output_path.parent.mkdir(parents=True, exist_ok=True)
        sf.write(str(output_path), final_audio, sample_rate)
        if args.dehiss:
            apply_dehiss_inplace(output_path, args.dehiss_preset)
        duration_sec = len(final_audio) / sample_rate
        print(f"Wrote {output_path} ({duration_sec:.2f}s @ {sample_rate} Hz)")
    finally:
        if cleanup is not None and cleanup.exists():
            cleanup.unlink(missing_ok=True)


if __name__ == "__main__":
    main()
```

## Notes for Future Reconstruction

1. Do not assume upstream `mlx-audio` contains `moss_clone_local.py`. In this workspace it is a custom file placed inside the pinned checkout.
2. Do not assume Python `3.14+` will install cleanly. Use Python `3.12`.
3. Keep the two MOSS model folders beside the `mlx-audio` checkout unless you also update the runner logic.
4. If reference audio has music, clean it first.
5. If a future rebuild produces truncation artifacts, verify that adaptive mode is still enabled.
6. If a future rebuild must avoid all network calls, pre-download every model and set offline env vars explicitly.

## Minimal Reproduction Checklist

- `brew install ffmpeg git python@3.12`
- run `setup_moss_m1.sh`
- install `demucs torchcodec` if reference cleaning is needed
- keep `moss-tts-8bit` and `moss-audio-tokenizer-full` under `./mlx-audio`
- run `mlx-audio/moss_clone_local.py` with `--dehiss`
- optionally clean the reference first with `isolate_voice_local.py`
