MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide

This guide documents a fully local MOSS-TTS voice cloning setup on Apple Silicon using MLX, a pinned mlx-audio checkout, and a small set of custom helper scripts. It is aimed at macOS users on M1, M2, M3, or M4 hardware who want local voice cloning without a hosted TTS service.

It is designed to be public, searchable, and reconstruction-friendly. A human or an LLM should be able to rebuild the same working environment from this document alone.

Topics covered here include:

MOSS-TTS 8B on MLX
Apple Silicon local voice cloning
mlx-audio pinned setup
reference audio cleanup
Demucs, SAM-Audio, and MossFormer2 integration
de-hiss cleanup
adaptive chunking and token handling for long-form generation

This guide intentionally excludes personal paths, machine names, personal filenames, and private transcript contents. All commands use example paths and filenames.

Quick Summary

If you only need the working shape:

Create a clean working directory.
Add the root helper scripts from this document.
Run setup_moss_m1.sh to create the Python 3.12 environment, clone mlx-audio, pin the commit, and download the local MOSS model folders.
Add the custom mlx-audio/moss_clone_local.py from this document into the cloned repo.
Optionally install demucs and torchcodec for reference cleanup.
Generate speech with the example commands in the run section.

What this recreates
Final working state
Why this setup exists
Canonical workspace layout
Prerequisites
Rebuild from scratch
Standard commands
Required vs optional pieces
Known warnings and fixes
Exact helper scripts
Minimal reproduction checklist

What This Recreates

This recovers a workspace with:

A root working directory containing a Python virtualenv and helper scripts.
A pinned mlx-audio clone at commit ecb8d6a3b3e09efa3c323cb1563baeff6e421d13.
Local model directories for:
- mlx-community/MOSS-TTS-8B-8bit
- OpenMOSS-Team/MOSS-Audio-Tokenizer
A custom moss_clone_local.py runner that:
- accepts MP3 or WAV reference audio
- resolves the codec path correctly
- chunks long text
- adaptively increases token budget and splits capped chunks
- optionally applies post-generation de-hiss
An optional reference-audio cleanup pipeline using:
- Demucs vocal extraction
- optional SAM-Audio extraction
- optional MossFormer2 enhancement
A standalone de-hiss helper for generated WAVs

Final Working State

The working state captured here is:

Apple Silicon target
Python 3.12.x
mlx-audio git commit ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
mlx 0.31.1
mlx-lm 0.31.1
huggingface_hub 1.9.0
sounddevice 0.5.3
librosa 0.11.0
misaki 0.9.4
demucs 4.0.1
torchcodec 0.11.0

The root workspace itself does not need to be a git repo. The only pinned git checkout is ./mlx-audio.

Why This Setup Exists

This was not a straight upstream install. The current state is the result of several fixes applied in order.

1. The intended local path was MOSS-TTS on MLX, not open Voxtral TTS

The target was fully local, Apple Silicon, 8B-class voice cloning from reference audio. The practical path was:

MOSS-TTS-8B-8bit on MLX
not the open Voxtral TTS release

That led to pinning the mlx-audio branch/commit that contained the MOSS-TTS integration.

2. Plain `python3 -m venv` plus `pip install` failed under Homebrew Python / PEP 668

An early attempt hit:

error: externally-managed-environment

The practical fix was to avoid relying on the default interpreter and explicitly use a Homebrew Python 3.12 binary to create the virtualenv.

3. Python 3.14 was a bad fit for this dependency graph

The first venv path used a newer interpreter and the install failed because mlx-audio[tts] depends on:

misaki>=0.9.4

At that moment the available wheel compatibility did not line up with the chosen interpreter, which caused the pip install -e ".[tts]" step to fail.

The fix was:

pin the local env to Python 3.12

4. The original setup instructions referenced a script that did not exist locally

The initial workflow assumed a moss_clone_local.py script already existed beside the downloaded model folders. It did not.

A custom runner script was created to provide:

model loading
local text input loading
MP3 to WAV conversion for the reference voice
chunked generation
WAV output writing

5. The codec path was wrong when running from the workspace root

Running the custom runner from the workspace root produced:

FileNotFoundError: Config not found at moss-audio-tokenizer-full/config.json

Root cause:

mlx-audio expected the codec folder to be discoverable relative to the process working directory
the runner was being executed from outside the mlx-audio directory

The fix was to add a loader working-directory resolver so the runner temporarily changes to a directory where ./moss-audio-tokenizer-full/config.json is visible before calling load_model(...).

6. Mixed narration plus background music hurt voice cloning quality

Reference audio that included music was usable, but identity and cleanliness degraded badly.

An optional voice-isolation helper was added so the reference audio can be cleaned before cloning. The helper evolved through a few stages:

first SAM-Audio
then SAM plus MossFormer2
then Demucs-based extraction became the default practical option

The current helper defaults to:

demucs

This was the most practical local path for narration plus background music in this workspace.

7. MossFormer2 model naming was inconsistent

One attempt to use MossFormer2 failed with a 404 on an older or renamed Hugging Face repo path.

The helper was updated to try multiple candidate IDs:

starkdmi/MossFormer2-SE
starkdmi/MossFormer2_SE_48K_MLX

8. Generated audio had hiss

A standalone dehiss_wav.py helper was added using ffmpeg filters.

Then the same de-hiss logic was integrated directly into the MOSS runner behind:

--dehiss
--dehiss-preset

9. Generated words were being cut off

The most important generation-quality bug was not just hiss. Some generations dropped word endings or jumped to the next word.

This turned out to be primarily a token-budget problem:

the runner defaulted to a fixed --max-tokens 300
at least some chunks hit the ceiling exactly
exact cap hits meant the model was being truncated mid-utterance

The fix was not to force manual tuning every run. The runner was extended with:

adaptive token budget selection
capped-chunk detection
retry at a larger budget
safe chunk splitting when still capped

That is now the default behavior.

10. Hugging Face warnings still appear even when generation is local

Even with local model weights present, some parts of the stack still use huggingface_hub for cache checks or fallback metadata access.

That means you may still see warnings like:

unauthenticated Hugging Face Hub requests

This does not by itself mean generation is happening remotely.

If strict offline behavior is required, run with:

HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...

Canonical Workspace Layout

Use this layout:

/path/to/workdir/
  .venv/
  setup_moss_m1.sh
  isolate_voice_local.py
  dehiss_wav.py
  mlx-audio/
    moss_clone_local.py
    moss-tts-8bit/
    moss-audio-tokenizer-full/

Temporary outputs, personal audio, generated WAVs, and input transcript files are intentionally not part of the canonical layout.

Prerequisites

Install these system tools first:

brew install ffmpeg git [email protected]

Optional but useful:

brew install pipx

Rebuild From Scratch

1. Create a clean working directory

mkdir -p /path/to/workdir
cd /path/to/workdir

2. Create the root helper scripts

Create these root-level files from the inline script appendix later in this document:

./setup_moss_m1.sh
./isolate_voice_local.py
./dehiss_wav.py

Make them executable:

chmod +x ./setup_moss_m1.sh ./isolate_voice_local.py ./dehiss_wav.py

Do not try to place ./mlx-audio/moss_clone_local.py yet. The mlx-audio clone does not exist until the next step.

3. Run the setup script

./setup_moss_m1.sh

This does all of the following:

creates a fresh Python 3.12 venv at ./.venv
clones mlx-audio if missing
checks out commit ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
installs mlx-audio[tts]
installs huggingface_hub
downloads the MOSS-TTS model to ./mlx-audio/moss-tts-8bit
downloads the audio tokenizer to ./mlx-audio/moss-audio-tokenizer-full

4. Add the custom MOSS runner into the pinned clone

After the setup script completes, create this file from the inline script appendix later in this document:

./mlx-audio/moss_clone_local.py

This file is required because it is a custom workspace script, not an upstream file guaranteed to exist in the pinned mlx-audio checkout.

5. Install the optional reference-cleaning dependencies

The base setup script is enough for TTS generation. To reproduce the full current helper workflow, also install:

/path/to/workdir/.venv/bin/python -m pip install demucs torchcodec

6. Verify the environment

/path/to/workdir/.venv/bin/python --version
/path/to/workdir/.venv/bin/pip show mlx mlx-lm huggingface_hub misaki demucs torchcodec
cd /path/to/workdir/mlx-audio
git rev-parse HEAD

Expected mlx-audio commit:

ecb8d6a3b3e09efa3c323cb1563baeff6e421d13

7. Optional offline and auth behavior

If you want higher Hugging Face rate limits for model downloads:

/path/to/workdir/.venv/bin/hf auth login

If you want strict no-network execution after all models are already local:

export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

Standard Commands

Generate speech from a reference voice

This is the normal command now. Adaptive chunk handling is on by default, so manual --max-tokens tuning is usually not needed.

/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.mp3 \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss

Clean a mixed reference, then clone from the cleaned voice

This is the most practical end-to-end path when the reference audio contains narration plus background music.

/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs \
  --verbose

/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_clean.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss

Generate from inline text instead of a text file

/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.mp3 \
  --text "This is a local test of MOSS-TTS on Apple Silicon." \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --dehiss

Disable adaptive behavior and force fixed generation settings

Only do this for debugging or controlled experiments.

/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en \
  --no-adaptive \
  --max-tokens 500 \
  --chunk-chars 240

Clean a mixed narration reference with Demucs only

/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs \
  --verbose

Clean a mixed narration reference with Demucs plus MossFormer2

/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode demucs+mossformer2 \
  --verbose

Use SAM-Audio instead of Demucs

/path/to/workdir/.venv/bin/python /path/to/workdir/isolate_voice_local.py \
  --input /path/to/workdir/reference_mixed.wav \
  --output /path/to/workdir/reference_clean.wav \
  --mode sam \
  --description "A person speaking" \
  --verbose

Apply de-hiss to a generated WAV after the fact

/path/to/workdir/.venv/bin/python /path/to/workdir/dehiss_wav.py \
  --input /path/to/workdir/output.wav \
  --output /path/to/workdir/output_clean.wav \
  --preset balanced

Force strict offline behavior

If you want the run to fail rather than touch Hugging Face at all:

HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
/path/to/workdir/.venv/bin/python /path/to/workdir/mlx-audio/moss_clone_local.py \
  --model /path/to/workdir/mlx-audio/moss-tts-8bit \
  --ref-audio /path/to/workdir/reference_voice.wav \
  --text-file /path/to/workdir/input.txt \
  --output /path/to/workdir/output.wav \
  --lang-code en

What Is Required vs Optional

Required for local voice cloning:

setup_moss_m1.sh
mlx-audio/moss_clone_local.py
local moss-tts-8bit model folder
local moss-audio-tokenizer-full folder
Python 3.12
ffmpeg

Optional but practical:

isolate_voice_local.py
dehiss_wav.py
demucs
torchcodec
SAM-Audio
MossFormer2

Known Warnings and What They Mean

`externally-managed-environment`

Cause:

using a Homebrew-managed interpreter directly instead of a proper venv flow

Fix:

create a venv explicitly with Python 3.12
use the venv’s python and pip

`No matching distribution found for misaki>=0.9.4`

Cause:

incompatible interpreter choice during installation

Fix:

use Python 3.12

`can't open file '.../moss_clone_local.py'`

Cause:

running from the wrong folder or assuming the runner script exists in the workspace root

Fix:

run the script at ./mlx-audio/moss_clone_local.py
or cd ./mlx-audio first

`Config not found at moss-audio-tokenizer-full/config.json`

Cause:

the tokenizer directory is local but not visible from the current working directory

Fix:

keep moss-tts-8bit and moss-audio-tokenizer-full beside the mlx-audio checkout
use the custom runner here, which resolves the loader cwd before calling load_model(...)

Hugging Face unauthenticated warning

Cause:

local generation still passes through libraries that may consult huggingface_hub for cache checks or metadata

Fix options:

ignore it if the run works
set HF_TOKEN
or force offline mode with HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1

Words get cut off or skipped

Cause:

token cap reached during generation

Fix:

use the adaptive runner in this document

Reference voice sounds contaminated by background music

Cause:

the clone model conditions on the whole reference audio, not just the voice

Fix:

clean the reference first with isolate_voice_local.py
demucs is the most practical default in this workspace

Exact Helper Scripts

These are the canonical script bodies used for the current working state.

`setup_moss_m1.sh`

#!/usr/bin/env bash
set -euo pipefail

# Clean local setup for fully-local MOSS-TTS 8B voice cloning on Apple Silicon.
# Run this from the directory where you want the project to live.

ROOT_DIR="$(pwd)"
VENV_DIR="${ROOT_DIR}/.venv"
VENV_PY="${VENV_DIR}/bin/python"
VENV_HF="${VENV_DIR}/bin/hf"
PYTHON_BIN="${PYTHON_BIN:-/opt/homebrew/opt/python@3.12/bin/python3.12}"

brew install ffmpeg git [email protected]

if <span class="broken-link">! -x &quot;${PYTHON_BIN}&quot;</span>; then
  PYTHON_BIN="$(command -v python3.12 || true)"
fi

if <span class="broken-link">-z &quot;${PYTHON_BIN}&quot;</span>; then
  echo "python3.12 not found. Install with: brew install [email protected]"
  exit 1
fi

rm -rf "${VENV_DIR}"
"${PYTHON_BIN}" -m venv "${VENV_DIR}"
"${VENV_PY}" -m pip install -U pip setuptools wheel

# Pin the exact OpenMOSS MLX branch commit behind PR #586 (add MOSS-TTS).
if <span class="broken-link">.git</span>; then
  cd mlx-audio
  git fetch --all --tags --prune
else
  git clone https://github.com/OpenMOSS/mlx-audio.git
  cd mlx-audio
fi
git checkout ecb8d6a3b3e09efa3c323cb1563baeff6e421d13
"${VENV_PY}" -m pip install -e ".[tts]"
"${VENV_PY}" -m pip install -U huggingface_hub

# Download the 8B MLX model and the codec weights into the exact default paths
# expected by the current MOSS-TTS MLX README.
"${VENV_HF}" download mlx-community/MOSS-TTS-8B-8bit --local-dir ./moss-tts-8bit
"${VENV_HF}" download OpenMOSS-Team/MOSS-Audio-Tokenizer --local-dir ./moss-audio-tokenizer-full

echo

echo "Setup complete. Run:"
echo "  cd ${ROOT_DIR}/mlx-audio"
echo "  ${VENV_PY} moss_clone_local.py --ref-audio ./voice.mp3 --text-file ./input.txt --output ./final.wav --lang-code en"

`dehiss_wav.py`

#!/usr/bin/env python3
"""Apply de-hiss cleanup to a WAV file using ffmpeg audio filters.

Presets are tuned for TTS outputs with broadband high-frequency hiss.
"""

from __future__ import annotations

import argparse
import shutil
import subprocess
from pathlib import Path


PRESETS = {
    "mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
    "balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
    "strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}


def parse_args() -> argparse.Namespace:
    p = argparse.ArgumentParser(description="De-hiss WAV audio with ffmpeg.")
    p.add_argument("--input", required=True, help="Input WAV/audio path")
    p.add_argument("--output", required=True, help="Output WAV path")
    p.add_argument(
        "--preset",
        choices=sorted(PRESETS.keys()),
        default="balanced",
        help="Filter strength preset",
    )
    p.add_argument(
        "--filter",
        default=None,
        help="Custom ffmpeg -af filter string (overrides --preset)",
    )
    return p.parse_args()


def main() -> None:
    args = parse_args()
    in_path = Path(args.input).expanduser().resolve()
    out_path = Path(args.output).expanduser().resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input not found: {in_path}")
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError("ffmpeg not found. Install with: brew install ffmpeg")

    af = args.filter if args.filter else PRESETS[args.preset]
    out_path.parent.mkdir(parents=True, exist_ok=True)
    cmd = [
        ffmpeg,
        "-y",
        "-i",
        str(in_path),
        "-af",
        af,
        "-hide_banner",
        "-loglevel",
        "error",
        str(out_path),
    ]
    subprocess.run(cmd, check=True)
    print(f"Wrote: {out_path}")
    print(f"Filter: {af}")


if __name__ == "__main__":
    main()

`isolate_voice_local.py`

#!/usr/bin/env python3
"""Isolate clean narration voice from mixed audio (speech + background music).

Local pipeline on Apple Silicon using mlx-audio:
1) SAM-Audio source separation (speech target extraction)
2) Optional MossFormer2 speech enhancement
"""

from __future__ import annotations

import argparse
import shutil
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import List, Optional, Tuple

from huggingface_hub.errors import HfHubHTTPError, RemoteEntryNotFoundError
from mlx_audio import audio_io
from mlx_audio.sts import MossFormer2SEModel, SAMAudio, SAMAudioProcessor, save_audio
from mlx_audio.sts.models.sam_audio.processor import load_audio
from mlx_audio.utils import get_model_path


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Extract clean speech from narration+music audio, fully local."
    )
    parser.add_argument("--input", required=True, help="Input audio path")
    parser.add_argument(
        "--output",
        required=True,
        help="Output clean voice WAV path",
    )
    parser.add_argument(
        "--mode",
        default="demucs",
        choices=[
            "sam",
            "mossformer2",
            "sam+mossformer2",
            "demucs",
            "demucs+mossformer2",
        ],
        help="Processing mode (default: demucs)",
    )
    parser.add_argument(
        "--description",
        default="A person speaking",
        help='SAM-Audio text prompt for target sound (default: "A person speaking")',
    )
    parser.add_argument(
        "--sam-model",
        default="mlx-community/sam-audio-large",
        help="SAM model repo or local path",
    )
    parser.add_argument(
        "--enhancer-model",
        default="starkdmi/MossFormer2-SE",
        help="MossFormer2 model repo or local path",
    )
    parser.add_argument(
        "--demucs-model",
        default="htdemucs_ft",
        help="Demucs model name for vocal separation (default: htdemucs_ft)",
    )
    parser.add_argument(
        "--sam-strategy",
        default="auto",
        choices=["auto", "separate", "long"],
        help="SAM processing strategy",
    )
    parser.add_argument(
        "--long-threshold-sec",
        type=float,
        default=30.0,
        help="Auto mode uses separate_long when input duration exceeds this threshold",
    )
    parser.add_argument(
        "--chunk-seconds",
        type=float,
        default=10.0,
        help="SAM long-mode chunk size in seconds",
    )
    parser.add_argument(
        "--overlap-seconds",
        type=float,
        default=3.0,
        help="SAM long-mode overlap in seconds",
    )
    parser.add_argument(
        "--ode-method",
        choices=["midpoint", "euler"],
        default="midpoint",
        help="SAM ODE method",
    )
    parser.add_argument(
        "--ode-step-size",
        type=float,
        default=(2.0 / 32.0),
        help="SAM ODE step size (default: 2/32)",
    )
    parser.add_argument(
        "--ode-decode-chunk-size",
        type=int,
        default=None,
        help="SAM decode chunk size to reduce memory (optional)",
    )
    parser.add_argument(
        "--anchors",
        default=None,
        help='Optional SAM anchors for separate mode: "+,1.2,3.0;-,4.0,5.5"',
    )
    parser.add_argument(
        "--save-residual",
        default=None,
        help="Optional path to save background/residual track from SAM",
    )
    parser.add_argument(
        "--keep-intermediate",
        action="store_true",
        help="Keep SAM intermediate target WAV when running sam+mossformer2",
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="Verbose logs",
    )
    return parser.parse_args()


def parse_anchors(raw: Optional[str]) -> Optional[List[List[Tuple[str, float, float]]]]:
    if not raw:
        return None
    items: List[Tuple[str, float, float]] = []
    for segment in raw.split(";"):
        segment = segment.strip()
        if not segment:
            continue
        parts = [x.strip() for x in segment.split(",")]
        if len(parts) != 3:
            raise ValueError(
                f"Invalid anchor segment '{segment}'. Expected format token,start,end."
            )
        token = parts[0]
        if token not in {"+", "-"}:
            raise ValueError(f"Invalid anchor token '{token}'. Use '+' or '-'.")
        start = float(parts[1])
        end = float(parts[2])
        if end <= start:
            raise ValueError(
                f"Invalid anchor range '{segment}'. end must be greater than start."
            )
        items.append((token, start, end))
    if not items:
        return None
    return [items]


def resolve_strategy(args: argparse.Namespace, input_path: Path) -> str:
    if args.sam_strategy in {"separate", "long"}:
        return args.sam_strategy
    wav, sr = load_audio(str(input_path), target_sr=48_000)
    duration = len(wav) / float(sr)
    return "long" if duration > args.long_threshold_sec else "separate"


def ensure_sam_weights(model_name_or_path: str) -> Path:
    model_path = get_model_path(
        model_name_or_path,
        allow_patterns=["*.safetensors", "*.json", "*.pt"],
    )
    has_weights = bool(list(model_path.glob("*.safetensors"))) or (
        model_path / "checkpoint.pt"
    ).exists()
    if not has_weights:
        raise RuntimeError(
            f"SAM model weights not found in {model_path}. "
            "If using facebook/sam-audio-large, you may need HF access approval. "
            "Recommended ungated model: mlx-community/sam-audio-large"
        )
    return model_path


def run_sam(
    input_path: Path,
    output_voice_path: Path,
    args: argparse.Namespace,
) -> Optional[Path]:
    strategy = resolve_strategy(args, input_path)
    anchors = parse_anchors(args.anchors)
    if strategy == "long" and anchors is not None:
        print("Ignoring --anchors because SAM long mode does not support anchors.")
        anchors = None

    if args.verbose:
        print(f"[SAM] strategy={strategy}")
        print(f"[SAM] model={args.sam_model}")
        print(f"[SAM] description={args.description}")

    model_path = ensure_sam_weights(args.sam_model)
    model = SAMAudio.from_pretrained(str(model_path))
    processor = SAMAudioProcessor.from_pretrained(str(model_path))

    ode_opt = {"method": args.ode_method, "step_size": args.ode_step_size}

    if strategy == "separate":
        result = model.separate(
            audios=[str(input_path)],
            descriptions=[args.description],
            anchors=anchors,
            ode_opt=ode_opt,
            ode_decode_chunk_size=args.ode_decode_chunk_size,
        )
    else:
        result = model.separate_long(
            audios=[str(input_path)],
            descriptions=[args.description],
            chunk_seconds=args.chunk_seconds,
            overlap_seconds=args.overlap_seconds,
            ode_opt=ode_opt,
            ode_decode_chunk_size=args.ode_decode_chunk_size,
            verbose=args.verbose,
        )

    output_voice_path.parent.mkdir(parents=True, exist_ok=True)
    save_audio(result.target[0], str(output_voice_path), sample_rate=processor.audio_sampling_rate)
    if args.verbose:
        print(f"[SAM] wrote target voice: {output_voice_path}")
        if result.peak_memory is not None:
            print(f"[SAM] peak_memory={result.peak_memory:.2f} GB")

    if args.save_residual:
        residual_path = Path(args.save_residual).expanduser().resolve()
        residual_path.parent.mkdir(parents=True, exist_ok=True)
        save_audio(
            result.residual[0],
            str(residual_path),
            sample_rate=processor.audio_sampling_rate,
        )
        if args.verbose:
            print(f"[SAM] wrote residual: {residual_path}")
        return residual_path
    return None


def run_mossformer2(input_path: Path, output_path: Path, model_name_or_path: str, verbose: bool) -> None:
    candidates = [model_name_or_path]
    for fallback in ("starkdmi/MossFormer2-SE", "starkdmi/MossFormer2_SE_48K_MLX"):
        if fallback not in candidates:
            candidates.append(fallback)

    last_error = None
    for model_id in candidates:
        try:
            if verbose:
                print(f"[MossFormer2] model={model_id}")
            model = MossFormer2SEModel.from_pretrained(model_id)
            enhanced = model.enhance(str(input_path))
            output_path.parent.mkdir(parents=True, exist_ok=True)
            audio_io.write(str(output_path), enhanced, model.config.sample_rate)
            if verbose:
                print(f"[MossFormer2] wrote enhanced: {output_path}")
            return
        except (RemoteEntryNotFoundError, HfHubHTTPError) as exc:
            last_error = exc
            if verbose:
                print(f"[MossFormer2] failed to load {model_id}: {exc}")
            continue

    if last_error is not None:
        raise RuntimeError(
            "Could not load a MossFormer2 enhancement model from known IDs. "
            "Try passing a valid local path with --enhancer-model."
        ) from last_error

    raise RuntimeError("Could not load MossFormer2 enhancement model.")


def run_demucs(input_path: Path, output_voice_path: Path, model_name: str, verbose: bool) -> Path:
    """Run Demucs two-stem separation and return vocals path."""
    demucs_cmd = [sys.executable, "-m", "demucs.separate"]
    if shutil.which("demucs"):
        demucs_cmd = ["demucs"]

    out_root = output_voice_path.parent / "_demucs_tmp"
    out_root.mkdir(parents=True, exist_ok=True)

    cmd = [
        *demucs_cmd,
        "--two-stems",
        "vocals",
        "-n",
        model_name,
        "-o",
        str(out_root),
        str(input_path),
    ]
    if verbose:
        print("[Demucs] running:", " ".join(cmd))
    try:
        subprocess.run(cmd, check=True)
    except FileNotFoundError as exc:
        raise RuntimeError(
            "Demucs is not installed in this environment. Install with: pip install demucs torchcodec"
        ) from exc

    vocals_path = out_root / model_name / input_path.stem / "vocals.wav"
    if not vocals_path.exists():
        # Fallback: find any vocals.wav in output tree.
        candidates = sorted(out_root.glob("**/vocals.wav"))
        if not candidates:
            raise RuntimeError("Demucs completed but no vocals.wav was produced.")
        vocals_path = candidates[-1]

    y, sr = load_audio(str(vocals_path), target_sr=48_000)
    output_voice_path.parent.mkdir(parents=True, exist_ok=True)
    audio_io.write(str(output_voice_path), y, sr)
    if verbose:
        print(f"[Demucs] wrote vocals: {output_voice_path}")
    return output_voice_path


def main() -> None:
    args = parse_args()
    input_path = Path(args.input).expanduser().resolve()
    output_path = Path(args.output).expanduser().resolve()
    if not input_path.exists():
        raise FileNotFoundError(f"Input audio not found: {input_path}")

    if args.mode == "sam":
        run_sam(input_path=input_path, output_voice_path=output_path, args=args)
        print(f"Done: {output_path}")
        return

    if args.mode == "mossformer2":
        run_mossformer2(
            input_path=input_path,
            output_path=output_path,
            model_name_or_path=args.enhancer_model,
            verbose=args.verbose,
        )
        print(f"Done: {output_path}")
        return

    if args.mode == "demucs":
        run_demucs(
            input_path=input_path,
            output_voice_path=output_path,
            model_name=args.demucs_model,
            verbose=args.verbose,
        )
        print(f"Done: {output_path}")
        return

    # sam+mossformer2
    if args.mode == "sam+mossformer2":
        temp_prefix = "sam_voice_"
    elif args.mode == "demucs+mossformer2":
        temp_prefix = "demucs_voice_"
    else:
        raise ValueError(f"Unsupported mode: {args.mode}")

    temp_file = tempfile.NamedTemporaryFile(prefix=temp_prefix, suffix=".wav", delete=False)
    sam_voice_path = Path(temp_file.name)
    temp_file.close()
    try:
        if args.mode == "sam+mossformer2":
            run_sam(input_path=input_path, output_voice_path=sam_voice_path, args=args)
        else:
            run_demucs(
                input_path=input_path,
                output_voice_path=sam_voice_path,
                model_name=args.demucs_model,
                verbose=args.verbose,
            )
        run_mossformer2(
            input_path=sam_voice_path,
            output_path=output_path,
            model_name_or_path=args.enhancer_model,
            verbose=args.verbose,
        )
    finally:
        if args.keep_intermediate:
            kept = output_path.with_name(output_path.stem + "_sam.wav")
            try:
                sam_voice_path.replace(kept)
                print(f"Kept intermediate SAM output: {kept}")
            except Exception:
                pass
        else:
            sam_voice_path.unlink(missing_ok=True)

    print(f"Done: {output_path}")


if __name__ == "__main__":
    main()

`mlx-audio/moss_clone_local.py`

#!/usr/bin/env python3
"""Local MOSS-TTS voice cloning helper for Apple Silicon.

Expected layout (inside mlx-audio checkout):
  ./moss-tts-8bit
  ./moss-audio-tokenizer-full
"""

from __future__ import annotations

import argparse
import contextlib
import os
import re
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Iterable, List, Tuple

import numpy as np
import soundfile as sf
from mlx_audio.tts.utils import load_model


DEFAULT_AUDIO_SAMPLING = {
    "temperature": 1.7,
    "top_p": 0.8,
    "top_k": 25,
    "text_temperature": 1.5,
    "text_top_p": 1.0,
    "text_top_k": 50,
    "repetition_penalty": 1.0,
}

DEHISS_PRESETS = {
    "mild": "afftdn=nf=-20:nt=w,lowpass=f=10000",
    "balanced": "afftdn=nf=-23:nt=w,lowpass=f=9000",
    "strong": "afftdn=nf=-26:nt=w,lowpass=f=8000",
}

TOKEN_CAP_MARGIN = 3
MIN_AUTO_TOKENS = 220
MAX_SPLIT_DEPTH = 2
MIN_SPLIT_CHARS = 80


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="MOSS-TTS local voice clone helper")
    parser.add_argument(
        "--model",
        default="./moss-tts-8bit",
        help="Path to local MOSS-TTS model directory (default: ./moss-tts-8bit)",
    )
    parser.add_argument(
        "--ref-audio",
        required=True,
        help="Reference audio path (WAV preferred; MP3/M4A supported via ffmpeg)",
    )
    parser.add_argument("--text", default=None, help="Inline text to synthesize")
    parser.add_argument("--text-file", default=None, help="Text file to synthesize")
    parser.add_argument(
        "--output",
        required=True,
        help="Output WAV file path (example: ./final.wav)",
    )
    parser.add_argument("--lang-code", default="en", help="Language code (default: en)")
    parser.add_argument("--instruct", default=None, help="Optional style instruction")
    parser.add_argument(
        "--max-tokens",
        type=int,
        default=640,
        help="Per-chunk token ceiling; adaptive mode chooses lower budgets automatically (default: 640)",
    )
    parser.add_argument(
        "--chunk-chars",
        type=int,
        default=320,
        help="Target max characters per chunk (default: 320)",
    )
    parser.add_argument(
        "--pause-ms",
        type=int,
        default=140,
        help="Pause between generated chunks in milliseconds (default: 140)",
    )
    parser.add_argument(
        "--verbose",
        action="store_true",
        help="Print detailed generation output",
    )
    parser.add_argument(
        "--adaptive",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Auto-size token budget and retry/split capped chunks (default: on)",
    )
    parser.add_argument(
        "--dehiss",
        action="store_true",
        help="Apply post-generation de-hiss cleanup to output audio",
    )
    parser.add_argument(
        "--dehiss-preset",
        choices=sorted(DEHISS_PRESETS.keys()),
        default="balanced",
        help="De-hiss strength preset (default: balanced)",
    )
    return parser.parse_args()


def load_text(args: argparse.Namespace) -> str:
    if bool(args.text) == bool(args.text_file):
        raise ValueError("Provide exactly one of --text or --text-file")
    if args.text_file:
        return Path(args.text_file).read_text(encoding="utf-8")
    return args.text


def normalize_text(text: str) -> str:
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    if not text:
        raise ValueError("Input text is empty after normalization")
    return text


def sentence_split(text: str) -> List[str]:
    parts = re.split(r"(?<=[.!?。！？])\s+", text)
    sentences = [part.strip() for part in parts if part and part.strip()]
    return sentences if sentences else [text]


def chunk_text(text: str, max_chars: int) -> List[str]:
    sentences = sentence_split(text)
    chunks: List[str] = []
    current = ""
    for sentence in sentences:
        if len(sentence) > max_chars:
            # Hard-wrap very long sentence to stay within a stable generation window.
            for i in range(0, len(sentence), max_chars):
                segment = sentence[i : i + max_chars].strip()
                if segment:
                    if current:
                        chunks.append(current)
                        current = ""
                    chunks.append(segment)
            continue
        candidate = f"{current} {sentence}".strip() if current else sentence
        if len(candidate) <= max_chars:
            current = candidate
        else:
            if current:
                chunks.append(current)
            current = sentence
    if current:
        chunks.append(current)
    return chunks


def ensure_wav_24k_mono(path: Path) -> tuple[Path, Path | None]:
    ext = path.suffix.lower()
    if ext == ".wav":
        return path, None
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError(
            "ffmpeg is required for non-WAV reference audio. Install with: brew install ffmpeg"
        )
    tmp = Path(tempfile.NamedTemporaryFile(prefix="moss_ref_", suffix=".wav", delete=False).name)
    cmd = [
        ffmpeg,
        "-y",
        "-i",
        str(path),
        "-ac",
        "1",
        "-ar",
        "24000",
        str(tmp),
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    return tmp, tmp


@contextlib.contextmanager
def pushd(path: Path):
    prev = Path.cwd()
    os.chdir(path)
    try:
        yield
    finally:
        os.chdir(prev)


def resolve_loader_cwd(model_path: Path) -> Path:
    """Choose a cwd where moss-audio-tokenizer-full is discoverable by mlx-audio."""
    if (model_path / "moss-audio-tokenizer-full" / "config.json").exists():
        return Path.cwd()

    script_dir = Path(__file__).resolve().parent
    candidates = [
        model_path.parent,
        script_dir,
        Path.cwd(),
    ]
    for base in candidates:
        if (base / "moss-audio-tokenizer-full" / "config.json").exists():
            return base
    return Path.cwd()


def generate_chunks(
    model,
    chunks: Iterable[str],
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    max_tokens: int,
    verbose: bool,
    adaptive: bool,
) -> tuple[List[np.ndarray], int]:
    generated: List[np.ndarray] = []
    sample_rate: int | None = None
    for idx, chunk in enumerate(chunks, start=1):
        print(f"[{idx}] generating ({len(chunk)} chars)")
        chunk_segments, chunk_sample_rate = generate_chunk_adaptive(
            model=model,
            chunk=chunk,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            max_tokens=max_tokens,
            verbose=verbose,
            adaptive=adaptive,
            depth=0,
        )
        generated.extend(chunk_segments)
        if sample_rate is None:
            sample_rate = chunk_sample_rate
        elif sample_rate != chunk_sample_rate:
            raise RuntimeError(
                f"Inconsistent sample rates across chunks: {sample_rate} vs {chunk_sample_rate}"
            )
    if sample_rate is None:
        raise RuntimeError("No audio generated")
    return generated, sample_rate


def choose_token_budget(chunk: str, max_tokens: int) -> int:
    # Empirical fit for MOSS-TTS: character count underestimates needed tokens.
    estimated = int((len(chunk) * 1.25) + 120)
    return max(MIN_AUTO_TOKENS, min(max_tokens, estimated))


def is_token_capped(token_count: int, budget: int) -> bool:
    if budget <= 0:
        return False
    return token_count >= max(1, budget - TOKEN_CAP_MARGIN)


def split_chunk_for_retry(chunk: str) -> List[str]:
    sentences = sentence_split(chunk)
    if len(sentences) >= 2:
        mid = len(sentences) // 2
        left = " ".join(sentences[:mid]).strip()
        right = " ".join(sentences[mid:]).strip()
        if (
            left
            and right
            and len(left) >= MIN_SPLIT_CHARS
            and len(right) >= MIN_SPLIT_CHARS
        ):
            return [left, right]

    mid = len(chunk) // 2
    left_space = chunk.rfind(" ", 0, mid)
    right_space = chunk.find(" ", mid)
    candidates = [pos for pos in (left_space, right_space) if pos != -1]
    if not candidates:
        return [chunk]

    split_at = min(candidates, key=lambda pos: abs(pos - mid))
    left = chunk[:split_at].strip()
    right = chunk[split_at + 1 :].strip()
    if (
        left
        and right
        and len(left) >= MIN_SPLIT_CHARS
        and len(right) >= MIN_SPLIT_CHARS
    ):
        return [left, right]
    return [chunk]


def generate_once(
    model,
    chunk: str,
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    token_budget: int,
    verbose: bool,
) -> Tuple[np.ndarray, int, int]:
    kwargs = dict(
        text=chunk,
        ref_audio=str(ref_audio_wav),
        lang_code=lang_code,
        instruct=instruct,
        max_tokens=token_budget,
        verbose=verbose,
        **DEFAULT_AUDIO_SAMPLING,
    )
    results = list(model.generate(**kwargs))
    if not results:
        raise RuntimeError("No audio generated")
    first = results[0]
    arr = np.array(first.audio, dtype=np.float32)
    token_count = int(getattr(first, "token_count", 0) or 0)
    return arr, int(first.sample_rate), token_count


def generate_chunk_adaptive(
    model,
    chunk: str,
    ref_audio_wav: Path,
    lang_code: str,
    instruct: str | None,
    max_tokens: int,
    verbose: bool,
    adaptive: bool,
    depth: int,
) -> Tuple[List[np.ndarray], int]:
    budget = choose_token_budget(chunk, max_tokens) if adaptive else max_tokens
    print(f"    token budget={budget}")

    audio, sample_rate, token_count = generate_once(
        model=model,
        chunk=chunk,
        ref_audio_wav=ref_audio_wav,
        lang_code=lang_code,
        instruct=instruct,
        token_budget=budget,
        verbose=verbose,
    )
    print(f"    generated token_count={token_count}")

    if not adaptive:
        return [audio], sample_rate

    capped = is_token_capped(token_count, budget)
    if not capped:
        return [audio], sample_rate

    print(f"    token cap likely hit ({token_count}/{budget})")

    # One retry at a larger budget before splitting.
    if budget < max_tokens:
        retry_budget = min(max_tokens, max(budget + 64, int(budget * 1.35)))
        print(f"    retrying with higher budget={retry_budget}")
        retry_audio, retry_sample_rate, retry_token_count = generate_once(
            model=model,
            chunk=chunk,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            token_budget=retry_budget,
            verbose=verbose,
        )
        print(f"    retry token_count={retry_token_count}")
        if not is_token_capped(retry_token_count, retry_budget):
            return [retry_audio], retry_sample_rate
        audio, sample_rate = retry_audio, retry_sample_rate

    if depth >= MAX_SPLIT_DEPTH:
        print("    max split depth reached; keeping capped chunk output")
        return [audio], sample_rate

    parts = split_chunk_for_retry(chunk)
    if len(parts) != 2:
        print("    unable to split chunk safely; keeping capped chunk output")
        return [audio], sample_rate

    print("    splitting chunk and regenerating parts")
    stitched_segments: List[np.ndarray] = []
    part_sample_rate: int | None = None
    for part_idx, part in enumerate(parts, start=1):
        print(f"      part {part_idx}: {len(part)} chars")
        part_segments, part_sr = generate_chunk_adaptive(
            model=model,
            chunk=part,
            ref_audio_wav=ref_audio_wav,
            lang_code=lang_code,
            instruct=instruct,
            max_tokens=max_tokens,
            verbose=verbose,
            adaptive=adaptive,
            depth=depth + 1,
        )
        if part_sample_rate is None:
            part_sample_rate = part_sr
        elif part_sample_rate != part_sr:
            raise RuntimeError(
                f"Inconsistent sample rates in split chunk: {part_sample_rate} vs {part_sr}"
            )
        stitched_segments.extend(part_segments)

    if part_sample_rate is None:
        raise RuntimeError("Split generation failed to produce audio")
    return stitched_segments, part_sample_rate


def join_audio(segments: List[np.ndarray], sample_rate: int, pause_ms: int) -> np.ndarray:
    if len(segments) == 1:
        return segments[0]
    pause = np.zeros(int(sample_rate * (pause_ms / 1000.0)), dtype=np.float32)
    joined: List[np.ndarray] = []
    for i, seg in enumerate(segments):
        joined.append(seg)
        if i < len(segments) - 1 and len(pause) > 0:
            joined.append(pause)
    return np.concatenate(joined, axis=0)


def apply_dehiss_inplace(path: Path, preset: str) -> None:
    ffmpeg = shutil.which("ffmpeg")
    if ffmpeg is None:
        raise RuntimeError(
            "ffmpeg is required for --dehiss. Install with: brew install ffmpeg"
        )
    af = DEHISS_PRESETS[preset]
    tmp = Path(tempfile.NamedTemporaryFile(prefix="moss_dehiss_", suffix=".wav", delete=False).name)
    try:
        cmd = [
            ffmpeg,
            "-y",
            "-i",
            str(path),
            "-af",
            af,
            "-hide_banner",
            "-loglevel",
            "error",
            str(tmp),
        ]
        subprocess.run(cmd, check=True)
        tmp.replace(path)
        print(f"Applied de-hiss ({preset}) to {path}")
    finally:
        tmp.unlink(missing_ok=True)


def main() -> None:
    args = parse_args()
    model_path = Path(args.model).expanduser().resolve()
    ref_audio_path = Path(args.ref_audio).expanduser().resolve()
    output_path = Path(args.output).expanduser().resolve()

    if not model_path.exists():
        raise FileNotFoundError(f"Model path not found: {model_path}")
    if not ref_audio_path.exists():
        raise FileNotFoundError(f"Reference audio not found: {ref_audio_path}")

    text = normalize_text(load_text(args))
    chunks = chunk_text(text, max_chars=args.chunk_chars)
    print(f"Loaded text as {len(chunks)} chunk(s)")

    ref_wav, cleanup = ensure_wav_24k_mono(ref_audio_path)
    if cleanup:
        print(f"Converted reference audio to WAV: {ref_wav}")

    try:
        loader_cwd = resolve_loader_cwd(model_path)
        if loader_cwd != Path.cwd():
            print(f"Using loader working dir: {loader_cwd}")

        print(f"Loading model: {model_path}")
        with pushd(loader_cwd):
            model = load_model(str(model_path))
        segments, sample_rate = generate_chunks(
            model=model,
            chunks=chunks,
            ref_audio_wav=ref_wav,
            lang_code=args.lang_code,
            instruct=args.instruct,
            max_tokens=args.max_tokens,
            verbose=args.verbose,
            adaptive=args.adaptive,
        )
        final_audio = join_audio(segments, sample_rate=sample_rate, pause_ms=args.pause_ms)
        output_path.parent.mkdir(parents=True, exist_ok=True)
        sf.write(str(output_path), final_audio, sample_rate)
        if args.dehiss:
            apply_dehiss_inplace(output_path, args.dehiss_preset)
        duration_sec = len(final_audio) / sample_rate
        print(f"Wrote {output_path} ({duration_sec:.2f}s @ {sample_rate} Hz)")
    finally:
        if cleanup is not None and cleanup.exists():
            cleanup.unlink(missing_ok=True)


if __name__ == "__main__":
    main()

Notes for Future Reconstruction

Do not assume upstream mlx-audio contains moss_clone_local.py. In this workspace it is a custom file placed inside the pinned checkout.
Do not assume Python 3.14+ will install cleanly. Use Python 3.12.
Keep the two MOSS model folders beside the mlx-audio checkout unless you also update the runner logic.
If reference audio has music, clean it first.
If a future rebuild produces truncation artifacts, verify that adaptive mode is still enabled.
If a future rebuild must avoid all network calls, pre-download every model and set offline env vars explicitly.

Minimal Reproduction Checklist

brew install ffmpeg git [email protected]
run setup_moss_m1.sh
install demucs torchcodec if reference cleaning is needed
keep moss-tts-8bit and moss-audio-tokenizer-full under ./mlx-audio
run mlx-audio/moss_clone_local.py with --dehiss
optionally clean the reference first with isolate_voice_local.py

MOSS-TTS MLX Apple Silicon Local Voice Cloning Setup Guide

Quick Summary

Contents

What This Recreates

Final Working State

Why This Setup Exists

1. The intended local path was MOSS-TTS on MLX, not open Voxtral TTS

2. Plain python3 -m venv plus pip install failed under Homebrew Python / PEP 668

3. Python 3.14 was a bad fit for this dependency graph

4. The original setup instructions referenced a script that did not exist locally

5. The codec path was wrong when running from the workspace root

6. Mixed narration plus background music hurt voice cloning quality

7. MossFormer2 model naming was inconsistent

8. Generated audio had hiss

9. Generated words were being cut off

10. Hugging Face warnings still appear even when generation is local

Canonical Workspace Layout

Prerequisites

Rebuild From Scratch

1. Create a clean working directory

2. Create the root helper scripts

3. Run the setup script

4. Add the custom MOSS runner into the pinned clone

5. Install the optional reference-cleaning dependencies

6. Verify the environment

7. Optional offline and auth behavior

Standard Commands

Generate speech from a reference voice

Clean a mixed reference, then clone from the cleaned voice

Generate from inline text instead of a text file

Disable adaptive behavior and force fixed generation settings

Clean a mixed narration reference with Demucs only

Clean a mixed narration reference with Demucs plus MossFormer2

Use SAM-Audio instead of Demucs

Apply de-hiss to a generated WAV after the fact

Force strict offline behavior

What Is Required vs Optional

Known Warnings and What They Mean

externally-managed-environment

No matching distribution found for misaki>=0.9.4

can't open file '.../moss_clone_local.py'

Config not found at moss-audio-tokenizer-full/config.json

Hugging Face unauthenticated warning

Words get cut off or skipped

Reference voice sounds contaminated by background music

Exact Helper Scripts

setup_moss_m1.sh

dehiss_wav.py

isolate_voice_local.py

mlx-audio/moss_clone_local.py

Notes for Future Reconstruction

Minimal Reproduction Checklist

Graph View

Backlinks (0)

2. Plain `python3 -m venv` plus `pip install` failed under Homebrew Python / PEP 668

`externally-managed-environment`

`No matching distribution found for misaki>=0.9.4`

`can't open file '.../moss_clone_local.py'`

`Config not found at moss-audio-tokenizer-full/config.json`

`setup_moss_m1.sh`

`dehiss_wav.py`

`isolate_voice_local.py`

`mlx-audio/moss_clone_local.py`