Local installation of Fish Audio on Windows 10

Local installation of Fish Audio on Windows 10

I’ve been exploring

Fish Audio S2 Pro is one of (if not the) best text-to-speech solutions. Getting it installed locally and working, however, isn’t so straightforward on Windows 10. There are at least 2 different ways to get this working. One of which is to download/run

Method 0: Use the free online version

It’s not hard – but expect to be limited in usage. https://fish.audio/app/text-to-speech

Method 1: Fish S2 Pro Zero Docker

  1. Go to the Huggingface Fish Audio S3 Pro project page.
  2. Ensure you’re logged into Huggingface, and you should see the ‘Run Locally’ option Go up in the link
  3. Ensure Docker is installed on the Windows desktop and WSL support is enabled in the Docker options.
  4. Open a WSL session running Ubuntu 24.04 or similar.
  5. Enter the docker command:
docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
registry.hf.space/artificialguybr-fish-s2-pro-zero:latest python app.py

6. You’ll see the docker container download along with the models and start up:

(base) me@DESKTOP:/mnt/c/fish-audio-s2$ docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all registry.hf.space/artificialguybr-fish-s2-pro-zero:latest python app.py
Cloning into 'fish-speech'…
remote: Enumerating objects: 6605, done.
remote: Counting objects: 100% (1088/1088), done.
remote: Compressing objects: 100% (292/292), done.
remote: Total 6605 (delta 905), reused 796 (delta 796), pack-reused 5517 (from 2)
Receiving objects: 100% (6605/6605), 28.21 MiB | 10.42 MiB/s, done.
Resolving deltas: 100% (4328/4328), done.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████| 13/13 [02:19<00:00, 10.72s/it]Fetching 13 files: 62%|████████████████████████████████████████ | 8/13 [02:19<01:13, 14.78s/itYou are using a model of type fish_qwen3_omni to instantiate a model of type `. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading asam2_video checkpoint intoSam2Model), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating. Download complete: : 11.0GB [02:19, 79.0MB/s] 2026-07-03 18:59:16.787 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:504 - Injected Semantic IDs into Config: 151678-155773 2026-07-03 18:59:16.787 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:520 - Loading model from /home/user/.cache/huggingface/hub/models--fishaudio--s2-pro/snapshots/1de9996b6be38b745688de084d87a5633f714e4e, config: DualARModelArgs(model_type='dual_ar', vocab_size=155776, n_layer=36, n_head=32, dim=2560, intermediate_size=9728, n_local_heads=8, head_dim=128, rope_base=1000000, norm_eps=1e-06, max_seq_len=32768, dropout=0.0, tie_word_embeddings=True, attention_qkv_bias=False, attention_o_bias=False, attention_qk_norm=True, codebook_size=4096, num_codebooks=10, semantic_begin_id=151678, semantic_end_id=155773, use_gradient_checkpointing=True, initializer_range=0.01976423537605237, is_reward_model=False, scale_codebook_embeddings=True, audio_embed_dim=2560, n_fast_layer=4, fast_dim=2560, fast_n_head=32, fast_n_local_heads=8, fast_head_dim=128, fast_intermediate_size=9728, fast_attention_qkv_bias=False, fast_attention_qk_norm=False, fast_attention_o_bias=False, norm_fastlayer_input=True) 2026-07-03 18:59:46.228 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:552 - Loading sharded safetensors weights 2026-07-03 18:59:46.717 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:588 - Model weights loaded - Status: <All keys matched successfully> 2026-07-03 18:59:48.707 | INFO | fish_speech.models.text2semantic.inference:init_model:366 - Restored model from checkpoint 2026-07-03 18:59:48.708 | INFO | fish_speech.models.text2semantic.inference:init_model:371 - Using DualARTransformer/usr/local/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning:torch.nn.utils.weight_normis deprecated in favor oftorch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
* Running on local URL: http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set ssr_mode=False in launch())
* To create a public link, set share=True in launch().

7. Open a browser to localhost:7860

Method 2: Build and run locally

  1. Clone the github project: https://github.com/fishaudio/fish-speech
  2. Open a WSL Ubuntu 24.04 installation.
  3. Ensure you have nVidia support in WSL installed. Rebooting after this is often required.
  4. Follow the installation/build instructions.
    • Run the conda setup steps
    • Run the UV steps for CPU or GPU depending on your install
    • Skip the docker part
  5. WebUI On the left menu, select the ‘Inference’ from the list of items
    • Download the model weights with the hf command
      • You can test using the command line inference steps if you want to test it
    • Scroll down to the WebUI inference
    • Install Gradio if you want the older style (not so much recommended, but easier than Awesome WebUI)
    • Install the ‘Awesome WebUI’
    • Start the ‘Awesome WebUI’ using the python command
  6. Server
    • Select the ‘Server’ item from the list of left-hand items
    • Run the python command to start the server locally.
    • Try out one of the api_client.py commands to test it out

Things you can do with the server:

Other links:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.