Local installation of Fish Audio on Windows 10
I’ve been exploring
Fish Audio S2 Pro is one of (if not the) best text-to-speech solutions. Getting it installed locally and working, however, isn’t so straightforward on Windows 10. There are at least 2 different ways to get this working. One of which is to download/run
Method 0: Use the free online version
It’s not hard – but expect to be limited in usage. https://fish.audio/app/text-to-speech
Method 1: Fish S2 Pro Zero Docker

- Go to the Huggingface Fish Audio S3 Pro project page.
- Ensure you’re logged into Huggingface, and you should see the ‘Run Locally’ option Go up in the link
- Ensure Docker is installed on the Windows desktop and WSL support is enabled in the Docker options.
- Open a WSL session running Ubuntu 24.04 or similar.
- Enter the docker command:
docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
registry.hf.space/artificialguybr-fish-s2-pro-zero:latest python app.py
6. You’ll see the docker container download along with the models and start up:
(base) me@DESKTOP:/mnt/c/fish-audio-s2$ docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all registry.hf.space/artificialguybr-fish-s2-pro-zero:latest python app.py
Cloning into 'fish-speech'…
remote: Enumerating objects: 6605, done.
remote: Counting objects: 100% (1088/1088), done.
remote: Compressing objects: 100% (292/292), done.
remote: Total 6605 (delta 905), reused 796 (delta 796), pack-reused 5517 (from 2)
Receiving objects: 100% (6605/6605), 28.21 MiB | 10.42 MiB/s, done.
Resolving deltas: 100% (4328/4328), done.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████| 13/13 [02:19<00:00, 10.72s/it]Fetching 13 files: 62%|████████████████████████████████████████ | 8/13 [02:19<01:13, 14.78s/itYou are using a model of type fish_qwen3_omni to instantiate a model of type `. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading asam2_video checkpoint intoSam2Model), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating. Download complete: : 11.0GB [02:19, 79.0MB/s] 2026-07-03 18:59:16.787 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:504 - Injected Semantic IDs into Config: 151678-155773 2026-07-03 18:59:16.787 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:520 - Loading model from /home/user/.cache/huggingface/hub/models--fishaudio--s2-pro/snapshots/1de9996b6be38b745688de084d87a5633f714e4e, config: DualARModelArgs(model_type='dual_ar', vocab_size=155776, n_layer=36, n_head=32, dim=2560, intermediate_size=9728, n_local_heads=8, head_dim=128, rope_base=1000000, norm_eps=1e-06, max_seq_len=32768, dropout=0.0, tie_word_embeddings=True, attention_qkv_bias=False, attention_o_bias=False, attention_qk_norm=True, codebook_size=4096, num_codebooks=10, semantic_begin_id=151678, semantic_end_id=155773, use_gradient_checkpointing=True, initializer_range=0.01976423537605237, is_reward_model=False, scale_codebook_embeddings=True, audio_embed_dim=2560, n_fast_layer=4, fast_dim=2560, fast_n_head=32, fast_n_local_heads=8, fast_head_dim=128, fast_intermediate_size=9728, fast_attention_qkv_bias=False, fast_attention_qk_norm=False, fast_attention_o_bias=False, norm_fastlayer_input=True) 2026-07-03 18:59:46.228 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:552 - Loading sharded safetensors weights 2026-07-03 18:59:46.717 | INFO | fish_speech.models.text2semantic.llama:from_pretrained:588 - Model weights loaded - Status: <All keys matched successfully> 2026-07-03 18:59:48.707 | INFO | fish_speech.models.text2semantic.inference:init_model:366 - Restored model from checkpoint 2026-07-03 18:59:48.708 | INFO | fish_speech.models.text2semantic.inference:init_model:371 - Using DualARTransformer/usr/local/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning:torch.nn.utils.weight_normis deprecated in favor oftorch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
* Running on local URL: http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set ssr_mode=False in launch())
* To create a public link, set share=True in launch().
7. Open a browser to localhost:7860

Method 2: Build and run locally
- Clone the github project: https://github.com/fishaudio/fish-speech
- Open a WSL Ubuntu 24.04 installation.
- Ensure you have nVidia support in WSL installed. Rebooting after this is often required.
- Follow the installation/build instructions.
- Run the conda setup steps
- Run the UV steps for CPU or GPU depending on your install
- Skip the docker part
- WebUI On the left menu, select the ‘Inference’ from the list of items
- Download the model weights with the hf command
- You can test using the command line inference steps if you want to test it
- Scroll down to the WebUI inference
- Install Gradio if you want the older style (not so much recommended, but easier than Awesome WebUI)
- Install the ‘Awesome WebUI’
- Start the ‘Awesome WebUI’ using the python command
- Open a browser on http://localhost:8888/ui
- Download the model weights with the hf command
- Server
- Select the ‘Server’ item from the list of left-hand items
- Run the python command to start the server locally.
- Try out one of the api_client.py commands to test it out
Things you can do with the server:
- Create your own local cloned voice (.npy files) from sample wav files and transcribed text from inference.py.
- Check out the Text to Speech API Developer’s Guide
- Check out the emotional cues you can add to the text
Other links:
- List of emotive cues you might give: https://fish.audio/blog/fish-audio-s2-fine-grained-ai-voice-control-at-the-word-level/
- Reddit article
