OpenAudio S1: AI Text to Speech (Fish.audio TTS)

What is Openaudio S1?

Openaudio S1 is an advanced AI-powered audio and video processing framework designed to separate and manipulate different elements in audio and video content. It uses state-of-the-art deep learning models to perform tasks like background removal, voice separation, and even emotional speech synthesis.

Background removal
Voice separation
Emotional speech synthesis (laughter, crying, and more)
Real-time processing & beginner-friendly

Live Demo

Overview of Openaudio S1

Feature	Description
Name	Openaudio S1
Type	Audio Processing Framework
Main Features	Audio separation Background removal Voice enhancement Emotional speech synthesis
Processing Speed	Real-time
User Level	Beginner-friendly

✨ Modern & Easy to Use

How will Fish Audio evolve OpenAudio S1 and its voice technologies?

Real-time Voice Interaction

Fish Audio is introducing real-time voice interaction, enabling seamless conversations with voice library characters. This will make interactions more natural and engaging, transforming virtual assistants, content creation, and gaming.

Language & Emotion Expansion

By expanding training data and optimizing RLHF, OpenAudio S1 will support more languages and richer emotions. Advanced emotional expression is already available for English, Chinese, and Japanese, with more coming soon—strengthening its TTS leadership.

Platform & Deployment Growth

OpenAudio S1’s GUI Inference supports Linux and Windows, with MacOS coming soon. This cross-platform expansion makes it easier for everyone to deploy and use OpenAudio S1 on their favorite systems.

Key Features of OpenAudio S1

Highly Natural Sound Quality

Generated voices are smooth and realistic, often indistinguishable from real human voiceovers.
Suitable for professional use cases like video dubbing, podcasts, and game character voices.
Ranked #1 on TTS-Arena and TTS-Arena2, leading benchmarks for text-to-speech evaluation, and widely recognized for its lifelike voice quality and expressive emotional range.
Achieved outstanding results in Seed TTS assessment, with an English Word Error Rate (WER) as low as 0.008 and a Character Error Rate (CER) of only 0.004, outperforming traditional models.

Advanced Emotional and Tone Control

Supports over 50 emotions and tone markers, letting users flexibly adjust voice expression using natural language instructions.
Includes both basic emotions (like angry, sad, excited) and advanced ones (such as disdainful, hysterical, sarcastic).
Offers tone markers (like shouting, whispering, soft tone) and special audio effects (such as laughing, sobbing, sighing).
Uses online Reinforcement Learning from Human Feedback (RLHF) to enhance emotional expression, capturing voice timbre and intonation more precisely.
Can smoothly shift between emotions and add subtle sound cues like whispers or laughter, making voices feel more human.

Strong Instruction-Following and Dynamic Voice Control

Users can control details like speech rate, volume, pauses, and even laughter with simple text commands for highly personalized voice outputs.
Developers can customize tone, emphasis, and pacing in real time via API.

Multilingual and Cross-lingual Support

Covers 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish.
Supports advanced emotions for English, Chinese, and Japanese, with more languages coming soon.
Strong generalization capabilities without relying on phonemes, so it can handle text in any language script.
Enables global reach with ultra-low latency for multiple languages.

Ultra-Realistic Voice Cloning

Supports zero-shot and few-shot voice cloning, needing only 10–30 seconds of audio samples to generate high-fidelity cloned voices in under a minute.
Captures not just the tone, but also the unique speaking patterns, rhythm, and style of any voice.

Two Model Variants for Diverse Needs

OpenAudio S1 (4B parameters): The full-featured flagship model available on fish.audio, offering the highest quality speech synthesis and advanced features.
OpenAudio S1-mini (0.5B parameters): A distilled, open-source version with core capabilities, optimized for faster inference while maintaining excellent quality. Available on Hugging Face Space and GitHub for developers.

Model Variants

S14B Parameters

Flagship Model: The full-scale flagship model, providing the richest and most nuanced performance.

S1-mini0.5B Parameters

Efficient & Distilled: A highly efficient, distilled version of S1, designed for scenarios where resource optimization is important, while still maintaining high quality.

OpenAudio S1 is built on the Qwen3 architecture and is natively multimodal, supporting TTS, STT, TextQA, and AudioQA tasks. At present, only the TTS features are publicly available.

The audio encoding and decoding are based on a Descript Audio Codec-like design, developed from scratch and improved with a transformer for strong text modeling capabilities.

Pros and Cons

Pros

Exceptional sound quality
Natural, human-like voices
Advanced emotion control
Dynamic voice control
Multilingual support (13+)
Ultra-realistic cloning
Fast, efficient processing
Flexible deployment options
Broad practical uses
Continuous improvements

Cons

Free plan limits
Some features paid
Older versions robotic
Minor technical glitches

How to Use OpenAudio S1 on HuggingFace?

Access the Interface

Go to the OpenAudio S1-Mini space on Hugging Face.
You may need to log in to your Hugging Face account.

Input Your Text

In the Input Text box, enter the text you want to convert to speech.
The interface supports multilingual text—simply copy and paste your text, regardless of language.

Configure Advanced Settings

Adjust the following parameters as needed:

Iterative Prompt Length: Set to 0 to turn off (slider: 0–500)
Maximum tokens per batch: Set to 0 for no limit (slider: 0–2048)
Top-P: Controls randomness in generation (e.g., 0.9)
Repetition Penalty: Prevents repetitive output (e.g., 1.1)
Temperature: Controls creativity/randomness (e.g., 0.9)
Seed: Set to 0 for randomized inference, or use a specific number for deterministic results

Optional Reference Audio

For voice cloning, you can upload a reference audio sample (10–30 seconds of clear speech) to generate high-quality TTS output.

Generate

Click the blue Generate button to create your audio.
The generated audio will appear in the Generated Audio section.