Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ dist
*.egg-info
.idea
.vscode
test/outs
test/outs
pretrained_models
3 changes: 0 additions & 3 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
[submodule "cosyvoice"]
path = cosyvoice
url = https://github.com/FunAudioLLM/CosyVoice.git
[submodule "third_party/Matcha-TTS"]
path = third_party/Matcha-TTS
url = https://github.com/shivammehta25/Matcha-TTS.git
4 changes: 3 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ repos:
language_version: python3
args: [--line-length=120]
additional_dependencies: ['click==8.0.4']
exclude: '^cosyvoice/'
- repo: https://github.com/pycqa/flake8
rev: 3.9.0
hooks:
- id: flake8
additional_dependencies: [flake8-typing-imports==1.9.0]
args: ['--config=.flake8', '--max-line-length=120', '--ignore=C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606']
args: ['--config=.flake8', '--max-line-length=120', '--ignore=C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606']
exclude: '^cosyvoice/'
42 changes: 35 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,18 +1,46 @@
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
ARG MAMBA_VERSION=23.1.0-1
ARG CUDA_VERSION=12.8.0
FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu22.04
ARG PYTHON_VERSION=3.10
ARG MAMBA_VERSION=24.7.1-0
ARG TARGETPLATFORM
ENV PATH=/opt/conda/bin:$PATH \
CONDA_PREFIX=/opt/conda

WORKDIR /opt
WORKDIR /root

RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
libssl-dev \
curl \
g++ \
make \
git && \
git \
ffmpeg \
unzip && \
rm -rf /var/lib/apt/lists/*

RUN git clone --recursive https://github.com/ModelTC/light-tts.git
RUN cd light-tts && pip3 install -r requirements.txt
WORKDIR /opt/light-tts
RUN case ${TARGETPLATFORM} in \
"linux/arm64") MAMBA_ARCH=aarch64 ;; \
*) MAMBA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -o ~/mambaforge.sh "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh" && \
bash ~/mambaforge.sh -b -p /opt/conda && \
rm ~/mambaforge.sh

RUN case ${TARGETPLATFORM} in \
"linux/arm64") exit 1 ;; \
*) /opt/conda/bin/conda update -y conda && \
/opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \
esac && \
/opt/conda/bin/conda clean -ya

COPY ./requirements.txt /lighttts/requirements.txt
RUN pip install -U pip
RUN pip install -r /lighttts/requirements.txt --no-cache-dir

COPY . /lighttts
WORKDIR /lighttts
RUN cd pretrained_models/CosyVoice-ttsfrd/ && \
unzip resource.zip -d . && \
pip install ttsfrd_dependency-0.1-py3-none-any.whl && \
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
22 changes: 22 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Light TTS
Copyright (c) 2024 Light TTS Contributors

This project contains code from the following third-party projects:

================================================================================

CosyVoice (cosyvoice/)
https://github.com/FunAudioLLM/CosyVoice
Copyright (c) Alibaba, Inc. and its affiliates.
Licensed under the Apache License, Version 2.0
Original commit: bc34459

The cosyvoice/ directory contains a modified copy of the CosyVoice project.
We have integrated and adapted this code for use in Light TTS.
The original LICENSE file is preserved in cosyvoice/LICENSE.

All modifications to the original CosyVoice code are also licensed under
the Apache License, Version 2.0.

================================================================================

204 changes: 123 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
![Light TTS Banner](asset/light-tts.jpg)

# light-tts
# LightTTS

**light-tts** is a lightweight and high-performance text-to-speech (TTS) inference and service framework based on Python. It is built around the [cosyvoice](https://github.com/FunAudioLLM/CosyVoice) model and based on the [lightllm](https://github.com/ModelTC/lightllm), with optimizations to support fast, scalable, and service-ready TTS deployment.
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Docker](https://img.shields.io/badge/docker-ready-brightgreen.svg)](https://hub.docker.com/r/lighttts/light-tts)

**⚡ Lightning-Fast Text-to-Speech Inference & Service Framework**

**LightTTS** is a lightweight and high-performance text-to-speech (TTS) inference and service framework based on Python. It supports **CosyVoice2** and **CosyVoice3** models, built upon the [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) architecture and [LightLLM](https://github.com/ModelTC/lightllm) framework, with optimizations to support fast, scalable, and service-ready TTS deployment.

---

Expand All @@ -19,34 +24,33 @@

### Installation

- Installing with Docker
- (Option 1 Recommended) Run with Docker
```bash
# The easiest way to install Lightllm is by using the official image. You can directly pull and run the official image
docker pull lighttts/light-tts:v1.0
# The easiest way to install LightTTS is by using the official image. You can directly pull and run the official image
docker pull lighttts/light-tts:latest

# Or you can manually build the image
docker build -t light-tts:v1.0 .
docker build -t light-tts:latest .

# Run the image
docker run -it --gpus all -p 8080:8080 --shm-size 4g -v your_local_path:/data/ light-tts:v1.0 /bin/bash
docker run -it --gpus all -p 8080:8080 --shm-size 4g -v your_local_path:/data/ light-tts:latest /bin/bash

- Installing from Source
- (Option 2) Install from Source

```bash
# Clone the repo
git clone --recursive https://github.com/ModelTC/light-tts.git
cd light-tts
git clone --recursive https://github.com/ModelTC/LightTTS.git
cd LightTTS
# If you failed to clone the submodule due to network failures, please run the following command until success
# cd light-tts
# cd LightTTS
# git submodule update --init --recursive

# (Recommended) Create a new conda environment
conda create -n light-tts python=3.10 -y
conda create -n light-tts python=3.10
conda activate light-tts

# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# Install dependencies (We use the latest torch==2.9.1, but other versions are also compatible)
pip install -r requirements.txt

# If you encounter sox compatibility issues
# ubuntu
Expand All @@ -55,23 +59,25 @@
sudo yum install sox sox-devel
```

### Model download
### Model Download

We now only support CosyVoice2 model.
We now support CosyVoice2 and CosyVoice3 models.

```python
# SDK模型下载
# ModelScope SDK model download (SDK模型下载)
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
```
```python
# git模型下载,请确保已安装git lfs
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

# For overseas users, HuggingFace SDK model download
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
```

(We have already installed the ttsfrd package in the docker image. If you are using docker image, you can skip this installation)
For better text normalization performance, you can optionally install the ttsfrd package and unzip its resources. This step is not required — if skipped, the system will fall back to WeTextProcessing by default.

```bash
Expand All @@ -80,73 +86,109 @@ unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
```
📝 This setup instruction is based on the original guide from the [CosyVoice repository](https://github.com/FunAudioLLM/CosyVoice).

### Start the Model Service

**Note:** It is recommended to enable the `load_trt` parameter for acceleration. The default flow precision is fp16 for CosyVoice2 and fp32 for CosyVoice3.

**For CosyVoice2:**

```bash
# It is recommended to enable the load_trt parameter for acceleration.
# The default is fp16 mode.
python -m light_tts.server.api_server --model_dir ./pretrained_models/CosyVoice2-0.5B-latest --load_trt True --max_total_token_num 65536 --max_req_total_len 32768
python -m light_tts.server.api_server --model_dir ./pretrained_models/CosyVoice2-0.5B
```

- max_total_token_num: llm arg, the total token nums the gpu and model can support, equals = `max_batch * (input_len + output_len)`
- max_req_total_len: llm arg, the max value for `req_input_len + req_output_len`, 32768 is set here because the `max_position_embeddings` of the llm part is 32768
- There are many other parameters that can be viewed in `light_tts/server/api_cli.py`
**For CosyVoice3:**

```bash
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512
```

**With custom data type** (float32, bfloat16, or float16; default: float16):

```bash
# Use float32 for better accuracy or float16 for faster speed
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512 --data_type float32
```

**Available Parameters:**

The default values are usually the fastest and generally do not need to be adjusted. If you need to customize them, please refer to the following parameter descriptions:
- `load_trt`: Whether to load the flow_decoder in TensorRT mode (default: True).
- `data_type`: The data type for LLM inference (default: float16)
- `load_jit`: Whether to load the flow_encoder in JIT mode (default: False).
- `max_total_token_num`: LLM arg, total token count the GPU and model can support = `max_batch * (input_len + output_len)` (default: 64 * 1024)
- `max_req_total_len`: LLM arg, maximum value for `req_input_len + req_output_len` (default: 32768, matches `max_position_embeddings`)
- `graph_max_len_in_batch`: Maximum sequence length for CUDA graph capture in decoding stage (default: 32768)
- `graph_max_batch_size`: Maximum batch size for CUDA graph capture in decoding stage (default: 16)

For more parameters, see `light_tts/server/api_cli.py`

Wait for a while, this service will be started. The default startup is localhost:8080.
Wait for the service to initialize. The default address is `http://localhost:8080`.

### Request Examples

When your service is started, you can call the service through the http API. We support three modes: non-streaming, streaming and bi-streaming.

- non-streaming and streaming. You can also use `test/test_zero_shot.py`, which can print information such as rtf and ttft.


```python
import requests
import time
import soundfile as sf
import numpy as np
import os
import threading
import json

url = "http://localhost:8080/inference_zero_shot"
path = "cosyvoice/asset/zero_shot_prompt.wav" # wav file path
prompt_text = "希望你以后能够做的比我还好呦。"
tts_text = "收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。"
stream = True # Whether to use streaming inference
files = {
"prompt_wav": ("sample.wav", open(path, "rb"), "audio/wav")
}
data = {
"tts_text": tts_text,
"prompt_text": prompt_text,
"stream": stream
}
response = requests.post(url, files=files, data=data, stream=True)
sample_rate = 24000

audio_data = bytearray()
try:
for chunk in response.iter_content(chunk_size=4096):
if chunk:
audio_data.extend(chunk)
except Exception as e:
print(f"Exception: {e}")
print(f"Error: {response.status_code}, {response.text}")
return
audio_np = np.frombuffer(audio_data, dtype=np.int16)
if response.status_code == 200:
output_wav = f"./outs/output{'_stream' if stream else ''}_{index}.wav"
sf.write(output_wav, audio_np, samplerate=sample_rate, subtype="PCM_16")
print(f"saved as {output_wav}")
else:
print("Error:", response.status_code, response.text)
```
Once the service is running, you can interact with it through the HTTP API. We support three modes: **non-streaming**, **streaming**, and **bi-streaming**.

- **Non-streaming and Streaming**: Use `test/test_zero_shot.py` for examples, which prints metrics such as RTF (Real-Time Factor) and TTFT (Time To First Token)
- **Bi-streaming**: Uses WebSocket interface. See usage examples in `test/test_bistream.py`

## 📊 Performance Benchmarks

We have conducted performance benchmarks on different GPU configurations to demonstrate the throughput and latency characteristics of LightTTS in streaming mode.

Model: `Fun-CosyVoice3-0.5B-2512` datatype: `float16`

### NVIDIA GeForce RTX 4090D
non-stream: `test/test_zs.py`

|num_workers|cost time 50%|cost time 90%|cost time 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
|------|------|------|------|------|------|------|------|------|------|
|1|0.61|1.09|1.51|0.13|0.16|0.22|0.13|33.95|1.47|
|2|0.8|1.24|1.71|0.15|0.22|0.25|0.16|21.46|2.33|
|4|1.02|1.88|2.27|0.22|0.29|0.38|0.23|15.31|3.27|
|8|1.76|2.36|3.48|0.33|0.49|0.62|0.36|12.18|4.1|

stream: `test/test_zs_stream.py`

- bi-streaming. We use the websocket interface implementation, and we can find usage examples in `test/test_bistream.py`.
|num_workers|cost time 50%|cost time 90%|cost time 99%|ttft 50%|ttft 90%|ttft 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
|------|------|------|------|------|------|------|------|------|------|------|------|------|
|1|1.01|2.15|2.82|0.33|0.34|0.9|0.21|0.25|0.34|0.22|60.13|0.83|
|2|1.83|3.56|5.16|0.93|1.53|2.3|0.34|0.63|0.81|0.4|52.47|0.95|
|4|3.43|5.76|7.31|2.62|4.37|5.8|0.7|1.28|2.16|0.81|48.74|1.03|
|8|7.27|10.01|10.45|6.4|8.55|9.03|1.28|2.67|3.66|1.57|47.37|1.06|

### NVIDIA GeForce RTX 5090
non-stream

|num_workers|cost time 50%|cost time 90%|cost time 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
|------|------|------|------|------|------|------|------|------|------|
|1|0.51|0.81|1.61|0.11|0.13|0.23|0.11|28.9|1.73|
|2|0.64|1.1|1.48|0.13|0.16|0.26|0.13|17.54|2.85|
|4|0.87|1.28|1.68|0.17|0.23|0.36|0.18|11.45|4.37|
|8|1.32|1.86|2.14|0.25|0.4|0.6|0.29|8.97|5.57|

stream

|num_workers|cost time 50%|cost time 90%|cost time 99%|ttft 50%|ttft 90%|ttft 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
|------|------|------|------|------|------|------|------|------|------|------|------|------|
|1|0.76|1.41|2.27|0.28|0.3|0.31|0.16|0.18|0.22|0.16|44.06|1.13|
|2|1.45|2.34|3.46|0.74|1.28|1.75|0.27|0.45|0.7|0.3|38.82|1.29|
|4|2.9|4.04|4.7|2.16|3.03|3.4|0.5|1.04|1.51|0.61|37.75|1.32|
|8|5.78|7.74|8.49|5.01|6.73|7.35|1.03|2.09|2.85|1.22|37.67|1.33|

**Metrics Explanation:**
- **num_workers**: Number of concurrent workers
- **cost time**: Total request processing time in seconds (50th/90th/99th percentile)
- **ttft**: Time to First Token in seconds (50th/90th/99th percentile)
- **rtf**: Real-Time Factor (50th/90th/99th percentile)
- **avg rtf**: Average Real-Time Factor
- **total_cost_time**: Total benchmark duration in seconds
- **qps**: Queries Per Second

## License
This repository is released under the [Apache-2.0](LICENSE) license.

This repository is released under the [Apache-2.0](LICENSE) license.

### Third-Party Code Attribution

This project includes code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) (Copyright Alibaba, Inc. and its affiliates), which is also licensed under Apache-2.0. The CosyVoice code is located in the `cosyvoice/` directory and has been integrated and modified as part of LightTTS. See the [NOTICE](NOTICE) file for complete attribution details.
1 change: 0 additions & 1 deletion cosyvoice
Submodule cosyvoice deleted from c939c8
Loading