CosyVoice-300M-SFT无法生成较长的语音 #2280

ieayoio · 2024-09-11T07:38:58Z

System Info / 系統信息

OS：Ubuntu 22.04.4 LTS
CUDA Version: 12.6
GPU：Tesla V100S-PCIE-32GB
Docker version：24.0.7

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

docker镜像：xprobe/xinference:v0.15.0

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose.yml

version: "3"
services:
  xinference8992:
    container_name: xinference8992
    restart: always
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
    volumes:
      - ./appdata:/root/.xinference
      - /data/.cache/huggingface:/root/.cache/huggingface
      - /data/.cache/modelscope:/root/.cache/modelscope
    ports:
      - 8992:9997
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
    image: xprobe/xinference:v0.15.0
    command: xinference-local -H 0.0.0.0 --log-level debug

Reproduction / 复现过程

1.运行CosyVoice-300M-SFT模型，使用默认参数

2.运行curl命令测试

curl http://xx.xx.xx.xx:8992/v1/audio/speech \
  -H "Authorization: Bearer ddddd" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CosyVoice-300M-SFT",
    "input": "Java是一种广泛使用的计算机编程语言，它是由Sun Microsystems公司（现已被甲骨文公司收购）于1995年推出的。Java是一种面向对象的编程语言，设计之初就旨在使应用程序能够“一次编写，到处运行”（Write Once, Run Anywhere，简称WORA）。这意味着一个Java程序在编写完成后，可以在任何安装了Java虚拟机（JVM）的设备上运行，无论是Windows、macOS、Linux还是移动设备等。\n\n以下是Java的一些关键特点：\n\n1. **面向对象**：Java使用面向对象编程（OOP）的概念，如类（Class）、对象（Object）、继承 （Inheritance）、封装（Encapsulation）和多态（Polymorphism）。\n\n2. **平台无关性**：Java代码在编译后生成字节码（.class 文件），这些字节码可以在任何支持Java虚拟机的平台上运行。\n\n3. **简单性**：Java设计时考虑了易学易用，它的语法相对简单， 且没有C++中的指针和多继承等复杂特性。\n\n4. **强类型**：Java是一种强类型语言，这意味着变量在使用前必须声明其类型。\n\n5. **安全性**：Java提供了许多安全特性，如异常处理、访问控制、字节码验证等。\n\n6. **多线程**：Java内置了对多线程的支持，这使得它可以同时执行多个任务。\n\n7. **库丰富**：Java有一个庞大的标准库，提供了各种工具和API，可以用于网络编程、图形用户界面（GUI）开发、数据库连接等。\n\nJava被广泛应用于企业级应用、Android移动应用开发、Web应用开发、大数据处理、云计算等领域 。由于其跨平台特性和强大的生态系统，Java成为了全球范围内最受欢迎的编程语言之一。",
    "voice": "中文女"
  }' \
  --output speech.mp3

3.播放speech.mp3音频，发现音频只念到“Java....于1995年推出的。”只念了一句，之后的文本均被忽略

4.docker-compose logs如下：

xinference8992    | 2024-09-11 00:25:05,175 xinference.core.supervisor 140 DEBUG    [request f73f6490-700e-11ef-8b05-0242ac1a0002] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x798f5e3896c0>,CosyVoice-300M-SFT, kwargs:
xinference8992    | 2024-09-11 00:25:05,176 xinference.core.worker 140 DEBUG    Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x798f5e38a520>, kwargs: model_uid=CosyVoice-300M-SFT-1-0
xinference8992    | 2024-09-11 00:25:05,176 xinference.core.worker 140 DEBUG    Leave get_model, elapsed time: 0 s
xinference8992    | 2024-09-11 00:25:05,176 xinference.core.supervisor 140 DEBUG    [request f73f6490-700e-11ef-8b05-0242ac1a0002] Leave get_model, elapsed time: 0 s
xinference8992    | 2024-09-11 00:25:05,179 xinference.core.model 158 DEBUG    Request speech, current serve request count: 0, request limit: None for the model CosyVoice-300M-SFT-1-0
xinference8992    | 2024-09-11 00:25:05,179 xinference.core.model 158 DEBUG    [request f740062a-700e-11ef-bf1f-0242ac1a0002] Enter speech, args: <xinference.core.model.ModelActor object at 0x7faf286c7010>, kwargs: input=Java是一种广泛使用的计算机编程语言，它是由Sun Microsystems公司（现已被甲骨文公司收购）于1995年推出的。Java是一种面向对象的编程语言，设计之初就旨在使应用程 序能够“一次编写，...,voice=中文女,response_format=mp3,speed=1.0,stream=False
xinference8992    | 2024-09-11 00:25:05,181 xinference.model.audio.cosyvoice 158 INFO     CosyVoice inference_sft
xinference8992    | 2024-09-11 00:25:12,253 xinference.core.model 158 DEBUG    [request f740062a-700e-11ef-bf1f-0242ac1a0002] Leave speech, elapsed time: 7 s
xinference8992    | 2024-09-11 00:25:12,254 xinference.core.model 158 DEBUG    After request speech, current serve request count: 0 for the model CosyVoice-300M-SFT-1-0

Expected behavior / 期待表现

希望input传入长文本可以完整的生成音频，如果是出于性能考虑，希望可以给出设置参数，目前只念一句有点太少了

The text was updated successfully, but these errors were encountered:

qinxuye · 2024-09-11T11:35:49Z

大概率和模型有关，检查下标点符号，有不在 cosyvoice 训练语料里的标点符号可能会导致终止。

ieayoio · 2024-09-12T07:04:03Z

大概率和模型有关，检查下标点符号，有不在 cosyvoice 训练语料里的标点符号可能会导致终止。

我将cosyvoice集成到dify的时候发现dify只能读取一段语音，通过curl调用xinference暴露的接口确实也只有一段语音。

我找到了相关的代码，在xinference/model/audio/cosyvoice.py中，调用self._model.inference_sft

在官方示例中会使用enumerate迭代出多个对象，xinference目前这种调用方式确实只会得到一段语音

不过我通过xinference的日志观察到，其实dify(0.7.0版本)使用cosyvoice也会将文本切分成很多段，实际上也会调用很多次cosyvoice模型，而且是非流式的，这么看来xinference只返回一段语音也并不会影响到dify，但是实际上dify还是只读了一段语音，应该还是dify有bug😂

XprobeBot added the gpu label Sep 11, 2024

XprobeBot added this to the v0.15 milestone Sep 11, 2024

qinxuye mentioned this issue Sep 18, 2024

BUG: Fix CosyVoice missing output #2320

Merged

qinxuye closed this as completed in #2320 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosyVoice-300M-SFT无法生成较长的语音 #2280

CosyVoice-300M-SFT无法生成较长的语音 #2280

ieayoio commented Sep 11, 2024 •

edited

Loading

qinxuye commented Sep 11, 2024

ieayoio commented Sep 12, 2024

CosyVoice-300M-SFT无法生成较长的语音 #2280

CosyVoice-300M-SFT无法生成较长的语音 #2280

Comments

ieayoio commented Sep 11, 2024 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

qinxuye commented Sep 11, 2024

ieayoio commented Sep 12, 2024

ieayoio commented Sep 11, 2024 •

edited

Loading