You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

A large-scale pre-trained RVC model specialized for Japanese pronunciation

Overview

HiFi-GAN / contentvec

Hakumai is a pre-trained model focused on breath (inhale/exhale) to achieve authentic Japanese pronunciation and realistic vocal delivery.

Why Only 32 kHz?

Higher sampling rates such as 48 kHz turned out to be extremely sensitive to handle, with almost no audible benefit.
While they might slightly reduce latency during real-time inference, the difference is negligible.
When fine-tuning for a target speaker, higher rates often caused issues such as reverberation noise and instability.

After considering multiple factors, the model outputs at 32 kHz by design.
If additional high-frequency range is desired, it’s better to expand it afterward using tools like an expander or similar processing.

V2(Stable)

147 Speakers(All japanese): 61 Hours
SR: 32Khz
Batch 64
FP32

V5(Professional use)

Superior quality pretrained model

225 Speakers(All japanese)
SR: 32Khz
Batch 4
FP32
Remastered datasets
LR D/G : 4e-5/1e-4
Shared layers LR : 2e-6
Encoder/Flow LR : 1.0e-5
Vocoder LR : 1.8e-6
Optimizer/weight decay : AdamW/5e-4
Mel/KL Decay Start Ratio : 0.58
Mel/KL Decay End Ratio : 0.9
Mel Min Scale : 0.45
KL Min Scale : 0.35
Freeze foundation model speakers (109)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for yesiampapa/Hakumai

Base model

lj1995/VoiceConversionWebUI

Quantized

IAHispano/Applio

Finetuned

(5)

this model