-
-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-form generation #44
base: master
Are you sure you want to change the base?
Conversation
One way you could make the generation process consistent between chunks is by forcing a part of the previous chunk T onto the next chunk T+1 in the diffusion stage. So let's suppose the chunks are of length 50, on inference you save the last 10 (random number just for the purpose of example) for each step in the diffusion process. Then, when doing diffusion on chunk T+1, you overlap it with chunk T such that the overlap is 10, and on each diffusion step you force the corresponding diffusion outputs from chunk T onto the overlapping region. In this way you are forcing the model to make continuous speech in between chunks, by providing context from the previous chunk. |
@Craq That is what is happening when |
I tried this method and got more consistence results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR, one small comment otherwise everything looks good. Can you pls fix and then we can merge it in.
inference.py
Outdated
@@ -107,6 +116,9 @@ def __init__(self, ar_ckpt, nar_ckpt, device: str = None) -> None: | |||
nuke_weight_norm(self.codec) | |||
nuke_weight_norm(self.vocos) | |||
|
|||
# Download `punkt` for sentence segmentation | |||
nltk.download('punkt', quiet=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a warning that this being downloaded? Or not keep it quiet? Having it like this seems a litte weird?
I.e. this adds NLTK as a dependency in the code, but it isn't specified anywhere. Ideally make it as an optional dependancy or add it to readme/requirements.txt dependancies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the quiet downloading and added nltk
into reqs.
I have implemented this simple method to generate long-form content with MARS5. It splits the text into multiple chunks and generates audio for each chunk individually. These are then joined. There are two ways how this could work: (1) Either it can reuse the reference provided by the user
sliding_window_reuse_reference = True
, or (2) it uses the audio generated for the previous chunk as a reference= False
.Pros of reusing the same reference:
Cons of reusing the same reference:
reuse
sample puts accent on some words.Examples
The chunks were as follows:
Using the previous chunk as reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/f1675439-2865-44b1-834c-a2b82365644e
Reusing the original reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/695ed185-a74a-405f-89ff-d016a768eb22
Sliding window size
I added the size of the sliding window as a
cfg
attribute. It simply counts the number of characters in the input and tries to split the text accordingly. This can be controlled by the user.Silences
I have lowered the
trim_db
attribute to be more aggressive. There are however still some silences generated in the middle of the speech. On the other hand, if we join two chunks, they often follow each other abruptly and it would be nice to include some additional silence there. I think a good sound engineer might be able to fix both of these issues.