Long-form generation #44

matus-pikuliak · 2024-06-24T17:24:44Z

I have implemented this simple method to generate long-form content with MARS5. It splits the text into multiple chunks and generates audio for each chunk individually. These are then joined. There are two ways how this could work: (1) Either it can reuse the reference provided by the user sliding_window_reuse_reference = True, or (2) it uses the audio generated for the previous chunk as a reference = False.

Pros of reusing the same reference:

It is more robust, i.e., if the generation fails in one chunk, it will not affect the other chunks.
It is feasible to use a short reference, so the inference is faster and you can use longer sliding windows (meaning less splits)

Cons of reusing the same reference:

The speech is less fluent. For example, if the reference is a sentence, all the generated audios can have an accent on the start of the speech (as the model expects that it is generating next sentence at that point). This is barely noticeable, but in the examples below, it is possible to hear that the reuse sample puts accent on some words.

Examples

The chunks were as follows:

An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures
of dispersion such as the expected absolute deviation; for example, the variance of a sum of uncorrelated random variables
is equal to the sum of their variances.
A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the
random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation
is finished.

Using the previous chunk as reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/f1675439-2865-44b1-834c-a2b82365644e

Reusing the original reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/695ed185-a74a-405f-89ff-d016a768eb22

Sliding window size

I added the size of the sliding window as a cfg attribute. It simply counts the number of characters in the input and tries to split the text accordingly. This can be controlled by the user.

Silences

I have lowered the trim_db attribute to be more aggressive. There are however still some silences generated in the middle of the speech. On the other hand, if we join two chunks, they often follow each other abruptly and it would be nice to include some additional silence there. I think a good sound engineer might be able to fix both of these issues.

…or now.

…r me.

Craq · 2024-06-25T15:52:04Z

One way you could make the generation process consistent between chunks is by forcing a part of the previous chunk T onto the next chunk T+1 in the diffusion stage. So let's suppose the chunks are of length 50, on inference you save the last 10 (random number just for the purpose of example) for each step in the diffusion process. Then, when doing diffusion on chunk T+1, you overlap it with chunk T such that the overlap is 10, and on each diffusion step you force the corresponding diffusion outputs from chunk T onto the overlapping region. In this way you are forcing the model to make continuous speech in between chunks, by providing context from the previous chunk.

matus-pikuliak · 2024-06-25T21:32:26Z

@Craq That is what is happening when sliding_window_reuse_reference = False. The previous chunk is reused as a reference. I use the entire chunk, as we need to know the transcript as well. We could use just the last X frames, but we would have to match the transcript accordingly (not trivial).

superkido511 · 2024-07-04T08:44:44Z

I tried this method and got more consistence results

RF5

Great PR, one small comment otherwise everything looks good. Can you pls fix and then we can merge it in.

RF5 · 2024-07-05T15:14:31Z

inference.py

@@ -107,6 +116,9 @@ def __init__(self, ar_ckpt, nar_ckpt, device: str = None) -> None:
        nuke_weight_norm(self.codec)
        nuke_weight_norm(self.vocos)

+        # Download `punkt` for sentence segmentation
+        nltk.download('punkt', quiet=True)


Can you add a warning that this being downloaded? Or not keep it quiet? Having it like this seems a litte weird?

I.e. this adds NLTK as a dependency in the code, but it isn't specified anywhere. Ideally make it as an optional dependancy or add it to readme/requirements.txt dependancies.

I have removed the quiet downloading and added nltk into reqs.

matus-pikuliak added 3 commits June 24, 2024 16:12

NumPy 2 is not compatible with the code right now. Pegging it to <2 f…

31623f7

…or now.

Chunking of text for inference.

c998945

Slight refactoring. I have reduced the trim_db to levels that work fo…

e40f8cf

…r me.

RF5 requested changes Jul 5, 2024

View reviewed changes

matus-pikuliak added 3 commits July 20, 2024 20:36

Remove quiet nltk download

e638b2e

Add nltk to requirements.txt

0d7fee5

Merge branch 'master' into master

bc38741

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-form generation #44

Long-form generation #44

matus-pikuliak commented Jun 24, 2024 •

edited

Loading

Craq commented Jun 25, 2024 •

edited

Loading

matus-pikuliak commented Jun 25, 2024

superkido511 commented Jul 4, 2024

RF5 left a comment

RF5 Jul 5, 2024

matus-pikuliak Jul 20, 2024

Long-form generation #44

Are you sure you want to change the base?

Long-form generation #44

Conversation

matus-pikuliak commented Jun 24, 2024 • edited Loading

Examples

Sliding window size

Silences

Craq commented Jun 25, 2024 • edited Loading

matus-pikuliak commented Jun 25, 2024

superkido511 commented Jul 4, 2024

RF5 left a comment

Choose a reason for hiding this comment

RF5 Jul 5, 2024

Choose a reason for hiding this comment

matus-pikuliak Jul 20, 2024

Choose a reason for hiding this comment

matus-pikuliak commented Jun 24, 2024 •

edited

Loading

Craq commented Jun 25, 2024 •

edited

Loading