Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-form generation #44

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

matus-pikuliak
Copy link

@matus-pikuliak matus-pikuliak commented Jun 24, 2024

I have implemented this simple method to generate long-form content with MARS5. It splits the text into multiple chunks and generates audio for each chunk individually. These are then joined. There are two ways how this could work: (1) Either it can reuse the reference provided by the user sliding_window_reuse_reference = True, or (2) it uses the audio generated for the previous chunk as a reference = False.

Pros of reusing the same reference:

  • It is more robust, i.e., if the generation fails in one chunk, it will not affect the other chunks.
  • It is feasible to use a short reference, so the inference is faster and you can use longer sliding windows (meaning less splits)

Cons of reusing the same reference:

  • The speech is less fluent. For example, if the reference is a sentence, all the generated audios can have an accent on the start of the speech (as the model expects that it is generating next sentence at that point). This is barely noticeable, but in the examples below, it is possible to hear that the reuse sample puts accent on some words.

Examples

The chunks were as follows:

  • An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures
  • of dispersion such as the expected absolute deviation; for example, the variance of a sum of uncorrelated random variables
  • is equal to the sum of their variances.
  • A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the
  • random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation
  • is finished.

Using the previous chunk as reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/f1675439-2865-44b1-834c-a2b82365644e

Reusing the original reference
https://github.com/Camb-ai/MARS5-TTS/assets/9572985/695ed185-a74a-405f-89ff-d016a768eb22

Sliding window size

I added the size of the sliding window as a cfg attribute. It simply counts the number of characters in the input and tries to split the text accordingly. This can be controlled by the user.

Silences

I have lowered the trim_db attribute to be more aggressive. There are however still some silences generated in the middle of the speech. On the other hand, if we join two chunks, they often follow each other abruptly and it would be nice to include some additional silence there. I think a good sound engineer might be able to fix both of these issues.

@Craq
Copy link

Craq commented Jun 25, 2024

One way you could make the generation process consistent between chunks is by forcing a part of the previous chunk T onto the next chunk T+1 in the diffusion stage. So let's suppose the chunks are of length 50, on inference you save the last 10 (random number just for the purpose of example) for each step in the diffusion process. Then, when doing diffusion on chunk T+1, you overlap it with chunk T such that the overlap is 10, and on each diffusion step you force the corresponding diffusion outputs from chunk T onto the overlapping region. In this way you are forcing the model to make continuous speech in between chunks, by providing context from the previous chunk.

@matus-pikuliak
Copy link
Author

@Craq That is what is happening when sliding_window_reuse_reference = False. The previous chunk is reused as a reference. I use the entire chunk, as we need to know the transcript as well. We could use just the last X frames, but we would have to match the transcript accordingly (not trivial).

@superkido511
Copy link

I tried this method and got more consistence results

Copy link
Collaborator

@RF5 RF5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, one small comment otherwise everything looks good. Can you pls fix and then we can merge it in.

inference.py Outdated
@@ -107,6 +116,9 @@ def __init__(self, ar_ckpt, nar_ckpt, device: str = None) -> None:
nuke_weight_norm(self.codec)
nuke_weight_norm(self.vocos)

# Download `punkt` for sentence segmentation
nltk.download('punkt', quiet=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a warning that this being downloaded? Or not keep it quiet? Having it like this seems a litte weird?

I.e. this adds NLTK as a dependency in the code, but it isn't specified anywhere. Ideally make it as an optional dependancy or add it to readme/requirements.txt dependancies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the quiet downloading and added nltk into reqs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants