Here's a detailed reference for all the options accepted by the Echogarden CLI and API.
Related pages:
Applies to CLI operations: speak
, speak-file
, speak-url
, speak-wikipedia
, API method: synthesize
General:
engine
: identifier of the synthesis engine to use, such asespeak
,vits
orgoogle-translate
(see the full engine list). Auto-selected if not setlanguage
: language code (ISO 639-1), likeen
,fr
,en-US
,pt-BR
. Auto-detected if not setvoice
: name of the voice to use. Can be a search string. Auto-selected if not setvoiceGender
: gender of the voice to use. Optionalspeed
: speech rate factor, relative to default. In the range0.1
..10.0
. Defaults to1.0
pitch
: pitch factor, relative to default. In the range0.1
..10.0
. Defaults to1.0
pitchVariation
: pitch variation factor. In the range0.1
..10.0
. Defaults to1.0
splitToSentences
: split text to sentences before synthesis. Defaults totrue
ssml
: the input is SSML. Defaults tofalse
sentenceEndPause
: pause duration (seconds) at end of sentence. Defaults to0.75
segmentEndPause
: pause duration (seconds) at end of segment. Defaults to1.0
customLexiconPaths
: a list of custom lexicon file paths. Optionalalignment
: prefix to provide options for alignment. Options detailed in section for alignmentsubtitles
: prefix to provide options for subtitles. Options detailed in section for subtitleslanguageDetection
: prefix to provide options for text language detection. Options detailed in section for text language detection
Plain text processing:
plainText.paragraphBreaks
: split to paragraphs based on single (single
), or double (double
) line breaks. Defaults todouble
plainText.whitespace
: determines how to process whitespace within paragraphs. Can bepreserve
(leave as is),removeLineBreaks
(convert line breaks to spaces) orcollapse
(convert runs of whitespace characters, including line breaks, to a single space character). Defaults tocollapse
Post-processing:
postProcessing.normalizeAudio
: should normalize output audio. Defaults totrue
postProcessing.targetPeak
: target peak (decibels) for normalization. Defaults to-3
postProcessing.maxGainIncrease
: max gain increase (decibels) when performing normalization. Defaults to30
postProcessing.speed
: target speed for time stretching. Defaults to1.0
postProcessing.pitch
: target pitch for pitch shifting. Defaults to1.0
postProcessing.timePitchShiftingMethod
: method for time and pitch shifting. Can besonic
orrubberband
. Defaults tosonic
postProcessing.rubberband
: prefix for RubberBand options (TODO: document options)
Output audio format:
outputAudioFormat.codec
: Codec identifier (Note: API only. CLI uses file extensions instead), can bewav
,mp3
,opus
,m4a
,ogg
,flac
. Leaving asundefined
would return a raw audio structure (see more information at the API documentation. OptionaloutputAudioFormat.bitrate
: Custom bitrate for encoding, applies only tomp3
,opus
,m4a
,ogg
. By default, bitrates are selected between 48Kbps and 64Kbps, to provide a good speech quality while minimizing file size. Optional
VITS:
vits.speakerId
: speaker ID, for VITS models that support multiple speakers. Defaults to0
vits.provider
: ONNX execution provider to use. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Using GPU acceleration for VITS may or may not be faster than CPU, depending on your hardware. Defaults tocpu
eSpeak:
espeak.rate
: speech rate, in eSpeak units. Overridesspeed
when setespeak.pitch
: pitch, in eSpeak units. Overridespitch
when setespeak.pitchRange
: pitch range, in eSpeak units. OverridespitchVariation
when setespeak.useKlatt
: use the Klatt synthesis method. Defaults tofalse
SAM:
sam.pitch
: pitch value, between0
..255
. Overridespitch
when setsam.speed
: speed value, between0
..255
. Overridesspeed
when setsam.mouth
: mouth value, between0
..255
(defaults to128
)sam.throat
: throat value, between0
..255
(defaults to128
)
SAPI:
sapi.rate
: SAPI speech rate, in its native units. An integer number between-10
and10
. Settingspeed
would apply time stretching instead. The two options can be used together
Microsoft Speech Platform:
msspeech.rate
: same units and effects as the SAPI speech rate
Coqui Server:
coquiServer.serverUrl
: server URLcoquiServer.speakerId
: speaker ID (if applicable)
Google Cloud:
googleCloud.apiKey
: API key (required)googleCloud.pitchDeltaSemitones
: pitch delta in semitones. Overridespitch
when setgoogleCloud.customVoice.model
: name of custom voicegoogleCloud.customVoice.reportedUsage
: reported usage of custom voice
Azure Cognitive Services:
microsoftAzure.subscriptionKey
: subscription key (required)microsoftAzure.serviceRegion
: service region (required)microsoftAzure.pitchDeltaHz
: pitch delta in Hz. Overridespitch
when set
Amazon Polly:
amazonPolly.region
: region (required)amazonPolly.accessKeyId
: access key ID (required)amazonPolly.secretAccessKey
: secret access key (required)amazonPolly.pollyEngine
: Amazon Polly engine kind, can bestandard
orneural
. Defaults toneural
amazonPolly.lexiconNames
: an array of lexicon names. Optional
OpenAI Cloud:
openAICloud.apiKey
: API key (required)openAICloud.organization
: organization identifier. OptionalopenAICloud.baseURL
: override the default base URL for the API. OptionalopenAICloud.model
: model to use. Can be eithertts-1
ortts-1-hd
. Defaults totts-1
openAICloud.timeout
: request timeout. OptionalopenAICloud.maxRetries
: maximum retries on failure. Defaults to 10
Elevenlabs:
elevenLabs.apiKey
: API key (required)elevenLabs.stability
: stability. Defaults to0.5
elevenLabs.similarityBoost
: similarity boost. Defaults to0.5
Google Translate:
googleTranslate.tld
: top level domain to connect to. Can change the dialect for a small number of voices. For exampleus
gives American English foren
, whilecom
gives British English foren
. Defaults tous
Microsoft Edge:
microsoftEdge.trustedClientToken
: trusted client token (required). A special token required to use the servicemicrosoftEdge.pitchDeltaHz
: pitch delta in Hz. Overridespitch
when set
Applies to CLI operation: list-voices
, API method: requestVoiceList
General:
language
: language code to filter by (optional)voice
: name or name pattern to filter by (optional)voiceGender
: gender to filter by (optional)
Also accepted are the following engine-specific options that may be required in order to retrieve the voice list:
googleCloud.apiKey
microsoftAzure.subscriptionKey
,microsoftAzure.serviceRegion
amazonPolly.region
,amazonPolly.accessKeyId
,amazonPolly.secretAccessKey
elevenLabs.apiKey
microsoftEdge.trustedClientToken
Applies to CLI operation: transcribe
, API method: recognize
General:
engine
: identifier of the recognition engine to use, such aswhisper
orvosk
(see the full engine list)language
: language code (ISO 639-1) for the audio, likeen
,fr
,de
. Auto-detected if not setcrop
: crop to active parts using voice activity detection before starting recognition. Defaults totrue
isolate
: apply source separation to isolate voice before starting recognition. Defaults tofalse
alignment
: prefix to provide options for alignment. Options detailed in section for alignmentlanguageDetection
: prefix to provide options for language detection. Options detailed in section for speech language detectionsubtitles
: prefix to provide options for subtitles. Options detailed in section for subtitlesvad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detectionsourceSeparation
: prefix to provide options for source separation whenisolate
is set totrue
. Options detailed in section for source separation
Whisper:
whisper.model
: selects which Whisper model to use. Can betiny
,tiny.en
,base
,base.en
,small
,small.en
,medium
,medium.en
orlarge-v3-turbo
. Defaults totiny
ortiny.en
whisper.temperature
: temperature setting for the text decoder. Impacts the amount of randomization for token selection. It is recommended to leave at0.1
(close to no randomization - almost always chooses the top ranked token) or choose a relatively low value (0.25
or lower) for best results. Defaults to0.1
whisper.prompt
: initial text to give the Whisper model. Can be a vocabulary, or example text of some sort. Note that if the prompt is very similar to the transcript, the model may intentionally avoid producing the transcript tokens as it may assume that they have already been transcribed. Optionalwhisper.topCandidateCount
: the number of top candidate tokens to consider. Defaults to5
whisper.punctuationThreshold
: the minimal probability for a punctuation token, included in the top candidates, to be chosen unconditionally. A lower threshold encourages the model to output more punctuation characters. Defaults to0.2
whisper.autoPromptParts
: use previous part's recognized text as the prompt for the next part. Disabling this may help to prevent repetition carrying over between parts, in some cases. Defaults totrue
(Note: currently always disabled forlarge-v3-turbo
model due to an apparent issue with corrupt output when prompted)whisper.maxTokensPerPart
: maximum number of tokens to decode for each audio part. Defaults to250
whisper.suppressRepetition
: attempt to suppress decoding of repeating token patterns. Defaults totrue
whisper.repetitionThreshold
: minimal repetition / compressibility score to cause a part not to be auto-prompted to the next part. Defaults to2.4
whisper.decodeTimestampTokens
: enable/disable decoding of timestamp tokens. Setting tofalse
can reduce the occurrence of hallucinations and token repetition loops, possibly due to the overall reduction in the number of tokens decoded. This has no impact on the accuracy of timestamps, since they are derived independently using cross-attention weights. However, there are cases where this can cause the model to end a part prematurely, especially in singing and less speech-like voice segments, or when there are multiple speakers. Defaults totrue
whisper.timestampAccuracy
: timestamp accuracy. can bemedium
orhigh
.medium
uses a reduced subset of attention heads for alignment,high
uses all attention heads and is thus more accurate at the word level, but slower for larger models. Defaults tohigh
for thetiny
andbase
models, andmedium
for the larger modelswhisper.encoderProvider
: identifier for the ONNX execution provider to use with the encoder model. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). In general, GPU-based encoding should be significantly faster. Defaults tocpu
, ordml
if availablewhisper.decoderProvider
: identifier for the ONNX execution provider to use with the decoder model. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Using GPU acceleration for the decoder may be faster than CPU, especially for larger models, but that depends on your particular combination of CPU and GPU. Defaults tocpu
, and on Windows,dml
if available for larger models (small
,medium
,large
)whisper.seed
: provide a custom random seed for token selection when temperature is greater than 0. Uses a constant seed by default to ensure reproducibility
Whisper.cpp:
whisperCpp.model
: selects whichwhisper.cpp
model to use. Can betiny
,tiny.en
,base
,base.en
,small
,small.en
,medium
,medium.en
,large
(same aslarge-v2
),large-v1
,large-v2
,large-v3
,large-v3-turbo
. The following quantized models are also supported:tiny-q5_1
,tiny.en-q5_1
,tiny.en-q8_0
,base-q5_1
,base.en-q5_1
,small-q5_1
,small.en-q5_1
,medium-q5_0
,medium.en-q5_0
,large-v2-q5_0
,large-v3-q5_0
,large-v3-turbo-q5_0
. Defaults tobase
orbase.en
whisperCpp.executablePath
: a path to a customwhisper.cpp
main
executable (currently required for macOS)whisperCpp.build
: type ofwhisper.cpp
build to use. Can be set tocpu
,cublas-12.4.0
orcustom
. By default, builds are auto-selected and downloaded for Windows x64 (cpu
,cublas-12.4.0
) and Linux x64 (cpu
). Using other builds requires providing a customexecutablePath
, which will automatically set this option tocustom
whisperCpp.threadCount
: number of threads to use, defaults to4
whisperCpp.splitCount
: number of splits of the audio data to process in parallel (called--processors
in thewhisper.cpp
CLI). A value greater than1
can increase memory use significantly, reduce timing accuracy, and slow down execution in some cases. Defaults to1
(highly recommended)whisperCpp.enableGPU
: enable GPU processing. Setting totrue
will try to use a CUDA build, if available for your system. Defaults totrue
when a CUDA-enabled build is selected viawhisperCpp.build
, otherwisefalse
. If a custom build is used, it will enable or disable GPU for that buildwhisperCpp.topCandidateCount
: the number of top candidate tokens to consider. Defaults to5
whisperCpp.beamCount
: the number of decoding paths to use during beam search. Defaults to5
whisperCpp.temperature
: set temperature. Defaults to0.0
whisperCpp.temperatureIncrement
: set temperature increment. Defaults to0.2
whisperCpp.repetitionThreshold
: minimal repetition / compressibility score to cause a decoded segment to be discarded. Defaults to2.4
whisperCpp.prompt
: initial text to give the Whisper model. Can be a vocabulary, or example text of some sort. Note that if the prompt is very similar to the transcript, the model may intentionally avoid producing the transcript tokens as it may assume that they have already been transcribed. OptionalwhisperCpp.enableDTW
: enablewhisper.cpp
's own experimental DTW-based token alignment to be used to derive timestamps. Defaults tofalse
whisperCpp.enableFlashAttention
: enable flash attention. Can significantly increase performance for some configurations (Note: setting this totrue
will causeenableDTW
to always be set tofalse
since it's not compatible with flash attention). Defaults tofalse
whisperCpp.verbose
: show all CLI messages during execution. Defaults tofalse
Vosk:
vosk.modelPath
: path to the Vosk model to be used
Silero:
silero.modelPath
: path to a Silero model. Note that latesten
,de
,fr
anduk
models are automatically installed when needed based on the selected language. This should only be used to manually specify a different model, otherwise specifylanguage
insteadsilero.provider
: ONNX execution provider to use. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Defaults tocpu
, ordml
if available
Google Cloud:
googleCloud.apiKey
: Google Cloud API key (required)googleCloud.alternativeLanguageCodes
: An array of alternative language codes. OptionalgoogleCloud.profanityFilter
: censor profanity. Defaults tofalse
googleCloud.autoPunctuation
: add punctuation automatically. Defaults totrue
googleCloud.useEnhancedModel
: use enhanced model. Defaults totrue
Azure Cognitive Services:
microsoftAzure.subscriptionKey
: subscription key (required)microsoftAzure.serviceRegion
: service region (required)
Amazon Transcribe:
amazonTranscribe.region
: region (required)amazonTranscribe.accessKeyId
: access key ID (required)amazonTranscribe.secretAccessKey
: secret access key (required)
OpenAI Cloud:
openAICloud.apiKey
: API key (required)openAICloud.model
: model to use. When using the default provider (OpenAI), can only bewhisper-1
. For a custom provider, like Groq, see its documentationopenAICloud.organization
: organization identifier. OptionalopenAICloud.baseURL
: override the default endpoint used by the API. For example, sethttps://api.groq.com/openai/v1
to use Groq's OpenAI-compatible API instead of the default one. OptionalopenAICloud.temperature
: temperature. Choosing0
uses a dynamic temperature approach. Defaults to0
openAICloud.prompt
: initial prompt for the model. OptionalopenAICloud.timeout
: request timeout. OptionalopenAICloud.maxRetries
: maximum retries on failure. Defaults to 10openAICloud.requestWordTimestamps
: request word timestamps from the server. Defaults totrue
for the default OpenAI endpoint, andfalse
if a custom one is set usingbaseURL
Applies to CLI operation: align
, API method: align
General:
engine
: alignment algorithm to use, can bedtw
,dtw-ra
orwhisper
. Defaults todtw
language
: language code for the audio and transcript (ISO 639-1), likeen
,fr
,en-US
,pt-BR
. Auto-detected from transcript if not setcrop
: crop to active parts using voice activity detection before starting. Defaults totrue
isolate
: apply source separation to isolate voice before starting alignment. Defaults tofalse
customLexiconPaths
: an array of custom lexicon file paths. Optionalsubtitles
: prefix to provide options for subtitles. Options detailed in section for subtitlesvad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detectionsourceSeparation
: prefix to provide options for source separation whenisolate
is set totrue
. Options detailed in section for source separation
Plain text processing:
plainText.paragraphBreaks
: split transcript to paragraphs based on single (single
), or double (double
) line breaks. Defaults todouble
plainText.whitespace
: determines how to process whitespace within transcript paragraphs. Can bepreserve
(leave as is),removeLineBreaks
(convert line breaks to spaces) orcollapse
(convert runs of whitespace characters, including line breaks, to a single space character). Defaults tocollapse
DTW:
dtw.granularity
: adjusts the MFCC frame width and hop size based on the profile selected. Can be set to eitherxx-low
(400ms width, 160ms hop),x-low
(200ms width, 80ms hop),low
(100ms width, 40ms hop),medium
(50ms width, 20ms hop),high
(25ms width, 10ms hop),x-high
(20ms width, 5ms hop). For multi-pass processing, multiple granularities can be provided, likedtw.granularity=['xx-low','medium']
. Auto-selected by default.dtw.windowDuration
: sets the maximum duration of the Sakoe-Chiba window when performing DTW alignment. The value can be specified in seconds, like240
, or as an integer percentage (formatted like15%
), relative to the total duration of the source audio. The estimated memory requirement is shown in the log before alignment starts. Recommended to be set to at least 10% - 20% of total audio duration. For multi-pass processing, multiple durations can be provided (which can mix absolute and relative values), likedtw.windowDuration=['15%',20]
. Auto-selected by default
DTW-RA:
recognition
: prefix to provide recognition options when usingdtw-ra
method, for example: settingrecognition.engine = whisper
andrecognition.whisper.model = base.en
dtw.phoneAlignmentMethod
: algorithm to use when aligning phones: can either be set todtw
orinterpolation
. Defaults todtw
Whisper:
Applies to the whisper
engine only. To provide Whisper options for dtw-ra
, use recognition.whisper
instead.
whisper.model
: Whisper model to use. Defaults totiny
ortiny.en
whisper.endTokenThreshold
: minimal probability to accept an end token for a recognized part. The probability is measured via the softmax between the end token's logit and the second highest logit. You can try to adjust this threshold in cases the model is ending a part with too few, or many tokens decoded. Defaults to0.9
. On the last audio part, it is always effectively set toInfinity
, to ensure the remaining transcript tokens are decoded in fullwhisper.maxTokensPerPart
: maximum number of tokens to decode per part. Should help avoid edge cases where the model never reaches an end token for the part, which otherwise may cause the model to decode too many tokens and eventually crash. Defaults to 250whisper.timestampAccuracy
: timestamp accuracy. can bemedium
orhigh
.medium
uses a reduced subset of attention heads for alignment,high
uses all attention heads and is thus more accurate at the word level, but slower for larger models. Defaults tohigh
for thetiny
andbase
models, andmedium
for the larger modelswhisper.encoderProvider
: encoder ONNX provider. See details in recognition section abovewhisper.decoderProvider
: decoder ONNX provider. See details in recognition section above
Applies to CLI operation: translate-speech
, API method: translateSpeech
General:
engine
: onlywhisper
supportedsourceLanguage
: the source language code for the input speech. Auto-detected if not settargetLanguage
: the target language code for the output speech. Onlyen
(English) supported by thewhisper
engine. Optionalcrop
: crop to active parts using voice activity detection before starting. Defaults totrue
isolate
: apply source separation to isolate voice before starting speech translation. Defaults tofalse
languageDetection
: prefix to provide options for language detection. Options detailed in section for speech language detectionvad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detectionsourceSeparation
: prefix to provide source separation options whenisolate
is totrue
Whisper:
whisper
: prefix to provide options for the Whisper model. Same options as detailed in the recognition section above
Whisper.cpp:
whisper.cpp
: prefix to provide options for the Whisper.cpp model. Same options as detailed in the recognition section above
OpenAI Cloud:
openAICloud
: prefix to provide options for OpenAI cloud. Same options as detailed in the recognition section above
Applies to CLI operation: translate-text
, API method: translateText
General:
engine
: onlygoogle-translate
supportedsourceLanguage
: the source language code for the input text. Auto-detected if not settargetLanguage
: the target language code for the output text. RequiredlanguageDetection
: language detection options. OptionalplainText
: plain text processing options. Optional
Google Translate:
googleTranslate.tld
: top-level domain to request from. Defaults tocom
googleTranslate.maxCharactersPerPart
: maximum number of characters in each part requested from the server. Defaults to 2000
Applies to CLI operation: align-translation
, API method: alignTranslation
General:
engine
: alignment algorithm to use, can only bewhisper
. Defaults towhisper
sourceLanguage
: language code for the source audio (ISO 639-1), likeen
,fr
,zh
, etc. Auto-detected from audio if not settargetLanguage
: language code for the translated transcript. Can only been
for now. Defaults toen
crop
: crop to active parts using voice activity detection before starting. Defaults totrue
isolate
: apply source separation to isolate voice before starting alignment. Defaults tofalse
subtitles
: prefix to provide options for subtitles. Options detailed in section for subtitlesvad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detectionsourceSeparation
: prefix to provide options for source separation whenisolate
is set totrue
. Options detailed in section for source separation
Whisper:
whisper.model
: Whisper model to use. Only multilingual models can be used. Defaults totiny
whisper.endTokenThreshold
: see details in the alignment section abovewhisper.encoderProvider
: encoder ONNX execution provider. See details in recognition section abovewhisper.decoderProvider
: decoder ONNX execution provider. See details in recognition section above
Applies to CLI operation: align-transcript-and-translation
, API method: alignTranscriptAndTranslation
General:
engine
: can only betwo-stage
. Defaults totwo-stage
sourceLanguage
: language code for the source audio (ISO 639-1), likeen
,fr
,zh
, etc. Auto-detected from audio if not settargetLanguage
: language code for the translated transcript. Can only been
for now. Defaults toen
crop
: crop to active parts using voice activity detection before starting. Defaults totrue
isolate
: apply source separation to isolate voice before starting alignment. Defaults tofalse
alignment
: prefix to provide options for alignment. Options detailed in section for alignmenttimelineAlignment
: prefix to provide options for timeline alignment. Options detailed in section for timeline alignmentvad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detectionsourceSeparation
: prefix to provide options for source separation whenisolate
is set totrue
. Options detailed in section for source separationsubtitles
: prefix to provide options for subtitles. Options detailed in section for subtitles
Applies to CLI operation: align-timeline-translation
, API method: alignTimelineTranslation
General:
engine
: alignment engine to use. Can only bee5
. Defaults toe5
sourceLanguage
: language code for the source timeline. Auto-detected from timeline if not settargetLanguage
: language code for the translated transcript. Auto-detected if not setaudio
: spoken audio to play when previewing the result in the CLI (not required or used by the alignment itself). OptionallanguageDetection
: prefix to provide options for language detection. Options detailed in section for text language detectionsubtitles
: prefix to provide options for subtitles. Options detailed in section for subtitles
E5:
e5.model
: E5 model to use. Defaults toe5-small-fp16
(support for additional models will be added in the future)
Applies to CLI operation: detect-speech-langauge
, API method: detectSpeechLangauge
General:
engine
:whisper
orsilero
. Defaults towhisper
defaultLanguage
: language to fallback to when confidence for top candidate of is low. Defaults toen
fallbackThresholdProbability
: confidence threshold to cause fallback. Defaults to0.05
crop
: crop to active parts using voice activity detection before starting. Defaults totrue
(recommended, otherwise inactive sections may skew the probabilities towards various random languages)vad
: prefix to provide options for voice activity detection whencrop
is set totrue
. Options detailed in section for voice activity detection
Whisper:
whisper.model
: Whisper model to use. See model list in the recognition sectionwhisper.temperature
: impacts the distribution of candidate languages when applying the softmax function to compute language probabilities over the model output. Higher temperature causes the distribution to be more uniform, while lower temperature causes it to be more strongly weighted towards the best scoring candidates. Defaults to1.0
whisper.encoderProvider
: encoder ONNX execution provider. See details in recognition section abovewhisper.decoderProvider
: decoder ONNX execution provider. See details in recognition section above
Silero:
silero.provider
: ONNX execution provider to use. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Using GPU may be faster, but the initialization overhead is larger. Note:dml
provider seems to be unstable at the moment for this model. Defaults tocpu
Applies to CLI operation: detect-text-langauge
, API method: detectTextLangauge
General:
engine
:tinyld
orfasttext
. Defaults totinyld
defaultLanguage
: language to fallback to when confidence for top candidate is low. Defaults toen
fallbackThresholdProbability
: confidence threshold to cause fallback. Defaults to0.05
Applies to CLI operation: detect-voice-activity
, API method: detectVoiceActivity
General:
engine
: VAD engine to use. Can bewebrtc
,silero
,rnnoise
, oradaptive-gate
. Defaults tosilero
activityThreshold
: minimum predicted probability for determining a frame as having speech activity. Defaults to0.5
WebRTC:
webrtc.frameDuration
: WebRTC frame duration (ms). Can be10
,20
or30
. Defaults to30
webrtc.mode
: WebRTC mode (aggressiveness). Can be0
,1
,2
or3
. Defaults to1
Silero:
silero.frameDuration
: Silero frame duration (ms). Can be30
,60
or90
. Defaults to90
silero.provider
: ONNX provider to use. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Using GPU is likely to be slower than CPU due to inference being independently executed on each audio frame. Defaults tocpu
(recommended)
Applies to CLI operation: denoise
, API method: denoise
General:
engine
:rnnoise
ornsnet2
. Defaults tornnoise
Post-processing:
postProcessing.normalizeAudio
: should normalize output audio. Defaults tofalse
postProcessing.targetPeak
: target peak (decibels) for normalization. Defaults to-3
postProcessing.maxGainIncrease
: max gain increase (decibels) when performing normalization. Defaults to30
postProcessing.dryMixGain
: gain (decibels) of dry (original) signal to mix back to the denoised (wet) signal. Defaults to-100
NSNet2:
nsnet2.model
: can bebaseline-16khz
orbaseline-48khz
. Defaults tobaseline-48khz
nsnet2.provider
: ONNX execution provider (Note:dml
provider seems to fail with these models). Defaults tocpu
maxAttenuation
: maximum amount of attenuation, in decibels, applied to an FFT bin when filtering the audio frames. Defaults to30
Applies to CLI operation: isolate
, API method: isolate
General:
engine
: can only bemdx-net
MDX-NET:
mdxNet.model
: model to use. Currently available models areUVR_MDXNET_1_9703
,UVR_MDXNET_2_9682
,UVR_MDXNET_3_9662
,UVR_MDXNET_KARA
, and higher quality modelsUVR_MDXNET_Main
andKim_Vocal_2
. Defaults toUVR_MDXNET_1_9703
mdxNet.provider
: ONNX execution provider to use. Can becpu
,dml
(DirectML-based GPU acceleration - Windows only), orcuda
(Linux only - requires CUDA Toolkit 12.x and cuDNN 9.x to be installed). Defaults todml
if available (Windows) orcpu
(other platforms)
These are shared between text-to-speech, speech-to-text and alignment operations, usually prefixed with subtitles.
.
mode
: subtitle generation mode. Can besegment
(ensures each segment starts at a new cue),sentence
(ensures each sentence starts at a new cue),word
(one word per cue, no punctuation included),phone
(one phone per cue),word+phone
(include bothword
andphone
cues, with overlapping time ranges),line
(each text line is made a separate cue). Defaults tosentence
maxLineCount
: maximum number of lines per cue. Defaults to2
maxLineWidth
: maximum characters in a line. Defaults to42
minWordsInLine
: minimum number of remaining words to break to a new line. Defaults to4
separatePhrases
: try to separate phrases or sentences in new lines or cues, if possible. Defaults totrue
maxAddedDuration
: maximum extra time (in seconds) that may be added after a cue's end time. This gives the reader additional time to read the cue, and also ensures that very short duration cues aren't shown in a flash. Defaults to3.0
Note: options maxLineCount
, maxLineWidth
, minWordsInLine
, separatePhrases
, are only effective when using the segment
and sentence
modes, and are ignored in all other modes. maxAddedDuration
doesn't apply to modes word
, phone
and word+phone
(they always use the exact start and end timestamps).
On the CLI, global options can be used with any operation. To set global options via the API, use the setGlobalOption(key, value)
method (see the API reference for more details).
ffmpegPath
: sets a custom path for the FFmpeg executablesoxPath
: sets a custom path for the SoX executablepackageBaseURL
: sets a custom base URL for the remote package repository used to download missing packages. Default ishttps://huggingface.co/echogarden/echogarden-packages/resolve/main/
. Ifhuggingface.co
isn't accessible in your location, you can set to use a mirror by changinghuggingface.co
to an alternative domain likehf-mirror.com
logLevel
: adjusts the quantity of log messages shown during processing. Possible values:silent
,output
,error
,warning
,info
,trace
. Defaults toinfo
These options are for the CLI only.
--play
,--no-play
: enable/disable audio playback. Defaults to play if there is no output file specified--player
: audio player to use. Can beaudio-io
(uses theaudio-io
package to directly output to native OS audio buffers) orsox
(requiressox
to be available on path on macOS, auto-downloaded on other platforms). Defaults toaudio-io
--overwrite
,--no-overwrite
: overwrite/keep existing files. Doesn't overwrite by default--debug
,--no-debug
: show/hide the full details of JavaScript errors, if they occur. Disabled by default--config=...
: path to configuration file to use. Defaults toechogarden.config
orechogarden.config.json
, if found at the current directory
The CLI supports loading options from a default or custom configuration file in various formats. See the CLI Guide for more details.