Runing WhisperSeg as a Web service make it possible to disentangle the environment of the WhisperSeg and the environment where this segmenting function is called. For example, we can set up a WhisperSeg segmenting service at one machine, and call the segmenting service in different working environment (Matlab, Webpage frontend, Jupyter Notebook) at different physical locations.
This enables an easy implementation of calling WhisperSeg in Matlab and is essential for setting up a Web page for automatic vocal segmentation.
In a terminal, go to the main folder of this repository, and run the following command:
python segment_service.py --flask_port 8050 --model_path nccratliri/whisperseg-large-ms-ct2 --device cuda
Illustration of the parameters:
- flask_port: the port that this service will keep listening to. Requests that are sent to this port will be handled by this service
- model_path: the path to the WhisperSeg model. This model can either be original huggingface model, e.g., nccratliri/whisperseg-large-ms, or CTranslate converted model, e.g., nccratliri/whisperseg-large-ms-ct2. If you choose to use the Ctranslate converted model, please make sure the converted model exists. If you have a different trained WhisperSeg checkpoint, replace "nccratliri/whisperseg-large-ms-ct2" with the path to the checkpoint.
- device: where to run the WhisperSeg. It can be cuda or cpu. By default we run the model on cuda
Note: The terminal that runs this service needs to be kept open. On Linux system's terminal, one can first create a new screen and run the service in the created screen, to allow the service runing in the background.
For example, we are segmenting a zebra finch recording:
import requests,json,base64
import pandas as pd
import librosa
## define a function for segmentation
def call_segment_service( service_address,
audio_file_path,
sr = None,
channel_id = 0,
min_frequency=None,
spec_time_step=None,
min_segment_length=None,
eps=None,
num_trials=3,
adobe_audition_compatible=False
):
if sr is None:
sr = librosa.get_samplerate(audio_file_path)
audio_file_base64_string = base64.b64encode( open(audio_file_path, 'rb').read()).decode('ASCII')
response = requests.post( service_address,
data = json.dumps( {
"audio_file_base64_string":audio_file_base64_string,
"channel_id":channel_id,
"sr":sr,
"min_frequency":min_frequency,
"spec_time_step":spec_time_step,
"min_segment_length":min_segment_length,
"eps":eps,
"num_trials":num_trials,
"adobe_audition_compatible":adobe_audition_compatible
} ),
headers = {"Content-Type": "application/json"}
)
return response.json()
Note (Important):
- Runing the above code does not require any further dependencies or load any models
- The service_address is composed of SEGMENTING_SERVER_IP_ADDRESS + ":" + FLASK_PORT_NUMBER + "/segment". If the server is running in the local machine, then the SEGMENTING_SERVER_IP_ADDRESS is "http://localhost", otherwise, you will need to know the IP address of the server machine.
- channel_id is useful when the input audio file has multiple channels. In this case, channel_id can be used to specify which channel to segment. By default channel_id = 0, which means the first channel is used for segmentation.
- The parameter adobe_audition_compatible is used to control the returned segmentation results format. If adobe_audition_compatible=1, the returned segmentation result is a dictionary that is comptible with Adobe Audition. This means after converting the dictionary to a Dataframe and then to a csv file, this csv file can be directly loaded into Adobe Audition. If adobe_audition_compatible=0, the segmentation result is a simple dictionary containing only "onset", "offset" and "cluster".
prediction = call_segment_service( "http://localhost:8050/segment",
"../data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.wav",
adobe_audition_compatible = True
)
## we can convert the returned dictionary into a pandas Dataframe
df = pd.DataFrame(prediction)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Name | Start | Duration | Time Format | Type | Description | |
---|---|---|---|---|---|---|
0 | 0:00.010 | 0:00.063 | decimal | Cue | ||
1 | 0:00.380 | 0:00.067 | decimal | Cue | ||
2 | 0:00.603 | 0:00.070 | decimal | Cue | ||
3 | 0:00.758 | 0:00.074 | decimal | Cue | ||
4 | 0:00.912 | 0:00.571 | decimal | Cue | ||
5 | 0:01.812 | 0:00.070 | decimal | Cue | ||
6 | 0:01.963 | 0:00.074 | decimal | Cue | ||
7 | 0:02.073 | 0:00.570 | decimal | Cue | ||
8 | 0:02.840 | 0:00.053 | decimal | Cue | ||
9 | 0:02.982 | 0:00.081 | decimal | Cue | ||
10 | 0:03.112 | 0:00.171 | decimal | Cue | ||
11 | 0:03.668 | 0:00.074 | decimal | Cue | ||
12 | 0:03.828 | 0:00.070 | decimal | Cue | ||
13 | 0:03.953 | 0:00.570 | decimal | Cue | ||
14 | 0:05.158 | 0:00.065 | decimal | Cue | ||
15 | 0:05.323 | 0:00.070 | decimal | Cue | ||
16 | 0:05.468 | 0:00.575 | decimal | Cue |
We can save the df to the Adobe Audition compitible csv by (note: index = False, sep="\t" is necessary!):
df.to_csv( "prediction_result.csv", index = False, sep="\t")
prediction = call_segment_service( "http://localhost:8050/segment",
"../data/example_subset/Zebra_finch/test_adults/zebra_finch_g17y2U-f00007.wav",
adobe_audition_compatible = False
)
## we can convert the returned dictionary into a pandas Dataframe
df = pd.DataFrame(prediction)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
onset | offset | cluster | |
---|---|---|---|
0 | 0.010 | 0.073 | vocal |
1 | 0.380 | 0.447 | vocal |
2 | 0.603 | 0.673 | vocal |
3 | 0.758 | 0.832 | vocal |
4 | 0.912 | 1.483 | vocal |
5 | 1.812 | 1.882 | vocal |
6 | 1.963 | 2.037 | vocal |
7 | 2.073 | 2.643 | vocal |
8 | 2.840 | 2.893 | vocal |
9 | 2.982 | 3.063 | vocal |
10 | 3.112 | 3.283 | vocal |
11 | 3.668 | 3.742 | vocal |
12 | 3.828 | 3.898 | vocal |
13 | 3.953 | 4.523 | vocal |
14 | 5.158 | 5.223 | vocal |
15 | 5.323 | 5.393 | vocal |
16 | 5.468 | 6.043 | vocal |
First define a matlab function
function response = call_segment_service(service_address, audio_file_path, sr, channel_id, min_frequency, spec_time_step, min_segment_length, eps, num_trials, adobe_audition_compatible)
fileID = fopen(audio_file_path, 'r');
fileData = fread(fileID, inf, 'uint8=>uint8');
audio_file_base64_string = matlab.net.base64encode( fileData );
data = struct('audio_file_base64_string', audio_file_base64_string, ...
"channel_id", channel_id, ...
"sr", sr, ...
"min_frequency", min_frequency, ...
"spec_time_step", spec_time_step, ...
"min_segment_length", min_segment_length, ...
"eps", eps, ...
"num_trials", num_trials, ...
"adobe_audition_compatible", adobe_audition_compatible );
jsonData = jsonencode(data);
options = weboptions( 'RequestMethod', 'POST', 'MediaType', 'application/json' );
response = webwrite(service_address, jsonData, options);
end
Then call the matlab function in MATLAB console:
prediction = prediction = call_segment_service( 'http://localhost:8050/segment', '/Users/meilong/Downloads/zebra_finch_g17y2U-f00007.wav', 32000, 0, 0, 0.0025, 0.01, 0.02, 3, 0 )
disp(prediction)