You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run a pipeline to extract embeddings
The pipeline I am running is the one in the README:
import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding
segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()
stream = mic.stream.pipe(
# Reformat stream to 5s duration and 500ms shift
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
ops.map(lambda wav: (wav, segmentation(wav))),
ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))
mic.read()
Although SegmentationModel has no attribute sample_rate
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
Traceback (most recent call last):
File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'
First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!
Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames). To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S) satisfies p(S) >= tau_active, then S is considered "active" and we keep its embedding.
Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.
Thanks for your help. I submitted a PR with some changes.
Do you know how to change the num_speakers on SpeakerSegmentation? I know that we could create a config for SpeakerDiarization, not sure if we can do something similar for SpeakerSegmentation.
Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)
Hi,
I am trying to run a pipeline to extract embeddings
The pipeline I am running is the one in the README:
Although SegmentationModel has no attribute sample_rate
So I tried replacing it with
and all I get from output is :
Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.
Any help is appreciated.
Edit : I did come across #214 but I still am not sure how to actually perform the embedding extraction.
Edit 2 :
Taking out
.shape
does print out values:The text was updated successfully, but these errors were encountered: