Trying to extract embeds #239

hmehdi515 · 2024-05-28T20:04:36Z

Hi,

I am trying to run a pipeline to extract embeddings

The pipeline I am running is the one in the README:

import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding

segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()

stream = mic.stream.pipe(
    # Reformat stream to 5s duration and 500ms shift
    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    ops.map(lambda wav: (wav, segmentation(wav))),
    ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))

mic.read()

Although SegmentationModel has no attribute sample_rate

    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),

    Traceback (most recent call last):
      File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
        dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'

So I tried replacing it with

    dops.rearrange_audio_stream(sample_rate=44100),

and all I get from output is :

# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])

Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.

Any help is appreciated.

Edit : I did come across #214 but I still am not sure how to actually perform the embedding extraction.

Edit 2 :

).subscribe(on_next=lambda emb: print(emb))

Taking out .shape does print out values:

tensor([[[-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534,  ..., -0.0544, -0.0962,  0.0316],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483,  ..., -0.0615, -0.0730, -0.0237],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243]]])

The text was updated successfully, but these errors were encountered:

juanmc2005 · 2024-05-30T09:20:47Z

Hi @hmehdi515,

First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!

Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames). To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S) satisfies p(S) >= tau_active, then S is considered "active" and we keep its embedding.

Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.

hmehdi515 · 2024-05-30T17:43:50Z

Thanks for your help. I submitted a PR with some changes.

Do you know how to change the num_speakers on SpeakerSegmentation? I know that we could create a config for SpeakerDiarization, not sure if we can do something similar for SpeakerSegmentation.

juanmc2005 · 2024-06-28T21:35:38Z

Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)

juanmc2005 added documentation Improvements or additions to documentation question Further information is requested labels May 30, 2024

juanmc2005 added this to the Version 0.9.2 milestone Jun 28, 2024

juanmc2005 linked a pull request Jun 28, 2024 that will close this issue

Fix embedding extraction example in README #241

Merged

juanmc2005 modified the milestones: Version 0.9.2, Version 0.10 Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to extract embeds #239

Trying to extract embeds #239

hmehdi515 commented May 28, 2024 •

edited

Loading

juanmc2005 commented May 30, 2024

hmehdi515 commented May 30, 2024

juanmc2005 commented Jun 28, 2024

Trying to extract embeds #239

Trying to extract embeds #239

Comments

hmehdi515 commented May 28, 2024 • edited Loading

juanmc2005 commented May 30, 2024

hmehdi515 commented May 30, 2024

juanmc2005 commented Jun 28, 2024

hmehdi515 commented May 28, 2024 •

edited

Loading