Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to extract embeds #239

Open
hmehdi515 opened this issue May 28, 2024 · 3 comments · Fixed by #241
Open

Trying to extract embeds #239

hmehdi515 opened this issue May 28, 2024 · 3 comments · Fixed by #241
Labels
documentation Improvements or additions to documentation question Further information is requested
Milestone

Comments

@hmehdi515
Copy link
Contributor

hmehdi515 commented May 28, 2024

Hi,

I am trying to run a pipeline to extract embeddings

The pipeline I am running is the one in the README:

import rx.operators as ops
import diart.operators as dops
from diart.sources import MicrophoneAudioSource
from diart.blocks import SpeakerSegmentation, OverlapAwareSpeakerEmbedding

segmentation = SpeakerSegmentation.from_pretrained("pyannote/segmentation")
embedding = OverlapAwareSpeakerEmbedding.from_pretrained("pyannote/embedding")
mic = MicrophoneAudioSource()

stream = mic.stream.pipe(
    # Reformat stream to 5s duration and 500ms shift
    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    ops.map(lambda wav: (wav, segmentation(wav))),
    ops.starmap(embedding)
).subscribe(on_next=lambda emb: print(emb.shape))

mic.read()

Although SegmentationModel has no attribute sample_rate

    dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),

    Traceback (most recent call last):
      File "T:\Projects\endospeech_RD\IdentifySpeechToText\obtain_embeddings.py", line 11, in <module>
        dops.rearrange_audio_stream(sample_rate=segmentation.model.sample_rate),
    AttributeError: 'SegmentationModel' object has no attribute 'sample_rate'

So I tried replacing it with

    dops.rearrange_audio_stream(sample_rate=44100),

and all I get from output is :

# (batch_size, num_speakers, embedding_dim)
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])
torch.Size([1, 3, 512])

Not sure why it is detecting 3 speakers when I am the only one talking. The entire output confuses me.

Any help is appreciated.

Edit : I did come across #214 but I still am not sure how to actually perform the embedding extraction.

Edit 2 :

).subscribe(on_next=lambda emb: print(emb))

Taking out .shape does print out values:

tensor([[[-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226],
         [-0.0517, -0.0178, -0.0477,  ..., -0.0572, -0.0540, -0.0226]]])
tensor([[[-0.0507, -0.0086, -0.0534,  ..., -0.0544, -0.0962,  0.0316],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159],
         [-0.0571, -0.0187, -0.0451,  ..., -0.0532, -0.0596, -0.0159]]])
tensor([[[-0.0604, -0.0138, -0.0483,  ..., -0.0615, -0.0730, -0.0237],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243],
         [-0.0603, -0.0138, -0.0479,  ..., -0.0614, -0.0728, -0.0243]]])
@juanmc2005 juanmc2005 added documentation Improvements or additions to documentation question Further information is requested labels May 30, 2024
@juanmc2005
Copy link
Owner

Hi @hmehdi515,

First of all, your sample rate should be 16000. The example in the README must be old. I removed the sample_rate attribute from the model to make it easier to integrate custom models. Would you mind creating a PR to fix the example? It would be greatly appreciated!

Concerning the 3-speaker output, this is normal and depends on how many maximum speakers are predicted by the segmentation model. In this case, the segmentation output is a matrix of (num_speakers=3, num_frames). To get the embeddings corresponding to "active" speakers you should filter depending on the segmentation activation. For example, in the diarization pipeline we use the tau_active threshold which applies the following rule: if any predicted speaker S has at least 1 frame where the probability of speech p(S) satisfies p(S) >= tau_active, then S is considered "active" and we keep its embedding.

Bear in mind that this is not necessarily the best rule for every use case, so I encourage you to try different alternatives.

@hmehdi515
Copy link
Contributor Author

Thanks for your help. I submitted a PR with some changes.

Do you know how to change the num_speakers on SpeakerSegmentation? I know that we could create a config for SpeakerDiarization, not sure if we can do something similar for SpeakerSegmentation.

@juanmc2005
Copy link
Owner

Changing the number of speakers would require to re-train the segmentation model or fine-tuning it to produce a matrix of a different size (adding or removing speaker rows)

@juanmc2005 juanmc2005 added this to the Version 0.9.2 milestone Jun 28, 2024
@juanmc2005 juanmc2005 linked a pull request Jun 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants