IndexError when using disambiguate() with maxsim algorithm #59

kientuongnguyen · 2019-10-31T03:53:17Z

I'm using Google Colab

s = "would sentiment"
disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

the same with "may sentiment", "might sentiment", "must sentiment", ...

IndexError Traceback (most recent call last)
in ()
1 s = "would sentiment"
----> 2 disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

1 frames
/usr/local/lib/python3.6/dist-packages/pywsd/allwords_wsd.py in disambiguate(sentence, algorithm, context_is_lemmatized, similarity_option, keepLemmas, prefersNone, from_cache, tokenizer)
43 synset = algorithm(lemma_sentence, lemma, from_cache=from_cache)
44 elif algorithm == max_similarity:
---> 45 synset = algorithm(lemma_sentence, lemma, pos=pos, option=similarity_option)
46 else:
47 synset = algorithm(lemma_sentence, lemma, pos=pos, context_is_lemmatized=True,

/usr/local/lib/python3.6/dist-packages/pywsd/similarity.py in max_similarity(context_sentence, ambiguous_word, option, lemma, context_is_lemmatized, pos, best)
125 result = sorted([(v,k) for k,v in result.items()],reverse=True)
126
--> 127 return result[0][1] if best else result

IndexError: list index out of range

BigBossAnwer · 2019-12-05T07:08:10Z

Getting the same error with a similar kind of usage in python 3.8, pywsd 1.2.4
for example:

disambiguate('Neither was there a qualified majority within this House to revert to Article 272.', max_similarity, similarity_option='path')

Gives an index out of bounds error in pywsd.similarity.max_similarity()

Scotch-tape patch with:

def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    # If ambiguous word not in WordNet return None
    if not wn.synsets(ambiguous_word):
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
    result = {}
    for i in wn.synsets(ambiguous_word, pos=pos):
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    if not len(result):
        return None
    
    return result[0][1] if best else result

in pywsd.similarity
where

    if not len(result):
        return None

is the "fix"

Doesn't really resolve the underlying issue though

tcardlab · 2023-02-24T06:10:18Z

I was also getting this error. I found that it was because the incorrect pos was being passed to max_similarity.

Why the wrong pos is being passed probably has to do with something in the following chain:
disambiguate > lemmatize_sentence > postagger & lemmatize

However, we can still catch a bad pos by checking if the synset list is empty (falsy) and use an unspecified pos otherwise.
I have done this at the declaration of syn:

from pywsd.tokenize import word_tokenize
from pywsd.utils import lemmatize
from pywsd import sim

def max_similarity_fix(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True, from_cache=False) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    syn = wn.synsets(ambiguous_word, pos=pos) or wn.synsets(ambiguous_word)

    # If ambiguous word not in WordNet return None
    if not syn:
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]

    result = {}
    for i in syn:
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    return result[0][1] if best else result

You can see this works for "should sentiment. deep-water. co-beneficiary.", each of which would otherwise break it:

sentence = "should sentiment. deep-water. co-beneficiary."
print( disambiguate(sentence, algorithm=max_similarity_fix ))

I am uncertain as to whether using an unspecified pos is a good idea. It may be better to mark them with a unique output that you can filter for afterward. @BigBossAnwer has a good method for that, though you may wish to return a different value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError when using disambiguate() with maxsim algorithm #59

IndexError when using disambiguate() with maxsim algorithm #59

kientuongnguyen commented Oct 31, 2019 •

edited

Loading

BigBossAnwer commented Dec 5, 2019

tcardlab commented Feb 24, 2023

IndexError when using disambiguate() with maxsim algorithm #59

IndexError when using disambiguate() with maxsim algorithm #59

Comments

kientuongnguyen commented Oct 31, 2019 • edited Loading

I'm using Google Colab

the same with "may sentiment", "might sentiment", "must sentiment", ...

BigBossAnwer commented Dec 5, 2019

tcardlab commented Feb 24, 2023

kientuongnguyen commented Oct 31, 2019 •

edited

Loading