Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError when using disambiguate() with maxsim algorithm #59

Open
kientuongnguyen opened this issue Oct 31, 2019 · 2 comments
Open

IndexError when using disambiguate() with maxsim algorithm #59

kientuongnguyen opened this issue Oct 31, 2019 · 2 comments

Comments

@kientuongnguyen
Copy link

kientuongnguyen commented Oct 31, 2019

I'm using Google Colab

s = "would sentiment"
disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

the same with "may sentiment", "might sentiment", "must sentiment", ...


IndexError Traceback (most recent call last)
in ()
1 s = "would sentiment"
----> 2 disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

1 frames
/usr/local/lib/python3.6/dist-packages/pywsd/allwords_wsd.py in disambiguate(sentence, algorithm, context_is_lemmatized, similarity_option, keepLemmas, prefersNone, from_cache, tokenizer)
43 synset = algorithm(lemma_sentence, lemma, from_cache=from_cache)
44 elif algorithm == max_similarity:
---> 45 synset = algorithm(lemma_sentence, lemma, pos=pos, option=similarity_option)
46 else:
47 synset = algorithm(lemma_sentence, lemma, pos=pos, context_is_lemmatized=True,

/usr/local/lib/python3.6/dist-packages/pywsd/similarity.py in max_similarity(context_sentence, ambiguous_word, option, lemma, context_is_lemmatized, pos, best)
125 result = sorted([(v,k) for k,v in result.items()],reverse=True)
126
--> 127 return result[0][1] if best else result

IndexError: list index out of range

@BigBossAnwer
Copy link

Getting the same error with a similar kind of usage in python 3.8, pywsd 1.2.4
for example:

disambiguate('Neither was there a qualified majority within this House to revert to Article 272.', max_similarity, similarity_option='path')

Gives an index out of bounds error in pywsd.similarity.max_similarity()

Scotch-tape patch with:

def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    # If ambiguous word not in WordNet return None
    if not wn.synsets(ambiguous_word):
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
    result = {}
    for i in wn.synsets(ambiguous_word, pos=pos):
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    if not len(result):
        return None
    
    return result[0][1] if best else result

in pywsd.similarity
where

    if not len(result):
        return None

is the "fix"

Doesn't really resolve the underlying issue though

@tcardlab
Copy link

I was also getting this error. I found that it was because the incorrect pos was being passed to max_similarity.

Why the wrong pos is being passed probably has to do with something in the following chain:
disambiguate > lemmatize_sentence > postagger & lemmatize


However, we can still catch a bad pos by checking if the synset list is empty (falsy) and use an unspecified pos otherwise.
I have done this at the declaration of syn:

from pywsd.tokenize import word_tokenize
from pywsd.utils import lemmatize
from pywsd import sim

def max_similarity_fix(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True, from_cache=False) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    syn = wn.synsets(ambiguous_word, pos=pos) or wn.synsets(ambiguous_word)

    # If ambiguous word not in WordNet return None
    if not syn:
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]

    result = {}
    for i in syn:
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    return result[0][1] if best else result

You can see this works for "should sentiment. deep-water. co-beneficiary.", each of which would otherwise break it:

sentence = "should sentiment. deep-water. co-beneficiary."
print( disambiguate(sentence, algorithm=max_similarity_fix ))

I am uncertain as to whether using an unspecified pos is a good idea. It may be better to mark them with a unique output that you can filter for afterward. @BigBossAnwer has a good method for that, though you may wish to return a different value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants