Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find canonical names from Wikipedia titles? #20

Open
AbeHandler opened this issue Jul 18, 2019 · 1 comment
Open

Find canonical names from Wikipedia titles? #20

AbeHandler opened this issue Jul 18, 2019 · 1 comment
Labels

Comments

@AbeHandler
Copy link
Collaborator

AbeHandler commented Jul 18, 2019

One limit to phrasemachine is that it returns overlapping spans, which may be unsuitable for some use cases. For instance, phrasemachine will return "Kim Kardashian", "Kim Kardashian West" and "Kardashian West" for the sentence, "Kim Kardashian West will attend". For at least some phrasemachine users, this will be undesirable: they will just want a single, canonical span (e.g. "Kim Kardashian")

There are definitely cases where even determining what the canonical name even should be is tricky, e.g. "Sichuan hot pot dishes are delicious" => 'sichuan hot pot', 'sichuan hot pot dishes', 'hot pot', 'hot pot dishes', 'pot dishes'. [shrugs] And in other cases I could imagine that there will be all sorts of complex semantic issues at play.

But our current solution is basically to just do nothing. I wonder if some users would prefer some decision just being made for them. Maybe we should offer some small, simple model trained on canonical Wikipedia titles for overlapping spans. In the first case case, "Kim Kardashian" would be the ''correct'' answer b/c that is the wikipedia page. I would imagine this would at least identify obviously terrible canonical names, e.g. (''Rev. Jean-Bertrand'' for ''Rev. Jean-Bertrand Aristide'').

Should this be attempted? I guess one con is that it starts making phrasemachine more complex. And another con is that it might not work, or might have complications that we can't foresee. But it seems like a good, standalone project for someone looking to help out with phrasemachine or tackle a contained, NLP problem.

@AbeHandler
Copy link
Collaborator Author

AbeHandler commented Nov 23, 2019

Some related work for record-keeping https://arxiv.org/pdf/1906.06703.pdf

I seem to recall there was another paper from Stanovsky or Dagan or both on this but I can't seem to find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant