-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match based on query with tags/placeables removed #1967
Comments
I have similar ideas that I want to investigate, but haven't had time to realise any of them yet. One possibility that someone mentioned, is to filter out certain token types when indexing. We already filter out when tokid=12. I don't know what the best reference is, but you can have a look here: http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html We will need to work out how we take this into account for the weighting. For start we can still just use the normal Levenshtein distance as we do know, or consider weighted averages of different runs on the full text vs the reduced/stripped version, or maybe use the database rank to affect our own rank. I don't know if this makes sense, so feel free to discuss further :-) |
Originally posted by Marce van Velden: I think we should use the Levenshtein distance based on the stripped version of source and target and have a fixed but configurable penalty for tag/token/placeable mismatch if stripped sources are equal but tag/token/placeables do not match |
Yes, I agree. I think the issue will be mostly to figure out which ones we feel are the relevant ones. Doing a weighted average of Levenshtein distances on the full text and the reduced text is a way of doing the penalties, I guess. For the implementation I guess we might want to either keep both the original and reduced version in the database, or simply rely on the database rank more. The value from the ranking function is not easy to work with directly (as in, its inherent meaning isn't obvious), but we mostly get it for free, as far as I know. |
Originally posted by Marce van Velden:
Consider the folowing sourcetext
This is a sample text with a link
If this is sent to the tmserver i would like it to search for "This is a sample text with a link" (the source without tags) and give a quality penalty if the tags dont match:
i.e. The tmserver might have a match like:
This is a sample text with a link
This requires to save both the complete sourcetext and the sourcetext with tags removed in the tm db.
What do you think about this? I have made a sample implementation for tmserver (sqlite) for this in the past, and it worked perfectly for us. Though one question is which tags/placeables you will filter out. Possibly we could make this settable in the configuration.
The text was updated successfully, but these errors were encountered: