Match based on query with tags/placeables removed #1967

transl8bzimport · 2011-07-01T23:22:00Z

Originally posted by Marce van Velden:

Consider the folowing sourcetext
This is a sample text with a link

If this is sent to the tmserver i would like it to search for "This is a sample text with a link" (the source without tags) and give a quality penalty if the tags dont match:
i.e. The tmserver might have a match like:
This is a sample text with a link

This requires to save both the complete sourcetext and the sourcetext with tags removed in the tm db.

What do you think about this? I have made a sample implementation for tmserver (sqlite) for this in the past, and it worked perfectly for us. Though one question is which tags/placeables you will filter out. Possibly we could make this settable in the configuration.

friedelwolff · 2011-07-02T00:27:22Z

I have similar ideas that I want to investigate, but haven't had time to realise any of them yet.

One possibility that someone mentioned, is to filter out certain token types when indexing. We already filter out when tokid=12. I don't know what the best reference is, but you can have a look here:

http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html
and in tmdb.py you can look for "tokid".

We will need to work out how we take this into account for the weighting. For start we can still just use the normal Levenshtein distance as we do know, or consider weighted averages of different runs on the full text vs the reduced/stripped version, or maybe use the database rank to affect our own rank. I don't know if this makes sense, so feel free to discuss further :-)

transl8bzimport · 2011-07-04T19:06:04Z

Originally posted by Marce van Velden:

I think we should use the Levenshtein distance based on the stripped version of source and target and have a fixed but configurable penalty for tag/token/placeable mismatch if stripped sources are equal but tag/token/placeables do not match

friedelwolff · 2011-07-07T20:18:09Z

Yes, I agree. I think the issue will be mostly to figure out which ones we feel are the relevant ones. Doing a weighted average of Levenshtein distances on the full text and the reduced text is a way of doing the penalties, I guess.

For the implementation I guess we might want to either keep both the original and reduced version in the database, or simply rely on the database rank more. The value from the ranking function is not easy to work with directly (as in, its inherent meaning isn't obvious), but we mostly get it for free, as far as I know.

transl8bzimport assigned alaaosh Jul 27, 2014

unho unassigned alaaosh Aug 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match based on query with tags/placeables removed #1967

Match based on query with tags/placeables removed #1967

transl8bzimport commented Jul 1, 2011

friedelwolff commented Jul 2, 2011

transl8bzimport commented Jul 4, 2011

friedelwolff commented Jul 7, 2011

Match based on query with tags/placeables removed #1967

Match based on query with tags/placeables removed #1967

Comments

transl8bzimport commented Jul 1, 2011

friedelwolff commented Jul 2, 2011

transl8bzimport commented Jul 4, 2011

friedelwolff commented Jul 7, 2011