-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr searches fail when query contains an unclosed "
character
#10390
Comments
"
character"
character
@cdrini I'd like to contribute to this issue! Would I be responsible for breaking the issue down into sub-parts? Thanks! |
@tomrod10 The team met yesterday and I think we've done a okay job at breaking down the issue so that someone can start to address it. The current challenge is to investigate which of the three functions identified may not be sanitizing. The unmatched quote syntax and then both fix it and added test so that such a issue doesn't regress in the future. Please give it a shot and thank you for your help! |
|
Thank you @tomrod10! If you're open to an additional set of small changes, one more edge case we're seeing very frequently is
Sometimes the |
Sounds good! I believe I have a working version handling unmatched '"' Reading docs on Lucene and looking into "*" symbol breaking the parser. |
@mekarpeles According to Lucene docs we should not use the "*" or "?" wildcard symbols as the first character! This should be an easy fix. However, when I tried queries (openlibrary.org) where the "*" symbol is in between/after letters I get mixed results:
Q: What purpose does a query with a search term preceded by a "*" accomplish?
Q: Also what about search terms like Note: I can't repro the parser breaking locally when I search the list of queries containing "*" you shared :/ Doc refs: |
Curious to know your thoughts @cdrini! |
Per cdrini, I'll make my current PR focus only on fixing unmatched '"' in solr queries. I'll work on debugging and fixing why '*' breaks the query parser in a separate issue. |
After reading more about Lucene/Luqum and Solr, I realize that the initial approach of removing the lone double quote is not the best solution. It's unlikely that the user accidentally entered a double quote For context, in Solr, a query using double quotes will create indexed terms (Single or Phrase) that are mapped to the matching documents. Therefore, if a query has an unmatched Any thoughts @cdrini PR: #10405 |
Problem
Solr queries fail with a
400
status code when the search term contains a double quote character. Examples of this can be found in our Sentry logs, here.e.g.
Compilation Group for the "History of Modern China
Files
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/worksearch/code.py
This is likely the block where the code needs to be investigated and fixed:
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/worksearch/code.py#L220-L254
Our guess is one of these three functions
Your goal is to add a test with the broken string
Compilation Group for the "History of Modern China
and see which is not producing the right outputReproducing the bug
Context
Breakdown
Requirements Checklist
Related files
Stakeholders
Instructions for Contributors
The text was updated successfully, but these errors were encountered: