Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr searches fail when query contains an unclosed " character #10390

Open
jimchamp opened this issue Jan 27, 2025 · 9 comments
Open

Solr searches fail when query contains an unclosed " character #10390

jimchamp opened this issue Jan 27, 2025 · 9 comments
Assignees
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Response Issues which require feedback from lead Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@jimchamp
Copy link
Collaborator

jimchamp commented Jan 27, 2025

Problem

Solr queries fail with a 400 status code when the search term contains a double quote character. Examples of this can be found in our Sentry logs, here.

e.g. Compilation Group for the "History of Modern China

Files

Image Image

https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/worksearch/code.py

This is likely the block where the code needs to be investigated and fixed:
https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/worksearch/code.py#L220-L254

Our guess is one of these three functions

scheme.process_user_query(param['q'])
scheme.build_q_from_params(param)
scheme.q_to_solr_params(q, solr_fields, params)

Your goal is to add a test with the broken string Compilation Group for the "History of Modern China and see which is not producing the right output

Reproducing the bug

  1. Go to ...
  2. Do ...
  • Expected behavior:
  • Actual behavior:

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
@jimchamp jimchamp added Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Bug Something isn't working. [managed] labels Jan 27, 2025
@cdrini cdrini changed the title Solr searches fail when query contains a " character Solr searches fail when query contains an unclosed " character Jan 27, 2025
@mekarpeles mekarpeles added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Good First Issue Easy issue. Good for newcomers. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] and removed Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Priority: 3 Issues that we can consider at our leisure. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jan 27, 2025
@tomrod10
Copy link

@cdrini I'd like to contribute to this issue! Would I be responsible for breaking the issue down into sub-parts? Thanks!

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 28, 2025
@mekarpeles
Copy link
Member

@tomrod10 The team met yesterday and I think we've done a okay job at breaking down the issue so that someone can start to address it. The current challenge is to investigate which of the three functions identified may not be sanitizing. The unmatched quote syntax and then both fix it and added test so that such a issue doesn't regress in the future. Please give it a shot and thank you for your help!

@mekarpeles mekarpeles removed the Needs: Response Issues which require feedback from lead label Jan 28, 2025
@tomrod10
Copy link

tomrod10 commented Jan 30, 2025

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 30, 2025
@mekarpeles
Copy link
Member

mekarpeles commented Jan 30, 2025

Thank you @tomrod10!

If you're open to an additional set of small changes, one more edge case we're seeing very frequently is * in queries. The following queries seem to break our parser (some of them are likely the result of autocomplete sending partial queries):

Lanny Ebenstein, *Milton Friedman: A Biography* (2
Henry Dodwell, *The Founder of Modern Egypt: A Stu
Hibbert, Christopher (1993), *Cavaliers
Jean-Pierre Martin, *Description lexicale du franç
Shirer, William (1960). *The Rise and Fall of the
Iverson, Kenneth E. (1962). *A Programming Languag
*Englishman's Greek Concordance of the New Testame
Preda, Alex: *Framing Finance: The Boundaries of M
*Johnson*, at 414
Ness, Erik, "The Cost of Skepticism," *Brown Alumn
Douglas *William the Conqueror* p. 51
Dwyer, Philip (2013). *Citizen Emperor: Napoleon i
Buss, D.M. (2002). "Human mating strategies". *Sam
Meredith, Martin (2005). *The Fate of Africa*. New

Sometimes the * is valid as a wildcard, e.g. for cases like... subject:*

@tomrod10
Copy link

Sounds good! I believe I have a working version handling unmatched '"'

Reading docs on Lucene and looking into "*" symbol breaking the parser.

@tomrod10
Copy link

tomrod10 commented Feb 2, 2025

@mekarpeles According to Lucene docs we should not use the "*" or "?" wildcard symbols as the first character! This should be an easy fix.

However, when I tried queries (openlibrary.org) where the "*" symbol is in between/after letters I get mixed results:

  • Searching for: Human m*ti works but Human m* or Human m*t does not.

Q: What purpose does a query with a search term preceded by a "*" accomplish?
The Lucene doc seems to point users to not use the "*" symbol unless it is preceded by a char or string.

  • Example: *Cavaliers Jean-Pierre Martin

Q: Also what about search terms like *William the Conqueror*, what is the use of two "*" here?

Note: I can't repro the parser breaking locally when I search the list of queries containing "*" you shared :/

Doc refs:
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
https://lucene.apache.org/core/9_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Wildcard_Searches

@tomrod10
Copy link

tomrod10 commented Feb 4, 2025

Curious to know your thoughts @cdrini!

@tomrod10
Copy link

tomrod10 commented Feb 5, 2025

Per cdrini, I'll make my current PR focus only on fixing unmatched '"' in solr queries. I'll work on debugging and fixing why '*' breaks the query parser in a separate issue.

@tomrod10
Copy link

tomrod10 commented Feb 6, 2025

After reading more about Lucene/Luqum and Solr, I realize that the initial approach of removing the lone double quote is not the best solution. It's unlikely that the user accidentally entered a double quote " for no reason. The user probably forgot to close it. Therefore, in cases where the count of double quotes " is odd, I'll close or match the lone quote to make the count even and ensure the Solr query is valid.

For context, in Solr, a query using double quotes will create indexed terms (Single or Phrase) that are mapped to the matching documents. Therefore, if a query has an unmatched " it will break.

Any thoughts @cdrini

PR: #10405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Response Issues which require feedback from lead Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

3 participants