Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Solr Query API #11

Open
hoodriverheather opened this issue Nov 6, 2024 · 20 comments
Open

Create Solr Query API #11

hoodriverheather opened this issue Nov 6, 2024 · 20 comments
Assignees

Comments

@hoodriverheather
Copy link

@nutjob4life create new Solr Query API interface to allow users to query Solr metadata. For example, show me all EventIDs for LTP2 Site weRc6TUHvOru6A. Or return all EventIDs by BlindedSiteID for LTP2.

@nutjob4life
Copy link
Member

Hi @hoodriverheather

The Solr query API for EDRN LabCAS is available.

The URLs are:

Please use HTTP Basic Authentication with your EDRN username and password. For example, using the curl command to query for all files that have eventID of 8300386 and returning the collection name and organ in JSON format:

curl --silent --user 'kelly:REDACTED' 'https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select?fl=CollectionName,Organ,eventID&indent=on&q=eventID:8300386&wt=json'

The various APIs all accept Solr query parameters documented at https://solr.apache.org/guide/6_6/the-standard-query-parser.html

More examples:

  • All details of all collections with SpecimenType of Serum in XML format: https://edrn-labcas.jpl.nasa.gov/data-access-api/collections/select?indent=on&q=SpecimenType:Serum&wt=xml
  • Top 10 LeadPI names of all datasets with the CollectionName of Lung Team Project 2 Images in JSON format: https://edrn-labcas.jpl.nasa.gov/data-access-api/datasets/select?fl=LeadPI&indent=on&q=CollectionName:%22Lung%20Team%20Project%202%20Images%22&wt=json
  • The ID, data custodian, and data custodian email of the top 100 files with City_of_Hope in their IDs in CSV format: https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select?fl=id,DataCustodian,DataCustodianEmail&indent=on&q=id:*City_of_Hope*&rows=100&wt=csv

@hoodriverheather
Copy link
Author

@nutjob4life This is cool! I got your example to work. Could you write a query for me that would return the eventIDs for a BlindedSiteID? I tried this:
curl --silent --user 'kincaid:YourPassword' 'https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select?fl=eventID&indent=on&q=BlindedSiteID:"NVRiRYzqspbvMw"&wt=json'

it didn't work. I get this error:

<title>400 Unknown Reason</title>

Unknown Reason

Your browser sent a request that this server could not understand.

It would be even more helpful if it could write the output to a .csv file :)
Thanks!

@nutjob4life
Copy link
Member

nutjob4life commented Nov 15, 2024

@hoodriverheather the issue is that there are quotation marks in your URL (above) which need to be encoded as %22.

Here's the URL that worked for me

https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select?fl=eventID&indent=on&q=BlindedSiteID:NVRiRYzqspbvMw&wt=json
$ curl --user 'kelly:REDACTED' --silent 'https://edrn-labcas.jpl.nasa.gov/data-access-api/files/select?fl=eventID&indent=on&q=BlindedSiteID:NVRiRYzqspbvMw&wt=json'
{
  "response":{"numFound":114558,"start":0,"docs":[
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8291042"]},
      {
        "eventID":["8143000"]},
      {
        "eventID":["8143000"]}]
  }}

Here's a tip: I use ChatGPT to write the curl commands:

Write a Solr query for "curl" that will return all eventIDs for a query where the BlindedSiteID is NVRiRYzqspbvMw

Also, there's probably no need to quote NVRiRYzqspbvMw anyway as it doesn't contain spaces or special characters

@hoodriverheather
Copy link
Author

@nutjob4life I still can't get this to work. :( Do you have a few minutes to help me tomorrow?

Also, can you give DMCC instructions on how to do this using our API? I don't think this is a user friendly way for them to get updates. Thanks!!

@nutjob4life
Copy link
Member

nutjob4life commented Nov 19, 2024

@hoodriverheather sure, I can help on the 19th.

No, it's not user friendly. But it is developer friendly, and quite powerful—and APIs are meant for developers, not users 😉

Although a lot of developers are familiar with curl, some might find a programming language client like pysolr for Python, or Postman to construct queries. Does the DMCC have developers familiar with Python or another programming language? I think they know Postman. What would you recommend I gear a guide for?

EDIT: I searched email and saw that [email protected] was indeed using Postman, so I'll write up instructions specifically for that. You might find Postman easier to use, too, than curl.

@nutjob4life
Copy link
Member

@hoodriverheather could you come up with some example queries that developers at the DMCC might like to perform? I have a few examples in this comment but those are just "toy" examples I came up with. We can include these in the document I'm writing.

You know better the kinds of questions they'd like to ask of the metadata 🎓

@hoodriverheather
Copy link
Author

@nutjob4life I think the primary query would be the following:
Return all eventIDs by BlindedSiteID for CollectionName="Lung Team Project 2 Images"

Similar for PMRI data collection.

OR
For CollectionName="Combined Imaging and Blood Biomarkers for Breast Cancer Diagnosis" somehow return the list of images sets grouped by Training and Validation. I'm not sure what the image set is labeled as in Solr. This would look like this:
Validation:

  • 2614
  • 2624
  • 2639
  • ...
    (see screenshot)
Screenshot 2024-11-19 at 10 48 57 AM

Do those make sense?

@nutjob4life
Copy link
Member

@hoodriverheather that first query can be done and I will include it in the documentation. That second one, though, will require programming. That goes beyond the scope of the documentation I'm writing. (I trust a developer like [email protected] to handle it.)

I'll mention it, though, in an "Advanced Topics" section" 😉

@nutjob4life
Copy link
Member

@hoodriverheather I wrote some docs that should satisfy many developers

You're welcome to give it a try—or we can try it together. My schedule on 11-20 is open except from 7am to 8am.

@hoodriverheather
Copy link
Author

@nutjob4life It would be great to get a quick walk through or at least show me how to query so that I can get a list by Collection of eventIDs by BlindedSiteID. Even better would be to output that list into a csv or allow me to copy into a spreadsheet. :)
Question - can i also run this on Dev?

@nutjob4life
Copy link
Member

@hoodriverheather you can't get eventIDs by BlindedSiteID without programming; Solr doesn't support sorting by that field. You can get into CSV with the wt parameter.

As for dev: yes, it is supported there.

Got time for a quick call?

@hoodriverheather
Copy link
Author

@nutjob4life yes, i have time for a call. i'll send you a Teams meeting.

@nutjob4life
Copy link
Member

@hoodriverheather here's that first report we discussed:

events-by-blinds.csv

@hoodriverheather
Copy link
Author

@nutjob4life This is awesome! It will help both me and Jackie! Would it be possible to also add the CollectionID?

@nutjob4life
Copy link
Member

@hoodriverheather I gotta stop reading email at night; let me restart VPN and modify the report generator 😁

@nutjob4life
Copy link
Member

@hoodriverheather here you go!

events-by-blinds.csv

@hoodriverheather
Copy link
Author

@nutjob4life Thanks! I’ll pass this along to Jackie.

Could you help clarify the plan for providing access to the DMCC programmer? I’m sorry—I know you’ve explained this before, but I don’t remember enough details to draft an email about it.

From what I recall, the DMCC programmer will be able to run a Solr query now. Is there any documentation available for this? Additionally, is there API access or another plan for providing broader access?

If so, we could close this issue and create a new one to track any future updates.

Thanks again for your help!

@nutjob4life
Copy link
Member

nutjob4life commented Dec 6, 2024

@hoodriverheather did we conclude that the Solr API—despite being an industry standard—was too advanced for the DMCC to handle? I thought we talked about creating specific APIs to handle specific questions the DMCC would like to pose rather than use the generic Solr API.

Should we talk about this on the 12-10 staff meeting?

Regardless, yes I did write documentation for the Solr API … and you're welcome! 😇

@hoodriverheather
Copy link
Author

@nutjob4life That documentation looks great. We can discuss on tomorrow's call.

@nutjob4life
Copy link
Member

In our staff meeting on 2024-12-10, @dcrichto1 recommended we develop some example programs that use the LabCAS Solr API that can serve as additional supporting material for the existing documentation.

@hoodriverheather I've completed these example programs.

The documentation page has been updated to refer to these example programs.

Please review. You can try running the example programs, however they're meant for programmers, so you can skip that 😇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants