GEO.ca Semantic search engine powered by Amazon SageMaker and Amazon OpenSearch

The GeoDiscovery team, part of the Canada Centre for Mapping and Earth Observations (CCMEO), has developed an advanced semantic search engine to enhance the relevance and accuracy of geospatial dataset searches on GEO.ca, CCMEO's public dissemination platform. By integrating natural language processing (NLP) and machine learning (ML) techniques, this search engine addresses the limitations of traditional keyword-based search, which relies on exact word matches and fails to interpret context and synonyms. In contrast, semantic search enables the processing of natural language queries, understands complex context and account for synonyms, delivering more relevant and accurate search results.

Our approach uses Sentence-Transformer (SBERT) on Hugging Face for vector embeddings. We fine-tuned three widely used SBERT models: all-MiniLM-L6-v2, all-mpnet-base-v2 and paraphrase-multilingual-MiniLM-L12-v2 using geospatial metadata. To evaluate the performance of these models, we created a benchmark dataset of curated queries and responses informed by domain expertise. Evaluation results showed that all-mpnet-base-v2 outperformed the others with a Mean Reciprocal Rank at 5 (MRR@5) score of 0.33 and Accuracy at 5 score (ACC@5) of 0.43. These scores indicate the model performs reasonably well by recommending relevant results within the top 5, with 43% of the queries returning at least one relevant result in this range. Additionally, we created a semantic search demo site to compare the semantic search engine and existing keyword search results. An internal survey indicated that the semantic search engine greatly outperformed the keyword search, improving search accuracy and enhancing the overall user experiences.

The deployment architecture for this solution is shown in Figure 1. We used Shared Services Canada’s High Performance Computing (HPC) resources to fine-tune the NLP model and selected the best-performing model for deployment on Amazon Web Services (AWS). Key components include the use of AWS SageMaker for real-time model inference, AWS OpenSearch for indexing spatial and non-spatial data alongside its web application interface capabilities, and AWS Lambda to facilitate communication between SageMaker and OpenSearch. To streamline deployment and maintain consistency, the entire infrastructure was implemented using AWS CloudFormation, enabling infrastructure-as-code practices.

Figure 1. Deployment architecture of the geospatial semantic search engine

Model Fine-Tuning

We fine-tuned the Sentence-Transformer models to improve search relevancy on geospatial metadata search. Details on model fine-tuning are available in the model evaluation repository: semantic-search-model-evaluation.

CloudFormation Deployment

Deployment details are availiable in the geocore repository: geocore-semantic-search-with-opensearch.yml

Semantic Search Architecture

The Semantic Search Architecture is a serverless setup consisting of the following steps:

Finetuning the chosen sentence-transformer models on HPC.
Saving the finetuned models in an AWS S3 bucket.
Loading the model to SageMaker and hosting a SageMaker Endpoint using the finetuned model.
Creating a Vector Index in the Amazon OpenSearch domain, embedding the text dataset (in this case, geospatial metadata) into vectors using the finetuned models and loading the vectors into the Vector Index using a Lambda function.
Creating a Lambda function to call SageMaker Endpoints to generate embeddings from user search queries, performing K-Nearest Neighbors (KNN) search on the OpenSearch Vector Index and sending the query results back to the API Gateway.
The API Gateway sends the search results to the frontend and returns search results to the users.

API Endpoints

https://search-recherche.geocore-stage.api.geo.ca/search-opensearch?

GET Request

Retrieve viewer configuration data:

GET /search-opensearch?method=SemanticSearch&q=

Filter parameters

Search filters can be found below. For the most recent filter list, please refer to the CloudFormation template on the prod branch: geocore-semantic-search-with-opensearch.yml

{
    "method": "$input.params('method')",                          #mandatory: either SemanticSearch or KeywordSearch
    "q": "$input.params('q')",                                    #all other parameters are optional
    "bbox": "$input.params('bbox')",                              #comma seperated bounding box: west, south, east, north. Example: -120.0, 49.0, -110.0, 60.0
    "relation": "$input.params('relation')",                      #spatial filter relationship: instersect (default), disjoint, contains, within
    "begin": "$input.params('begin')",                            #beginning date filter
    "end": "$input.params('end')",                                #end date filter
    "org": "$input.params('org')",                                #organisation who published the dataset
    "type": "$input.params('type')",                              #dataset 'type' (api, dataset, etc.)
    "theme": "$input.params('theme')",                            #ISO 19139 theme
    "foundational": "$input.params('foundational')",
    "source_system": "$input.params('source_system')",            #defaults to all systems
    "eo_collection": "$input.params('eo_collection')",            #filter for earth observation datasets
    "polarization": "$input.params('polarization')",              #filter for earth observation datasets
    "orbit_direction": "$input.params('orbit_direction')",        #filter for earth observation datasets
    "lang": "$input.params('lang')",                              #english or french
    "sort": "$input.params('sort')",                              #sort by relevance, date, title
    "order": "$input.params('order')",                            #sort order - defaults to desc unless sort by title 
    "size": "$input.params('size')",                              #maximum results returned
    "from": "$input.params('from')"                               #used for pagination
}

Example

curl -X GET "https://search-recherche.geocore.api.geo.ca/search-opensearch?method=SemanticSearch&q=wildfire"

Response

{
  "method": "SemanticSearch",
  "response": {
    "total_hits": 320,
    "returned_hits": 10,
    "items": [
      {
        "type": "FeatureCollection",
        "features": [
          {
            "type": "Feature",
            "geometry": {
              "type": "Polygon",
              "coordinates": [
                [
                  [
                    -141,
                    60
                  ],
                  [
                    -123.8,
                    60
                  ],
                  [
                    -123.8,
                    69.7
                  ],
                  [
                    -141,
                    69.7
                  ],
                  [
                    -141,
                    60
                  ]
                ]
              ]
            },
            "properties": {
              "row_num": 1,
              "relevancy": 0.6723405,
              "id": "c4a1037c-cf9e-491a-2e3f-70cf13ee29a2",
              "title": "Wildfire Information",
              "description": "The Wildland Fire Management branch of Yukon government distributes current fire information in this web mapping application. It includes active and extinguished fires, and danger ratings at weather stations."
            }
          }
        ]
      }
    ]
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
deployment		deployment
image		image
model		model
src		src
.gitattributes		.gitattributes
Demo1-Keyword-search.ipynb		Demo1-Keyword-search.ipynb
Demo2-Semantic-search-with-pretain-model.ipynb		Demo2-Semantic-search-with-pretain-model.ipynb
Demo3-Semantic-search-with-pretrain-model-fullstack.ipynb		Demo3-Semantic-search-with-pretrain-model-fullstack.ipynb
Demo4-Deploy-semantic-search-with-finetuned-model.ipynb		Demo4-Deploy-semantic-search-with-finetuned-model.ipynb
README.md		README.md
geocore-semantic-search-with-opensearch.yml		geocore-semantic-search-with-opensearch.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEO.ca Semantic search engine powered by Amazon SageMaker and Amazon OpenSearch

Model Fine-Tuning

CloudFormation Deployment

Semantic Search Architecture

API Endpoints

GET Request

Filter parameters

Example

Response

About

Releases 2

Packages

Contributors 2

Languages

Canadian-Geospatial-Platform/semantic-search-with-amazon-opensearch

Folders and files

Latest commit

History

Repository files navigation

GEO.ca Semantic search engine powered by Amazon SageMaker and Amazon OpenSearch

Model Fine-Tuning

CloudFormation Deployment

Semantic Search Architecture

API Endpoints

GET Request

Filter parameters

Example

Response

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages