Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(graph): Add Custom Retrievers for Spanner Graph RAG. #122

Merged
merged 44 commits into from
Jan 29, 2025

Conversation

amullick-git
Copy link
Collaborator

@amullick-git amullick-git commented Dec 12, 2024

SpannerGraphTextToGQLRetriever
SpannerGraphVectorContextRetriever

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@amullick-git amullick-git requested review from a team as code owners December 12, 2024 20:37
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Dec 12, 2024
@amullick-git amullick-git requested a review from mtyin December 12, 2024 20:37
@amullick-git amullick-git changed the title feat(graph) - Add Custom Retrievers for Spanner Graph RAG. feat(graph): Add Custom Retrievers for Spanner Graph RAG. Dec 16, 2024
Copy link
Collaborator

@mtyin mtyin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Amarnath! First batch of comments.

src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
src/langchain_google_spanner/graph_retriever.py Outdated Show resolved Hide resolved
return element


class SpannerGraphGQLRetriever(BaseRetriever):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you educate me how much difference this vs the QAChain?

It seems to be doing very similar things?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QA chain talks to the LLM as the last step and produces an answer.
This one is a standalone retriever that gets the context only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: may be we should change the QA chain to use this internally?

return documents


class SpannerGraphSemanticGQLRetriever(BaseRetriever):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like to me that SpannerGraphSemanticGQLRetriever is a generic case of SpannerGraphGQLRetriever ?

Can we combine these two so that:
SpannerGraphSemanticGQLRetriever allows no example, in that case, it falls back to SpannerGraphGQLRetriever behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a reasonable observation.
at this time, I'd prefer not to overload the retrievers here to keep it clear the different types of retrieval techniques that could be used.

)
elif self.return_properties_list:
return_properties = ",".join(
map(lambda x: node_variable + "." + x, self.return_properties_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use bacticks to quote all identifiers otherwise your query could fail when the identifier is a reserved keyword.

For example, let's say order is a property of your node.

"n.order" will fail your query because order is a reserved keyword,
you have to do "n.order".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

documents = []
if self.expand_by_hops >= 0:
for response in responses:
elements = json.loads((response["path"]).serialize())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: what does this serialize do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response["path"] is a google.cloud.spanner_v1.data_types.JsonObject. The serialize internally convertes this to a json formatted string. https://github.com/googleapis/python-spanner/blob/7acf6dd8cc854a4792782335ac2b384d22910520/google/cloud/spanner_v1/data_types.py#L82

return element


class SpannerGraphGQLRetriever(BaseRetriever):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: may be we should change the QA chain to use this internally?

graph_name=graph_name,
node_var=node_variable,
label_expr=self.label_expr,
embeddings_column=self.embeddings_column,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically graph_name should also be back-tick wrapped in case you got a strange name.

For others, if it's a user input, maybe you can rely on the user to explicitly backtick them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the call which gets the graph name from schema.

@@ -150,7 +150,7 @@
"source": [
"# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n",
"\n",
"PROJECT_ID = \"google.com:cloud-spanner-demo\" # @param {type:\"string\"}\n",
"PROJECT_ID = \"\" # @param {type:\"string\"}\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like to use "my-project-id" instead of a empty value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -28,6 +27,7 @@

from langchain_google_spanner.graph_store import SpannerGraphStore

from .graph_utils import extract_gql, fix_gql_syntax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to note this as a breaking change. I can add that manually to the change log when releasing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Thank you for point this out.

return json.loads(schema)["Name of graph"]


def duplicate_braces_in_string(text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods should have type hints on args and return types

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return text


def clean_element(element, embedding_column):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may wish to reduce publicly exposed methods by making them private to this class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@amullick-git
Copy link
Collaborator Author

/gcbrun

**kwargs,
)

@staticmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't needed outside this class I would keep it as a class method and make it private by using the dunder "def __duplicate_braces_in_string(self, ...)". That means this code SpannerGraphTextToGQLRetriever._duplicate_braces_in_string can just be self.__duplicate_braces_in_string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@classmethod
def from_params(
cls, embedding_service: Optional[Embeddings] = None, **kwargs: Any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the embedding_service an optional parameter if you are checking if it's not None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed!

)

@staticmethod
def _clean_element(element: dict[str, Any], embedding_column: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here and for the distance method below on about static vs class method and reducing the API exposed by making it private

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"Either `return_properties` or `expand_by_hops` must be provided."
)

print(gql_query)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
)
)
print(gql_query)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debugging print?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@averikitsch averikitsch merged commit bf2903a into googleapis:main Jan 29, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/langchain-google-spanner-python API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants