-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(graph): Add Custom Retrievers for Spanner Graph RAG. #122
Conversation
…ollision for tests running parallely from different python environments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Amarnath! First batch of comments.
return element | ||
|
||
|
||
class SpannerGraphGQLRetriever(BaseRetriever): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you educate me how much difference this vs the QAChain?
It seems to be doing very similar things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QA chain talks to the LLM as the last step and produces an answer.
This one is a standalone retriever that gets the context only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: may be we should change the QA chain to use this internally?
return documents | ||
|
||
|
||
class SpannerGraphSemanticGQLRetriever(BaseRetriever): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like to me that SpannerGraphSemanticGQLRetriever is a generic case of SpannerGraphGQLRetriever ?
Can we combine these two so that:
SpannerGraphSemanticGQLRetriever allows no example, in that case, it falls back to SpannerGraphGQLRetriever behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's a reasonable observation.
at this time, I'd prefer not to overload the retrievers here to keep it clear the different types of retrieval techniques that could be used.
) | ||
elif self.return_properties_list: | ||
return_properties = ",".join( | ||
map(lambda x: node_variable + "." + x, self.return_properties_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use bacticks to quote all identifiers otherwise your query could fail when the identifier is a reserved keyword.
For example, let's say order
is a property of your node.
"n.order" will fail your query because order
is a reserved keyword,
you have to do "n.order
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
documents = [] | ||
if self.expand_by_hops >= 0: | ||
for response in responses: | ||
elements = json.loads((response["path"]).serialize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: what does this serialize do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response["path"] is a google.cloud.spanner_v1.data_types.JsonObject. The serialize internally convertes this to a json formatted string. https://github.com/googleapis/python-spanner/blob/7acf6dd8cc854a4792782335ac2b384d22910520/google/cloud/spanner_v1/data_types.py#L82
return element | ||
|
||
|
||
class SpannerGraphGQLRetriever(BaseRetriever): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: may be we should change the QA chain to use this internally?
graph_name=graph_name, | ||
node_var=node_variable, | ||
label_expr=self.label_expr, | ||
embeddings_column=self.embeddings_column, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically graph_name should also be back-tick wrapped in case you got a strange name.
For others, if it's a user input, maybe you can rely on the user to explicitly backtick them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the call which gets the graph name from schema.
docs/graph_qa_chain.ipynb
Outdated
@@ -150,7 +150,7 @@ | |||
"source": [ | |||
"# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n", | |||
"\n", | |||
"PROJECT_ID = \"google.com:cloud-spanner-demo\" # @param {type:\"string\"}\n", | |||
"PROJECT_ID = \"\" # @param {type:\"string\"}\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like to use "my-project-id" instead of a empty value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -28,6 +27,7 @@ | |||
|
|||
from langchain_google_spanner.graph_store import SpannerGraphStore | |||
|
|||
from .graph_utils import extract_gql, fix_gql_syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to note this as a breaking change. I can add that manually to the change log when releasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Thank you for point this out.
return json.loads(schema)["Name of graph"] | ||
|
||
|
||
def duplicate_braces_in_string(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All methods should have type hints on args and return types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return text | ||
|
||
|
||
def clean_element(element, embedding_column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may wish to reduce publicly exposed methods by making them private to this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/gcbrun |
**kwargs, | ||
) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't needed outside this class I would keep it as a class method and make it private by using the dunder "def __duplicate_braces_in_string(self, ...)". That means this code SpannerGraphTextToGQLRetriever._duplicate_braces_in_string
can just be self.__duplicate_braces_in_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
@classmethod | ||
def from_params( | ||
cls, embedding_service: Optional[Embeddings] = None, **kwargs: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the embedding_service an optional parameter if you are checking if it's not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed!
) | ||
|
||
@staticmethod | ||
def _clean_element(element: dict[str, Any], embedding_column: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here and for the distance method below on about static vs class method and reducing the API exposed by making it private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"Either `return_properties` or `expand_by_hops` must be provided." | ||
) | ||
|
||
print(gql_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
) | ||
) | ||
print(gql_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove debugging print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
SpannerGraphTextToGQLRetriever
SpannerGraphVectorContextRetriever
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕