Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

Open
nielm opened this issue Oct 9, 2024 · 1 comment
Assignees
Labels
api: spanner Issues related to the googleapis/langchain-google-spanner-python API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nielm
Copy link

nielm commented Oct 9, 2024

The code in main/src/langchain_google_spanner/loader.py#L230 is flaky:

Specifically:

    for partition in partitions:
      r = snapshot.process_query_batch(partition)
      results = r.to_dict_list()
      if len(results) == 0:
        break
      # ... process rows in results

That break should be a continue so that the remaining partitions are read.

Reason being that this is a partitioned query (see https://cloud.google.com/spanner/docs/reads#read_data_in_parallel) which requires you to read all the returned partitions.

Some of these partitions may be empty, some may have data, it depends on how much data is in your tables, the query, and some Spanner internals.

This code aborts iterating through the partitions as soon as it finds the first empty partition

An additional (minor) error in this code is that the database snapshot is never closed, which is a resource leak.

Another is that this code will fail if the caller takes too long (more than 1hr) iterating over the returned values, due to some of the partitions not having been read and timing out...

Note that this partitioned query API is not that useful in this use case. It is meant to be used in a multi-threaded or multi-machine/multi-process environment (eg Dataflow/Flume) where partitions of data are processed in parallel. The code here simply processes the partitions in series (aborting if one is empty).

There is a much simpler way of querying the database, which simply returns a single ResultSet: https://cloud.google.com/python/docs/reference/spanner/latest/snapshot-usage#execute-a-sql-select-statement (see the top of the page for using a snapshot with a fixed staleness timestamp)

I recommend removing all partitioned queries from the spanner loader code, and use simple queries.

@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Oct 9, 2024
@aperepel
Copy link
Contributor

aperepel commented Oct 9, 2024

I confirmed this 1-liner fixes all the flaky load issues we saw. Next step would be to explore a simple API.

@averikitsch averikitsch added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/langchain-google-spanner-python API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants