Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

nielm · 2024-10-09T20:09:34Z

The code in main/src/langchain_google_spanner/loader.py#L230 is flaky:

Specifically:

    for partition in partitions:
      r = snapshot.process_query_batch(partition)
      results = r.to_dict_list()
      if len(results) == 0:
        break
      # ... process rows in results

That break should be a continue so that the remaining partitions are read.

Reason being that this is a partitioned query (see https://cloud.google.com/spanner/docs/reads#read_data_in_parallel) which requires you to read all the returned partitions.

Some of these partitions may be empty, some may have data, it depends on how much data is in your tables, the query, and some Spanner internals.

This code aborts iterating through the partitions as soon as it finds the first empty partition

An additional (minor) error in this code is that the database snapshot is never closed, which is a resource leak.

Another is that this code will fail if the caller takes too long (more than 1hr) iterating over the returned values, due to some of the partitions not having been read and timing out...

Note that this partitioned query API is not that useful in this use case. It is meant to be used in a multi-threaded or multi-machine/multi-process environment (eg Dataflow/Flume) where partitions of data are processed in parallel. The code here simply processes the partitions in series (aborting if one is empty).

There is a much simpler way of querying the database, which simply returns a single ResultSet: https://cloud.google.com/python/docs/reference/spanner/latest/snapshot-usage#execute-a-sql-select-statement (see the top of the page for using a snapshot with a fixed staleness timestamp)

I recommend removing all partitioned queries from the spanner loader code, and use simple queries.

The text was updated successfully, but these errors were encountered:

aperepel · 2024-10-09T23:44:28Z

I confirmed this 1-liner fixes all the flaky load issues we saw. Next step would be to explore a simple API.

product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Oct 9, 2024

blunderbuss-gcf bot assigned rahul2393 Oct 9, 2024

averikitsch added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

nielm commented Oct 9, 2024

aperepel commented Oct 9, 2024

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

Comments

nielm commented Oct 9, 2024

aperepel commented Oct 9, 2024