[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

vval-odoo · 2025-06-11T08:48:06Z

[FIX] spreadsheet: batch process spreadsheet_revision.commands

Some dbs have spreadsheet_revision records with over 10 millions characters in commands. If the number of record is high, this leads to memory errors here. We distribute them in buckets of memory_cap maximum size, and use a named cursor to process them in buckets. Commands larger than memory_cap fit in one bucket.

select count(*),SUM(CHAR_LENGTH(commands)) as characters from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%'])
+-------+------------+
| count | characters |
|-------+------------|
| 2317  | 3419017646 |
+-------+------------+

select id,CHAR_LENGTH(commands) from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%']) order by CHAR_LENGTH(commands) desc limit 10
+------+-------------+
| id   | char_length |
|------+-------------|
| 3871 | 13197157    |
| 4788 | 13197157    |
| 3290 | 13197157    |
| 3557 | 13197157    |
| 4179 | 13197157    |
| 4481 | 13197157    |
| 2757 | 13197157    |
| 3022 | 13197157    |
| 2492 | 13197157    |
| 5097 | 13197157    |
+------+-------------+

Fixes upg-2899961

robodoo · 2025-06-11T08:48:08Z

Pirols

Good work! :)

src/util/spreadsheet/misc.py

Pirols · 2025-06-11T09:20:33Z

src/util/spreadsheet/misc.py



 def iter_commands(cr, like_all=(), like_any=()):
    if not (bool(like_all) ^ bool(like_any)):
        raise ValueError("Please specify `like_all` or `like_any`, not both")
-    cr.execute(
+    ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)


Using a context manager you do not need to close it explicitely¹.

Suggested change

ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)

with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr:

That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.

Footnotes

https://github.com/odoo/odoo//blob/bdc28e53b7c426640e30b900064442f19c946d9e/odoo/sql_db.py#L221-L238 ↩

KangOl · 2025-06-11T13:08:09Z

upgradeci retry with always only *spreadsheet*

src/util/spreadsheet/misc.py

Pirols · 2025-06-25T07:50:42Z

Another affected request: https://upgrade.odoo.com/odoo/action-150/2988031

Pirols · 2025-06-25T13:21:13Z

src/util/spreadsheet/misc.py

+            """.format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),
+            [list(like_all or like_any)],
+        )
+        for ids, datas in ncr.fetchmany(size=1):


This will only output the first bucket.

A simple solution is what you had already:

Suggested change

for ids, datas in ncr.fetchmany(size=1):

for ids, datas in ncr:

With with pg.named_cursor(cr, itersize=1) as ncr

I agree with Edoardo. You just need to limit the itersize. Under the hood psycopg will fetch one by one and each line already contains multiple records (as many as they fit in one bucket).

aj-fuentes · 2025-06-25T13:32:33Z

src/util/spreadsheet/misc.py

-            cr.execute(
-                "UPDATE spreadsheet_revision SET commands=%s WHERE id=%s", [json.dumps(data_loaded), revision_id]
+
+    with pg.named_cursor(cr) as ncr:


Suggested change

with pg.named_cursor(cr) as ncr:

with pg.named_cursor(cr, itersize=1) as ncr:

https://www.psycopg.org/docs/usage.html#server-side-cursors

src/util/spreadsheet/misc.py

vval-odoo · 2025-06-25T14:41:26Z

I added ORDER BY LENGTH(commands) here

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num

as it could reduce the number of buckets if length is randomly distributed among ids.

aj-fuentes · 2025-06-25T15:29:56Z

src/util/spreadsheet/misc.py

+                   ARRAY_AGG(commands ORDER BY id)
+              FROM buckets
+          GROUP BY num, alone
+            """.format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),


Use format_query to set ALL or ANY. Then pass memory_cap as parameter to the query in ncr.execute
EDIT: condition = util.SQLStr("ALL" if like_all else "ANY")

aj-fuentes · 2025-06-25T15:30:51Z

src/util/spreadsheet/misc.py

+            """
+            WITH buckets AS (
+                SELECT id,
+                       SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,


Suggested change

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num,

aj-fuentes · 2025-06-25T15:31:02Z

src/util/spreadsheet/misc.py

+            WITH buckets AS (
+                SELECT id,
+                       SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
+                       LENGTH(commands) > {memory_cap} AS alone,


Suggested change

LENGTH(commands) > {memory_cap} AS alone,

LENGTH(commands) > %s AS alone,

aj-fuentes · 2025-06-25T15:35:12Z

I added ORDER BY LENGTH(commands) here
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num
as it could reduce the number of buckets if length is randomly distributed among ids.

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

src/util/spreadsheet/misc.py

vval-odoo · 2025-06-26T08:25:15Z

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

        ncr.execute(
            """
            WITH RECURSIVE start_bucket AS (
                SELECT 1 AS bucket
            ),
            ordered_rows AS (
                SELECT id,
                       commands,
                       LENGTH(commands) AS length,
                       ROW_NUMBER() OVER (ORDER BY LENGTH(commands), id) AS rownum
                  FROM spreadsheet_revision
                 WHERE commands LIKE {condition}(%s::text[])
            ),
            assign AS (
                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       sb.bucket AS bucket,
                       o.length AS sum
                  FROM ordered_rows o, start_bucket sb
                 WHERE o.rownum = 1

                UNION ALL

                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       CASE
                            WHEN a.sum + o.length > %s THEN a.bucket + 1
                            ELSE a.bucket
                       END AS bucket,
                       CASE
                            WHEN a.sum + o.length > %s THEN o.length
                            ELSE a.sum + o.length
                       END AS sum
                  FROM assign a
                  JOIN ordered_rows o
                    ON o.rownum = a.rownum + 1
            )
            SELECT ARRAY_AGG(id),
                   ARRAY_AGG(commands)
              FROM assign
          GROUP BY bucket
            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
            [list(like_all or like_any), MEMORY_CAP, MEMORY_CAP]
        )

aj-fuentes · 2025-06-26T08:31:52Z

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

EDIT: note that only records bigger than bucket size needs to be set alone. Because all remaining records will never be fetched in a more than 2*MEMORY_CAP size group. Thus by choosing a cap that's acceptable (I think 200MB is fine as max 400MB would be fetched for smaller records, while bigger ones won't have any cap but fetched alone) we are done and we don't need any more complex logic here to decide about buckets. Here is a way to do this in a clearer way without the "magic" alone column (pseudo sql)

SELECT ARRAY[id], ARRAY[commands]
  FROM table
 WHERE len(commands)>cap
 UNION 
SELECT array_agg(data.id), array_agg(data.commands)
  FROM (SELECT id, commands, sum(len(commands))/cap 
          FROM table
         WHERE len(commands)<=cap
       ) AS data(id,commands,num)
 GROUP BY data.num

KangOl · 2025-06-26T08:46:22Z

we need to know for each row what will be the size of following one

To do so, you can use the lead window function: https://www.postgresql.org/docs/13/functions-window.html

But other window functions can be used to get memory_cap groups or rows.

KangOl · 2025-06-26T09:13:35Z

No need for lag/lead, we can just use a cumulative sum to get the ids to fetch in successive queries

  with _groups as (
    select id, sum(length(commands)) over (ORDER BY id) / {mem_cap} as cs
      from spreadsheet_revision
     where {condition}
  )
  select array_agg(id)
    from _groups
group by cs

Pirols · 2025-06-26T09:38:36Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c0) is 6 chars long, second one (c1) is 6 chars long, third one (c2) is 5 chars long. No one qualifies as alone. c0 goes into b0 (first bucket), c1 goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > 10 = bucket_size:

|----b0----|----b1----|----b2----|
 **c0**
            **c1**
                  *c2**

vval-odoo · 2025-06-26T09:40:19Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

I think that two records bigger than memory_cap can't get in the same bucket, even without the "alone" trick because of the division. What this tricks prevents is for smaller records to "complete" a bucket that is already too large, but that doesn't solve any problem because a bucket of 15,something * memory_cap would make the program crash as would a bucket of 15 * memory_cap. What I try to prevent is getting buckets between memory_cap and 2 * memory_cap.
I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

aj-fuentes · 2025-06-26T09:51:31Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c1) is 6 chars long, second one (c2) is 6 chars long, third one (c3) is 5 chars long. No one qualifies as alone. First command goes into b0 (first bucket), second command goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > bucket_size = 10:
|----b0----|----b1----|----b2----|
 **c1**
            **c2**
                  *c3**

I think it is correct. See we don't care about 2*max_cap size since we know that's the hard max. We care about a really big record, think about 5*max_cap, getting even more smaller ones. In your example, yes c2+c3 > max_cap, but still < 2*max_cap. That's my point. The logic is simpler and as proven by the example here the biggest record is bigger than 2*max_cap and still acceptable for memory.

aj-fuentes · 2025-06-26T09:52:49Z

I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

@vval-odoo yes. Which is fine.

Pirols

Remove seemingly useless ordering?

src/util/spreadsheet/misc.py

KangOl · 2025-06-26T15:22:01Z

upgradeci retry

KangOl · 2025-06-26T15:33:31Z

upgradeci retry

aj-fuentes

Use format_query.

aj-fuentes · 2025-06-27T13:50:49Z

src/util/spreadsheet/misc.py

+                   ARRAY[commands]
+              FROM filtered
+             WHERE commands_length > %s
+            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),


Suggested change

""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")),

""",

condition=pg.SQLStr("ALL" if like_all else "ANY"),

) # close format_query

aj-fuentes · 2025-06-27T13:51:44Z

src/util/spreadsheet/misc.py

+
+    with pg.named_cursor(cr, itersize=1) as ncr:
+        ncr.execute(
+            """


Use format_query. The query is becoming increasingly complex, better use the right formatting tool to avoid issues later.

Suggested change

"""

util.format_query(

cr,

"""

aj-fuentes

LGTM

I left an optional suggestion to ensure we return data deterministically.

src/util/spreadsheet/misc.py

aj-fuentes · 2025-07-07T11:20:59Z

src/util/spreadsheet/misc.py

+                       NULL AS num
+                  FROM filtered
+                 WHERE commands_length > %s
+              ORDER BY num NULLS LAST


This is not what we need. It isn't actually ordering the result since all are NULL. You are doing an UNION ALL thus you should order on each query. In the first you order by num here you then need to order by id. You don't need to return num since it isn't used in Python side.

aj-fuentes

Ensure each query in the UNION ALL is ordered independently. For the first one the order that makes sense is num, for the second one is id.

src/util/spreadsheet/misc.py

aj-fuentes · 2025-07-07T11:22:33Z

src/util/spreadsheet/misc.py

+                       ARRAY[commands],
+                       NULL AS num


Suggested change

ARRAY[commands],

NULL AS num

ARRAY[commands]

aj-fuentes · 2025-07-07T11:22:43Z

src/util/spreadsheet/misc.py

+                       NULL AS num
+                  FROM filtered
+                 WHERE commands_length > %s
+              ORDER BY num NULLS LAST


Suggested change

ORDER BY num NULLS LAST

ORDER BY id

src/util/spreadsheet/misc.py

aj-fuentes · 2025-07-07T11:23:05Z

src/util/spreadsheet/misc.py

+                       array_agg(commands ORDER BY id),
+                       num


Suggested change

array_agg(commands ORDER BY id),

num

array_agg(commands ORDER BY id)

Some dbs have `spreadsheet_revision` records with over 10 millions characters in `commands`. If the number of record is high, this leads to memory errors. We distribute them in buckets of `memory_cap` maximum size, and use a named cursor to process them in buckets. Commands larger than `memory_cap` fit in one bucket.

vval-odoo requested review from a team, jjmaksoud and aj-fuentes June 11, 2025 08:49

Pirols reviewed Jun 11, 2025

View reviewed changes

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3752d09 to 327a6f6 Compare June 11, 2025 11:16

aj-fuentes reviewed Jun 11, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

src/util/spreadsheet/misc.py Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 327a6f6 to 508732d Compare June 13, 2025 12:20

vval-odoo requested a review from aj-fuentes June 20, 2025 07:56

aj-fuentes reviewed Jun 23, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 508732d to 1bfec7f Compare June 25, 2025 13:10

Pirols reviewed Jun 25, 2025

View reviewed changes

aj-fuentes reviewed Jun 25, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch 2 times, most recently from 03175d2 to 3367e6d Compare June 25, 2025 14:39

aj-fuentes reviewed Jun 25, 2025

View reviewed changes

cawo-odoo reviewed Jun 26, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

Pirols reviewed Jun 26, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3367e6d to 3e92404 Compare June 26, 2025 13:39

aj-fuentes reviewed Jun 27, 2025

View reviewed changes

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3e92404 to e579d7b Compare June 30, 2025 06:52

aj-fuentes approved these changes Jul 1, 2025

View reviewed changes

src/util/spreadsheet/misc.py Show resolved Hide resolved

src/util/spreadsheet/misc.py Show resolved Hide resolved

vval-odoo removed the request for review from jjmaksoud July 1, 2025 14:19

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from e579d7b to 4ba9038 Compare July 3, 2025 09:48

aj-fuentes reviewed Jul 7, 2025

View reviewed changes

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 4ba9038 to 7eab714 Compare July 8, 2025 11:01

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 7eab714 to b36699a Compare July 8, 2025 14:48

aj-fuentes approved these changes Jul 11, 2025

View reviewed changes

	ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)
	with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr:

	for ids, datas in ncr.fetchmany(size=1):
	for ids, datas in ncr:

	with pg.named_cursor(cr) as ncr:
	with pg.named_cursor(cr, itersize=1) as ncr:

	SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
	SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num,

	LENGTH(commands) > {memory_cap} AS alone,
	LENGTH(commands) > %s AS alone,

-            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
+            """,
+            condition=pg.SQLStr("ALL" if like_all else "ANY"),
+            )  # close format_query

-            """
+            util.format_query(
+                cr,
+                """

	array_agg(commands ORDER BY id),
	num
	array_agg(commands ORDER BY id)

[FIX] spreadsheet: batch process spreadsheet_revision.commands #284

Are you sure you want to change the base?

[FIX] spreadsheet: batch process spreadsheet_revision.commands #284

Uh oh!

Conversation

vval-odoo commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robodoo commented Jun 11, 2025

Uh oh!

Pirols left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

KangOl commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pirols commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vval-odoo commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aj-fuentes commented Jun 25, 2025

Uh oh!

Uh oh!

vval-odoo commented Jun 26, 2025

Uh oh!

aj-fuentes commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

Pirols commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vval-odoo commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes commented Jun 26, 2025

Uh oh!

Pirols left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

aj-fuentes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

vval-odoo commented Jun 11, 2025 •

edited

Loading

vval-odoo commented Jun 25, 2025 •

edited

Loading

aj-fuentes Jun 25, 2025 •

edited

Loading

aj-fuentes commented Jun 26, 2025 •

edited

Loading

Pirols commented Jun 26, 2025 •

edited

Loading

vval-odoo commented Jun 26, 2025 •

edited

Loading

aj-fuentes commented Jun 26, 2025 •

edited

Loading