-
Notifications
You must be signed in to change notification settings - Fork 68
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! :)
src/util/spreadsheet/misc.py
Outdated
|
||
|
||
def iter_commands(cr, like_all=(), like_any=()): | ||
if not (bool(like_all) ^ bool(like_any)): | ||
raise ValueError("Please specify `like_all` or `like_any`, not both") | ||
cr.execute( | ||
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a context manager you do not need to close it explicitely1.
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) | |
with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr: |
That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.
Footnotes
3752d09
to
327a6f6
Compare
|
327a6f6
to
508732d
Compare
Another affected request: https://upgrade.odoo.com/odoo/action-150/2988031 |
508732d
to
1bfec7f
Compare
src/util/spreadsheet/misc.py
Outdated
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"), | ||
[list(like_all or like_any)], | ||
) | ||
for ids, datas in ncr.fetchmany(size=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only output the first bucket.
A simple solution is what you had already:
for ids, datas in ncr.fetchmany(size=1): | |
for ids, datas in ncr: |
With with pg.named_cursor(cr, itersize=1) as ncr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Edoardo. You just need to limit the itersize. Under the hood psycopg will fetch one by one and each line already contains multiple records (as many as they fit in one bucket).
src/util/spreadsheet/misc.py
Outdated
cr.execute( | ||
"UPDATE spreadsheet_revision SET commands=%s WHERE id=%s", [json.dumps(data_loaded), revision_id] | ||
|
||
with pg.named_cursor(cr) as ncr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pg.named_cursor(cr) as ncr: | |
with pg.named_cursor(cr, itersize=1) as ncr: |
03175d2
to
3367e6d
Compare
I added
as it could reduce the number of buckets if length is randomly distributed among ids. |
src/util/spreadsheet/misc.py
Outdated
ARRAY_AGG(commands ORDER BY id) | ||
FROM buckets | ||
GROUP BY num, alone | ||
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query
to set ALL
or ANY
. Then pass memory_cap as parameter to the query in ncr.execute
EDIT: condition = util.SQLStr("ALL" if like_all else "ANY")
src/util/spreadsheet/misc.py
Outdated
""" | ||
WITH buckets AS ( | ||
SELECT id, | ||
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, | |
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num, |
src/util/spreadsheet/misc.py
Outdated
WITH buckets AS ( | ||
SELECT id, | ||
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, | ||
LENGTH(commands) > {memory_cap} AS alone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LENGTH(commands) > {memory_cap} AS alone, | |
LENGTH(commands) > %s AS alone, |
Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length. |
That was the idea behind my original query. Compute buckets that would only go above
|
What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get EDIT: note that only records bigger than bucket size needs to be set alone. Because all remaining records will never be fetched in a more than 2*MEMORY_CAP size group. Thus by choosing a cap that's acceptable (I think 200MB is fine as max 400MB would be fetched for smaller records, while bigger ones won't have any cap but fetched alone) we are done and we don't need any more complex logic here to decide about buckets. Here is a way to do this in a clearer way without the "magic" alone column (pseudo sql) SELECT ARRAY[id], ARRAY[commands]
FROM table
WHERE len(commands)>cap
UNION
SELECT array_agg(data.id), array_agg(data.commands)
FROM (SELECT id, commands, sum(len(commands))/cap
FROM table
WHERE len(commands)<=cap
) AS data(id,commands,num)
GROUP BY data.num |
To do so, you can use the But other window functions can be used to get |
No need for with _groups as (
select id, sum(length(commands)) over (ORDER BY id) / {mem_cap} as cs
from spreadsheet_revision
where {condition}
)
select array_agg(id)
from _groups
group by cs |
I don't think that the
|
I think that two records bigger than |
I think it is correct. See we don't care about |
@vval-odoo yes. Which is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove seemingly useless ordering?
3367e6d
to
3e92404
Compare
upgradeci retry |
1 similar comment
upgradeci retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query.
src/util/spreadsheet/misc.py
Outdated
ARRAY[commands] | ||
FROM filtered | ||
WHERE commands_length > %s | ||
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")), | |
""", | |
condition=pg.SQLStr("ALL" if like_all else "ANY"), | |
) # close format_query |
src/util/spreadsheet/misc.py
Outdated
|
||
with pg.named_cursor(cr, itersize=1) as ncr: | ||
ncr.execute( | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query
. The query is becoming increasingly complex, better use the right formatting tool to avoid issues later.
""" | |
util.format_query( | |
cr, | |
""" |
3e92404
to
e579d7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I left an optional suggestion to ensure we return data deterministically.
e579d7b
to
4ba9038
Compare
src/util/spreadsheet/misc.py
Outdated
NULL AS num | ||
FROM filtered | ||
WHERE commands_length > %s | ||
ORDER BY num NULLS LAST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what we need. It isn't actually ordering the result since all are NULL. You are doing an UNION ALL thus you should order on each query. In the first you order by num
here you then need to order by id
. You don't need to return num
since it isn't used in Python side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure each query in the UNION ALL is ordered independently. For the first one the order that makes sense is num
, for the second one is id
.
array_agg(commands ORDER BY id), | ||
num | ||
FROM smaller | ||
GROUP BY num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GROUP BY num | |
GROUP BY num | |
ORDER BY num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the problem. However, independently ordering sets of a union is not authorized by psql. I can either do that with subqueries, either use something more hacky like this:
SELECT array_agg(id ORDER BY id),
array_agg(commands ORDER BY id),
min(id) AS sort_key
FROM smaller
GROUP BY num
UNION ALL
SELECT ARRAY[id],
ARRAY[commands],
id AS sort_key
FROM filtered
WHERE commands_length > %s
ORDER BY sort_key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort_key idea looks good to me :)
src/util/spreadsheet/misc.py
Outdated
ARRAY[commands], | ||
NULL AS num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARRAY[commands], | |
NULL AS num | |
ARRAY[commands] |
src/util/spreadsheet/misc.py
Outdated
NULL AS num | ||
FROM filtered | ||
WHERE commands_length > %s | ||
ORDER BY num NULLS LAST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORDER BY num NULLS LAST | |
ORDER BY id |
), | ||
[list(like_any or like_all), MEMORY_CAP, MEMORY_CAP, MEMORY_CAP], | ||
) | ||
for ids, commands, _ in ncr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ids, commands, _ in ncr: | |
for ids, commands in ncr: |
src/util/spreadsheet/misc.py
Outdated
array_agg(commands ORDER BY id), | ||
num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_agg(commands ORDER BY id), | |
num | |
array_agg(commands ORDER BY id) |
4ba9038
to
7eab714
Compare
Some dbs have `spreadsheet_revision` records with over 10 millions characters in `commands`. If the number of record is high, this leads to memory errors. We distribute them in buckets of `memory_cap` maximum size, and use a named cursor to process them in buckets. Commands larger than `memory_cap` fit in one bucket.
7eab714
to
b36699a
Compare
[FIX] spreadsheet: batch process
spreadsheet_revision.commands
Some dbs have
spreadsheet_revision
records with over 10 millions characters incommands
. If the number of record is high, this leads to memory errors here. We distribute them in buckets ofmemory_cap
maximum size, and use a named cursor to process them in buckets. Commands larger thanmemory_cap
fit in one bucket.Fixes upg-2899961