-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement column detection and sorting #3
Comments
The result of the sorting seems inconsistent in TreeExtract.py between Python2 and Python3. boxes.sort(key=cmp_to_key(two_column_paper_order)) def two_column_paper_order(b1, b2):
'''
b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
'''
if((b1[2] > b2[2] and b1[2] < b2[4]) or (b2[2] > b1[2] and b2[2] < b1[4])):
return float_cmp(b1[1], b2[1])
return float_cmp(b1[2], b2[2]) def float_cmp(f1, f2):
if f1 > f2:
return 1
elif f1 < f2:
return -1
else:
return 0 Using the first page of Python 3 (Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
['paragraph',
164.6805999999999,
59.8111,
196.08479999999997,
203.48589999999996],
['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
['section_header',
72.88620000000003,
59.8111,
91.12019999999995,
163.90329999999997],
['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
['section_header', 651.4868, 405.183, 654.1348, 408.847],
['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
['section_header', 742.1597, 454.3368, 764.2245, 548.8016]] Python 2 (Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
['section_header',
72.88620000000003,
59.8111,
91.12019999999995,
163.90329999999997],
['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
['paragraph',
164.6805999999999,
59.8111,
196.08479999999997,
203.48589999999996],
['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
['section_header', 651.4868, 405.183, 654.1348, 408.847],
['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
['section_header',
207.69960000000003,
449.5181,
215.77160000000003,
489.4061],
['section_header',
217.67719999999997,
467.6029,
225.74919999999997,
471.2509],
['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
['section_header', 742.1597, 454.3368, 764.2245, 548.8016]] |
It looks like from pprint import pprint
from functools import cmp_to_key
boxes = [['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
['section_header', 651.4868, 405.183, 654.1348, 408.847],
['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]
def two_column_paper_order(b1, b2):
'''
b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
'''
# If overlapping, return higher box
if((b1[2] >= b2[2] and b1[2] <= b2[4]) or
(b2[2] >= b1[2] and b2[2] <= b1[4])):
return float_cmp(b1[1], b2[1])
# Return leftmost boxes first
return float_cmp(b1[2], b2[2])
def float_cmp(f1, f2):
if f1 > f2:
return 1
elif f1 < f2:
return -1
else:
return 0
# pprint(boxes)
boxes.sort(key=cmp_to_key(two_column_paper_order))
pprint(boxes, width=120)
for i in range(len(boxes)):
print([two_column_paper_order(boxes[i], boxes[j]) for j in range(len(boxes))]) This prints [['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
['section_header', 651.4868, 405.183, 654.1348, 408.847],
['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]
[0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 1, 0, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1]
[1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 1, -1, 1, 1, 1, 0, -1, -1, -1, -1, -1, -1]
[1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 0, -1, -1, -1, 1, -1]
[1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 0, -1, -1, 1, -1]
[1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 0, -1, 1, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, 1, 0, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0] Basically, it's problem of how columns are defined.
|
We've boiled this down for a need to implement a function to detect the number of columns in a document, and then order the text between the columns in reading order. This function will need to sort things consistently so that we won't have the issue shown above. |
We will need to implement functionality for: 1. Detecting the number of columns in a document based on bboxes 2. Ordering the content in reading order Currently, this just swaps out the inconsistent two-column code we used to have in the TreeStructure repository for a simple top-to-bottom, left-to-right comparator. We will need to fix this in the future.
Using Python3 gives slight variations in the ordering of the content in the output HTML of the test input document. For example, here are the first few paragraphs after 3 separate runs.
Notice that the order of the content can vary.
To reproduce the bug, you can use a Python3 virtualenv:
and view the results in
results/112823.html
over a few runs ofextract_tree.py
.Content order remains consistent between runs when using Python 2.
The text was updated successfully, but these errors were encountered: