Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement column detection and sorting #3

Open
lukehsiao opened this issue Jan 4, 2018 · 3 comments
Open

Implement column detection and sorting #3

lukehsiao opened this issue Jan 4, 2018 · 3 comments
Assignees

Comments

@lukehsiao
Copy link
Contributor

Using Python3 gives slight variations in the ordering of the content in the output HTML of the test input document. For example, here are the first few paragraphs after 3 separate runs.

image
image
image

Notice that the order of the content can vary.

To reproduce the bug, you can use a Python3 virtualenv:

virtualenv -p python3 .venv
source .venv/bin/activate
pip install -r requirements.txt
python extract_tree.py --pdf_file tests/input/112823.pdf

and view the results in results/112823.html over a few runs of extract_tree.py.

Content order remains consistent between runs when using Python 2.
image

@lukehsiao lukehsiao added the bug label Jan 4, 2018
@lukehsiao lukehsiao self-assigned this Jan 4, 2018
@lukehsiao
Copy link
Contributor Author

lukehsiao commented Jan 4, 2018

The result of the sorting seems inconsistent in TreeExtract.py between Python2 and Python3.

            boxes.sort(key=cmp_to_key(two_column_paper_order))
def two_column_paper_order(b1, b2):
    '''
    b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
    b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
    '''
    if((b1[2] > b2[2] and b1[2] < b2[4]) or (b2[2] > b1[2] and b2[2] < b1[4])):
        return float_cmp(b1[1], b2[1])
    return float_cmp(b1[2], b2[2])
def float_cmp(f1, f2):
    if f1 > f2:
        return 1
    elif f1 < f2:
        return -1
    else:
        return 0

Using the first page of 112823.pdf, and printing the result of the sorted boxes, I have this:

Python 3

(Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['paragraph',
  164.6805999999999,
  59.8111,
  196.08479999999997,
  203.48589999999996],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header',
  72.88620000000003,
  59.8111,
  91.12019999999995,
  163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]

Python 2

(Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['section_header',
  72.88620000000003,
  59.8111,
  91.12019999999995,
  163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['paragraph',
  164.6805999999999,
  59.8111,
  196.08479999999997,
  203.48589999999996],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header',
  207.69960000000003,
  449.5181,
  215.77160000000003,
  489.4061],
 ['section_header',
  217.67719999999997,
  467.6029,
  225.74919999999997,
  471.2509],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]

@lukehsiao
Copy link
Contributor Author

lukehsiao commented Jan 4, 2018

It looks like two_column_paper_order is inconsistent.

from pprint import pprint
from functools import cmp_to_key

boxes = [['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
         ['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
         ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
         ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
         ['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
         ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
         ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
         ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
         ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
         ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
         ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
         ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
         ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
         ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
         ['section_header', 651.4868, 405.183, 654.1348, 408.847],
         ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
         ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]


def two_column_paper_order(b1, b2):
    '''
    b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
    b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
    '''
    # If overlapping, return higher box
    if((b1[2] >= b2[2] and b1[2] <= b2[4]) or
            (b2[2] >= b1[2] and b2[2] <= b1[4])):
        return float_cmp(b1[1], b2[1])

    # Return leftmost boxes first
    return float_cmp(b1[2], b2[2])

def float_cmp(f1, f2):
    if f1 > f2:
        return 1
    elif f1 < f2:
        return -1
    else:
        return 0


#  pprint(boxes)
boxes.sort(key=cmp_to_key(two_column_paper_order))
pprint(boxes, width=120)

for i in range(len(boxes)):
  print([two_column_paper_order(boxes[i], boxes[j]) for j in range(len(boxes))])

This prints

[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
 ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]
[0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  0,  -1, -1, -1, 1,  1,  1,  1,  -1, -1, -1]
[1, 1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  0,  -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  0,  -1, -1, -1, 1,  -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  1,  0,  -1, -1, 1,  -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  1,  1,  0,  -1, 1,  -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  -1, -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  -1, -1, -1, 1,  0,  -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0]

Basically, it's problem of how columns are defined.

---1---
            ---3----
      ---2----
             

1: [1, 2, 3]
2: [1, 3, 2]
3: [1, 3, 2]

@lukehsiao lukehsiao changed the title Variations in HTML output using Python3 Implement column detection and sorting Jan 5, 2018
@lukehsiao
Copy link
Contributor Author

We've boiled this down for a need to implement a function to detect the number of columns in a document, and then order the text between the columns in reading order. This function will need to sort things consistently so that we won't have the issue shown above.

@lukehsiao lukehsiao assigned senwu and unassigned lukehsiao Jan 5, 2018
lukehsiao added a commit that referenced this issue Jan 5, 2018
We will need to implement functionality for:
  1. Detecting the number of columns in a document based on bboxes
  2. Ordering the content in reading order

Currently, this just swaps out the inconsistent two-column code we used
to have in the TreeStructure repository for a simple top-to-bottom,
left-to-right comparator. We will need to fix this in the future.
@lukehsiao lukehsiao added enhancement and removed bug labels Mar 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants