Implement column detection and sorting #3

lukehsiao · 2018-01-04T08:23:53Z

Using Python3 gives slight variations in the ordering of the content in the output HTML of the test input document. For example, here are the first few paragraphs after 3 separate runs.

Notice that the order of the content can vary.

To reproduce the bug, you can use a Python3 virtualenv:

virtualenv -p python3 .venv
source .venv/bin/activate
pip install -r requirements.txt
python extract_tree.py --pdf_file tests/input/112823.pdf

and view the results in results/112823.html over a few runs of extract_tree.py.

Content order remains consistent between runs when using Python 2.

The text was updated successfully, but these errors were encountered:

lukehsiao · 2018-01-04T20:07:28Z

The result of the sorting seems inconsistent in TreeExtract.py between Python2 and Python3.

            boxes.sort(key=cmp_to_key(two_column_paper_order))

def two_column_paper_order(b1, b2):
    '''
    b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
    b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
    '''
    if((b1[2] > b2[2] and b1[2] < b2[4]) or (b2[2] > b1[2] and b2[2] < b1[4])):
        return float_cmp(b1[1], b2[1])
    return float_cmp(b1[2], b2[2])

def float_cmp(f1, f2):
    if f1 > f2:
        return 1
    elif f1 < f2:
        return -1
    else:
        return 0

Using the first page of 112823.pdf, and printing the result of the sorted boxes, I have this:

Python 3

(Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['paragraph',
  164.6805999999999,
  59.8111,
  196.08479999999997,
  203.48589999999996],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header',
  72.88620000000003,
  59.8111,
  91.12019999999995,
  163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]

Python 2

(Pdb) pprint(boxes)
[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['section_header',
  72.88620000000003,
  59.8111,
  91.12019999999995,
  163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['paragraph',
  164.6805999999999,
  59.8111,
  196.08479999999997,
  203.48589999999996],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header',
  207.69960000000003,
  449.5181,
  215.77160000000003,
  489.4061],
 ['section_header',
  217.67719999999997,
  467.6029,
  225.74919999999997,
  471.2509],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]

lukehsiao · 2018-01-04T23:29:01Z

It looks like two_column_paper_order is inconsistent.

from pprint import pprint
from functools import cmp_to_key

boxes = [['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
         ['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
         ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
         ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
         ['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
         ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
         ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
         ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
         ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
         ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
         ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
         ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
         ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
         ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
         ['section_header', 651.4868, 405.183, 654.1348, 408.847],
         ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
         ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]


def two_column_paper_order(b1, b2):
    '''
    b1 = [b1.type, b1.top, b1.left, b1.bottom, b1.right]
    b2 = [b2.type, b2.top, b2.left, b2.bottom, b2.right]
    '''
    # If overlapping, return higher box
    if((b1[2] >= b2[2] and b1[2] <= b2[4]) or
            (b2[2] >= b1[2] and b2[2] <= b1[4])):
        return float_cmp(b1[1], b2[1])

    # Return leftmost boxes first
    return float_cmp(b1[2], b2[2])

def float_cmp(f1, f2):
    if f1 > f2:
        return 1
    elif f1 < f2:
        return -1
    else:
        return 0


#  pprint(boxes)
boxes.sort(key=cmp_to_key(two_column_paper_order))
pprint(boxes, width=120)

for i in range(len(boxes)):
  print([two_column_paper_order(boxes[i], boxes[j]) for j in range(len(boxes))])

This prints

[['header', 52.92960000000005, 59.8111, 71.16359999999997, 286.05310000000003],
 ['section_header', 72.88620000000003, 59.8111, 91.12019999999995, 163.90329999999997],
 ['section_header', 111.77690000000007, 59.8111, 130.0109, 265.32250000000005],
 ['section_header', 130.47469999999998, 59.8111, 153.43470000000002, 139.2471],
 ['paragraph', 164.6805999999999, 59.8111, 196.08479999999997, 203.48589999999996],
 ['table', 243.24599999999998, 59.8111, 555.8221, 532.8825],
 ['paragraph', 701.5117, 59.8111, 732.2509, 337.3303000000001],
 ['section_header', 744.1121, 62.7024, 763.9413000000001, 205.22760000000005],
 ['section_header', 741.8637, 303.5336, 754.9837, 307.98159999999996],
 ['figure', 54.34699999999998, 397.9276, 160.58950000000004, 514.6137],
 ['section_header', 167.9121, 417.4866, 182.6721000000001, 494.9946],
 ['section_header', 207.69960000000003, 449.5181, 215.77160000000003, 489.4061],
 ['section_header', 217.67719999999997, 467.6029, 225.74919999999997, 471.2509],
 ['paragraph', 604.6788, 382.7326, 667.3356, 528.3814000000002],
 ['section_header', 651.4868, 405.183, 654.1348, 408.847],
 ['paragraph', 679.4523, 364.0252, 712.7441, 548.5977000000003],
 ['section_header', 742.1597, 454.3368, 764.2245, 548.8016]]
[0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  0,  -1, -1, -1, 1,  1,  1,  1,  -1, -1, -1]
[1, 1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  1,  1,  1,  0,  -1, -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  0,  -1, -1, -1, -1, -1, -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  0,  -1, -1, -1, 1,  -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  1,  0,  -1, -1, 1,  -1]
[1, 1,  1,  1,  1,  -1, 1,  1,  1,  1,  1,  1,  0,  -1, 1,  -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  -1, -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  -1, -1, -1, 1,  0,  -1]
[1, 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0]

Basically, it's problem of how columns are defined.

---1---
            ---3----
      ---2----
             

1: [1, 2, 3]
2: [1, 3, 2]
3: [1, 3, 2]

lukehsiao · 2018-01-05T04:29:43Z

We've boiled this down for a need to implement a function to detect the number of columns in a document, and then order the text between the columns in reading order. This function will need to sort things consistently so that we won't have the issue shown above.

We will need to implement functionality for: 1. Detecting the number of columns in a document based on bboxes 2. Ordering the content in reading order Currently, this just swaps out the inconsistent two-column code we used to have in the TreeStructure repository for a simple top-to-bottom, left-to-right comparator. We will need to fix this in the future.

lukehsiao added the bug label Jan 4, 2018

lukehsiao self-assigned this Jan 4, 2018

lukehsiao changed the title ~~Variations in HTML output using Python3~~ Implement column detection and sorting Jan 5, 2018

lukehsiao assigned senwu and unassigned lukehsiao Jan 5, 2018

lukehsiao added enhancement and removed bug labels Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement column detection and sorting #3

Implement column detection and sorting #3

lukehsiao commented Jan 4, 2018

lukehsiao commented Jan 4, 2018 •

edited

Loading

lukehsiao commented Jan 4, 2018 •

edited

Loading

lukehsiao commented Jan 5, 2018

Implement column detection and sorting #3

Implement column detection and sorting #3

Comments

lukehsiao commented Jan 4, 2018

lukehsiao commented Jan 4, 2018 • edited Loading

lukehsiao commented Jan 4, 2018 • edited Loading

lukehsiao commented Jan 5, 2018

lukehsiao commented Jan 4, 2018 •

edited

Loading

lukehsiao commented Jan 4, 2018 •

edited

Loading