Skip to content

Commit

Permalink
Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens()…
Browse files Browse the repository at this point in the history
… fails when the underlying text datatype is not actually text. I assume this is rare but is dependent on the original source PDF authoring tool. I have a pdf where once page only has a number on it and it appears the data type that is extracted to the dataframe is float64. This fails with the extract_page_tokens() function as written. Added .astype(str) to line 43 to force conversion to string, which should cover these kinds of corner cases. Working for me at least on the pdf that was crashingt the parser. (allenai#199)
  • Loading branch information
JSv4 authored Feb 6, 2023
1 parent 51b2a63 commit 57fc217
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion cli/pawls/preprocessors/tesseract.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def extract_page_tokens(
gp["width"].max(),
gp["height"].max(),
gp["conf"].mean(),
gp["text"].str.cat(sep=" "),
gp["text"].astype(str).str.cat(sep=" "),
]
)
)
Expand Down

0 comments on commit 57fc217

Please sign in to comment.