-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF parser in Velociraptor #2600
Comments
Duplicate of #801 |
Adding this link to Didier Stevens pdf-parser.py https://blog.didierstevens.com/programs/pdf-tools/ |
Actually I looked into it. While extracting compressed streams from a pdf is pretty easy, actually extracting the text from the pdf is quite complex. The tool Matt linked to above just extracts the streams. Pdfs generally don't have text in them, instead they consist of actually drawing commands to place the letters in a position on the page. These commands are encoded in many ways, some are obvious but some are not. The actual letters are sometimes encoded in terms of the font used in fact. So we can sometimes easily read the text and sometimes not. There are some commercial solutions to extract text from pdf but the open source solutions are quite simplistic and fail frequently (also seen to be CPU intensive in my testing) The best open source solutions seem to be in python at the moment. There are some go libraries all based on old code by Russ Cox for example https://pkg.go.dev/github.com/dslipak/pdf but in my testing these are really slow and not suitable for use on large numbers of documents. It may be possible to write something that works some of the time but not all the time. |
Hey this is great work Mike; much appreciated. |
Can you please link to PDFParser? is this a different tool? I have not heard of it |
apologies, it is pdf-parser... |
Was there any progress on this issue? Scanning within PDFs would be a very useful feature. |
Ah I completely forgot about this - I started writing an artifaft that can do a yara scan on PDF files - but I didnt have time to back test it against a set of maldocs. https://gist.github.com/scudette/40f49fb64383eed489667ca9fade93f4 I also started writing a blog post about it but i have not gotten around to finishing it. Thanks for reminding me I will get to it soon :-). Until then feel free to play with the artifact and comment ! |
This is awesome progress, thank you! |
Feedback: I tested your query above and it appears to work well! It found the text in my test PDF document, but one caveat was that I had to add "ascii" to the Yara rule, otherwise it returned no results with only "wide".
|
@scudette I did some more testing, with your custom artefact and found that I could only get search hits on text in PDF files that I had created (i.e. opened Word - added some text - exported to PDF - searched for it with VR with the Generic.Search.PDF artefact). I tested some PDFs from other sources (e.g. downloaded books) and I also OCR scanned a page of text, but unfortunately no hits for text that was there (i.e. confirmed with the find function in Adobe Reader or extracting the text with the Python module PyPDF2). |
Are you able to share some of those pdfs? |
Sure thing, I’ll dig them out tomorrow.
…On Tue, 20 Jun 2023 at 19:46, Mike Cohen ***@***.***> wrote:
Are you able to share some of those pdfs?
—
Reply to this email directly, view it on GitHub
<#2600 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALB2VM25VTBIVYYQQUQ2B53XMGESXANCNFSM6AAAAAAWPKZVXE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Initial TestingSo I originally download the complete works of Shakespeare in PDF to test it (from https://www.booksfree.org/the-complete-works-of-william-shakespeare-pdf-free-download/). I found a suitable word "bluntness" that only appears once, and searched for it using VR (Generic.Search.PDF) but got no hits, so I confirmed that I could find the phrase in Adobe Reader and I also ran a search using the PyPDF2 module in Python, which also found a hit. Python PyPDF2# pip install PyPDF2
from PyPDF2 import PdfReader
import re
# reader = PdfReader("The-Complete-Works-of-William-Shakespeare-booksfree.org_.pdf")
# reader = PdfReader("Lorem ipsum - Scanned OCR.pdf")
reader = PdfReader("Lorem ipsum 13k.pdf")
total_pages = len(reader.pages)
hits = 0
search_phrase = ".*bluntness.*"
for page in range(total_pages):
page_text = reader.pages[page].extract_text()
search_match = re.search(search_phrase, page_text, re.IGNORECASE)
if search_match:
hits += 1
print(f"Hit for {search_phrase} on page {page + 1}")
# print(page_text)
print(f"Summary: {hits} hits for search phrase: {search_phrase}")
print("Finished.") Further TestingI created a couple of test PDFs, using some generated Lorem Ipsum and the word "bluntness" at the end of the document (in case it was a data size thing). VR was able to get a hit on the documents created in Word and exported to PDF fine. (The Lorem ipsum 13k document is PDF version 1.7). I then printed a page of the lorem ipsum text with the word "bluntness" in the middle of the page, and scanned it using NAPS2 and its OCR function. Again the phrase can be found using Adobe Reader, and the Python PyPDF2 module, but unfortunately not VR. (Lorem ipsum - printed, then scanned and OCR'd - PDF version 1.4) And for your reference, my Yara rule in VR looks like this:
|
This is what I was referring to above when I mentioned the text is not simple to extract from the PDF. If you look at the velociraptor output for this file: You can see the characters are encoded like this I guess ultimately we need to ask what we want to get out of parsing PDFs? Do we want to be able to detect things like embedded JS? embedded URLs (A lot of malware deliver phishing links in PDFs). Or do was want to be able to extract text? Each of these features have different use cases and can be difficult to properly extract in all cases. pdf-parser.py is only able to extract and decode streams which is enough for figuring out JS or extract URLs but not enough to decode text. PyPDF2 is much more feature full and so probably we will need to add native Go support (and we have to write it ourselves since there does not seem to be a go library that is as good out there). |
Love the work on this, sorry, I have been on the sideline, testing the new artefact on my 1TB of test files and had some server hardware issues with that. In my experience, the artefact picks up some documents, which is great and the extracted text in the logs is great as well. It does seem to miss a few documents, but based on my 1TB of data, it is very hard to put my finger on which files and why; it takes days to run. I confirm an export from Word to PDF gets picked up. As for the use cases, I see 2;
|
I had a quick test of existing Go modules and they have the same issue, and like you say @scudette, they work for basic PDF files, but fall down with more complex ones. Examples being:
Other solutions are commercial products, e.g. UniDoc As @DFIRFRANKY says, I think those would be the two main use cases:
I have no doubt that you could work it out eventually, but at what cost, and whether it's worth it for the project. |
For a particular use case that Velociraptor would provide; yara scanning files for strings remotely without the need to download the file from the target, it would be great to see the addition of a pdf parser (such as “pdfparser”).
Currently, it is easy to scan doc, docx, zip, txt, xls and xlsx, for string (for instance “SECRET”. This means if a large amount of endpoints need to be scanned for a leaked file or a suspicious document, it is a matter of starting a new hunt with a few clicks.
This does leave a large gap however; pdf files. These need to be parsed before they cam be scanned with yara. Having a pdfparser-type option for yara scanning in Velociraptor would greatly enhance the functionality of the suite.
The text was updated successfully, but these errors were encountered: