-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError in ZIPParser for invalid zips #155
Comments
@julik I think the suggestions that you gave are pretty good to deal with unknown errors, but we have a few cases that are treated inside the parsers that not raise error outside, so it's possible to drop down to the following parser. For example, wdyt adding another one to detect this scenario that I mentioned and move the problem of how to deal with unknown errors to another issue? |
Great discussion so far 🙌 I would like to add that we are not looking to "making FP work with all possible files regardless of how broken they are"; I am perfectly fine with seeing expected errors and ignoring them. However, it is |
@linkyndy just to confirm, do you consider the treatment that I suggested for this error as an expected error? Is so, I will work in a PR for it |
If we can rescue something specific, that would be great. I don't think it's a good idea to start ignoring |
For your knowledge, see previous discussion here: #152 I also think a TypeError is unexpected, but it's more valuable to fix the source of the issue rather than hide it. |
I doubt splitting errors into "expected" and "unexpected" is a sensible approach here, for the following reasons:
So my proposal is more geared towards distinguishing between errors which happen in the FormatParser "shell" (everything above a specific parser module, and multiple parser modules even) and errors which parsing modules themselves raise. For the "shell" you do want to have errors always, because they are indicative of the format parser itself doing something unexpected. But for the parser modules I think we should make it optional for the caller to do something with those errors later on (for example using an optional error handling proc). |
just to clarify, I created a draft PR showing how is my idea to tackle with the specific zip problem that I found I didn't push the file that I used to test yet because I'm waiting for the owner approval. |
I tried a different approach and I think it is a valid one 🤔 I got a file which is raising the error in the production server and removed all the content that is not relevant to the parser. By doing this I'm almost sure we can consider this file with no user content inside it. After that, the file size was reduced to 1KB and I got the same error that I had with the original one! Let me show how I did it, I think it's something we can do for other types of files as well!
def read(n_bytes)
return '' if n_bytes == 0 # As hardcoded for all Ruby IO objects
raise ArgumentError, "negative length #{n_bytes} given" if n_bytes < 0 # also as per Ruby IO objects
read = @cache.byteslice(@io, @pos, n_bytes)
return unless read && !read.empty?
puts "#{@pos.to_s.rjust(6, "0")}, #{n_bytes.to_s.rjust(5, "0")}" # <---- I added this
@pos += read.bytesize
read
end These are the ranges that the parser reads information and we can't change Then I ran: exe/format_parser_inspect file.zip | sort | uniq > out.csv This created a file like this:
require 'csv'
ranges = []
CSV.foreach('out.csv') do |pos, size|
ranges << (pos.to_i..(pos.to_i+size.to_i))
end
uniq_ranges = ranges.map do |range|
if other_range = ranges.find { |other_range| other_range != range && other_range.cover?(range) }
nil
else
range
end
end.uniq
puts "ranges: #{uniq_ranges}" Then I got an output like this:
So I realized everything after the position 1030 could be removed because it's not relevant to the parser. (but at this point I was not sure if the zip would be consider invalid after removing it) Then, I opened the binary file in SublimeText and removed all the bytes from the position 1031 until the last 10 bytes (I'm not sure how many bytes I need to keep in the end, but if I remove all of them I got an error I'm attaching the file that I generated, so you can inspect it. With this file, I got the same error that I had running @julik @martijnvermaat @lorenzograndi4 do you think this approach is valid to avoid asking the user for permission? I think it is, but I'm not 100% sure that I removed all the information that I had to. |
@linkyndy now that you are back, could you give your feedback about it pls ☝️ |
closed due to #156 |
Message
Backtrace
This issue was being discussed in a private repo and we decided to move the discussion to here.
TLDR:
io.pos
is beyond the file sizezip -T
and I gotzip error: Zip file structure invalid
. So I don't think we need to improve the parser to recognize those fileszip -s
,head -c BYTES file.zip > corrupted.zip
, etc, but this way, other errors are raisedInvalidCdirPosition = Class.new(Error)
which would be rescued by therescue FileReader::Error
(this is the pattern that other errors use), but @linkyndy suggested a flag to raise the errors optionally@julik replied this:
I believe this has more to do with how FormatParser works than with the ZIP format. For all intents and purposes, FormatParser does a brute-force application of N functions to a single payload, and some of them may raise exceptions or fail in an unexpected way. Since the functions are applied in sequence it can happen that a failure in one function (parser) will raise an exception and the code will not "drop down" to the next parser. Another concern is that when performing a parse for a specific file multiple parser modules can raise an exception actually - not only one. And in Ruby you can only recover one exception using the builtin way of handling errors, sadly.
Also, FormatParser by nature deals with a very wild variety of files. There is no guarantee that it supports all the file format features, but there is also no guarantee that the file format we are parsing is valid - for example, a lot of errors raised from the id3tag module for MP3 files stem from the fact that ID3 tags have mislabeled character encodings, which bleeds into the id3tag library. When making FP there was never a goal of recovering file information at any cost even if the file is invalid - since it was meant for previewing, the intention was that if a file is malformed or invalid you do not want to treat it as if it was OK, and you actually do want not to recognize it. That's why FormatParser parsers have, very deliberately, no fallbacks and no recovery strategies – a file has to be genuinely "mostly correct" to be treated correctly.
In terms of error handling there are three ways to handle those cases:
From the beginning with FormatParser we effectively defaulted to option 2, with the intention being to capture as many problems and quirks in FormatParser itself as well as finding as many problems in the upstream libraries as we can. This has largely worked, but it does pollute the APM store for the application which uses FormatParser. I think we could change the approach there using the following possible means
In terms of "making FP work with all possible files regardless of how broken they are" - I firmly believe this should be a non-goal. If a file is malformed we should not spend time on recovering its structure (also, malformed files are often what embeds exploits).
The text was updated successfully, but these errors were encountered: