Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError in ZIPParser for invalid zips #155

Closed
fabioperrella opened this issue Aug 13, 2020 · 10 comments
Closed

TypeError in ZIPParser for invalid zips #155

fabioperrella opened this issue Aug 13, 2020 · 10 comments
Assignees
Labels

Comments

@fabioperrella
Copy link
Contributor

fabioperrella commented Aug 13, 2020

Message

no implicit conversion of nil into String

Backtrace

/usr/local/bundle/gems/format_parser-0.21.1/lib/parsers/zip_parser/file_reader.rb:178 initialize
/usr/local/bundle/gems/format_parser-0.21.1/lib/parsers/zip_parser/file_reader.rb:178 new
/usr/local/bundle/gems/format_parser-0.21.1/lib/parsers/zip_parser/file_reader.rb:178 read_zip_structure
/usr/local/bundle/gems/format_parser-0.21.1/lib/parsers/zip_parser.rb:17 call
/usr/local/bundle/gems/format_parser-0.21.1/lib/format_parser.rb:201 block in execute_parser_and_capture_expected_exceptions
/usr/local/bundle/gems/measurometer-1.1.0/lib/measurometer.rb:48 instrument
/usr/local/bundle/gems/format_parser-0.21.1/lib/format_parser.rb:200 execute_parser_and_capture_expected_exceptions
/usr/local/bundle/gems/format_parser-0.21.1/lib/format_parser.rb:168 block in parse
/usr/local/bundle/gems/format_parser-0.21.1/lib/format_parser.rb:172 each
/usr/local/bundle/gems/format_parser-0.21.1/lib/format_parser.rb:172 each

This issue was being discussed in a private repo and we decided to move the discussion to here.

TLDR:

  1. some zip files raise error at this pont because io.pos is beyond the file size
  2. I tested some of those files (in production) with zip -T and I got zip error: Zip file structure invalid. So I don't think we need to improve the parser to recognize those files
  3. I couldn't create a zip file to raise this error to use as a fixture in tests. I tried with zip -s, head -c BYTES file.zip > corrupted.zip, etc, but this way, other errors are raised
  4. my idea is to create a new error eg: InvalidCdirPosition = Class.new(Error) which would be rescued by the rescue FileReader::Error (this is the pattern that other errors use), but @linkyndy suggested a flag to raise the errors optionally

@julik replied this:

I believe this has more to do with how FormatParser works than with the ZIP format. For all intents and purposes, FormatParser does a brute-force application of N functions to a single payload, and some of them may raise exceptions or fail in an unexpected way. Since the functions are applied in sequence it can happen that a failure in one function (parser) will raise an exception and the code will not "drop down" to the next parser. Another concern is that when performing a parse for a specific file multiple parser modules can raise an exception actually - not only one. And in Ruby you can only recover one exception using the builtin way of handling errors, sadly.

Also, FormatParser by nature deals with a very wild variety of files. There is no guarantee that it supports all the file format features, but there is also no guarantee that the file format we are parsing is valid - for example, a lot of errors raised from the id3tag module for MP3 files stem from the fact that ID3 tags have mislabeled character encodings, which bleeds into the id3tag library. When making FP there was never a goal of recovering file information at any cost even if the file is invalid - since it was meant for previewing, the intention was that if a file is malformed or invalid you do not want to treat it as if it was OK, and you actually do want not to recognize it. That's why FormatParser parsers have, very deliberately, no fallbacks and no recovery strategies – a file has to be genuinely "mostly correct" to be treated correctly.

In terms of error handling there are three ways to handle those cases:

  1. We drop down to the next parser, and optionally register an error in the parser that raised
  2. We raise the error immediately
  3. We suppress all errors and drop down to the following parser

From the beginning with FormatParser we effectively defaulted to option 2, with the intention being to capture as many problems and quirks in FormatParser itself as well as finding as many problems in the upstream libraries as we can. This has largely worked, but it does pollute the APM store for the application which uses FormatParser. I think we could change the approach there using the following possible means

  • We introduce a monadic result type which can contain either an exception object or a result, and we present those to the caller. The caller can then decide to register the first occurred exception, or all occurred exception, or just recover the first valid result
  • We add a setting to FormatParser where a Proc can be given which will receive exceptions that occur during parsing, and will receive all of them. We can't really handle multiple errors there (Appsignal - for one - only supports one exception per transaction/unit of work) but it would allow us to make error handling / error recovery in FormatParser optional

In terms of "making FP work with all possible files regardless of how broken they are" - I firmly believe this should be a non-goal. If a file is malformed we should not spend time on recovering its structure (also, malformed files are often what embeds exploits).

@fabioperrella fabioperrella self-assigned this Aug 13, 2020
@fabioperrella
Copy link
Contributor Author

fabioperrella commented Aug 13, 2020

@julik I think the suggestions that you gave are pretty good to deal with unknown errors, but we have a few cases that are treated inside the parsers that not raise error outside, so it's possible to drop down to the following parser.

For example, ZIPParser::FileReader has the errors UnsupportedFeature, InvalidStructure and others.

wdyt adding another one to detect this scenario that I mentioned and move the problem of how to deal with unknown errors to another issue?

@linkyndy
Copy link
Contributor

Great discussion so far 🙌 I would like to add that we are not looking to "making FP work with all possible files regardless of how broken they are"; I am perfectly fine with seeing expected errors and ignoring them. However, it is TypeError, KeyError and the like that bother me; they are unexpected errors and I think we should 1) fix them, if they are indeed that 2) wrap them in something nicer, making them expected errors, so that apps using this gem can safely ignore them if they want.

@fabioperrella
Copy link
Contributor Author

I am perfectly fine with seeing expected errors and ignoring them

@linkyndy just to confirm, do you consider the treatment that I suggested for this error as an expected error? Is so, I will work in a PR for it

@linkyndy
Copy link
Contributor

If we can rescue something specific, that would be great. I don't think it's a good idea to start ignoring TypeErrors, for obvious reasons.

@lorenzograndi4
Copy link
Contributor

For your knowledge, see previous discussion here: #152

I also think a TypeError is unexpected, but it's more valuable to fix the source of the issue rather than hide it.

@julik
Copy link
Contributor

julik commented Aug 14, 2020

I doubt splitting errors into "expected" and "unexpected" is a sensible approach here, for the following reasons:

  • We do not know whether a library we are using for parsing is prone to raising an "unexpected error". For example, the ID3 tag parser raises TypeErrors and EncodingErrors and whatnot, because it does not produce fallbacks for invalid data - it just trucks on and fails loudly, early. Same for the exifr gem.
  • We apply multiple parsers to the same file, and it might happen that one parser raises an "unexpected error" while another parser will succeed. Do you really care about this "unexpected error" in that case?
  • If we want to guarantee that there will be no "unexpected errors" from any parsers the only option we have is maintaining all the parsing code ourselves, including all the recovery. This means that we need our own parsing flows also for relatively complex formats (ID3 and EXIF) and it is going to cost time. This is what the ZIP parser does at the moment (to an extent) but I don't know whether this approach scales all that well.

So my proposal is more geared towards distinguishing between errors which happen in the FormatParser "shell" (everything above a specific parser module, and multiple parser modules even) and errors which parsing modules themselves raise. For the "shell" you do want to have errors always, because they are indicative of the format parser itself doing something unexpected. But for the parser modules I think we should make it optional for the caller to do something with those errors later on (for example using an optional error handling proc).

@fabioperrella
Copy link
Contributor Author

fabioperrella commented Aug 17, 2020

just to clarify, I created a draft PR showing how is my idea to tackle with the specific zip problem that I found

#156

I didn't push the file that I used to test yet because I'm waiting for the owner approval.

@fabioperrella
Copy link
Contributor Author

fabioperrella commented Aug 18, 2020

I tried a different approach and I think it is a valid one 🤔

I got a file which is raising the error in the production server and removed all the content that is not relevant to the parser. By doing this I'm almost sure we can consider this file with no user content inside it.

After that, the file size was reduced to 1KB and I got the same error that I had with the original one!

Let me show how I did it, I think it's something we can do for other types of files as well!

  1. I generated a CSV with all the reads (position, bytes) that the parser does in Care#read
def read(n_bytes)
  return '' if n_bytes == 0 # As hardcoded for all Ruby IO objects
  raise ArgumentError, "negative length #{n_bytes} given" if n_bytes < 0 # also as per Ruby IO objects
  read = @cache.byteslice(@io, @pos, n_bytes)
  return unless read && !read.empty?
  puts "#{@pos.to_s.rjust(6, "0")}, #{n_bytes.to_s.rjust(5, "0")}"  # <---- I added this
  @pos += read.bytesize
  read
end

These are the ranges that the parser reads information and we can't change

Then I ran:

exe/format_parser_inspect file.zip | sort | uniq > out.csv

This created a file like this:

000000, 00001
000000, 00002
000000, 00004
000000, 00005
000000, 00006
000000, 00008
000000, 00009
000000, 00010
000000, 00012
000000, 00014
000008, 00004
000010, 00004
000018, 00004
000026, 00004
000034, 00004
000042, 00004
000050, 00004
000058, 00004
000066, 00004
000074, 00004
000082, 00004
000090, 00004
000098, 00004
000106, 00004
000114, 00004
000122, 00004
000130, 00004
000138, 00004
000146, 00004
000154, 00004
...
  1. Then I created a script to return only the uniq ranges, for example for the ranges (0..4) and (0..8), it's only relevant the range (0..8) because it covers the range (0..4):
require 'csv'

ranges = []
CSV.foreach('out.csv') do |pos, size|
  ranges << (pos.to_i..(pos.to_i+size.to_i))
end

uniq_ranges = ranges.map do |range|
  if other_range = ranges.find { |other_range| other_range != range && other_range.cover?(range) }
    nil
  else
    range
  end
end.uniq

puts "ranges: #{uniq_ranges}"

Then I got an output like this:

ranges: [nil, 0..14, 18..22, 26..30, 34..38, 42..46, 50..54, 58..62, 66..70, 74..78, 82..86, 90..94, 98..102, 106..110, 114..118, 
122..126, 130..134, 138..142, 146..150, 154..158, 162..166, 170..174, 178..182, 186..190, 194..198, 202..206, 210..214, 218..222, 
226..230, 234..238, 242..246, 250..254, 258..262, 266..270, 274..278, 282..286, 290..294, 298..302, 306..310, 314..318, 322..326, 
330..334, 338..342, 346..350, 354..358, 362..366, 370..374, 378..382, 386..390, 394..398, 402..406, 410..414, 418..422, 426..430, 
434..438, 442..446, 450..454, 458..462, 466..470, 474..478, 482..486, 490..494, 498..502, 506..510, 514..518, 522..526, 530..534, 
538..542, 546..550, 554..558, 562..566, 570..574, 578..582, 586..590, 594..598, 602..606, 610..614, 618..622, 626..630, 634..638, 
642..646, 650..654, 658..662, 666..670, 674..678, 682..686, 690..694, 698..702, 706..710, 714..718, 722..726, 730..734, 738..742, 
746..750, 754..758, 762..766, 770..774, 778..782, 786..790, 794..798, 802..806, 810..814, 818..822, 826..830, 834..838, 842..846, 
850..854, 858..862, 866..870, 874..878, 882..886, 890..894, 898..902, 906..910, 914..918, 922..926, 930..934, 938..942, 946..950, 
954..958, 962..966, 970..974, 978..982, 986..990, 994..998, 1002..1006, 1010..1014, 1018..1022, 1026..1030, 132458..198015]

So I realized everything after the position 1030 could be removed because it's not relevant to the parser. (but at this point I was not sure if the zip would be consider invalid after removing it)

Then, I opened the binary file in SublimeText and removed all the bytes from the position 1031 until the last 10 bytes (I'm not sure how many bytes I need to keep in the end, but if I remove all of them I got an error Unexpected end of zip file)

I'm attaching the file that I generated, so you can inspect it.

file2.zip

With this file, I got the same error that I had running zip -T or exe/format_parser_inspect!

@julik @martijnvermaat @lorenzograndi4 do you think this approach is valid to avoid asking the user for permission? I think it is, but I'm not 100% sure that I removed all the information that I had to.

@fabioperrella
Copy link
Contributor Author

@linkyndy now that you are back, could you give your feedback about it pls ☝️

@fabioperrella
Copy link
Contributor Author

closed due to #156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants