Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Metadata filter #225

Closed
wants to merge 1 commit into from
Closed

Conversation

Nadahar
Copy link
Contributor

@Nadahar Nadahar commented Jan 13, 2017

@drewnoakes You mentioned creating a whitelist in #216. I figured that would be worth a shot, but I wanted to keep the filtered directories and tags from being created in the first place. This turned out to require changes in many places, so it will be preferrable not to have to keep this "in sync" manually.

As a result, I've created this pull request hoping that you'd consider taking it into the library itself. I've made it as general as possible. All changes to public methods are done with overloads, so it is API compatible. Most of the changes are very straight forward, but I had some challenges with TiffHandler and its implementing classes due to it's recursiveness and DirectoryTiffHandler class global _currentDirectory. I created FilteredDirectory that simply doesn't store anything, and had to redirect setParent() to the first non-filtered parent.

It is implemented as an interface MetadataFilter that can easily be implemented as an anonymous class for easy use. It can be used as either a blacklist or a whitelist.

I've tried to keep the formatting as close to the existing code as I can, but I've been very uncertain about how to handle line wrapping many places. The existing practice seems to vary some, so I'm pretty sure you'll want some changes to that.

As a side note, I think I might have found a bug in the new GIF reader. I don't have intricate knowledge of the format, but the reader always skips the size of the color table without checking if there is a global color table. It looks to me like that won't work if the GIF doesn't have a global color table.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 32.812% when pulling e120f04 on Nadahar:MetadataFilter into 0f853af on drewnoakes:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 32.812% when pulling e120f04 on Nadahar:MetadataFilter into 0f853af on drewnoakes:master.

@drewnoakes
Copy link
Owner

I really like the idea of filtering. Some thoughts follow.

The first observation is that the filter gets passed around a lot. Instead, a reference to it could be stored in Metadata and Directory instances. This would avoid breaking public APIs and reduce the number of changes considerably.

Secondly, I'm not sure that filtering at the tag level makes as much sense as filtering at the directory level. You may, for example, not be interested in ICC data, nor perhaps Exif thumnail data. I know you wished to filter out some large byte[] tag values (or lengthy strings), but perhaps that could be achieved via a different method on the filter class that is called for large tags.

Further, filtering might be format dependant. For example, when processing JPEG files, there are many segment types that metadata extractor will look through by default. You can use the JPEG-related classes it provides and pull out the segments manually for processing, but this requires knowledge of the mapping between segment types and the metadata they contain. Instead, I'd favour maintaining a set of JpegSegmentMetadataReader on the 'filter', that controls which readers to use. This only applies to JPEG, but it is a very popular format. Doing this would give better performance than directory type filtering alone, as the JPEG segments wouldn't even be read from the underlying stream.

Users may wish to exclude certain types of file from processing. The filter could contain a set of FileType enum values that is checked by ImageMetadataReader for that purpose.

I feel that such an important new API needs a bit more design attention before landing. This solution might address the needs of UMS (and I'd be thrilled to have you use this library without having to maintain a fork) but I must also consider the various other kinds of users and use cases.

Stepping back for a minute, what's the primary motivation for filtering in your case? Is it performance? Convenience? I gather the former, and if so, where is the pain? If it were enough to prevent huge tags being stored in memory (which we will definitely fix, see #221), would that suffice? Do you have any measurements of the impact of filtering? Such data would inform the design here.

@Nadahar
Copy link
Contributor Author

Nadahar commented Jan 16, 2017

@drewnoakes As I'm not as familiary with this code as you obviously are, I might not have chosen the optimal solution when I chose to pass it around. My primary concern was to have minimal impact on existing code and logic while implementing a filter, and I tried to keep "structural" changes to a minimum. I tried long and hard to avoid creating "FilteredDirectory" but in the end I couldn't find a way around it. I'm very open to that this might not be the optimal implementation, and I didn't expect it to be merged as-is in any case. My goal was to make a working implementation with minimal impact. I don't think I've broken the API anywhere though, unless you consider additional overloads to break it. Any public method exists in the exact same version as before, but many have gotten an addition overload which takes a MetadataFilter. Using the overload without the filter will simply works as before, without filtering.

As I see it, tag filtering is fundamental to what I want to achieve. As you suggested, I was thinking of using a whitelist filter (while keeping the possibility for blacklist filtereing as well). The filter I'm thinking of using will filter out directories like all the makernotes and other stuff seems irrelevant, but it will include most of the "basic" directories for the different image formats and the basic Exif directories. Within the directories that are kept, I plan on explicitly keep any tags that seems like they could be relevant. That means filtering out any unknown tags and tags I can't see the possible use for (binary data, camera make, model, name of software etc).

While #221 obviously is a big concern, I really liked the whitelist approach by simply excluding everything unknown. My goal isn't just to avoid massive binary data being included. From my testing the serialized Metadata instances generally end up in the 15 - 25 kb range. That is very much bigger than I what I had in mind, so I'm also hoping to reduce the general size of an instance.

Filtering by JPEG segments seems like a good addition. From my POV reducing the memory consumption is a primary concern because this is going to be repeated for thousands of files or streams. Reducing memory consumption will usually give better performance as well.

The same goes for FileType filtering, while it's not relevant to UMS (we already know that we want metadata for a given file before making the call to extract them) it seems like a good idea.

My primary concern is size and then performance I'd say, and after that things like maintainability and convenience also weighs in. Size is primarily a memory concern (as UMS is already a bit hard on memory use), but keeping the size of the cache/database in check is also a factor as that translates into performance for everything involving the databse. I haven't actually implemented the filter in UMS yet, so I can't say what the impact is, but I'm fairly confident that it will be substantial. Filtering also gives me the possiblity to "tune" the size to an "acceptable level" when I decide what to keep and not.

I'm still working on some other unfinished things in the "image refactoring branch". My idea was to maintain a dialogue about these issues here while completing the rest, and then make a decision on how exactly to handle this. One possible solution could be to use a fork for now and then revisit this later and and see if there's changes made to the library in the mean time that would make it possible to ditch the fork. That would allow me to finish this relatively quickly while allowing time for a more "mature" implementation here.

@Nadahar
Copy link
Contributor Author

Nadahar commented Mar 15, 2017

@drewnoakes An idea has been forming in the back of my head which I haven't implemeted as I don't know if you think it's a good idea, but here it is:

I've been thinking of creating a class, MetadataParameters, ReaderParameters or similar that can be sent to the public API calls (where it's relevant obviously). Just like MetadataFilter is, it would be optional/@Nullable and the methods which has it as a parameter would have overloads without it that is identical to how they are today,

That class could then hold everything that might be needed to include as extra information, now and in the future, like MetadataFilter, Locale/character encoding, maximum unknown tag size to keep, whether to store thumbnail data etc.

All of its values would have a null/default which resulted in the same functionality as today.

It would need to be passed around approximately as the MetadataFilter is now, but it would be for a "greater cause". It would also be possible to reduce the passing by storing it in class fields e.g on Directory, but as always using "globals" has concurrency implications. This is why I default to passing things around, it's so simple with regards to thread-safety - but since globals are already used the problem is already there and this wouldn't make it any worse.

What do you think, would it be worth my while?

@drewnoakes
Copy link
Owner

I like the idea. Somewhere along the line the idea of a MetadataContext or similar was floated. It'd be optional, and if you don't specify it you'd get a default which would include all types of metadata. The trick will be to make this simple, extensible and not too invasive on the current API.

Would be good to sketch out API ideas here.

There's some mention of the context in the PR tracking a new API for a future major release here.

@drewnoakes
Copy link
Owner

The context could also be a place to control #257

@drewnoakes
Copy link
Owner

In general I like the idea of filtering, but think that the design needs some work.

I'm doing some housekeeping and and closing old PRs. Sorry we couldn't merge your work here.

@drewnoakes drewnoakes closed this Jan 21, 2020
@Nadahar
Copy link
Contributor Author

Nadahar commented Jan 21, 2020

No problem, I had given up on this a very long time ago.

If I had understood how you wanted the MetadataContext object to be handled, I would probably have made an attempt. It was never clear enough for me how you wanted it though, and making a "I hope I got it right" implementation doesn't seem like a good way to spend my time. Since I have 10000 other things to do, it kind of stalled there 😉

@drewnoakes
Copy link
Owner

Understood. The problem is that to work out how I want it means doing 90% of the work anyway, and I also have many other things to be working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants