-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Metadata filter #225
Conversation
4b3e609
to
8ec2b4a
Compare
8ec2b4a
to
e120f04
Compare
1 similar comment
I really like the idea of filtering. Some thoughts follow. The first observation is that the filter gets passed around a lot. Instead, a reference to it could be stored in Secondly, I'm not sure that filtering at the tag level makes as much sense as filtering at the directory level. You may, for example, not be interested in ICC data, nor perhaps Exif thumnail data. I know you wished to filter out some large Further, filtering might be format dependant. For example, when processing JPEG files, there are many segment types that metadata extractor will look through by default. You can use the JPEG-related classes it provides and pull out the segments manually for processing, but this requires knowledge of the mapping between segment types and the metadata they contain. Instead, I'd favour maintaining a set of Users may wish to exclude certain types of file from processing. The filter could contain a set of I feel that such an important new API needs a bit more design attention before landing. This solution might address the needs of UMS (and I'd be thrilled to have you use this library without having to maintain a fork) but I must also consider the various other kinds of users and use cases. Stepping back for a minute, what's the primary motivation for filtering in your case? Is it performance? Convenience? I gather the former, and if so, where is the pain? If it were enough to prevent huge tags being stored in memory (which we will definitely fix, see #221), would that suffice? Do you have any measurements of the impact of filtering? Such data would inform the design here. |
@drewnoakes As I'm not as familiary with this code as you obviously are, I might not have chosen the optimal solution when I chose to pass it around. My primary concern was to have minimal impact on existing code and logic while implementing a filter, and I tried to keep "structural" changes to a minimum. I tried long and hard to avoid creating "FilteredDirectory" but in the end I couldn't find a way around it. I'm very open to that this might not be the optimal implementation, and I didn't expect it to be merged as-is in any case. My goal was to make a working implementation with minimal impact. I don't think I've broken the API anywhere though, unless you consider additional overloads to break it. Any public method exists in the exact same version as before, but many have gotten an addition overload which takes a As I see it, tag filtering is fundamental to what I want to achieve. As you suggested, I was thinking of using a whitelist filter (while keeping the possibility for blacklist filtereing as well). The filter I'm thinking of using will filter out directories like all the makernotes and other stuff seems irrelevant, but it will include most of the "basic" directories for the different image formats and the basic Exif directories. Within the directories that are kept, I plan on explicitly keep any tags that seems like they could be relevant. That means filtering out any unknown tags and tags I can't see the possible use for (binary data, camera make, model, name of software etc). While #221 obviously is a big concern, I really liked the whitelist approach by simply excluding everything unknown. My goal isn't just to avoid massive binary data being included. From my testing the serialized Filtering by JPEG segments seems like a good addition. From my POV reducing the memory consumption is a primary concern because this is going to be repeated for thousands of files or streams. Reducing memory consumption will usually give better performance as well. The same goes for My primary concern is size and then performance I'd say, and after that things like maintainability and convenience also weighs in. Size is primarily a memory concern (as UMS is already a bit hard on memory use), but keeping the size of the cache/database in check is also a factor as that translates into performance for everything involving the databse. I haven't actually implemented the filter in UMS yet, so I can't say what the impact is, but I'm fairly confident that it will be substantial. Filtering also gives me the possiblity to "tune" the size to an "acceptable level" when I decide what to keep and not. I'm still working on some other unfinished things in the "image refactoring branch". My idea was to maintain a dialogue about these issues here while completing the rest, and then make a decision on how exactly to handle this. One possible solution could be to use a fork for now and then revisit this later and and see if there's changes made to the library in the mean time that would make it possible to ditch the fork. That would allow me to finish this relatively quickly while allowing time for a more "mature" implementation here. |
@drewnoakes An idea has been forming in the back of my head which I haven't implemeted as I don't know if you think it's a good idea, but here it is: I've been thinking of creating a class, That class could then hold everything that might be needed to include as extra information, now and in the future, like All of its values would have a It would need to be passed around approximately as the What do you think, would it be worth my while? |
I like the idea. Somewhere along the line the idea of a Would be good to sketch out API ideas here. There's some mention of the context in the PR tracking a new API for a future major release here. |
The context could also be a place to control #257 |
In general I like the idea of filtering, but think that the design needs some work. I'm doing some housekeeping and and closing old PRs. Sorry we couldn't merge your work here. |
No problem, I had given up on this a very long time ago. If I had understood how you wanted the |
Understood. The problem is that to work out how I want it means doing 90% of the work anyway, and I also have many other things to be working on. |
@drewnoakes You mentioned creating a whitelist in #216. I figured that would be worth a shot, but I wanted to keep the filtered directories and tags from being created in the first place. This turned out to require changes in many places, so it will be preferrable not to have to keep this "in sync" manually.
As a result, I've created this pull request hoping that you'd consider taking it into the library itself. I've made it as general as possible. All changes to public methods are done with overloads, so it is API compatible. Most of the changes are very straight forward, but I had some challenges with
TiffHandler
and its implementing classes due to it's recursiveness andDirectoryTiffHandler
class global_currentDirectory
. I createdFilteredDirectory
that simply doesn't store anything, and had to redirectsetParent()
to the first non-filtered parent.It is implemented as an interface
MetadataFilter
that can easily be implemented as an anonymous class for easy use. It can be used as either a blacklist or a whitelist.I've tried to keep the formatting as close to the existing code as I can, but I've been very uncertain about how to handle line wrapping many places. The existing practice seems to vary some, so I'm pretty sure you'll want some changes to that.
As a side note, I think I might have found a bug in the new GIF reader. I don't have intricate knowledge of the format, but the reader always skips the size of the color table without checking if there is a global color table. It looks to me like that won't work if the GIF doesn't have a global color table.