-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002
base: master
Are you sure you want to change the base?
findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002
Conversation
Second PR, had to back out the broken one I committed yesterday. test/test_findspam.py: reinstate backed-out test case
The forced anchoring to the beginning of the post is probably something I would like to extend at some point. I don't think many posts have repeated text in the middle (though certainly that happens too) but having a lot of repeats at the end is definitely something I would like to catch, too. I'm thinking it should trigger the same rule, perhaps with a tweak to the detailed "why" message. |
I have tested on my laptop for several hours in https://chat.stackexchange.com/rooms/136393/room-for-tripleee-and-ad-oisee and this seems to do what it's supposed to do now. |
This looks quite a bit better. Using similar regex101 tests (first, second, third) produce a much better final regex, I wonder how it will perform on a post where the user has a fairly redundant dataset as the first thing in their post. However, in practice, that probably won't be all that many FP hits. Unfortunately, however, the meta-concerns and side-effects I expressed elsewhere still seem quite valid. Basically, given how things are, there's quite a bit of stuff which needs to change in MS prior to our actually being able to merge this and use it. Thus, I'm going to, reluctantly, tag it as status: deferred. I'll see if I can take a better/good look at what needs to change in MS. |
if not initial_words: | ||
return False, "" | ||
escaped_initial_words = [regex.escape(x) for x in initial_words.groups()] | ||
period = regex.match( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add regex.cache_all(False)
prior to this regex.match()
and regex.cache_all(True)
after it to prevent the single-use regexes compiled here from filling the regex
package's internal cache of compiled regular expressions.
Note: this is probably something which we should be doing in a substantial number of places in the code, but I was reminded when creating PR #7012 that it should be used here. Basically, we should be doing that wherever we use a regex which is unique per-post. We should also change things in individual functions/modules to separately store compiled regexes which are used repeatedly in order to reduce the number which we rely on the regex
package to cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've struck-out my earlier comments here and closed an open PR I had, as regex.cache_all()
is the wrong way to handle this here, or anywhere in SD, due to threading. I'll be writing a helper function which can be used for explicit regex compiles to force the implicit regex
cache to not be used. I'll update this again once I've created that (soon). I'll also create an issue in the regex
package asking to expose an option to be able to not use the implicit cache on a per-compile basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a PR, #7025, which adds regex_compile_no_cache()
to helpers.py and imports it to findspam.py. That function can be used to compile a regular expression and not have it placed in the regex
package's implicit cache. Using that function, the code here could be something like:
period_regex = regex_compile_no_cache(
r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % (
tuple(escaped_initial_words * 2))
period = period_regex.match(s)
if not period:
return False, ""
period_words = regex.split(r"[\W_]+", period.groups(0)[0])
escaped_words = escaped_initial_words + [
regex.escape(x) for x in period_words]
repeats_regex = regex_compile_no_cache(r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}")
repeats = repeats_regex.match(s)
escaped_words = escaped_initial_words + [ | ||
regex.escape(x) for x in period_words] | ||
repeats_regex = r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}" | ||
repeats = regex.match(repeats_regex, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same request here regarding regex.cache_all(False)
prior to this regex.match()
and regex.cache_all(True)
after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, it would be possible to use only a single pair of regex.cache_all(False)
and regex.cache_all(True)
to cover both post-specific uses of a regular expression.
This also touches on the larger issue of how we use the regex
package's cache, particularly within the detections. The larger issue is that we don't want to be using it for regexes which are individually generated specific to each post, but we do want any static regular expressions not to have to be recompiled for every use (e.g. through using the regex
package's cache). There are a couple of different approaches which we could take in general. I'll open an Issue to cover/discuss the topic.
Second PR, had to back out the broken one I committed yesterday.
test/test_findspam.py: reinstate backed-out test case