Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex including hosts which are not in targets #38

Open
jayvdb opened this issue Apr 11, 2020 · 3 comments
Open

regex including hosts which are not in targets #38

jayvdb opened this issue Apr 11, 2020 · 3 comments

Comments

@jayvdb
Copy link
Owner

jayvdb commented Apr 11, 2020

https://github.com/jayvdb/https-everywhere-py/blob/a3f2b42/https_everywhere/_rules.py#L406

There are lots of cases of regex which refer to hosts which are not in the rule targets.

They are currently detected, but are not being rejected, or considered in the tests.
The known ones are stored in _FIXME_BROKEN_REGEX_MATCHES:

_FIXME_BROKEN_REGEX_MATCHES = [
    "affili.de",
    "www.belgium.indymedia.org",
    "m.aljazeera.com",
    "atms00.alicdn.com",
    "i06.c.aliimg.com",
    "allianz-fuercybersicherheit.de",
    "statics0.beauteprivee.fr",
    "support.bulletproofexec.com",
    "wwwimage0.cbsstatic.com",
    "cdn0.colocationamerica.com",
    "www.login.dtcc.edu",
    "ejunkie.com",
    "e-rewards.com",
    "member.eurexchange.com",
    "4exhale.org",
    "na0.www.gartner.com",
    "blog.girlscouts.org",
    "lh0.google.*",  # fixme
    "nardikt.org",
    ".instellaplatform.com",
    "m.w.kuruc.org",
    "search.microsoft.com",
    "static.millenniumseating.com",
    "watchdog.mycomputer.com",
    "a0.ec-images.myspacecdn.com",
    "a0.mzstatic.com",
    "my.netline.com",
    "img.e-nls.com",
    "x.discover.oceaniacruises.com",
    "www.data.phishtank.com",
    "p00.qhimg.com",
    "webassetsk.scea.com",
    "s00.sinaimg.cn",
    "mosr.sk",
    "sofurryfiles.com",
    "asset-g.soupcdn.com",
    "cdn00.sure-assist.com",
    "www.svenskaspel.se",
    "mail.telecom.sk",
    "s4.thejournal.ie",
    "my.wpi.edu",
    "stec-t*.xhcdn.com",  # fixme
    "www.*.yandex.st",  # fixme
    "s4.jrnl.ie",
    "b2.raptrcdn.com",
    "admin.neobookings.com",
    "webmail.vipserv.org",
    "ak0.polyvoreimg.com",
    "cdn.fora.tv",
    "cdn.vbseo.com",
    "edge.alluremedia.com",
    "secure.trustedreviews.com",
    "icmail.net",
    "www.myftp.utechsoft.com",
    "research-store.com",
    "app.sirportly.com",
    "ec7.images-amazon.com",
    "help.npo.nl",
    "css.palcdn.com",
    "legacy.pgi.com",
    "my.btwifi.co.uk",
    "orders.gigenetcloud.com",
    "owa.space2u.com",
    "payment-solutions.entropay.com",
    "static.vce.com",
    "itpol.dk",
    "orionmagazine.com",
    # fix merged, not distributed
    "citymail.com",
    "mvg-mobile.de",
    "inchinashop.com",
    "www.whispergifts",
    # already merged?
    "css.bzimages.com",
    "cdn0.spiegel.de",
]

These are mostly fixed in EFForg/https-everywhere#18949 and EFForg/https-everywhere#18957 , but upstream has difficulty reviewing complex changesets - splitting them might help, but even so the progress on smaller PRs is very slow, so these problems will linger for a while, and need to be fixed in this library.

The regex hosts need to either be tested properly so that the extra hosts can be added to the targets and so be used in the processing logic, and optimised sanely, or the regex should be simplified to remove these extra hosts.

@jayvdb
Copy link
Owner Author

jayvdb commented Apr 11, 2020

A lot of these are in EFForg/https-everywhere@cd07963 on EFForg/https-everywhere#18938

@jayvdb
Copy link
Owner Author

jayvdb commented Apr 11, 2020

@jayvdb
Copy link
Owner Author

jayvdb commented Apr 11, 2020

Summary of rulesets in first possible batch to fix, roughly invalid rules with only a single hostname match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant