Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lazy matchers #173

Closed
masklinn opened this issue Jan 15, 2024 · 0 comments
Closed

Add lazy matchers #173

masklinn opened this issue Jan 15, 2024 · 0 comments
Milestone

Comments

@masklinn
Copy link
Contributor

masklinn commented Jan 15, 2024

See #171

This should be reasonably easy with the API changes of #116. Matchers should probably be a triple of lists of Matcher[*], where Matcher is

T = TypeVar("T")
class Matcher(abc.ABC, Generic[T]):
    @abc.abstractmethod
    def __call__(self, ua: str) -> Optional[T]:
        pass

and we can update e.g. class UserAgentMatcher(Matcher[UserAgent]), making the typing line up (hopefully).

The question is how to provide the lazy versions of the matchers:

  • add one more generated file generating a triplet of lists of lazy matchers (at which point the codegen should probably be abstracted some as it's getting really messy)
  • add a generator which yields arbitrarily applied thingie e.g.
    def user_agents(callback):
        yield callback('(GeoEvent Server) (\d+)(?:\.(\d+)(?:\.(\d+)|)|)')
        yield callback('(ArcGIS Pro)(?: (\d+)\.(\d+)\.([^ ]+)|)')
        ...
    maybe with a cache on callback somehow? Not sure there's any way that can work, might make more sense with a wrapper converting that to a list and the callbacks are just intermediate convenience?
  • generate lists of the parameters and map to a new list at runtime
  • use json? might parse faster than python file but requires re-mapping, and is eager, and might not work because it can't benefit from some optimisations (but not sure there are any optimisations to the generated modules, they create lists of function calls but there's no string sharing or key sharing or anything)
@masklinn masklinn added this to the 1.0 milestone Feb 6, 2024
@masklinn masklinn changed the title Add lazy matchers post #116 Add lazy matchers Feb 6, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 13, 2024
Support is addef for lazy builtin matchers (with a separately compiled
file), as well as loading json or yaml files using lazy matchers.

Lazy matchers are very much a tradeoff: they improve import speed, but
slow down run speed, possibly dramatically.

Use them by default for the re2 parser, but not the basic parser:
experimentally, on Python 3.11

- importing the package itself takes ~36ms
- importing the lazy matchers takes ~36ms (including the package, so ~0)
- importing the eager matchers takes ~97ms

the eager matchers have a significant overhead, *however* running the
bench on the sample file, they cause a runtime increase of 700~800ms
on the basic parser bench, as that ends up instantiating *every*
regex (likely due to match failures). Relatively this is not
huge (~2.5%), but the tradeoff doesn't seem great, especially since
the parser itself is initialized lazily.

The re2 parser does much better, only losing 20~30ms (~1%), this is
likely because it only needs to compile a fraction of the regexes (156
out of 1162 as of regexes.yaml version 0.18), and possibly because it
gets to avoid some of the most expensive to compile ones.

Fixes ua-parser#171, fixes ua-parser#173
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 17, 2024
Support is added for lazy builtin matchers (with a separately compiled
file), as well as loading json or yaml files using lazy matchers.

Lazy matchers are very much a tradeoff: they improve import speed, but
slow down run speed, possibly dramatically.

Use them by default for the re2 parser, but not the basic parser:
experimentally, on Python 3.11

- importing the package itself takes ~36ms
- importing the lazy matchers takes ~36ms (including the package, so ~0)
- importing the eager matchers takes ~97ms

the eager matchers have a significant overhead, *however* running the
bench on the sample file, they cause a runtime increase of 700~800ms
on the basic parser bench, as that ends up instantiating *every*
regex (likely due to match failures). Relatively this is not
huge (~2.5%), but the tradeoff doesn't seem great, especially since
the parser itself is initialized lazily.

The re2 parser does much better, only losing 20~30ms (~1%), this is
likely because it only needs to compile a fraction of the regexes (156
out of 1162 as of regexes.yaml version 0.18), and possibly because it
gets to avoid some of the most expensive to compile ones.

Fixes ua-parser#171, fixes ua-parser#173
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 17, 2024
Support is added for lazy builtin matchers (with a separately compiled
file), as well as loading json or yaml files using lazy matchers.

Lazy matchers are very much a tradeoff: they improve import speed, but
slow down run speed, possibly dramatically.

Use them by default for the re2 parser, but not the basic parser:
experimentally, on Python 3.11

- importing the package itself takes ~36ms
- importing the lazy matchers takes ~36ms (including the package, so ~0)
- importing the eager matchers takes ~97ms

the eager matchers have a significant overhead, *however* running the
bench on the sample file, they cause a runtime increase of 700~800ms
on the basic parser bench, as that ends up instantiating *every*
regex (likely due to match failures). Relatively this is not
huge (~2.5%), but the tradeoff doesn't seem great, especially since
the parser itself is initialized lazily.

The re2 parser does much better, only losing 20~30ms (~1%), this is
likely because it only needs to compile a fraction of the regexes (156
out of 1162 as of regexes.yaml version 0.18), and possibly because it
gets to avoid some of the most expensive to compile ones.

Fixes ua-parser#171, fixes ua-parser#173
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 18, 2024
Add lazy builtin matchers (with a separately compiled file), as well
as loading json or yaml files using lazy matchers.

Lazy matchers are very much a tradeoff: they improve import speed (and
memory consumption until triggered), but slow down run speed, possibly
dramatically:

- importing the package itself takes ~36ms
- importing the lazy matchers takes ~36ms (including the package, so
  ~0) and ~70kB RSS
- importing the eager matchers takes ~97ms and ~780kB RSS
- triggering the instantiation of the lazy matchers adds ~800kB RSS
- running bench on the sample file using the lazy matcher has
  700~800ms overhead compared to the eager matchers

While the lazy matchers are less costly across the board until they're
used, benching the sample file causes the loading of *every* regex --
likely due to matching failures -- has a 700~800ms overhead over eager
matchers, and increases the RSS by ~800kB (on top of the original 70).

Thus lazy matchers are not a great default for the basic parser.
Though they might be a good opt-in if the user only ever uses one of
the domains (especially if it's not the devices one as that's by far
the largest).

With the re2 parser however, only 156 of the 1162 regexes get
evaluated, leading to a minor CPU overhead of 20~30ms (1% of bench
time) and a more reasonable memory overhead. Thus use the lazy matcher
fot the re2 parser.

On the more net-negative but relatively minor side of things, the
pregenerated lazy matchers file adds 120k to the on-disk requirements
of the library, and ~25k to the wheel archive. This is also what the
_regexes and _matchers precompiled files do. pyc files seem to be even
bigger (~130k) so the tradeoff is dubious even if they are slightly
faster.

Fixes ua-parser#171, fixes ua-parser#173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant