Skip to content

Commit

Permalink
Refactor library to be closer to original implementation in C
Browse files Browse the repository at this point in the history
  • Loading branch information
nap committed Oct 15, 2024
1 parent 4e0b025 commit 934567a
Show file tree
Hide file tree
Showing 10 changed files with 349 additions and 281 deletions.
66 changes: 2 additions & 64 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,104 +1,42 @@
v2.0 (2024-09-29)

v1.8 (2016-03-22)
65da039 Fix README parsing
93de94a Update for new release with details of change in implementation
b3e4be8 Added 'CONTRIBUTOR' file
906dd09 Updated test to reflect changes
8b1b819 Removed code that is no longer reachable
332e5b1 Added enhancement suggested by @hellrich
931ab9c Revert "Fix in _get_matching_characters"
e0e7b64 Update tox.ini
8b24dfd fixed _get_matching_characters
d03966a Fix in _get_matching_characters
e70dfea update example with better formatting
20bda9a update example
v1.7 (2015-10-21)
1659545 fix for typo ensuring backward compatibility
f81227a fix argparse
003eed8 adde forgotten variable for identity
v1.6 (2015-09-06)
59b93e5 added information __all__
v1.5 (2015-09-06)
264160c remove merge messages from CHANGELOG
97d790a ignore DS_Store
ff1f53f added release script
40442e4 revert automatic release changes
350b37e update bad filename
52a2248 update release script with automatic changelog update
89b0fc6 update changelog with new format
8f10096 added useless output
9663294 added version and identity parameter with some debug
08fd0c2 added usage output
f3fd289 fix typo and update debug
3fa9bff added script to push new version
v1.4 (2015-08-30)
2f3107b update to work with py26
5da90dd update for py26 and py34
1e1b580 added py26
1496145 update test, removed frog test because of floating point issue between 2 and 3
178676a fix for py3 compatibility
06c1787 fix for py34
b549fc0 tuned tox config
b352006 added .tox directory to ignore list
c7d5889 tox config file to run tests
b5f0729 added ability to run test through test parameter
91d57b8 update travis with TOX deps and run test through setup.py
b08600a fix bad link
84214dd removed last edit
5cad2fa took faulty package out of virtenv
ab9f89f removed unknown attribute
2d134d8 update example
v0.1.3 (2015-08-20)
30b29d9 update readme
9905775 Update changelog
d538031 removed some useless assert
5f25d27 added test for new winkler disable
02d70df latest change
3a5e93a added ability to unable or disable winkler ajustment
eebe617 added author
827db01 added bugtrack_url
82aa877 added irc notification
90e634e added more badge :D
v0.1.2 (2015-08-04)
1b788e5 update release date for v0.1.2
e2b5608 updated changes
5c8c18f added coverage badge
763fd4d latest change
96f87ba added converage feature
4cbd8c0 added converage file
50026f6 added build health
bf44cdf added python path in env var
2229e8e added python path
367af91 added travis yaml config file
43e86e0 Modified version exception message
2b04148 Modified __author__ value
da70797 removed typo
f7a357a updated change information
42dd2e1 modified python version check, will raise exception now
v0.1.1 (2015-08-02)
4747455 added more information
a8ae21f change for minimal python requirement
2bb212b added ability to enforce minimal python version
f616731 changed README.md to README.rst
c78f379 Added more change
174299d Modified README with more structure and moved from MD to RST format
d3459cc Added changelog file
11bdf2d update summary, download_url, classifier, keywords, platform, and classifiers
54bc9b8 Added information in setup.py
8138f79 Update README.md
v0.1.0 (2015-07-31)
5babf40 rename package
df5c284 added more file to ignore (egg, dist, and build)
09a0411 Added setup.py
a89f1af added test to get 100% coverage. make sure test are uniform
e8499e7 removed unreachable code and useless conditions
309822e changed package import
1a68445 fix for forgotten denominator and return empty string on NoneType
a858eed added test to ensure consistency with StringUtils of common-lang library
1d33a98 renamed library file
39cb3f4 moved example script
a6cd455 added tests
e659da4 Added *.pyc to gitignore
25fadab Added gitignore
5c6f39f Added python code and example
eac8375 Update README.md

66 changes: 54 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,70 @@
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyjarowinkler?style=flat-square)
![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/nap/jaro-winkler-distance/workflow.yml)

Find the Jaro Winkler Distance which indicates the similarity score between two strings or words.
Jaro's equation measure is the weighted sum of percentage of matched and transposed characters from each strings. Winkler's factor increased this measure for matching prefixed characters.
This library find non-euclidean distance or similarity between two strings.

## The Implementation
Jaro and Jaro-Winkler equations provides a score between two short words where errors are more prone at the end of the word. Jaro's equation measure is the weighted sum of percentage of equal and transposed characters from each strings. Winkler's factor adds a weight in Jaro's formula to increased the calculated measure when there are a sequance of characters (prefix) that matches between the compaired items.

The original implementation is based on the Jaro Winkler Similarity Algorithm article that can be found on [Wikipedia](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance). This version of is based on the [Apache commons-text](https://github.com/apache/commons-text/blob/c2cb4501669e4148aebd9d7265430080f47af016/src/main/java/org/apache/commons/text/similarity/JaroWinklerSimilarity.java#L1-L167) library.
> [!NOTE]
> * Impact of the character prefix is limited to 4, as originally defined by Winkler.
> * Input strings are not modified beyond leading or trailing whitespace stripping. In-word whitespace and characters case *will* optionally impact score.
> * Returns a floating point number rounded to the desired decimals (defaults to `2`) using Python's [`round()`](https://docs.python.org/3/library/functions.html#round).
> * Consider usual [Floating Point Arithmetic](https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues) characterisitcs.
### Correctness
## Notes on Calculation

Unit tests similar to what you will find in the `commons-text` library were used to validate the implementation.
The complexity of this algoritme reside in the calculation of `matching` and `transposed` characters.

### Note
* A character of the first string is `matching` if it's found in the second string within a specified `distance`. A character in the first string cannot be matched multiple time to the same character in the second string.
* Two characters are `transposed` if they match, but aren't matched at the same position.
* The `limit` is calculated using the length of the longest string devided by two minus one.

A limit of `shorter / 2 + 1` is used in `commons-text`, this differs from Wikipedia and also [Winkler's paper](https://files.eric.ed.gov/fulltext/ED325505.pdf), where a distance of `longer / 2 - 1` is used, corresponding to positions of `longer / 2`.
### Example

${d = \left \lfloor {\frac {\max(12, 13)}{2}}\right \rfloor - 1 = 5}$

```
----------------------------
P E N N C I S Y L V N I A
P 1 |
E 1 |
N 1 |
N 1 |
S 1 |
Y | 1 |
L | 1 |
V | 1 |
A | 1
N | 1
I | 1
A |
----------------------------
```
${\text{Where }|s_{1}| = 12\text{, }|s_{2}| = 13\text{, }\ell = 4\text{, }m = 11\text{, }t = 3\text{, and }p = 0.1}$.

${sim_{j}=\left\{{\begin{array}{l l}0&{\text{if }}m=0\\{\frac {1}{3}}\left({\frac {m}{|s_{1}|}}+{\frac {m}{|s_{2}|}}+{\frac {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right.}$

${sim_{j}=\frac {1}{3}}\left({\frac {11}{12}}+{\frac {11}{13}}+{\frac {11-3}{11}}\right) = 0.83003108003$

${sim_{w} = sim_{j}+\ell p(1-sim_{j})}$

${sim_{w} =0.83003108003 + 4 * 0.1 * (1 - 0.83003108003) = 0.89801864801}$

${\lceil sim_{w}\rceil = 0.9}$

## Implementation

The original implementation is based on the [Jaro Winkler](https://www.census.gov/content/dam/Census/library/working-papers/1991/adrm/rr91-9.pdf) Similarity Algorithm article that can be found on [Wikipedia](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance). This version of is based on the [original C implementation of strcmp95](https://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c) library.

## Example

```python
from pyjarowinkler import distance
# Scaling is 0.1 by default
distance.get_jaro_distance("hello", "haloa", winkler=True, scaling=0.1)

distance.get_jaro_distance("hello", "haloa", decimals=2)
# 0.76
distance.get_jaro_distance("hello", "haloa", winkler=False, scaling=0.1)
# 0.733333333333
distance.get_jaro_winkler_distance("hello", "Haloa", scaling=0.1, ignore_case=False)
# 0.6
distance.get_jaro_winkler_distance("hello", "HaLoA", scaling=0.1, ignore_case=True)
# 0.73
```
5 changes: 3 additions & 2 deletions example.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
"""
Example for using :func:`distance.get_jaro_distance` of the ``pyjarowinkler`` module.
"""

__author__ = "Jean-Bernard Ratte - [email protected]"

from pyjarowinkler import distance

if __name__ == "__main__":
dist: float = distance.get_jaro_distance("hello", "haloa")
print(f"The words 'hello' and 'haloa' matches at {dist:.1%}.")
dist: float = distance.get_jaro_distance("faremviel", "farmville")
print(f"The words 'farmville' and 'faremviel' matches at {dist:.1%}.")
2 changes: 1 addition & 1 deletion pyjarowinkler/__about__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.8.5"
__version__ = "1.8.0"
19 changes: 18 additions & 1 deletion pyjarowinkler/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,18 @@
__author__ = 'Jean-Bernard Ratte - [email protected]'
"""Find the Jaro Winkler Distance which indicates the similarity score between two strings.
The Jaro measure is the weighted sum of percentage of matched characters and transposed
characters. Winkler increased this measure for matching prefix characters.
This implementation is based on the Jaro Winkler similarity algorithm
from [Wikipedia article](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).
The validation is based on the (Apache ``commons-text``)[https://github.com/apache/commons-text/blob/2f45b62a4e3c0953c3fc14982006a22c1a8a1ca8/src/main/java/org/apache/commons/text/similarity/JaroWinklerSimilarity.java] implementation.
:copyright: (c) 2015 by Jean-Bernard Ratte.
:license: Apache 2.0, see :file:`LICENSE` for more details.
""" # noqa: E501

__author__ = "Jean-Bernard Ratte - [email protected]"


class JaroDistanceError(Exception):
def __init__(self, message) -> None:
super(Exception, self).__init__(message)
Loading

0 comments on commit 934567a

Please sign in to comment.