-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add active learning API for predicate-based blocking and matching #54
Comments
(note I am trying to use the term "blocking key" instead of "predicate", I never liked the term predicate from dedupe, I usually think of predicate as a very specific meaning of a function that returns a boolean. I want to phrase this process more in terms of table joins, so "join key" "blocking key", etc make more sense to me) In general, I am very much in favor of improving the blocking user experience. Manually defining blocking keys and combos of keys feels way too arbitrary and not data-informed.
Would this proposal actually solve this problem? Or is it just inherent in your dataset, eg there just isn't a whole lot of useful info we could use? Can you manually "cherry-pick" a gold-standard selection of predicates that does get you the performance you are looking for, and perhaps we can reverse engineer from there, approaching the problem as "what algorithm could we come up with that would be able to find this set of predicates out of the total search space"? I want to take a step back and really think about the problem holistically before we re-implement what dedupe did (which is fricking neat algorithmically, but I'm intimidated by its complexity, I would love to avoid it if possible). I would also love to avoid any active or supervised learning, any unsupervised method will get a lot more traction from me. It seems like we could do something similar with an expectation maximization algorithm here??? For example, I'm curious if there is a much simpler algorithm:
Yes, there should be a |
Thanks for the feedback @NickCrews. I've been progressing further with the dataset I'm working on at the moment and am reaching similar conclusions to you about this approach.
After looking into Dedupe's API in more detail, I agree that while its method is very sophisticated, it is somewhat intimidating in its complexity. I also wonder how well it would scale to large datasets, which I see as one of the main attractions of mismo.
The first two steps you've described are more-or-less what I've done manually. In my particular case, there were some records that have a value I like the idea of using a
I'd be interested to get your thoughts on this in general. I've been working a little bit on using an active learning method to generate pairs of records that can be trained using a logistic regression model. The model itself is fairly easy to implement using mismo's API - I can define a I've tried out some active learning using modal-python, but this, along with a labelling UI, introduces some complexity that might be better elsewhere. This feels like a separate, but related topic, so I'm happy to split this out into a separate issue if you'd prefer to keep this focused on blocking |
The active learning method that
dedupe
implements to learn a minimal set of predicates to block records can be very useful, particularly when the user has little prior knowledge on an appropriate set of blocking rules. In some of my usage, I've found that it can be tricky to balance the recall and precision when blocking on a single feature, e.g. zip code. More complex predicates would most likely help reduce the number of candidate pairs, so having a semi-supervised method to learn these could reduce the manual work required.In practice, I have found that
dedupe
's implementation scales quite poorly - I frequently hit memory bottlenecks when blocking on datasets of more than 10k rows. I'm hoping that by usingduckdb
viaibis
we may be able to do something more performative.At a high-level, the active learner works as follows:
If we were to implement something like this, I think it's worth ensuring this can be done using the current
mismo
API.For example, I can see how the candidate pairs could be blocked using a
UnionBlocker
and appropriately definedConditionBlockers
. Similarly, the features for the MatchLearner could be generated using a set ofLevelComparers
- which then learns to predictp(match | comparisons)
.It's not immediately clear to me how best to decide the predicates for blocking based on labelled examples, but I expect that could be done by keeping track of the blocking rules. How efficient this is when we have many blockers remains to be seen.
I like the idea of leaving the predicates and features open to the user, as dedupe's approach of statically-defining the predicates based on the data type makes it hard for me to understand where the performance bottle-necks are for a given matching/linking task.
The text was updated successfully, but these errors were encountered: