Skip to content

Commit 602534f

Browse files
Update README.md
1 parent 8a278a1 commit 602534f

File tree

1 file changed

+13
-12
lines changed

1 file changed

+13
-12
lines changed

README.md

+13-12
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ python train.py
4040
### 3.Calculate similarity matrix (inference)
4141
```
4242
Example:
43-
python cal_column_similarity.py -p Test\ Data/self -m model/2022-04-11-17-10-11 -s one-to-one
44-
python cal_column_similarity.py -p Test\ Data/authors -m model/2022-04-11-17-10-11 -t 0.9
43+
python cal_column_similarity.py -p Test\ Data/self -m /model/2022-04-12-12-06-32 -s one-to-one
44+
python cal_column_similarity.py -p Test\ Data/authors -m /model/2022-04-12-12-06-32-11 -t 0.9
4545
```
4646
Parameters:
4747
- -p: Path to test data folder, must contain **"Table1.csv" and "Table2.csv" or "Table1.json" and "Table2.json"**.
@@ -50,42 +50,43 @@ Parameters:
5050
- -s: Strategy, there are two options: "one-to-one" and "one-to-many". "one-to-one" means that one column can only be matched to one column. "one-to-many" means that there is no restrictions. Default is "one-to-many".
5151
## Feature Engineering
5252

53-
Features: "is_url","is_numeric","is_date","is_string","numeric:mean", "numeric:min", "numeric:max", "numeric:variance","numeric:cv", "numeric:unique/len(data_list)", "length:mean", "length:min", "length:max", "length:variance","length:cv", "length:unique/len(data_list)", "whitespace_ratios:mean","punctuation_ratios:mean","special_character_ratios:mean","numeric_ratios:mean", "whitespace_ratios:cv","punctuation_ratios:cv","special_character_ratios:cv","numeric_ratios:cv", "colname:bleu_score", "colname:edit_distance","colname:lcs","colname:tsm_cosine", "colname:one_in_one"
53+
Features: "is_url","is_numeric","is_date","is_string","numeric:mean", "numeric:min", "numeric:max", "numeric:variance","numeric:cv", "numeric:unique/len(data_list)", "length:mean", "length:min", "length:max", "length:variance","length:cv", "length:unique/len(data_list)", "whitespace_ratios:mean","punctuation_ratios:mean","special_character_ratios:mean","numeric_ratios:mean", "whitespace_ratios:cv","punctuation_ratios:cv","special_character_ratios:cv","numeric_ratios:cv", "colname:bleu_score", "colname:edit_distance","colname:lcs","colname:tsm_cosine", "colname:one_in_one", "instance_similarity:cosine"
5454

55-
- tsm_cosine: cosine similarity computed by sentence-transformers using "paraphrase-multilingual-mpnet-base-v2". Support multi-language column names matching.
55+
- tsm_cosine: Cosine similarity of column names computed by sentence-transformers using "paraphrase-multilingual-mpnet-base-v2". Support multi-language column names matching.
56+
- instance_similarity:cosine: Select 20 instances each string column and compute its mean embedding using sentence-transformers. Cosine similarity is computed by each pairs.
5657

5758
## Performance
5859

5960
### Cross Validation on Training Data(Each pair to be used as test data)
6061

61-
- Average Precision: 0.750
62-
- Average Recall: 0.823
62+
- Average Precision: 0.755
63+
- Average Recall: 0.829
6364
- Average F1: 0.766
6465

6566
Average Confusion Matrix:
6667
| | Negative(Truth) | Positive(Truth) |
6768
|----------------|-----------------|-----------------|
68-
| Negative(pred) | 0.94439985 | 0.05560015 |
69-
| Positive(pred) | 0.1765625 | 0.8234375 |
69+
| Negative(pred) | 0.94343111 | 0.05656889 |
70+
| Positive(pred) | 0.17135417 | 0.82864583 |
7071

7172
### Inference on Test Data (Give confusing column names)
7273

7374
Data: https://github.com/fireindark707/Schema_Matching_XGboost/tree/main/Test%20Data/self
7475

7576
| | title | text | summary | keywords | url | country | language | domain | name | timestamp |
7677
|---------|------------|------------|------------|------------|------------|------------|------------|------------|-------|------------|
77-
| col1 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
78-
| col2 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
78+
| col1 | 1(FN) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
79+
| col2 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7980
| col3 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8081
| words | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 |
8182
| link | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 |
8283
| col6 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 |
8384
| lang | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 |
8485
| col8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 |
85-
| website | 0 | 0 | 0 | 0 | 0 | 1(FP) | 0 | 0 | 0(FN) | 0 |
86+
| website | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0(FN) | 0 |
8687
| col10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) |
8788

88-
**F1 score: 0.9**
89+
**F1 score: 0.889**
8990

9091
## Cite
9192
```

0 commit comments

Comments
 (0)