This package provides some basic necessary tools to do attribute value normalization and string clustering in data cleaning and integration pipelines.
At the moment, due to incompatibility of Linux and Mac versions of PyQt5 package, we have two separate branches for Linux and Mac user interfaces. The master branch contains the version which works on Linux and the pyqt5mac-dev branch contains the version which works in Mac.
To install this package:
-
First install PyQt5, e.g. using one of the following methods:
-
If you have Anaconda installed on your machine, then use the following command:
$ conda install pyqt -
Otherwise follow the instructions at http://pyqt.sourceforge.net/Docs/PyQt5/installation.html
-
Then install the package using one of the following methods:
-
Run the following command:
pip install git+https://github.com/adelaneh/py_valuenormalization.git -
First clone the package source code using the following command:
$ git clone https://github.com/adelaneh/py_valuenormalizationThen enter the source code root folder (
py_valuenormalization) and install the package using the following command:$ python setup.py installYou can use
--prefixto change the destination folder for installing the package (see the help using$ python setup.py --help).
To use this package, import it by running the following python command:
>>> import py_valuenormalization as vn
We have developed two main approaches to normalize values, namely:
- Manual value normalization
- Clustering-based value normalization
In manual value normalization, you merge values into clusters to normalize them:
-
First load your data values into a list using the following command:
>>> vals = vn.read_from_file('PATH-TO-TEXT-FILE')where the file at
PATH-TO-TEXT-FILEcontains the values to be normalized, one data value per line. You can download one of our sample datasets from https://github.com/adelaneh/py_manual_vn/tree/master/py_valuenormalization/data. -
Now run the command:
>>> res = vn.normalize_values(vals)This will open the value normalization application which gives you instructions on how to normalize the input values.
-
Finally when you finish the normalization process and close the above application, the normalization results are returned in the variable
res: it is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.
In clustering-based value normalization, you first cluster the values using one of the following three method:
- Regular hierarchical agglomerative clustering (HAC)
- Smart clustering, which finds the best HAC parameter settings using input training data
- Hybrid clustering, which finds a clustering of input values such that you need to spend minimal time to clean up the clustering
To cluster the values, follow these steps:
-
First load your data values into a list using the following command:
>>> vals = vn.read_from_file('PATH-TO-TEXT-FILE')where the file at
PATH-TO-TEXT-FILEcontains the values to be normalized, one data value per line. You can download one of our sample datasets from https://github.com/adelaneh/py_manual_vn/tree/master/py_valuenormalization/data. -
Then use one of the following sequences of commands to cluster the values:
-
Regular HAC:
>>> hac = vn.HierarchicalClustering(vals)>>> clusts = hac.cluster(sim_measure = '3gram Jaccard', linkage = 'single', thr = 0.7)where
valsis the set of input values,sim_measure,linkageandthrare standard HAC parameters, andclustsis a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. -
Smart clustering:
>>> (_, training_pairs) = vn.calibrate_normalization_cost_model(vals)>>> smc = vn.SmartClustering(vals, training_pairs)>>> (clusts, best_setting) = smc.cluster()where
training_pairsis a dictionary where each key is a value pair(v1, v2)withv1andv2being distinct input values, and the corresponding value isTrueifv1andv2refer to the same entity andFalseotherwise.The output consists of a dictionary
clustsand a tuplebest_setting. Each key of the dictionaryclustsis the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.best_setting = (agrscore, simk, lnk, thr)is a tuple of agreement score and HAC parameter settings using whichclustsis obtained.agrscoreis the agreement score betweenclustsandtraining_pairs; i.e. the fraction of the value pairs intraining_pairswhich agree withclusts.sim_measure,linkageandthrare the standard HAC parameters settings using whichclustsis obtained. -
Hybrid clustering:
>>> (cm, _) = vn.calibrate_normalization_cost_model(vals)>>> hybhac = vn.HybridClustering(vals, cm)>>> (clusts, mcl) = hybhac.cluster()where
cmis a cost model used by hybrid clustering algorithm to find the clustering of the input data set that requires minimum effort by you to clean it up.The outpurt consists of a dictionary
clustsand an integermcl. Each key of the dictionaryclustsis the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.mclis the maximum size of the clusters inclusts.
-
Now you can clean up the clusters obtained above to arrive at the correct clustering of the input values. This phase consists of two main steps:
- Split step, where you split clusters containing values referring to more than one real-world entity into smaller clusters each of which contains values referring to a single entity
- Merge steps, in which you merge clusters referring to the same entity
To clean up the clustering results run the following command:
```>>> clean_clusts = vn.normalize_clusters(clusts)```
where clusts is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. This will open a graphical user interface to clean up clusts and the results with be returned in clean_clusts which is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.