Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
9a55aa0
Added the edit distance join disk cython file and corresponding pytho…
tangirb145 Mar 2, 2018
1ff81f3
1st commit to remote
tangirb145 Mar 15, 2018
93bdeca
2nd push to origin with all the files.
tangirb145 Mar 15, 2018
b33d3be
Added functionality of glob for creating a file anywhere.
tangirb145 Apr 12, 2018
514beca
Removing extra print statements
ankitjain64 Apr 12, 2018
34d546a
Added validation to the attributes datalimit and path in Utils.
tangirb145 Apr 12, 2018
ed517b6
Validating the attributes sata_limit, path
tangirb145 Apr 12, 2018
554df1a
Merging ankit changes and my changes.
tangirb145 Apr 12, 2018
3fb42b4
Handling missing values and cleaning code
ankitjain64 Apr 12, 2018
4ce2f79
Adding header to the output file and cleaned code
ankitjain64 Apr 12, 2018
3d90a61
Added comments
ankitjain64 Apr 12, 2018
ae0e0b0
Created edit_dist_join_disk test file
ankitjain64 Apr 12, 2018
c488139
Removing output file before creating new and cleaned code.
ankitjain64 Apr 14, 2018
f536560
Modified Comment
ankitjain64 Apr 14, 2018
4460594
Modified Default value of data_limit to 100K
ankitjain64 Apr 14, 2018
cb67de1
1. Added new parameter output_file_path (Absolute path for output fil…
ankitjain64 Apr 18, 2018
3e5cc1f
Updated function Doc String
ankitjain64 Apr 18, 2018
02d59af
There is a possibility that memory goes out of bound during calculati…
ankitjain64 Apr 18, 2018
a5647d2
Modifying variable name and return value.
ankitjain64 Apr 20, 2018
19f0f9b
Create CodeReviewComments
ankitjain64 Apr 23, 2018
5d2db92
Rename CodeReviewComments to CodeReviewComments.md
ankitjain64 Apr 23, 2018
a64b34f
Update CodeReviewComments.md
tangirb145 Apr 23, 2018
1825351
Update CodeReviewComments.md
tangirb145 Apr 23, 2018
a981c76
Modifying the code as per comments received after code review.
ankitjain64 Apr 23, 2018
1b8ae14
Merge remote-tracking branch 'origin/editjoin_disk' into editjoin_disk
ankitjain64 Apr 23, 2018
f9ebbd7
Added changes to test the edit distance join disk.
ankitjain64 Apr 30, 2018
85c8ac7
Added argument checking testcases.
ankitjain64 May 3, 2018
30a6abf
Fixed File name
ankitjain64 May 3, 2018
bc8463e
Fixed File name
ankitjain64 May 3, 2018
767b9a8
Added more test cases and modified intermediate file writing code.
ankitjain64 May 9, 2018
9837f66
Added more test cases for data limits, multiprocessing and output fil…
ankitjain64 May 10, 2018
79cfe61
Added comments and modified output file name
ankitjain64 May 10, 2018
adec91c
Merged and commited the changes done by @Bharghav.
ankitjain64 May 24, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions CodeReviewComments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
- [x] Change "Cython not installed" to an Exception.
- [x] Renamed temp_file_path to temp_dir
- [x] Copied doc string to the .py file
- [x] default n_jobs = -1
- [x] keeping the output file_name in wrapper function only.
- [x] Appending timestamp to the filename.
- [x] Changed the name of output file name variable to include "default".
- [x] Edited doc string to reflect desired changes.
- [x] Changed default value of data_limit to 1 million.
- [x] Changed variable data_limit to data_limit_per_core
- [x] Changed from num_cpus to n_jobs for per core data limit computation.
- [x] Added timestamps to temporary file names after being created in the wrapper itself.
- [x] Exception to shutil caught and handled.
- [x] Moved output_header code to wrapper function.
- [x] change missing pairs behavior by creating filename in the wrapper itself.
- [x] Exception to shutil caught and handled in missing pairs.
- [x] Removed _progress_bar.
144 changes: 144 additions & 0 deletions my.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@

# coding: utf-8

# This quickstart guide explains how to join two tables A and B using edit distance measure. First, you need to import the required packages as follows (if you have installed **py_stringsimjoin** it will automatically install the dependencies **py_stringmatching** and **pandas**):

# In[20]:


# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os, sys
import time


# In[21]:


print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)
print(sys.path)


# Joining two tables using edit distance measure typically consists of three steps:
# 1. Loading the input tables
# 2. Profiling the tables
# 3. Performing the join

# # 1. Loading the input tables

# We begin by loading the two tables. For the purpose of this guide,
# we use the sample dataset that comes with the package.

# In[22]:


# construct the path of the tables to be loaded. Since we are loading a
# dataset from the package, we need to access the data from the path
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'imdb_A.csv'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'imdb_B.csv'])


# In[23]:


# Load csv files as dataframes.
A = pd.read_csv(table_A_path,error_bad_lines= False)
B = pd.read_csv(table_B_path,error_bad_lines=False)
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))
print(A.columns.values)
print(B.columns.values)

# In[24]:




# In[25]:





# # 2. Profiling the tables

# Before performing the join, we may want to profile the tables to
# know about the characteristics of the attributes. This can help identify:
#
# a) unique attributes in the table which can be used as key attribute when performing
# the join. A key attribute is needed to uniquely identify a tuple.
#
# b) the number of missing values present in each attribute. This can
# help you in deciding the attribute on which to perform the join.
# For example, an attribute with a lot of missing values may not be a good
# join attribute. Further, based on the missing value information you
# need to decide on how to handle missing values when performing the join
# (See the section below on 'Handling missing values' to know more about
# the options available for handling missing values when performing the join).
#
# You can profile the attributes in a table using the following command:

# In[26]:


# profile attributes in table A
#print(ssj.profile_table_for_join(A))


# In[27]:


# profile attributes in table B
#print(ssj.profile_table_for_join(B))


# If the input tables does not contain any key attribute, then you need
# to create a key attribute. In the current example, both the input tables
# A and B have key attributes, and hence you can proceed to the next step.
# In the case the table does not have any key attribute, you can
# add a key attribute using the following command:

# In[28]:


#B['new_key_attr'] = range(0, len(B))
#print(B)


# For the purpose of this guide, we will now join tables A and B on
# 'name' attribute using edit distance measure. Next, we need to decide on what
# threshold to use for the join. For this guide, we will use a threshold of 5.
# Specifically, the join will now find tuple pairs from A and B such that
# the edit distance over the 'name' attributes is at most 5.

# # 3. Performing the join

# The next step is to perform the edit distance join using the following command:

# In[29]:


# find all pairs from A and B such that the edit distance
# on 'name' is at most 5.
# l_out_attrs and r_out_attrs denote the attributes from the
# left table (A) and right table (B) that need to be included in the output.

tick = time.time()

output_pairs = ssj.edit_distance_join_disk(A, B, 'ID', 'ID', ' name', 'title', 1,
100000,n_jobs =4,l_out_attrs=[' name',' year'],#' director',' writers',' actors '],
r_out_attrs=['title','year'], temp_dir = "/afs/cs.wisc.edu/u/a/j/ajain64/private/Spring_2018/indep/git/",
allow_missing = False, output_file_path= "/afs/cs.wisc.edu/u/a/j/ajain64/private/Spring_2018/indep/join_output.csv")

tock = time.time()


print ("Time taken in disk " + str(tock-tick))
Loading