Currently, pre-processing CryptoCurrency3.pdf causes the following error (already reported in #2).
Traceback (most recent call last):
File "/home/sekikn/repos/FeatureX/src/FeatureX.py", line 139, in pre_process
docmain = reduce(lambda x, y: x+' '+y, [i[0] for i in mainfeature])
TypeError: reduce() of empty sequence with no initial value
DONE R2
Traceback (most recent call last):
File "../src/CaseStudy-Evaluations.py", line 49, in <module>
RelationshipSegregator()
File "/home/sekikn/repos/FeatureX/src/RelationshipSegregator.py", line 140, in __init__
dependencygraph()
File "/home/sekikn/repos/FeatureX/src/DependencyGraph.py", line 59, in __init__
if r and r[0] != '' and rootFeature in r[0]:
UnboundLocalError: local variable 'rootFeature' referenced before assignment
This is because pre-processing fails to determine the root feature. In the following lines, pre-processing extracts Top-10 frequently-appeared words, and then removes stop words from them. But 10 most frequent words are too few to decide the root feature in the case of CryptoCurrency3.pdf.
$ python2 -m pdb ../src/CaseStudy-Evaluations.py
> /home/sekikn/repos/FeatureX/src/CaseStudy-Evaluations.py(2)<module>()
-> import FeatureX
(Pdb) b FeatureX.py:136
Breakpoint 1 at /home/sekikn/repos/FeatureX/src/FeatureX.py:136
(Pdb) c
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 7), (',', 5), ('to', 5), ('the', 5), ('routing', 3), ('?', 3), ('Tor', 3), ('of', 3), ('at', 3), ('Ip', 3)]
(Pdb) c
DONE R2
DONE R1
DONE R3
DONE R4
DONE
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 10), (',', 9), ('a', 7), ('of', 6), ('Zerocoin', 4), ('it', 4), ('the', 4), ('does', 3), ('this', 3), ('its', 2)]
(Pdb) c
DONE R2
DONE R1
DONE R3
DONE R4
DONE
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 10), ('a', 9), ('the', 9), (',', 7), ('of', 6), ('to', 5), ('and', 5), ('[', 4), ('that', 4), (']', 4)]
(Pdb) b FeatureX.py:139
Breakpoint 2 at /home/sekikn/repos/FeatureX/src/FeatureX.py:139
(Pdb) c
> /home/sekikn/repos/FeatureX/src/FeatureX.py(139)pre_process()
-> docmain = reduce(lambda x, y: x+' '+y, [i[0] for i in mainfeature])
(Pdb) p mainfeature
[]
For CryptoCurrency1.pdf and CryptoCurrency2.pdf, the word such as "routing", "Tor", "Zerocoin" are extracted.
But for CryptoCurrency3.pdf, only stopwords are extracted. So the 'mainfeature' variable becomes empty at last at line 139.
Currently, pre-processing CryptoCurrency3.pdf causes the following error (already reported in #2).
This is because pre-processing fails to determine the root feature. In the following lines, pre-processing extracts Top-10 frequently-appeared words, and then removes stop words from them. But 10 most frequent words are too few to decide the root feature in the case of CryptoCurrency3.pdf.
FeatureX/src/FeatureX.py
Lines 135 to 137 in 150f486
We can confirm it as follows:
For CryptoCurrency1.pdf and CryptoCurrency2.pdf, the word such as "routing", "Tor", "Zerocoin" are extracted.
But for CryptoCurrency3.pdf, only stopwords are extracted. So the 'mainfeature' variable becomes empty at last at line 139.
Other comments for
featurex.pre_process():The comment says "Take the content of the first 2 pages of the document (for determining the root feature)", but it uses the first 10 sentences in reality.
FeatureX/src/FeatureX.py
Lines 114 to 118 in 150f486
next_few_sentencesandlast_few_sentencesare filled in the else block in the for-loop, but they are not used hereafter. They should be removed for code readability.FeatureX/src/FeatureX.py
Lines 119 to 131 in 150f486