Determining the root feature for CryptoCurrency3.pdf fails

Currently, pre-processing CryptoCurrency3.pdf causes the following error (already reported in #2).

```
Traceback (most recent call last):
  File "/home/sekikn/repos/FeatureX/src/FeatureX.py", line 139, in pre_process
    docmain = reduce(lambda x, y: x+' '+y, [i[0] for i in mainfeature])
TypeError: reduce() of empty sequence with no initial value
DONE R2
Traceback (most recent call last):
  File "../src/CaseStudy-Evaluations.py", line 49, in <module>
    RelationshipSegregator()
  File "/home/sekikn/repos/FeatureX/src/RelationshipSegregator.py", line 140, in __init__
    dependencygraph()
  File "/home/sekikn/repos/FeatureX/src/DependencyGraph.py", line 59, in __init__
    if r and r[0] != '' and rootFeature in r[0]:
UnboundLocalError: local variable 'rootFeature' referenced before assignment
```

This is because pre-processing fails to determine the root feature. In the following lines, pre-processing extracts Top-10 frequently-appeared words, and then removes stop words from them. But 10 most frequent words are too few to decide the root feature in the case of CryptoCurrency3.pdf.
https://github.com/5Quintessential/FeatureX/blob/150f4868485c16444653ecb48d18dec95cc6e24d/src/FeatureX.py#L135-L137

We can confirm it as follows:

```
$ python2 -m pdb ../src/CaseStudy-Evaluations.py 
> /home/sekikn/repos/FeatureX/src/CaseStudy-Evaluations.py(2)<module>()
-> import FeatureX
(Pdb) b FeatureX.py:136
Breakpoint 1 at /home/sekikn/repos/FeatureX/src/FeatureX.py:136
(Pdb) c
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 7), (',', 5), ('to', 5), ('the', 5), ('routing', 3), ('?', 3), ('Tor', 3), ('of', 3), ('at', 3), ('Ip', 3)]
(Pdb) c
DONE R2
DONE R1
DONE R3
DONE R4
DONE
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 10), (',', 9), ('a', 7), ('of', 6), ('Zerocoin', 4), ('it', 4), ('the', 4), ('does', 3), ('this', 3), ('its', 2)]
(Pdb) c
DONE R2
DONE R1
DONE R3
DONE R4
DONE
> /home/sekikn/repos/FeatureX/src/FeatureX.py(136)pre_process()
-> mainfeature = self.RemoveNonsenseWords(mainfeature)
(Pdb) p mainfeature
[('.', 10), ('a', 9), ('the', 9), (',', 7), ('of', 6), ('to', 5), ('and', 5), ('[', 4), ('that', 4), (']', 4)]
(Pdb) b FeatureX.py:139
Breakpoint 2 at /home/sekikn/repos/FeatureX/src/FeatureX.py:139
(Pdb) c
> /home/sekikn/repos/FeatureX/src/FeatureX.py(139)pre_process()
-> docmain = reduce(lambda x, y: x+' '+y, [i[0] for i in mainfeature])
(Pdb) p mainfeature
[]
```

For CryptoCurrency1.pdf and CryptoCurrency2.pdf, the word such as "routing", "Tor", "Zerocoin" are extracted.
But for CryptoCurrency3.pdf, only stopwords are extracted. So the 'mainfeature' variable becomes empty at last at line 139.

Other comments for `featurex.pre_process()`:

* The comment says "Take the content of the first 2 pages of the document (for determining the root feature)", but it uses the first 10 sentences in reality.
  https://github.com/5Quintessential/FeatureX/blob/150f4868485c16444653ecb48d18dec95cc6e24d/src/FeatureX.py#L114-L118

* `next_few_sentences` and `last_few_sentences` are filled in the else block in the for-loop, but they are not used hereafter. They should be removed for code readability.
  https://github.com/5Quintessential/FeatureX/blob/150f4868485c16444653ecb48d18dec95cc6e24d/src/FeatureX.py#L119-L131

	# Take the content of the first 2 pages of the document
	for sent in sent_tokenize(self.complete_processed_corpus):
	if count < 10:
	first_few_sentences.append(sent)
	count = count + 1

	else:
	if count%5 >= 1:
	if len(last_few_sentences) > 0:
	doc.append(' '.join(last_few_sentences))
	last_few_sentences = []
	next_few_sentences.append(sent)
	count = count + 1
	else:
	if len(next_few_sentences) > 0:
	doc.append(' '.join(next_few_sentences))
	next_few_sentences = []
	last_few_sentences.append(sent)
	count = count + 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining the root feature for CryptoCurrency3.pdf fails #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	mainfeature = nltk.FreqDist(word_tokenize(subcorpus)).most_common(10)
	mainfeature = self.RemoveNonsenseWords(mainfeature)
	mainfeature = [w for w in mainfeature if w[0].lower() not in stopwords and w[0].lower() not in spaceandspecial]

Determining the root feature for CryptoCurrency3.pdf fails #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions