Skip to content

Strange behaviour of do_tfidf #841

@hhaensel

Description

@hhaensel

I am trying to reproduce the example from the manual

> res <- data.frame("text" = c("this is what it is", "which is better")) %>%
+   do_tokenize(text) %>%
+   do_tfidf(document_id, token)

which is expected to result in:

document_id token count_per_doc count_of_docs tfidf
1 is 2 2 0.0000000
1 it 1 1 0.5773503
1 this 1 1 0.5773503
1 what 1 1 0.5773503
2 better 1 1 0.7071068
2 is 1 2 0.0000000
2 which 1 1 0.7071068

However, I obtain

document_id token count_per_doc count_of_docs tfidf
1 is 2 2 0.0000000
1 it 1 1 0.0000000
1 this 1 1 0.7071068
1 what 1 1 0.7071068
2 better 1 1 0.7071068
2 is 1 2 0.0000000
2 which 1 1 0.7071068

Another strange result is the following:

> data.frame("text" = c("good it was", "is nice she", "good is she")) %>%
+   do_tokenize(text) %>%
+   do_tfidf(document_id,token)
document_id token count_per_doc count_of_docs tfidf
1 good 1 2 0.327
1 it 1 1 0.327
1 was 1 1 0.887
2 is 1 2 0.327
2 nice 1 1 0.327
2 she 1 2 0.887
3 good 1 2 0.327
3 is 1 2 0.327
3 she 1 2 0.887

where I would expect to find identical values for "it" and "was"...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions