Skip to content

reading multiple word files uses excessive space on /tmp #180

Description

@gcpoole

readtext leaves temporary files on /tmp in linux each time you read a .docx file. If you read a lot of large .docx files, the space on /tmp can get used up quickly.

I have been using Natural Language Processing on a corpus of 1000 books, where each book is a few hundred pages. Each time I open a book, about 10 MB of space is consumed on /tmp. The files are not cleaned up by readtext. So if I open and process all 1000 books (one at a time, in a loop), about 10 GB of disk space is consumed on /tmp!!

It would be really nice if readtext would clean up the /tmp files it creates before it returns its result, rather than leaving potentially large files behind.

A reprex would be to create a .docx file (foo.docx) consisting of about 300 pages of text. Then run the following loop in R

for (i in 1:1000) {
  dummy <- readtext::readtext("/path/to/foo.docx")
}

This will open foo.docx 1000 times, creating 1000 temporary folders on /tmp which will consume 5-10 GB of disk space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions