Skip to content

Code-Platoon-Assignments/tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“˜ Assignment: Tokenizing a Story with NLTK

Welcome to your first text pre-processing assignment! In this project, youโ€™ll apply your knowledge of tokenization to process a short story using Python and the Natural Language Toolkit (NLTK).

๐ŸŽฏ Objective

Read a short story from a .txt file and:

  1. Tokenize it into sentences using sent_tokenize()
  2. Tokenize it into words using word_tokenize()
  3. (Optional) Clean the story using re.sub() to remove unwanted characters

๐Ÿ“ Files Provided

  • story.txt: A short story about a beginner coder.
  • main.py: A starter Python script where you will write your code.
  • README.md: Youโ€™re reading it now!

โœ… Steps

  1. Install NLTK if you havenโ€™t already:

    pip install nltk
  2. Download NLTK resources in main.py: The script already includes the nltk.download('punkt') command for tokenizer models.

  3. Read the file: Use Pythonโ€™s file reading tools to load story.txt.

  4. Clean the text (optional): Use re.sub() to remove any characters you donโ€™t want included in your analysis.

  5. Tokenize the story:

    • Use sent_tokenize() to break the story into sentences.
    • Use word_tokenize() to break the story into individual words.
  6. Print the results to compare and see the difference between sentence and word tokenization.

๐Ÿค” Questions to Think About

  • How does sent_tokenize() determine where a sentence ends?
  • How does punctuation affect word_tokenize()?
  • How could this process help a chatbot better understand user input?

๐Ÿ’ก Bonus Challenge

  • Count how many sentences and how many words are in the story.
  • Print the most frequent word (ignoring common stopwords like "the", "and", etc.)

Happy tokenizing! ๐Ÿง ๐Ÿ’ฌ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors