- Fork repo and clone
- Use
devbranch - Send pull request when complete
- Download
spam.csvdataset - https://www.kaggle.com/uciml/sms-spam-collection-dataset
- Each entry is labeled as either
spamorham(not spam) - We want to do hypothesis testing on this dataset
Let's imagine these are the messages in our inbox. Our hypothesis is
that the percentage of spam in our inbox is greater than 12.5%.
Alpha is 0.025
Do we reject or not reject the null hypothesis? What's the p-value?
- You will be using the Poisson Probability Distribution
- https://en.wikipedia.org/wiki/Poisson_distribution
- It is parameterized by lambda
- In
Part Iyou figured out how many messages were spam - Let's imagine that you receive that many spam messages every four weeks
- What's the probability that you will get at least 30 spam messages per day?
- You are a human spam detector.
- Create an interactive program that will randomly show you a message from the above list.
- You have to determine if it's spam or not (do not look at the label).
- Record your answer for a small sample of messages (about 10).
- How accurate were your predictions?
- How many Type 1 and Type 2 error did you have?
In Part IV you were a spam detector. Let's see if we can automate this process and hopefully get better accuracy.
You will be creating a Naive Bayes Text Classification Model to determine if an incoming message is spam or not.
Here's some sample code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
counts = count_vect.fit_transform(LIST_OF_MESSAGES) # change
tfidfs = tfidf_transformer.fit_transform(counts)
X_train, X_test, y_train, y_test = train_test_split(tfidfs, LIST_OF_LABELS, test_size=0.33, random_state=42) #change
nb = MultinomialNB().fit(X_train, y_train)
predictions = nb.predict(X_test)
# you may need to add more code here ...
- What is the baseline accuracy?
- What is the accuracy from your trained model?
- How many Type 1 and Type 2 errors occurred?
- What performed better, the Human or NB model?
You've now built a model that hopefully filters out all those spam messages from your inbox.
Rerun the entire dataset against your Naive Bayes model. Now, in theory it has filtered out all the spam messages. But you can actually check to see if any spams got through the filter.
- Is this a Type 1 or Type 2 error?
- How many spam messages were not successfully detected?
- Redo the hypothsis testing from
Part IIbut with your new (mostly filtered) dataset. - Do we reject or not reject the null hypothesis? What's the p-value?