Employment Scam Detection 🕵️‍♂️🚫

An NLP-based Machine Learning project to detect fraudulent job advertisements. This project analyzes job descriptions to classify them as "Real" or "Fraudulent," helping to protect job seekers from employment scams.

📌 Project Overview

Employment scams are on the rise, often tricking applicants into providing personal information or money. This project utilizes Natural Language Processing (NLP) and Machine Learning to identify fake job postings based on textual features.

Key challenges addressed:

Imbalanced Dataset: Fraudulent cases were only ~5% of the data.
Text Processing: Converting unstructured job descriptions into meaningful numerical vectors.

📂 Project Structure

├── DataSet.csv              # The dataset containing job descriptions (EMSCAD)
├── Project_Analysis.ipynb   # Main Jupyter Notebook with code, visualizations, and models
├── requirements.txt         # List of dependencies
└── README.md                # Project documentation

🚀 Key Features & Methodology

1. Advanced Preprocessing (NLP)

Instead of simple cleaning, I used Feature Engineering within the text:

Token Replacement: Replaced sensitive patterns (Emails, URLs, Phone Numbers, Money amounts) with special tokens like _EMAIL_, _MONEY_ to help the model learn fraud patterns.
HTML Removal: Cleaned raw web-scraped data using BeautifulSoup.
Lemmatization: Used NLTK to reduce words to their base root (e.g., "hiring" -> "hire").

2. Handling Class Imbalance

The dataset was highly imbalanced (Fraud: ~5%, Real: ~95%).

Technique: Used SMOTE (Synthetic Minority Over-sampling Technique) on the training data to generate synthetic samples for the minority class, preventing model bias.

3. Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency): Used with N-grams (1,2) to capture context (e.g., "wire transfer" vs. just "wire").

4. Models Implemented

Support Vector Machine (SVM): Used with a Linear Kernel (best for high-dimensional text data).
K-Nearest Neighbors (KNN): Used with Distance Weighting.

📊 Results & Evaluation

Since detecting fraud is the priority, Recall (Sensitivity) is the most critical metric (we want to catch all frauds).

Model	Accuracy	Recall (Sensitivity)	F1-Score	ROC-AUC
SVM	~98%	High	High	~0.99
KNN	Good	Moderate	Moderate	~0.90

Conclusion: SVM outperformed KNN significantly in detecting fraudulent cases, making it the preferred model for this task.

🛠️ Installation & Usage

Clone the repository:

git clone https://github.com/YOUR_USERNAME/Employment-Scam-Detection.git
cd Employment-Scam-Detection

Create a Virtual Environment (Optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Run the Analysis

Open Project_Analysis.ipynb in Jupyter Notebook or VS Code to see the step-by-step implementation.

📦 Libraries Used

Pandas & NumPy: Data manipulation.
Matplotlib & Seaborn: Visualization (Distribution plots, Confusion Matrix).
NLTK: Text processing (Stopwords, Lemmatizer).
Scikit-Learn: ML models and evaluation metrics.
Imbalanced-Learn: SMOTE implementation.

🤝 Future Improvements

Implement Deep Learning models like LSTM or BERT for better context understanding.
Incorporate non-text features (e.g., "Has Company Logo?", "Has Questions?").
Deploy the model as a web API using Flask or FastAPI.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Project_Analysis.ipynb		Project_Analysis.ipynb
README.md		README.md
Report.docx		Report.docx
Report.pdf		Report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Employment Scam Detection 🕵️‍♂️🚫

📌 Project Overview

📂 Project Structure

🚀 Key Features & Methodology

1. Advanced Preprocessing (NLP)

2. Handling Class Imbalance

3. Vectorization

4. Models Implemented

📊 Results & Evaluation

🛠️ Installation & Usage

Create a Virtual Environment (Optional but recommended):

Install Dependencies

Run the Analysis

📦 Libraries Used

🤝 Future Improvements

About

Uh oh!

Releases

Packages

Languages

MDVR9980/Employment-Scam-Detection

Folders and files

Latest commit

History

Repository files navigation

Employment Scam Detection 🕵️‍♂️🚫

📌 Project Overview

📂 Project Structure

🚀 Key Features & Methodology

1. Advanced Preprocessing (NLP)

2. Handling Class Imbalance

3. Vectorization

4. Models Implemented

📊 Results & Evaluation

🛠️ Installation & Usage

Create a Virtual Environment (Optional but recommended):

Install Dependencies

Run the Analysis

📦 Libraries Used

🤝 Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages