Customer_Churn_Prediction_dataScience/final_data_science.Rmd at main · Alime21/Customer_Churn_Prediction_dataScience · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
---
title: "Customer Churn Prediction Project"
author: "Alime KILINÇ"
date: "2026-01-11"
output:
    html_document:
      toc: true
      toc_float: true
      number_sections: false
      theme: united
---
# 1. Environment Setup

```{r setup, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

# --- Core Data Manipulation  ---
library(tidyverse)
library(lubridate)
library(stringr)
library(readr)
library(dplyr)
library(tidyr)
library(janitor)

# --- Machine Learning  ---
library(fastDummies)
library(smotefamily)
library(xgboost)
library(caret)
library(ranger)
library(rpart)
library(pROC)

# --- Visualization  ---
library(ggplot2)
library(gridExtra)
library(scales)
library(corrplot)
library(viridis)
library(knitr)
library(kableExtra)
library(plotly)
```

# 2. Data Acquisition & Quality Assessment

## 2.1 Load Training Data

```{r load-data}
Demographics <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_demographics_train.csv")
StatusData <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_status_level_train.csv")
MRR <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_monthly_recurring_revenue_train.csv")
Revenue <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_revenue_history_train.csv")
Support <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/support_ticket_activity_train.csv")
Bugs <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/product_bug_reports_train.csv")
Newsletter <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/newsletter_engagement_train.csv")
Satisfaction_scores <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_satisfaction_scores_train.csv")
Region_industry <- read_csv("/Users/macbook/Documents/Ceng3Kış2025/ProjectDataScience/customer_region_and_industry_train.csv")
```

## 2.2 Data Quality Report

Automatic data quality analysis for all datasets (Dimensions, Duplicates, Missing Values):

```{r data-quality-auto, results='asis'}
# --- 1. DEFINE FUNCTION ---
# This function takes a dataframe and generates a summary report
create_quality_report <- function(df, dataset_name) {

  # Print Subheader
  cat("\n### ", dataset_name, " Analysis\n")

  # Dimensions
  cat("- **Rows:** ", nrow(df), "\n")
  cat("- **Columns:** ", ncol(df), "\n")

  # Duplicate Check
  dup_count <- sum(duplicated(df))
  cat("- **Duplicate Rows:** ", dup_count, ifelse(dup_count > 0, " (WARNING!)", ""), "\n\n")

  # create Detailed Table
  summary_df <- data.frame(
    Column = names(df),
    Type = sapply(df, class),
    Missing_Count = colSums(is.na(df)),
    Missing_Percent = round(colSums(is.na(df)) / nrow(df) * 100, 2),
    Unique_Values = sapply(df, function(x) length(unique(x)))
  )

  # Print Table (in HTML format)
  print(
    kable(summary_df, caption = paste(dataset_name, "Details"), row.names = FALSE) %>%
      kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "left") %>%
      scroll_box(height = "250px")
  )

  cat("\n---\n") # Separator line
}

# --- 2. EXECUTE FUNCTION (For all 9 Datasets) ---

create_quality_report(Demographics, "1. Demographics")
create_quality_report(StatusData, "2. Customer Status")
create_quality_report(MRR, "3. Monthly Recurring Revenue (MRR)")
create_quality_report(Revenue, "4. Revenue History")
create_quality_report(Support, "5. Support Tickets")
create_quality_report(Bugs, "6. Bug Reports")
create_quality_report(Newsletter, "7. Newsletter Engagement")
create_quality_report(Satisfaction_scores, "8. Satisfaction Scores")
create_quality_report(Region_industry, "9. Region & Industry")
```

### Data cleaning: removing duplicates
```{r cleaning}
Bugs <- Bugs %>% distinct()
Satisfaction_scores <- Satisfaction_scores %>% distinct()
cat("Cleaning Complete: Duplicates removed from Bugs and Satisfaction datasets.\n")
```

## 2.3 Initial Target Distribution
Our main objective in this project is to examine the distribution of customer status.
```{r target-dist}
# We are visualizing the 'Status' column in the StatusData table.
StatusData %>%
  group_by(Status) %>%
  summarise(num = n()) %>%
  mutate(percent = round(num / sum(num) * 100, 1)) %>%
  ggplot(aes(x = Status, y = num, fill = Status)) +
  geom_col() +
  geom_text(aes(label = paste0(num , " (%", percent, ")")), vjust = -0.5) +
  labs(title = "Customer Status Distribution (Target Variable)",
       x = "Status",
       y = "number of customer") +
  theme_minimal() +
  theme(legend.position = "none")
```


# 3. Data Preprocessing Pipeline

## 3.1 Column Name Standardization
Before merging, we standardize the key identifier column to 'Customer_ID' across all datasets to ensure consistency.

```{r rename-columns}
# Standardize 'Demographics' (Change 'CUS ID' to 'Customer_ID')
Demographics <- Demographics %>%
  rename(Customer_ID = `CUS ID`)

# Standardize other datasets (Change 'Customer ID' to 'Customer_ID')
# We use backticks `` because the original names contain spaces.

StatusData <- StatusData %>% rename(Customer_ID = `Customer ID`)
MRR <- MRR %>% rename(Customer_ID = `Customer ID`)
Revenue <- Revenue %>% rename(Customer_ID = `Customer ID`)
Support <- Support %>% rename(Customer_ID = `Customer ID`)
Bugs <- Bugs %>% rename(Customer_ID = `Customer ID`)
Newsletter <- Newsletter %>% rename(Customer_ID = `Customer ID`)
Satisfaction_scores <- Satisfaction_scores %>% rename(Customer_ID = `Customer ID`)
Region_industry <- Region_industry %>% rename(Customer_ID = `Customer ID`)

cat("Column standardization complete. All datasets now use 'Customer_ID' as the key.\n")
```

## 3.2 Feature Engineering - Satisfaction Surveys
We are summarizing the detailed questions in the satisfaction survey to create a single 'Average Satisfaction Score' for each customer.
```{r fe-satisfaction}
# Check the columns of the satisfaction table (are all columns except ID numerical scores?)
# Calculate the mean for each row and create a new column.

Satisfaction_Feat <- Satisfaction_scores %>%
  # 1. Select only numeric columns (excluding ID)
  select(Customer_ID, where(is.numeric)) %>%
  # 2. Extract non-score numerical data such as 'Year' and 'Quarter'.
  select(-Year, -Quarter) %>%
  # 3. Calculate the average of the remaining score columns.
  mutate(Instance_Score = rowMeans(select(., -Customer_ID), na.rm = TRUE)) %>%
  # Group by customer and reduce to a single line.
  group_by(Customer_ID) %>%
  summarise(Avg_Satisfaction = mean(Instance_Score, na.rm = TRUE))

cat("Completed: There is a single satisfaction score for each customer.\n")
head(Satisfaction_Feat)
```

## 3.3 Aggregate Additional Features
We are summarizing transaction-based tables into a single row per customer (Aggregation).

```{r fe-aggregate}
# 1. Support Tickets: How many total support requests has each customer made?
Support_Agg <- Support %>%
  group_by(Customer_ID) %>%
  summarise(Total_Tickets = n()) # n() counts the number of rows

# 2. Revenue History: What is the total expenditure of the customer?
Revenue_Agg <- Revenue %>%
  mutate(amount_clean = parse_number(as.character(grep("[0-9]", ., value = TRUE)[1]))) %>%
  mutate(across(where(is.character), parse_number)) %>%
  group_by(Customer_ID) %>%
  summarise(
    Total_Revenue = sum(across(where(is.numeric), sum), na.rm = TRUE), # Sum all numeric columns
    Transaction_Count = n()
  )

# 3. Bug Reports: How many times has the customer reported a bug?
Bugs_Agg <- Bugs %>%
  group_by(Customer_ID) %>%
  summarise(Bug_Count = n())

# 4. MRR
MRR_Agg <- MRR %>%
  mutate(MRR_Clean = parse_number(as.character(MRR))) %>%
  group_by(Customer_ID) %>%
  summarise(MRR = mean(MRR_Clean, na.rm = TRUE))

# 5. Newsletter
Newsletter_Agg <- Newsletter %>%
  group_by(Customer_ID) %>%
  summarise(Newsletter_Count = sum(`Company Newsletter Interaction Count`, na.rm = TRUE))

cat("Completed: Support, Revenue, Bugs, MRR, Newsletter tables have been summarized.\n")
```
## Fix ID types (type mismatch repair)
```{r fix-id-types}
Demographics$Customer_ID <- as.character(Demographics$Customer_ID)
Region_industry$Customer_ID <- as.character(Region_industry$Customer_ID)
StatusData$Customer_ID <- as.character(StatusData$Customer_ID)

MRR_Agg$Customer_ID <- as.character(MRR_Agg$Customer_ID)
Newsletter_Agg$Customer_ID <- as.character(Newsletter_Agg$Customer_ID)
Revenue_Agg$Customer_ID <- as.character(Revenue_Agg$Customer_ID)
Support_Agg$Customer_ID <- as.character(Support_Agg$Customer_ID)
Bugs_Agg$Customer_ID <- as.character(Bugs_Agg$Customer_ID)
Satisfaction_Feat$Customer_ID <- as.character(Satisfaction_Feat$Customer_ID)

cat("All Customer_ID columns have been set to 'Character' format.\n")
```
## 3.4 Master Dataset Integration
We are merging the summarized tables created during Feature Engineering into the main dataset.

```{r master-integration-v2}
Master_Data <- Demographics %>%
  left_join(Region_industry, by = "Customer_ID") %>%
  left_join(StatusData, by = "Customer_ID") %>%

  left_join(MRR_Agg, by = "Customer_ID") %>%
  left_join(Newsletter_Agg, by = "Customer_ID") %>%
  # The following are the aggregated tables:
  left_join(Revenue_Agg, by = "Customer_ID") %>%
  left_join(Support_Agg, by = "Customer_ID") %>%
  left_join(Bugs_Agg, by = "Customer_ID") %>%
  left_join(Satisfaction_Feat, by = "Customer_ID")

cat("Master Dataset Created Successfully (Aggregated).\n")
cat("Final Dimensions: ", nrow(Master_Data), " Rows, ", ncol(Master_Data), " Columns\n")

# Display the table
kable(head(Master_Data), caption = "Final Master Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  scroll_box(width = "100%")
```

## 3.5 Missing Value Imputation
We address the missing values resulting from the merge process.
Strategy:
1. Activity counts (Tickets, Bugs, Revenue) are filled with **0** (meaning no activity).
2. Satisfaction scores are imputed with the **Median** value to maintain distribution.

```{r missing-imputation}
library(readr)
library(tidyr)
library(dplyr)

Master_Data <- Master_Data %>%
  mutate(
    # 1.
    MRR = as.numeric(as.character(MRR)),
    Total_Revenue = as.numeric(as.character(Total_Revenue)),
    Newsletter_Count = as.numeric(as.character(Newsletter_Count))
  ) %>%
  mutate(
    # 2. Imputation
    Total_Tickets = replace_na(Total_Tickets, 0),
    Bug_Count = replace_na(Bug_Count, 0),
    Total_Revenue = replace_na(Total_Revenue, 0),
    Transaction_Count = replace_na(Transaction_Count, 0),

    # if not in MRR, Newsletter ->  0
    MRR = replace_na(MRR, 0),
    Newsletter_Count = replace_na(Newsletter_Count, 0),

    # if not in satisfaction score -> Median
    Avg_Satisfaction = ifelse(is.na(Avg_Satisfaction),
                              median(Avg_Satisfaction, na.rm = TRUE),
                              Avg_Satisfaction)
  )

# 3. Verify Imputation
cat("\n--------------------------------\n")
cat("Missing Values AFTER Imputation:\n")
print(colSums(is.na(Master_Data)))


Master_Data <- Master_Data %>%
  mutate(
    Region = replace_na(Region, "Unknown"),
    Vertical = replace_na(Vertical, "Unknown"),
    Subvertical = replace_na(Subvertical, "Unknown"),
    `Customer Level` = replace_na(`Customer Level`, "Unknown")
  )

cat("\n--------------------------------\n")
cat("Final check:\n")
print(colSums(is.na(Master_Data)))
```
### Recreate and merge the Revenue table.

```{r }
# 1. Define the function here first (since section 4.3 has not run yet)
cap_outliers <- function(x) {
  quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
  x[x > quantiles[2]] <- quantiles[2] # Cap values at the upper bound
  x[x < quantiles[1]] <- quantiles[1] # Cap values at the lower bound
  return(x)
}

# 2. Remove the corrupted Revenue columns from Master_Data
Master_Data <- Master_Data %>%
  select(-Total_Revenue, -Transaction_Count)

# 3. Re-extract Revenue data from raw file with aggressive cleaning
Revenue_Agg_Fixed <- Revenue %>%
  mutate(across(everything(), as.character)) %>% # Convert everything to character strings
  pivot_longer(cols = -Customer_ID, values_to = "raw_value") %>% # Pivot to long format
  mutate(clean_val = parse_number(raw_value)) %>% # Extract numeric values
  group_by(Customer_ID) %>%
  summarise(
    Total_Revenue = sum(clean_val, na.rm = TRUE),
    Transaction_Count = n()
  )

# 4. Re-integrate into Master_Data
Master_Data <- Master_Data %>%
  left_join(Revenue_Agg_Fixed, by = "Customer_ID") %>%
  mutate(
    # Set unmatched records (customers with no purchases) to 0
    Total_Revenue = replace_na(Total_Revenue, 0),
    Transaction_Count = replace_na(Transaction_Count, 0)
  )

# 5. Apply Outlier Capping for this NEW column
Master_Data <- Master_Data %>%
  mutate(Total_Revenue_Capped = cap_outliers(Total_Revenue))

cat("Revenue table repaired and Outlier capping applied.\n")
cat("New Average Revenue: ", mean(Master_Data$Total_Revenue, na.rm = TRUE), "\n")
```

# 4. Exploratory Data Analysis

## 4.1 Correlation Analysis
Understand the relationships between numeric variables.
```{r correlation analysis}
library(corrplot)
# 1. Select only numeric columns for correlation analysis
numeric_vars <- Master_Data %>%
  select(where(is.numeric)) %>%
  # Removing 'Transaction_Count' if it has zero variance or is redundant
  select(-Transaction_Count)

# 2. Compute the correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# 3. Visualize the correlation matrix
corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.srt = 45, # Text label rotation
         addCoef.col = "black", # Add coefficient numbers
         number.cex = 0.7, # Font size for coefficients
         diag = FALSE, # Hide diagonal
         title = "Feature Correlation Matrix",
         mar = c(0,0,1,0))
summary(Master_Data$Total_Revenue)
```

## 4.2 Satisfaction Metrics Distribution
Analyze the distribution of the 'Avg_Satisfaction' score.
```{r }
ggplot(Master_Data, aes(x = Avg_Satisfaction)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.5, fill = "#2A9D8F", color = "white", alpha = 0.7) +
  geom_density(color = "#E76F51", size = 1) +
  labs(title = "Distribution of Average Satisfaction Scores",
       subtitle = "Histogram with Density Curve",
       x = "Average Satisfaction Score",
       y = "Density") +
  theme_minimal()
```
## 4.3 Outlier Detection and Treatment
Identify and cap extreme values (outliers) to prevent model skew.

```{r }
# --- Step 1: Visualize Outliers (Boxplots) ---
# We focus on highly variable columns: Revenue, MRR, and Tickets.
p1 <- ggplot(Master_Data, aes(y = Total_Revenue)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Total Revenue Outliers") + theme_minimal()

p2 <- ggplot(Master_Data, aes(y = MRR)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "MRR Outliers") + theme_minimal()

p3 <- ggplot(Master_Data, aes(y = Total_Tickets)) +
  geom_boxplot(fill = "pink") +
  labs(title = "Total Tickets Outliers") + theme_minimal()

library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3)


# --- Step 2: Outlier Treatment (Capping / Winsorization) ---
# Strategy: Cap values above the 99th percentile.
# We do not delete rows to preserve data; we just limit the maximum value.

cap_outliers <- function(x) {
  quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
  x[x > quantiles[2]] <- quantiles[2] # Cap upper bound
  x[x < quantiles[1]] <- quantiles[1] # Cap lower bound (optional)
  return(x)
}

# Apply capping to relevant numeric columns
Master_Data <- Master_Data %>%
  mutate(
    Total_Revenue_Capped = cap_outliers(Total_Revenue),
    MRR_Capped = cap_outliers(MRR),
    Total_Tickets_Capped = cap_outliers(Total_Tickets)
  )

cat("Outlier treatment complete. New columns with suffix '_Capped' created.\n")
```
# 5. Feature Engineering and Model Preparation
## 5.1 Advanced Feature Creation
```{r }
# Goal: Create new predictive features BEFORE encoding or splitting.
# Logic: It is safer to calculate ratios on real data than on synthetic (SMOTE) data.

Model_Data_Enhanced <- Master_Data %>%
  mutate(
    # 1. Ticket Intensity: Tickets per month of tenure
    # (Adding +1 to avoid division by zero)
    Ticket_per_Month = Total_Tickets_Capped / (`Customer Age (Months)` + 1),

    # 2. Value per Age: Revenue generated per month
    Revenue_per_Month = Total_Revenue_Capped / (`Customer Age (Months)` + 1),

    # 3. High Risk Flag: Interaction between Low Satisfaction and High Bugs
    # If Satisfaction is low (e.g. 2) and Bugs are high (e.g. 5) -> Risk = (10-2)*5 = 40 (High Score)
    Risk_Factor = (10 - Avg_Satisfaction) * Bug_Count
  )

cat("5.1 Complete: New Features (Ratios & Interactions) created on Master Data.\n")
```
## 5.2 Selection and One-Hot Encoding
```{r }
# Goal: Select relevant columns and convert categoricals to numeric.

library(fastDummies)
library(janitor)

# 1. Prepare Target & Drop Unused Columns
Model_Data_Selected <- Model_Data_Enhanced %>%
  mutate(Target = ifelse(Status == "Churn", 1, 0)) %>%
  select(
    -Customer_ID,      # ID is not predictive
    -Status,           # Converted to Target
    -Total_Revenue,    # Using Capped version
    -MRR,              # Using Capped version
    -Total_Tickets     # Using Capped version
  )

# 2. One-Hot Encoding
categorical_cols <- c("Region", "Vertical", "Subvertical", "Customer Level")

Model_Data_Encoded <- dummy_cols(
  Model_Data_Selected,
  select_columns = categorical_cols,
  remove_selected_columns = TRUE, # Remove original text columns
  remove_first_dummy = FALSE      # Keep all categories
) %>%
  clean_names() # Standardize names (region_turkey, etc.)

cat("5.2 Complete: Encoding Finished.\n")
cat("Dimensions:", nrow(Model_Data_Encoded), "Rows,", ncol(Model_Data_Encoded), "Columns\n")
```
## 5.3 Train-Test Split
```{r }
# Goal: Split data into 80% Training and 20% Testing.

library(caret)
set.seed(123)

train_index <- createDataPartition(Model_Data_Encoded$target, p = 0.8, list = FALSE)

X_train_raw <- Model_Data_Encoded[train_index, ] %>% select(-target)
y_train_raw <- Model_Data_Encoded[train_index, ]$target

# Test Set (We will NOT touch this with SMOTE)
Test_Set    <- Model_Data_Encoded[-train_index, ]
X_test      <- Test_Set %>% select(-target)
y_test      <- Test_Set$target

cat("5.3 Complete: Data Split.\n")
cat("Train Size:", length(y_train_raw), "\n")
cat("Test Size: ", length(y_test), "\n")
```
## 5.4 SMOTE (only on Train)
```{r }
# Goal: Apply SMOTE only to the Training set to fix imbalance.

library(smotefamily)

# Apply SMOTE
# K=5 is standard. dup_size=0 means we don't just copy rows, we generate new ones.
smote_result <- SMOTE(X = X_train_raw, target = y_train_raw, K = 5, dup_size = 0)

# Extract Balanced Train Data
Train_Set_Balanced <- smote_result$data %>%
  rename(target = class) %>%
  mutate(target = as.numeric(as.character(target)))

# Separate Features and Target for the final Train set
X_train_balanced <- Train_Set_Balanced %>% select(-target)
y_train_balanced <- Train_Set_Balanced$target

cat("5.4 Complete: SMOTE applied to Training Set.\n")
cat("Original Churn Count:", sum(y_train_raw == 1), "\n")
cat("Balanced Churn Count:", sum(y_train_balanced == 1), "\n")
```
## 5.5 Final Distribution Check
```{r }
library(ggplot2)
library(gridExtra)

# Plot before and after
p1 <- data.frame(target = y_train_raw) %>%
  ggplot(aes(x = factor(target), fill = factor(target))) + geom_bar() +
  labs(title = "Original Train", x = "Status") + theme_minimal() + theme(legend.position="none")

p2 <- data.frame(target = y_train_balanced) %>%
  ggplot(aes(x = factor(target), fill = factor(target))) + geom_bar() +
  labs(title = "SMOTE Balanced Train", x = "Status") + theme_minimal() + theme(legend.position="none")

grid.arrange(p1, p2, ncol = 2)
```

# 6. Model Development and Training
## 6.1 Training Setup (Cross-Validation)

```{r }
# Goal: Define how we will validate the models (5-Fold Cross Validation).
# Note: Train-Test Split was already done in Step 5.3.

library(caret)

# 1. Convert Target to Factor (Required for Classification in Caret)
# 0 -> "Current", 1 -> "Churn"
# Caret prefers text labels for classification levels.
y_train_balanced_factor <- factor(y_train_balanced, levels = c(0, 1), labels = c("Current", "Churn"))
y_test_factor <- factor(y_test, levels = c(0, 1), labels = c("Current", "Churn"))

# 2. Define Control Parameters (5-Fold CV)
# This splits the training data into 5 parts to validate the model internally.
fit_control <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE, # Calculate probabilities (for ROC curve)
  summaryFunction = twoClassSummary, # Use ROC/AUC as metric
  verboseIter = FALSE
)

cat("6.1 Complete: Training setup ready with 5-Fold Cross Validation.\n")
```
## 6.2 Model1: Random Forest

```{r }
# Goal: Train a Random Forest model using the balanced dataset.

cat("Training Random Forest... (This may take a minute)\n")

# We use the 'ranger' package (faster implementation of Random Forest)
rf_model <- train(
  x = X_train_balanced,
  y = y_train_balanced_factor,
  method = "ranger",
  metric = "ROC", # Optimize for Area Under Curve
  trControl = fit_control,
  tuneLength = 5 # Try 5 different hyperparameter combinations automatically
)

print(rf_model)
cat("\nRandom Forest Training Complete.\n")
```

## 6.3 Model2: GBM(Gradient Boosting Machine)

```{r }
# Goal: Train a GBM model (A robust alternative to XGBoost with high stability in R)

library(gbm)
library(caret)

cat("Training GBM... (This takes a moment but is very stable)\n")

# We use the same balanced training data (X_train_balanced) that worked for Random Forest.
# Unlike XGBoost, GBM is less "finicky" and handles factors and data types very effectively.

# Define a basic hyperparameter grid for GBM
gbmGrid <- expand.grid(
  interaction.depth = c(1, 3, 5), # Maximum depth of each tree
  n.trees = c(100, 150),         # Number of boosting iterations
  shrinkage = 0.1,                # Learning rate
  n.minobsinnode = 10             # Minimum number of observations in terminal nodes
)

gbm_model <- train(
  x = X_train_balanced,        # Using the same dataset as Random Forest
  y = y_train_balanced_factor, # Our binary target variable
  method = "gbm",              # Method set to 'gbm'
  metric = "ROC",
  trControl = fit_control,     # Settings defined in section 6.1
  tuneGrid = gbmGrid,
  verbose = FALSE              # Turn off excessive logging for a cleaner console
)

print(gbm_model)
cat("\nGBM Training Complete.\n")

```


# 7. Model Evaluation
```{r }
# Goal: Compare Random Forest and GBM performance visually to select the best model.
# The resamples function collects Cross-Validation results from both models
results <- resamples(list(RF = rf_model, GBM = gbm_model))

# Print summary statistics (Mean, Median, Min, Max for ROC, Sens, Spec)
summary(results)

# Visualization with Boxplots
# This chart shows which model is more stable and has higher overall scores.
bwplot(results, metric = "ROC", main = "Model Comparison: ROC Curve Distribution")

# Visualization with Dotplots
# Useful for a quick glance at the mean performance and confidence intervals.
dotplot(results, metric = "ROC", main = "Model Comparison: ROC Performance Dotplot")

cat("7 Complete: Models compared visually.\n")
```
Commentary: Based on the results of the 5-fold Cross-Validation, both the Random Forest and GBM models demonstrate exceptionally high performance. The Area Under the Curve (AUC) for both models is approximately 0.98. The Random Forest model was selected as the final model because it exhibits a slightly more stable distribution compared to GBM.


# 8. Final Predicitons on Test Set
```{r }
# Goal: Use the best model (Random Forest) to predict on the Test Set.

cat("Predicting on Test Set using Random Forest...\n")

# 1. Generate Predictions (Both Probabilities and Classes)
# We are using the Random Forest model as it achieved the highest ROC score during training.
predictions_prob <- predict(rf_model, X_test, type = "prob")
predictions_class <- predict(rf_model, X_test, type = "raw")

# 2. Create the Confusion Matrix
# Evaluate model performance: How many did we get right? (Accuracy, Sensitivity, Specificity)
# We set 'positive = "Churn"' to ensure metrics reflect our ability to detect churners.
conf_matrix <- confusionMatrix(predictions_class, y_test_factor, positive = "Churn")

print(conf_matrix)

cat("\n8. Complete: Final Test Evaluation Done.\n")
```
Final Test Result: The model achieved an overall Accuracy of 95.64% on the Test Data, which it had never encountered before.

 - Error Rate: Out of 413 customers in the test set, only 18 were misclassified (6 False      Positives, 12 False Negatives).

 - Sensitivity (Recall): 97.01%. This metric demonstrates the model's high level of success    in identifying customers with churn potential.

 - Kappa Score: A value of 0.91 proves that the model's success is not due to chance and      represents a "Perfect Agreement" level.


# 9. Feature Importance
```{r }
## 9. Feature Importance
# Goal: Re-train the optimal model with importance enabled and visualize drivers.

cat("Re-calculating importance using the 'impurity' method...\n")

# 1. Re-train the model one last time with importance enabled
# We use the best 'mtry' found in your previous step (which was 29)
final_rf_with_imp <- train(
  x = X_train_balanced,
  y = y_train_balanced_factor,
  method = "ranger",
  importance = 'impurity', # This is the missing piece!
  tuneGrid = data.frame(mtry = 29, splitrule = "gini", min.node.size = 1),
  trControl = trainControl(method = "none") # No need for CV again
)

# 2. Extract and Plot Importance
var_imp_data <- varImp(final_rf_with_imp, scale = FALSE)

# 3. Generate the Visualization
plot(var_imp_data, top = 20, main = "Top 20 Drivers of Customer Churn")

cat("9. Complete: Feature Importance Plot Generated successfully.\n")
```
Variable Analysis: When examining the most significant factors influencing customer churn, it is evident that financial metrics are the primary drivers.

 - MRR_Capped (Monthly Recurring Revenue): This is by far the most decisive factor.

 - Transaction_Count (Number of Transactions): The customer's activity level has a direct     impact on their loyalty.

 - Total_Revenue: The total value a customer has brought to the company is also a critical    indicator.

These results highlight the necessity of closely monitoring the behaviors of high-revenue and high-transaction customers.