STATS-15-Final-Project-2025Spring/Final_Draft_revised.Rmd at main · Eleanore-MZ/STATS-15-Final-Project-2025Spring · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "STATS 15 Final Project - Portuguese Secondary School Academic Performance Report"
author: "Cindy Li, Eleanore Zhu, Yinbo Zhang, Joseph Liu, Haoxiang Gao"
date: "2025-05-31"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Executive Summary

This project investigates how demographic, social, and behavioral factors correlate to student academic performance using data from two public secondary schools in Portugal. Our analysis reveals that academic outcomes are shaped by multiple variables, sometimes even in unexpected ways.

Demographic factors show that older students and those from rural areas tend to underperform in Portuguese, while male students perform better in Math and female students in Portuguese. Behavioral factors like study time, travel time, free time, and alcohol consumption have a stronger influence on Portuguese performance. Confirming normal intuition, students in better health condition and with fewer past failures in class tend to perform better in both subjects. Social support also plays a key role: students with good family relationships and desire for higher education are more likely to achieve top grades (A or higher).

Further analysis investigated some cross section variable interactions. While romantic relationships were initially hypothesized to negatively affect study time and thus grades, the data did not support this. In contrast, parental education—especially maternal education—shows a very strong positive correlation with academic performance, especially when the mother is the student’s guardian. Finally, we discovered that Math and Portuguese respond to different types of influences. Overall, students show better performance as well as greater over-time grade improvement in Portuguese than in Math. There are some differences in the strength of correlation between some variables and grades in each subject.

# Section 1 - Introduction

## 1.1 - Motivating Questions and Scope of Analysis

All our group members grew up in education-emphasizing Asian families. Though we completed high school in different countries, we share the experience of working hard towards higher education. So, we are drawn to the topic of academic performance and the underlying determinants. In particular, we would like to investigate how factors other than intelligence can impact educational attainment. Rather than focusing our exploration on China or the U.S., whose education systems we are already familiar with, we decided to challenge ourselves with Portugal, which has a different educational structure and less societal emphasis on education. We are especially curious about what shapes student performance in a context that differs vastly from our own.

In this project, we focus on two public schools in Portugal and use exploratory data analysis (EDA) to answer the overarching research question: **How do demographic, social support, and behavioral factors relate to academic performance – measured by test scores over time (G1, G2, G3) – and are there subject-based differences in these relationships?**

## 1.2	- Portugal’s Educational Landscape: Historical and Social Context

Education in Portugal was established by the Constitution of the Republic in 1976 according to the democratic principles of the freedom to teach and learn(“Overview”). Ten years later, the Education Act that defines educational objectives, structures, and modes of organization was also introduced, aiming to expand access to compulsory education. However, the underinvestment, regional disparities, and value differences still impact student educational outcomes today.

Despite Portugal’s substantial efforts to improve its educational system, in 2006, the country had an early school dropout rate of 40% for 18 to 24-year-olds, lagging behind the 15% average for various other European Union nations (Cortez & Silva, 2008). Among the challenges that Portugal faces are teacher shortages, demographic shifts, and unequal access to quality instruction. In particular, many students have reported studying without teachers in at least one subject for extended periods of time (“Overview”). The systemic difficulties of lacking educational resources not only affect immediate academic performance, but also limit long-term opportunities for higher education and career advancement. This is especially true for students from disadvantaged backgrounds or remote areas – individuals who lack opportunities to access quality education and succeed in academic pursuits.

Therefore, it is critical to understand what factors influence student achievement in this context, and by analyzing how behavioral, social, and demographic variables interact with academic outcomes in Portugal, we hope to gain insights into an education system that faces significant historical and modern challenges.

## 1.3 - Structure of the Portuguese Education System

According to the Education Act of 1986, Portugal’s education system is divided into three levels. The system begins with pre-school education, an optional education opportunity for children between the ages of three and six. Compulsory schooling spans twelve years (ages 6 - 18) and is divided into basic education and upper secondary education.

Basic education lasts nine years and is subdivided into three cycles. The 1st cycle is from Grades 1-4 (ages 6-10). The 2nd cycle: Grades 5 - 6 (ages 10-12). The 3rd cycle: Grades 7-9 (ages 12-15). This is followed by upper secondary education from grades 10-12 (aged 15-18), during which students may choose among general academic tracks (such as science and humanities), vocational training, or specialized artistic programs (“Overview”). The different pathways give students opportunities to personalize future plans based on their own circumstances; whether it be continuing towards higher education and university, polytechnics that emphasize technical training, or directly entering the labor market upon graduation. However, despite the structural flexibility, all upper secondary students are evaluated under a standard 0-20 grading scale, 20 being the highest possible score (“Overview”). This scale is used across general and vocational tracks, and is also applied in the dataset we have chosen. For better understanding, Portugal’s 20-point grading scale can be mapped onto a conventional US letter grading scale as follows: A (16 - 20), B (14-15),  C (12-13), D (10-1), and F( 0-9); the scores represent the following achievements: A is excellent/very good, B is goo,. C is satisfactory, D is sufficient, F is failed (Cortez & Silva (2008)).

## 1.4 - Data Set Overview

The dataset we are analyzing was obtained from the UCI Machine Learning Repository, contributed by Paulo Cortez, a professor at the Department of Information Systems, University of Minho, Portugal (Cortez). The data contains the academic outcomes of students from two public secondary schools in Portugal. It includes the academic score of two courses, Math and Portuguese, over a period of time with more detailed attributes of each of the students. The students’ scores were collected from school reports, on paper sheets, and including a few attributes (including the three-period grades (beginning, middle, and final course grades) and number of school absences) during the 2005-2006 school year. A questionnaire is also given to the students to complement other information. The questionnaires were designed with closed questions related to demographic, social/emotional, and behavioral variables. The questionnaires were first reviewed by professionals and pre-tested by a small group of students. The final version, with 37 questions, was answered by 788 students, among whom 111 were discarded due to lack of identification (Cortez & Silva, 2008). The resulting data was sorted into two datasets, one related to math scores and one to Portuguese language scores.

Additionally, the data was sampled from two public secondary schools – Gabriel Pereira (GP) and Mousinho da Silveira (MS) – located in the Alentejo region of southern Portugal. Alentejo is known for its agricultural economy, with extensive regions that are essentially rural and sparsely populated(Alentejo, Turismo do.). Its lagging economic development compared to urban centers like Lisbon would very likely influence the educational experiences and outcomes of students in this dataset. For instance, how students are living in rural or urban areas, the travel times to school, and limited access to education support may contribute to the differences we observe in student performance.

As the region of Alentejo makes up about one-third of Portugal(Alentejo, Turismo do.), we hope to gain a more nuanced understanding of how local Portuguese students’ academic achievement is affected by various factors by situating our analysis within this specific region. This adds depth to our research question and allows for a more grounded interpretation of the dataset’s patterns.

## 1.5 - Observational Units and Variable Structure

Each row in the dataset represents one student enrolled in a single subject course, either Mathematics or Portuguese. Since the data was split into two subject-specific files without identifying factors for us to match the student, the same student may appear in both datasets but as distinct entries. For the purposes of our project, we analyze each subject separately, treating each record as an independent observational unit. In the end, we will try to compare and analyze how the trends observed in the two courses differ.

During our initial data exploration, we did not find any missing values. This observation is supported by the original paper published that analyzes this dataset, which notes that entries lacking identification details were filtered out before publication (Cortez & Silva, 2008). We also checked for potential duplicate entries. However, since identifying information has been removed, it is not possible to confirm the presence of duplicates, although the structure and origin of the data suggest that such instances are unlikely.

## 1.6 - Variable Explanation

| #  | Variable    | Type                | Description |
|----|-------------|------------------|---------------------------|
| 1  | G1          | Numeric, Discrete   | First-period grade (0–20) |
| 2  | G2          | Numeric, Discrete   | Second-period grade (0–20) |
| 3  | G3          | Numeric, Discrete   | Final course grade (0–20) |
| 4  | Age         | Numeric, Discrete   | Student’s age (15–22) |
| 5  | Sex         | Character, Binary   | Student’s sex (‘F’ = female, ‘M’ = male) |
| 6  | School      | Character, Binary   | Student’s school (‘GP’ = Gabriel Pereira, ‘MS’ = Mousinho da Silveira) |
| 7  | Address     | Character, Binary   | Home address type (‘U’ = urban, ‘R’ = rural) |
| 8  | Guardian    | Character, Nominal  | Student’s guardian (‘mother’, ‘father’, ‘other’) |
| 9  | Medu        | Numeric, Ordinal    | Mother’s education (0 = none to 4 = higher ed) |
| 10 | Fedu        | Numeric, Ordinal    | Father’s education (0 = none to 4 = higher ed) |
| 11 | Mjob        | Character, Nominal  | Mother’s job (‘teacher’, ‘health’, ‘services’, ‘at_home’, ‘other’) |
| 12 | Fjob        | Character, Nominal  | Father’s job (‘teacher’, ‘health’, ‘services’, ‘at_home’, ‘other’) |
| 13 | Studytime   | Numeric, Ordinal    | Weekly study time (1 = <2h to 4 = >10h) |
| 14 | Traveltime  | Numeric, Ordinal    | Travel time to school (1 = <15min to 4 = >1h) |
| 15 | Failures    | Numeric, Discrete   | Past class failures (0 to 4+) |
| 16 | Paid        | Character, Binary   | Took extra paid classes (‘yes’, ‘no’) |
| 17 | Absences    | Numeric, Discrete   | Number of school absences (0–93) |
| 18 | Health      | Numeric, Ordinal    | Current health status (1 = very bad to 5 = very good) |
| 19 | Pstatus     | Character, Binary   | Parent cohabitation (‘T’ = together, ‘A’ = apart) |
| 20 | Schoolsup   | Character, Binary   | School educational support (‘yes’, ‘no’) |
| 21 | Famsup      | Character, Binary   | Family educational support (‘yes’, ‘no’) |
| 22 | Activities  | Character, Binary   | Participates in extracurriculars (‘yes’, ‘no’) |
| 23 | Higher      | Character, Binary   | Plans for higher education (‘yes’, ‘no’) |
| 24 | Nursery     | Character, Binary   | Attended nursery school (‘yes’, ‘no’) |
| 25 | Internet    | Character, Binary   | Internet access at home (‘yes’, ‘no’) |
| 26 | Romantic    | Character, Binary   | In a romantic relationship (‘yes’, ‘no’) |
| 27 | Famrel      | Numeric, Ordinal    | Family relationship quality (1 = very bad to 5 = very good) |
| 28 | Goout       | Numeric, Ordinal    | Going out with friends (1 = very low to 5 = very high) |
| 29 | Dalc        | Numeric, Ordinal    | Weekday alcohol consumption (1 = very low to 5 = very high) |
| 30 | Walc        | Numeric, Ordinal    | Weekend alcohol consumption (1 = very low to 5 = very high) |
| 31 | Freetime    | Numeric, Ordinal    | Free time after school (1 = very low to 5 = very high) |
| 32 | Famsize     | Character, Binary   | Family size (‘LE3’ <= 3, ‘GT3’ > 3 members) |
| 33 | Reason      | Character, Nominal  | Reason for choosing school (‘home’, ‘reputation’, ‘course’, ‘other’) |

## 1.7 - Related Research

The original collector of this dataset explored the data in-depth and intended to predict student achievement in secondary education using Business Intelligence/Data Mining techniques. This paper provides us with some insights into possible relationships we could further investigate.

The following is the method used by the paper. It focuses on binary classification and five-level classification to establish data and it uses Naive Predictor, Decision Tree, Random Forest, and Neural Network to analyze the data (Cortez & Silva, 2008). These methods can identify the key elements of achieving success and process a large amount of data. Its primary purpose is to help predict students’ final grades (G3) and to analyze how various factors contribute to academic success or failure. By using supervised machine learning models like Decision Trees, Random Forests, Support Vector Machines, and Neural Networks, we aim to understand these relationships better and improve the accuracy of predicting student outcomes (Cortez & Silva, 2008). Like Failure and G2, G1 is more influential than absence and school support. Mothers’ educational experiences influence the grades of students, but not such influentially. On top of that, students’ drinking behavior is negatively correlated with grades (Cortez & Silva, 2008). Overall, this analysis offers valuable insights into the complex factors influencing student performance in Portugal. Through examining the dataset and model results, it becomes clear that while academic history plays the most significant role, social behaviors, and family background also contribute in meaningful ways.

If the original research paper from Cortez & Silva (2008) focused on the evaluation of predictive accuracy across three tasks: binary classification (pass/fail), five-level classification (grade bands), and regression (numeric final grade G3) using the data of Portuguese and Mathematics separately, the other paper from Ali Khan (2020) took a different approach. He uses a two-variable decision tree to analyze student performance factors and rank variable importance using MIC (Maximal Information Coefficient). The focus of the paper is exploratory variable ranking and classification through a customized tree structure, prioritizing interpretability and policy implications with the use of both Portuguese and Math grades together(Ali Khan). Both papers provided a meaningful exploration and insight into the dataset. We aim to conduct further research based on this paper’s dataset to investigate new and insightful relationships between variables.

## References

Alentejo, Turismo do. “Visitalentejo.” Turismo Do Alentejo, www.visitalentejo.pt/en/. Accessed 21 May 2025.

Ali Khan, Yousaf. “Factors Influencing Secondary School Student’s Performance through Variable Decision Tree Data Mining Technique.” International Journal of Data Science and Analysis, vol. 6, no. 5, 2020, p. 120, https://doi.org/10.11648/j.ijdsa.20200605.11.

Cortez, P. and A. M. Gonçalves Silva. “Using data mining to predict secondary school student performance.” (2008).

“Overview.” Europa.eu, 2024, eurydice.eacea.ec.europa.eu/eurypedia/portugal/overview. Accessed 18 May 2025.

\newpage
# Section 2 - Data Cleaning

```{r, message = FALSE}
library(knitr)
library(tidyverse)
library(ggplot2)
library(stringr)
library(broom)
library(tidyr)
library(dplyr)
library(paletteer)
library(patchwork)
library(lmtest)
library(readr)
```

## 2.1 - Loading Data

First, we load in our data set. The original data set contains two .csv files: one of Math grade and one of Portuguese grade. Each with the same 33 varibles as introduced above in the introduction section.

```{r}
math_data <- read.csv2("student-mat.csv")
port_data <- read.csv2("student-por.csv")
```

## 2.2 - Exploring the Data Set

While the data set was already cleaned by its provider, excluding missing values and data lacking identifications, We will still explore it first to see if there is any other errors or outliers.

We plot the grade distributions of each subject and period first.

```{r}
math_data_longer <- math_data %>%
  pivot_longer(cols = G1:G3, names_to = "Period", values_to = "Grade")
port_data_longer <- port_data %>%
  pivot_longer(cols = G1:G3, names_to = "Period", values_to = "Grade")
```

```{r, message=FALSE}
math_data_longer %>%
  ggplot(aes(x=Grade, fill=Period))+
  geom_bar(position="dodge")+
  coord_cartesian(xlim = c(0, 20)) +
  labs(title="Math Grade Distribution by Period", x="Grade", y="Number of Students")+
  scale_fill_paletteer_d("ggthemes::Classic_Blue_Red_12") +
  scale_fill_manual(values = as.character(paletteer::paletteer_d("MoMAColors::Klein")[c(2,1,7)]))+
  theme_bw()

port_data_longer %>%
  ggplot(aes(x=Grade, fill=Period))+
  geom_bar(position="dodge")+
  coord_cartesian(xlim = c(0, 20)) +
  labs(title="Portuguese Grade Distribution by Period",x="Grade", y="Number of Students")+
  scale_fill_manual(values = as.character(paletteer::paletteer_d("MoMAColors::Klein")[c(2,1,7)]))+
  theme_bw()
```

We notice that an increasing number of students got a 0 in G2 and G3 in both math and Portuguese class, which is not plausible if that was their actual grade. We will further analyze why these data appears.

We first filter out the students who score 0 in either G1, G2, or G3.

```{r}
math_data %>%
  filter(G1 == 0 | G2==0 | G3 == 0) %>%
  select(G1,G2,G3) %>%
  arrange(G3) %>%
  arrange(G2) %>%
  arrange(G1) %>%
  head(10)
port_data %>%
  filter(G1 == 0 | G2==0 | G3 == 0) %>%
  select(G1,G2,G3)%>%
  arrange(G3) %>%
  arrange(G2) %>%
  arrange(G1) %>%
  head(10)
```

We notice that every student who got a 0 in G2 also got a 0 in G3, which suggests that 0 likely represents the student dropping the class rather than their actual score. Therefore, in our further analysis, we should distinguish these students with those who took the actual test.

Specifically, there is one student in Portuguese class who receive a 0 in G1 but 11 in G2 and G3. It is possible that this student scored 0 in G1 but improved their grade in G2 and G3. It is also possible that this student did not take the test in G1 but attended G2 and G3 exams. For this reason, we will not exclude this data from our future analysis.

## 2.3 - Relationship between Dropping Class and Other Variables

We would like to see if there is a relationship between other variables and whether the student would drop the class. Some factors are related to whether students are more likely to dropout.

For the purpose of easier analysis, we will create a combined table of Math and Portuguese grades, adding three new variables: bool variable G2_zero and G3_zero indicating whether the student drop the class in G2 and G3, and subject indicating which class (Math/Portuguese) the grade is.

```{r}
library(tidyverse)
math_data <- math_data %>% mutate(G2_zero = G2 == 0)
port_data <- port_data %>% mutate(G2_zero = G2 == 0)
math_data <- math_data %>% mutate(G3_zero = G3 == 0)
port_data <- port_data %>% mutate(G3_zero = G3 == 0)
math_data$subject <- "Math"
port_data$subject <- "Portuguese"
combined_uncleaned <- bind_rows(math_data, port_data)
combined_uncleaned %>%
  select(G1, G2, G3, G2_zero, G3_zero) %>%
  head()
```
After some preliminary trials, we found that there is a relationship between the following two variables and whether the students are more likely to drop out or not.

- School Support
- Higher Education

**School Support**

- Null hypothesis: Students with or without school support score the same.

```{r}
library(tidyverse)
table_drop_schoolsup <- table(combined_uncleaned$G3_zero, combined_uncleaned$schoolsup)
print(table_drop_schoolsup)
fisher.test(table_drop_schoolsup)
```

- p-value - 0.0768. We fail to reject the null hypothesis at alpha = 0.05, but at a
more lenient alpha = 0.10, we have a marginally significant relationship. The odds ratio = 0.293	suggests that school support may reduce dropout odds.

**Higher Education**

```{r}
library(tidyverse)
table_drop_higher <- table(combined_uncleaned$G3_zero, combined_uncleaned$higher)
print(table_drop_higher)
fisher.test(table_drop_higher)
```

- p-value < 0.05. We reject the null hypothesis.
- There is a strong relationship between whether the student aim for higher education and whether they would drop classes, with an odds ratio of 0.32.

## 2.4 Data Clearning Conclusion

We find some students with 0 in G2 and G3 grades. We deduce that these are due to the students dropping the class. Therefore we will exclude these data from our analysis of variables and grades in future sections.

We also create four cleaned data sets excluding students who drop out: two cleaned data sets of math grades and Portuguese grades respectively,

```{r}
math_data_cleaned <- math_data %>%
  filter(G2 != 0 & G3 != 0)
port_data_cleaned <- port_data %>%
  filter(G2 != 0 & G3 != 0)
```

We create another data frame that combines these two data set, with a new column named "subject".

```{r}
math_data_cleaned$subject <- "Math"
port_data_cleaned$subject <- "Portuguese"
combined_data <- bind_rows(math_data_cleaned, port_data_cleaned)
combined_data %>%
  select(age, sex, subject, G1, G2, G3) %>%
  head()
```
Furthermore, for coding convenience, we have another data frame that is pivoted longer, putting G1, G2 and G3 grade into the same column, adding a new "period" column.

```{r}
longer_data <- combined_data %>%
  pivot_longer(cols = G1:G3, names_to = "Period", values_to = "Grade") %>%
  mutate(GradeGroup = case_when(
    Grade >= 16 ~ "A",           # Excellent/Very Good
    Grade >= 14 ~ "B",           # Good
    Grade >= 12 ~ "C",           # Satisfactory
    Grade >= 10 ~ "D",           # Sufficient
    TRUE        ~ "F"            # Fail
    ))
longer_data %>%
  select(age, sex, subject, Period, Grade, GradeGroup) %>%
  head()
```

In further sections, we will conduct an exploratory data analysis on various variables and academic performances.

\newpage
# Section 3 - Demographic

## 3.1 - Age

### Age Distribution

```{r}
data_19plus <- combined_data%>%
  mutate(age_group = ifelse(age >= 19, "19+", as.character(age)))

summary1 <- data_19plus %>%
  group_by(age_group) %>%
  count()

summary1

ggplot(data_19plus, aes(x = age_group)) +
  geom_bar() +
  labs(title = "Age Distribution",
       x = "Age",
       y = "Count")+
  theme_bw()

```

1. Ages 16 and 17 are the most common, each with the highest number of students.
2. Age 15 and 18 have moderate representation.
3. The 19+ group has the fewest students.

### Q1: How does the proportion of letter grades vary by age group across periods?

- Null Hypothesis: The proportion of letter grades does not differ by age group across periods.

```{r,warning=FALSE}
longer_data_19plus <- longer_data %>%
  mutate(age_group = ifelse(age >= 19, "19+", as.character(age)))

ggplot(longer_data_19plus, aes(x = age_group, fill = GradeGroup)) +
  geom_bar(position = "fill") +
  facet_grid(Period ~ subject) +
  labs(title = "Proportion of Letter Grade by Age",
       x = "Age", y = "Proportion",
       fill = "Letter Grade") +
  scale_fill_paletteer_d("PrettyCols::Beach", labels = c(
    "A (16-20)",
    "B (14-15)",
    "C (12-13)",
    "D (10-11)",
    "F (0 - 9)")) +
  theme_bw()

longer_data_19plus %>%
  filter(subject == "Math", Period == "G1") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()

longer_data_19plus %>%
  filter(subject == "Portuguese", Period == "G1") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()

longer_data_19plus %>%
  filter(subject == "Math", Period == "G2") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()

longer_data_19plus %>%
  filter(subject == "Portuguese", Period == "G2") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()

longer_data_19plus %>%
  filter(subject == "Math", Period == "G3") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()

longer_data_19plus %>%
  filter(subject == "Portuguese", Period == "G3") %>%
  with(table(GradeGroup, age_group)) %>%
  chisq.test()
```

**Visual Observation**

As age increases, students in both classes across all periods show an increases in lower letter grade proportion.

**Statistical Test (Chi-squared)**

- Portuguese p-values (1.501e-05, 0.001157, 0.002147) are all < 0.05, reject H0, proportion of letter grades does differ by age group across periods.
- Math p-values (0.6684, 0.4071, 0.5002) are all > 0.05, fail to reject H0, proportion of letter grades does not differ by age group across periods.

**Answer**

This indicates that age is associated with letter grade outcomes in Portuguese at each time point — older student groups consistently receive higher proportions of high grades. The only exception is age group 19+ which consistently performs the worst across all periods for both classes. However, for math, students performance isn't impacted by their age.

## 3.2 - Address

### Address Distribution

```{r}
summary2 <- combined_data%>%
  group_by(address) %>%
  count()

summary2
```

The number of urban students is almost three times the number of rural students.

### Q1: How does letter grade distribution vary between urban and rural students across different periods and subjects?

- Null Hypothesis: There is no difference in the distribution of letter grades between urban and rural students, and this distribution is the same across all periods and subjects.

```{r}
ggplot(longer_data, aes(x = address, fill = GradeGroup)) +
  geom_bar(position = "fill") +
  facet_grid(Period ~ subject) +
  labs(title = "Letter Grade Proportions by Address, Period, and Subject",
       fill = "Letter Grade",
       x = "Address",
       y = "Proportion") +
  scale_fill_paletteer_d("PrettyCols::Beach", labels = c(
    "A (16-20)",
    "B (14-15)",
    "C (12-13)",
    "D (10-11)",
    "F (0 - 9)")) +
  theme_bw()+
  scale_x_discrete(labels = c("R" = "Rural", "U" = "Urban"))

longer_data %>%
  filter(subject == "Math") %>%
  with(table(GradeGroup, address)) %>%
  chisq.test()

longer_data %>%
  filter(subject == "Portuguese") %>%
  with(table(GradeGroup, address)) %>%
  chisq.test()
```

**Visual Observation**

Overall, rural students seem to obtain lower letter grade than urban students across both classes and all periods.

**Statistical Test (Chi-squared)**

1. Portuguese: p-value = 3.012e-09 < 0.05, reject H0, address significantly impacts grade in Portuguese class
2. Math: p-value = 0.002299 < 0.05, reject H0, address significantly impacts grade in math class

**Answer**

Address variable impacts letter grade over time for both subject. In both classes, there is a significant difference in letter grade between urban and rural students for every period. This means rural students consistently perform worse than urban students in both subjects throughout the year. This made us curious about if the reason to attend school corresponds to this difference.

### Q2: How does the proportion of rural and urban students vary across reason to attend school?

- Null Hypothesis: There is no association between student address (urban vs. rural) and reason for attending school; the proportion of rural and urban students is the same across all reasons.

```{r}
ggplot(combined_data, aes(x = reason, fill = address)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of Rural vs. Urban Students by Reason and Subject",
       x = "Reason",
       y = "Proportion",
       fill = "Address") +
  scale_fill_discrete(labels = c(
    "R" = "Rural",
    "U" = "Urban")) +
  scale_x_discrete(labels = c(
    "course" = "Course Preference",
    "home" = "Close to Home",
    "other" = "Other Reason",
    "reputation" = "School Reputation"))+
  theme_bw()


table2_data <- table(combined_data$reason, combined_data$address)
chisq.test(table2_data)
```

**Visual Observation**

1. The reason with largest proportion of urban students is a school that is close to home, second largest being school reputation.

2. The reason with largest proportion of rural students is other, second largest being course preference.

**Statistical Test (Chi-squared)**

Chi-square p-value of 4.195e-05 < 0.05, reject H0, shows that reason and address variables are dependent

**Answer**

Address variable impacts grade over time by shaping students’ motivation for attending school, which may indirectly influence their academic performance even when attending the same school.

The significant difference in reasons for attending school (p < 0.001) between urban and rural students suggests that rural students are more likely to attend school for practical reasons, while urban students more often choose school based on proximity or reputation.

Although they attend the same schools, these differences in motivation may reflect underlying disparities in academic preparation, expectations, or external support, which could help explain why rural students tend to perform worse. For instance, urban students can choose school close to home, indicating they had been closer to better studying resources in the city. This reinforces the idea that address impacts student performance over time — not because of school differences, but due to differences in student context and reason for enrollment.

## 3.3 - Sex

### Sex Distribution

```{r}
summary3 <- combined_data%>%
  group_by(sex) %>%
  count()

summary3
```

There are more female students than male students.

### Q1: How does student average grade over time differ by sex, and does this pattern vary between Math and Portuguese?

- Null Hypothesis: There is no difference in average grade in both subjects by sex.

```{r}
avg_grade_sex <- longer_data %>%
  group_by(sex, Period, subject) %>%
  summarize(average_grade = mean(Grade), .groups = "drop")

ggplot(avg_grade_sex, aes(x = Period, y = average_grade, fill = sex)) +
  geom_col(position = "dodge") +
  facet_wrap(~subject) +
  labs(title = "Average Grade by Sex Across Periods",
       x = "Period",
       y = "Average Grade",
       fill = "Sex") +
  scale_fill_discrete(labels = c(
    "F" = "Female",
    "M" = "Male"))+
  theme_bw()+
  ylim(0,20)

sex_test1 <- aov(Grade ~ sex * Period * subject, data = longer_data)
summary(sex_test1)
```

**Visual Observation**

On average, male students perform better in math class than female students, and vice versa in Portuguese class.

**Statistical Test (ANOVA)**

1. Sex: p-value = 0.09642 > 0.05, fail to reject H0, sex alone doesn't impact grade significantly
2. sex:subject p-value = 7.8e-09 < 0.05, reject H0, sex significantly impacts grade differently by subject

**Answer**

Although sex itself isn't an indicator of grade (p-value = 0.09642), the interaction p-value above, 7.8e-09, means that the impact of sex on grade depending on subject is highly significant. This means that gender-based performance differences are subject-specific, which is confirmed by the plot where male student on average perform better in math class and worse in Portuguese class.

## Demographic Section Conclusion

There are three main variables in demographics: age (15-19+), address (urban & rural), and sex (male & female).

Age: older student groups consistently receive higher proportions of high grades. The only exception is age group 19+ which consistently performs the worst across all periods for both classes. However, for math, students performance isn't impacted by their age.

Address: In both classes, there is a significant difference in letter grade between urban and rural students for every period. This means rural students consistently perform worse than urban students in both subjects throughout the year. Also, the different emphasis of reason to attend school in may reflect underlying disparities between students from different addresses.

Sex: Gender-based performance differences are subject-specific, which is confirmed by the plot where male student on average perform better in math class and worse in Portuguese class.

\newpage
# Section 4 - Social Support

## EDA Overview: Social Support Variables
In this section, we explore select variables associated with "social support." These capture family, school, and external support systems that may assist or influence a student's academic success. Additionally, we explore how the social support variables may interact with demographic variables, such as age and sex.

## Data Wrangling & Summary Statistics
```{r}
df <- read.csv("combined_data_cleaned.csv")

df_ss <- df |> select(age,sex,higher,famrel,G1,G2,G3,subject)

df_ss <- df_ss |> mutate( higher = ifelse(higher=="yes",1,0),
                          male = ifelse(sex=="M",1,0),
                          age_group = ifelse(age >= 19, "19+", as.character(age)) )

longer_data <- df %>%
pivot_longer(cols = G1:G3, names_to = "Period", values_to = "Grade") %>%
mutate(GradeGroup = case_when(
Grade >= 16 ~ "A", # Excellent/Very Good
Grade < 16 ~ "Not A" # Good
))

```

```{r, warning=F}
mean_sum <- df_ss |> select(-subject) |>
  summarize(across(everything(), ~ signif(mean(.x, na.rm = TRUE), 4)))

kable(mean_sum|>select(famrel,higher),caption="Mean Values for Famrel and Higher")
```

## Section 1: Family Relationship

The *famrel* variable measures the quality of a student's family relationships (numeric: from 1 – very bad to 5 – excellent).

```{r}
ggplot(df_ss, aes(x=famrel))+geom_histogram(binwidth = 1,color="black",fill="pink",alpha=.8)+
  labs(title="Distribution of Family Relation Scores")+
  theme_bw()
```

From the histogram, we see that $famrel$ is right-skewed, and from our summary statistic table, the mean $famrel$ score is 3.93.

### Q1: Does family relationship quality affect grades?

```{r}
g1 <- ggplot(df_ss, aes(x = famrel, y = G1)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  geom_hline(yintercept = 16) +
  coord_cartesian(ylim = c(0, 20)) +
  labs(x = "Family Relation Score", y = "G1 Grade")

g2 <- ggplot(df_ss, aes(x = famrel, y = G2)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  geom_hline(yintercept = 16) +
  coord_cartesian(ylim = c(0, 20)) +
  labs(
    x = "Family Relation Score",
    y = "G2 Grade",
    title = "Family Relation vs Grades"
  )

g3 <- ggplot(df_ss, aes(x = famrel, y = G3)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  geom_hline(yintercept = 16) +
  coord_cartesian(ylim = c(0, 20)) +
  labs(x = "Family Relation Score", y = "G3 Grade")

g1 + g2 + g3 + plot_layout(nrow = 1)

ggplot(df_ss,aes(x=famrel,y=G1))+
  geom_jitter(alpha=0.2,height=NULL) + geom_hline(yintercept=16)+
  facet_wrap(~age_group)+
  labs(x= "Family Relation Score", y = "G1 Grade",
       title="Family Relation vs G1 Score, by Age Group")

ggplot(df_ss,aes(x=famrel,y=G2))+
  geom_jitter(alpha=0.2,height=NULL) + geom_hline(yintercept=16)+
  facet_wrap(~age_group)+
  labs(x= "Family Relation Score", y = "G2 Grade",
       title="Family Relation vs G2 Score, by Age Group")

ggplot(df_ss,aes(x=famrel,y=G3))+
  geom_jitter(alpha=0.2,height=NULL) + geom_hline(yintercept=16)+
  facet_wrap(~age_group)+
  labs(x= "Family Relation Score", y = "G3 Grade",
       title="Family Relation vs G3 Score, by Age Group")


```
```{r}
longer_data_19plus <- longer_data %>%
mutate(famrel_4 = factor(ifelse(famrel>=4, 1,0)),age_group = ifelse(age >= 19, "19+", as.character(age))) |>
  filter(Period == "G3")
ggplot(longer_data_19plus, aes(x = famrel_4, fill = GradeGroup)) +
geom_bar(position = "fill") +
scale_x_discrete(labels=c("Famrel < 4","Famrel >= 4"))+
labs(title = "Proportion of G3 Letter Grade by Famrel",
x = "Family Relation Score", y = "Proportion",
fill = "G3 Grade") +
theme_bw()
```


```{r}
# Step 1: Create famrel group
df_ss <- df_ss %>%
  mutate(famrel_group = ifelse(famrel >= 4, ">= 4", "< 4"))

# Step 2: Plot
ggplot(df_ss, aes(x = age_group, y = G3, fill = famrel_group)) +
  geom_col(position="dodge") +
  labs(
    x = "Family Relation Score Group",
    y = "G3 Grade",
    title = "G3 Score by Age Group"
  ) +
  theme_minimal()


```
From the scatter plots, we see that across all test periods (G1-G3),those with famrel scores of 3 or more have more scores greater than 16 (an A equivalent). Faceting based on Age Group also shows a similar trend, especially for students 18 or under. It is possible that having good family relationship increases the likelihood of receiving an A. The next question is how to define a "good" family relationship.

```{r}
df_ss_g1 <- df_ss |> mutate(gradeA = G1 >= 16, goodFamrel = famrel >= 4)
chisq_test <- chisq.test(table(df_ss_g1$gradeA, df_ss_g1$goodFamrel))
chisq_test

df_ss_g2 <- df_ss |> mutate(gradeA = G2 >= 16, goodFamrel = famrel >= 4)
chisq_test <- chisq.test(table(df_ss_g2$gradeA, df_ss_g2$goodFamrel))
chisq_test

df_ss_g3 <- df_ss |> mutate(gradeA = G3 >= 16, goodFamrel = famrel >= 4)
chisq_test <- chisq.test(table(df_ss_g3$gradeA, df_ss_g3$goodFamrel))
chisq_test
```

We find that when defining a "good" family relationship to be $famrel \ge 4$, at the 5% level we reject $H_0$ of the Chi-squared test for periods G3, so there is a relationship between (1) getting an A and (2) having a good family relationship in those periods. Notably, we fail to reject $H_0$ for G1 and G2. We continue the exploration using this definition: goodFamrel = ($famrel \ge 4$).

```{r}
famrel_prop <- df_ss_g3 |> group_by(goodFamrel) |>
  count(gradeA) |>
  mutate(prop_4_A = n / sum(n))
kable(famrel_prop)

x <- c(18, 104)
n_total <- c(230, 761)
prop.test(x = x, n = n_total, correct = FALSE)
```

In G3, the percentage of students with As is 13.67% for those with good family relationship, whereas it is 7.83% for those without a good family relationship. Running a two-proportion z-test, we reject $H_0$; the 5.84% difference in proportions of students getting As is statistically significant.

## Section 2: Students' Desire to Pursue Higher Education

$Higher$ is a binary variable indicating whether or not the student intends to pursue higher education after high school. We hypothesize that wanting to attend higher education will increase G3 grades in school.

From the earlier table of means, we know that 92.13% of the students want to attend higher education (i.e. higher = 1). Does this differ with sex or age?

### Q1: Does Wanting to Pursue Higher Education Affect Grades?

```{r}
higher_prop <- df_ss_g1 |> group_by(higher) |> count(gradeA) |> mutate(prop_higher_A = n / sum(n))
kable(higher_prop,caption="Contingency Table: Higher vs A-Proportion (G1)")

tbl <- matrix(c(78, 0, 826, 87), nrow = 2, byrow = TRUE)
rownames(tbl) <- c("higher = 0", "higher = 1")
colnames(tbl) <- c("not_A", "A")
fisher.test(tbl)

higher_prop <- df_ss_g2 |> group_by(higher) |> count(gradeA) |> mutate(prop_higher_A = n / sum(n))
kable(higher_prop,caption="Contingency Table: Higher vs A-Proportion (G2)")

tbl <- matrix(c(78, 0, 820, 93), nrow = 2, byrow = TRUE)
rownames(tbl) <- c("higher = 0", "higher = 1")
colnames(tbl) <- c("not_A", "A")
fisher.test(tbl)

higher_prop <- df_ss_g3 |> group_by(higher) |> count(gradeA) |> mutate(prop_higher_A = n / sum(n))
kable(higher_prop,caption="Contingency Table: Higher vs A-Proportion (G3)")

tbl <- matrix(c(78, 0, 791, 122), nrow = 2, byrow = TRUE)
rownames(tbl) <- c("higher = 0", "higher = 1")
colnames(tbl) <- c("not_A", "A")
fisher.test(tbl)
```
First, we want to determine if higher is correlated with getting an A. Running respective Fisher Tests, we find similar conclusions for scores in all periods - we reject $H_0$. There is a very strong (technically infinite) association between higher and getting an A, so a relationship between the two variables is likely.

### Q2: Does the effect of the variable "higher" grades depend on student sex?

```{r,message=F}
higher_hist <- ggplot(df_ss,aes(x=higher))+
  geom_bar(fill="pink",color="black") + facet_wrap(~sex) +
  labs(title="Distribution of Higher, by Sex",
       x="Higher","Count") + theme_bw()
jitter_g1 <- ggplot(df_ss,aes(x=higher,y=G1))+geom_jitter(alpha=0.2,height=NULL) +
  facet_wrap(~sex) + geom_hline(yintercept=16) +
  labs(title="Scatterplot of Higher vs G1 Score",x="Higher") +theme_bw()

jitter_g2 <- ggplot(df_ss,aes(x=higher,y=G2))+geom_jitter(alpha=0.2,height=NULL) +
  facet_wrap(~sex) + geom_hline(yintercept=16) +
  labs(title="Scatterplot of Higher vs G2 Score",x="Higher") +theme_bw()

jitter_g3 <- ggplot(df_ss,aes(x=higher,y=G3))+geom_jitter(alpha=0.2,height=NULL) +
  facet_wrap(~sex) + geom_hline(yintercept=16) +
  labs(title="Scatterplot of Higher vs G3 Score",x="Higher") +theme_bw()

higher_hist + jitter_g1 + jitter_g2 + jitter_g3 + plot_layout(ncol=2)
```
From the dotplot, for both ages, no one with $higher = 0$ got an A (in any period), so it is unlikely that the effect of $higher$ on $G3$ differs based on gender.

```{r}

# G1
error_g1 <- ggplot(df_ss, aes(x = factor(higher), y = G1, color = sex)) +
  stat_summary(fun = mean, geom = "point", position = position_dodge(0.2)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               width = 0.2, position = position_dodge(0.2)) +
  coord_cartesian(ylim = c(8, 12)) +   # <-- set y-axis scale
  labs(x = "Higher", y = "G1") +
  theme_bw()

# G2
error_g2 <- ggplot(df_ss, aes(x = factor(higher), y = G2, color = sex)) +
  stat_summary(fun = mean, geom = "point", position = position_dodge(0.2)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               width = 0.2, position = position_dodge(0.2)) +
  coord_cartesian(ylim = c(8, 12)) +   # <-- set y-axis scale
  labs(x = "Higher", y = "G2") +
  theme_bw()

# G3
error_g3 <- ggplot(df_ss, aes(x = factor(higher), y = G3, color = sex)) +
  stat_summary(fun = mean, geom = "point", position = position_dodge(0.2)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               width = 0.2, position = position_dodge(0.2)) +
  coord_cartesian(ylim = c(8, 12)) +   # <-- set y-axis scale
  labs(x = "Higher", y = "G3") +
  theme_bw()

# Combine into one layout
error_g1 + labs(title = "Mean G1, G2, G3 Scores by Higher Education Aspiration and Gender") +
  error_g2 + error_g3 +
  plot_layout(ncol = 2)
```

Since the error bars for mean score by gender overlap (for both values of $higher$), we do not have enough evidence to say that the effect of $higher$ on G1, G2, and G3 varies based on sex. This confirms our initial suspicion.

### Q3: Does proportion of students wanting to pursue higher education change with age?

We would also like to explore if proportion of students with $higher = 1$ decreases with age. This is a different exploration that focuses on the interaction between a social and demographic factor, rather than just focusing on grades.

```{r}
df_higher_prop <- df_ss |> group_by(age_group) |>
  count(higher) |>  mutate(prop = n / sum(n)) #|> filter(higher==1)
kable(df_higher_prop|>select(-higher))
```

The table shows that the proportion of students who want to attend higher education decreases with age, and the graph below provides a way to visualize this negative correlation.

```{r}
df_higher_prop <- df_higher_prop |> mutate(higher = factor(higher,levels=c(0,1)))

ggplot(df_higher_prop, aes(x = age_group, y = prop, fill = factor(higher), group = 1)) +
  geom_col(position=position_stack(reverse=F)) +
  labs(title = "Proportion of Students Wanting to Pursue Higher Ed, by Age",
       x = "Age Group", y = "Proportion",
       fill = "Higher Ed Aspiration") +
  theme_bw()
```
From the graph above, we can visualize the negative correlation between $age$ and $higher$. As age increases, higher education aspiration decreases.

## Social Support Section Conclusion

Exploring $famrel$ and $higher$ from the social support variable category, we find that the proportion of students getting As in G3 is almost 6% higher when family relationship is good. We also find that having a desire to achieve higher education is correlated with getting an A, but there is no gender-based difference in this "effect." This is true for all grade periods. Finally, we find that the proportion of students wanting to pursue higher education decreases with age, with a negative correlation of around -0.233. This aligns with a finding in the Demographic section - that older students get lower grades, particularly in Portuguese.

\newpage
# Section 5 - Behavioural Vairables

## 5.1 - Traveltime, Studytime, Freetime, Goout

There are 10 variables in this section, each with a different y-axis, so they are divided into four groups. The first group is time-related variables, including traveltime, studytime, freetime, and goout. The second group focuses on alcohol consumption, with Dalc and Walc. The third group includes activities, which is a  binary variables (yes/no). The remaining three variables — health, absences, and failures — form the fourth group. We will analyze each group separately.


### Q1: How do studytime, traveltime, freetime, and goout affect students’ grades in Portuguese and Mathematics across G1–G3?

- Null Hypothesis: There is no significant difference in students' letter grades across different levels of traveltime, studytime, freetime, and goout.

**Traveltime, Studytime, Freetime, Goout Distribution**

```{r, message=FALSE}
g1 <- read_csv("port_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G1) |>
  mutate(across(-G1, as.character)) |>
  pivot_longer(-G1, names_to = "Variable", values_to = "Value")

ggplot(g1, aes(x = as.factor(Value), y = G1)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Portuguese G1 vs traveltime, studytime, freetime, goout", x = NULL, y = "First Period Grade") +
  theme_bw()

g2 <- read_csv("port_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G2) |>
  mutate(across(-G2, as.character)) |>
  pivot_longer(-G2, names_to = "Variable", values_to = "Value")

ggplot(g2, aes(x = as.factor(Value), y = G2)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Portuguese G2 vs traveltime, studytime, freetime, goout", x = NULL, y = "Second Period Grade") +
  theme_bw()

g3 <- read_csv("port_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G3) |>
  mutate(across(-G3, as.character)) |>
  pivot_longer(-G3, names_to = "Variable", values_to = "Value")

ggplot(g3, aes(x = as.factor(Value), y = G3)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs( title = "Portuguese G3 vs traveltime, studytime, freetime, goout", x = NULL,  y = "Final Grade") +
  theme_bw()

g1 <- read_csv("math_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G1) |>
  mutate(across(-G1, as.character)) |>
  pivot_longer(-G1, names_to = "Variable", values_to = "Value")

ggplot(g1, aes(x = as.factor(Value), y = G1)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Math G1 vs traveltime, studytime, freetime, goout", x = NULL, y = "First Period Grade") +
  theme_bw()

g2 <- read_csv("math_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G2) |>
  mutate(across(-G2, as.character)) |>
  pivot_longer(-G2, names_to = "Variable", values_to = "Value")

ggplot(g2, aes(x = as.factor(Value), y = G2)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Math G2 vs traveltime, studytime, freetime, goout", x = NULL, y = "Second Period Grade") +
  theme_bw()

g3 <- read_csv("math_data_cleaned.csv") |>
  select(traveltime, studytime, freetime, goout, G3) |>
  mutate(across(-G3, as.character)) |>
  pivot_longer(-G3, names_to = "Variable", values_to = "Value")

ggplot(g3, aes(x = as.factor(Value), y = G3)) +
  geom_jitter(alpha = 0.2, height = NULL) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs( title = "Math G3 vs traveltime, studytime, freetime, goout", x = NULL,  y = "Final Grade") +
  theme_bw()

```


**Visual Observation**

studytime: Students who study more tend to have higher scores, especially those at level 3 or level 4. This trend is more clearly observed in G1 and G3.

traveltime: Travel time appears to be slightly related to grades.

freetime: Free time also seems to have a slight relationship with grades.

goout: Students who go out more tend to have lower grades. In particular, those who go out very frequently often score below 10.

```{r, message=FALSE,warning=FALSE}
grade_group <- function(score) {
  case_when(
    score >= 16 ~ "A",
    score >= 14 ~ "B",
    score >= 12 ~ "C",
    score >= 10 ~ "D",
    TRUE        ~ "F"
  )
}

prepare_data <- function(file) {
  read_csv(file) %>%
    pivot_longer(cols = G1:G3, names_to = "Period", values_to = "Grade") %>%
    mutate(
      GradeGroup = grade_group(Grade),
      traveltime = as.character(traveltime),
      studytime = as.character(studytime),
      freetime = as.character(freetime),
      goout = as.character(goout)
    )
}

run_tests <- function(df, subject) {
  cat(paste0("======== ", subject, " ========\n"))
  for (v in c("traveltime", "studytime", "freetime", "goout")) {
    cat("\n==========", v, "==========\n")
    for (p in c("G1", "G2", "G3")) {
      temp <- df %>% filter(Period == p)
      tbl <- table(temp[[v]], temp$GradeGroup)
      test <- chisq.test(tbl)
      cat(v, "vs", p, ": p-value =", signif(test$p.value, 4), "\n")
    }
  }
}

port <- prepare_data("port_data_cleaned.csv")
math <- prepare_data("math_data_cleaned.csv")

run_tests(port, "Portuguese")
run_tests(math, "Math")


```

**Statistical Test (Chi-Square Test) — Mathematics**

studytime

- G1: p = 0.1214 > 0.05 -> fail to reject H0
- G2: p = 0.1640 > 0.05 -> fail to reject H0
- G3: p = 0.1413 > 0.05 -> fail to reject H0

traveltime

- G1: p = 0.1132 > 0.05 -> fail to reject H0
- G2: p = 0.1916 > 0.05 -> fail to reject H0
- G3: p = 0.6713 > 0.05 -> fail to reject H0

freetime
- G1: p = 0.1173 > 0.05 -> fail to reject H0
- G2: p = 0.4534 > 0.05 -> fail to reject H0
- G3: p = 0.4283 > 0.05 -> fail to reject H0

goout
- G1: p = 0.2411 > 0.05 -> fail to reject H0
- G2: p = 0.0899 > 0.05 -> fail to reject H0
- G3: p = 0.0231 < 0.05 -> reject H0

**Statistical Test (Chi-Square Test) — Portuguese**

studytime

- G1: p = 4.08e-08 < 0.05 -> reject H0
- G2: p = 6.42e-06 < 0.05 -> reject H0
- G3: p = 1.42e-06 < 0.05 -> reject H0

traveltime

- G1: p = 0.1630 > 0.05 -> fail to reject H0
- G2: p = 0.0039 < 0.05 -> reject H0
- G3: p = 0.0123 < 0.05 -> reject H0