GHS-Data-Collection/code/Overall_Data.py at main · GGC-DSA/GHS-Data-Collection · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# -*- coding: utf-8 -*-
"""Project GHS Itec4230 .ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1BIvrXFGp6rFO2Ys5EcRHZP3MPoTO6eol

Project Title: Second Iteration
# Data Collection for Green & Healthy Schools (GHS)

##### Project Description: The "Data Collection for Green & Healthy Schools (GHS)" project aims to revolutionize environmental data management within Gwinnett County Public Schools by developing a sophisticated online system. This system will comprise an intuitive data submission form and a dynamic dashboard, empowering teachers, principals, and program managers to comprehensively assess environmental practices and initiatives across classrooms. The project's primary objectives include facilitating easy data collection, enabling visualization of environmental impact data, and fostering informed decision-making to promote sustainability. Key features encompass secure user authentication, user-friendly data submission, searchable project database, and robust data security measures. By adhering to stringent requirements and leveraging modern technologies, the project aspires to enhance environmental education and instill a culture of sustainability within the educational ecosystem of Gwinnett County.
"""


"""### Nhat Minh Vu Section, Data Set 1

Our clients provided us with 2 datasets, so there was no need for us to search for one. However, the first dataset was disorganized, containing a mix of numeric and character data. For instance, in the 'Number of Students Enrolled in School' column, entries included variations such as 'approximate 900', '~900', '900-950', and 'None'. To clean this up, I manually removed the non-numeric characters, leaving only the numbers. For instances where there was a range, like '900-950', I replaced it with the average, in this case, 925. Additionally, other columns contained values like 'NA', 'N/A', 'None', and 'No'. I removed these values, leaving the fields empty.

#  My hypotheis:"The number of students enrolled in school is the most important factor influencing the various aspects of environmental education activities."
"""

import pandas as pd
import numpy as np

# Read the CSV file into a DataFrame with a different encoding
df = pd.read_csv("UPDATE TABLE GGC_2023 Green and Healthy Schools Profile_Data Capture.csv",encoding='ISO-8859-1')

# Display the DataFrame
print(df.head())

# Display summary statistics for numerical columns
print("\nSummary Statistics for Numerical Columns:")
print(df.describe())

# Handling duplicates, because there are not many row and in the table so I want to keep as many row and column as possible.
df.drop_duplicates(inplace=True)


# Display information after cleaning
print("\nDataFrame after cleaning:")
print(df.head())

import matplotlib.pyplot as plt

# Convert 'Number of Students Enrolled in School' column to numeric
df['Number of Students Enrolled in School'] = pd.to_numeric(df['Number of Students Enrolled in School'], errors='coerce')

# Plotting
plt.figure(figsize=(12, 6))
plt.bar(df['School me'], df['Number of Students Enrolled in School'])
plt.title('Number of Students Enrolled in Each School')
plt.xlabel('School Name')
plt.ylabel('Number of Students Enrolled')
plt.xticks(rotation=45, ha='right')

# Scaling up the y-axis
plt.ylim(bottom=0, top=df['Number of Students Enrolled in School'].max() * 1.1)

plt.tight_layout()
plt.show()

import pandas as pd
import matplotlib.pyplot as plt

# Assuming your DataFrame is already loaded and named df
# If not, load your DataFrame using pd.read_csv() first

# Plotting
plt.figure(figsize=(12, 6))
plt.bar(df['School me'], df['Number of Teachers and Support Staff within School'])
plt.title('Number of Teachers and Support Staff within Each School')
plt.xlabel('School Name')
plt.ylabel('Number of Teachers and Support Staff')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Convert 'Timestamp' column to datetime format
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Plotting
plt.figure(figsize=(12, 6))
df['Timestamp'].dt.date.value_counts().sort_index().plot(kind='bar')
plt.title('Frequency of Timestamps')
plt.xlabel('Date')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
df['If you described a PBL in the question above, please tell us how many students were involved.'].value_counts().plot(kind='bar')
plt.title('Frequency of Students Involved in PBL')
plt.xlabel('Number of Students Involved')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Calculate the correlation coefficient
correlation_coefficient = df['Number of Students Enrolled in School'].corr(df['Number of Teachers and Support Staff within School'])

# Print the correlation coefficient
print("Correlation Coefficient:", correlation_coefficient)

# Verify hypothesis
if correlation_coefficient > 0:
    print("The correlation coefficient is positive, indicating a positive linear relationship.")
    print("Therefore, the school that has more students enrolled tends to have more teachers and support staff participating in the project.")
elif correlation_coefficient < 0:
    print("The correlation coefficient is negative, indicating a negative linear relationship.")
    print("Therefore, the school that has more students enrolled tends to have fewer teachers and support staff participating in the project.")
else:
    print("There is no linear relationship between the number of students enrolled and the number of teachers and support staff participating in the project.")

# Sort DataFrame by 'Number of Students Enrolled in School'
df_sorted = df.sort_values(by='Number of Students Enrolled in School')

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(df_sorted['Number of Students Enrolled in School'], df_sorted['Number of Teachers and Support Staff within School'], marker='o', linestyle='-')
plt.title('Relationship between Number of Students and Teachers/Staff')
plt.xlabel('Number of Students Enrolled in School')
plt.ylabel('Number of Teachers and Support Staff within School')
plt.grid(True)
plt.tight_layout()
plt.show()

# Convert columns with numeric data to numeric types
numeric_columns = ['Number of Teachers and Support Staff within School',
                   'If you answered the question above, please tell us how many students were involved with the community partnership. ',
                   'If you answered the question above, please tell us how many students participated.',
                   'If you answered the question above, please tell us how many students utilize the area(s).',
                   'If you answered the question above, please tell us how many students utilize the area(s)..1']

# Iterate through numeric columns and convert them to numeric types
for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Check the data types again to ensure conversion
print(df.dtypes)

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

# Load the dataset
# Assuming your DataFrame is named df

# Step 1: Explore the distribution of the number of students enrolled in school
plt.figure(figsize=(8, 6))
sns.histplot(df['Number of Students Enrolled in School'], bins=20, kde=True)
plt.title('Distribution of Number of Students Enrolled in School')
plt.xlabel('Number of Students')
plt.ylabel('Frequency')
plt.show()

import numpy as np

# Remove non-numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=np.number)

# Remove rows with missing or infinite values
numeric_df = numeric_df.dropna()
numeric_df = numeric_df.replace([np.inf, -np.inf], np.nan).dropna()

# Select the dependent variable
y = numeric_df['Number of Students Enrolled in School']

# Select the independent variables
X = numeric_df.drop(columns=['Number of Students Enrolled in School'])


# Add a constant to the independent variables
X = sm.add_constant(X)

# Fit the OLS model OLS stands for Ordinary Least Squares, which is a method used in linear regression analysis to estimate the parameters of a linear regression model.
# In an OLS model, the goal is to minimize the sum of the squared differences between the observed values of the dependent variable and the values predicted by the linear regression equation.
model = sm.OLS(y, X).fit()

# Print the summary of the regression model
print(model.summary())

"""## My hypotheis:"The number of students enrolled in school is the most important factor influencing the various aspects of environmental education activities."

 Reject Hypothesis. The regression results do not directly address this hypothesis because they focus on the relationship between the number of students enrolled in school and other variables, rather than directly assessing the importance of the number of students enrolled.

1. The coefficients for the number of students enrolled in school and other ariables are not statistically significant (except for the constant term).

The p-values for the coefficients associated with the number of students enrolled in school and other variables are not statistically significant, indicating that these variables may not have a significant effect on the number of students enrolled in school.

This can be observed from the p-values associated with each coefficient in the regression summary table. If the p-value is greater than 0.05, then the corresponding variable is not considered statistically significant in predicting the dependent variable.

This can be from p-values (0.007) for each coefficient in the regression summary table. If the p-value is greater than a chosen significance level (commonly 0.05), then the coefficient is not considered statistically significant. In this case, it is not significant.

2. The overall explanatory power of the model (R-squared) is low (0.315) normaly it range from 0-1, suggesting that the variables included in the model may not effectively explain the variation in the number of students enrolled in school.

The R-squared value measures the proportion of the variance in the dependent variable (number of students enrolled in school) that is explained by the independent variables in the model. A low R-squared value indicates that the model does not explain much of the variability in the dependent variable.


"""

df.columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assuming df contains your DataFrame with the provided numeric columns

# Select numeric columns
numeric_df = df.select_dtypes(include=np.number)

# Drop rows with missing values
numeric_df.dropna(inplace=True)

# Standardize the data
scaler = StandardScaler()
numeric_df_scaled = scaler.fit_transform(numeric_df)

# Perform PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(numeric_df_scaled)

# Plot the data before PCA
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Before PCA')
for column in numeric_df.columns:
    plt.scatter(range(len(numeric_df)), numeric_df[column], label=column)
plt.xlabel('Data Point Index')
plt.ylabel('Value')
plt.legend()
plt.ylim(bottom=0)  # Adjust y-axis limit


plt.tight_layout()
plt.show()

# Plot the data after PCA
plt.subplot(1, 2, 2)
plt.title('After PCA')
plt.scatter(pca_result[:, 0], pca_result[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.tight_layout()
plt.show()

# Add 'Number of Students Enrolled in School' to numeric columns list
numeric_columns = ['Number of Students Enrolled in School',
                   'Number of Teachers and Support Staff within School',
                   'If you answered the question above, please tell us how many students were involved with the community partnership. ',
                   'If you answered the question above, please tell us how many students participated.',
                   'If you answered the question above, please tell us how many students utilize the area(s).',
                   'If you answered the question above, please tell us how many students utilize the area(s)..1']

# Convert columns with numeric data to numeric types
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')


# Convert columns with numeric data to numeric types
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
# Plotting scatter plot for each numeric column on the same graph
plt.figure(figsize=(12, 8))
for i, column in enumerate(numeric_columns):
    plt.scatter(df.index, df[column], label=column, alpha=0.7)

plt.title('Scatter Plot of Numeric Columns')
plt.xlabel('Index')
plt.ylabel('Values')
plt.legend()
plt.grid(True)
plt.show()

# Plotting scatter plot for each numeric column on the same graph
plt.figure(figsize=(12, 8))
for column in numeric_columns:
    sns.regplot(data=df, x=df.index, y=column, label=column)

plt.title('Scatter Plot of Numeric Columns with Regression Lines')
plt.xlabel('Index')
plt.ylabel('Values')
plt.legend()
plt.grid(True)
plt.show()

# Convert columns with numeric data to numeric types
numeric_columns = ['Number of Students Enrolled in School',
                   'Number of Teachers and Support Staff within School',
                   'If you answered the question above, please tell us how many students were involved with the community partnership. ',
                   'If you answered the question above, please tell us how many students participated.',
                   'If you answered the question above, please tell us how many students utilize the area(s).',
                   'If you answered the question above, please tell us how many students utilize the area(s)..1']

df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Define colors for each column
colors = ['red', 'black', 'blue', 'pink', 'purple', 'green']

# Plotting scatter plot for each numeric column on the same graph
plt.figure(figsize=(12, 8))
for i, column in enumerate(numeric_columns):
    if column in ["Number of Students Enrolled in School", "Number of Teachers and Support Staff within School"]:
        sns.regplot(data=df, x=df.index, y=column, label=column, scatter_kws={'color': colors[i]}, line_kws={'color': colors[i]})
    else:
        plt.scatter(df.index, df[column], label=column, alpha=0.7, color=colors[i])

plt.title('Scatter Plot of Numeric Columns with Regression Lines')
plt.xlabel('Index')
plt.ylabel('Values')
plt.legend()
plt.grid(True)
plt.show()

"""## Shantel Parrish

Hypothesis: There is a correlation between the presence of
green spaces utilized for instruction and participating in composting waste
"""

import pandas as pd

df = pd.read_csv("UPDATE TABLE GGC_2023 Green and Healthy Schools Profile_Data Capture - UPDATE TABLE GGC_2023 Green and Healthy Schools Profile_Data Capture.csv", encoding='ISO-8859-1')

from scipy.stats import chi2_contingency

contingency_df = df[["Does your school have outdoor green spaces or gardens utilized for instruction?  Examples include: edible gardens, pollitor gardens, etc.", "Does your school actively compost waste?"]]

contingency_table = pd.crosstab(df["Does your school have outdoor green spaces or gardens utilized for instruction?  Examples include: edible gardens, pollitor gardens, etc."], df["Does your school actively compost waste?"])

chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Contingency Table:")
print(contingency_table)
print("\n")

print("Chi-square test results:")
print(f"Chi-square: {chi2}")
print(f"P-value: {p}")

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(contingency_table, annot=True, fmt="d", cmap="Blues")
plt.title("Correlation between Outdoor Spaces and Composting Waste")
plt.xlabel("Composting Waste")
plt.ylabel("Outdoor Spaces")
plt.show()

sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))

categories = ["No Composting, No Green Space", "Composting, No Green Space", "No Composting, Green Space", "Composting, Green Space"]

counts = [contingency_table.loc["No", "No"], contingency_table.loc["Yes", "No"], contingency_table.loc["No", "Yes"], contingency_table.loc["Yes", "Yes"]]

sns.barplot(x="Categories", y="Counts", data=pd.DataFrame({"Categories": categories, "Counts": counts}), palette=['#8cbf3f','#8cbf3f','#8cbf3f','#8cbf3f'], hue="Categories", dodge=False, legend=False)

plt.title("Composting Waste is Preferred When Green Spaces are Not Available")
plt.xlabel("Outdoor Spaces and Composting Waste")
plt.ylabel("Count")

plt.xticks(size =8)
plt.savefig('association_graph.png')

"""A contingency table shows  a clear connection between outdoor spaces and composting practices in schools, with most lacking composting initiatives despite having green areas, except for one school, indicating a potential target for focused intervention and further research.

Extra: Top Schools with the Most Student Participation
"""

column_indices = [8,11,14,19,22]

def bin_score(value):
    if value == '1-100':
        return 1
    elif value == '101-300':
        return 2
    elif value == '301-600':
        return 3
    elif value == '601-900':
        return 4
    elif value == '901-950':
        return 5
    else:
        return None

for index in column_indices:
  df.iloc[:,index] = df.iloc[:,index].apply(bin_score)

df['Total_Score'] = df.iloc[:, column_indices].sum(axis=1)

top_schools = df.sort_values(by='Total_Score', ascending=False).head(10)

top_schools.reset_index(drop=True, inplace=True)
top_schools.index += 1


print("Top 10 Schools with the Most Participation:")
print(top_schools[['School me']].to_string(header=False))