diff --git a/projects-appendix/modules/spring2026/images/spotify_project_image.jpg b/projects-appendix/modules/spring2026/images/spotify_project_image.jpg new file mode 100644 index 000000000..ca2832b6e Binary files /dev/null and b/projects-appendix/modules/spring2026/images/spotify_project_image.jpg differ diff --git a/projects-appendix/modules/spring2026/pages/40200/Lasso_Ridge_Project.adoc b/projects-appendix/modules/spring2026/pages/40200/Lasso_Ridge_Project.adoc new file mode 100644 index 000000000..c8dba4a80 --- /dev/null +++ b/projects-appendix/modules/spring2026/pages/40200/Lasso_Ridge_Project.adoc @@ -0,0 +1,626 @@ +:page-mathjax: true += TDM 40200: Project Lasso and Ridge + +== Project Objectives + +The objective of this project is to learn regularization. Student will learn the difference between lasso and ridge regression. + +.Learning Objectives +**** +- Students will review how a linear regression line is fit by minimizing the sum of squared residuals. + +- Students will recall what a residual represents and how the sum of squared residuals (SSR) is used to evaluate model fit. + +- Students will learn how regression coefficients are estimated using ridge regression. + +- Students will understand the role of the penalty term in ridge regression and how it influences coefficient shrinkage. + +- Students will learn how lasso regression estimates coefficients, including how the L1 penalty affects the objective function. + +- Students will be able to distinguish how linear regression, ridge regression, and lasso regression differ in estimating coefficients for a regression problem. +**** + + +== Introduction + +image::spotify_project_image.jpg[width=600, height=450, caption="Several Guitars Image"] + +In this project, you will work with a Spotify dataset containing song-level audio features such as danceability, energy, loudness, tempo, and other characteristics extracted from each track. The response variable of interest is song popularity, a numerical score intended to capture how widely a song is listened to. Our goal is to understand how these audio features relate to popularity and how different regression techniques estimate that relationship. + +**Recall Linear Regressinon** + +We will recall linear regression, which models the relationship between a response variable $y$ and one or more predictors $x_1, x_2, \dots, x_p$ as a linear combination. The linear regression model can be written as + +$$ +\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p +$$ + +where $\hat{y}$ represents the predicted response value, $\hat{\beta}_0$ is the estimated intercept, and $\hat{\beta}_1, \dots, \hat{\beta}_p$ are the estimated coefficients associated with each predictor. + + +The coefficients in linear regression are estimated using the least squares approach. For an observed response value $y_i$ and corresponding prediction $\hat{y}_i$, the residual is defined as + +$e_i = y_i - \hat{y}_i$ + +The residual sum of squares (SSR) measures the overall discrepancy between the observed values and the model predictions and is defined as + +$\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ + +The least squares approach chooses coefficient values $\beta_0, \beta_1, \dots, \beta_p$ that minimize the SSR. This results in a regression line (or hyperplane) that best fits the data in the sense of minimizing squared prediction errors. + +**Regularization in Linear Models** + +Ridge regression and lasso regression are both examples of regularization methods in linear modeling. Regularization refers to the practice of modifying the ordinary least squares objective function by adding a penalty term that constrains the size of the regression coefficients. Ridge regression and lasso regression extend linear regression by adding a penalty term to the SSR objective function. This penalty discourages overly large coefficient values, intentionally trading a small amount of bias for a meaningful reduction in variance and improved model stability. Both methods preserve the familiar linear structure of the model but differ in how they manage coefficient complexity. While ridge and lasso are both regularization techniques, they serve slightly different purposes. Ridge regression primarily stabilizes coefficient estimates by shrinking them toward zero, which is especially helpful when predictors are highly correlated. Lasso regression, on the other hand, goes a step further by allowing some coefficients to be exactly zero, effectively performing feature selection alongside coefficient estimation. + + +**Ridge Regression** + +Ridge regression estimates coefficients by minimizing the following objective function: + +$\text{SSR} + \lambda \sum_{j=1}^{p} \beta_j^2$ + +where $\lambda \ge 0$ is a tuning parameter that controls the strength of the penalty. The ridge penalty shrinks coefficient values toward zero, reducing variance and improving stability, but it does not set coefficients exactly to zero. + +**Lasso Regression** + +Lasso regression uses a different penalty and minimizes the objective + +$\text{SSR} + \lambda \sum_{j=1}^{p} |\beta_j|$ + + +Because of the absolute value penalty, lasso regression can shrink some coefficients exactly to zero. As a result, lasso performs both regularization and feature selection, producing models that are often easier to interpret. In this project, you will compare linear regression, ridge regression, and lasso regression using the same Spotify dataset. You will explore how coefficient estimates change across methods, how regularization affects model performance, and how these techniques balance prediction accuracy, stability, and interpretability. + + +== Questions + +=== Question 1 - Preparing the Data (2 points) + +.Deliverables +==== +**1a. Read in the `spotify_popularity_data` then make sure to drop the columns in the `drop_cols` list.** + +[source,python] +---- +import pandas as pd + +spotify_popularity_data = pd.read_csv("/anvil/projects/tdm/data/spotify/linear_regression_popularity.csv") + +drop_cols = [ + "Unnamed: 0", "Unnamed: 0.1", "track_id", "track_name", "available_markets", "href", + "album_id", "album_name", "album_release_date", "album_type", + "artists_names", "artists_ids", "principal_artist_id", + "principal_artist_name", "artist_genres", "analysis_url", "duration_min", "principal_artist_followers"] + +# For YOU to do: drop the columns in the `drop_cols` list from spotify_popularity_data +---- + +**1b. Separate the target variable (popularity) into y and place all remaining variables into the feature matrix X using the code below. Then, in 1–2 sentences, explain why the target variable is isolated from the predictor variables when training a machine-learning model.** + +[source,python] +---- +# Target and features +y = spotify_popularity_data["popularity"].copy() +X = spotify_popularity_data.drop(columns=["popularity"]).copy() +---- + +**1c. Use the code below to create an 80/20 train–test split with random_state = 42. Then, in 1–2 sentences, explain why the data is split into training and testing sets along with why we fix a random seed.** + +[source,python] +---- +from sklearn.model_selection import train_test_split + +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) +---- + +==== + +=== Question 2 - Linear Regression and Least Squares (2 points) + +In the previous question, you prepared the Spotify dataset and created training and testing splits. We now turn to fitting linear regression models and interpreting their outputs. The goal of this section is to help you recall how linear regression estimates coefficients and how model fit is evaluated using residuals. Recall that linear regression estimates coefficients using the least squares approach, which selects coefficient values that minimize the residual sum of squares (SSR)), defined as + +$\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ + + + +.Deliverables +==== +**2a. Recall that the linear regression model is written as** + +$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p$ + + +**In 2–3 sentences of your own words, explain what $\hat{y}$ represents, what $\beta_0$ represents, and how the coefficients $\beta_1, \dots, \beta_p$ relate to predicting Spotify song popularity.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression + +# Fit standard linear regression +lr = LinearRegression() +lr.fit(X_train, y_train) + +# Evaluate performance +train_score = lr.score(X_train, y_train) +test_score = lr.score(X_test, y_test) + +print("Training R^2:", train_score) +print("Testing R^2:", test_score) +---- + +**2b. Use the code below to fit a simple linear regression model to the training data using only the danceability feature to predict popularity and develop a plot. On the plot, let's overlay the fitted model against an alternative candidate (Line 2) and calculate the Sum of Squared Residuals (SSR) for both. In 2-3 sentences, explain which line would be selected by the least squares approach and why.** + +[source,python] +---- +import warnings +warnings.filterwarnings("ignore") +from sklearn.linear_model import LinearRegression +import numpy as np +import matplotlib.pyplot as plt + +# Fit simple linear regression +X_dance = X_train[["danceability"]] +lr = LinearRegression().fit(X_dance, y_train) +x_line = np.linspace(X_dance.min()[0], X_dance.max()[0], 100) +y_line_1 = lr.predict(x_line.reshape(-1, 1)) +y_line_2 = y_line_1 + 5 + +ssr_1 = np.sum((y_train - lr.predict(X_dance))**2) +ssr_2 = np.sum((y_train - (lr.predict(X_dance) + 5))**2) + +plt.figure(figsize=(10, 6)) +plt.scatter(X_dance, y_train, alpha=0.4, label="Actual Data", color="gray") + +# Plotting the lines +plt.plot(x_line, y_line_1, color="blue", linewidth=2, label=f"Line 1 (Fitted): SSR={ssr_1:.0f}") +plt.plot(x_line, y_line_2, "--", color="red", linewidth=2, label=f"Line 2 (Shifted): SSR={ssr_2:.0f}") + +# Adding the SSR Equation in a box +ssr_formula = r'$SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$' +plt.text(0.05, 0.95, ssr_formula, transform=plt.gca().transAxes, fontsize=14, + verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.5)) + +plt.title("______") # For YOU to fill in +plt.xlabel("______") # For YOU to fill in +plt.ylabel("______") # For YOU to fill in +plt.legend(loc="lower right") +plt.show() +---- + +**2c. Now let's fit a multivariable linear regression model using all features and display actual values, predicted values, and residuals side by side for the test set. Let's compute the Sum of Squared Residuals (SSR) from the residual column. Fill in the formula for residuals and then in 2–3 sentences, explain what a residual measures and why the residuals are squared rather than summed directly.** + +[source,python] +---- +# Fit full linear regression +lr = LinearRegression() +lr.fit(X_train, y_train) + +# Generate predictions +y_pred = lr.predict(X_test) + +comparison_df = pd.DataFrame({ + "Actual (y)": y_test.values, + "Predicted (ŷ)": y_pred, + "Residual (y − ŷ)": ______ # For YOU to fill in +}) + +comparison_df.head() +---- + +**2d. Use the code below to display the coefficients estimated by the multivariable regression model. In 2–3 sentences, explain how we the coefficient values shown in the table are computed.** + +[source,python] +---- +print("Intercept (β₀):", round(lr.intercept_, 3)) + +pd.DataFrame({ + "Feature": X.columns, + "Estimated Coefficient": lr.coef_ +}).round(4) +---- + +==== + +=== Question 3 - Feature Selection and Subset Selection(2 points) + +Let's recall how in multiple linear regression, we model the relationship between a response variable $Y$ and a set of predictors $X_1, X_2, \dots, X_p$ using the linear model: + +$Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon$ + + +This model is typically fit using **least squares**. However, sometimes only a subset of the available predictors may be truly associated with the response, and including irrelevant variables adds unnecessary complexity to the model. Feature selection methods aim to address these issues by identifying a subset of predictors that are most relevant to the response, resulting in models that are simpler, more interpretable, and potentially better at generalizing to new data. + +**Subset Selection Methods** + +Subset selection methods aim to identify a smaller set of predictors that are most strongly related to the response. Rather than fitting a single full model with all predictors, these methods search over models of varying sizes to balance model fit, complexity, and interpretability. + +**Forward Stepwise Selection** + +Forward stepwise selection begins with the null model, which contains no predictors and predicts the sample mean of $Y$. Predictors are then added to the model one at a time. At each step, the predictor that provides the greatest improvement according to a specified criterion is included, conditional on the predictors already in the model. +This procedure continues until no remaining predictor improves the model sufficiently, or until a stopping rule is met. + +Different criteria can be used to guide forward selection. In earlier work, we explored information-criterion-based approaches such as AIC, which balance model fit and complexity and are motivated by predictive performance. + +Alternatively, forward selection can be based on p-values. In this approach, a predictor is added only if its associated p-value is below a chosen significance level (for example, $\alpha = 0.05$), conditional on the predictors already in the model. This emphasizes inferential reasoning, as the p-value measures the strength of evidence against the null hypothesis that a coefficient is equal to zero. + +**Backward Stepwise Selection** + +Backward stepwise selection takes the opposite approach. The procedure begins with the full model, which includes all $p$ predictors. At each step, the predictor whose removal leads to the smallest decrease in model performance is removed, according to a specified criterion. +As with forward selection, the criterion may be based on an information criterion such as AIC or on p-values. In a p-value-based approach, the predictor with the largest p-value (above a chosen threshold) is removed at each step. +Backward stepwise selection continues until all remaining predictors satisfy the stopping rule. Unlike forward selection, backward selection requires that $n > p$, since the full least squares model must be fit at the start. + +**Stepwise Selection** + +Stepwise selection is a hybrid approach that combines elements of both forward and backward selection. The procedure typically begins like forward selection by adding predictors one at a time. However, after each new predictor is added, the method checks whether any previously included predictors should be removed based on the chosen criterion. +This allows predictors to enter and leave the model dynamically as the selection process proceeds. While stepwise selection can be more flexible than purely forward or backward procedures, it remains a greedy algorithm and does not guarantee identification of the model with the lowest test error. + +.Deliverables +==== + +**3a. Use the provided function to perform forward stepwise selection based on p-values. In 2–3 sentences, explain how forward selection works and describe the role of the p-value threshold.** + +[source,python] +---- +import pandas as pd +import statsmodels.api as sm +import numpy as np + +# Keep only numeric columns and drop missing values +X_num = X_train.select_dtypes(include=[np.number]).dropna() +y_num = pd.to_numeric(y_train, errors="coerce").loc[X_num.index] + +def forward_selection_pvalues(X, y, alpha=0.05): + remaining = list(X.columns) + selected = [] + + while remaining: + pvals = [] + + for feature in remaining: + X_model = sm.add_constant(X[selected + [feature]].values.astype(float)) + model = sm.OLS(y.values.astype(float), X_model).fit() + pvals.append((model.pvalues[-1], feature)) + + best_pval, best_feature = min(pvals) + + if best_pval < alpha: + selected.append(best_feature) + remaining.remove(best_feature) + else: + break + + return selected + +selected_features = forward_selection_pvalues(X_num, y_num) +print("Forward-selected features:") +print(selected_features) +---- + +**3b. Use the code below to compute the test-set $R^2$ for the full multivariable linear regression model and for the reduced model selected using forward selection based on p-values and display a table. Then in 2–3 sentences, discuss the tradeoff between using a simpler model with fewer predictors and a more complex model with many predictors in terms of interpretability and prediction accuracy.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression +from sklearn.metrics import r2_score + +# Full model +lr_full = LinearRegression() +lr_full.fit(X_train, y_train) +y_pred_full = lr_full.predict(X_test) +r2_full = r2_score(y_test, y_pred_full) + +# P-value selected model +lr_pval = LinearRegression() +lr_pval.fit(X_train[selected_features], y_train) +y_pred_pval = lr_pval.predict(X_test[selected_features]) +r2_pval = r2_score(y_test, y_pred_pval) + +# Display results +print("Test-set R^2 comparison") +print("-----------------------") +print(f"Full model R^2: {r2_full:.3f}") +print(f"P-value model R^2: {r2_pval:.3f}") +---- +**3c. In 2–3 sentences, discuss the tradeoff between using a simpler model with fewer predictors and a more complex model with many predictors in terms of interpretability and prediction accuracy.** + +==== + +=== Question 4 - Ridge Regression (2 points) + +Ridge regression closely mirrors ordinary least squares, with one important modification: when estimating the coefficients, we add a penalty that discourages large coefficient values. Rather than focusing solely on fitting the data as closely as possible, Ridge regression balances model fit with coefficient size. + +Ridge regression estimates the coefficients by minimizing + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$ + + +The first term, $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, is the sum of squared residuals (SSR) and measures how closely the model’s predictions match the observed data. In the Spotify popularity dataset, this corresponds to how far the predicted popularity values are from the actual popularity scores based on features such as danceability, energy, loudness, and tempo. + +The second term, $\lambda \sum_{j=1}^{p} \beta_j^2$, is the Ridge penalty. The tuning parameter $\lambda \ge 0$ determines how strongly we penalize large slope coefficients. + +When $\lambda = 0$, the penalty term drops out entirely and the objective reduces to +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, +which is exactly the ordinary least squares criterion. In this case, Ridge regression yields the same coefficient estimates as OLS. + +As $\lambda$ increases, large coefficients become increasingly costly in the objective function. To keep the overall objective small, the model is forced to reduce the magnitude of the slope coefficients, shrinking them toward zero. This has the effect of stabilizing the model and making it less sensitive to noise in the training data, which helps reduce overfitting. + +In a multivariable Spotify popularity model, Ridge regression shrinks all feature coefficients toward zero, but not necessarily by the same amount. This occurs because the coefficients must balance two competing goals in the objective function: reducing the sum of squared residuals and keeping the penalty term small. Features that are strongly associated with popularity, such as danceability or energy, contribute substantially to reducing the SSR, so shrinking their coefficients too much would noticeably worsen the model fit. As a result, these coefficients tend to remain relatively larger, while weaker predictors, such as disc number or track number, can be shrunk much more with little impact on the SSR. + + +.Deliverables +==== +**4a. Ridge regression is similar to ordinary least squares, but the coefficients are estimated by minimizing a function that includes a penalty on the size of the coefficients. Specifically, Ridge regression estimates the coefficients by minimizing:** + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$ + + +**In 1–2 sentences of your own words, describe what happens to the Ridge regression objective and the estimated coefficients when $\lambda = 0$. What equation does it remind you of?** + +**4b. Now focus on the penalty term in the objective function for Ridge:** + +$\lambda \sum_{j=1}^{p} \beta_j^2$ + + +**As $\lambda$ increases, the contribution of this penalty term to the objective function becomes larger. In 1–2 sentences, explain how the slope coefficients would need to change in order to minimize the overall objective function as $\lambda$ gets bigger, and why this behavior could help reduce overfitting.** + +**4c. To better understand how Ridge regularization affects the estimated coefficient, we will focus on a single predictor variable. Use the code below to plot Spotify popularity vs. danceability and overlay the regression lines from Ordinary Least Squares (OLS) and Ridge regression with alpha = 25. In 1–2 sentences, describe how the Ridge regression line compares to the OLS line. What change do you observe in the slope, and how does this reflect the effect of regularization?** + +[source,python] +---- +import warnings +from scipy.linalg import LinAlgWarning +import numpy as np +import matplotlib.pyplot as plt +from sklearn.linear_model import LinearRegression, Ridge + +warnings.filterwarnings("ignore", category=LinAlgWarning) + +# Select a single predictor +X_train_single = X_train[["danceability"]] + +# Create grid for plotting regression lines (with feature name) +x_vals = pd.DataFrame( + np.linspace( + X_train_single.min().values[0], + X_train_single.max().values[0], + 100), + columns=["danceability"]) + +# Fit OLS +ols = LinearRegression() +ols.fit(X_train_single, y_train) +y_ols = ols.predict(x_vals) +ridge = Ridge(alpha=25) +ridge.fit(X_train_single, y_train) +y_ridge = ridge.predict(x_vals) + +# Plot data and regression lines +plt.scatter(X_train_single["danceability"], y_train, alpha=0.3, label="Training data") +plt.plot(x_vals["danceability"], y_ols, label="OLS (alpha = 0)", linewidth=2) +plt.plot(x_vals["danceability"], y_ridge, linestyle="--", label="Ridge (alpha = 25)", linewidth=2) + +plt.xlabel("______") +plt.ylabel("______") +plt.title("______") +plt.legend() +plt.show() +---- + +**4d. So far, you examined how Ridge regression affects the slope of a single predictor. Now, consider a multivariable setting with all predictors included. Use the code below to fit both a standard multiple linear regression model and a multiple Ridge regression model using the same predictors and a fixed value of $\lambda$ (alpha). Use the code below to create a table that displays the coefficients from both models side by side. In 2–3 sentences, describe how the Ridge regression coefficients compare to the ordinary least squares coefficients and explain how this comparison illustrates the idea of coefficient shrinkage.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression, Ridge +import pandas as pd + +# Fit OLS model +ols = LinearRegression() +ols.fit(X_train, y_train) + +# Fit Ridge model +ridge = Ridge(alpha=25) +ridge.fit(X_train, y_train) + +# Compare coefficients +coef_compare = pd.DataFrame({ + "Feature": X_train.columns, + "OLS Coefficient": ols.coef_, + "Ridge Coefficient (alpha = 25)": ridge.coef_ +}).round(4) + +coef_compare +---- + + +==== + +=== Question 5 - Lasso Regression (2 points) + +Lasso regression is closely related to Ridge regression, but it introduces an important difference in how the model penalizes large coefficients. While both methods apply regularization to reduce overfitting, they do so in different ways, which leads to meaningful differences in model behavior and interpretability. + +One key distinction is that Lasso regression has the ability to exclude variables entirely from the model. This means that some predictors can be removed from the final equation, resulting in a simpler and more interpretable model. In contrast, Ridge regression shrinks coefficients toward zero but does not eliminate predictors. + +This difference arises from the form of the penalty used by each method. Ridge regression penalizes squared coefficients, while Lasso regression penalizes the absolute value of the coefficients. + +Instead of minimizing a penalty involving squared slopes, Lasso regression estimates coefficients by minimizing + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$ + + +Here, the first term is the sum of squared residuals, which measures how well the model fits the data, and the second term penalizes large coefficient values using absolute values rather than squares. The tuning parameter $\lambda \ge 0$ controls the strength of this penalty and is typically chosen using cross-validation. + +The most important practical difference between Ridge and Lasso regression is how coefficients behave as $\lambda$ increases. Ridge regression can shrink coefficients very close to zero, but they generally remain nonzero. Lasso regression, on the other hand, can force some coefficients to become exactly zero when $\lambda$ is sufficiently large. + +As a result, increasing $\lambda$ in a Lasso model can cause some predictors to drop out of the model entirely. This makes Lasso particularly useful in settings where the dataset contains many predictors that contribute little to explaining the response. + +Because Lasso can remove insignificant variables from the equation, it can be more effective than Ridge regression at reducing variance in models that contain a large number of weak or irrelevant predictors. Like other regularization methods, this comes at the cost of introducing some bias, but the overall bias–variance tradeoff often leads to improved predictive performance. + + +.Deliverables +==== +**5a. Lasso regression is closely related to Ridge regression, but it uses a different penalty on the coefficients. Specifically, the Lasso estimates the coefficients by minimizing** + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$ + +**In 1–2 sentences, explain how the Lasso objective function differs from the Ridge regression objective. How does replacing the squared penalty with an absolute value penalty change the behavior of the estimated coefficients?** + +**5b. Now focus on the Lasso penalty term** + +$\lambda \sum_{j=1}^{p} |\beta_j|$ + +**As $\lambda$ increases, this penalty increasingly discourages large coefficient values. In 1–2 sentences, explain why the Lasso can force some coefficients to become exactly zero, and how this property makes Lasso different from Ridge regression in terms of variable selection and model interpretability.** + +**5c. Use the code below to fit three models using the same set of predictors: a standard multiple linear regression model, a Ridge regression model, and a Lasso regression model. Display the table that shows the estimated coefficients from all three models side by side. In 2–3 sentences, compare the coefficients across the three methods. Describe how Ridge and Lasso differ from ordinary least squares in terms of coefficient magnitude, and explain what additional behavior you observe for Lasso that does not occur with Ridge.** + + +[source,python] +---- +from sklearn.linear_model import LinearRegression, Ridge, Lasso +import pandas as pd + +# Fit models +ols = LinearRegression().fit(X_train, y_train) +ridge = Ridge(alpha=25).fit(X_train, y_train) +lasso = Lasso(alpha=1, max_iter=10000).fit(X_train, y_train) + +# Compare coefficients +coef_table = pd.DataFrame({ + "Feature": X_train.columns, + "OLS": ols.coef_, + "Ridge (alpha = 25)": ridge.coef_, + "Lasso (alpha = 25)": lasso.coef_ +}).round(4) + +coef_table +---- + +**5d. Suppose you are building a model to predict Spotify song popularity using many audio features, some of which may be weakly related or irrelevant. In 1-2 sentences, explain when you would prefer using Lasso regression over Ridge regression.** + +==== + + +=== Question 6 - Cross-Validation (2 points) + +In the previous questions, you explored how regularization methods such as Ridge and Lasso control model complexity by shrinking coefficient estimates and, in the case of Lasso, performing variable selection. While these methods can help reduce overfitting, their effectiveness depends critically on the choice of the tuning parameter (for example, $\lambda$ or $\alpha$), which determines the strength of the penalty applied to the coefficients. + +Rather than selecting this parameter arbitrarily, we typically rely on cross-validation to guide this choice. Cross-validation allows us to evaluate how a model performs on multiple subsets of the data, providing a more stable and reliable estimate of predictive performance than a single train–test split. By repeatedly training the model on different portions of the data and validating it on held-out folds, cross-validation helps balance the tradeoff between bias and variance. + +In this question, you will use k-fold cross-validation to assess the performance of Lasso regression across a range of $\alpha$ values. This process will help you identify a level of regularization that generalizes well to unseen data, rather than one that simply performs well on a single split of the dataset. + + +.Deliverables +==== +**6a. In **2–3 sentences**, explain what *k-fold cross validation* is and how it differs from using a single train–test split.** + +**6b. Use the code below to perform 5-fold cross-validation for Lasso across a grid of $\alpha$ values. Based on the cross-validated $R^2$ scores, identify the value of $\alpha$ that yields the highest average performance and report the corresponding $R^2$.** + +_The `KFold` function controls how the data are split into training and validation folds during cross-validation. You can learn more about how `KFold` works, including the role of `n_splits`, `shuffle`, and `random_state`, here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html_ + + +[source,python] +---- +from sklearn.linear_model import Lasso +from sklearn.model_selection import KFold, cross_val_score +import numpy as np +import pandas as pd + +alphas = np.logspace(-3, 1, 10) + +cv_results = [] + +kf = KFold(n_splits=___, shuffle=True, random_state=42) # For YOU to fill in + +for a in alphas: + lasso = Lasso(alpha=a, max_iter=10000) + + scores = cross_val_score( + lasso, + X_train, + y_train, + cv=kf, + scoring="r2" + ) + + cv_results.append({ + "alpha": a, + "mean_cv_r2": scores.mean() + }) + +cv_df = pd.DataFrame(cv_results).round(4) +cv_df + +---- + +**6c. In 2–3 sentences, explain how cross-validation helps guide the choice of $\lambda$ for regularized models such as Lasso. Discuss why selecting hyperparameters using cross-validation is more reliable than evaluating model performance on a single test set.** +==== + + +== Dataset + +- /anvil/projects/tdm/data/spotify/linear_regression_popularity.csv + +Learn more about the source: https://the-examples-book.com/projects/data-sets/Spotify[Spotify Dataset 1986–2023]. Feature names and their descriptions are below for your reference: + +[cols="1,3", options="header"] +|=== +| Feature | Description + +| track_id | Unique identifier for the track +| track_name | Name of the track +| popularity | Popularity score (0–100) based on Spotify plays +| available_markets | Markets/countries where the track is available +| disc_number | Disc number (for albums with multiple discs) +| duration_ms | Track duration in milliseconds +| explicit | Whether the track contains explicit content (True/False) +| track_number | Position of the track within the album +| href | Spotify API endpoint URL for the track +| album_id | Unique identifier for the album +| album_name | Name of the album +| album_release_date | Release date of the album +| album_type | Album type (album, single, compilation) +| album_total_tracks | Total number of tracks in the album +| artists_names | Names of the artists on the track +| artists_ids | Unique identifiers of the artists +| principal_artist_id | ID of the principal/primary artist +| principal_artist_name | Name of the principal/primary artist +| artist_genres | Genres associated with the principal artist +| principal_artist_followers | Number of Spotify followers of the principal artist +| acousticness | Confidence measure of whether the track is acoustic (0–1) +| analysis_url | Spotify API URL for detailed track analysis +| danceability | How suitable a track is for dancing (0–1) +| energy | Intensity and activity measure of the track (0–1) +| instrumentalness | Predicts whether a track contains vocals (0–1) +| key | Estimated key of the track (integer, e.g., 0=C, 1=C#/Db) +| liveness | Presence of an audience in the recording (0–1) +| loudness | Overall loudness of the track in decibels (dB) +| mode | Modality of the track (1=major, 0=minor) +| speechiness | Presence of spoken words (0–1) +| tempo | Estimated tempo in beats per minute (BPM) +| time_signature | Estimated overall time signature +| valence | Musical positivity/happiness of the track (0–1) +| year | Year the track was released +| duration_min | Track duration in minutes +|=== + + +== References + +Some explanations, examples, and terminology presented in this section were adapted from the following sources for educational purposes: + +* James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). _An Introduction to Statistical Learning: with Applications in Python_. Springer Texts in Statistics. Springer. + + +== Submitting your Work + +Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. + +.Items to submit +==== +- firstname_lastname_project1.ipynb +==== + +[WARNING] +==== +You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. + +You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. +==== \ No newline at end of file diff --git a/projects-appendix/modules/spring2026/pages/40200/Regularization.adoc b/projects-appendix/modules/spring2026/pages/40200/Regularization.adoc new file mode 100644 index 000000000..436b18c9b --- /dev/null +++ b/projects-appendix/modules/spring2026/pages/40200/Regularization.adoc @@ -0,0 +1,624 @@ +:page-mathjax: true += TDM 40200: Project Lasso and Ridge + +== Project Objectives + +The objective of this project is to learn regularization. Student will learn the difference between lasso and ridge regression. + +.Learning Objectives +**** +- Students will review how a linear regression line is fit by minimizing the sum of squared residuals. + +- Students will recall what a residual represents and how the sum of squared residuals (SSR) is used to evaluate model fit. + +- Students will learn how regression coefficients are estimated using ridge regression. + +- Students will understand the role of the penalty term in ridge regression and how it influences coefficient shrinkage. + +- Students will learn how lasso regression estimates coefficients, including how the L1 penalty affects the objective function. + +- Students will be able to distinguish how linear regression, ridge regression, and lasso regression differ in estimating coefficients for a regression problem. +**** + + +== Introduction + +image::spotify_project_image.jpg[width=600, height=450, caption="Several Guitars Image"] + +In this project, you will work with a Spotify dataset containing song-level audio features such as danceability, energy, loudness, tempo, and other characteristics extracted from each track. The response variable of interest is song popularity, a numerical score intended to capture how widely a song is listened to. Our goal is to understand how these audio features relate to popularity and how different regression techniques estimate that relationship. + +**Recall Linear Regressinon** + +We will recall linear regression, which models the relationship between a response variable $y$ and one or more predictors $x_1, x_2, \dots, x_p$ as a linear combination. The linear regression model can be written as + +$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p$ + +where $\hat{y}$ represents the predicted response value, $\hat{\beta}_0$ is the estimated intercept, and $\hat{\beta}_1, \dots, \hat{\beta}_p$ are the estimated coefficients associated with each predictor. + + +The coefficients in linear regression are estimated using the least squares approach. For an observed response value $y_i$ and corresponding prediction $\hat{y}_i$, the residual is defined as + +$e_i = y_i - \hat{y}_i$ + +The residual sum of squares (SSR) measures the overall discrepancy between the observed values and the model predictions and is defined as + +$\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ + +The least squares approach chooses coefficient values $\beta_0, \beta_1, \dots, \beta_p$ that minimize the SSR. This results in a regression line (or hyperplane) that best fits the data in the sense of minimizing squared prediction errors. + +**Regularization in Linear Models** + +Ridge regression and lasso regression are both examples of regularization methods in linear modeling. Regularization refers to the practice of modifying the ordinary least squares objective function by adding a penalty term that constrains the size of the regression coefficients. Ridge regression and lasso regression extend linear regression by adding a penalty term to the SSR objective function. This penalty discourages overly large coefficient values, intentionally trading a small amount of bias for a meaningful reduction in variance and improved model stability. Both methods preserve the familiar linear structure of the model but differ in how they manage coefficient complexity. While ridge and lasso are both regularization techniques, they serve slightly different purposes. Ridge regression primarily stabilizes coefficient estimates by shrinking them toward zero, which is especially helpful when predictors are highly correlated. Lasso regression, on the other hand, goes a step further by allowing some coefficients to be exactly zero, effectively performing feature selection alongside coefficient estimation. + + +**Ridge Regression** + +Ridge regression estimates coefficients by minimizing the following objective function: + +$\text{SSR} + \lambda \sum_{j=1}^{p} \beta_j^2$ + +where $\lambda \ge 0$ is a tuning parameter that controls the strength of the penalty. The ridge penalty shrinks coefficient values toward zero, reducing variance and improving stability, but it does not set coefficients exactly to zero. + +**Lasso Regression** + +Lasso regression uses a different penalty and minimizes the objective + +$\text{SSR} + \lambda \sum_{j=1}^{p} |\beta_j|$ + + +Because of the absolute value penalty, lasso regression can shrink some coefficients exactly to zero. As a result, lasso performs both regularization and feature selection, producing models that are often easier to interpret. In this project, you will compare linear regression, ridge regression, and lasso regression using the same Spotify dataset. You will explore how coefficient estimates change across methods, how regularization affects model performance, and how these techniques balance prediction accuracy, stability, and interpretability. + + +== Questions + +=== Question 1 - Preparing the Data (2 points) + +.Deliverables +==== +**1a. Read in the `spotify_popularity_data` then make sure to drop the columns in the `drop_cols` list.** + +[source,python] +---- +import pandas as pd + +spotify_popularity_data = pd.read_csv("/anvil/projects/tdm/data/spotify/linear_regression_popularity.csv") + +drop_cols = [ + "Unnamed: 0", "Unnamed: 0.1", "track_id", "track_name", "available_markets", "href", + "album_id", "album_name", "album_release_date", "album_type", + "artists_names", "artists_ids", "principal_artist_id", + "principal_artist_name", "artist_genres", "analysis_url", "duration_min", "principal_artist_followers"] + +# For YOU to do: drop the columns in the `drop_cols` list from spotify_popularity_data +---- + +**1b. Separate the target variable (popularity) into y and place all remaining variables into the feature matrix X using the code below. Then, in 1–2 sentences, explain why the target variable is isolated from the predictor variables when training a machine-learning model.** + +[source,python] +---- +# Target and features +y = spotify_popularity_data["popularity"].copy() +X = spotify_popularity_data.drop(columns=["popularity"]).copy() +---- + +**1c. Use the code below to create an 80/20 train–test split with random_state = 42. Then, in 1–2 sentences, explain why the data is split into training and testing sets along with why we fix a random seed.** + +[source,python] +---- +from sklearn.model_selection import train_test_split + +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) +---- + +==== + +=== Question 2 - Linear Regression and Least Squares (2 points) + +In the previous question, you prepared the Spotify dataset and created training and testing splits. We now turn to fitting linear regression models and interpreting their outputs. The goal of this section is to help you recall how linear regression estimates coefficients and how model fit is evaluated using residuals. Recall that linear regression estimates coefficients using the least squares approach, which selects coefficient values that minimize the residual sum of squares (SSR)), defined as + +$\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ + + + +.Deliverables +==== +**2a. Recall that the linear regression model is written as** + +$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p$ + + +**In 2–3 sentences of your own words, explain what $\hat{y}$ represents, what $\beta_0$ represents, and how the coefficients $\beta_1, \dots, \beta_p$ relate to predicting Spotify song popularity.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression + +# Fit standard linear regression +lr = LinearRegression() +lr.fit(X_train, y_train) + +# Evaluate performance +train_score = lr.score(X_train, y_train) +test_score = lr.score(X_test, y_test) + +print("Training R^2:", train_score) +print("Testing R^2:", test_score) +---- + +**2b. Use the code below to fit a simple linear regression model to the training data using only the danceability feature to predict popularity and develop a plot. On the plot, let's overlay the fitted model against an alternative candidate (Line 2) and calculate the Sum of Squared Residuals (SSR) for both. In 2-3 sentences, explain which line would be selected by the least squares approach and why.** + +[source,python] +---- +import warnings +warnings.filterwarnings("ignore") +from sklearn.linear_model import LinearRegression +import numpy as np +import matplotlib.pyplot as plt + +# Fit simple linear regression +X_dance = X_train[["danceability"]] +lr = LinearRegression().fit(X_dance, y_train) +x_line = np.linspace(X_dance.min()[0], X_dance.max()[0], 100) +y_line_1 = lr.predict(x_line.reshape(-1, 1)) +y_line_2 = y_line_1 + 5 + +ssr_1 = np.sum((y_train - lr.predict(X_dance))**2) +ssr_2 = np.sum((y_train - (lr.predict(X_dance) + 5))**2) + +plt.figure(figsize=(10, 6)) +plt.scatter(X_dance, y_train, alpha=0.4, label="Actual Data", color="gray") + +# Plotting the lines +plt.plot(x_line, y_line_1, color="blue", linewidth=2, label=f"Line 1 (Fitted): SSR={ssr_1:.0f}") +plt.plot(x_line, y_line_2, "--", color="red", linewidth=2, label=f"Line 2 (Shifted): SSR={ssr_2:.0f}") + +# Adding the SSR Equation in a box +ssr_formula = r'$SSR = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$' +plt.text(0.05, 0.95, ssr_formula, transform=plt.gca().transAxes, fontsize=14, + verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.5)) + +plt.title("______") # For YOU to fill in +plt.xlabel("______") # For YOU to fill in +plt.ylabel("______") # For YOU to fill in +plt.legend(loc="lower right") +plt.show() +---- + +**2c. Now let's fit a multivariable linear regression model using all features and display actual values, predicted values, and residuals side by side for the test set. Let's compute the Sum of Squared Residuals (SSR) from the residual column. Fill in the formula for residuals and then in 2–3 sentences, explain what a residual measures and why the residuals are squared rather than summed directly.** + +[source,python] +---- +# Fit full linear regression +lr = LinearRegression() +lr.fit(X_train, y_train) + +# Generate predictions +y_pred = lr.predict(X_test) + +comparison_df = pd.DataFrame({ + "Actual (y)": y_test.values, + "Predicted (ŷ)": y_pred, + "Residual (y − ŷ)": ______ # For YOU to fill in +}) + +comparison_df.head() +---- + +**2d. Use the code below to display the coefficients estimated by the multivariable regression model. In 2–3 sentences, explain how we the coefficient values shown in the table are computed.** + +[source,python] +---- +print("Intercept (β₀):", round(lr.intercept_, 3)) + +pd.DataFrame({ + "Feature": X.columns, + "Estimated Coefficient": lr.coef_ +}).round(4) +---- + +==== + +=== Question 3 - Feature Selection and Subset Selection(2 points) + +Let's recall how in multiple linear regression, we model the relationship between a response variable $Y$ and a set of predictors $X_1, X_2, \dots, X_p$ using the linear model: + +$Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon$ + + +This model is typically fit using **least squares**. However, sometimes only a subset of the available predictors may be truly associated with the response, and including irrelevant variables adds unnecessary complexity to the model. Feature selection methods aim to address these issues by identifying a subset of predictors that are most relevant to the response, resulting in models that are simpler, more interpretable, and potentially better at generalizing to new data. + +**Subset Selection Methods** + +Subset selection methods aim to identify a smaller set of predictors that are most strongly related to the response. Rather than fitting a single full model with all predictors, these methods search over models of varying sizes to balance model fit, complexity, and interpretability. + +**Forward Stepwise Selection** + +Forward stepwise selection begins with the null model, which contains no predictors and predicts the sample mean of $Y$. Predictors are then added to the model one at a time. At each step, the predictor that provides the greatest improvement according to a specified criterion is included, conditional on the predictors already in the model. +This procedure continues until no remaining predictor improves the model sufficiently, or until a stopping rule is met. + +Different criteria can be used to guide forward selection. In earlier work, we explored information-criterion-based approaches such as AIC, which balance model fit and complexity and are motivated by predictive performance. + +Alternatively, forward selection can be based on p-values. In this approach, a predictor is added only if its associated p-value is below a chosen significance level (for example, $\alpha = 0.05$), conditional on the predictors already in the model. This emphasizes inferential reasoning, as the p-value measures the strength of evidence against the null hypothesis that a coefficient is equal to zero. + +**Backward Stepwise Selection** + +Backward stepwise selection takes the opposite approach. The procedure begins with the full model, which includes all $p$ predictors. At each step, the predictor whose removal leads to the smallest decrease in model performance is removed, according to a specified criterion. +As with forward selection, the criterion may be based on an information criterion such as AIC or on p-values. In a p-value-based approach, the predictor with the largest p-value (above a chosen threshold) is removed at each step. +Backward stepwise selection continues until all remaining predictors satisfy the stopping rule. Unlike forward selection, backward selection requires that $n > p$, since the full least squares model must be fit at the start. + +**Stepwise Selection** + +Stepwise selection is a hybrid approach that combines elements of both forward and backward selection. The procedure typically begins like forward selection by adding predictors one at a time. However, after each new predictor is added, the method checks whether any previously included predictors should be removed based on the chosen criterion. +This allows predictors to enter and leave the model dynamically as the selection process proceeds. While stepwise selection can be more flexible than purely forward or backward procedures, it remains a greedy algorithm and does not guarantee identification of the model with the lowest test error. + +.Deliverables +==== + +**3a. Use the provided function to perform forward stepwise selection based on p-values. In 2–3 sentences, explain how forward selection works and describe the role of the p-value threshold.** + +[source,python] +---- +import pandas as pd +import statsmodels.api as sm +import numpy as np + +# Keep only numeric columns and drop missing values +X_num = X_train.select_dtypes(include=[np.number]).dropna() +y_num = pd.to_numeric(y_train, errors="coerce").loc[X_num.index] + +def forward_selection_pvalues(X, y, alpha=0.05): + remaining = list(X.columns) + selected = [] + + while remaining: + pvals = [] + + for feature in remaining: + X_model = sm.add_constant(X[selected + [feature]].values.astype(float)) + model = sm.OLS(y.values.astype(float), X_model).fit() + pvals.append((model.pvalues[-1], feature)) + + best_pval, best_feature = min(pvals) + + if best_pval < alpha: + selected.append(best_feature) + remaining.remove(best_feature) + else: + break + + return selected + +selected_features = forward_selection_pvalues(X_num, y_num) +print("Forward-selected features:") +print(selected_features) +---- + +**3b. Use the code below to compute the test-set $R^2$ for the full multivariable linear regression model and for the reduced model selected using forward selection based on p-values and display a table. Then in 2–3 sentences, discuss the tradeoff between using a simpler model with fewer predictors and a more complex model with many predictors in terms of interpretability and prediction accuracy.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression +from sklearn.metrics import r2_score + +# Full model +lr_full = LinearRegression() +lr_full.fit(X_train, y_train) +y_pred_full = lr_full.predict(X_test) +r2_full = r2_score(y_test, y_pred_full) + +# P-value selected model +lr_pval = LinearRegression() +lr_pval.fit(X_train[selected_features], y_train) +y_pred_pval = lr_pval.predict(X_test[selected_features]) +r2_pval = r2_score(y_test, y_pred_pval) + +# Display results +print("Test-set R^2 comparison") +print("-----------------------") +print(f"Full model R^2: {r2_full:.3f}") +print(f"P-value model R^2: {r2_pval:.3f}") +---- +**3c. In 2–3 sentences, discuss the tradeoff between using a simpler model with fewer predictors and a more complex model with many predictors in terms of interpretability and prediction accuracy.** + +==== + +=== Question 4 - Ridge Regression (2 points) + +Ridge regression closely mirrors ordinary least squares, with one important modification: when estimating the coefficients, we add a penalty that discourages large coefficient values. Rather than focusing solely on fitting the data as closely as possible, Ridge regression balances model fit with coefficient size. + +Ridge regression estimates the coefficients by minimizing + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$ + + +The first term, $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, is the sum of squared residuals (SSR) and measures how closely the model’s predictions match the observed data. In the Spotify popularity dataset, this corresponds to how far the predicted popularity values are from the actual popularity scores based on features such as danceability, energy, loudness, and tempo. + +The second term, $\lambda \sum_{j=1}^{p} \beta_j^2$, is the Ridge penalty. The tuning parameter $\lambda \ge 0$ determines how strongly we penalize large slope coefficients. + +When $\lambda = 0$, the penalty term drops out entirely and the objective reduces to +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, +which is exactly the ordinary least squares criterion. In this case, Ridge regression yields the same coefficient estimates as OLS. + +As $\lambda$ increases, large coefficients become increasingly costly in the objective function. To keep the overall objective small, the model is forced to reduce the magnitude of the slope coefficients, shrinking them toward zero. This has the effect of stabilizing the model and making it less sensitive to noise in the training data, which helps reduce overfitting. + +In a multivariable Spotify popularity model, Ridge regression shrinks all feature coefficients toward zero, but not necessarily by the same amount. This occurs because the coefficients must balance two competing goals in the objective function: reducing the sum of squared residuals and keeping the penalty term small. Features that are strongly associated with popularity, such as danceability or energy, contribute substantially to reducing the SSR, so shrinking their coefficients too much would noticeably worsen the model fit. As a result, these coefficients tend to remain relatively larger, while weaker predictors, such as disc number or track number, can be shrunk much more with little impact on the SSR. + + +.Deliverables +==== +**4a. Ridge regression is similar to ordinary least squares, but the coefficients are estimated by minimizing a function that includes a penalty on the size of the coefficients. Specifically, Ridge regression estimates the coefficients by minimizing:** + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$ + + +**In 1–2 sentences of your own words, describe what happens to the Ridge regression objective and the estimated coefficients when $\lambda = 0$. What equation does it remind you of?** + +**4b. Now focus on the penalty term in the objective function for Ridge:** + +$\lambda \sum_{j=1}^{p} \beta_j^2$ + + +**As $\lambda$ increases, the contribution of this penalty term to the objective function becomes larger. In 1–2 sentences, explain how the slope coefficients would need to change in order to minimize the overall objective function as $\lambda$ gets bigger, and why this behavior could help reduce overfitting.** + +**4c. To better understand how Ridge regularization affects the estimated coefficient, we will focus on a single predictor variable. Use the code below to plot Spotify popularity vs. danceability and overlay the regression lines from Ordinary Least Squares (OLS) and Ridge regression with alpha = 25. In 1–2 sentences, describe how the Ridge regression line compares to the OLS line. What change do you observe in the slope, and how does this reflect the effect of regularization?** + +[source,python] +---- +import warnings +from scipy.linalg import LinAlgWarning +import numpy as np +import matplotlib.pyplot as plt +from sklearn.linear_model import LinearRegression, Ridge + +warnings.filterwarnings("ignore", category=LinAlgWarning) + +# Select a single predictor +X_train_single = X_train[["danceability"]] + +# Create grid for plotting regression lines (with feature name) +x_vals = pd.DataFrame( + np.linspace( + X_train_single.min().values[0], + X_train_single.max().values[0], + 100), + columns=["danceability"]) + +# Fit OLS and Ridge +ols = LinearRegression() +ols.fit(X_train_single, y_train) +y_ols = ols.predict(x_vals) +ridge = Ridge(alpha=25) +ridge.fit(X_train_single, y_train) +y_ridge = ridge.predict(x_vals) + +# Plot data and regression lines +plt.scatter(X_train_single["danceability"], y_train, alpha=0.3, label="Training data") +plt.plot(x_vals["danceability"], y_ols, label="OLS (alpha = 0)", linewidth=2) +plt.plot(x_vals["danceability"], y_ridge, linestyle="--", label="Ridge (alpha = 25)", linewidth=2) + +plt.xlabel("______") +plt.ylabel("______") +plt.title("______") +plt.legend() +plt.show() +---- + +**4d. So far, you examined how Ridge regression affects the slope of a single predictor. Now, consider a multivariable setting with all predictors included. Use the code below to fit both a standard multiple linear regression model and a multiple Ridge regression model using the same predictors and a fixed value of $\lambda$ (alpha). Use the code below to create a table that displays the coefficients from both models side by side. In 2–3 sentences, describe how the Ridge regression coefficients compare to the ordinary least squares coefficients and explain how this comparison illustrates the idea of coefficient shrinkage.** + +[source,python] +---- +from sklearn.linear_model import LinearRegression, Ridge +import pandas as pd + +# Fit OLS model +ols = LinearRegression() +ols.fit(X_train, y_train) + +# Fit Ridge model +ridge = Ridge(alpha=25) +ridge.fit(X_train, y_train) + +# Compare coefficients +coef_compare = pd.DataFrame({ + "Feature": X_train.columns, + "OLS Coefficient": ols.coef_, + "Ridge Coefficient (alpha = 25)": ridge.coef_ +}).round(4) + +coef_compare +---- + + +==== + +=== Question 5 - Lasso Regression (2 points) + +Lasso regression is closely related to Ridge regression, but it introduces an important difference in how the model penalizes large coefficients. While both methods apply regularization to reduce overfitting, they do so in different ways, which leads to meaningful differences in model behavior and interpretability. + +One key distinction is that Lasso regression has the ability to exclude variables entirely from the model. This means that some predictors can be removed from the final equation, resulting in a simpler and more interpretable model. In contrast, Ridge regression shrinks coefficients toward zero but does not eliminate predictors. + +This difference arises from the form of the penalty used by each method. Ridge regression penalizes squared coefficients, while Lasso regression penalizes the absolute value of the coefficients. + +Instead of minimizing a penalty involving squared slopes, Lasso regression estimates coefficients by minimizing + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$ + + +Here, the first term is the sum of squared residuals, which measures how well the model fits the data, and the second term penalizes large coefficient values using absolute values rather than squares. The tuning parameter $\lambda \ge 0$ controls the strength of this penalty and is typically chosen using cross-validation. + +The most important practical difference between Ridge and Lasso regression is how coefficients behave as $\lambda$ increases. Ridge regression can shrink coefficients very close to zero, but they generally remain nonzero. Lasso regression, on the other hand, can force some coefficients to become exactly zero when $\lambda$ is sufficiently large. + +As a result, increasing $\lambda$ in a Lasso model can cause some predictors to drop out of the model entirely. This makes Lasso particularly useful in settings where the dataset contains many predictors that contribute little to explaining the response. + +Because Lasso can remove insignificant variables from the equation, it can be more effective than Ridge regression at reducing variance in models that contain a large number of weak or irrelevant predictors. Like other regularization methods, this comes at the cost of introducing some bias, but the overall bias–variance tradeoff often leads to improved predictive performance. + + +.Deliverables +==== +**5a. Lasso regression is closely related to Ridge regression, but it uses a different penalty on the coefficients. Specifically, the Lasso estimates the coefficients by minimizing** + +$\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$ + +**In 1–2 sentences, explain how the Lasso objective function differs from the Ridge regression objective. How does replacing the squared penalty with an absolute value penalty change the behavior of the estimated coefficients?** + +**5b. Now focus on the Lasso penalty term** + +$\lambda \sum_{j=1}^{p} |\beta_j|$ + +**As $\lambda$ increases, this penalty increasingly discourages large coefficient values. In 1–2 sentences, explain why the Lasso can force some coefficients to become exactly zero, and how this property makes Lasso different from Ridge regression in terms of variable selection and model interpretability.** + +**5c. Use the code below to fit three models using the same set of predictors: a standard multiple linear regression model, a Ridge regression model, and a Lasso regression model. Display the table that shows the estimated coefficients from all three models side by side. In 2–3 sentences, compare the coefficients across the three methods. Describe how Ridge and Lasso differ from ordinary least squares in terms of coefficient magnitude, and explain what additional behavior you observe for Lasso that does not occur with Ridge.** + + +[source,python] +---- +from sklearn.linear_model import LinearRegression, Ridge, Lasso +import pandas as pd + +# Fit models +ols = LinearRegression().fit(X_train, y_train) +ridge = Ridge(alpha=25).fit(X_train, y_train) +lasso = Lasso(alpha=1, max_iter=10000).fit(X_train, y_train) + +# Compare coefficients +coef_table = pd.DataFrame({ + "Feature": X_train.columns, + "OLS": ols.coef_, + "Ridge (alpha = 25)": ridge.coef_, + "Lasso (alpha = 25)": lasso.coef_ +}).round(4) + +coef_table +---- + +**5d. Suppose you are building a model to predict Spotify song popularity using many audio features, some of which may be weakly related or irrelevant. In 1-2 sentences, explain when you would prefer using Lasso regression over Ridge regression.** + +==== + + +=== Question 6 - Cross-Validation (2 points) + +In the previous questions, you explored how regularization methods such as Ridge and Lasso control model complexity by shrinking coefficient estimates and, in the case of Lasso, performing variable selection. While these methods can help reduce overfitting, their effectiveness depends critically on the choice of the tuning parameter (for example, $\lambda$ or $\alpha$), which determines the strength of the penalty applied to the coefficients. + +Rather than selecting this parameter arbitrarily, we typically rely on cross-validation to guide this choice. Cross-validation allows us to evaluate how a model performs on multiple subsets of the data, providing a more stable and reliable estimate of predictive performance than a single train–test split. By repeatedly training the model on different portions of the data and validating it on held-out folds, cross-validation helps balance the tradeoff between bias and variance. + +In this question, you will use k-fold cross-validation to assess the performance of Lasso regression across a range of $\alpha$ values. This process will help you identify a level of regularization that generalizes well to unseen data, rather than one that simply performs well on a single split of the dataset. + + +.Deliverables +==== +**6a. In **2–3 sentences**, explain what *k-fold cross validation* is and how it differs from using a single train–test split.** + +**6b. Use the code below to perform 5-fold cross-validation for Lasso across a grid of $\alpha$ values. Based on the cross-validated $R^2$ scores, identify the value of $\alpha$ that yields the highest average performance and report the corresponding $R^2$.** + +_The `KFold` function controls how the data are split into training and validation folds during cross-validation. You can learn more about how `KFold` works, including the role of `n_splits`, `shuffle`, and `random_state`, here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html_ + + +[source,python] +---- +from sklearn.linear_model import Lasso +from sklearn.model_selection import KFold, cross_val_score +import numpy as np +import pandas as pd + +alphas = np.logspace(-3, 1, 10) + +cv_results = [] + +kf = KFold(n_splits=___, shuffle=True, random_state=42) # For YOU to fill in + +for a in alphas: + lasso = Lasso(alpha=a, max_iter=10000) + + scores = cross_val_score( + lasso, + X_train, + y_train, + cv=kf, + scoring="r2" + ) + + cv_results.append({ + "alpha": a, + "mean_cv_r2": scores.mean() + }) + +cv_df = pd.DataFrame(cv_results).round(4) +cv_df + +---- + +**6c. In 2–3 sentences, explain how cross-validation helps guide the choice of $\lambda$ for regularized models such as Lasso. Discuss why selecting hyperparameters using cross-validation is more reliable than evaluating model performance on a single test set.** +==== + + +== Dataset + +- /anvil/projects/tdm/data/spotify/linear_regression_popularity.csv + +Learn more about the source: https://the-examples-book.com/projects/data-sets/Spotify[Spotify Dataset 1986–2023]. Feature names and their descriptions are below for your reference: + +[cols="1,3", options="header"] +|=== +| Feature | Description + +| track_id | Unique identifier for the track +| track_name | Name of the track +| popularity | Popularity score (0–100) based on Spotify plays +| available_markets | Markets/countries where the track is available +| disc_number | Disc number (for albums with multiple discs) +| duration_ms | Track duration in milliseconds +| explicit | Whether the track contains explicit content (True/False) +| track_number | Position of the track within the album +| href | Spotify API endpoint URL for the track +| album_id | Unique identifier for the album +| album_name | Name of the album +| album_release_date | Release date of the album +| album_type | Album type (album, single, compilation) +| album_total_tracks | Total number of tracks in the album +| artists_names | Names of the artists on the track +| artists_ids | Unique identifiers of the artists +| principal_artist_id | ID of the principal/primary artist +| principal_artist_name | Name of the principal/primary artist +| artist_genres | Genres associated with the principal artist +| principal_artist_followers | Number of Spotify followers of the principal artist +| acousticness | Confidence measure of whether the track is acoustic (0–1) +| analysis_url | Spotify API URL for detailed track analysis +| danceability | How suitable a track is for dancing (0–1) +| energy | Intensity and activity measure of the track (0–1) +| instrumentalness | Predicts whether a track contains vocals (0–1) +| key | Estimated key of the track (integer, e.g., 0=C, 1=C#/Db) +| liveness | Presence of an audience in the recording (0–1) +| loudness | Overall loudness of the track in decibels (dB) +| mode | Modality of the track (1=major, 0=minor) +| speechiness | Presence of spoken words (0–1) +| tempo | Estimated tempo in beats per minute (BPM) +| time_signature | Estimated overall time signature +| valence | Musical positivity/happiness of the track (0–1) +| year | Year the track was released +| duration_min | Track duration in minutes +|=== + + +== References + +Some explanations, examples, and terminology presented in this section were adapted from the following sources for educational purposes: + +* James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). _An Introduction to Statistical Learning: with Applications in Python_. Springer Texts in Statistics. Springer. + + +== Submitting your Work + +Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. + +.Items to submit +==== +- firstname_lastname_project1.ipynb +==== + +[WARNING] +==== +You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. + +You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. +==== \ No newline at end of file diff --git a/projects-appendix/modules/spring2026/pages/40200/project_survival_analysis.adoc b/projects-appendix/modules/spring2026/pages/40200/project_survival_analysis.adoc new file mode 100644 index 000000000..03f31f810 --- /dev/null +++ b/projects-appendix/modules/spring2026/pages/40200/project_survival_analysis.adoc @@ -0,0 +1,1007 @@ +:page-mathjax: true + += TDM 40100: Project Survival Analysis + +== Project Objectives + +This project introduces Survival Analysis and Cox proportional hazards regression, guiding you through building and interpreting models to understand patient risk over time for a dataset related to the study of lung cancer patients. + +.Learning Objectives +**** +- Prepare and clean a real survival dataset by loading data, recoding event indicators, evaluating class balance, and imputing missing values using appropriate methods. + +- Understand censoring and survival time concepts through descriptive statistics and visualizations that distinguish observed events from right-censored observations. + +- Estimate and compare survival functions using Kaplan–Meier curves and learn about the log rank test. + +- Build and interpret a multivariable Cox proportional hazards model, including calculation of hazard ratios, statistical significance, and percent change in hazard for clinical predictors. + +**** + +== Dataset +- `/anvil/projects/tdm/data/survival/SurvivalData.csv` + +[IMPORTANT] +==== +Use 4 cores for this project. +==== + +== Introduction + +image::StethoscopeImage.jpg[width=600, height=450, title="A close-up view of a stethoscope on a medical background."] + + +In many studies, we are interested in whether an event occurs, such as whether a patient survives surgery. +In those cases, a binary modeling approach like logistic regression may seem like it can be used. However, when the **timing of the event matters**, survival analysis is more adequate. For example, a patient who dies during a medical procedure and a patient who dies two months later both experience death, yet a simple yes or no model would give them the same label. Ignoring how long it took for the event to occur removes valuable information and limits our ability to see important patterns. + +This is where **survival analysis** becomes useful. + +Instead of modeling only whether an event occurs, survival analysis models **the time until an event occurs.** This time is called the **survival time** or **event time**. The survival time is measured from a clearly defined starting point until the event happens. + +Common examples of events include + +* Death of a patient +* Progression of a disease +* Failure of a machine part +* Cancellation of a customer subscription +* Time until pregnancy + +In all of these situations, the main response variable is the **time from a defined beginning until an event happens**. + +**True Survival Time and True Censoring Time** + +In survival analysis, we imagine that each individual has two underlying time values: + +* **True survival time** $T$: the time when the event actually occurs +* **True censoring time** $C$: the time when observation of that individual stops + +We do not usually observe both of these. Instead, we only see whichever occurs first. + +The **observed time** for each individual is + +$Y = \min(T, C)$ + + +So + +* If the event occurs before observation stops, then $Y = T$. +* If observation stops before the event occurs, then $Y = C$. + +We will use this same notation when we describe censored data in more detail. + + + +**Our Example Data Set** + +Consider our dataset which is study of lung cancer patients. + +In this data set + +* **Survival time** is the **number of days from enrollment into the study until the outcome occurs**. +* The outcome is either **Death**, when the event is observed, or **Censoring**, when the event is not observed during the period of observation. + +For each patient + +* Time begins at **enrollment into the study**. +* Time ends at **death**, if the patient dies during the study, or at the **last recorded contact**, if the patient is still alive or no longer in contact with the study team. Those patients are treated as censored observations, which we describe next. + +The data set does not record calendar dates for diagnosis or death. +Instead, it records **elapsed time in days** between study start and the outcome. Each recorded survival time therefore represents how many days a patient lived after diagnosis or enrollment. + + +== Questions + +=== Question 1 Exploring and Cleaning the Data (2 points) + +++++ + +++++ + +.Deliverables +==== +**1a. Load the `survival_data` dataset using the file path provided and display the first 5 rows along with the size of the data.** + +[source,python] +---- +# Load data +import pandas as pd +survival_data = pd.read_csv('~/data/survival/SurvivalData.csv') +---- + +**1b. Recode the status variable so that death is labeled as 1 and censored cases are labeled as 0.** + + +[source,python] +---- +survival_data.loc[survival_data.status==_, '____']=_ # For YOU to fill +survival_data.loc[survival_data.status==_, '____']=_ # For YOU to fill +---- + + +**1c. Apply linear interpolation to fill missing values in the meal.cal and wt.loss columns, then use forward fill and backward fill to handle any remaining NaNs in these two columns. Fill the missing values in the inst column using the column's mode.** + +Hint: Use `.ffill()` and `bfill()` for forward and backward fill. + +[source,python] +---- +survival_data[['meal.cal', 'wt.loss']] = ( + survival_data[['meal.cal', 'wt.loss']].interpolate(method='___')) # For YOu to fill in + +# Fill any remaining NaNs using forward and backward fill +survival_data[['meal.cal', 'wt.loss']] = _____ # For YOU to fill in + +# Fill in inst using mode +survival_data['inst'] = # For YOU to fill in +---- + +**1d. For the remaining variables with missing data (ph.ecog, ph.karno, and pat.karno), fill the NaN values using the median of each respective column. Finally, display the updated missing-value counts for all columns to confirm that no missing values remain in the dataset.** + +[source,python] +---- +# Fill any missing values using median +survival_data[['ph.ecog','ph.karno','pat.karno']] = _________ # For YOU to fill in + +# For YOU to do: confirm no missing values remain +---- +==== + +=== Question 2 - Understanding Time, Censoring, and Survival (2 points) +++++ + +++++ + +**Censored Observations** + +As we mentioned, although this setup may appear similar to a regression problem, an important complication exists: many patients survive past the end of the study, so their exact event times are unknown. + + +Those patients are considered **censored**. + +For censored individuals: + +* The true event (death) has not been observed. +* They may experience the event days, months, or years later. +* We only know that their true event time occurs **after** their last follow-up time. + +Censored data is still informative because each censored observation tells us +the patient survived at least as long as their recorded follow-up time. + + +Using the notation above, we still observe + +$Y = \min(T, C)$ + +but now we also record a **status indicator** to tell us whether the event was actually seen. + +$$ +$\delta$ = +\begin{cases} +1 & \text{if } T \le C \quad \text{(event observed)} \\ +0 & \text{if } T > C \quad \text{(censored)} +\end{cases} +$$ + + +So for each individual we have the pair $(Y, \delta)$. + +* If $\delta = 1$, then the event occurred at time $Y$. +* If $\delta = 0$, then the observation is censored and the true event time is greater than $Y$. + + + +Even though censored observations do not give us the exact event time, they still provide important information. Each censored patient tells us that the event did not occur before their last observed time. + + +**Types of Censoring** + +There are several ways in which censoring can occur. The most common types are **right censoring**, **left censoring**, and **interval censoring**. + +**Right Censoring** + +Right censoring occurs when we know that the event time is at least as large as the observed time. In notation + +$T \ge Y$ + + +Examples + +* The study ends while the patient is still alive. +* The patient leaves the study or is lost to contact before the event occurs. + +In both cases, we do not know the exact value of $T$, but we do know that $T$ is larger than the final observed time. + + +**Left Censoring** + +Left censoring occurs when the event happens before the time that observation begins, but the exact event time is unknown. + +In this case + +$T \le Y$ + +Example + +Suppose we survey a group of pregnant women at one visit late in pregnancy. Some of them have already given birth by the time of the survey. For those women, we only know that the birth occurred some time before the survey date, which means their pregnancy duration is less than the time since conception that corresponds to the survey. + +**Interval Censoring** + +Interval censoring occurs when we only know that the event happened between two observation times. + +In this setting we know that + +$T \in (L, U)$ + + +where $L$ and $U$ are the last time the event was known not to have happened and the first time it was known to have happened. + +Example + +* Patients are checked once per week. +* At one visit, the event has not occurred. +* At the next visit, the event has occurred. + +We do not know the exact day of the event, only that it occurred at some time between two visits. + +**Censored Data in Our Example Dataset** + +image::censoreddata.png[width=600, height=450, title="Example Survival Timelines (5 Patients from Dataset)"] + +*Figure interpretation:* + +In the figure above, each horizontal line represents one patient’s follow-up time starting at study entry (time = 0). An X marks when a death occurs (event observed), and an open circle marks a censored observation (patient still alive at last follow-up). The dashed vertical line shows a **chosen reference (landmark) time at 365 days**, which we use as a snapshot of follow-up rather than the true end of the study. + +Notice that some patients die before this reference time (e.g., Patients 1 and 2), while others die after it (e.g., Patient 4). Patients who remain alive beyond the reference time (e.g., Patient 5) have survival times that extend past 365 days. From the perspective of a snapshot taken at day 365, patients who die after this point and patients who remain alive after this point **are indistinguishable at that moment**, because neither outcome has occurred yet. This demonstrates why survival analysis must explicitly account for censoring and time-to-event information rather than relying on simple yes/no outcome models. + + + +**Assumption About Censoring** + +In order for standard survival analysis methods to give valid results, we usually assume that + +* The censoring process is independent of the event time, after we account for the observed covariates in the model. + +Informally, this means that whether or when a subject is censored should not depend on their unobserved survival time in a way that is not already explained by the variables in the model. When this assumption is reasonable, censored observations still contribute correct information about the survival experience of the population. + + +.Deliverables +==== +**2a. Using the `status` event indicator variable, calculate how many observations are events (status = 1) and how many are right-censored (status = 0). Print the counts and create a bar plot showing the proportion of censored versus observed events.** + +[source,python] +---- +# Print how many observations are deaths (status = 1) and how many are censored (status = 0). + +# Plot the proportions +import matplotlib.pyplot as plt +survival_data['status'].value_counts().plot(kind='bar') +plt.xticks([0,1], ['Censored (0)', 'Death (1)']) +plt.ylabel('______') +plt.title('______') +plt.show() +---- + +**2b. Use the code below to create a scatter plot that displays each patient’s follow-up time on the x-axis and the patient index (after sorting by time) on the y-axis. Then, write 1–2 sentences interpreting the results.** + +_Note: Follow-up time is the length of time each patient is observed in the study, from study entry until either death occurs (event) or the patient is censored (e.g., study ends or they are lost to follow-up). In this dataset, follow-up time is stored in the time column and is measured in days._ + +[source,python] +---- +import matplotlib.pyplot as plt + +# Sort by follow-up time for better visualization +df_sorted = survival_data.sort_values(by='time') + +# Color mapping: red = death (event), blue = censored +colors = df_sorted['status'].map({1.0: 'red', 0.0: 'blue'}) + +plt.figure(figsize=(10, 6)) +plt.scatter( + df_sorted['time'], + range(len(df_sorted)), + c=colors, + alpha=0.7 +) + +plt.title("______") # For YOU to fill in +plt.xlabel("_______") # For YOU to fill in +plt.ylabel("________") # For YOU to fill in + +# Custom legend +plt.legend(handles=[ + plt.Line2D([], [], marker='o', color='red', linestyle='None', + label='Death (event observed)'), + plt.Line2D([], [], marker='o', color='blue', linestyle='None', + label='Censored (event not observed during study)') +]) + +plt.grid(alpha=0.3) +plt.show() +---- + +**2c. Use the code below to generate a survival timeline plot for the first 10 patients in `survival_data`, including a dashed vertical reference line at 365 days to represent a snapshot of follow-up. Write 1–2 sentences interpreting the plot and stating how many patients experienced an observed death at our point of reference versus how many were a censored outcome.** + +[source,python] +---- +import matplotlib.pyplot as plt + +# Select the first 10 patients from the dataset +plot_df = survival_data.head(10).reset_index(drop=True) + +# Create plot +plt.figure(figsize=(10, 5)) +plt.axvline(x=365, linestyle='--', label='Study cutoff (365 days)') + +death_labeled = False +censor_labeled = False + +# Plot each patient's survival timeline +for i, row in plot_df.iterrows(): + plt.hlines(y=i+1, xmin=0, xmax=row['time']) + + if row['status'] == 1: + plt.scatter( + row['time'], i+1, + marker='x', s=100, + label='Death (event observed)' if not death_labeled else None + ) + death_labeled = True + else: + plt.scatter( + row['time'], i+1, + facecolors='none', edgecolors='black', s=100, + label='Censored' if not censor_labeled else None + ) + censor_labeled = True + +plt.yticks(range(1, 11), [f'Patient {i}' for i in range(1, 11)]) +plt.xlabel('_________') +plt.ylabel('_________') +plt.title('_________') +plt.legend() +plt.show() + +---- + +==== + +=== Question 3 - Kaplan Meier (2 points) + +++++ + +++++ + +The survival function can be written using the Kaplan Meier product formula: + +$S(t) = \prod_{t_i \le t} \left(1 - \frac{d_i}{n_i}\right)$ + + + +In this form + +* $t_i$ represents each observed event time +* $d_i$ is the number of deaths that occur at time $t_i$ +* $n_i$ is the number of individuals at risk immediately before time $t_i$ + + +At each event time, survival is updated by multiplying the previous survival probability by a new fraction that accounts for how many deaths occurred out of the group still being observed. + +**Data Used for Illustration** + +We demonstrate this process using our lung cancer dataset. Each row of the data includes + +* **time**: the number of days observed +* **event**: + * 1 indicates death occurred + * 0 indicates the observation is censored + +Example rows from the dataset are shown below. + +[source] +---- + time event +0 306 1 +1 455 1 +2 1010 0 +3 210 1 +4 883 1 +---- + +**Identify Event Times** + +Kaplan Meier updates the survival probability only at time points where deaths occur. These are called **event times**. + +From our dataset, the first two event times are: + +[source] +---- +[5, 11] +---- + +This means + +* At least one patient died at day 5. +* At least one patient died at day 11. + +We now calculate survival probabilities at each of these two event times. + +**Step 1 Calculate Survival at Time 5** + +At the first event time: + +* $t_1 = 5$ +* Number at risk just before day 5 + $n_1 = 228$ +* Number of deaths on day 5 + $d_1 = 1$ + +Apply the Kaplan–Meier step formula: + +$S(5) = 1 \times \left(1 - \frac{1}{228}\right)$ + +This gives + +$$ +S(5) = 0.995614 +$$ + + +**Interpretation** + +After day 5, approximately **99.56 percent** of patients are estimated to survive beyond that time. + +**Step 2 Calculate Survival at Time 11** + +Move to the second event time. + +* $t_2 = 11$ +* Number at risk just before day 11 + $n_2 = 227$ +* Number of deaths on day 11 + $d_2 = 3$ + + +The survival probability is updated by multiplying the previous estimate by a new fraction: + +$S(11) = S(5) \times \left(1 - \frac{3}{227}\right)$ + +This gives + +$S(11) = 0.982456$ + + +**Interpretation** + +After day 11, approximately **98.25 percent** of patients are estimated to survive beyond this time. + +**Summary Table** + +The survival probabilities calculated for our first two event times are summarized below. + +[cols="1,1,1,1"] +|=== +| Time (days) | Deaths \(d_i\) | At risk \(n_i\) | Survival probability + +| 5 +| 1 +| 228 +| 0.995614 + +| 11 +| 3 +| 227 +| 0.982456 +|=== + + +**Kaplan Meier Pattern** + +Each Kaplan Meier step follows the same process: + +1. Identify the next event time. +2. Count how many individuals are still at risk $n_i$. +3. Count how many deaths occur at that time $d_i$. +4. Compute the fraction + +$1 - \frac{d_i}{n_i}$ + +5. Multiply that fraction by the current survival probability + +$S(t_i) = S(t_{i-1}) \times \left(1 - \frac{d_i}{n_i}\right)$ + + +Repeating this procedure at all event times builds the full Kaplan Meier survival curve. + +image::KaplanMeierCurve.png[width=600, height=450, title="Kaplan Meier Curve."] + +.Deliverables +==== +**3a. Use the code below to identify the first two event times in the dataset and explain what an “event time” means in the context of Kaplan Meier estimation. Then, in 1–2 sentences, explain what the survival function $S(t)$ represents in plain language.** + +[source,python] +---- +# Identify event times +event_times = ( + survival_data + .loc[survival_data["status"] == 1, "time"] + .sort_values() + .unique() +) + +# For YOU to do: Identify the first two event times +---- + +**3b. Use the code below to calculate the first survival probability $S(5)$. Then write 1 sentence explaining what $S(5)$ means for patients in this study.** + +[source,python] +---- +# First event time +t1 = event_times[0] + +# Count patients at risk and deaths +n1 = (survival_data["time"] >= t1).sum() +d1 = ((survival_data["time"] == t1) & (survival_data["status"] == 1)).sum() + +# Kaplan Meier Equation +S1 = ______ # For YOU to fill in + +print("Time =", t1) +print("At risk n1 =", n1) +print("Deaths d1 =", d1) +print("S(t1) =", S1) +---- + + +**3c. Using the value of $S(5)$ from Question 3b, calculate the updated survival probability $S(11)$. Then write 1 sentence explaining what $S(11)$ means for patients in this study.** + + + +[source,python] +---- +# Second event time +t2 = event_times[1] + +# Count patients at risk and deaths +n2 = (survival_data["time"] >= t2).sum() +d2 = ((survival_data["time"] == t2) & (survival_data["status"] == 1)).sum() + +# Update survival probability +S2 = ______ # For YOU to fill in + +print("Time =", t2) +print("At risk n2 =", n2) +print("Deaths d2 =", d2) +print("S(t2) =", S2) +---- + +==== + +=== Question 4 - Survival Curves and Log Rank Test (2 points) + +**Comparing Survival Curves: The Log-Rank Test** + +Once survival curves are estimated using the Kaplan–Meier method, we often want to compare **survival between two or more groups** (for example, male vs female patients). + +While visual inspection of Kaplan–Meier curves can suggest differences between groups, overlap or small separations can be difficult to interpret by eye alone. To formally evaluate whether any observed differences are statistically meaningful, we use the **log-rank test**. + +The **log-rank test** compares: + +* The **observed number of deaths** in each group at every event time +* To the **number of deaths expected** if both groups truly had the **same underlying survival experience** + +These comparisons are summed across all event times to produce a **test statistic** and corresponding **p-value**. + +**Interpretation** + +* A **small p-value (typically < 0.05)** provides evidence that the survival curves differ between groups, suggesting **different hazard rates**. +* A **large p-value** indicates that any visible differences in the curves are likely due to random variation rather than a true difference in survival. + +The log-rank test therefore provides a **formal statistical complement** to Kaplan–Meier plots, moving beyond visual comparison to hypothesis testing. + + + +.Deliverables +==== +**4a. Using the code below, fit and plot the overall Kaplan Meier survival curve using all patients. Then write 1–2 sentences explaining what the curve communicates about survival probabilities over time.** + +[source,python] +---- +from lifelines import KaplanMeierFitter +import matplotlib.pyplot as plt + +kmf = KaplanMeierFitter() + +kmf.fit( + durations=survival_data["time"], + event_observed=(survival_data["status"] == 1) +) + +kmf.plot(ci_show=True) +plt.xlabel("_______") # For YOU to fill in +plt.ylabel("_______") # For YOU to fill in +plt.title("______") # For YOU to fill in +plt.grid(alpha=0.3) +plt.show() +---- + +**4b. Create Kaplan Meier survival curves stratified by sex (sex = 1 for male, sex = 2 for female). Then write 1–2 sentences comparing survival between male and female patients and noting any visible separation between the curves.** + +[source,python] +---- +from lifelines import KaplanMeierFitter +import matplotlib.pyplot as plt + +kmf = KaplanMeierFitter() + +plt.figure(figsize=(6, 4)) + +for sex_value, label in [(1, "Male"), (2, "Female")]: + group = survival_data[survival_data["sex"] == sex_value] + + if group.empty: + continue + + kmf.fit( + durations=group["time"], + event_observed=(group["status"] == 1), + label=label + ) + kmf.plot(ci_show=False) + +plt.xlabel("______") # For YOU to fill in +plt.ylabel("______") # For YOU to fill in +plt.title("______") # For YOU to fill in +plt.grid(alpha=0.3) +plt.legend() +plt.show() +---- + +**4c. Using the code below, perform a log rank test to compare survival between male patients (sex = 1) and female patients (sex = 2). Then, in 1–2 sentences, state whether the p-value from the log rank test provides evidence that the log hazards differ between male and female patients.** + +[source,python] +---- +from lifelines.statistics import logrank_test + +# Separate data by sex +male = survival_data[survival_data["sex"] == 1] +female = survival_data[survival_data["sex"] == 2] + +# Run log rank test +results = logrank_test( + male["time"], + female["time"], + event_observed_A=(male["status"] == 1), + event_observed_B=(female["status"] == 1) +) + +# Print results +results.print_summary() +---- + +==== + +=== Question 5 - The Cox Proportional Hazards Model (2 points) + +**The Cox Proportional Hazards Model** + +The **Cox Proportional Hazards model** describes how patient characteristics (**covariates**) affect the risk of death over time using the dataset `survival_data`. + +The hazard function is: + +$h(t) = h_0(t)\exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)$ + +where: + +* $h(t)$ is the instantaneous risk of death at time $t$ +* $h_0(t)$ is the **baseline hazard** (when all covariates equal 0) +* $X_1,\ldots,X_p$ are covariates from `survival_data`, including + `age`, `sex`, `ph.ecog`, `ph.karno`, `pat.karno`, `meal.cal`, and `wt.loss` +* $\beta_1,\ldots,\beta_p$ are the estimated model coefficients + +Each covariate receives one coefficient. + +**Example model using our dataset** + +Suppose our Cox model includes `age`, `sex`, and `ph.ecog`. +The fitted model is written as: + +$h(t) = h_0(t)\exp(\beta_1(\text{age}) + \beta_2(\text{sex}) + \beta_3(\text{ph.ecog}))$ + +Each coefficient represents the **log hazard ratio** of that variable while holding all others constant. + +**Log hazard coefficients** + +Example coefficient output: + +[cols="3,3,4"] +|=== +| *Covariate* | *Log hazard coefficient ($\beta$)* | *Interpretation* + +| Age +| 0.02 +| Higher age increases hazard + +| Sex (female vs male) +| -0.60 +| Lower hazard for females + +| ph.ecog +| 0.90 +| Worse ECOG score increases hazard +|=== + +**From coefficients to hazard ratios** + +Coefficients are converted to hazard ratios using: + +$HR = e^{\beta}$ + +**Why do we take the exponent of the log hazard coefficient?** + +Because the model uses an exponential function: +* Coefficients are estimated on the **log scale**. + +* A one-unit increase in a predictor changes the **log of the hazard**, not the hazard directly. + +* Log values are a little more difficult to interpret in a real-world context. + +* This transformation produces a **hazard ratio (HR)**, which represents the **relative change in risk** associated with a one-unit increase in a predictor. + +* Exponentiating the coefficients allows us to express model results in intuitive terms—such as percent increases or decreases in risk—making them easier to understand and communicate. + + +**Hazard ratios** + +[cols="3,3,4"] +|=== +| *Covariate* | *Hazard ratio (HR)* | *Interpretation* + +| Age +| 1.02 +| Each additional year of age is associated with a ~2% increase in hazard. + +| Sex (female) +| 0.55 +| Females have a ~45% lower hazard compared to males. + +| ph.ecog +| 2.46 +| Each one-unit increase in ECOG score is associated with a ~146% increase in hazard (more than double the risk). +|=== + + +**How to interpret hazard ratios** + +* $HR = 1$ → no difference in hazard +* $HR > 1$ → increased hazard +* $HR < 1$ → decreased hazard + + +**Applying the model to real patients** + +Using rows from `survival_data`: + +* Patient 0: age = 74, sex = 1 (male), ph.ecog = 1 +* Patient 2: age = 56, sex = 1 (male), ph.ecog = 0 + +Because Patient 0 is older and has worse ECOG performance, the Cox model predicts: + +*Patient 0 has a higher hazard of death than Patient 2.* + +**Sample Cox output** + +[cols="3,2,2,2,3"] +|=== +| *Covariate* | *Coefficient ($\beta$)* | *Std Error* | *p-value* | *HR (95% CI)* + +| Age +| 0.020 +| 0.008 +| 0.010 +| 1.02 (1.01 – 1.03) + +| Sex +| -0.60 +| 0.25 +| 0.014 +| 0.55 (0.34 – 0.88) + +| ph.ecog +| 0.90 +| 0.28 +| 0.002 +| 2.46 (1.42 – 4.28) +|=== + +**Statistical significance** + +* p-value < 0.05 → statistically significant +* Confidence intervals: + + - CI includes 1 → not significant + - CI above 1 → increased hazard + - CI below 1 → decreased hazard + +Example: + +$HR(\text{ph.ecog}) = 2.46$, CI $(1.42,4.28)$ + +Because the CI does not include 1 and the p-value is < 0.05, ECOG performance status is a **significant predictor of mortality**. + +**Key takeaways** + +* Cox coefficients are **log hazard ratios**. +* Convert coefficients with $HR = e^{\beta}$. +* Interpretation: + + - $HR > 1$ → higher hazard + - $HR < 1$ → lower hazard + - $HR = 1$ → no effect + + + +.Deliverables +==== + +**5a. Use the following covariates (age, sex, ph.ecog, ph.karno, pat.karno, meal.cal and wt.loss) to fit a Cox regression model and display the coefficient summary table.** + +See documentation on CoxPHFItter https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html[here.] + +[source,python] +---- +from lifelines import CoxPHFitter + +# Prepare modeling dataset +cox_data = survival_data[['time', 'status', 'age', 'sex', + 'ph.ecog', 'ph.karno', + 'pat.karno', 'meal.cal', 'wt.loss']] + +# Fit Cox model +cph = CoxPHFitter() +cph.fit(cox_data, duration_col='time', event_col='status') + +---- + +**5b. Create a table that summarizes the hazard ratios, p-values, percent change in hazard, and statistical significance for each predictor in the Cox model by completing and running the code below. Then, using the generated table.** + +[source,python] +---- +import pandas as pd +import numpy as np + +# Extract coefficients and p-values +hr_table = cph.summary[['coef', 'p']].reset_index() +hr_table.columns = ['Predictor', 'Log_Hazard_Ratio (coef)', 'p_value'] + +# Compute hazard ratio from the log hazard coefficient +hr_table['Hazard_Ratio (HR)'] = np.___(hr_table['Log_Hazard_Ratio (coef)']) # For YOU to fill in + + +# Using the table above, fill in the Effect column: +# * If the hazard ratio is greater than 1, fill in whether it increases or decreases hazard. + #* If the hazard ratio is less than 1, fill in whether it increases or decreases hazard. + +hr_table['Effect'] = np.where( + hr_table['Hazard_Ratio (HR)'] >= ____, # For YOU to fill in + '________', # For YOU to fill in + '_______' # For YOU to fill in +) + +# Round values for display +hr_table[['Log_Hazard_Ratio (coef)', + 'Hazard_Ratio (HR)', + 'p_value']] = ( + hr_table[['Log_Hazard_Ratio (coef)', + 'Hazard_Ratio (HR)', + 'p_value']] + .round(3) +) + +hr_table +---- + +**5c. Write 1–2 sentences explaining which variables significantly increase or decrease patient hazard and what these results suggest about survival risk in the lung cancer dataset.** + +==== + +=== Question 6 - Cox Proportional Hazards Model Assumptions (2 points) + +The Cox proportional hazards model relies on several important assumptions. +These assumptions help ensure that the hazard ratios estimated by the model are meaningful and valid. + + +**Assumption 1 – Proportional Hazards** + +The **proportional hazards assumption** states that the hazard ratios for predictors are **constant over time**. + +This means the effect of a predictor on risk is assumed to be stable across the entire follow-up period. The proportional hazards assumption can be tested statistically using tests such as the **Schoenfeld residual test**, which produces a **p-value** for each predictor. + +* A **large p-value (typically > 0.05)** suggests that the assumption holds — there is **no evidence** that the effect of the predictor changes over time. +* A **small p-value (typically < 0.05)** suggests that the assumption may be violated — the predictor’s effect **may vary across time**, meaning the reported hazard ratio may not be valid. + + +**Assumption 2 – Linearity of Continuous Predictors** + +The Cox model assumes that **continuous predictors have a linear relationship with the log hazard**. + +This means: + +* A one-unit increase in a continuous variable corresponds to a constant proportional change in hazard. +* The effect of a predictor increases or decreases smoothly rather than abruptly changing at different values. + +If a variable has a highly non-linear effect on risk, transformations or modeling adjustments may be needed. + + +**Assumption 3 – Independent (Non-Informative) Censoring** + +We assume that **censoring is independent of survival**, after accounting for observed covariates. + +This means: + +* Whether a patient is censored should not depend on their unobserved death time. +* Censored patients are assumed to have the **same survival prospects as non-censored patients with similar covariate values**. + +For a detailed discussion of the Cox model assumptions and how to assess them, see standard survival analysis references such as *Klein & Moeschberger, Survival Analysis: Techniques for Censored and Truncated Data*; students are encouraged to consult this text for deeper explanation and examples. + + +.Deliverables +==== + +**6a. One key assumption of the Cox model is that the effect of each predictor on patient risk stays constant over time. To evaluate this for the variable age, use the Schoenfeld residual test below and examine the resulting p-value. After viewing the output, write 1 sentence stating whether the proportional hazards assumption seems reasonable for age.** + + + +[source,python] +---- +from lifelines.statistics import proportional_hazard_test + +# Test proportional hazards for age +results = proportional_hazard_test(cph, cox_data, time_transform='rank') +print(results.summary.loc[['age']]) + +---- + +**6b. Using the scatter plot below for the continuous predictor *age*, briefly explain whether assuming a linear relationship between age and the *log hazard* appears reasonable. Support your answer with 1–2 sentences describing any visible patterns in the plot.** + + +[source,python] +---- +import matplotlib.pyplot as plt + +# Example using age +plt.scatter(survival_data["age"], survival_data["time"], alpha=0.5) +plt.xlabel("____") # For YOU to fill in +plt.ylabel("____") # For YOU to fill in +plt.title("_____") # For YOU to fill in +plt.grid(alpha=0.3) +plt.show() +---- + +**6c. In 2-3 sentences, describe what a violation of Cox proportional hazards model assumptions would imply for interpreting the hazard ratio.** + +==== + + +**References** + +Klein, J. P., & Moeschberger, M. L. (2003). *Survival Analysis: Techniques for Censored and Truncated Data*. Springer. + +James, G., Witten, D., Hastie, T., Tibshirani, R. (2023). *Statistical Learning with Python*. Springer — Survival Analysis section. + + + +== Submitting your Work + +Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. + + +.Items to submit +==== +- firstname_lastname_project1.ipynb +==== + +[WARNING] +==== +You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. + +You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. +==== \ No newline at end of file