cmu-mlip-model-testing-lab

Lab 4: Model Testing with Weights & Biases and LLMs

In this lab, you'll gain hands-on experience using Weights & Biases (W&B) for interactive model evaluation and LLMs to generate targeted test cases. You will run a candidate sentiment model alongside a baseline, slice the predictions to uncover failure modes, log everything to W&B, and then stress test a weak slice with synthetic prompts generated by a Large Language Model.

Deliverables

Your goal is to act like an ML engineer preparing a model for deployment: justify your slices, inspect slice performance in W&B, and validate a weakness with synthetic data. To receive credit, you must:

Run Steps 1–4 and define at least five hypothesis-driven slices. Each slice should capture a specific property of the tweets (hashtags, negation, emoji density, unusual length, presence of mentions, etc.), and you should be able to explain why that slice matters to model behavior.
Log to W&B and walk the TA through your analysis. Ensure df_long, slice_metrics, regression_metrics, and df_eval are logged, build comparative visualizations of your choice for the slices, and use the notebook to answer “Why can accuracy be misleading?” and “What did slicing reveal?” during your discussion.
Complete the targeted stress test (Step 7) and discuss it with the TA. Paste your hypothesis and 10 LLM-generated tweets in the notebook, run the helper that scores them, interpret any repeated or new failures, and explain whether that changes your confidence in deploying the candidate model.

For every slice you log, keep a short note in the notebook (e.g., the saved_slice_notes list) so the TA can see your takeaways without rerunning the code.

Getting started

Clone this repository: https://github.com/nikitachaudharicodes/cmu-mlip-model-testing-lab/tree/main
Open lab4.ipynb in your preferred notebook environment (Jupyter, Colab, VS Code, etc.).
Run the cells sequentially. The notebook is split into seven steps that mirror the deliverables above.

Installation instructions

Recommended Python version: 3.10+ (the notebook also works with Python ≥ 3.7).

Install the dependencies:

pip install --upgrade wandb datasets transformers evaluate tqdm emoji regex pandas pyarrow scikit-learn nbformat torch

Login to W&B

Create a free account at https://wandb.ai using your CMU email.
Copy the API key from https://wandb.ai/authorize.
Run wandb login in the terminal (outside the notebook) and paste the key when prompted.

References

W&B slicing and tables guide: https://docs.wandb.ai/guides/app/features/panels/

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
README.md		README.md
lab4.ipynb		lab4.ipynb
tweets.csv		tweets.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cmu-mlip-model-testing-lab

Lab 4: Model Testing with Weights & Biases and LLMs

Deliverables

Getting started

Installation instructions

Login to W&B

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

cmu-mlip-model-testing-lab

Lab 4: Model Testing with Weights & Biases and LLMs

Deliverables

Getting started

Installation instructions

Login to W&B

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors