Skip to content

nikitachaudharicodes/cmu-mlip-model-testing-lab

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

cmu-mlip-model-testing-lab

Lab 4: Model Testing with Weights & Biases and LLMs

In this lab, you'll gain hands-on experience using Weights & Biases (W&B) for interactive model evaluation and LLMs to generate targeted test cases. You will run a candidate sentiment model alongside a baseline, slice the predictions to uncover failure modes, log everything to W&B, and then stress test a weak slice with synthetic prompts generated by a Large Language Model.

Deliverables

Your goal is to act like an ML engineer preparing a model for deployment: justify your slices, inspect slice performance in W&B, and validate a weakness with synthetic data. To receive credit, you must:

  1. Run Steps 1–4 and define at least five hypothesis-driven slices. Each slice should capture a specific property of the tweets (hashtags, negation, emoji density, unusual length, presence of mentions, etc.), and you should be able to explain why that slice matters to model behavior.
  2. Log to W&B and walk the TA through your analysis. Ensure df_long, slice_metrics, regression_metrics, and df_eval are logged, build comparative visualizations of your choice for the slices, and use the notebook to answer “Why can accuracy be misleading?” and “What did slicing reveal?” during your discussion.
  3. Complete the targeted stress test (Step 7) and discuss it with the TA. Paste your hypothesis and 10 LLM-generated tweets in the notebook, run the helper that scores them, interpret any repeated or new failures, and explain whether that changes your confidence in deploying the candidate model.

For every slice you log, keep a short note in the notebook (e.g., the saved_slice_notes list) so the TA can see your takeaways without rerunning the code.

Getting started

Installation instructions

  • Recommended Python version: 3.10+ (the notebook also works with Python ≥ 3.7).
  • Install the dependencies:
    pip install --upgrade wandb datasets transformers evaluate tqdm emoji regex pandas pyarrow scikit-learn nbformat torch

Login to W&B

  1. Create a free account at https://wandb.ai using your CMU email.
  2. Copy the API key from https://wandb.ai/authorize.
  3. Run wandb login in the terminal (outside the notebook) and paste the key when prompted.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages

  • Jupyter Notebook 100.0%