In this lab, you'll gain hands-on experience using Weights & Biases (W&B) for interactive model evaluation and LLMs to generate targeted test cases. You will run a candidate sentiment model alongside a baseline, slice the predictions to uncover failure modes, log everything to W&B, and then stress test a weak slice with synthetic prompts generated by a Large Language Model.
Your goal is to act like an ML engineer preparing a model for deployment: justify your slices, inspect slice performance in W&B, and validate a weakness with synthetic data. To receive credit, you must:
- Run Steps 1–4 and define at least five hypothesis-driven slices. Each slice should capture a specific property of the tweets (hashtags, negation, emoji density, unusual length, presence of mentions, etc.), and you should be able to explain why that slice matters to model behavior.
- Log to W&B and walk the TA through your analysis. Ensure
df_long,slice_metrics,regression_metrics, anddf_evalare logged, build comparative visualizations of your choice for the slices, and use the notebook to answer “Why can accuracy be misleading?” and “What did slicing reveal?” during your discussion. - Complete the targeted stress test (Step 7) and discuss it with the TA. Paste your hypothesis and 10 LLM-generated tweets in the notebook, run the helper that scores them, interpret any repeated or new failures, and explain whether that changes your confidence in deploying the candidate model.
For every slice you log, keep a short note in the notebook (e.g., the saved_slice_notes list) so the TA can see your takeaways without rerunning the code.
- Clone this repository: https://github.com/nikitachaudharicodes/cmu-mlip-model-testing-lab/tree/main
- Open
lab4.ipynbin your preferred notebook environment (Jupyter, Colab, VS Code, etc.). - Run the cells sequentially. The notebook is split into seven steps that mirror the deliverables above.
- Recommended Python version: 3.10+ (the notebook also works with Python ≥ 3.7).
- Install the dependencies:
pip install --upgrade wandb datasets transformers evaluate tqdm emoji regex pandas pyarrow scikit-learn nbformat torch
- Create a free account at https://wandb.ai using your CMU email.
- Copy the API key from https://wandb.ai/authorize.
- Run
wandb loginin the terminal (outside the notebook) and paste the key when prompted.
- W&B slicing and tables guide: https://docs.wandb.ai/guides/app/features/panels/