Skip to content

feat: updating ml bits#36

Merged
eschmidt42 merged 5 commits into
mainfrom
feat/2025-09-13-updating-ml-bits
Sep 14, 2025
Merged

feat: updating ml bits#36
eschmidt42 merged 5 commits into
mainfrom
feat/2025-09-13-updating-ml-bits

Conversation

@eschmidt42
Copy link
Copy Markdown
Owner

This pull request refactors the data processing and modeling workflows in both 04_poll_clustering.ipynb and 05_predicting_votes.ipynb notebooks. The main improvements include switching from pandas to polars for data manipulation, standardizing file loading via helper functions, updating the feature engineering and modeling pipelines to use polars-native operations, and improving topic modeling workflows for poll clustering. The changes enhance performance, consistency, and maintainability of the notebooks.

Data Loading and Preprocessing Improvements

  • Switched from pandas to polars for all dataframe operations, including reading parquet files and data transformations, for improved performance and consistency. [1] [2] [3] [4] [5]
  • Standardized file path resolution using helper functions like get_polls_parquet_path, get_votes_parquet_path, and get_mandates_parquet_path for all relevant datasets. [1] [2] [3] [4]

Feature Engineering and Data Transformation

  • Replaced pandas-style .pipe and .apply feature engineering with polars-native .with_columns and .map_elements for NLP preprocessing and topic modeling, including use of partial for clean text transformation. [1] [2]
  • Updated party name normalization and unique party extraction to use polars conditional logic. [1] [2]

Poll Clustering and Topic Modeling Workflow

  • Enhanced LDA topic modeling workflow: introduced a grid search for optimal num_topics using coherence and perplexity metrics, visualized results with plotnine, and updated downstream transformations to use polars. [1] [2] [3]
  • Refactored topic feature extraction and aggregation for visualization of topic weights over time, using polars group_by and aggregation methods.

Vote Prediction Pipeline Updates

  • Updated all modeling, embedding, and plotting function calls to use polars dataframes and refactored function imports for clarity. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Improved embedding extraction and visualization by converting embeddings to polars dataframes and standardizing party color mappings for plots.

These changes collectively modernize and optimize the data science workflows in both notebooks, making them faster, more maintainable, and easier to extend.

@eschmidt42 eschmidt42 merged commit dcfed21 into main Sep 14, 2025
4 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.88%. Comparing base (d50a8ed) to head (f696e87).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
- Coverage   94.90%   94.88%   -0.03%     
==========================================
  Files          27       27              
  Lines        1433     1426       -7     
==========================================
- Hits         1360     1353       -7     
  Misses         73       73              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eschmidt42 eschmidt42 deleted the feat/2025-09-13-updating-ml-bits branch September 14, 2025 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant