Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,20 @@ rsync -a /tmp/lahmans-git-work/.git/objects/ $PROJ/.git/objects/

`git checkout` in the project dir will fail — files are correct but the local branch pointer may lag.

## Interactive R Sessions (Analysis Development)

When developing analysis scripts or iterating on charts, use an **interactive R session** instead of re-running the full script each time:

1. Start R in async mode: `bash mode="async" command="R --no-save"`
2. Source shared setup (DB connection, libraries) once
3. Send individual code blocks via `write_bash` to iterate on specific charts or queries
4. Use the `view` tool on saved PNG files to inspect chart output visually
5. Only assemble the final `.R` script once the individual pieces are working

This avoids the 60-90 second penalty of re-running a full analysis script on every change and enables tight visual feedback loops.

**DuckDB CLI for ad-hoc queries:** Use `duckdb ~/Documents/Data/baseball/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code.

## R CMD Check

- Non-ASCII characters (em-dashes, box-drawing) in R source cause WARNING — use ASCII `--`.
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ tests/testthat/_snaps/
# Copilot CLI config (environment-specific, not for contributors)
.github/lsp.json
.copilot/mcp-config.json
.mcp.json

# Old scratch notebooks (superseded by analysis/)
inst/notebooks/
Expand Down
14 changes: 14 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,20 @@ rsync -a /tmp/lahmans-git-work/.git/objects/ $PROJ/.git/objects/

`git checkout` in the project dir will fail — files are correct but the local branch pointer may lag.

## Interactive R Sessions (Analysis Development)

When developing analysis scripts or iterating on charts, use an **interactive R session** instead of re-running the full script each time:

1. Start R in async mode: `bash mode="async" command="R --no-save"`
2. Source shared setup (DB connection, libraries) once
3. Send individual code blocks via `write_bash` to iterate on specific charts or queries
4. Use the `view` tool on saved PNG files to inspect chart output visually
5. Only assemble the final `.R` script once the individual pieces are working

This avoids the 60-90 second penalty of re-running a full analysis script on every change and enables tight visual feedback loops.

**DuckDB CLI for ad-hoc queries:** Use `duckdb ~/Documents/Data/baseball/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code.

## R CMD Check

- Non-ASCII characters (em-dashes, box-drawing) in R source cause WARNING — use ASCII `--`.
Expand Down
13 changes: 9 additions & 4 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
Package: lahmanTools
Title: Baseball Analytics with Lahman and DuckDB
Version: 0.1.0
Version: 0.2.0
Authors@R: person("David", "Lucey", role = c("aut", "cre"),
email = "david@example.com")
Description: Provides a persistent DuckDB database populated with all Lahman
baseball tables and supplemental MLB salary data scraped from USA Today
(2017+). Includes helpers to connect, rebuild, and extend the database.
Description: Loads all Sean Lahman baseball tables (1871-2025) into a persistent
DuckDB database and exposes pre-built sabermetric SQL views (BattingStats,
PitchingStats, SalaryPerWAR, etc.). Optionally extends salary coverage to
2017-2025 from USA Today and Spotrac, and supplements with FanGraphs WAR
(1985+) and Chadwick Bureau player ID crosswalk via the baseballr package.
Includes write_mcp_config() to connect the database to GitHub Copilot CLI
or Claude via a local DuckDB MCP server. No third-party data is bundled;
all supplemental data is fetched at runtime.
License: MIT + file LICENSE
Encoding: UTF-8
Depends: R (>= 4.1.0)
Expand Down
7 changes: 7 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,15 @@ export(connect_baseball_db)
export(create_stats_views)
export(db_query)
export(dt_factors_to_char)
export(load_chadwick_ids)
export(load_fangraphs_war)
export(load_statcast)
export(match_player_ids)
export(normalise_player_name)
export(scrape_salaries)
export(setup_baseball_db)
export(team_name_map)
export(write_mcp_config)
importFrom(data.table,":=")
importFrom(data.table,.SD)
importFrom(data.table,as.data.table)
Expand Down
52 changes: 51 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,54 @@
# lahmanTools (development version)
# lahmanTools 0.2.0

## New features

* Three new runtime data loaders in `R/loaders.R`:
- `load_chadwick_ids(con)` -- downloads the Chadwick Bureau player ID
crosswalk via `baseballr` and writes it as `ChadwickIDs` to DuckDB.
Creates `PlayerIDs` view joining Lahman `playerID` to MLBAM, FanGraphs,
Retrosheet and Baseball Reference IDs. Licensed ODC-BY 1.0 (attribution
required).
- `load_fangraphs_war(con, years)` -- fetches FanGraphs batter and pitcher
WAR leaderboards (batting 1871+, pitching 1985+) and creates `PlayerWAR`
and `SalaryPerWAR` views. Requires `ChadwickIDs` for the FanGraphs-to-Lahman
join. `SalaryPerWAR` includes a `war_reliable` flag (TRUE for all rows in
the salary era 1985+; retained for backward compatibility).
- `load_statcast(con, years)` -- fetches Baseball Savant pitch-level data
(2015+ only, ~700 MB/season) and creates `StatcastSeason` batter aggregates
(exit velocity, launch angle, hard-hit rate, xBA, xwOBA).

* `setup_baseball_db()` gains three new parameters:
- `load_chadwick = FALSE` -- pass `TRUE` to load the Chadwick crosswalk
during initial database build.
- `load_war = FALSE` -- pass `TRUE` to also fetch FanGraphs WAR (implies
`load_chadwick`).
- `war_years = 1985:2025` -- seasons to fetch for WAR data.

* `baseballr` added to `Suggests`; required only by the three new loaders.

* `write_mcp_config()` -- generates the JSON config entry needed to connect
GitHub Copilot CLI or Claude Code to `baseball.duckdb` via a local DuckDB
MCP server. Resolves `~` to an absolute path (required by Python-based MCP
servers), merges into an existing config without clobbering other server
entries, and always enforces `--readonly`. Defaults to `dry_run = TRUE` so
nothing is written until the user opts in.

* Three new analytical views created by `create_stats_views()` / `setup_baseball_db()`:
- `PlayerAcquisitionType` -- one row per player-team; `acq_type` column
classifies as `homegrown` (debut year = first year with team),
`young_acq` (arrived post-debut, age < 26), or `veteran_acq`.
Eliminates the repeated 3-CTE acquisition-classification pattern in
analysis queries.
- `LeagueMedianSalary` -- `med_sal`, `avg_sal`, `n_players` by season from
`SalariesAll`. Use `salary / med_sal` for relative-salary normalisation.
- `TeamPayroll` -- `total_salary`, `n_players`, `median_salary`, `max_salary`
by team-season from `SalariesAll`. Was documented in README but missing
from the code; now implemented.

* `era_label(yr)` SQL macro registered by `create_stats_views()`. Replaces
the repeated `CASE WHEN yearID <= 2002 THEN 'Pre-Moneyball' ...` block in
every analysis query. Returns `'Pre-Moneyball'`, `'Moneyball'`, `'Big Data'`,
or `NULL` for years outside 1998-present.

## New features

Expand Down
5 changes: 4 additions & 1 deletion R/globals.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@ utils::globalVariables(c(
# scrape.R
"salary", "average_annual", "player", "playerID", "yearID",
# setup_db.R (also average_annual, playerID)
# People columns used in scrape.R
# People columns used in scrape.R / match_player_ids
"nameLast", "nameFirst",
# match_player_ids internal columns
"player_exact", "player_norm", "debut_year", "final_year",
".row_idx", "n_matches", ".match_teamID", "last_norm", "first_init", "n",
# utils.R (dt_factors_to_char)
"factor_cols",
# loaders.R -- Chadwick register column references
Expand Down
15 changes: 5 additions & 10 deletions R/loaders.R
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,8 @@ create_war_views_ <- function(con) {

# SalaryPerWAR: dollars per WAR by player-season.
#
# war_reliable flag:
# FanGraphs pitching WAR is only available from 2002 onward. A pitcher
# with salary data before 2002 will have near-zero total_war (batting
# contribution only), making dollars_per_war badly wrong for that row.
# war_reliable = FALSE when the player had pitching appearances AND
# yearID < 2002. Filter WHERE war_reliable = TRUE for clean analysis.
# Batting WAR is reliable for all seasons 1985+.
# war_reliable flag: kept for backward compatibility; now always TRUE
# since FanGraphs pitching WAR covers the full salary era (1985+).
DBI::dbExecute(con, "
CREATE OR REPLACE VIEW SalaryPerWAR AS
WITH pitcher_seasons AS (
Expand All @@ -100,7 +95,7 @@ create_war_views_ <- function(con) {
w.total_war,
s.salary / NULLIF(w.total_war, 0) AS dollars_per_war,
era_label(s.yearID) AS era,
NOT (ps.playerID IS NOT NULL AND s.yearID < 2002) AS war_reliable
NOT (ps.playerID IS NOT NULL AND s.yearID < 1985) AS war_reliable
FROM SalariesAll s
JOIN PlayerWAR w USING (playerID, yearID)
LEFT JOIN pitcher_seasons ps USING (playerID, yearID)
Expand Down Expand Up @@ -265,7 +260,7 @@ load_fangraphs_war <- function(con, years = 1985:2025, overwrite = FALSE) {
bat <- data.table::rbindlist(Filter(Negate(is.null), bat_list), fill = TRUE)

message(sprintf("Fetching FanGraphs pitching WAR %d-%d...", start_yr, end_yr))
pit_list <- lapply(years[years >= 2002L], function(yr) {
pit_list <- lapply(years, function(yr) {
tryCatch({
d <- data.table::as.data.table(
baseballr::fg_pitch_leaders(startseason = yr, endseason = yr, qual = 0)
Expand All @@ -280,7 +275,7 @@ load_fangraphs_war <- function(con, years = 1985:2025, overwrite = FALSE) {
pit <- data.table::rbindlist(Filter(Negate(is.null), pit_list), fill = TRUE)

if (nrow(bat) == 0L) stop("No FanGraphs batting WAR data retrieved.")
if (nrow(pit) == 0L) warning("No FanGraphs pitching WAR data retrieved (pitching WAR only available 2002+).")
if (nrow(pit) == 0L) warning("No FanGraphs pitching WAR data retrieved.")

DBI::dbWriteTable(con, "FangraphsBattingWAR", bat, overwrite = overwrite)
DBI::dbWriteTable(con, "FangraphsPitchingWAR", pit, overwrite = overwrite)
Expand Down
15 changes: 6 additions & 9 deletions R/scrape.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,21 +91,18 @@ scrape_salaries <- function(years = 2017:2025,

# -- Join to Lahman playerID --------------------------------------------------
people <- data.table::as.data.table(Lahman::People)
people[, player := paste0(nameLast, ", ", nameFirst)]
match_player_ids(all_salaries, people)

sal_linked <- merge(all_salaries, people[, .(playerID, player)],
by = "player", all.x = TRUE)
match_pct <- mean(!is.na(all_salaries$playerID)) * 100
message(sprintf("Final match rate: %.1f%% of %d rows", match_pct, nrow(all_salaries)))

match_pct <- mean(!is.na(sal_linked$playerID)) * 100
message(sprintf("Matched: %.1f%% of %d rows", match_pct, nrow(sal_linked)))

yr_range <- range(sal_linked$yearID, na.rm = TRUE)
yr_range <- range(all_salaries$yearID, na.rm = TRUE)
out_combined <- file.path(
output_dir,
sprintf("salaries_%d_%d_with_playerID.csv", yr_range[1], yr_range[2])
)
data.table::fwrite(sal_linked, out_combined)
data.table::fwrite(unique(sal_linked[is.na(playerID), .(player)]),
data.table::fwrite(all_salaries, out_combined)
data.table::fwrite(unique(all_salaries[is.na(playerID), .(player)]),
file.path(output_dir, "unmatched_players.csv"))

message("Done. Combined file: ", out_combined)
Expand Down
8 changes: 3 additions & 5 deletions R/setup_db.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,8 @@
#' and creates the `PlayerIDs` view (ODC-BY 1.0 licensed; safe to use locally).
#' - `load_war = TRUE` additionally fetches FanGraphs WAR leaderboards and
#' creates the `PlayerWAR` and `SalaryPerWAR` views. Implies
#' `load_chadwick = TRUE`. Pitching WAR is available from FanGraphs from
#' 2002 onward only; `SalaryPerWAR` includes a `war_reliable` flag to mark
#' pre-2002 pitcher rows where WAR values are incomplete.
#' `load_chadwick = TRUE`. Both batting and pitching WAR are available
#' from FanGraphs for the full salary era (1985+).
#'
#' @param dbdir Path for the output `baseball.duckdb` file. Defaults to the
#' value of the `LAHMANS_DBDIR` environment variable if set, otherwise
Expand All @@ -38,8 +37,7 @@
#' `PlayerWAR` and `SalaryPerWAR` views. Implies `load_chadwick = TRUE`.
#' Requires an internet connection and \pkg{baseballr}. Default `FALSE`.
#' @param war_years Integer vector of seasons to fetch for WAR data.
#' Defaults to `1985:2025` (full salary era). Pitching WAR before 2002 is
#' not available from FanGraphs; see `SalaryPerWAR.war_reliable`.
#' Defaults to `1985:2025` (full salary era).
#'
#' @return Invisibly returns `dbdir`.
#' @export
Expand Down
Loading
Loading