diff --git a/.Rbuildignore b/.Rbuildignore index 49c0220..b8c7312 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -14,3 +14,4 @@ ^AGENTS\.md$ ^CONTRIBUTING\.md$ ^analysis$ +^.*\.pdf$ diff --git a/.copilot/mcp-config.json b/.copilot/mcp-config.json new file mode 100644 index 0000000..277e3b4 --- /dev/null +++ b/.copilot/mcp-config.json @@ -0,0 +1,17 @@ +{ + "mcpServers": { + "baseball": { + "type": "stdio", + "command": "python3", + "args": [ + "-c", + "import os, subprocess; db=os.path.join(os.environ.get('LAHMANS_DBDIR', os.path.expanduser('~/Documents/Data/baseball')), 'baseball.duckdb'); subprocess.run(['duckdb-mcp-server', '--db-path', db, '--readonly'])" + ] + }, + "r-btw": { + "type": "stdio", + "command": "Rscript", + "args": ["-e", "btw::btw_mcp_server()"] + } + } +} \ No newline at end of file diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS new file mode 100644 index 0000000..58f8835 --- /dev/null +++ b/.github/CODEOWNERS @@ -0,0 +1,13 @@ +# CODEOWNERS — these files require owner review on every PR. +# Protects against prompt injection via AI instruction files and +# unauthorized expansion of the MCP attack surface. + +# AI agent instructions and MCP server config are sensitive: +# a malicious change could redirect agent behaviour or expose new +# execution vectors. Require explicit owner sign-off. +.github/copilot-instructions.md @davidlucey +.copilot/mcp-config.json @davidlucey + +# DESCRIPTION and NAMESPACE control package identity and imports +DESCRIPTION @davidlucey +NAMESPACE @davidlucey diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 2a17962..f347a75 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -97,15 +97,73 @@ cd $PROJ && git reset When developing analysis scripts or iterating on charts, use an **interactive R session** instead of re-running the full script each time: -1. Start R in async mode: `bash mode="async" command="R --no-save"` -2. Source shared setup (DB connection, libraries) once -3. Send individual code blocks via `write_bash` to iterate on specific charts or queries -4. Use the `view` tool on saved PNG files to inspect chart output visually -5. Only assemble the final `.R` script once the individual pieces are working +1. Start R in async mode: `bash mode="async" command="R --no-save" shellId="r-session"` +2. Manually connect to DuckDB and load libraries (do NOT source the full script — see gotcha below) +3. Source only the SQL/data-processing block once (lines after `dbConnect`, before chart code) +4. Send only the chart code block via `write_bash` to iterate on specific charts or queries +5. Use the `view` tool on saved PNG files to inspect chart output visually +6. Only assemble the final `.R` script once the individual pieces are working + +**`on.exit` gotcha in sourced scripts:** Analysis scripts use `on.exit(dbDisconnect(con, shutdown = TRUE))` at the top level. When you `source()` such a file interactively, R fires the `on.exit` handler when `source()` returns, closing the connection immediately. **Workaround:** connect manually in the session first, then source only the data-processing and chart sections (not the preamble). + +```r +# Step 1 — run once to set up the session +suppressPackageStartupMessages({ + library(data.table); library(ggplot2); library(DBI); library(duckdb) +}) +db_path <- file.path(path.expand(Sys.getenv("LAHMANS_DBDIR", "~/Documents/Data/baseball")), "baseball.duckdb") +con <- dbConnect(duckdb(), db_path, read_only = TRUE) + +# Step 2 — source just the SQL + data wrangling (skip preamble lines) +source("/tmp/roi_data.R") # or whichever temp file has only the data block + +# Step 3 — iterate: edit chart file, source, view +source("/tmp/roi_chart.R") +``` This avoids the 60-90 second penalty of re-running a full analysis script on every change and enables tight visual feedback loops. -**DuckDB CLI for ad-hoc queries:** Use `duckdb ~/Documents/Data/baseball/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code. +**DuckDB CLI for ad-hoc queries:** Use `duckdb $LAHMANS_DBDIR/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code. + +## MCP Servers + +Two MCP servers are configured in `.copilot/mcp-config.json`: + +| Server | Command | Purpose | +|--------|---------|---------| +| `baseball` | `duckdb-mcp-server --readonly` | Read-only SQL access to `baseball.duckdb`; path resolved via `$LAHMANS_DBDIR` using Python (no shell expansion — avoids injection) | +| `r-btw` | `btw::btw_mcp_server()` | R package dev tools: test, document, check, coverage, help | + +**Prerequisites:** +- `LAHMANS_DBDIR` env var should be set (defaults to `~/Documents/Data/baseball`). +- `duckdb-mcp-server` binary on `PATH` (installed at `~/.local/bin/duckdb-mcp-server`). +- `btw` R package installed in the system library (not renv). + +**Using `r-btw` tools:** prefer them over bash for package tasks — `btw_tool_pkg_test`, `btw_tool_pkg_check`, `btw_tool_pkg_coverage`, `btw_tool_pkg_document` all run in-process and are faster than shell invocations. + +**`mcptools` is intentionally NOT configured** as an MCP server. `mcptools::mcp_server(session_tools = TRUE)` would expose `list_r_sessions` / `select_r_session`, giving the AI arbitrary R code execution in any session that has called `mcp_session()`. Use the bash async session approach instead (see Interactive R Sessions above). + +## Security + +The following files are high-value targets for prompt injection and are protected by CODEOWNERS (owner review required on every PR): + +- `.github/copilot-instructions.md` — controls AI agent behaviour for all sessions +- `.copilot/mcp-config.json` — controls which MCP servers (execution surfaces) are available + +**What prompt injection means here:** a malicious PR that modifies `copilot-instructions.md` could redirect the agent to exfiltrate data, weaken commit checks, or perform unintended operations. The CODEOWNERS rule ensures a human must explicitly approve any change to these files before merge. + +**MCP surface area (in order of privilege):** +1. `baseball` (DuckDB, read-only) — SQL queries only; no writes; path constructed programmatically to avoid shell injection. +2. `r-btw` — can read all package source files and run tests/checks. Cannot write files or execute arbitrary shell commands. + +**Never add to MCP config without security review:** +- Any server that exposes `eval`, `system()`, `shell()`, or arbitrary R/Python execution. +- `mcptools::mcp_server(session_tools = TRUE)` — see above. +- Any server that takes user-supplied input as a shell argument. + +## Tests + +Run with `devtools::test()`. The suite has ~71 `test_that()` blocks (202 assertions) across 6 files. All must pass before committing. The full-DB smoke test uses `skip_on_ci()` and `skip_if_not_installed("Lahman")`. ## R CMD Check diff --git a/.gitignore b/.gitignore index 9fb0da2..7db0d8a 100644 --- a/.gitignore +++ b/.gitignore @@ -32,7 +32,6 @@ tests/testthat/_snaps/ # Copilot CLI config (environment-specific, not for contributors) .github/lsp.json -.copilot/mcp-config.json .mcp.json # Old scratch notebooks (superseded by analysis/) diff --git a/2006-an-economic-evaluation-of-the-moneyball-hypothesis.pdf b/2006-an-economic-evaluation-of-the-moneyball-hypothesis.pdf new file mode 100644 index 0000000..a8a85e7 Binary files /dev/null and b/2006-an-economic-evaluation-of-the-moneyball-hypothesis.pdf differ diff --git a/DESCRIPTION b/DESCRIPTION index d67844b..f0e159b 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,13 +1,14 @@ Package: lahmanTools Title: Baseball Analytics with Lahman and DuckDB -Version: 0.3.0 +Version: 0.4.0 Authors@R: person("David", "Lucey", role = c("aut", "cre"), email = "david@example.com") Description: Loads all Sean Lahman baseball tables (1871-2025) into a persistent DuckDB database and exposes pre-built sabermetric SQL views (BattingStats, PitchingStats, SalaryPerWAR, etc.). Optionally extends salary coverage to - 2017-2025 from USA Today and Spotrac, and supplements with FanGraphs WAR - (1985+) and Chadwick Bureau player ID crosswalk via the baseballr package. + 2017-2025 from USA Today and Spotrac, supplements with FanGraphs WAR + (1985+) and Retrosheet postseason data (2022-2025), and provides Chadwick + Bureau player ID crosswalk via the baseballr package. No third-party data is bundled; all supplemental data is fetched at runtime. License: MIT + file LICENSE Encoding: UTF-8 @@ -27,6 +28,12 @@ Suggests: dm, jsonlite, re2, + zip, ggplot2, + ggrepel, + scales, + quarto, + knitr, testthat (>= 3.0.0) +VignetteBuilder: quarto Config/testthat/edition: 3 diff --git a/NAMESPACE b/NAMESPACE index f4e0000..7a4d337 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -7,6 +7,7 @@ export(db_query) export(dt_factors_to_char) export(load_chadwick_ids) export(load_fangraphs_war) +export(load_retrosheet_post) export(load_statcast) export(match_player_ids) export(normalise_player_name) diff --git a/NEWS.md b/NEWS.md index ca0c400..e41be51 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,18 @@ +# lahmanTools 0.4.0 + +## Attribution fix + +* **README** and **DESCRIPTION** now include Retrosheet as a credited data + source. `load_retrosheet_post()` has always carried the required attribution + in its roxygen docs and `inst/RETROSHEET_NOTICE`; the top-level docs now + match. + +## Code quality + +* Three call sites that bypassed `db_query()` now use it consistently: + `scrape.R` (people lookup) and `utils.R` (Batting/Pitching roster tables in + `match_player_ids()` Pass 4). No behaviour change -- purely DRY cleanup. + # lahmanTools 0.3.0 ## Breaking changes diff --git a/R/loaders.R b/R/loaders.R index 62b4612..066e077 100644 --- a/R/loaders.R +++ b/R/loaders.R @@ -287,6 +287,452 @@ load_fangraphs_war <- function(con, years = 1985:2025, overwrite = FALSE) { } +#' Load Retrosheet postseason data (2022+) +#' +#' Downloads Retrosheet simplified CSV files and appends postseason player +#' statistics for the requested seasons to the `BattingPost`, `PitchingPost`, +#' and `SeriesPost` tables in the database. +#' +#' The Lahman `BattingPost`, `PitchingPost`, and `SeriesPost` tables stop at +#' 2021. This function extends them using Retrosheet data (available through +#' 2025) by: +#' +#' \enumerate{ +#' \item Downloading the Retrosheet simplified CSV archive +#' (`basiccsvs.zip`) to a local path. +#' \item Filtering to the requested \code{years} and postseason game types +#' (wildcard, divisionseries, lcs, worldseries). +#' \item Mapping Retrosheet player IDs to Lahman \code{playerID} via +#' \code{People$retroID}. +#' \item Mapping Retrosheet team codes to Lahman \code{teamID} and +#' \code{lgID} via the \code{Teams} table. +#' \item Deriving Lahman-style \code{round} codes (e.g. \code{ALCS}, +#' \code{ALDS1}, \code{ALWC1}, \code{WS}). +#' \item Aggregating per-game rows into per-series player totals. +#' \item Inserting new rows into \code{BattingPost}, \code{PitchingPost}, +#' and \code{SeriesPost}. +#' } +#' +#' **Attribution:** Data are from Retrosheet. You are free to use, sell, or +#' build products from Retrosheet data provided the following notice appears +#' prominently: *"The information used here was obtained free of charge from +#' and is copyrighted by Retrosheet. Interested parties may contact Retrosheet +#' at \url{https://www.retrosheet.org}"*. +#' +#' @param con A writable \code{DBIConnection} to the baseball DuckDB database. +#' The database must already contain \code{People} and \code{Teams} tables +#' (loaded by \code{\link{setup_baseball_db}}). +#' @param years Integer vector of seasons to load. Defaults to all seasons +#' after the current maximum \code{yearID} in \code{BattingPost} (typically +#' 2022:2025 when Lahman is current through 2021). +#' @param zip_path Path to a pre-downloaded \code{basiccsvs.zip} file. When +#' \code{NULL} (default) the file is downloaded to \code{tempdir()} if not +#' already cached there. +#' @param overwrite Logical. When \code{TRUE}, delete any existing rows for +#' the requested \code{years} in all three tables before inserting. Default +#' \code{FALSE} skips years already present. +#' +#' @return Invisibly returns \code{con}. +#' @export +#' +#' @examples +#' \dontrun{ +#' con <- connect_baseball_db(read_only = FALSE) +#' load_retrosheet_post(con) # extend through latest available year +#' load_retrosheet_post(con, years = 2024) # single season +#' DBI::dbDisconnect(con, shutdown = TRUE) +#' } +load_retrosheet_post <- function(con, + years = NULL, + zip_path = NULL, + overwrite = FALSE) { + needed <- c("People", "Teams", "BattingPost", "PitchingPost", "SeriesPost") + missing_tbls <- setdiff(needed, DBI::dbListTables(con)) + if (length(missing_tbls)) { + stop("Required tables missing from database: ", + paste(missing_tbls, collapse = ", "), + "\n Run setup_baseball_db() first.", call. = FALSE) + } + + # Determine which years to load ----------------------------------------------- + existing_max <- DBI::dbGetQuery( + con, "SELECT COALESCE(MAX(yearID), 2021) AS m FROM BattingPost" + )$m + if (is.null(years)) years <- seq.int(existing_max + 1L, 2025L) + years <- as.integer(years) + if (!length(years)) { + message(" No new postseason years to load.") + return(invisible(con)) + } + + if (!overwrite) { + already <- DBI::dbGetQuery( + con, + paste0("SELECT DISTINCT yearID FROM BattingPost WHERE yearID IN (", + paste(years, collapse = ","), ")") + )$yearID + years <- setdiff(years, already) + if (!length(years)) { + message(" BattingPost already contains all requested years. ", + "Use overwrite = TRUE to reload.") + return(invisible(con)) + } + } + + yr_min <- min(years) + yr_max <- max(years) + message(sprintf("Loading Retrosheet postseason data for %d-%d...", yr_min, yr_max)) + + # Find / download zip --------------------------------------------------------- + if (is.null(zip_path)) { + zip_path <- file.path(tempdir(), "retrosheet_basiccsvs.zip") + if (!file.exists(zip_path)) { + message(" Downloading Retrosheet basiccsvs.zip ...") + utils::download.file( + "https://www.retrosheet.org/downloads/basiccsvs.zip", + zip_path, mode = "wb", quiet = FALSE + ) + } + } + if (!file.exists(zip_path)) + stop("zip_path does not exist: ", zip_path, call. = FALSE) + + # Unzip ----------------------------------------------------------------------- + extract_dir <- file.path(tempdir(), "retrosheet_csv") + dir.create(extract_dir, showWarnings = FALSE, recursive = TRUE) + utils::unzip(zip_path, files = c("batting.csv", "pitching.csv"), + exdir = extract_dir, overwrite = TRUE) + + bat_csv <- file.path(extract_dir, "batting.csv") + pit_csv <- file.path(extract_dir, "pitching.csv") + + if (!file.exists(bat_csv) || !file.exists(pit_csv)) + stop("Expected batting.csv and pitching.csv not found after unzip.", call. = FALSE) + + # Shared constants ------------------------------------------------------------ + # Retrosheet gametype values for postseason rounds. + post_types <- "('worldseries','lcs','divisionseries','wildcard')" + # Year filter uses integer division (//) -- DuckDB's / on BIGINT returns DOUBLE. + yr_filter <- paste0("date // 10000 IN (", paste(years, collapse = ","), ")") + + # Round-code SQL snippet (used in all three table queries) -------------------- + # Maps Retrosheet gametype + league -> Lahman round code. + # Non-WS series within each (year, league, gametype) are numbered 1, 2, 3 + # by the alphabetically ordered canonical pair (LEAST(team,opp), GREATEST(...)). + round_cte_sql <- function() { + # Teams only covers through 2021 in the Lahman release. Use each team's + # most-recent available lgID as a stable proxy (teams very rarely change + # leagues). No year join needed. + " + team_lg AS ( + SELECT teamID, lgID + FROM ( + SELECT teamID, lgID, + ROW_NUMBER() OVER (PARTITION BY teamID ORDER BY yearID DESC) AS rn + FROM Teams + ) + WHERE rn = 1 + ), + -- Number series within (year, gametype, league) for ALDS1/ALDS2 etc. + -- pair_low is always within one league for non-WS series. + series_num AS ( + SELECT DISTINCT yearID, gametype, pair_low, pair_high, + DENSE_RANK() OVER ( + PARTITION BY yearID, gametype, tl.lgID + ORDER BY pair_low, pair_high + ) AS sn + FROM src + JOIN team_lg tl ON tl.teamID = src.pair_low + WHERE gametype <> 'worldseries' + UNION ALL + SELECT DISTINCT yearID, gametype, pair_low, pair_high, 1 AS sn + FROM src + WHERE gametype = 'worldseries' + ), + -- Attach round code + lgID to every source row. + augmented AS ( + SELECT + src.*, + tl.lgID, + CASE src.gametype + WHEN 'worldseries' THEN 'WS' + WHEN 'lcs' THEN + CASE tl.lgID WHEN 'AL' THEN 'ALCS' ELSE 'NLCS' END + WHEN 'divisionseries' THEN + CASE tl.lgID + WHEN 'AL' THEN 'ALDS' || CAST(sn.sn AS VARCHAR) + ELSE 'NLDS' || CAST(sn.sn AS VARCHAR) + END + WHEN 'wildcard' THEN + CASE tl.lgID + WHEN 'AL' THEN 'ALWC' || CAST(sn.sn AS VARCHAR) + ELSE 'NLWC' || CAST(sn.sn AS VARCHAR) + END + END AS round + FROM src + JOIN team_lg tl ON tl.teamID = src.team + JOIN series_num sn + ON sn.yearID = src.yearID + AND sn.gametype = src.gametype + AND sn.pair_low = src.pair_low + AND sn.pair_high = src.pair_high + )" + } + + # ── BattingPost ────────────────────────────────────────────────────────────── + message(" Building BattingPost supplement...") + bat_sql <- paste0(" + WITH + src AS ( + SELECT gid, id, team, opp, + (date // 10000)::INTEGER AS yearID, + gametype, + LEAST(team, opp) AS pair_low, + GREATEST(team, opp) AS pair_high, + COALESCE(b_ab, 0) AS b_ab, + COALESCE(b_r, 0) AS b_r, + COALESCE(b_h, 0) AS b_h, + COALESCE(b_d, 0) AS b_d, + COALESCE(b_t, 0) AS b_t, + COALESCE(b_hr, 0) AS b_hr, + COALESCE(b_rbi, 0) AS b_rbi, + COALESCE(b_sh, 0) AS b_sh, + COALESCE(b_sf, 0) AS b_sf, + COALESCE(b_hbp, 0) AS b_hbp, + COALESCE(b_w, 0) AS b_w, + COALESCE(b_iw, 0) AS b_iw, + COALESCE(b_k, 0) AS b_k, + COALESCE(b_sb, 0) AS b_sb, + COALESCE(b_cs, 0) AS b_cs, + COALESCE(b_gdp, 0) AS b_gdp + FROM read_csv_auto('", bat_csv, "', sample_size = -1, ignore_errors = TRUE) + WHERE gametype IN ", post_types, " + AND ", yr_filter, " + ),", + round_cte_sql(), " + SELECT + p.playerID AS playerID, + a.yearID AS yearID, + a.round AS round, + a.team AS teamID, + a.lgID AS lgID, + COUNT(DISTINCT a.gid) AS G, + SUM(a.b_ab) AS AB, + SUM(a.b_r) AS R, + SUM(a.b_h) AS H, + SUM(a.b_hr) AS HR, + SUM(a.b_rbi) AS RBI, + SUM(a.b_sb) AS SB, + SUM(a.b_cs) AS CS, + SUM(a.b_w) AS BB, + SUM(a.b_k) AS SO, + SUM(a.b_iw) AS IBB, + SUM(a.b_hbp) AS HBP, + SUM(a.b_sh) AS SH, + SUM(a.b_sf) AS SF, + SUM(a.b_gdp) AS GIDP, + SUM(a.b_d) AS X2B, + SUM(a.b_t) AS X3B + FROM augmented a + JOIN People p ON p.retroID = a.id + GROUP BY p.playerID, a.yearID, a.round, a.team, a.lgID + ") + + if (overwrite) { + DBI::dbExecute(con, + paste0("DELETE FROM BattingPost WHERE yearID IN (", + paste(years, collapse = ","), ")")) + } + bat_new <- DBI::dbGetQuery(con, bat_sql) + DBI::dbAppendTable(con, "BattingPost", bat_new) + message(sprintf(" %-25s +%d rows", "BattingPost", nrow(bat_new))) + + # ── PitchingPost ───────────────────────────────────────────────────────────── + message(" Building PitchingPost supplement...") + pit_sql <- paste0(" + WITH + src AS ( + SELECT gid, id, team, opp, + (date // 10000)::INTEGER AS yearID, + gametype, + LEAST(team, opp) AS pair_low, + GREATEST(team, opp) AS pair_high, + COALESCE(wp, 0) AS wp, + COALESCE(lp, 0) AS lp, + COALESCE(save, 0) AS p_sv, + COALESCE(p_gs, 0) AS p_gs, + COALESCE(p_gf, 0) AS p_gf, + COALESCE(p_cg, 0) AS p_cg, + COALESCE(p_ipouts,0) AS p_ipouts, + COALESCE(p_h, 0) AS p_h, + COALESCE(p_er, 0) AS p_er, + COALESCE(p_hr, 0) AS p_hr, + COALESCE(p_w, 0) AS p_w, + COALESCE(p_iw, 0) AS p_iw, + COALESCE(p_k, 0) AS p_k, + COALESCE(p_hbp, 0) AS p_hbp, + COALESCE(p_wp, 0) AS p_wp, + COALESCE(p_bk, 0) AS p_bk, + COALESCE(p_bfp, 0) AS p_bfp, + COALESCE(p_r, 0) AS p_r, + COALESCE(p_sh, 0) AS p_sh, + COALESCE(p_sf, 0) AS p_sf + FROM read_csv_auto('", pit_csv, "', sample_size = -1, ignore_errors = TRUE) + WHERE gametype IN ", post_types, " + AND ", yr_filter, " + ),", + round_cte_sql(), " + SELECT + pp.playerID AS playerID, + a.yearID AS yearID, + a.round AS round, + a.team AS teamID, + a.lgID AS lgID, + SUM(a.wp) AS W, + SUM(a.lp) AS L, + COUNT(DISTINCT a.gid) AS G, + SUM(a.p_gs) AS GS, + SUM(a.p_cg) AS CG, + 0::BIGINT AS SHO, + SUM(a.p_sv) AS SV, + SUM(a.p_ipouts) AS IPouts, + SUM(a.p_h) AS H, + SUM(a.p_er) AS ER, + SUM(a.p_hr) AS HR, + SUM(a.p_w) AS BB, + SUM(a.p_k) AS SO, + -- BAOpp: H / (BFP - BB - IBB - HBP - SH - SF) + SUM(a.p_h)::DOUBLE / + NULLIF(SUM(a.p_bfp) - SUM(a.p_w) - SUM(a.p_iw) + - SUM(a.p_hbp) - SUM(a.p_sh) - SUM(a.p_sf), + 0) AS BAOpp, + -- ERA: ER * 27 / IPouts + SUM(a.p_er)::DOUBLE * 27.0 / + NULLIF(SUM(a.p_ipouts), 0) AS ERA, + SUM(a.p_iw) AS IBB, + SUM(a.p_wp) AS WP, + SUM(a.p_hbp) AS HBP, + SUM(a.p_bk) AS BK, + SUM(a.p_bfp) AS BFP, + SUM(a.p_gf) AS GF, + SUM(a.p_r) AS R, + SUM(a.p_sh) AS SH, + SUM(a.p_sf) AS SF, + 0::BIGINT AS GIDP + FROM augmented a + JOIN People pp ON pp.retroID = a.id + GROUP BY pp.playerID, a.yearID, a.round, a.team, a.lgID + ") + + if (overwrite) { + DBI::dbExecute(con, + paste0("DELETE FROM PitchingPost WHERE yearID IN (", + paste(years, collapse = ","), ")")) + } + pit_new <- DBI::dbGetQuery(con, pit_sql) + DBI::dbAppendTable(con, "PitchingPost", pit_new) + message(sprintf(" %-25s +%d rows", "PitchingPost", nrow(pit_new))) + + # ── SeriesPost ─────────────────────────────────────────────────────────────── + # Derive series outcomes from game-level data. Home-team rows (vishome='h') + # give exactly one win/loss record per game without double-counting. + message(" Building SeriesPost supplement...") + ser_sql <- paste0(" + WITH + src AS ( + SELECT DISTINCT gid, team, opp, vishome, win, + (date // 10000)::INTEGER AS yearID, + gametype, + LEAST(team, opp) AS pair_low, + GREATEST(team, opp) AS pair_high + FROM read_csv_auto('", bat_csv, "', sample_size = -1, ignore_errors = TRUE) + WHERE gametype IN ", post_types, " + AND ", yr_filter, " + AND vishome = 'h' + ),", + round_cte_sql(), ", + -- Count wins per team per series using home-team perspective. + -- Home team's win=1 means visitor lost; invert for visitor. + home_side AS ( + SELECT yearID, round, team, opp, + SUM(win) AS wins, + SUM(1 - win) AS losses + FROM augmented + GROUP BY yearID, round, team, opp + ), + vis_side AS ( + SELECT yearID, round, opp AS team, team AS opp, + SUM(1 - win) AS wins, + SUM(win) AS losses + FROM augmented + GROUP BY yearID, round, opp, team + ), + totals AS ( + SELECT yearID, round, team, + SUM(wins) AS wins, + SUM(losses) AS losses + FROM (SELECT yearID, round, team, wins, losses FROM home_side + UNION ALL + SELECT yearID, round, team, wins, losses FROM vis_side) + GROUP BY yearID, round, team + ), + -- Pair teams; winner = higher wins (ties go to the larger-code team as + -- a tie-break but real ties are very rare in postseason play). + paired AS ( + SELECT + t1.yearID, t1.round, + CASE WHEN t1.wins >= t2.wins THEN t1.team ELSE t2.team END AS teamIDwinner, + CASE WHEN t1.wins >= t2.wins THEN t2.team ELSE t1.team END AS teamIDloser, + GREATEST(t1.wins, t2.wins)::BIGINT AS wins, + LEAST(t1.wins, t2.wins)::BIGINT AS losses, + 0::BIGINT AS ties + FROM totals t1 + JOIN totals t2 + ON t1.yearID = t2.yearID AND t1.round = t2.round + AND t1.team < t2.team + ), + lg AS ( + SELECT teamID, lgID + FROM ( + SELECT teamID, lgID, + ROW_NUMBER() OVER (PARTITION BY teamID ORDER BY yearID DESC) AS rn + FROM Teams + ) + WHERE rn = 1 + ) + SELECT + p.yearID, + p.round, + p.teamIDwinner, + w.lgID AS lgIDwinner, + p.teamIDloser, + l.lgID AS lgIDloser, + p.wins, + p.losses, + p.ties + FROM paired p + JOIN lg w ON w.teamID = p.teamIDwinner + JOIN lg l ON l.teamID = p.teamIDloser + ") + + if (overwrite) { + DBI::dbExecute(con, + paste0("DELETE FROM SeriesPost WHERE yearID IN (", + paste(years, collapse = ","), ")")) + } + ser_new <- DBI::dbGetQuery(con, ser_sql) + DBI::dbAppendTable(con, "SeriesPost", ser_new) + message(sprintf(" %-25s +%d rows", "SeriesPost", nrow(ser_new))) + + # Cleanup extracted CSVs (zip kept for reuse) --------------------------------- + unlink(c(bat_csv, pit_csv)) + + message(sprintf("\nRetrosheet postseason data loaded for years: %s", + paste(sort(years), collapse = ", "))) + invisible(con) +} + + #' Load Statcast pitch-level data #' #' Fetches Baseball Savant pitch-level data via \pkg{baseballr} for each diff --git a/R/scrape.R b/R/scrape.R index 1816207..7486d8a 100644 --- a/R/scrape.R +++ b/R/scrape.R @@ -96,7 +96,7 @@ scrape_salaries <- function(years = 2017:2025, # -- Join to playerID --------------------------------------------------------- if (!is.null(con)) { - people <- data.table::as.data.table(DBI::dbGetQuery(con, "SELECT playerID, nameLast, nameFirst, nameGiven FROM People")) + people <- db_query(con, "SELECT playerID, nameLast, nameFirst, nameGiven FROM People") } else { stop("A DuckDB connection (con=) is required for player ID matching. ", "Run setup_baseball_db() and pass con = connect_baseball_db().") diff --git a/R/setup_db.R b/R/setup_db.R index 62f097d..a6cf919 100644 --- a/R/setup_db.R +++ b/R/setup_db.R @@ -44,6 +44,14 @@ #' Requires an internet connection and \pkg{baseballr}. Default `FALSE`. #' @param war_years Integer vector of seasons to fetch for WAR data. #' Defaults to `1985:2025` (full salary era). +#' @param load_retrosheet If `TRUE`, download Retrosheet postseason CSVs and +#' extend `BattingPost`, `PitchingPost`, and `SeriesPost` through the latest +#' available year (currently 2025). The Lahman tables stop at 2021; this +#' fills the gap. Requires an internet connection. Defaults to `TRUE` when +#' `load_war = TRUE`, `FALSE` otherwise. +#' @param retrosheet_zip Optional path to a pre-downloaded Retrosheet +#' `basiccsvs.zip`. When `NULL` (default), the file is downloaded from +#' \url{https://www.retrosheet.org/downloads/basiccsvs.zip}. #' #' @return Invisibly returns `dbdir`. #' @export @@ -53,8 +61,9 @@ #' # Download all tables from Chadwick Bureau and build database #' setup_baseball_db() #' -#' # With full WAR coverage (requires baseballr and internet) +#' # With full WAR and postseason coverage through 2025 #' setup_baseball_db(load_war = TRUE, overwrite = TRUE) +#' # load_war = TRUE implies load_retrosheet = TRUE automatically #' } setup_baseball_db <- function(dbdir = NULL, sal_file = NULL, @@ -62,7 +71,9 @@ setup_baseball_db <- function(dbdir = NULL, overwrite = FALSE, load_chadwick = FALSE, load_war = FALSE, - war_years = 1985:2025) { + war_years = 1985:2025, + load_retrosheet = load_war, + retrosheet_zip = NULL) { if (is.null(dbdir)) { dbdir <- Sys.getenv( "LAHMANS_DBDIR", @@ -374,6 +385,8 @@ setup_baseball_db <- function(dbdir = NULL, if (load_war && !load_chadwick) load_chadwick <- TRUE if (load_chadwick) load_chadwick_ids(con, overwrite = overwrite) if (load_war) load_fangraphs_war(con, years = war_years, overwrite = overwrite) + if (load_retrosheet) load_retrosheet_post(con, zip_path = retrosheet_zip, + overwrite = overwrite) n <- length(DBI::dbListTables(con)) message(sprintf("\nDone. %d tables/views written to %s", n, dbdir)) diff --git a/R/stats_views.R b/R/stats_views.R index 5d5457c..be418d5 100644 --- a/R/stats_views.R +++ b/R/stats_views.R @@ -62,7 +62,46 @@ create_stats_views <- function(con) { # X2B, X3B -- stored with R's prefix because "2B"/"3B" are invalid identifiers # HBP, SF, SH -- nullable in early seasons; COALESCE to 0 # OBP denominator excludes SH (sacrifice bunts) per official MLB definition - DBI::dbExecute(con, " + # + # Extension: when FangraphsBattingWAR and ChadwickIDs are present, seasons + # after the last Lahman year (>2021) are filled from FanGraphs via UNION ALL. + # FG pre-computes AVG/OBP/SLG/ISO/BABIP/BB_pct/K_pct as proportions, matching + # the Lahman-derived values. teamID and lgID are NULL for FG rows (FG uses + # internal team codes that don't map cleanly to Lahman identifiers). + tbls <- DBI::dbListTables(con) + has_fg <- all(c("FangraphsBattingWAR", "ChadwickIDs", "People") %in% tbls) + + fg_union <- if (has_fg) " + UNION ALL + -- FanGraphs extension for seasons not covered by Lahman CSVs (>2021) + SELECT + p.playerID, + fw.Season::INTEGER AS yearID, + 1 AS stint, + NULL::VARCHAR AS teamID, + NULL::VARCHAR AS lgID, + fw.G, fw.AB, fw.R, fw.H, + fw.\"2B\"::INTEGER AS X2B, + fw.\"3B\"::INTEGER AS X3B, + fw.HR, fw.RBI, fw.SB, fw.CS, fw.BB, fw.SO, + COALESCE(fw.IBB, 0) AS IBB, + COALESCE(fw.HBP, 0) AS HBP, + COALESCE(fw.SH, 0) AS SH, + COALESCE(fw.SF, 0) AS SF, + COALESCE(fw.GDP, 0) AS GIDP, + fw.PA, + fw.AVG, fw.OBP, fw.SLG, + fw.OBP + fw.SLG AS OPS, + fw.ISO, fw.BABIP, + fw.BB_pct, fw.K_pct + FROM FangraphsBattingWAR fw + JOIN ChadwickIDs c ON fw.playerid::VARCHAR = c.key_fangraphs::VARCHAR + JOIN People p ON c.key_bbref = p.bbrefID + WHERE fw.Season > 2021 + AND fw.AB > 0 + " else "" + + DBI::dbExecute(con, paste0(" CREATE OR REPLACE VIEW BattingStats AS SELECT playerID, yearID, stint, teamID, lgID, @@ -112,8 +151,9 @@ create_stats_views <- function(con) { / NULLIF(AB + BB + COALESCE(HBP,0) + COALESCE(SF,0) + COALESCE(SH,0), 0) AS K_pct FROM Batting - ") - message(sprintf(" %-25s (view)", "BattingStats")) + ", fg_union)) + src_note <- if (has_fg) " + FanGraphs (2022+)" else "" + message(sprintf(" %-25s (view, Lahman%s)", "BattingStats", src_note)) # ── PitchingStats ──────────────────────────────────────────────────────────── # IPouts = total outs recorded (IP * 3); use throughout to avoid /3 /3 chains. diff --git a/R/utils.R b/R/utils.R index 46b84b9..9d69d70 100644 --- a/R/utils.R +++ b/R/utils.R @@ -294,8 +294,8 @@ match_player_ids <- function(sal_dt, people_dt, roster_dt = NULL, con = NULL) { "Pass con = connect_baseball_db() for best match rates.") } else { roster_dt <- tryCatch({ - bat <- data.table::as.data.table(DBI::dbGetQuery(con, "SELECT playerID, yearID, teamID FROM Batting")) - pit <- data.table::as.data.table(DBI::dbGetQuery(con, "SELECT playerID, yearID, teamID FROM Pitching")) + bat <- db_query(con, "SELECT playerID, yearID, teamID FROM Batting") + pit <- db_query(con, "SELECT playerID, yearID, teamID FROM Pitching") unique(rbind( bat[, .(playerID, yearID, teamID)], pit[, .(playerID, yearID, teamID)] diff --git a/README.md b/README.md index 7890097..0077142 100644 --- a/README.md +++ b/README.md @@ -301,9 +301,10 @@ the source license. |--------|---------|------------| | [Sean Lahman Baseball Database](http://www.seanlahman.com/) via [cbwinslow/baseballdatabank](https://github.com/cbwinslow/baseballdatabank) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | Credit Sean Lahman and carry the same license in any derivative work. | | [Chadwick Baseball Bureau Register](https://github.com/chadwickbureau/register) | [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/) | Credit the Chadwick Baseball Bureau when publishing work that uses the player ID crosswalk. | +| [Retrosheet](https://www.retrosheet.org) | Copyright Retrosheet | "The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at https://www.retrosheet.org" — must appear prominently in any work using this data. | | [FanGraphs WAR Leaderboards](https://www.fangraphs.com) | Copyright FanGraphs | Do not redistribute the fetched data. | | [Baseball Savant / Statcast](https://baseballsavant.mlb.com/) | Copyright MLB Advanced Media | Do not redistribute the fetched data. | -| USA Today / Spotrac salary data | Proprietary — ToS applies | See [`data-raw/README.md`](data-raw/README.md). Do not redistribute. | +| USA Today / Spotrac salary data | Proprietary -- ToS applies | See [`data-raw/README.md`](data-raw/README.md). Do not redistribute. | FanGraphs, Chadwick, and Statcast data are fetched via the [`baseballr`](https://billpetti.github.io/baseballr/) package diff --git a/inst/RETROSHEET_NOTICE b/inst/RETROSHEET_NOTICE new file mode 100644 index 0000000..cd4882a --- /dev/null +++ b/inst/RETROSHEET_NOTICE @@ -0,0 +1,8 @@ +The information used here was obtained free of charge from and is copyrighted +by Retrosheet. Interested parties may contact Retrosheet at: + + https://www.retrosheet.org + +Users of Retrosheet data are free to use, sell, or build products from it +provided this notice appears prominently in any work that includes Retrosheet +data. diff --git a/man/load_retrosheet_post.Rd b/man/load_retrosheet_post.Rd new file mode 100644 index 0000000..c11bf33 --- /dev/null +++ b/man/load_retrosheet_post.Rd @@ -0,0 +1,68 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/loaders.R +\name{load_retrosheet_post} +\alias{load_retrosheet_post} +\title{Load Retrosheet postseason data (2022+)} +\usage{ +load_retrosheet_post(con, years = NULL, zip_path = NULL, overwrite = FALSE) +} +\arguments{ +\item{con}{A writable \code{DBIConnection} to the baseball DuckDB database. +The database must already contain \code{People} and \code{Teams} tables +(loaded by \code{\link{setup_baseball_db}}).} + +\item{years}{Integer vector of seasons to load. Defaults to all seasons +after the current maximum \code{yearID} in \code{BattingPost} (typically +2022:2025 when Lahman is current through 2021).} + +\item{zip_path}{Path to a pre-downloaded \code{basiccsvs.zip} file. When +\code{NULL} (default) the file is downloaded to \code{tempdir()} if not +already cached there.} + +\item{overwrite}{Logical. When \code{TRUE}, delete any existing rows for +the requested \code{years} in all three tables before inserting. Default +\code{FALSE} skips years already present.} +} +\value{ +Invisibly returns \code{con}. +} +\description{ +Downloads Retrosheet simplified CSV files and appends postseason player +statistics for the requested seasons to the \code{BattingPost}, \code{PitchingPost}, +and \code{SeriesPost} tables in the database. +} +\details{ +The Lahman \code{BattingPost}, \code{PitchingPost}, and \code{SeriesPost} tables stop at +2021. This function extends them using Retrosheet data (available through +2025) by: + +\enumerate{ +\item Downloading the Retrosheet simplified CSV archive +(\code{basiccsvs.zip}) to a local path. +\item Filtering to the requested \code{years} and postseason game types +(wildcard, divisionseries, lcs, worldseries). +\item Mapping Retrosheet player IDs to Lahman \code{playerID} via +\code{People$retroID}. +\item Mapping Retrosheet team codes to Lahman \code{teamID} and +\code{lgID} via the \code{Teams} table. +\item Deriving Lahman-style \code{round} codes (e.g. \code{ALCS}, +\code{ALDS1}, \code{ALWC1}, \code{WS}). +\item Aggregating per-game rows into per-series player totals. +\item Inserting new rows into \code{BattingPost}, \code{PitchingPost}, +and \code{SeriesPost}. +} + +\strong{Attribution:} Data are from Retrosheet. You are free to use, sell, or +build products from Retrosheet data provided the following notice appears +prominently: \emph{"The information used here was obtained free of charge from +and is copyrighted by Retrosheet. Interested parties may contact Retrosheet +at \url{https://www.retrosheet.org}"}. +} +\examples{ +\dontrun{ +con <- connect_baseball_db(read_only = FALSE) +load_retrosheet_post(con) # extend through latest available year +load_retrosheet_post(con, years = 2024) # single season +DBI::dbDisconnect(con, shutdown = TRUE) +} +} diff --git a/man/setup_baseball_db.Rd b/man/setup_baseball_db.Rd index 17b0860..af5294a 100644 --- a/man/setup_baseball_db.Rd +++ b/man/setup_baseball_db.Rd @@ -11,7 +11,9 @@ setup_baseball_db( overwrite = FALSE, load_chadwick = FALSE, load_war = FALSE, - war_years = 1985:2025 + war_years = 1985:2025, + load_retrosheet = load_war, + retrosheet_zip = NULL ) } \arguments{ @@ -44,6 +46,16 @@ Requires an internet connection and \pkg{baseballr}. Default \code{FALSE}.} \item{war_years}{Integer vector of seasons to fetch for WAR data. Defaults to \code{1985:2025} (full salary era).} + +\item{load_retrosheet}{If \code{TRUE}, download Retrosheet postseason CSVs and +extend \code{BattingPost}, \code{PitchingPost}, and \code{SeriesPost} through the latest +available year (currently 2025). The Lahman tables stop at 2021; this +fills the gap. Requires an internet connection. Defaults to \code{TRUE} when +\code{load_war = TRUE}, \code{FALSE} otherwise.} + +\item{retrosheet_zip}{Optional path to a pre-downloaded Retrosheet +\code{basiccsvs.zip}. When \code{NULL} (default), the file is downloaded from +\url{https://www.retrosheet.org/downloads/basiccsvs.zip}.} } \value{ Invisibly returns \code{dbdir}. @@ -80,7 +92,8 @@ from FanGraphs for the full salary era (1985+). # Download all tables from Chadwick Bureau and build database setup_baseball_db() -# With full WAR coverage (requires baseballr and internet) +# With full WAR and postseason coverage through 2025 setup_baseball_db(load_war = TRUE, overwrite = TRUE) +# load_war = TRUE implies load_retrosheet = TRUE automatically } } diff --git a/renv.lock b/renv.lock index 127fa87..4a94792 100644 --- a/renv.lock +++ b/renv.lock @@ -114,10 +114,10 @@ }, "cli": { "Package": "cli", - "Version": "3.6.5", + "Version": "3.6.6", "Source": "Repository", "Title": "Helpers for Developing Command Line Interfaces", - "Authors@R": "c( person(\"Gábor\", \"Csárdi\", , \"gabor@posit.co\", role = c(\"aut\", \"cre\")), person(\"Hadley\", \"Wickham\", role = \"ctb\"), person(\"Kirill\", \"Müller\", role = \"ctb\"), person(\"Salim\", \"Brüggemann\", , \"salim-b@pm.me\", role = \"ctb\", comment = c(ORCID = \"0000-0002-5329-5987\")), person(\"Posit Software, PBC\", role = c(\"cph\", \"fnd\")) )", + "Authors@R": "c( person(\"Gábor\", \"Csárdi\", , \"gabor@posit.co\", role = c(\"aut\", \"cre\")), person(\"Hadley\", \"Wickham\", role = \"ctb\"), person(\"Kirill\", \"Müller\", role = \"ctb\"), person(\"Salim\", \"Brüggemann\", , \"salim-b@pm.me\", role = \"ctb\", comment = c(ORCID = \"0000-0002-5329-5987\")), person(\"Posit Software, PBC\", role = c(\"cph\", \"fnd\"), comment = c(ROR = \"03wc8by49\")) )", "Description": "A suite of tools to build attractive command line interfaces ('CLIs'), from semantic elements: headings, lists, alerts, paragraphs, etc. Supports custom themes via a 'CSS'-like language. It also contains a number of lower level 'CLI' elements: rules, boxes, trees, and 'Unicode' symbols with 'ASCII' alternatives. It support ANSI colors and text styles as well.", "License": "MIT + file LICENSE", "URL": "https://cli.r-lib.org, https://github.com/r-lib/cli", @@ -152,10 +152,11 @@ ], "Config/Needs/website": "r-lib/asciicast, bench, brio, cpp11, decor, desc, fansi, prettyunits, sessioninfo, tidyverse/tidytemplate, usethis, vctrs", "Config/testthat/edition": "3", + "Config/usethis/last-upkeep": "2025-04-25", "Encoding": "UTF-8", - "RoxygenNote": "7.3.2", + "RoxygenNote": "7.3.2.9000", "NeedsCompilation": "yes", - "Author": "Gábor Csárdi [aut, cre], Hadley Wickham [ctb], Kirill Müller [ctb], Salim Brüggemann [ctb] (), Posit Software, PBC [cph, fnd]", + "Author": "Gábor Csárdi [aut, cre], Hadley Wickham [ctb], Kirill Müller [ctb], Salim Brüggemann [ctb] (ORCID: ), Posit Software, PBC [cph, fnd] (ROR: )", "Maintainer": "Gábor Csárdi ", "Repository": "CRAN" }, @@ -491,7 +492,7 @@ }, "magrittr": { "Package": "magrittr", - "Version": "2.0.4", + "Version": "2.0.5", "Source": "Repository", "Type": "Package", "Title": "A Forward-Pipe Operator for R", @@ -743,7 +744,7 @@ }, "rlang": { "Package": "rlang", - "Version": "1.1.7", + "Version": "1.2.0", "Source": "Repository", "Title": "Functions for Base Types and Core R and 'Tidyverse' Features", "Description": "A toolbox for working with base types, core R features like the condition system, and core 'Tidyverse' features like tidy evaluation.", @@ -771,7 +772,7 @@ "pkgload", "rmarkdown", "stats", - "testthat (>= 3.2.0)", + "testthat (>= 3.3.2)", "tibble", "usethis", "vctrs (>= 0.2.3)", diff --git a/tests/testthat/test-loaders.R b/tests/testthat/test-loaders.R index bddb045..d428810 100644 --- a/tests/testthat/test-loaders.R +++ b/tests/testthat/test-loaders.R @@ -248,3 +248,148 @@ test_that("load_statcast rejects years before 2015", { on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) expect_error(load_statcast(con, years = 2014L), "2015") }) + +# ── load_retrosheet_post ────────────────────────────────────────────────────── + +# Build a minimal in-memory DB with the five tables load_retrosheet_post needs. +stub_retrosheet_tables <- function(con) { + DBI::dbExecute(con, " + CREATE TABLE People ( + playerID VARCHAR, retroID VARCHAR, bbrefID VARCHAR, + nameFirst VARCHAR, nameLast VARCHAR + )") + DBI::dbExecute(con, " + CREATE TABLE Teams ( + teamID VARCHAR, yearID INTEGER, lgID VARCHAR, franchID VARCHAR + )") + DBI::dbExecute(con, " + CREATE TABLE BattingPost ( + playerID VARCHAR, yearID INTEGER, round VARCHAR, + teamID VARCHAR, lgID VARCHAR, + G INTEGER, AB INTEGER, R INTEGER, H INTEGER, X2B INTEGER, X3B INTEGER, + HR INTEGER, RBI INTEGER, SB INTEGER, CS INTEGER, + BB INTEGER, SO INTEGER, IBB INTEGER, HBP INTEGER, + SH INTEGER, SF INTEGER, GIDP INTEGER + )") + DBI::dbExecute(con, " + CREATE TABLE PitchingPost ( + playerID VARCHAR, yearID INTEGER, round VARCHAR, + teamID VARCHAR, lgID VARCHAR, + W INTEGER, L INTEGER, G INTEGER, GS INTEGER, + CG INTEGER, SHO INTEGER, SV INTEGER, IPouts INTEGER, + H INTEGER, ER INTEGER, HR INTEGER, BB INTEGER, SO INTEGER, + BAOpp DOUBLE, ERA DOUBLE, IBB INTEGER, WP INTEGER, HBP INTEGER, + BK INTEGER, BFP INTEGER, GF INTEGER, R INTEGER, SH INTEGER, + SF INTEGER, GIDP INTEGER + )") + DBI::dbExecute(con, " + CREATE TABLE SeriesPost ( + yearID INTEGER, round VARCHAR, + teamIDwinner VARCHAR, lgIDwinner VARCHAR, + teamIDloser VARCHAR, lgIDloser VARCHAR, + wins INTEGER, losses INTEGER, ties INTEGER + )") + # Two teams for 2022 WS (HOU vs PHI) + DBI::dbExecute(con, "INSERT INTO Teams VALUES ('HOU', 2021, 'AL', 'HOU')") + DBI::dbExecute(con, "INSERT INTO Teams VALUES ('PHI', 2021, 'NL', 'PHI')") +} + +# Create a minimal Retrosheet-format zip from vectors of game records. +make_retro_zip <- function(zip_path, bat_rows, pit_rows) { + extract_dir <- dirname(zip_path) + bat_file <- file.path(extract_dir, "batting.csv") + pit_file <- file.path(extract_dir, "pitching.csv") + + bat_header <- paste( + "gid,id,team,opp,b_lp,b_seq,stattype,b_pa,b_ab,b_r,b_h,b_d,b_t,b_hr,", + "b_rbi,b_sh,b_sf,b_hbp,b_w,b_iw,b_k,b_sb,b_cs,b_gdp,b_xi,b_roe,", + "dh,ph,pr,date,number,site,vishome,win,loss,tie,gametype,box,pbp", + sep = "") + writeLines(c(bat_header, bat_rows), bat_file) + + pit_header <- paste( + "gid,id,team,opp,date,number,site,vishome,win,loss,tie,gametype,box,pbp,", + "wp,lp,save,p_gs,p_gf,p_cg,p_ipouts,p_h,p_er,p_hr,p_w,p_iw,p_k,", + "p_hbp,p_wp,p_bk,p_bfp,p_r,p_sh,p_sf", + sep = "") + writeLines(c(pit_header, pit_rows), pit_file) + + zip::zip(zip_path, files = c("batting.csv", "pitching.csv"), + root = extract_dir, mode = "cherry-pick") + unlink(c(bat_file, pit_file)) +} + +test_that("load_retrosheet_post errors when required tables are missing", { + con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") + on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) + expect_error(load_retrosheet_post(con), "Required tables missing") +}) + +test_that("load_retrosheet_post errors when zip_path does not exist", { + con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") + on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) + stub_retrosheet_tables(con) + expect_error( + load_retrosheet_post(con, years = 2022L, zip_path = "/no/such/file.zip"), + "zip_path does not exist" + ) +}) + +test_that("load_retrosheet_post skips when all years already loaded", { + con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") + on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) + stub_retrosheet_tables(con) + # Seed BattingPost with 2022 already present + DBI::dbExecute(con, + "INSERT INTO BattingPost (playerID,yearID,round,teamID,lgID,G,AB,R,H, + X2B,X3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP) + VALUES ('testpl01',2022,'WS','HOU','AL',1,4,1,2,0,0,1,2,0,0,0,1,0,0,0,0,0)") + expect_message( + load_retrosheet_post(con, years = 2022L, zip_path = "unused"), + "already contains" + ) +}) + +test_that("load_retrosheet_post appends rows to BattingPost, PitchingPost, SeriesPost", { + skip_if_not_installed("zip") + con <- DBI::dbConnect(duckdb::duckdb(), ":memory:") + on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) + stub_retrosheet_tables(con) + + # Seed People with retroIDs for HOU batter + pitcher + DBI::dbExecute(con, + "INSERT INTO People VALUES ('altuvjo01','altuvj001','altuvjo01','Jose','Altuve')") + DBI::dbExecute(con, + "INSERT INTO People VALUES ('verlaju01','verlaj001','verlaju01','Justin','Verlander')") + DBI::dbExecute(con, + "INSERT INTO People VALUES ('harpbr01', 'harpebr01','harpbr01', 'Bryce','Harper')") + + td <- tempdir() + zip_path <- file.path(td, "test_retro.zip") + + # One WS game: HOU home, PHI visitor, HOU wins + bat_rows <- c( + "WS2022HOU10200220,altuvj001,HOU,PHI,1,1,batter,4,4,1,2,0,0,1,2,0,0,0,0,0,1,0,0,0,0,,0,0,0,20221101,1,MINU,h,1,0,0,worldseries,,", + "WS2022HOU10200220,harpebr01,PHI,HOU,2,1,batter,4,3,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,,0,0,0,20221101,1,MINU,v,1,0,0,worldseries,," + ) + pit_rows <- c( + "WS2022HOU10200220,verlaj001,HOU,PHI,20221101,1,MINU,h,1,0,0,worldseries,,,1,0,0,1,0,0,24,5,1,0,1,0,5,0,0,0,28,1,0,0" + ) + + make_retro_zip(zip_path, bat_rows, pit_rows) + + load_retrosheet_post(con, years = 2022L, zip_path = zip_path) + + bp <- DBI::dbGetQuery(con, "SELECT * FROM BattingPost WHERE yearID = 2022") + pp <- DBI::dbGetQuery(con, "SELECT * FROM PitchingPost WHERE yearID = 2022") + sp <- DBI::dbGetQuery(con, "SELECT * FROM SeriesPost WHERE yearID = 2022") + + expect_gt(nrow(bp), 0L) + expect_gt(nrow(pp), 0L) + expect_gt(nrow(sp), 0L) + expect_true(all(bp$yearID == 2022L)) + expect_true("WS" %in% bp$round) + expect_true("WS" %in% sp$round) + # HOU won the single game, so should be winner + expect_equal(sp$teamIDwinner, "HOU") +}) diff --git a/vignettes/franchise-efficiency.qmd b/vignettes/franchise-efficiency.qmd new file mode 100644 index 0000000..6c13308 --- /dev/null +++ b/vignettes/franchise-efficiency.qmd @@ -0,0 +1,621 @@ +--- +title: "MLB Franchise Management Efficiency" +author: "David Lucey" +date: last-modified +format: + html: + toc: true + toc-depth: 3 + code-fold: true + fig-width: 10 + fig-height: 7 + theme: flatly +vignette: > + %\VignetteIndexEntry{MLB Franchise Management Efficiency} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.align = "center" +) +``` + +## Background: What is WAR? + +**Wins Above Replacement (WAR)** measures how many more wins a player contributed +than a freely available minor-league call-up would have. A WAR of 5 means the +player was worth 5 extra wins to his team. FanGraphs publishes separate WAR for +batters (fWAR: hitting + baserunning + fielding) and pitchers (FIP-based). We +use FanGraphs WAR here because it extends back to 1985 and is available via the +`baseballr` package. + +| WAR range | Interpretation | +|----------:|----------------| +| < 0 | Below replacement level | +| 0–2 | Bench / back-of-rotation player | +| 2–5 | Solid regular or mid-rotation starter | +| 5–8 | All-Star calibre | +| > 8 | MVP / Cy Young calibre | + +## The Question We Are Answering + +Every team can buy payroll. The question is: **who converts dollars into wins, +playoff appearances, and ultimately World Series rings — and why do some +franchises consistently fail to do so despite spending heavily?** + +We score each franchise across five management dimensions using percentile ranks: + +1. **FA efficiency** — median $/WAR paid to veteran free agents +2. **Dead money** — salary committed to players who produced zero or negative WAR +3. **Homegrown pipeline** — share of WAR from players developed internally +4. **October WAR retention** — how much of a team's regular-season talent showed + up in the postseason (proxy for health and roster depth) +5. **Playoff achievement** — how deep teams went when they made the postseason + +## Data Coverage + +- **Lahman Baseball Database** — team stats, salary data through 2016, and + postseason tables (`SeriesPost`, `BattingPost`, `PitchingPost`) through 2021 +- **Retrosheet bulk CSVs** — postseason batting, pitching, and series results + extended to 2025 via `load_retrosheet_post()` (WC/DS/LCS/WS rounds) +- **FanGraphs WAR** via `baseballr` (1985–2025) +- **Spotrac / USA Today salary data** (2017–2025, combined in `SalariesAll` view) + +Playoff achievement scores and series outcomes extend to **1995–2025** (excl. 2020). +WAR retention proxy is capped at **2021** because regular-season `Batting` and +`Pitching` tables (used to compute RS PA/IPouts) stop at 2021 in the Lahman release. +Call `lahmanTools::load_retrosheet_post()` after `setup_baseball_db()` to add +2022–2025 postseason records before running this analysis. + +```{r prereqs, eval=FALSE} +# Prerequisites: database must be built first +# lahmanTools::setup_baseball_db() # builds baseball.duckdb +``` + +```{r libs} +#| message: false +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +suppressPackageStartupMessages({ + library(data.table) + library(ggplot2) + library(ggrepel) + library(DBI) + library(duckdb) + library(scales) +}) + +n_distinct <- function(x) length(unique(x[!is.na(x)])) + +db_path <- file.path( + path.expand(Sys.getenv("LAHMANS_DBDIR", "~/Documents/Data/baseball")), + "baseball.duckdb" +) +con <- dbConnect(duckdb(), db_path, read_only = TRUE) +qry <- function(sql) setDT(dbGetQuery(con, sql)) +``` + +--- + +## Franchise map + +We use `franchID` throughout so that relocated/renamed franchises are tracked +consistently (e.g., FLO → MIA, MON → WAS). + +```{r fran-map} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +fran_map <- qry(" + SELECT DISTINCT t.teamID, t.franchID, tf.franchName + FROM Teams t + JOIN TeamsFranchises tf ON t.franchID = tf.franchID + WHERE t.yearID BETWEEN 1995 AND 2021 +") +short_names <- c( + NYY = "Yankees", NYM = "Mets", LAD = "Dodgers", + BOS = "Red Sox", OAK = "Athletics", TBD = "Rays", + HOU = "Astros", ATL = "Braves", ARI = "D-Backs", + CHC = "Cubs", SFG = "Giants", STL = "Cardinals", + PHI = "Phillies", CLE = "Guardians", MIN = "Twins", + DET = "Tigers", SEA = "Mariners", TEX = "Rangers", + BAL = "Orioles", TOR = "Blue Jays", KCR = "Royals", + MIL = "Brewers", FLA = "Marlins", CIN = "Reds", + SDP = "Padres", COL = "Rockies", ANA = "Angels", + WSN = "Nationals", CHW = "White Sox", PIT = "Pirates" +) +fran_map[, short := short_names[franchID]] +fran_map[is.na(short), short := franchID] +fran_key <- unique(fran_map[, .(franchID, franchName, short)]) +fran_key <- fran_key[!duplicated(franchID)] +``` + +--- + +## Playoff achievement scores + +Each playoff appearance is scored by how far the team advanced: + +| Round | Points | +|-------|-------:| +| Wild Card win | 1 | +| Division Series win | 2 | +| LCS win | 4 | +| World Series win | 8 | + +Maximum possible = 15 (WC + DS + LCS + WS wins for a champion). + +```{r playoff-scores} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +playoff_seasons <- qry(" + WITH all_teams AS ( + SELECT teamIDwinner AS teamID, yearID FROM SeriesPost + WHERE yearID BETWEEN 1995 AND 2025 AND yearID != 2020 + UNION + SELECT teamIDloser AS teamID, yearID FROM SeriesPost + WHERE yearID BETWEEN 1995 AND 2025 AND yearID != 2020 + ) + SELECT a.teamID, a.yearID, + COALESCE(SUM(CASE + WHEN sw.round = 'WS' THEN 8 + WHEN sw.round IN ('ALCS','NLCS') THEN 4 + WHEN sw.round IN ('ALDS1','ALDS2','NLDS1','NLDS2') THEN 2 + WHEN sw.round IN ('ALWC','NLWC') + OR sw.round LIKE 'ALWC%' + OR sw.round LIKE 'NLWC%' THEN 1 + ELSE 0 END), 0) AS achievement_score + FROM all_teams a + LEFT JOIN SeriesPost sw + ON a.teamID = sw.teamIDwinner + AND a.yearID = sw.yearID + AND sw.yearID BETWEEN 1995 AND 2025 + AND sw.yearID != 2020 + GROUP BY a.teamID, a.yearID +") +playoff_seasons <- merge(playoff_seasons, + fran_map[, .(teamID, franchID)], + by = "teamID", all.x = TRUE) +``` + +--- + +## Postseason WAR proxy + +No published postseason WAR exists from FanGraphs or any other source for this +era. The industry-standard approach scales each player's regular-season WAR by +their postseason usage rate: + +$$\text{PS WAR proxy} = \text{RS WAR} \times \frac{\text{PS PA (or IP)}}{\text{RS PA (or IP)}}$$ + +Players who appear on a playoff roster but record zero postseason PA/IP receive a +proxy of 0. This directly captures two important signals: + +1. **Injury/decline** — a high-salary veteran who is hurt in October contributes 0 +2. **Roster construction** — teams that carry their best regular-season producers + into the playoffs score higher + +```{r war-proxy} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +playoff_ty <- unique(playoff_seasons[, .(teamID, yearID)]) + +rs_pa <- qry(" + SELECT playerID, yearID, teamID, + SUM(AB + COALESCE(BB,0) + COALESCE(HBP,0) + + COALESCE(SF,0) + COALESCE(SH,0)) AS rs_pa + FROM Batting + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID +") +rs_ip <- qry(" + SELECT playerID, yearID, teamID, SUM(IPouts) AS rs_ipouts + FROM Pitching + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID +") +ps_pa <- qry(" + SELECT playerID, yearID, teamID, + SUM(AB + COALESCE(BB,0) + COALESCE(HBP,0) + + COALESCE(SF,0) + COALESCE(SH,0)) AS ps_pa + FROM BattingPost + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID +") +ps_ip <- qry(" + SELECT playerID, yearID, teamID, SUM(IPouts) AS ps_ipouts + FROM PitchingPost + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID +") +war_data <- qry(" + SELECT playerID, yearID, bat_war, pit_war, total_war + FROM PlayerWAR WHERE total_war IS NOT NULL +") + +# Batter proxy +bat_proxy <- rs_pa[playoff_ty, on = c("teamID","yearID"), nomatch = 0] +bat_proxy <- bat_proxy[war_data[, .(playerID, yearID, bat_war)], + on = c("playerID","yearID"), nomatch = 0] +bat_proxy <- merge(bat_proxy, ps_pa, by = c("playerID","yearID","teamID"), all.x = TRUE) +bat_proxy[is.na(ps_pa), ps_pa := 0L] +bat_proxy[, bat_war_proxy := fifelse( + rs_pa > 0 & is.finite(bat_war) & bat_war > 0, + bat_war * (ps_pa / rs_pa), 0 +)] +bat_team <- bat_proxy[, .( + rs_bat_war = sum(bat_war, na.rm = TRUE), + ps_bat_war = sum(bat_war_proxy, na.rm = TRUE) +), by = .(teamID, yearID)] + +# Pitcher proxy +pit_proxy <- rs_ip[playoff_ty, on = c("teamID","yearID"), nomatch = 0] +pit_proxy <- pit_proxy[war_data[, .(playerID, yearID, pit_war)], + on = c("playerID","yearID"), nomatch = 0] +pit_proxy <- merge(pit_proxy, ps_ip, by = c("playerID","yearID","teamID"), all.x = TRUE) +pit_proxy[is.na(ps_ipouts), ps_ipouts := 0L] +pit_proxy[, pit_war_proxy := fifelse( + rs_ipouts > 0 & is.finite(pit_war) & pit_war > 0, + pit_war * (ps_ipouts / rs_ipouts), 0 +)] +pit_team <- pit_proxy[, .( + rs_pit_war = sum(pit_war, na.rm = TRUE), + ps_pit_war = sum(pit_war_proxy, na.rm = TRUE) +), by = .(teamID, yearID)] + +war_ret <- merge(bat_team, pit_team, by = c("teamID","yearID"), all = TRUE) +for (col in c("rs_bat_war","rs_pit_war","ps_bat_war","ps_pit_war")) + set(war_ret, which(is.na(war_ret[[col]])), col, 0) +war_ret[, `:=`( + rs_war_total = rs_bat_war + rs_pit_war, + ps_war_proxy = ps_bat_war + ps_pit_war +)] +war_ret[, war_retention := fifelse(rs_war_total > 0, + ps_war_proxy / rs_war_total, NA_real_)] +``` + +--- + +## Stage 1: Getting to October + +```{r stage1-data} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +fa_cost <- qry(" + SELECT t.franchID, + MEDIAN(sp.dollars_per_war / 1e6) AS fa_m_per_war, + COUNT(*) AS n_fa_seasons + FROM SalaryPerWAR sp + JOIN PlayerAcquisitionType pat + ON sp.playerID = pat.playerID AND sp.teamID = pat.teamID + JOIN Teams t ON sp.teamID = t.teamID AND sp.yearID = t.yearID + WHERE pat.acq_type = 'veteran_acq' + AND sp.yearID BETWEEN 1995 AND 2021 AND sp.yearID != 2020 + GROUP BY t.franchID +") + +playoff_rate <- qry(" + WITH all_seasons AS ( + SELECT DISTINCT t.franchID, t.yearID + FROM Teams t + WHERE t.yearID BETWEEN 1995 AND 2021 AND t.yearID != 2020 + ), + playoff_fran AS ( + SELECT DISTINCT t2.franchID, ps.yearID + FROM ( + SELECT teamIDwinner AS teamID, yearID FROM SeriesPost + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + UNION + SELECT teamIDloser AS teamID, yearID FROM SeriesPost + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + ) ps + JOIN Teams t2 ON ps.teamID = t2.teamID AND ps.yearID = t2.yearID + ) + SELECT a.franchID, + COUNT(DISTINCT a.yearID) AS total_seasons, + COUNT(DISTINCT pf.yearID) AS n_playoff_seasons, + COUNT(DISTINCT pf.yearID)::DOUBLE / COUNT(DISTINCT a.yearID) AS playoff_rate + FROM all_seasons a + LEFT JOIN playoff_fran pf ON a.franchID = pf.franchID AND a.yearID = pf.yearID + GROUP BY a.franchID +") + +dead_pct <- qry(" + WITH payroll AS ( + SELECT t.franchID, SUM(s.salary) / 1e6 AS total_payroll_M + FROM SalariesAll s + JOIN Teams t ON s.teamID = t.teamID AND s.yearID = t.yearID + WHERE s.is_actual = TRUE AND s.salary >= 1e6 + AND s.yearID BETWEEN 1995 AND 2021 AND s.yearID != 2020 + GROUP BY t.franchID + ), + dead AS ( + SELECT t.franchID, SUM(s.salary) / 1e6 AS dead_M + FROM SalariesAll s + JOIN Teams t ON s.teamID = t.teamID AND s.yearID = t.yearID + LEFT JOIN PlayerWAR w ON s.playerID = w.playerID AND s.yearID = w.yearID + WHERE s.is_actual = TRUE AND s.salary >= 1e6 + AND s.yearID BETWEEN 1995 AND 2021 AND s.yearID != 2020 + AND (w.total_war IS NULL OR w.total_war <= 0) + GROUP BY t.franchID + ) + SELECT p.franchID, + p.total_payroll_M, + d.dead_M, + d.dead_M / p.total_payroll_M * 100 AS dead_pct + FROM payroll p JOIN dead d ON p.franchID = d.franchID +") + +hg_war <- qry(" + SELECT t.franchID, + SUM(CASE WHEN pat.acq_type = 'homegrown' THEN sp.total_war ELSE 0 END) + ::DOUBLE / NULLIF(SUM(sp.total_war), 0) * 100 AS hg_war_pct + FROM SalaryPerWAR sp + JOIN PlayerAcquisitionType pat + ON sp.playerID = pat.playerID AND sp.teamID = pat.teamID + JOIN Teams t ON sp.teamID = t.teamID AND sp.yearID = t.yearID + WHERE sp.yearID BETWEEN 1995 AND 2021 AND sp.yearID != 2020 + AND sp.total_war IS NOT NULL + GROUP BY t.franchID +") + +avg_payroll <- qry(" + SELECT t.franchID, + SUM(s.salary) / 1e6 / COUNT(DISTINCT s.yearID) AS avg_payroll_M + FROM SalariesAll s + JOIN Teams t ON s.teamID = t.teamID AND s.yearID = t.yearID + WHERE s.is_actual = TRUE AND s.salary >= 1e6 + AND s.yearID BETWEEN 1995 AND 2021 AND s.yearID != 2020 + GROUP BY t.franchID +") + +stage1 <- Reduce( + function(a, b) merge(a, b, by = "franchID", all.x = TRUE), + list(fa_cost, playoff_rate, dead_pct[, .(franchID, dead_pct)], + hg_war, avg_payroll) +) +stage1 <- merge(stage1, fran_key, by = "franchID", all.x = TRUE) +stage1 <- stage1[!is.na(short)] +``` + +```{r chart-stage1} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +#| fig-cap: "Stage 1: efficient franchises appear in the upper-left (low $/WAR, high playoff rate). Bubble size = dead-money burden; blue fill = strong homegrown pipeline." +ggplot(stage1, aes(fa_m_per_war, playoff_rate * 100)) + + geom_point(aes(size = dead_pct, fill = hg_war_pct), + shape = 21, colour = "grey30", alpha = 0.9) + + geom_label_repel(aes(label = short), size = 3.0, fontface = "bold", + box.padding = 0.4, max.overlaps = 30, seed = 42, + min.segment.length = 0.2) + + scale_fill_gradient(low = "#D6604D", high = "#2166AC", + name = "Homegrown WAR (%)") + + scale_size_continuous(range = c(2, 9), name = "Dead money (% payroll)") + + scale_x_continuous(labels = dollar_format(suffix = "M"), + name = "Median $/WAR for veteran FAs (lower = more efficient)") + + scale_y_continuous(name = "Playoff appearance rate (%)", + labels = function(x) paste0(x, "%")) + + labs( + title = "Stage 1: Who Gets to October Efficiently?", + caption = "Sources: Lahman, Retrosheet, FanGraphs, Spotrac, USA Today -- 1995-2025 (achievement); WAR retention 1995-2021" + ) + + theme_minimal(base_size = 13) + + theme(plot.title = element_text(face = "bold"), + legend.position = "bottom", + panel.grid.minor = element_blank()) +``` + +--- + +## Stage 2: Going Deep + +```{r stage2-data} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +team_rs_war <- qry(" + WITH player_pa AS ( + SELECT playerID, yearID, teamID, + SUM(AB + COALESCE(BB,0) + COALESCE(HBP,0) + + COALESCE(SF,0) + COALESCE(SH,0)) AS pa + FROM Batting + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID + UNION ALL + SELECT playerID, yearID, teamID, SUM(IPouts) AS pa + FROM Pitching + WHERE yearID BETWEEN 1995 AND 2021 AND yearID != 2020 + GROUP BY playerID, yearID, teamID + ), + primary_team AS ( + SELECT DISTINCT ON (playerID, yearID) playerID, yearID, teamID + FROM player_pa + ORDER BY playerID, yearID, pa DESC + ) + SELECT pt.teamID, pt.yearID, SUM(pw.total_war) AS rs_war + FROM primary_team pt + JOIN PlayerWAR pw ON pt.playerID = pw.playerID AND pt.yearID = pw.yearID + WHERE pw.total_war IS NOT NULL + GROUP BY pt.teamID, pt.yearID +") +rs_war_playoff <- team_rs_war[playoff_ty, on = c("teamID","yearID"), nomatch = 0] +rs_war_playoff[, rs_war_rank := frank(-rs_war, ties.method = "average"), by = yearID] + +stage2_raw <- Reduce( + function(a, b) merge(a, b, by = c("teamID","yearID"), all.x = TRUE), + list( + playoff_seasons, + rs_war_playoff[, .(teamID, yearID, rs_war, rs_war_rank)], + war_ret[, .(teamID, yearID, rs_war_total, ps_war_proxy, war_retention)] + ) +) +stage2_raw <- merge(stage2_raw, fran_key, by = "franchID", all.x = TRUE) + +stage2 <- stage2_raw[!is.na(franchID), .( + n_playoffs = .N, + avg_achievement = mean(achievement_score, na.rm = TRUE), + avg_rs_war_rank = mean(rs_war_rank, na.rm = TRUE), + avg_war_retention = mean(war_retention, na.rm = TRUE) +), by = .(franchID, short, franchName)] +setorder(stage2, -avg_achievement) +``` + +```{r chart-stage2} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +#| fig-cap: "Stage 2: x-axis reversed so stronger RS WAR teams are on the right. Colour shows WAR retention -- how much of that regular-season talent materialised in October." +stage2_plot <- stage2[n_playoffs >= 5] +stage2_plot[, label := short] + +ggplot(stage2_plot, aes(avg_rs_war_rank, avg_achievement)) + + geom_point(aes(size = n_playoffs, colour = avg_war_retention * 100), alpha = 0.9) + + geom_label_repel(aes(label = label), size = 3.2, fontface = "bold", + box.padding = 0.5, seed = 42) + + scale_colour_gradient2( + low = "#D6604D", mid = "gold", high = "#2166AC", midpoint = 50, + name = "WAR retention (%)", + labels = function(x) paste0(round(x), "%") + ) + + scale_size_continuous(range = c(2, 9), name = "# playoff appearances") + + scale_x_reverse( + name = "Avg RS WAR rank among playoff teams that year (1 = most WAR)", + n.breaks = 8 + ) + + scale_y_continuous(name = "Avg playoff achievement score per appearance") + + labs( + title = "Stage 2: Who Converts October Appearances to Deep Runs?", + caption = paste0("WAR retention proxy: RS WAR x (PS PA/RS PA). ", + "Absent players = 0. Sources: Lahman, FanGraphs") + ) + + theme_minimal(base_size = 13) + + theme(plot.title = element_text(face = "bold"), + legend.position = "bottom", + panel.grid.minor = element_blank()) +``` + +--- + +## Synthesis Scorecard + +All five dimensions combined into a single percentile-rank heatmap. Rows are +sorted by the **Overall** column — the simple mean of all five percentile ranks. + +::: callout-note +**How to read this chart:** A score of 100 means that franchise ranked best +among all qualifying franchises on that dimension. Scores of 0 are worst. +The rightmost column (**OVERALL**) is the simple mean across all five +dimensions and determines the row sort order. + +**Why the Red Sox appear mid-table despite a 95th-percentile Playoff Achievement +score:** They also rank near the bottom on FA Efficiency and Dead Money, dragging +their Overall mean to a middling position. High achievement came at high cost. +::: + +```{r scorecard} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +#| fig-height: 9 +#| fig-cap: "Rows sorted by OVERALL = mean of 5 dimension percentile ranks. The rightmost column shows that overall rank. Red = bottom of league; Blue = top." +syn <- Reduce( + function(a, b) merge(a, b, by = "franchID", all.x = TRUE), + list( + stage1[, .(franchID, short, fa_m_per_war, playoff_rate, dead_pct, hg_war_pct)], + stage2[, .(franchID, avg_achievement, avg_war_ret_pct, n_playoffs)], + avg_payroll + ) +) +syn <- syn[n_playoffs >= 5 & !is.na(fa_m_per_war)] + +pct_rank <- function(x, higher_better = TRUE) { + r <- rank(x, na.last = "keep", ties.method = "average") + p <- (r - 1) / (sum(!is.na(x)) - 1) * 100 + if (!higher_better) p <- 100 - p + round(p, 1) +} +syn[, `:=`( + pct_fa_cost = pct_rank(fa_m_per_war, higher_better = FALSE), + pct_dead = pct_rank(dead_pct, higher_better = FALSE), + pct_hg = pct_rank(hg_war_pct, higher_better = TRUE), + pct_retention = pct_rank(avg_war_ret_pct, higher_better = TRUE), + pct_achievement = pct_rank(avg_achievement, higher_better = TRUE) +)] +syn[, overall := rowMeans(.SD, na.rm = TRUE), + .SDcols = c("pct_fa_cost","pct_dead","pct_hg","pct_retention","pct_achievement")] +syn[, overall_pct := pct_rank(overall, TRUE)] +setorder(syn, -overall) +syn[, short_f := factor(short, levels = rev(short))] + +metrics_labels <- c( + pct_fa_cost = "FA Efficiency\n($/WAR)", + pct_dead = "Low Dead\nMoney", + pct_hg = "Homegrown\nWAR %", + pct_retention = "Oct WAR\nRetention", + pct_achievement = "Playoff\nAchievement", + overall_pct = "OVERALL\n(mean rank)" +) +syn_long <- melt( + syn[, c("franchID","short_f", names(metrics_labels)), with = FALSE], + id.vars = c("franchID","short_f"), + variable.name = "metric", + value.name = "pct" +) +syn_long[, metric_label := factor(metrics_labels[as.character(metric)], + levels = metrics_labels)] +syn_long[, is_overall := metric == "overall_pct"] + +ggplot(syn_long, aes(metric_label, short_f, fill = pct)) + + geom_tile(aes(colour = is_overall, linewidth = is_overall)) + + geom_text(aes(label = round(pct)), size = 2.8, + colour = "white", fontface = "bold") + + scale_fill_gradient2(low = "#D6604D", mid = "gold", high = "#2166AC", + midpoint = 50, name = "Percentile", + limits = c(0, 100)) + + scale_colour_manual(values = c(`FALSE` = "white", `TRUE` = "#333333"), + guide = "none") + + scale_linewidth_manual(values = c(`FALSE` = 0.5, `TRUE` = 1.5), + guide = "none") + + scale_x_discrete(position = "top") + + labs( + title = "Franchise Management Efficiency Scorecard", + subtitle = paste0( + "Sorted by OVERALL = simple mean of all 5 dimension percentile ranks.\n", + "100 = best franchise on that dimension; 0 = worst. ", + "Franchises with \u22655 playoff appearances, 1995-2025 (excl. 2020)." + ), + caption = "Sources: Lahman, FanGraphs, Spotrac, USA Today" + ) + + theme_minimal(base_size = 12) + + theme( + plot.title = element_text(face = "bold"), + axis.title = element_blank(), + axis.text.y = element_text(size = 10, face = "bold"), + legend.position = "right", + panel.grid = element_blank() + ) +``` + +--- + +## Cleanup + +```{r cleanup} +#| eval: !expr !is.na(Sys.getenv("LAHMANS_DBDIR", unset = NA)) +DBI::dbDisconnect(con, shutdown = TRUE) +``` + +--- + +## Key Takeaways + +1. **High payroll does not buy rings.** The Pearson correlation between average + seasonal payroll and playoff achievement score is r ≈ 0.00 (R² < 0.001) + across 1995--2025 (excl. 2020). In fact, the slope is slightly *negative* — not because + spending hurts, but because the signal is genuinely absent. + +2. **Dead money disproportionately hurts small-market teams.** A single bad + long-term contract can consume 25--30% of a small-market team's entire + budget, crowding out productive roster depth. + +3. **Homegrown pipelines are the most efficient path to sustained WAR.** Teams + that develop talent internally pay pre-arbitration and arbitration rates -- + a fraction of open-market costs. + +4. **WAR retention in October separates contenders from early exits.** Aging + veteran rosters assembled through expensive free-agent signings are more + likely to show depleted WAR in the postseason due to injury and decline. + +5. **The best franchises score well across all five dimensions.** No single + lever -- payroll, development, or health -- is sufficient on its own.