Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@
^AGENTS\.md$
^CONTRIBUTING\.md$
^analysis$
^.*\.pdf$
17 changes: 17 additions & 0 deletions .copilot/mcp-config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"mcpServers": {
"baseball": {
"type": "stdio",
"command": "python3",
"args": [
"-c",
"import os, subprocess; db=os.path.join(os.environ.get('LAHMANS_DBDIR', os.path.expanduser('~/Documents/Data/baseball')), 'baseball.duckdb'); subprocess.run(['duckdb-mcp-server', '--db-path', db, '--readonly'])"
]
},
"r-btw": {
"type": "stdio",
"command": "Rscript",
"args": ["-e", "btw::btw_mcp_server()"]
}
}
}
13 changes: 13 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# CODEOWNERS — these files require owner review on every PR.
# Protects against prompt injection via AI instruction files and
# unauthorized expansion of the MCP attack surface.

# AI agent instructions and MCP server config are sensitive:
# a malicious change could redirect agent behaviour or expose new
# execution vectors. Require explicit owner sign-off.
.github/copilot-instructions.md @davidlucey
.copilot/mcp-config.json @davidlucey

# DESCRIPTION and NAMESPACE control package identity and imports
DESCRIPTION @davidlucey
NAMESPACE @davidlucey
70 changes: 64 additions & 6 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,15 +97,73 @@ cd $PROJ && git reset

When developing analysis scripts or iterating on charts, use an **interactive R session** instead of re-running the full script each time:

1. Start R in async mode: `bash mode="async" command="R --no-save"`
2. Source shared setup (DB connection, libraries) once
3. Send individual code blocks via `write_bash` to iterate on specific charts or queries
4. Use the `view` tool on saved PNG files to inspect chart output visually
5. Only assemble the final `.R` script once the individual pieces are working
1. Start R in async mode: `bash mode="async" command="R --no-save" shellId="r-session"`
2. Manually connect to DuckDB and load libraries (do NOT source the full script — see gotcha below)
3. Source only the SQL/data-processing block once (lines after `dbConnect`, before chart code)
4. Send only the chart code block via `write_bash` to iterate on specific charts or queries
5. Use the `view` tool on saved PNG files to inspect chart output visually
6. Only assemble the final `.R` script once the individual pieces are working

**`on.exit` gotcha in sourced scripts:** Analysis scripts use `on.exit(dbDisconnect(con, shutdown = TRUE))` at the top level. When you `source()` such a file interactively, R fires the `on.exit` handler when `source()` returns, closing the connection immediately. **Workaround:** connect manually in the session first, then source only the data-processing and chart sections (not the preamble).

```r
# Step 1 — run once to set up the session
suppressPackageStartupMessages({
library(data.table); library(ggplot2); library(DBI); library(duckdb)
})
db_path <- file.path(path.expand(Sys.getenv("LAHMANS_DBDIR", "~/Documents/Data/baseball")), "baseball.duckdb")
con <- dbConnect(duckdb(), db_path, read_only = TRUE)

# Step 2 — source just the SQL + data wrangling (skip preamble lines)
source("/tmp/roi_data.R") # or whichever temp file has only the data block

# Step 3 — iterate: edit chart file, source, view
source("/tmp/roi_chart.R")
```

This avoids the 60-90 second penalty of re-running a full analysis script on every change and enables tight visual feedback loops.

**DuckDB CLI for ad-hoc queries:** Use `duckdb ~/Documents/Data/baseball/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code.
**DuckDB CLI for ad-hoc queries:** Use `duckdb $LAHMANS_DBDIR/baseball.duckdb` for quick schema checks (`DESCRIBE`, `SUMMARIZE`) rather than writing throwaway R code.

## MCP Servers

Two MCP servers are configured in `.copilot/mcp-config.json`:

| Server | Command | Purpose |
|--------|---------|---------|
| `baseball` | `duckdb-mcp-server --readonly` | Read-only SQL access to `baseball.duckdb`; path resolved via `$LAHMANS_DBDIR` using Python (no shell expansion — avoids injection) |
| `r-btw` | `btw::btw_mcp_server()` | R package dev tools: test, document, check, coverage, help |

**Prerequisites:**
- `LAHMANS_DBDIR` env var should be set (defaults to `~/Documents/Data/baseball`).
- `duckdb-mcp-server` binary on `PATH` (installed at `~/.local/bin/duckdb-mcp-server`).
- `btw` R package installed in the system library (not renv).

**Using `r-btw` tools:** prefer them over bash for package tasks — `btw_tool_pkg_test`, `btw_tool_pkg_check`, `btw_tool_pkg_coverage`, `btw_tool_pkg_document` all run in-process and are faster than shell invocations.

**`mcptools` is intentionally NOT configured** as an MCP server. `mcptools::mcp_server(session_tools = TRUE)` would expose `list_r_sessions` / `select_r_session`, giving the AI arbitrary R code execution in any session that has called `mcp_session()`. Use the bash async session approach instead (see Interactive R Sessions above).

## Security

The following files are high-value targets for prompt injection and are protected by CODEOWNERS (owner review required on every PR):

- `.github/copilot-instructions.md` — controls AI agent behaviour for all sessions
- `.copilot/mcp-config.json` — controls which MCP servers (execution surfaces) are available

**What prompt injection means here:** a malicious PR that modifies `copilot-instructions.md` could redirect the agent to exfiltrate data, weaken commit checks, or perform unintended operations. The CODEOWNERS rule ensures a human must explicitly approve any change to these files before merge.

**MCP surface area (in order of privilege):**
1. `baseball` (DuckDB, read-only) — SQL queries only; no writes; path constructed programmatically to avoid shell injection.
2. `r-btw` — can read all package source files and run tests/checks. Cannot write files or execute arbitrary shell commands.

**Never add to MCP config without security review:**
- Any server that exposes `eval`, `system()`, `shell()`, or arbitrary R/Python execution.
- `mcptools::mcp_server(session_tools = TRUE)` — see above.
- Any server that takes user-supplied input as a shell argument.

## Tests

Run with `devtools::test()`. The suite has ~71 `test_that()` blocks (202 assertions) across 6 files. All must pass before committing. The full-DB smoke test uses `skip_on_ci()` and `skip_if_not_installed("Lahman")`.

## R CMD Check

Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ tests/testthat/_snaps/

# Copilot CLI config (environment-specific, not for contributors)
.github/lsp.json
.copilot/mcp-config.json
.mcp.json

# Old scratch notebooks (superseded by analysis/)
Expand Down
Binary file not shown.
13 changes: 10 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
Package: lahmanTools
Title: Baseball Analytics with Lahman and DuckDB
Version: 0.3.0
Version: 0.4.0
Authors@R: person("David", "Lucey", role = c("aut", "cre"),
email = "david@example.com")
Description: Loads all Sean Lahman baseball tables (1871-2025) into a persistent
DuckDB database and exposes pre-built sabermetric SQL views (BattingStats,
PitchingStats, SalaryPerWAR, etc.). Optionally extends salary coverage to
2017-2025 from USA Today and Spotrac, and supplements with FanGraphs WAR
(1985+) and Chadwick Bureau player ID crosswalk via the baseballr package.
2017-2025 from USA Today and Spotrac, supplements with FanGraphs WAR
(1985+) and Retrosheet postseason data (2022-2025), and provides Chadwick
Bureau player ID crosswalk via the baseballr package.
No third-party data is bundled; all supplemental data is fetched at runtime.
License: MIT + file LICENSE
Encoding: UTF-8
Expand All @@ -27,6 +28,12 @@ Suggests:
dm,
jsonlite,
re2,
zip,
ggplot2,
ggrepel,
scales,
quarto,
knitr,
testthat (>= 3.0.0)
VignetteBuilder: quarto
Config/testthat/edition: 3
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ export(db_query)
export(dt_factors_to_char)
export(load_chadwick_ids)
export(load_fangraphs_war)
export(load_retrosheet_post)
export(load_statcast)
export(match_player_ids)
export(normalise_player_name)
Expand Down
15 changes: 15 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# lahmanTools 0.4.0

## Attribution fix

* **README** and **DESCRIPTION** now include Retrosheet as a credited data
source. `load_retrosheet_post()` has always carried the required attribution
in its roxygen docs and `inst/RETROSHEET_NOTICE`; the top-level docs now
match.

## Code quality

* Three call sites that bypassed `db_query()` now use it consistently:
`scrape.R` (people lookup) and `utils.R` (Batting/Pitching roster tables in
`match_player_ids()` Pass 4). No behaviour change -- purely DRY cleanup.

# lahmanTools 0.3.0

## Breaking changes
Expand Down
Loading
Loading