Skip to content

Feature/data management#466

Open
Shashankss1205 wants to merge 3 commits into
mainfrom
feature/data_management
Open

Feature/data management#466
Shashankss1205 wants to merge 3 commits into
mainfrom
feature/data_management

Conversation

@Shashankss1205

@Shashankss1205 Shashankss1205 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

PR: Add Data Management Tools

Reference Issues/PRs

Fixes #465 , Stacked on top of #463

What does this implement/fix? Explain your changes.

Adds 4 new MCP tools that complete the Data Management group of the ideal architecture:

# New Tool Purpose sktime Coverage
4 inspect_data Rich metadata inspection of loaded data handles mtype(), check_is_mtype(), get_cutoff(), scitype detection
5 split_data Temporal train/test splitting with test_size (fraction) or fh (horizon count) temporal_train_test_split()
6 transform_data Unified data transformation — action="format" (auto-fix freq/dupes/NaN) or action="convert" (mtype conversion) convert_to(), frequency inference, fill NaN, dedup
7 save_data Persist data handles to CSV, Parquet, or JSON files File export for predictions/transforms

inspect_data returns: mtype, scitype, shape, columns, dtypes, index_names, freq, cutoff, n_missing, head (first 5 rows), and summary_stats.

split_data creates two new data handles (train + test) and reports the cutoff timestamp. Supports both fractional (test_size=0.2) and horizon-based (fh=12) splitting.

transform_data subsumes the existing format_time_series tool as action="format" and adds a new action="convert" mode that calls sktime.datatypes.convert_to() for mtype conversion.

save_data combines y and X into a single DataFrame and writes to disk in CSV (default), Parquet, or JSON format.

Files created:

  • src/sktime_mcp/tools/inspect_data.py
  • src/sktime_mcp/tools/split_data.py
  • src/sktime_mcp/tools/transform_data.py
  • src/sktime_mcp/tools/save_data.py
  • tests/test_data_management.py (15 unit tests)

Files modified:

  • src/sktime_mcp/server.py — Tool schemas + call_tool dispatcher routing
  • src/sktime_mcp/tools/__init__.py — Updated exports

Does your contribution introduce a new dependency? If yes, which one?

No. All tools use existing dependencies (pandas, sktime).

What should a reviewer concentrate their feedback on?

  • inspect_data — verify the mtype/scitype detection fallback logic is robust
  • split_data — review the temporal splitting strategy and handle registration
  • transform_data — confirm the convert_to() integration handles edge cases
  • save_data — verify the pathlib.Path usage and format dispatch
  • Ensure the existing format_time_series tool remains available for backward compatibility (it does — transform_data(action="format") delegates to the same executor method)

Any other comments?

All 189 tests pass cleanly under make check (format + lint + pytest). The existing format_time_series tool is preserved for backward compatibility — transform_data wraps the same executor logic with a cleaner interface.

PR checklist

For all contributions
  • I've added unit tests and made sure they pass locally (make check).

  • I've added the tool to the online documentation in docs/source/.

  • I've updated the existing example scripts or provided a new one to showcase how my tool works in examples/.

@Shashankss1205 Shashankss1205 self-assigned this Jun 1, 2026
@Shashankss1205 Shashankss1205 force-pushed the feature/data_management branch from b550e24 to 9a11d65 Compare June 4, 2026 19:49
…save_data

- inspect_data: rich metadata (mtype, scitype, shape, freq, cutoff, missing values, head, summary_stats)
- split_data: temporal train/test split with test_size (fraction) or fh (horizon count)
- transform_data: unified action='format' (auto-fix freq/dupes/NaN) or action='convert' (mtype conversion)
- save_data: persist data handles to CSV/Parquet/JSON files
- Added 15 unit tests covering all tools and edge cases
- Wired all 4 tools into server.py (Tool schemas + call_tool dispatcher)
- Updated tools/__init__.py exports
@Shashankss1205 Shashankss1205 force-pushed the feature/data_management branch from 9a11d65 to 5d5afe9 Compare June 4, 2026 19:58
Shashankss1205 and others added 2 commits June 5, 2026 01:32
Add accurate LLM-facing descriptions for inspect_data, split_data,
transform_data, and save_data. Remove format_time_series MCP tool now
subsumed by transform_data(action='format'). Fix fh list splitting to
use max(fh) steps, rename return test_size to n_test, and add fh validation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 5: Data Management Tools

2 participants