Skip to content

Conversation

@ryankert01
Copy link
Contributor

@ryankert01 ryankert01 commented Dec 31, 2025

Purpose of PR

ran some benchmark + test(maybe) on colab.
https://colab.research.google.com/drive/1NuAyAFsko3uD8MMBTDQpob4zSxg_GeiC?usp=sharing

Related Issues or PRs

Closes #722

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@ryankert01 ryankert01 changed the base branch from main to dev-qdp December 31, 2025 16:45
@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

@ryankert01 thanks for the patch

@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

Should we add a file size check like other format to prevent OOM?

let path = path.as_ref();

// Verify file exists
if !path.exists() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add this path.exists() in other format.


// Flatten to Vec<f64>
// Handle both C-contiguous (row-major) and Fortran-contiguous (column-major)
let data = if array.is_standard_layout() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one it's great!

@rich7420
Copy link
Contributor

rich7420 commented Jan 3, 2026

overall LGTM

@400Ping
Copy link

400Ping commented Jan 4, 2026

Should we add a file size check like other format to prevent OOM?

I agree with this one.

@400Ping
Copy link

400Ping commented Jan 4, 2026

I have thought of 2 more memory-friendly alternatives to the current read_npy -> Array2 -> flatten Vec flow:

  • streaming/iterator reading: we parse the .npy header (dtype/shape/order) and then iterate the flat data from the file in small chunks (so the file doesn’t need to fit in RAM).
  • memory-mapping (mmap): we map the .npy file into memory (OS loads pages on demand), parse the header to locate the data region, and avoid an extra “read + flatten copy” peak while also enabling easy slicing/random access.

We can do this in a follow up if needed.

Copy link

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@guan404ming
Copy link
Member

Should we add a file size check like other format to prevent OOM?

This one is a nice suggestion!

@guan404ming guan404ming merged commit 57f90e6 into apache:dev-qdp Jan 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QDP] Add NumPy batch input support and memory optimizations

4 participants