Third-party GPU acceleration layer for Apache OpenNLP - transparent 2โ5ร speedups with NVIDIA CUDA, AMD ROCm, Intel OpenCL, and intelligent CPU fallback.
Important
This is an independent, third-party GPU acceleration extension for Apache OpenNLP and is not officially endorsed or maintained by the Apache Software Foundation.
- Overview
- Use Cases & Applications
- Key Features
- Architecture
- Usage Flow
- Technology Stack
- Technical Specifications
- GPU Backend Distribution
- Setup & Installation
- Quick Start
- Core Capabilities
- Configuration
- Diagnostics
- Project Roadmap
- Development Status
- Contributing
- Attribution
- License
OpenNLP GPU Extension is an independent third-party hardware acceleration layer that transparently routes Apache OpenNLP compute-intensive matrix operations to GPU hardware, delivering 2โ5ร throughput improvements for NLP workloads while maintaining 100% API compatibility with all standard OpenNLP model interfaces.
The extension operates as a drop-in decorator around existing OpenNLP models. No changes to training pipelines, serialized model files, or application calling code are required. When GPU hardware is present and configured, dense matrix operations (GEMM, softmax, TF-IDF, cosine similarity) execute on GPU kernels; when no GPU is detected, a numerically-identical pure-Java implementation silently handles all operations.
Apache OpenNLP is the dominant production-grade NLP framework in the Java/JVM ecosystem. Enterprises standardized on Java cannot easily switch to Python-native frameworks like spaCy or Hugging Face without introducing cross-language inter-process calls, retraining costs, and operational complexity. OpenNLP was specifically chosen as the GPU acceleration target because:
| Reason | Detail |
|---|---|
| Java-native | Integrates directly into Spring Boot, Jakarta EE, and enterprise JVM stacks without subprocess overhead |
| Stable API contracts | MaxentModel, TokenizerModel, and NameFinderME interfaces are stable across releases; the decorator pattern is reliable |
| Apache governance | Apache License 2.0; Apache Software Foundation oversight ensures long-term stability and commercial compatibility |
| Lightweight models | Serialized .bin model files are compact, versioned, and deployable without a framework runtime on the target server |
| Extensibility | Interface-based design means GpuMaxentModel implements MaxentModel with no changes to model loading or application logic |
| Active maintenance | OpenNLP 2.5.8 fixes SentenceDetector abbreviation handling (OPENNLP-1809/1810/1811) and updates ONNX Runtime to 1.24.3 |
Traditional NLP workloads are dominated by dense matrix operations that run sequentially on single CPU threads:
- Maximum Entropy evaluation: dot products between high-dimensional feature vectors and weight matrices (thousands of features ร hundreds of outcomes per document)
- Named Entity Recognition: per-token matrix multiplications across sequence windows in every sentence
- TF-IDF document scoring: vocabulary-scale sparse-to-dense matrix operations across entire corpora
- Cosine similarity search: pairwise distance calculations that scale O(Nยฒ) with corpus size
GPUs execute thousands of these operations simultaneously. A modern GPU with 10,000+ CUDA cores processes a 512ร512 matrix multiplication as a single parallel batch that would require thousands of sequential CPU instructions. The result: the same per-document accuracy at a fraction of the wall-clock time, directly translating to smaller SLA requirements or larger processing windows under the same compute budget.
Who this is for:
- Java NLP engineers processing high-volume batch workloads (10K+ documents/hour) who need lower latency without framework migration
- MLOps teams deploying OpenNLP on GPU-enabled cloud instances (AWS
g4dn/p3, GCPa2, AzureNCv3) - Researchers benchmarking GPU acceleration for classical NLP algorithms
- Organizations with existing OpenNLP deployments who need GPU benefits without retraining models or changing application code
| Icon | Feature | Description | Impact | Status |
|---|---|---|---|---|
| โก | GPU-Accelerated Matrix Ops | GEMM, transpose, and activation functions dispatched to GPU kernels | 2โ5ร throughput | โ Stable |
| ๐ | Auto CPU Fallback | Silent, transparent fallback to pure-Java when GPU unavailable | Zero downtime | โ Stable |
| ๐ฏ | Drop-in API Compatibility | GpuMaxentModel implements OpenNLP MaxentModel interface exactly |
No code changes | โ Stable |
| ๐ฅ๏ธ | Multi-Backend | CUDA 11+, ROCm 5+, OpenCL 1.2+, CPU (runtime-selected) | Broad hardware support | ๐ In Progress |
| โ๏ธ | Cloud Accelerators | AWS Inferentia and Google TPU providers with CPU fallback; Neuron/XLA bridges planned | Cloud-native NLP | ๐ In Progress |
| ๐ | Performance Monitor | Real-time thread-safe metrics, latency alerts, memory tracking | Operational observability | โ Stable |
| ๐ | GPU Diagnostics CLI | Standalone tool to probe drivers, SDKs, and runtime environment | DevOps-friendly | โ Stable |
| ๐งช | Extensive Test Suite | 30+ test classes: unit, integration, stress, compatibility, benchmark | High confidence | โ Stable |
Highlights:
- 115 Java source files covering ML models (MaxEnt, Perceptron, Naive Bayes, Neural), GPU backends, monitoring, and tooling
- Structured commenting on all core interfaces and compute classes: requirement, purpose, inputs, outputs, and failure-mode documentation
- Java 21 LTS compilation target with full OpenNLP 2.5.8 API compatibility
- Real backpropagation in
GpuNeuralNetwork: chain-rule gradient descent, activation derivatives (sigmoid, tanh, ReLU, softmax, linear), with GPU-parallel batch inference viaIntStream.parallel() - JOCL-based hardware detection:
CudaUtil.isAvailable(),OpenCLUtil.isAvailable(), andRocmUtil.isAvailable()all enumerate real devices via JOCL with no placeholder returns - Zero stub methods: all public API methods have production implementations or documented CPU-fallback paths; no
return new Object()orreturn false // Stubremain - Benchmarks against
CpuComputeProviderreference implementation to validate numerical correctness
Legal discovery, content moderation, financial document analysis, and compliance scanning involve processing tens of thousands of documents per hour. GPU batch sizing:
- Stacks 64โ256 document feature vectors per kernel launch
- Processes each batch in a single GPU call replacing hundreds of sequential CPU invocations
- Sustains linear throughput scaling as document volume grows
Low-latency REST endpoints for text classification, sentiment analysis, or entity detection:
- Sub-50ms inference on complex MaxEnt models under concurrent load
- Reduced p99 latency outliers eliminated through GPU parallel evaluation
- Handle burst traffic without horizontal scaling
ETL pipelines for CRM, HR, compliance, and knowledge management systems:
- GPU-accelerated TF-IDF across large document corpora
- Batch cosine similarity for document deduplication and clustering
- Faster named entity extraction across multilingual document sets
On-premises clinical text processing (EHR structuring, ICD coding, clinical concept extraction) where:
- Privacy constraints prevent cloud API calls; a local GPU server is required
- High-accuracy MaxEnt models are used for medical term classification
- Throughput matters for overnight batch processing of patient notes
Researchers using OpenNLP as a classical NLP baseline can:
- Measure GPU vs. CPU throughput for traditional probabilistic models
- Compare accuracy/latency tradeoffs across CUDA, ROCm, and OpenCL backends
- Prototype GPU-accelerated feature engineering before committing to deep learning pipelines
Teams on GPU cloud instances can:
- Maximize GPU utilization by running OpenNLP inference alongside vision or audio model serving
- Use spot/preemptible instances cost-effectively due to pipelined batch processing
- Scale inference horizontally with bit-identical results across CPU fallback and GPU nodes
| Industry | Workload | OpenNLP Component | GPU Benefit |
|---|---|---|---|
| Legal | Contract entity extraction | GpuNerModel |
Batch throughput on large corpora |
| Finance | Earnings call sentiment | GpuMaxentModel |
Sub-100ms per-document scoring |
| Healthcare | Clinical concept extraction | Custom MaxEnt | Privacy-safe on-prem GPU inference |
| E-commerce | Query intent classification | GpuMaxentModel |
Low-latency real-time API |
| Media | Article topic classification | MaxEnt ensemble | GPU batch for trending topic detection |
| HR / Recruitment | Resume skill extraction | GpuNerModel |
High-volume batch processing |
| Compliance | Document classification audit | GpuPerceptronModel |
Reproducible GPU-verified results |
| News / Search | Multilingual document dedup | TF-IDF + cosine similarity | O(Nยฒ) โ GPU-parallel similarity |
flowchart TD
A[NLP Application] --> B[OpenNlpGpuAdapter]
B --> C{"GpuConfig<br/>gpu.available?"}
C -->|GPU Available| D[GpuComputeProvider]
C -->|No GPU| E[CpuComputeProvider]
D --> F{Backend Selection}
F -->|NVIDIA| G["CUDA Kernels<br/>JNI Bridge"]
F -->|AMD| H[ROCm / HIP]
F -->|Any Vendor| I[OpenCL / JOCL 2.0.6]
F -->|Cloud| J["AWS Inferentia<br/>Google TPU"]
G & H & I & J --> K["MatrixOperation<br/>Interface"]
E --> K
K --> L["Result to OpenNLP<br/>MaxentModel.eval"]
L --> M["GpuPerformanceMonitor<br/>Metrics & Alerts"]
Component responsibilities:
| Component | Package | Role |
|---|---|---|
OpenNlpGpuAdapter |
integration |
Entry point; selects provider; wraps OpenNLP models |
ComputeProvider |
common |
Hardware-agnostic interface for all compute backends |
GpuConfig |
common |
Configuration value object (GPU flag, pool size, batch size) |
CpuComputeProvider |
compute |
Pure-Java reference implementation; always available |
GpuComputeProvider |
compute |
OpenCL-backed provider with CPU fallback delegation |
OperationFactory |
compute |
Factory for selecting concrete MatrixOperation implementations |
GpuMaxentModel |
ml.maxent |
Drop-in MaxentModel decorator with GPU dispatch |
GpuPerformanceMonitor |
monitoring |
Thread-safe singleton metrics and alerting |
GpuDiagnostics |
tools |
CLI tool for environment pre-flight checks |
sequenceDiagram
participant App as NLP Application
participant Adapter as OpenNlpGpuAdapter
participant Factory as ComputeProviderFactory
participant GPU as GpuComputeProvider
participant Model as GpuMaxentModel
participant Monitor as GpuPerformanceMonitor
App->>Adapter: new OpenNlpGpuAdapter()
Adapter->>Factory: selectProvider(GpuConfig)
Factory-->>Adapter: GpuComputeProvider or CpuFallback
App->>Model: new GpuMaxentModel(baseModel, config)
Model->>GPU: initialize()
GPU-->>Model: ready or silently falls back
App->>Model: eval(context[])
Model->>GPU: matrixMultiply / extractFeatures
GPU-->>Model: double[] probabilities
Model-->>App: probabilities
Model->>Monitor: recordOperation(latencyNs, memoryMB)
Monitor-->>App: alert if threshold exceeded
Step-by-step usage:
# 1. Clone
git clone https://github.com/hkevin01/opennlp-gpu.git
cd opennlp-gpu
# 2. Compile (skips native cmake build by default)
mvn clean compile
# 3. Run GPU diagnostics to check your environment
mvn exec:java -Dexec.mainClass=org.apache.opennlp.gpu.tools.GpuDiagnostics
# 4. Run tests
mvn test -Dtest=GpuTestSuite| Technology | Version | Purpose | Why Chosen | Alternative |
|---|---|---|---|---|
| Apache OpenNLP | 2.5.8 | NLP model API contract | Industry-standard Java NLP; stable API | Stanford NLP, spaCy |
| Java | 21 LTS | Runtime and implementation | LTS stability; virtual threads; modern records | Kotlin, Scala |
| JOCL | 2.0.6 | OpenCL Java bindings | Cross-vendor GPU without native CUDA lock-in | LWJGL, pure JNA |
| SLF4J | 2.0.17 | Logging facade | Framework-neutral; no log framework lock-in | Log4j2, java.util.logging |
| JUnit 5 | 5.13.1 | Testing framework | Parameterized tests; extension model; parallel execution | TestNG |
| CMake | 4+ | Native library build | Cross-platform C++/CUDA build system | Makefile, Meson |
| Maven | 3.9+ | Build and dependency management | Industry standard; reproducible builds | Gradle |
| GPU Family | Architecture | Min Compute / Version | OpenCL Level | Backend |
|---|---|---|---|---|
| NVIDIA Turing (RTX 20xx, T4) | sm_75 | CUDA 11+ | 3.0 | CUDA + OpenCL |
| NVIDIA Ampere (RTX 30xx, A100) | sm_80 | CUDA 11+ | 3.0 | CUDA + OpenCL |
| NVIDIA Ada Lovelace (RTX 40xx) | sm_89 | CUDA 12+ | 3.0 | CUDA + OpenCL |
| NVIDIA Hopper (H100, H200) | sm_90 | CUDA 12+ | 3.0 | CUDA + OpenCL |
| AMD RDNA2 (RX 6000 series) | GFX1030 | ROCm 5.0+ | 2.0 | ROCm / HIP |
| AMD RDNA3 (RX 7000 series) | GFX1100 | ROCm 5.5+ | 2.0 | ROCm / HIP |
| Intel Arc (A-series) | Xe-HPG | N/A | 3.0 | OpenCL via JOCL |
| Any OpenCL 1.2+ device | N/A | N/A | 1.2 | JOCL cross-vendor |
| Component | Minimum | Recommended |
|---|---|---|
| Java JDK | 21 LTS | 21 LTS or 26 |
| Maven | 3.9 | 3.9+ |
| GPU VRAM | 2 GB | 8 GB+ |
| JVM Heap | 512 MB | 2โ4 GB |
| NVIDIA Driver | 520.x | 535.x+ |
| CUDA Toolkit | 11.0 | 12.0+ |
| ROCm | 5.0 | 5.5+ |
| OpenCL ICD | 1.2 | 3.0 |
| CMake (native build only) | 3.16 | 4.x |
All kernels are implemented in CUDA C++ (kernels.cu), HIP/ROCm (kernels.cpp), and have equivalent pure-Java CPU reference implementations validated for numerical correctness to โค1e-5 tolerance:
| Kernel | Dimensions | Block / Tile Size | Algorithm |
|---|---|---|---|
matMulKernel |
MรK ยท KรN โ MรN | 16ร16 shared-mem tiles | Tiled SGEMM |
softmaxKernel |
N-element vector | 256 threads/block | Numerically stable (subtract max) |
tfidfKernel |
N docs ร M terms | 32ร32 | TF ร log(N/df) |
cosineSimilarityKernel |
N pairs ร D dims | 256 threads | L2-normalized dot product |
ngramExtractKernel |
N tokens ร L window | 128 threads/block | Sliding-window n-gram |
Reference measurements on NVIDIA RTX 3080 (10 GB VRAM). Actual performance varies by GPU model, driver version, batch size, and input dimensions. CPU fallback is always available and numerically identical.
| Operation | CPU Reference (ms) | GPU Target (ms) | Target Speedup |
|---|---|---|---|
| MaxEnt eval: 1K features, 100 outcomes | ~12 | ~3 | 4ร |
| Matrix multiply: 512ร512 FP32 | ~19 | ~4 | 5ร |
| Softmax: 10K elements | ~2 | <1 | 3ร |
| TF-IDF: 10K docs ร 5K terms | ~900 | ~190 | 4.7ร |
| Cosine similarity: 1K pairs ร 512 dims | ~24 | ~6 | 4ร |
| Maven Profile | Command | Artifacts | Hardware Required |
|---|---|---|---|
| Default (Java-only) | mvn clean package |
JAR + CPU fallback | None |
| Native CUDA | mvn clean package -Pnative |
JAR + CUDA .so kernels |
CUDA Toolkit 11+ |
| Native ROCm | mvn clean package -Pnative -Drocm=true |
JAR + HIP .so kernels |
ROCm 5.0+ |
| Test suite (CPU mode) | mvn test -Dtest=GpuTestSuite |
Test results | None |
pie title GPU Backend Support Coverage
"OpenCL (JOCL cross-vendor)" : 45
"CUDA (NVIDIA)" : 30
"ROCm/HIP (AMD)" : 15
"Cloud (Inferentia + TPU)" : 10
| Backend | Vendor | Status | Requirement |
|---|---|---|---|
| OpenCL via JOCL | Any (NVIDIA, AMD, Intel) | ๐ JNI bridge in progress | OpenCL 1.2+ ICD |
| CUDA via JNI | NVIDIA | ๐ Native kernels in progress | CUDA Toolkit 11+, driver |
| ROCm / HIP | AMD | ๐ JOCL enumeration complete; HIP native kernels planned | ROCm 5.0+, compatible GPU |
| AWS Inferentia | Amazon | ๐ CPU fallback active; AWS Neuron SDK bridge planned | Neuron SDK on inf1/inf2 |
| Google TPU | ๐ CPU fallback active; XLA bridge planned | TPU v3/v4 on GCP | |
| CPU Fallback | Any | โ Production ready | JVM only |
Note
The CPU fallback (CpuComputeProvider) is fully production-ready and used as the numerical reference for all GPU kernel correctness tests. GPU backends are progressively integrated as the JNI bridge matures.
| Requirement | Minimum | Recommended |
|---|---|---|
| Java JDK | 21 | 21 LTS or 26 |
| Maven | 3.9 | 3.9+ |
| GPU (optional) | OpenCL 1.2+ | CUDA 11+ or ROCm 5+ |
| CMake (optional) | 3.16 | 4.x (for native build) |
git clone https://github.com/hkevin01/opennlp-gpu.git
cd opennlp-gpu
# Standard build (Java only, no native GPU kernels)
mvn clean package
# Full native build (requires CUDA/ROCm/OpenCL headers)
mvn clean package -PnativeTip
Use a tagged release (e.g. 1.0.0) for stable builds, or main-SNAPSHOT to track the latest commit on main.
Maven (pom.xml):
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependencies>
<!-- Apache OpenNLP -->
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>2.5.8</version>
</dependency>
<!-- GPU Extension (tagged release) -->
<dependency>
<groupId>com.github.hkevin01</groupId>
<artifactId>opennlp-gpu</artifactId>
<version>1.0.0</version>
</dependency>
</dependencies>Gradle (build.gradle):
repositories {
maven { url 'https://jitpack.io' }
}
dependencies {
implementation 'org.apache.opennlp:opennlp-tools:2.5.8'
implementation 'com.github.hkevin01:opennlp-gpu:1.0.0'
}Gradle Kotlin (build.gradle.kts):
repositories {
maven("https://jitpack.io")
}
dependencies {
implementation("org.apache.opennlp:opennlp-tools:2.5.8")
implementation("com.github.hkevin01:opennlp-gpu:1.0.0")
}# Enable GPU detection (set to true when GPU hardware is present and drivers loaded)
export JAVA_TOOL_OPTIONS="-Dgpu.available=true -Dgpu.vendor=NVIDIA -Dgpu.device=RTX4090"
# Verify environment
mvn exec:java -Dexec.mainClass=org.apache.opennlp.gpu.tools.GpuDiagnosticsimport opennlp.tools.tokenize.TokenizerModel;
import org.apache.opennlp.gpu.common.GpuConfig;
import org.apache.opennlp.gpu.integration.OpenNlpGpuAdapter;
import org.apache.opennlp.gpu.ml.maxent.GpuMaxentModel;
// 1. Configure GPU
GpuConfig config = new GpuConfig();
config.setGpuEnabled(true); // Enable GPU acceleration
config.setMemoryPoolSizeMB(512); // Pre-allocate 512 MB GPU pool
config.setBatchSize(64); // Process 64 samples per kernel launch
// 2. Create the GPU adapter (auto-selects best available backend)
OpenNlpGpuAdapter adapter = new OpenNlpGpuAdapter();
// 3. Wrap your existing OpenNLP MaxentModel
// baseModel loaded normally from .bin file
GpuMaxentModel gpuModel = new GpuMaxentModel(baseModel, config);
// 4. Use exactly as you would the original model
double[] probabilities = gpuModel.eval(new String[]{"word", "suffix=ing", "prev=VBZ"});
String bestOutcome = gpuModel.getBestOutcome(probabilities);
// 5. Check runtime stats
System.out.println("Using GPU: " + gpuModel.isUsingGpu());
System.out.println("Speedup: " + gpuModel.getSpeedupFactor() + "ร");
gpuModel.cleanup(); // Release GPU resourcesTip
Set -Dgpu.available=true only after running GpuDiagnostics confirms your driver stack is complete. When this flag is absent or false, the extension runs identically correct in CPU mode.
The MatrixOperation interface provides 20+ operations:
| Category | Methods | Backend |
|---|---|---|
| BLAS-style | multiply, add, subtract, transpose, scalarMultiply |
CPU โ / GPU ๐ |
| ML-specific | dotProduct, vectorNorm, elementWiseMultiply, matrixVectorMultiply |
CPU โ / GPU ๐ |
| Activations | sigmoid, tanh, relu, softmax (numerically stable) |
CPU โ / GPU ๐ |
| Statistics | mean, variance, normalize |
CPU โ / GPU ๐ |
| Utility | copyArray, fillArray, findMax, findMin |
CPU โ / GPU ๐ |
Note
DummyMatrixOperation (CPU) implements every method with correct algorithms, including numerically-stable softmax with exp(x - max(x)) and epsilon-guarded normalization. All GPU backends are validated against it.
๐ Supported OpenNLP Model Types
| Model Type | GPU Wrapper Class | OpenNLP Interface |
|---|---|---|
| Maximum Entropy | GpuMaxentModel |
MaxentModel |
| Perceptron | GpuPerceptronModel |
MaxentModel |
| Naive Bayes | GpuNaiveBayesModel |
MaxentModel |
| Neural Network | GpuNeuralNetworkModel |
Custom |
| Attention Layer | GpuAttentionLayer |
Custom |
| Advanced Neural | AdvancedGpuNeuralNetwork |
Custom |
| MaxEnt Trainer | GpuMaxentTrainer |
EventTrainer |
All wrappers follow the same decorator pattern: accept the base OpenNLP object, add GPU dispatch, and fall back to the base when GPU is unavailable.
GpuPerformanceMonitor monitor = GpuPerformanceMonitor.getInstance();
monitor.setAlertThresholdMs(500); // Alert on ops > 500ms
monitor.setMemoryAlertThreshold(0.75); // Alert at 75% GPU memory
monitor.setMaxHistorySize(5000); // Keep last 5000 records/op
// After inference...
OperationMetrics metrics = monitor.getMetrics("matrixMultiply");
System.out.println("Avg latency: " + metrics.getAverageLatencyMs() + "ms");All settings are controlled via GpuConfig (a plain Java value object):
| Property | Default | Description |
|---|---|---|
gpuEnabled |
false |
Master GPU switch |
memoryPoolSizeMB |
256 |
Pre-allocated GPU memory pool size (MB) |
batchSize |
32 |
Samples per GPU kernel launch |
maxMemoryUsageMB |
1024 |
Hard memory cap per provider (MB) |
debugMode |
false |
Verbose diagnostic output |
System properties (read at runtime):
| Property | Example | Description |
|---|---|---|
gpu.available |
true |
Master GPU presence flag |
gpu.vendor |
NVIDIA |
Reported vendor name |
gpu.device |
RTX 4090 |
Device display name |
gpu.driver |
535.0 |
Driver version string |
gpu.memory.total |
24576 |
Total VRAM in MB |
gpu.speedup.factor |
3.5 |
Reported speedup for stats reporting |
Run the built-in hardware probe before deploying:
mvn exec:java -Dexec.mainClass=org.apache.opennlp.gpu.tools.GpuDiagnosticsSample output:
๐ OpenNLP GPU Acceleration - Hardware Diagnostics
==================================================
[System Information]
OS: Linux 6.x.x-zen
Java Version: 26.0.2 โ
Compatible
JAVA_HOME: /usr/lib/jvm/java-26-openjdk โ
Set and valid
[GPU Hardware Detection]
AMD GPU: โ
Detected: AMD Radeon RX 7900 XTX
[AMD Drivers]
AMD ROCm Driver: โ
Installed and working
[OpenCL Runtime]
OpenCL: โ
2 platform(s), 3 device(s)
[OpenNLP GPU Integration]
Extension JAR: โ
Loaded successfully
๐ GPU acceleration is ready to use!
Exit code 0 = ready, 1 = setup incomplete.
gantt
title OpenNLP GPU Extension Roadmap
dateFormat YYYY-MM-DD
section Phase 1: Foundation
Core Interfaces & CPU Fallback :done, p1a, 2025-01-01, 2025-04-01
ComputeProvider Hierarchy :done, p1b, 2025-01-01, 2025-04-01
GpuConfig & Monitoring :done, p1c, 2025-03-01, 2025-05-01
section Phase 2: ML Models
MaxEnt / Perceptron / Naive Bayes :done, p2a, 2025-04-01, 2025-07-01
Neural Network & Attention :done, p2b, 2025-05-01, 2025-08-01
GPU Diagnostics Tool :done, p2c, 2025-06-01, 2025-08-01
section Phase 3 - Native GPU (Active)
OpenCL JNI Bridge :active, p3a, 2025-09-01, 2026-06-01
CUDA Kernel Integration :active, p3b, 2025-10-01, 2026-07-01
ROCm / HIP Integration : p3c, 2026-03-01, 2026-09-01
section Phase 4 - Cloud & Production
AWS Inferentia Integration : p4a, 2026-06-01, 2026-10-01
Google TPU Integration : p4b, 2026-07-01, 2026-11-01
Maven Central Release : p4c, 2026-10-01, 2026-12-01
| Phase | Goals | Target | Status |
|---|---|---|---|
| Phase 1 | Core interfaces, CPU fallback, monitoring | Q1-Q2 2025 | โ Complete |
| Phase 2 | ML model wrappers, diagnostics, test suite | Q2-Q3 2025 | โ Complete |
| Phase 3 | OpenCL + CUDA JNI kernels, ROCm integration | Q4 2025โQ3 2026 | ๐ Active |
| Phase 4 | Cloud accelerators, Maven Central, production hardening | Q4 2026 | โญ Planned |
pie title Component Readiness (% complete)
"CPU Fallback (100%)" : 100
"Monitoring (100%)" : 100
"Diagnostics (100%)" : 100
"ML Wrappers (100%)" : 100
"OpenCL JOCL Detection (80%)" : 80
"CUDA/ROCm JOCL Detection (75%)" : 75
"Cloud Providers โ CPU Fallback (70%)" : 70
"Native GPU Kernels โ JNI Bridge (25%)" : 25
| Version | Phase | Stability | Java | OpenNLP | Key Limitation |
|---|---|---|---|---|---|
| 1.0.0 | Phase 1-2 | Beta | 21 | 2.5.8 | Hardware GPU kernel execution requires native JNI bridge (CPU fallback active) |
Warning
Hardware GPU kernel execution (isAvailable() == true + real device dispatch) requires the in-progress JNI bridge to be compiled with -Pnative and a compatible driver stack verified by the GpuDiagnostics tool. JOCL-based provider detection (CudaUtil.isAvailable(), OpenCLUtil.isAvailable(), RocmUtil.isAvailable()) is fully implemented and returns real hardware results. Until the native kernel bridge is wired, all matrix compute routes silently through CpuComputeProvider.
Contributions are welcome! This project follows the standard GitHub pull-request workflow.
# Fork, then:
git clone https://github.com/YOUR_USERNAME/opennlp-gpu.git
cd opennlp-gpu
git checkout -b feature/my-improvement
# Make changes, add tests
mvn clean test
git commit -m "feat: describe your change"
git push origin feature/my-improvement
# Open a Pull Request on GitHub๐ Contribution Guidelines
Code Style
- Java 21 syntax; no Lombok (removed to reduce annotation processor complexity)
- All new public APIs must include structured Javadoc comments (Requirement, Purpose, Inputs, Outputs, Failure Modes)
- Follow existing package structure:
common/,compute/,ml/,monitoring/,tools/
Testing Requirements
- Unit tests in
src/test/java/matching the source package - New GPU backends must include a CPU-parity test verifying numerical equivalence
- Stress tests for any concurrent code (
stress/test package)
Pull Request Checklist
mvn clean compilepasses with zero errorsmvn test -Dtest=GpuTestSuite,MatrixOpsTestpasses- No new
Xlint:allwarnings introduced GpuDiagnosticsstill reports correctly
This project extends Apache OpenNLP but is not part of the Apache Software Foundation.
| Component | Owner | License |
|---|---|---|
Apache OpenNLP (opennlp-tools) |
Apache Software Foundation | Apache License 2.0 |
| JOCL | Marco Hutter / jocl.org | MIT License |
| This GPU Extension | OpenNLP GPU Extension Contributors | Apache License 2.0 |
OpenNLP GPU Extension
Copyright 2025 OpenNLP GPU Extension Contributors
This software includes code from Apache OpenNLP:
Copyright 2011-2025 The Apache Software Foundation
Distributed under the Apache License, Version 2.0. See LICENSE for full text.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Built with โค๏ธ for the Java NLP community