added RFC on how to create a living knowledge base of owasp things by northdpole · Pull Request #734 · OWASP/OpenCRE

northdpole · 2026-02-01T15:59:28Z

No description provided.

PRAteek-singHWY · 2026-02-01T22:47:27Z

@northdpole
Thanks a lot for sharing this sir, this is extremely helpful and very well structured.

I've gone through the RFC and it gives a clear architectural and experimental framework to build the proposal around. I'll spend some time digesting it in detail and start aligning my work proposal with this design and the pre-code experiments outlined here.

PRAteek-singHWY · 2026-02-02T00:18:49Z

@northdpole

Thanks for putting this together Sir, the experimental framework is really clear.

I’m particularly interested in Module C (The Librarian) and want to start with the suggested pre-code experiments before proposing any concrete design or implementation.

The negation problem stands out — I’ve worked on gap analysis features before (#716) and have seen how basic similarity metrics can struggle with logical inversions in requirements (e.g., “Use X” vs “Do NOT use X”).

Plan:
I’ll start with the ASVS re-classification experiment:

Extract 50 ASVS requirements and strip metadata
Baseline: vector search with cosine similarity
Comparison: cross-encoder re-ranking (ms-marco-MiniLM-L-6-v2)
Target: >20% accuracy improvement on negative requirements

If the experiment is successful, I’m also interested in exploring hybrid search (vector + BM25), especially for cases like CVE identifiers where pure vector search often underperforms.

I'll take this up step by step .

I’ll share experiment results and observations before proposing any implementation.

I’m using AI tools (similar to Cursor/Windsurf) and have read Section 3.

Thank you .

manshusainishab · 2026-02-08T09:53:33Z

Hi @northdpole ,

Thanks for putting together this RFC — the structure, pre-code experiments, and CI-first mindset are exactly the kind of system I enjoy working on.

I’d like to formally express my interest in owning Module B: Noise / Relevance Filter as my primary contribution, and I’m also happy to assist with adjacent modules where needed.

So Why Module B

The framing of Module B as a cheap, high-signal gate before expensive downstream processing resonates strongly with me. Getting this layer right feels critical to the quality, cost, and trustworthiness of the entire pipeline, especially given the planned regression dataset and CI enforcement.

Proposed Plan of Action (Aligned with the RFC)
I plan to follow the RFC strictly and start with experiments before any production code:

Human Benchmark (Pre-Code Experiment)
Manually label them as:
Security Knowledge
Noise (formatting, admin, linting, meta updates)
This dataset will be versioned and reusable as an early “golden slice.”
Prompt Iteration & Evaluation
Start with a simple binary JSON output prompt:

“Is this content introducing or modifying security-relevant knowledge?”
Evaluate against the human benchmark.
Iterate until accuracy consistently exceeds 97%, with special attention to known failure modes (e.g., Code of Conduct updates, formatting-only diffs).

Regex + LLM Cost Control
Design the regex filter to aggressively eliminate obvious noise first (lockfiles, CSS, tests, config).
Ensure the LLM is only invoked on borderline or content-heavy diffs.
Document false positives / negatives clearly for future contributors.
CI & Dataset Readiness
Structure outputs so they can plug cleanly into the planned golden_dataset.json.
Ensure behavior is deterministic and testable for CI regression checks.

And Cross-Module Contributions

While Module B would be my ownership area, I can also help with:
Module A: defining shared interfaces and assumptions between diff harvesting and filtering.
CI / Evaluation: contributing test cases and failure examples derived from Module B experiments.

I’ve read and understood Section 3 (Agent-Ready CI & AI-generated PR constraints) and I’m comfortable working within those boundaries.

Looking forward to collaborating — this project feels like a rare opportunity to build something both technically rigorous and genuinely useful.

Best,
Manshu

added RFC on how to create a living knowledge base of owasp things

1539e0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added RFC on how to create a living knowledge base of owasp things#734

added RFC on how to create a living knowledge base of owasp things#734
northdpole wants to merge 1 commit intomainfrom
owasp-graph

northdpole commented Feb 1, 2026

Uh oh!

PRAteek-singHWY commented Feb 1, 2026 •

edited

Loading

Uh oh!

PRAteek-singHWY commented Feb 2, 2026

Uh oh!

manshusainishab commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

northdpole commented Feb 1, 2026

Uh oh!

PRAteek-singHWY commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PRAteek-singHWY commented Feb 2, 2026

Uh oh!

manshusainishab commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PRAteek-singHWY commented Feb 1, 2026 •

edited

Loading