JByteScanner Software Design Document (SDD)

1. Overview

Project Name: JByteScanner (Java Bytecode Security Scanner)
Core Engine: Soot 4.5+ (Java Bytecode Optimization and Analysis Framework)
Target Audience: Security Auditors, Security Researchers
Key Value Proposition: Single JAR execution, low memory footprint, database-free, highly configurable, and standardized SARIF output.

2. Architecture

The tool adopts a "Dual-Engine Microkernel" architecture, decoupling lightweight information extraction (API scanning) from heavyweight data flow analysis (vulnerability scanning) to address memory consumption issues found in previous tools.

graph TD
    User[User/Auditor] --> Launcher["Launcher (CLI)"]
    Launcher --> ConfigMgr[Config Manager]
    Launcher --> DiscoveryEngine["A. Asset Discovery Engine (Lightweight)"]
    Launcher --> SecretScanner["B. Secret Scanner (Tactical)"]
    Launcher --> TaintEngine["C. Taint Analysis Engine (Heavyweight)"]
    
    ConfigMgr --> |Load/Gen| Rules["Rules (yaml)"]
    
    DiscoveryEngine --> |"Soot (Structure)"| JARs[Target JARs]
    DiscoveryEngine --> |Extract| APIDict["api.txt (Route Dict)"]
    DiscoveryEngine --> |Extract| ComponentDict["components.txt (SCA)"]

    SecretScanner --> |"ASM/Regex"| JARs
    SecretScanner --> |"Scan Configs"| JARs
    SecretScanner --> |Export| Secrets["secrets.txt"]
    
    TaintEngine --> |Input| APIDict
    TaintEngine --> |Input| ComponentDict
    TaintEngine --> |"Soot (SPARK/Jimple)"| JARs
    TaintEngine --> |Analyze| Vulnerabilities[Vulnerabilities]
    
    Vulnerabilities --> Scorer[Vulnerability Scorer]
    Scorer --> |"R-S-A-C Model"| ScoredVulns[Scored Vulnerabilities]

    ScoredVulns --> ReportGen[Report Generator]
    ReportGen --> |Export| SARIF["result.sarif"]

Core Modules

Loader Module
- Responsibility: Handles input directories, identifying all .jar, .war files.
- Optimization: Automatically unpacks nested structures in SpringBoot/FatJARs, extracting BOOT-INF/classes and lib to build the necessary ClassPath for Soot. (Implemented in Phase 2).
- Fat JAR Support: Detects BOOT-INF/classes or WEB-INF/classes inside a JAR. Extracts them to a temporary directory along with dependent libraries (BOOT-INF/lib/*.jar) to reconstruct a valid classpath for static analysis.
Configuration Manager
- Improvement: Replaces custom .conf formats with YAML.
- Logic: Checks for rules.yaml in the current directory on startup. If missing, extracts a default template from the JAR resources; otherwise, loads the existing one. This addresses user pain points regarding configuration persistence and editability.
- Project Workspace: Prioritizes reading configuration from the project-specific workspace (.jbytescanner/rules.yaml), enabling isolated configurations per scan target.
Discovery Engine (Lightweight)
- Goal: Address Pain Point 1 (API Extraction) and Pain Point 4 (Memory Optimization).
- Technology: Runs only Soot's jb (Jimple Body) phase. Does not build a global Call Graph.
- Function: Rapidly traverses class annotations and inheritance hierarchies to extract Controller/Servlet definitions, outputting api.txt.
- Output Strategy: Writes results to the project workspace (.jbytescanner/api.txt), preventing file conflicts between different projects.
- SCA Support (Planned Phase 2.5): Will identify versions of third-party libraries (e.g., Fastjson, Log4j) to prune unnecessary taint analysis rules.
Taint Engine (Heavyweight)
- Technology: Builds Pointer Analysis and Call Graph using Soot's SPARK or CHA.
- Strategy: Uses "Demand-Driven Analysis". Instead of analyzing the entire universe, it uses entry points from api.txt to build relevant call subgraphs, significantly reducing memory usage.
- Engine Update: Now uses a Worklist-based Engine (Phase 7) to replace recursive analysis, preventing StackOverflow on deep chains.
- Optimization:
  - Leaf Summaries: Caches summaries for leaf methods to avoid redundant analysis.
  - Utilizes components.txt to skip analysis for safe library versions.
Report Generator
- Goal: Address Pain Point 5.
- Format: Supports SARIF (Standard Static Analysis Results Interchange Format) v2.1.0, enabling direct integration with VSCode, GitHub Security, etc.
- Enhancement: Includes risk levels (CRITICAL, HIGH, etc.) and numerical scores derived from the Vulnerability Scorer.
Secret Scanner (Tactical)
- Goal: Provide immediate value by identifying hardcoded credentials (Phase 8.1).
- Technology: Uses ASM for bytecode string extraction and Regex/Entropy analysis. Does not require heavy Soot analysis.
- Capabilities:
  - Config Scan: Parses application.properties/yml inside JARs.
  - String Scan: Detects keys (AWS, JDBC) in constant pools.
  - Entropy: Identifies high-entropy strings (potential secrets).
  - Base64: Decodes and recursively scans Base64 strings.
  - Context-Aware: Detects hash usage (e.g., token.equals("md5")).
Vulnerability Scorer
- Goal: Prioritize findings for security auditors (Phase 8.2).
- Model: R-S-A-C (Reachability * Severity * Auth * Confidence).
- Auth Detection: Heuristically identifies @PreAuthorize, @Secured, etc., to determine if a vulnerability is behind an authentication barrier.

3. Technology Stack & Principles

3.1 Technology Selection

Static Analysis Framework: Soot (4.5+)
- Rationale: De facto standard in academic and industrial Java research. Operates on bytecode (no source code required), crucial for the "audit deployment artifacts" use case. Its Jimple IR (Intermediate Representation) simplifies complex Java bytecode instructions into 3-address code, making analysis implementation significantly easier.
CLI Framework: Picocli
- Rationale: Modern, type-safe command-line parsing with built-in help generation and sub-command support.
Configuration: YAML (Jackson)
- Rationale: Human-readable, widespread adoption, and hierarchical structure suitable for nested rules (Sources/Sinks).
Reporting: SARIF
- Rationale: OASIS standard for static analysis tools, enabling seamless integration with CI/CD pipelines (GitHub Actions, GitLab CI) and IDEs (VSCode).

3.2 Technical Principles & Algorithms

The core analysis relies on Inter-procedural Data Flow Analysis.

A. Intermediate Representation (IR)

We utilize Jimple, Soot's primary IR. It is a typed, stack-less, 3-address code representation.

Benefit: Transforms stack-based bytecode (e.g., aload_0, iload_1, iadd) into variable-based statements (e.g., a = b + c), simplifying def-use chain construction.

B. Call Graph Construction

The tool supports two modes to balance precision and performance:

CHA (Class Hierarchy Analysis):
- Concept: Conservatively assumes any method overriding a virtual call target could be invoked.
- Pros/Cons: Extremely fast, low memory, but can introduce false positives (edges to methods that are never called at runtime).
SPARK (Soot Pointer Analysis Research Kit):
- Concept: Performs points-to analysis to determine which objects a variable can actually point to, filtering out impossible targets.
- Pros/Cons: More precise, fewer false positives, but computationally expensive.
- Reference: Lhoták, O., & Hendren, L. (2003). Scaling Java points-to analysis using Spark.

C. Taint Analysis (Vulnerability Detection)

The engine implements a Forward Taint Propagation algorithm combining intra- and inter-procedural analysis.

Source Identification: Based on api.txt and rules.yaml, all parameters of API entry-point methods are marked as "Tainted" at method entry.
Intra-procedural Propagation (IntraTaintAnalysis — ForwardBranchedFlowAnalysis<FlowSet<Value>>):
- Direct assignment: y = x → y tainted if x tainted.
- Binary/cast: y = x + z, y = (T) x → y tainted if operand tainted.
- Instance field read: y = obj.f → y tainted if obj tainted.
- Static field read: y = Cls.f → y tainted if Cls.f was previously written with tainted data (tracked in taintedStaticFields).
- Array read: y = arr[i] → y tainted if arr tainted.
- Instance field write: obj.f = x → obj tainted if x tainted.
- Static field write: Cls.f = x → Cls.f added to taintedStaticFields if x tainted.
- Method return (instance): y = obj.m(...) → y tainted if obj tainted; arg → return is additionally applied only to setter-like instance methods to reduce taint explosion.
- Method return (static/any): y = Cls.m(...) → y tainted if any arg tainted.
- Setter/constructor receiver: obj.set(x) or new Obj(x) → obj tainted if any arg tainted, but this receiver-tainting heuristic is restricted to setter-like methods and constructors. This enables the setter → field → getter → sink chain without broadly tainting service objects.
- Path sensitivity: Null-check branches (if x == null) kill taint on the null path.
Inter-procedural Propagation (WorklistEngine):
- Tainted arguments are mapped to callee parameter locals before scheduling.
- Tainted receiver (obj in obj.m(...)) is mapped to callee this local.
- AnalysisState (method + tainted-param-bitset + thisTainted) is used for memoization to avoid redundant re-analysis.
Sink Matching: A vulnerability is flagged when:
- Any argument of a sink method call is tainted, OR
- The receiver of an instance sink call is tainted for sink categories that enable receiver-based triggering. This receiver-based check is intentionally disabled for sqli to avoid false positives on tainted Statement / Connection objects.

Reference: Vallée-Rai et al. (1999). Soot - a Java optimization framework.

3.3 Internal Process Flows

Discovery Engine Flow (Phase 2)

sequenceDiagram
    participant CLI as Launcher
    participant Ldr as JarLoader
    participant Soot as Soot Framework
    participant Ext as RouteExtractor
    participant Out as File (api.txt)

    CLI->>Ldr: Load JARs
    Ldr-->>Ldr: Check FatJAR (BOOT-INF)
    opt Is FatJAR
        Ldr->>Ldr: Unpack classes & lib to temp
    end
    Ldr-->>CLI: ClassPath List (Jars + Temp Dirs)
    CLI->>Soot: Initialize (jb phase only)
    Soot->>Soot: Load Classes (Phantom Refs)
    CLI->>Ext: Run Extraction
    loop Every Class
        Ext->>Soot: Get Annotations (@Controller, @Path)
        Ext->>Soot: Check Hierarchy (extends HttpServlet)
        opt Match Found
            Ext->>Out: Write Route Info
        end
    end

Taint Analysis Flow (Phase 3/4)

flowchart TD
    A[Start Analysis] --> B{Load Config}
    B --> C["Initialize Soot (Whole Program)"]
    C --> D["Build Call Graph (CHA/SPARK)"]
    D --> E["Identify EntryPoints (from api.txt)"]
    E --> F["Initialize Worklist (Sources)"]
    
    F --> G{Worklist Empty?}
    G -- Yes --> H[Generate Report]
    G -- No --> I["Pop Method/Variable"]
    
    I --> J[Intra-procedural Propagation]
    J --> K{Reaches Sink?}
    K -- Yes --> L[Record Vulnerability]
    K -- No --> M["Find Callers/Callees"]
    
    M --> N["Map Taint to Args/Returns"]
    N --> O[Push to Worklist]
    O --> G

4. Detailed Design & Solutions

4.1 Pain Point 1: API Route Extraction (api.txt)

Instead of Regex or ASM, we leverage Soot's superior annotation support. A RouteExtractor will be implemented.

Recognition Logic:
- Spring Boot: Scan @RestController, @Controller on classes and @RequestMapping, @GetMapping, @PostMapping on methods. Parse value or path attributes.
- Servlet: Scan classes inheriting javax.servlet.http.HttpServlet and parse web.xml (if present) or @WebServlet.
- JAX-RS: Scan @Path, @GET, @POST, @PUT, @DELETE, @HEAD, @OPTIONS, @PATCH and other JAX-RS annotations.
Output Format: METHOD /full/url/path class.method(params)

4.2 Pain Points 2 & 3: Configurable Source/Sink (YAML)

We will use Jackson or SnakeYAML.

Configuration Structure (rules.yaml):

config:
  max_depth: 10
  scan_packages: ["com.example", "cn.service"] # Limit scan scope

sources:
  - type: "annotation"
    value: "org.springframework.web.bind.annotation.RequestParam"
  - type: "method"
    signature: "<javax.servlet.http.HttpServletRequest: java.lang.String getParameter(java.lang.String)>"

sinks:
  - type: "method"
    vuln_type: "RCE"
    signature: "<java.lang.Runtime: java.lang.Process exec(java.lang.String)>"

Startup Logic:

File configFile = new File("rules.yaml");
if (!configFile.exists()) {
    ResourceUtil.extract("/default_rules.yaml", "."); 
    Logger.info("Created default rules.yaml.");
}
Config config = ConfigLoader.load(configFile);

4.3 Pain Point 4: Memory Optimization

High memory usage in previous tools often stems from loading the entire JRE rt.jar and building an excessive Call Graph.

Optimization Strategies:

Phantom Refs: Enable Options.v().set_allow_phantom_refs(true). Do not load implementations of third-party libraries unless necessary.
Exclusion List: Aggressively exclude java.*, javax.*, sun.*, org.slf4j.* and other non-business logic packages from CallGraph construction.
CHA vs SPARK: Default to CHA (Class Hierarchy Analysis) for the base Call Graph as it is faster and memory-efficient. Enable SPARK only with a --deep flag.
Iterative Analysis: Implement a "Batch Mode" where jars are processed individually or in small groups (resetting G.reset()) if inter-service calls are not the focus.

4.4 Pain Point 5: SARIF Report

Use sarif-java-sdk or manually construct the JSON structure.

{
  "version": "2.1.0",
  "runs": [
    {
      "tool": { "driver": { "name": "JByteScanner" } },
      "results": [
        {
          "ruleId": "RCE",
          "message": { "text": "Detected RCE flow from Controller to Runtime.exec" },
          "locations": [ ... ]
        }
      ]
    }
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JByteScanner Software Design Document (SDD)

1. Overview

2. Architecture

Core Modules

3. Technology Stack & Principles

3.1 Technology Selection

3.2 Technical Principles & Algorithms

A. Intermediate Representation (IR)

B. Call Graph Construction

C. Taint Analysis (Vulnerability Detection)

3.3 Internal Process Flows

Discovery Engine Flow (Phase 2)

Taint Analysis Flow (Phase 3/4)

4. Detailed Design & Solutions

4.1 Pain Point 1: API Route Extraction (api.txt)

4.2 Pain Points 2 & 3: Configurable Source/Sink (YAML)

4.3 Pain Point 4: Memory Optimization

4.4 Pain Point 5: SARIF Report

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

JByteScanner Software Design Document (SDD)

1. Overview

2. Architecture

Core Modules

3. Technology Stack & Principles

3.1 Technology Selection

3.2 Technical Principles & Algorithms

A. Intermediate Representation (IR)

B. Call Graph Construction

C. Taint Analysis (Vulnerability Detection)

3.3 Internal Process Flows

Discovery Engine Flow (Phase 2)

Taint Analysis Flow (Phase 3/4)

4. Detailed Design & Solutions

4.1 Pain Point 1: API Route Extraction (api.txt)

4.2 Pain Points 2 & 3: Configurable Source/Sink (YAML)

4.3 Pain Point 4: Memory Optimization

4.4 Pain Point 5: SARIF Report