- Project Name: JByteScanner (Java Bytecode Security Scanner)
- Core Engine: Soot 4.5+ (Java Bytecode Optimization and Analysis Framework)
- Target Audience: Security Auditors, Security Researchers
- Key Value Proposition: Single JAR execution, low memory footprint, database-free, highly configurable, and standardized SARIF output.
The tool adopts a "Dual-Engine Microkernel" architecture, decoupling lightweight information extraction (API scanning) from heavyweight data flow analysis (vulnerability scanning) to address memory consumption issues found in previous tools.
graph TD
User[User/Auditor] --> Launcher["Launcher (CLI)"]
Launcher --> ConfigMgr[Config Manager]
Launcher --> DiscoveryEngine["A. Asset Discovery Engine (Lightweight)"]
Launcher --> SecretScanner["B. Secret Scanner (Tactical)"]
Launcher --> TaintEngine["C. Taint Analysis Engine (Heavyweight)"]
ConfigMgr --> |Load/Gen| Rules["Rules (yaml)"]
DiscoveryEngine --> |"Soot (Structure)"| JARs[Target JARs]
DiscoveryEngine --> |Extract| APIDict["api.txt (Route Dict)"]
DiscoveryEngine --> |Extract| ComponentDict["components.txt (SCA)"]
SecretScanner --> |"ASM/Regex"| JARs
SecretScanner --> |"Scan Configs"| JARs
SecretScanner --> |Export| Secrets["secrets.txt"]
TaintEngine --> |Input| APIDict
TaintEngine --> |Input| ComponentDict
TaintEngine --> |"Soot (SPARK/Jimple)"| JARs
TaintEngine --> |Analyze| Vulnerabilities[Vulnerabilities]
Vulnerabilities --> Scorer[Vulnerability Scorer]
Scorer --> |"R-S-A-C Model"| ScoredVulns[Scored Vulnerabilities]
ScoredVulns --> ReportGen[Report Generator]
ReportGen --> |Export| SARIF["result.sarif"]
-
Loader Module
- Responsibility: Handles input directories, identifying all
.jar,.warfiles. - Optimization: Automatically unpacks nested structures in SpringBoot/FatJARs, extracting
BOOT-INF/classesandlibto build the necessary ClassPath for Soot. (Implemented in Phase 2). - Fat JAR Support: Detects
BOOT-INF/classesorWEB-INF/classesinside a JAR. Extracts them to a temporary directory along with dependent libraries (BOOT-INF/lib/*.jar) to reconstruct a valid classpath for static analysis.
- Responsibility: Handles input directories, identifying all
-
Configuration Manager
- Improvement: Replaces custom
.confformats with YAML. - Logic: Checks for
rules.yamlin the current directory on startup. If missing, extracts a default template from the JAR resources; otherwise, loads the existing one. This addresses user pain points regarding configuration persistence and editability. - Project Workspace: Prioritizes reading configuration from the project-specific workspace (
.jbytescanner/rules.yaml), enabling isolated configurations per scan target.
- Improvement: Replaces custom
-
Discovery Engine (Lightweight)
- Goal: Address Pain Point 1 (API Extraction) and Pain Point 4 (Memory Optimization).
- Technology: Runs only Soot's
jb(Jimple Body) phase. Does not build a global Call Graph. - Function: Rapidly traverses class annotations and inheritance hierarchies to extract Controller/Servlet definitions, outputting
api.txt. - Output Strategy: Writes results to the project workspace (
.jbytescanner/api.txt), preventing file conflicts between different projects. - SCA Support (Planned Phase 2.5): Will identify versions of third-party libraries (e.g., Fastjson, Log4j) to prune unnecessary taint analysis rules.
-
Taint Engine (Heavyweight)
- Technology: Builds Pointer Analysis and Call Graph using Soot's
SPARKorCHA. - Strategy: Uses "Demand-Driven Analysis". Instead of analyzing the entire universe, it uses entry points from
api.txtto build relevant call subgraphs, significantly reducing memory usage. - Engine Update: Now uses a Worklist-based Engine (Phase 7) to replace recursive analysis, preventing StackOverflow on deep chains.
- Optimization:
- Leaf Summaries: Caches summaries for leaf methods to avoid redundant analysis.
- Utilizes
components.txtto skip analysis for safe library versions.
- Technology: Builds Pointer Analysis and Call Graph using Soot's
-
Report Generator
- Goal: Address Pain Point 5.
- Format: Supports SARIF (Standard Static Analysis Results Interchange Format) v2.1.0, enabling direct integration with VSCode, GitHub Security, etc.
- Enhancement: Includes risk levels (CRITICAL, HIGH, etc.) and numerical scores derived from the Vulnerability Scorer.
-
Secret Scanner (Tactical)
- Goal: Provide immediate value by identifying hardcoded credentials (Phase 8.1).
- Technology: Uses ASM for bytecode string extraction and Regex/Entropy analysis. Does not require heavy Soot analysis.
- Capabilities:
- Config Scan: Parses
application.properties/ymlinside JARs. - String Scan: Detects keys (AWS, JDBC) in constant pools.
- Entropy: Identifies high-entropy strings (potential secrets).
- Base64: Decodes and recursively scans Base64 strings.
- Context-Aware: Detects hash usage (e.g.,
token.equals("md5")).
- Config Scan: Parses
-
Vulnerability Scorer
- Goal: Prioritize findings for security auditors (Phase 8.2).
- Model: R-S-A-C (Reachability * Severity * Auth * Confidence).
- Auth Detection: Heuristically identifies
@PreAuthorize,@Secured, etc., to determine if a vulnerability is behind an authentication barrier.
- Static Analysis Framework: Soot (4.5+)
- Rationale: De facto standard in academic and industrial Java research. Operates on bytecode (no source code required), crucial for the "audit deployment artifacts" use case. Its Jimple IR (Intermediate Representation) simplifies complex Java bytecode instructions into 3-address code, making analysis implementation significantly easier.
- CLI Framework: Picocli
- Rationale: Modern, type-safe command-line parsing with built-in help generation and sub-command support.
- Configuration: YAML (Jackson)
- Rationale: Human-readable, widespread adoption, and hierarchical structure suitable for nested rules (Sources/Sinks).
- Reporting: SARIF
- Rationale: OASIS standard for static analysis tools, enabling seamless integration with CI/CD pipelines (GitHub Actions, GitLab CI) and IDEs (VSCode).
The core analysis relies on Inter-procedural Data Flow Analysis.
We utilize Jimple, Soot's primary IR. It is a typed, stack-less, 3-address code representation.
- Benefit: Transforms stack-based bytecode (e.g.,
aload_0,iload_1,iadd) into variable-based statements (e.g.,a = b + c), simplifying def-use chain construction.
The tool supports two modes to balance precision and performance:
- CHA (Class Hierarchy Analysis):
- Concept: Conservatively assumes any method overriding a virtual call target could be invoked.
- Pros/Cons: Extremely fast, low memory, but can introduce false positives (edges to methods that are never called at runtime).
- SPARK (Soot Pointer Analysis Research Kit):
- Concept: Performs points-to analysis to determine which objects a variable can actually point to, filtering out impossible targets.
- Pros/Cons: More precise, fewer false positives, but computationally expensive.
- Reference: Lhoták, O., & Hendren, L. (2003). Scaling Java points-to analysis using Spark.
The engine implements a Forward Taint Propagation algorithm combining intra- and inter-procedural analysis.
- Source Identification: Based on
api.txtandrules.yaml, all parameters of API entry-point methods are marked as "Tainted" at method entry. - Intra-procedural Propagation (
IntraTaintAnalysis—ForwardBranchedFlowAnalysis<FlowSet<Value>>):- Direct assignment:
y = x→ytainted ifxtainted. - Binary/cast:
y = x + z,y = (T) x→ytainted if operand tainted. - Instance field read:
y = obj.f→ytainted ifobjtainted. - Static field read:
y = Cls.f→ytainted ifCls.fwas previously written with tainted data (tracked intaintedStaticFields). - Array read:
y = arr[i]→ytainted ifarrtainted. - Instance field write:
obj.f = x→objtainted ifxtainted. - Static field write:
Cls.f = x→Cls.fadded totaintedStaticFieldsifxtainted. - Method return (instance):
y = obj.m(...)→ytainted ifobjtainted;arg → returnis additionally applied only to setter-like instance methods to reduce taint explosion. - Method return (static/any):
y = Cls.m(...)→ytainted if any arg tainted. - Setter/constructor receiver:
obj.set(x)ornew Obj(x)→objtainted if any arg tainted, but this receiver-tainting heuristic is restricted to setter-like methods and constructors. This enables thesetter → field → getter → sinkchain without broadly tainting service objects. - Path sensitivity: Null-check branches (
if x == null) kill taint on the null path.
- Direct assignment:
- Inter-procedural Propagation (
WorklistEngine):- Tainted arguments are mapped to callee parameter locals before scheduling.
- Tainted receiver (
objinobj.m(...)) is mapped to calleethislocal. AnalysisState(method + tainted-param-bitset +thisTainted) is used for memoization to avoid redundant re-analysis.
- Sink Matching: A vulnerability is flagged when:
- Any argument of a sink method call is tainted, OR
- The receiver of an instance sink call is tainted for sink categories that enable receiver-based triggering. This receiver-based check is intentionally disabled for
sqlito avoid false positives on taintedStatement/Connectionobjects.
Reference: Vallée-Rai et al. (1999). Soot - a Java optimization framework.
sequenceDiagram
participant CLI as Launcher
participant Ldr as JarLoader
participant Soot as Soot Framework
participant Ext as RouteExtractor
participant Out as File (api.txt)
CLI->>Ldr: Load JARs
Ldr-->>Ldr: Check FatJAR (BOOT-INF)
opt Is FatJAR
Ldr->>Ldr: Unpack classes & lib to temp
end
Ldr-->>CLI: ClassPath List (Jars + Temp Dirs)
CLI->>Soot: Initialize (jb phase only)
Soot->>Soot: Load Classes (Phantom Refs)
CLI->>Ext: Run Extraction
loop Every Class
Ext->>Soot: Get Annotations (@Controller, @Path)
Ext->>Soot: Check Hierarchy (extends HttpServlet)
opt Match Found
Ext->>Out: Write Route Info
end
end
flowchart TD
A[Start Analysis] --> B{Load Config}
B --> C["Initialize Soot (Whole Program)"]
C --> D["Build Call Graph (CHA/SPARK)"]
D --> E["Identify EntryPoints (from api.txt)"]
E --> F["Initialize Worklist (Sources)"]
F --> G{Worklist Empty?}
G -- Yes --> H[Generate Report]
G -- No --> I["Pop Method/Variable"]
I --> J[Intra-procedural Propagation]
J --> K{Reaches Sink?}
K -- Yes --> L[Record Vulnerability]
K -- No --> M["Find Callers/Callees"]
M --> N["Map Taint to Args/Returns"]
N --> O[Push to Worklist]
O --> G
Instead of Regex or ASM, we leverage Soot's superior annotation support. A RouteExtractor will be implemented.
- Recognition Logic:
- Spring Boot: Scan
@RestController,@Controlleron classes and@RequestMapping,@GetMapping,@PostMappingon methods. Parsevalueorpathattributes. - Servlet: Scan classes inheriting
javax.servlet.http.HttpServletand parseweb.xml(if present) or@WebServlet. - JAX-RS: Scan
@Path,@GET,@POST,@PUT,@DELETE,@HEAD,@OPTIONS,@PATCHand other JAX-RS annotations.
- Spring Boot: Scan
- Output Format:
METHOD /full/url/path class.method(params)
We will use Jackson or SnakeYAML.
Configuration Structure (rules.yaml):
config:
max_depth: 10
scan_packages: ["com.example", "cn.service"] # Limit scan scope
sources:
- type: "annotation"
value: "org.springframework.web.bind.annotation.RequestParam"
- type: "method"
signature: "<javax.servlet.http.HttpServletRequest: java.lang.String getParameter(java.lang.String)>"
sinks:
- type: "method"
vuln_type: "RCE"
signature: "<java.lang.Runtime: java.lang.Process exec(java.lang.String)>"Startup Logic:
File configFile = new File("rules.yaml");
if (!configFile.exists()) {
ResourceUtil.extract("/default_rules.yaml", ".");
Logger.info("Created default rules.yaml.");
}
Config config = ConfigLoader.load(configFile);High memory usage in previous tools often stems from loading the entire JRE rt.jar and building an excessive Call Graph.
Optimization Strategies:
- Phantom Refs: Enable
Options.v().set_allow_phantom_refs(true). Do not load implementations of third-party libraries unless necessary. - Exclusion List: Aggressively exclude
java.*,javax.*,sun.*,org.slf4j.*and other non-business logic packages from CallGraph construction. - CHA vs SPARK: Default to CHA (Class Hierarchy Analysis) for the base Call Graph as it is faster and memory-efficient. Enable SPARK only with a
--deepflag. - Iterative Analysis: Implement a "Batch Mode" where jars are processed individually or in small groups (resetting
G.reset()) if inter-service calls are not the focus.
Use sarif-java-sdk or manually construct the JSON structure.
{
"version": "2.1.0",
"runs": [
{
"tool": { "driver": { "name": "JByteScanner" } },
"results": [
{
"ruleId": "RCE",
"message": { "text": "Detected RCE flow from Controller to Runtime.exec" },
"locations": [ ... ]
}
]
}
]
}