Skip to content

Commit 36a9657

Browse files
committed
refactor: implement structured event models and improve trace parsing
- Add Pydantic models for syscall events (ExecveEvent, ForkEvent, CloneEvent, ConnectEvent) - Introduce UnparsedEvent model for raw trace lines that cannot be parsed - Update TraceReader to yield structured event models instead of raw strings - Refactor tests to handle new event model structure - Update documentation to reflect new event model architecture - Improve README with examples of structured event data - Clean up project structure and remove legacy code markers This change improves type safety and validation of event data while maintaining backward compatibility through the UnparsedEvent model for unparseable lines.
1 parent 2f27fdd commit 36a9657

8 files changed

Lines changed: 133 additions & 38 deletions

File tree

README.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ This is useful for troubleshooting or understanding what data is being collected
6666

6767
## Data Structure
6868

69-
Linux EDR groups execve events by process name and maintains the full command line for context:
69+
Linux EDR captures various syscall events and represents them using structured Pydantic models. Here's an example of an `ExecveEvent` within a report:
7070

7171
```json
7272
{
@@ -75,14 +75,30 @@ Linux EDR groups execve events by process name and maintains the full command li
7575
"window_end": "2023-01-01T12:00:00+00:00",
7676
"total": 150,
7777
"command_counts": {"ls": 50, "cat": 30, "bash": 70},
78-
"process_events": {
79-
"ls": ["ls -la /tmp", "ls /home", "ls -l /var/log"],
80-
"cat": ["cat /etc/passwd", "cat /var/log/syslog"],
81-
"bash": ["bash -c 'whoami'", "bash /tmp/script.sh"]
82-
}
78+
"events": [
79+
{
80+
"timestamp": "12345.67890",
81+
"pid": 1001,
82+
"command": "ls",
83+
"args": ["-la", "/tmp"]
84+
},
85+
{
86+
"timestamp": "12346.00000",
87+
"pid": 1002,
88+
"child_pid": 1003
89+
},
90+
{
91+
"timestamp": "12347.11111",
92+
"pid": 1004,
93+
"fd": 3,
94+
"address": "1.1.1.1:443"
95+
}
96+
]
8397
}
8498
```
8599

100+
The previous grouping by process name (`process_events`) in Cell reports might change based on how these structured events are aggregated. The core reporting hierarchy remains.
101+
86102
## Configuration
87103

88104
Linux EDR can be configured using a `config.ini` file with the following options:
@@ -229,11 +245,17 @@ linux-edr/
229245
│ ├── cli.py # Typer-based CLI interface
230246
│ ├── app.py # Core application logic
231247
│ ├── config.py # Configuration management
232-
│ ├── trace.py # Non-blocking trace reader
248+
│ ├── trace.py # Non-blocking trace reader & parser
233249
│ ├── aggregator.py # Thread-safe event buffering
234-
│ ├── summary.py # Report generation
235250
│ ├── reporter.py # OpenAI integration and output
236251
│ ├── report_manager.py # Hierarchical report handling
252+
│ ├── domain/
253+
│ │ └── models/
254+
│ │ └── events/ # Pydantic models for syscall events
255+
│ │ ├── base.py
256+
│ │ ├── execve.py
257+
│ │ ├── fork.py
258+
│ │ └── ... # other event types
237259
│ └── models.py # Pydantic data models
238260
├── tests/ # Comprehensive test suite
239261
├── docs/ # Documentation

docs/api/domain/models.md

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,62 @@ The domain layer contains the core business logic and entities of the Linux EDR
66

77
### Event Models
88

9-
Domain models for system events captured from the Linux kernel.
9+
Domain models for system events captured from the Linux kernel. These models are based on Pydantic for validation and type safety.
10+
11+
All syscall events inherit from a `BaseSyscallEvent`:
1012

1113
```python
12-
# Example usage
13-
from linux_edr.domain.models import Event, EventType, ProcessEvent
14+
from linux_edr.domain.models.events import BaseSyscallEvent
15+
16+
# Base class structure (simplified)
17+
class BaseSyscallEvent(BaseModel):
18+
timestamp: str
19+
pid: int
20+
```
21+
22+
Specific syscall events extend this base class, adding relevant fields.
23+
24+
#### Execve Event
1425

15-
# Create an event instance
16-
event = ProcessEvent(
26+
```python
27+
from linux_edr.domain.models.events import ExecveEvent
28+
29+
# Example usage
30+
event = ExecveEvent(
31+
timestamp="12345.67890",
1732
pid=1234,
1833
command="ls",
1934
args=["-la", "/home"],
20-
timestamp=datetime.now()
2135
)
36+
print(event.model_dump())
37+
```
38+
39+
#### Fork/Clone Events
40+
41+
```python
42+
from linux_edr.domain.models.events import ForkEvent, CloneEvent
43+
44+
# Example usage
45+
fork_evt = ForkEvent(timestamp="12346.00000", pid=100, child_pid=101)
46+
clone_evt = CloneEvent(timestamp="12347.00000", pid=200, child_pid=201, flags="CLONE_FS")
47+
48+
print(fork_evt)
49+
print(clone_evt)
50+
```
51+
52+
#### Connect Event
53+
54+
```python
55+
from linux_edr.domain.models.events import ConnectEvent
56+
57+
# Example usage
58+
connect_evt = ConnectEvent(
59+
timestamp="12348.00000",
60+
pid=500,
61+
fd=3,
62+
address="192.168.1.1:80"
63+
)
64+
print(connect_evt)
2265
```
2366

2467
### Report Models

docs/architecture/overview.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ For more details, see the [Clean Architecture](clean-architecture.md) page.
1717

1818
### Domain Layer
1919

20-
- **Models**: Defines the structure of events and reports using Pydantic, ensuring data consistency and validation.
20+
- **Models**: Defines the structure of events and reports using Pydantic, ensuring data consistency and validation. Includes a base `BaseSyscallEvent` and specific models for traced syscalls (e.g., `ExecveEvent`, `ForkEvent`).
2121

2222
### Application Layer
2323

@@ -45,15 +45,16 @@ For more details, see the [Clean Architecture](clean-architecture.md) page.
4545

4646
## Data Flow
4747

48-
1. The `TraceReader` continuously reads `execve` events from the kernel trace pipe.
49-
2. Events are passed to the `Aggregator`, which buffers them in a thread-safe manner.
50-
3. A background scheduler triggers the appropriate use case at the configured interval.
51-
4. The use case retrieves a snapshot of events from the `Aggregator`.
52-
5. The service creates a Level 1 `Cell` report from the event snapshot.
53-
6. The `Cell` is passed to the `ReportManager`.
54-
7. The `ReportManager` saves the `Cell` and checks if enough Cells exist to create a Level 2 `Block`. This process continues up the hierarchy (Daily, Weekly, Monthly).
55-
8. The `Reporter` can optionally save the initial `Cell` report to a JSON file and send it to OpenAI for analysis.
56-
9. Higher-level reports (Blocks, etc.) can also be configured for AI analysis via the `ReportManager` interacting with the `Reporter`.
48+
1. The `TraceReader` continuously reads raw event strings from the kernel trace pipe.
49+
2. The `TraceReader` attempts to parse known syscall event lines (e.g., execve, fork, connect) into corresponding Pydantic models (`ExecveEvent`, `ForkEvent`, etc.). Unparsed lines are yielded as raw strings.
50+
3. Parsed event models (or raw strings if parsing fails) are passed to the `Aggregator`, which buffers them.
51+
4. A background scheduler triggers the report generation process at the configured interval.
52+
5. The reporting process retrieves a snapshot of buffered events (now structured models) from the `Aggregator`.
53+
6. A Level 1 `Cell` report is created from the event snapshot.
54+
7. The `Cell` is passed to the `ReportManager`.
55+
8. The `ReportManager` saves the `Cell` and checks if enough Cells exist to create a Level 2 `Block`. This process continues up the hierarchy (Daily, Weekly, Monthly).
56+
9. The `Reporter` can optionally save the initial `Cell` report to a JSON file and send it to OpenAI for analysis.
57+
10. Higher-level reports (Blocks, etc.) can also be configured for AI analysis via the `ReportManager` interacting with the `Reporter`.
5758

5859
## Project Structure
5960

@@ -62,6 +63,7 @@ linux-edr/
6263
├── linux_edr/ # Main source code package
6364
│ ├── domain/ # Core business logic
6465
│ │ └── models/ # Domain entities and value objects
66+
│ │ └── events/ # Pydantic models for specific syscall events
6567
│ ├── application/ # Application-specific business rules
6668
│ │ ├── services/ # Stateless operations
6769
│ │ └── use_cases/ # Business processes
@@ -71,12 +73,12 @@ linux-edr/
7173
│ │ └── controllers/ # Input adapters (CLI, API controllers)
7274
│ ├── app.py # Core application logic (legacy)
7375
│ ├── config.py # Configuration management (legacy)
74-
│ ├── trace.py # Non-blocking trace reader (legacy)
75-
│ ├── aggregator.py # Thread-safe event buffering (legacy)
76+
│ ├── trace.py # Non-blocking trace reader & parser
77+
│ ├── aggregator.py # Thread-safe event buffering
7678
│ ├── summary.py # Initial report generation (legacy)
77-
│ ├── reporter.py # OpenAI integration (legacy)
79+
│ ├── reporter.py # OpenAI integration & report output
7880
│ ├── report_manager.py # Report management (legacy)
79-
│ ├── models.py # Pydantic data models (legacy)
81+
│ ├── models.py # Pydantic data models (legacy - to be removed/merged)
8082
│ └── cli.py # CLI interface (legacy)
8183
├── tests/ # Comprehensive test suite
8284
├── docs/ # Documentation source files

linux_edr/domain/models/events/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,13 @@
33
from .fork import ForkEvent
44
from .clone import CloneEvent
55
from .connect import ConnectEvent
6+
from .unparsed import UnparsedEvent
67

78
__all__ = [
89
"BaseSyscallEvent",
910
"ExecveEvent",
1011
"ForkEvent",
1112
"CloneEvent",
1213
"ConnectEvent",
14+
"UnparsedEvent",
1315
]
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from pydantic import BaseModel
2+
3+
class UnparsedEvent(BaseModel):
4+
"""Represents a raw line from the trace pipe that could not be parsed."""
5+
raw_line: str

linux_edr/trace.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
CLONE_PATTERN = re.compile(r"(\S+)\s+\[(\d+)\]\s+.*clone.*child_pid=(\d+)\s+flags=(\S+)")
2121
CONNECT_PATTERN = re.compile(r"(\S+)\s+\[(\d+)\]\s+.*connect.*fd=(\d+)\s+addr=(.+)")
2222

23-
from .domain.models.events import ExecveEvent, ForkEvent, CloneEvent, ConnectEvent, BaseSyscallEvent
23+
from .domain.models.events import (
24+
ExecveEvent, ForkEvent, CloneEvent, ConnectEvent, BaseSyscallEvent, UnparsedEvent
25+
)
2426

2527
class TraceReader:
2628
"""
@@ -156,12 +158,15 @@ def _parse_line(self, line: str) -> Optional[BaseSyscallEvent]:
156158

157159
return None
158160

159-
def __iter__(self) -> Generator[Union[str, BaseSyscallEvent], None, None]:
161+
def __iter__(self) -> Generator[Union[BaseSyscallEvent, UnparsedEvent], None, None]:
160162
"""
161-
Iterate over lines from the trace pipe.
163+
Iterate over events from the trace pipe.
164+
165+
Attempts to parse known syscall events into Pydantic models.
166+
If a line cannot be parsed, it yields an `UnparsedEvent` model containing the raw line.
162167
163168
Yields:
164-
Lines from the trace pipe, one at a time
169+
A `BaseSyscallEvent` subclass if parsed successfully, otherwise an `UnparsedEvent`.
165170
"""
166171
if self.fd is None:
167172
logger.error("Cannot iterate: file descriptor is not open")
@@ -207,8 +212,13 @@ def __iter__(self) -> Generator[Union[str, BaseSyscallEvent], None, None]:
207212
continue
208213

209214
parsed_evt = self._parse_line(line)
210-
# Yield the parsed object if recognized, else the raw line for backward-compat.
211-
yield parsed_evt if parsed_evt else line
215+
# Yield the parsed object if recognized, else yield an UnparsedEvent
216+
if parsed_evt:
217+
yield parsed_evt
218+
else:
219+
# Optionally log the unparsed line here if needed
220+
# logger.debug(f"Unparsed trace line: {line}")
221+
yield UnparsedEvent(raw_line=line)
212222
except OSError as e:
213223
if e.errno in (errno.EAGAIN, errno.EWOULDBLOCK):
214224
continue

tests/test_trace.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import errno
77
from unittest.mock import patch, MagicMock, mock_open
88
from linux_edr.trace import TraceReader
9+
from linux_edr.domain.models.events import UnparsedEvent
910

1011

1112
class TestTraceReader(unittest.TestCase):
@@ -129,7 +130,14 @@ def test_iteration(self, mock_selector, mock_read, mock_open):
129130
break
130131

131132
# Verify lines were read correctly
132-
self.assertEqual(lines, ["line1", "line2", "line3"])
133+
self.assertEqual(
134+
lines,
135+
[
136+
UnparsedEvent(raw_line="line1"),
137+
UnparsedEvent(raw_line="line2"),
138+
UnparsedEvent(raw_line="line3"),
139+
],
140+
)
133141

134142
@patch("os.open")
135143
@patch("os.read")
@@ -163,7 +171,9 @@ def test_unicode_decode_error(self, mock_selector, mock_read, mock_open):
163171

164172
# Verify line was read and invalid characters were replaced
165173
self.assertEqual(len(lines), 1)
166-
self.assertIn("Invalid UTF-8", lines[0])
174+
# Check the raw_line attribute of the UnparsedEvent
175+
self.assertIsInstance(lines[0], UnparsedEvent)
176+
self.assertIn("Invalid UTF-8", lines[0].raw_line)
167177

168178
@patch("os.open")
169179
@patch("os.read")
@@ -201,7 +211,7 @@ def test_eagain_handling(self, mock_selector, mock_read, mock_open):
201211
break
202212

203213
# Verify line was read after EAGAIN
204-
self.assertEqual(lines, ["line1"])
214+
self.assertEqual(lines, [UnparsedEvent(raw_line="line1")])
205215

206216
@patch("os.open")
207217
@patch("os.read")

tests/test_trace_errors.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from linux_edr.trace import TraceReader
77
from itertools import islice
88
import logging
9+
from linux_edr.domain.models.events import UnparsedEvent
910

1011

1112
class TestTraceReaderErrors(unittest.TestCase):
@@ -62,7 +63,7 @@ def test_open_retry_on_ebadf(self, mock_logger, mock_selector, mock_open):
6263
mock_reopen.assert_called_once()
6364

6465
# Verify data after reopen was read
65-
self.assertEqual(lines, ["data after reopen"])
66+
self.assertEqual(lines, [UnparsedEvent(raw_line="data after reopen")])
6667

6768
# Verify warning was logged about bad descriptor
6869
mock_logger.warning.assert_any_call("Bad file descriptor, reopening trace pipe")

0 commit comments

Comments
 (0)