Skip to content

Conversation

@Molter73
Copy link
Collaborator

@Molter73 Molter73 commented Dec 3, 2025

Description

This is a very basic implementation of inode tracking meant to be used
in the upcoming release and improved upon in the near future.

The current implementation will perform a scan of the paths that are
configured to be monitored, using the inode and device numbers as a key
to two maps:

  • A BPF hash map for kernelspace to know when an inode triggers an
    event.
  • A HashMap in userspace that maps the inode to a path that we found the
    inode at.

With these two maps we are able to confidently emit events for files
that are mounted into containers with paths that don't necessarily match
the prefixes configured for monitoring and, at a later stage in
userspace, add the path on the host to the event itself.

The implemented approach is far from complete, it will only work as long
as the files found during the initial scan are not moved, deleted or
replaced by a different file. Future patches will extend the
functionality of both kernel and userspace to be able to catch more
corner cases.

Checklist

  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Added tests will validate events generated on an overlayfs file properly
shows the event on the upper layer and the access to the underlying FS.
They also validate a mounted path on a container resolves to the correct
host path.

While developing these tests, it became painfully obvious getting the
information of the process running inside the container is not
straightforward. Because containers tend to be fairly static, we should
be able to manually create the information statically in the test and
still have everything work correctly. In order to minimize the amount of
changes on existing tests, the default Process constructor now takes
fields directly and there is a from_proc class method that builds a new
Process object from /proc. Additionally, getting the pid of a process in
a container is virtually impossible, so we make the pid check optional.

@Molter73 Molter73 marked this pull request as draft December 3, 2025 15:26
@Molter73 Molter73 force-pushed the mauro/ROX-30437/host-scanner-base branch 2 times, most recently from 221a623 to 7774aea Compare December 4, 2025 09:23
This is a very basic implementation of inode tracking meant to be used
in the upcoming release and improved upon in the near future.

The current implementation will perform a scan of the paths that are
configured to be monitored, using the inode and device numbers as a key
to two maps:
- A BPF hash map for kernelspace to know when an inode triggers an
  event.
- A HashMap in userspace that maps the inode to a path that we found the
  inode at.

With these two maps we are able to confidently emit events for files
that are mounted into containers with paths that don't necessarily match
the prefixes configured for monitoring and, at a later stage in
userspace, add the path on the host to the event itself.

The implemented approach is far from complete, it will only work as long
as the files found during the initial scan are not moved, deleted or
replaced by a different file. Future patches will extend the
functionality of both kernel and userspace to be able to catch more
corner cases.
Added tests will validate events generated on an overlayfs file properly
shows the event on the upper layer and the access to the underlying FS.
They also validate a mounted path on a container resolves to the correct
host path.

While developing these tests, it became painfully obvious getting the
information of the process running inside the container is not
straightforward. Because containers tend to be fairly static, we should
be able to manually create the information statically in the test and
still have everything work correctly. In order to minimize the amount of
changes on existing tests, the default Process constructor now takes
fields directly and there is a from_proc class method that builds a new
Process object from /proc. Additionally, getting the pid of a process in
a container is virtually impossible, so we make the pid check optional.
@Molter73 Molter73 force-pushed the mauro/ROX-30437/host-scanner-base branch from 7774aea to ce0b5fc Compare December 4, 2025 09:39
@Molter73 Molter73 marked this pull request as ready for review December 4, 2025 09:39
Copy link
Contributor

@erthalion erthalion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, few commentaries below. On top of that I'm testing it on OCP, and the device information seems to be off for container case:

# container file
$ stat /etc/test
  File: /etc/test
  Size: 4               Blocks: 8          IO Block: 4096   regular file
Device: 0,651   Inode: 94375102    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-12-10 14:49:19.215500519 +0000
Modify: 2025-12-10 14:50:53.170938879 +0000
Change: 2025-12-10 14:50:53.170938879 +0000
 Birth: 2025-12-10 14:49:19.215500519 +0000

# container event
{"timestamp":1765381528692290158,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2097291}}}}

# overlay event
{"timestamp":1765381528692350071,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2052}}}}

The latter event seems to have a correct dev id of the host overlay, but the former is weird.


unsigned long magic = inode->i_sb->s_magic;
switch (magic) {
case BTRFS_SUPER_MAGIC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, btrfs support is mostly useful for development on Fedora, but OCP doesn't use it. In that case what do you think about turning it into a set of if/else conditions, wrapped into likely/unlikely macro as needed, to make sure that we account for the most anticipated result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the check right below this line:

if (bpf_core_type_exists(struct btrfs_inode)) {

My understanding is that the verifier will remove the block entirely when btrfs is not used in the system because it is dead code. In this case, we should never go in this branch of the switch-case, since no event will have the btrfs magic number, but there will also not be any additional code because it was removed by the verifier. Do we need to complicate it further than this?

If this was a kernel module, I would 100% agree with you, but I believe letting the verifier remove the code is the way to go here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verifier will drop the btrfs part, but the switch condition with it's overhead will stay.

But in this particular case it probably doesn't matter indeed. The likely/unlikely is mostly a signal to the compiler to adapt the code layout for the locality -- and since the current layout is rather simple, looks like clang doesn't change anything that much:

// original
; switch (magic) {
 18c:	cmpq	%rsi, %rdi
 18f:	jne	0x1e2
 191:	movl	$1, %edi
; if (bpf_core_type_exists(struct btrfs_inode)) {
 196:	testl	%edi, %edi
 198:	je	0x1e2

// with the macros
; if (unlikely(magic == BTRFS_SUPER_MAGIC)) {
 18c:	cmpq	%rsi, %rdi
 18f:	jne	0x1e8
 191:	xorl	%edi, %edi
 193:	movl	$1, %esi
; if (bpf_core_type_exists(struct btrfs_inode)) {
 198:	testl	%esi, %esi
 19a:	je	0x21e

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can twist the condition a bit more to look something like this:

  if (bpf_core_type_exists(struct btrfs_inode) && unlikely(magic == BTRFS_SUPER_MAGIC)) {
    struct btrfs_inode* btrfs_inode = container_of(inode, struct btrfs_inode, vfs_inode);
    key.inode = inode->i_ino;
    key.dev = BPF_CORE_READ(btrfs_inode, root, anon_dev);
  } else {
    key.inode = inode->i_ino;
    key.dev = inode->i_sb->s_dev;
  }

I'm not 100% sure how to get the same output you got, but this might make the compiler output something a bit better while still letting the verifier remove the first check when btrfs is configured or the entire block when it is not?

}

__always_inline static long inode_remove(struct inode_key_t* inode) {
return bpf_map_delete_elem(&inode_map, inode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for null check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is currently only used like this:

      inode_remove(&inode_key);

Which means the key will never be null, but you are right, I will add the null check.

}

fn update_entry(&self, path: &Path) -> anyhow::Result<()> {
if !path.exists() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are no debug outputs on this path, so no way to see what was actually added. Makes sense to log it for debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had them for testing and it can get quite verbose very fast, I will add them back in and if it becomes a problem we can knock them down to trace in the future.

MONITORED,
} inode_monitored_t;

__always_inline static inode_monitored_t inode_is_monitored(const inode_value_t* inode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks overly verbose for what seems to be a null check under the hood. Any reason why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state is very minimal, but the next step to properly do inode tracking will require us to keep track of directories, so when a file is accessed, we will need to know if the parent directory is monitored in case the file is being created and it should be tracked. This function is in preparation to extend the inode_monitored_t enum to have a PARENT_MONITORED value that will be returned when the current inode is not monitored but the parent directory is (the function also needs to take in the inode of the parent to check this condition).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel this is too much for the current implementation, I'm happy to roll back to a simple null check and add everything back in the next stage of the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's fine by me, just have some short doc string to this function explaining this pls.

unsigned long long dev;
} inode_key_t;

typedef char inode_value_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for char instead of bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only reason I chose this is because the conversion on the rust side was easier this way, I'll double check and come back to this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now, using bool requires us to include vmlinux.h. If we include vmlinux.h in this file, then the generated bindings created by our build script gets massive, since everything in vmlinux.h gets a binding.

There's probably a better way to go about this, for now I've gone with just using char.

} process_t;

typedef struct inode_key_t {
unsigned long long inode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they be unsigned long long? From what I see both inode and dev are 32 bits wide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can knock them down to unsigned long, I think it will require some casting on the Rust side because the MetadataExt trait gives u64 values for these for whatever reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it just worked 🤷🏻‍♂️

use tokio::{
io::unix::AsyncFd,
sync::{broadcast, watch},
sync::{mpsc, watch},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what is it and how does this change the synchronization primitives used for events processing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. mpsc stands for multi-producer single-consumer, so we are going from a broadcast channel (which is essentially a multi-producer multi-consumer) to this type of channel. The logic behind this change comes from how the data will flow through the components:

  • The bpf module will grab a reference to an event coming from the kernel in the ringbuffer and turn it into an Event value by copying trivial types like integers and allocating heap memory for things like strings. Once the Event is constructed, it gets moved into an mpsc channel via its Sender, because this is done via a move operation, it should be fast and because we have a single producer writing to the channel (the bpf module), there should be no contention to insert the value (unless the channel becomes full).
  • The Receiver end of this mpsc channel is held by the host_scanner module. This will receive the Event values and check if they need to have their host path added to them, (in future updates, this module may need to react in some other ways, like scanning a directory after a rename puts it under a monitored path). Once this is done, the Event gets wrapped in an Arc (atomic reference counter) and forwarded via a broadcast channel to the output module.
  • The output module is essentially two tasks at the moment, one printing events in JSON format to stdout and another sending the events via gRPC. The reason an Arc is used is because receiving from a broadcast channel will cause a clone operation to be performed on the value, since this is the only way multiple consumers can receive it, wrapping the value in an Arc prevents all the underlying heap allocated values from being copied on receive.

More information on these channel can be found in the tokio docs:

I also have a draft on how we could extend this approach a bit further to do pipelining in a similar fashion: #169

Hopefully this answers the question, I'm happy to provide more clarifications if needed.

@@ -0,0 +1,143 @@
use std::{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to start any new module with a short description what it does and few words about it's internals and architecture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been dropping the ball hard on documentation. I will add them to the new module here and get to working on updating docs across the project when I get a minute.

dev: metadata.st_dev(),
};

self.kernel_inode_map.borrow_mut().insert(inode, 0, 0)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we might want to use bulk insert, does aya support it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this for the implementation, it'd be nice for start up to be able to populate the regular map and the bulk update the kernel one. Unfortunately, aya doesn't seem to support this, but we can probably do something with aya-obj and calling the BPF syscall directly.

@Molter73
Copy link
Collaborator Author

Thanks, few commentaries below. On top of that I'm testing it on OCP, and the device information seems to be off for container case:

# container file
$ stat /etc/test
  File: /etc/test
  Size: 4               Blocks: 8          IO Block: 4096   regular file
Device: 0,651   Inode: 94375102    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-12-10 14:49:19.215500519 +0000
Modify: 2025-12-10 14:50:53.170938879 +0000
Change: 2025-12-10 14:50:53.170938879 +0000
 Birth: 2025-12-10 14:49:19.215500519 +0000

# container event
{"timestamp":1765381528692290158,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2097291}}}}

# overlay event
{"timestamp":1765381528692350071,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2052}}}}

The latter event seems to have a correct dev id of the host overlay, but the former is weird.

Overlayfs has some pretty fancy logic when handling device numbers because it needs handle the upper/lower layers and merged paths for all files. Since there is no logic for handling overlayfs specifically implemented yet, the device number will be wrong there. I haven't bothered implementing it yet because we are not doing inode tracking in container files that I know of.

* Added a missing null check.
* Added missing doc strings.
* Downgrade inode and dev numbers to 32 bits.
* Added some logging statements.
@Molter73 Molter73 requested a review from erthalion December 15, 2025 11:45
@Molter73 Molter73 merged commit 087a210 into main Dec 15, 2025
23 checks passed
@Molter73 Molter73 deleted the mauro/ROX-30437/host-scanner-base branch December 15, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants