-
Notifications
You must be signed in to change notification settings - Fork 1
ROX-30437: basic inode tracking for host path resolution #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
221a623 to
7774aea
Compare
This is a very basic implementation of inode tracking meant to be used in the upcoming release and improved upon in the near future. The current implementation will perform a scan of the paths that are configured to be monitored, using the inode and device numbers as a key to two maps: - A BPF hash map for kernelspace to know when an inode triggers an event. - A HashMap in userspace that maps the inode to a path that we found the inode at. With these two maps we are able to confidently emit events for files that are mounted into containers with paths that don't necessarily match the prefixes configured for monitoring and, at a later stage in userspace, add the path on the host to the event itself. The implemented approach is far from complete, it will only work as long as the files found during the initial scan are not moved, deleted or replaced by a different file. Future patches will extend the functionality of both kernel and userspace to be able to catch more corner cases.
Added tests will validate events generated on an overlayfs file properly shows the event on the upper layer and the access to the underlying FS. They also validate a mounted path on a container resolves to the correct host path. While developing these tests, it became painfully obvious getting the information of the process running inside the container is not straightforward. Because containers tend to be fairly static, we should be able to manually create the information statically in the test and still have everything work correctly. In order to minimize the amount of changes on existing tests, the default Process constructor now takes fields directly and there is a from_proc class method that builds a new Process object from /proc. Additionally, getting the pid of a process in a container is virtually impossible, so we make the pid check optional.
7774aea to
ce0b5fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, few commentaries below. On top of that I'm testing it on OCP, and the device information seems to be off for container case:
# container file
$ stat /etc/test
File: /etc/test
Size: 4 Blocks: 8 IO Block: 4096 regular file
Device: 0,651 Inode: 94375102 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-12-10 14:49:19.215500519 +0000
Modify: 2025-12-10 14:50:53.170938879 +0000
Change: 2025-12-10 14:50:53.170938879 +0000
Birth: 2025-12-10 14:49:19.215500519 +0000
# container event
{"timestamp":1765381528692290158,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2097291}}}}
# overlay event
{"timestamp":1765381528692350071,"hostname":"xxxxxxx","process":{"comm":"bash","args":["bash"],"exe_path":"/usr/bin/bash","container_id":"4b5a83a0f180","uid":0,"username":"root","gid":0,"login_uid":4294967295,"pid":1404238,"in_root_mount_ns":false,"lineage":[{"uid":0,"exe_path":"/usr/bin/conmon"},{"uid":0,"exe_path":"/usr/lib/systemd/systemd"}]},"file":{"Open":{"filename":"/etc/test","host_file":"","inode":{"inode":94375102,"dev":2052}}}}
The latter event seems to have a correct dev id of the host overlay, but the former is weird.
|
|
||
| unsigned long magic = inode->i_sb->s_magic; | ||
| switch (magic) { | ||
| case BTRFS_SUPER_MAGIC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, btrfs support is mostly useful for development on Fedora, but OCP doesn't use it. In that case what do you think about turning it into a set of if/else conditions, wrapped into likely/unlikely macro as needed, to make sure that we account for the most anticipated result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the check right below this line:
if (bpf_core_type_exists(struct btrfs_inode)) {My understanding is that the verifier will remove the block entirely when btrfs is not used in the system because it is dead code. In this case, we should never go in this branch of the switch-case, since no event will have the btrfs magic number, but there will also not be any additional code because it was removed by the verifier. Do we need to complicate it further than this?
If this was a kernel module, I would 100% agree with you, but I believe letting the verifier remove the code is the way to go here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verifier will drop the btrfs part, but the switch condition with it's overhead will stay.
But in this particular case it probably doesn't matter indeed. The likely/unlikely is mostly a signal to the compiler to adapt the code layout for the locality -- and since the current layout is rather simple, looks like clang doesn't change anything that much:
// original
; switch (magic) {
18c: cmpq %rsi, %rdi
18f: jne 0x1e2
191: movl $1, %edi
; if (bpf_core_type_exists(struct btrfs_inode)) {
196: testl %edi, %edi
198: je 0x1e2
// with the macros
; if (unlikely(magic == BTRFS_SUPER_MAGIC)) {
18c: cmpq %rsi, %rdi
18f: jne 0x1e8
191: xorl %edi, %edi
193: movl $1, %esi
; if (bpf_core_type_exists(struct btrfs_inode)) {
198: testl %esi, %esi
19a: je 0x21e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can twist the condition a bit more to look something like this:
if (bpf_core_type_exists(struct btrfs_inode) && unlikely(magic == BTRFS_SUPER_MAGIC)) {
struct btrfs_inode* btrfs_inode = container_of(inode, struct btrfs_inode, vfs_inode);
key.inode = inode->i_ino;
key.dev = BPF_CORE_READ(btrfs_inode, root, anon_dev);
} else {
key.inode = inode->i_ino;
key.dev = inode->i_sb->s_dev;
}I'm not 100% sure how to get the same output you got, but this might make the compiler output something a bit better while still letting the verifier remove the first check when btrfs is configured or the entire block when it is not?
| } | ||
|
|
||
| __always_inline static long inode_remove(struct inode_key_t* inode) { | ||
| return bpf_map_delete_elem(&inode_map, inode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for null check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is currently only used like this:
inode_remove(&inode_key);Which means the key will never be null, but you are right, I will add the null check.
| } | ||
|
|
||
| fn update_entry(&self, path: &Path) -> anyhow::Result<()> { | ||
| if !path.exists() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are no debug outputs on this path, so no way to see what was actually added. Makes sense to log it for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had them for testing and it can get quite verbose very fast, I will add them back in and if it becomes a problem we can knock them down to trace in the future.
| MONITORED, | ||
| } inode_monitored_t; | ||
|
|
||
| __always_inline static inode_monitored_t inode_is_monitored(const inode_value_t* inode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks overly verbose for what seems to be a null check under the hood. Any reason why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state is very minimal, but the next step to properly do inode tracking will require us to keep track of directories, so when a file is accessed, we will need to know if the parent directory is monitored in case the file is being created and it should be tracked. This function is in preparation to extend the inode_monitored_t enum to have a PARENT_MONITORED value that will be returned when the current inode is not monitored but the parent directory is (the function also needs to take in the inode of the parent to check this condition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel this is too much for the current implementation, I'm happy to roll back to a simple null check and add everything back in the next stage of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's fine by me, just have some short doc string to this function explaining this pls.
| unsigned long long dev; | ||
| } inode_key_t; | ||
|
|
||
| typedef char inode_value_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for char instead of bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only reason I chose this is because the conversion on the rust side was easier this way, I'll double check and come back to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember now, using bool requires us to include vmlinux.h. If we include vmlinux.h in this file, then the generated bindings created by our build script gets massive, since everything in vmlinux.h gets a binding.
There's probably a better way to go about this, for now I've gone with just using char.
fact-ebpf/src/bpf/types.h
Outdated
| } process_t; | ||
|
|
||
| typedef struct inode_key_t { | ||
| unsigned long long inode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should they be unsigned long long? From what I see both inode and dev are 32 bits wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can knock them down to unsigned long, I think it will require some casting on the Rust side because the MetadataExt trait gives u64 values for these for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it just worked 🤷🏻♂️
| use tokio::{ | ||
| io::unix::AsyncFd, | ||
| sync::{broadcast, watch}, | ||
| sync::{mpsc, watch}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what is it and how does this change the synchronization primitives used for events processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. mpsc stands for multi-producer single-consumer, so we are going from a broadcast channel (which is essentially a multi-producer multi-consumer) to this type of channel. The logic behind this change comes from how the data will flow through the components:
- The
bpfmodule will grab a reference to an event coming from the kernel in the ringbuffer and turn it into anEventvalue by copying trivial types like integers and allocating heap memory for things like strings. Once theEventis constructed, it gets moved into an mpsc channel via itsSender, because this is done via a move operation, it should be fast and because we have a single producer writing to the channel (the bpf module), there should be no contention to insert the value (unless the channel becomes full). - The
Receiverend of thismpscchannel is held by thehost_scannermodule. This will receive theEventvalues and check if they need to have their host path added to them, (in future updates, this module may need to react in some other ways, like scanning a directory after a rename puts it under a monitored path). Once this is done, theEventgets wrapped in anArc(atomic reference counter) and forwarded via abroadcastchannel to theoutputmodule. - The
outputmodule is essentially two tasks at the moment, one printing events in JSON format to stdout and another sending the events via gRPC. The reason anArcis used is because receiving from a broadcast channel will cause acloneoperation to be performed on the value, since this is the only way multiple consumers can receive it, wrapping the value in anArcprevents all the underlying heap allocated values from being copied on receive.
More information on these channel can be found in the tokio docs:
- https://docs.rs/tokio/latest/tokio/sync/broadcast/index.html
- https://docs.rs/tokio/latest/tokio/sync/mpsc/index.html
I also have a draft on how we could extend this approach a bit further to do pipelining in a similar fashion: #169
Hopefully this answers the question, I'm happy to provide more clarifications if needed.
| @@ -0,0 +1,143 @@ | |||
| use std::{ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to start any new module with a short description what it does and few words about it's internals and architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been dropping the ball hard on documentation. I will add them to the new module here and get to working on updating docs across the project when I get a minute.
fact/src/host_scanner.rs
Outdated
| dev: metadata.st_dev(), | ||
| }; | ||
|
|
||
| self.kernel_inode_map.borrow_mut().insert(inode, 0, 0)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we might want to use bulk insert, does aya support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this for the implementation, it'd be nice for start up to be able to populate the regular map and the bulk update the kernel one. Unfortunately, aya doesn't seem to support this, but we can probably do something with aya-obj and calling the BPF syscall directly.
Overlayfs has some pretty fancy logic when handling device numbers because it needs handle the upper/lower layers and merged paths for all files. Since there is no logic for handling overlayfs specifically implemented yet, the device number will be wrong there. I haven't bothered implementing it yet because we are not doing inode tracking in container files that I know of. |
* Added a missing null check. * Added missing doc strings. * Downgrade inode and dev numbers to 32 bits. * Added some logging statements.
Description
This is a very basic implementation of inode tracking meant to be used
in the upcoming release and improved upon in the near future.
The current implementation will perform a scan of the paths that are
configured to be monitored, using the inode and device numbers as a key
to two maps:
event.
inode at.
With these two maps we are able to confidently emit events for files
that are mounted into containers with paths that don't necessarily match
the prefixes configured for monitoring and, at a later stage in
userspace, add the path on the host to the event itself.
The implemented approach is far from complete, it will only work as long
as the files found during the initial scan are not moved, deleted or
replaced by a different file. Future patches will extend the
functionality of both kernel and userspace to be able to catch more
corner cases.
Checklist
Automated testing
If any of these don't apply, please comment below.
Testing Performed
Added tests will validate events generated on an overlayfs file properly
shows the event on the upper layer and the access to the underlying FS.
They also validate a mounted path on a container resolves to the correct
host path.
While developing these tests, it became painfully obvious getting the
information of the process running inside the container is not
straightforward. Because containers tend to be fairly static, we should
be able to manually create the information statically in the test and
still have everything work correctly. In order to minimize the amount of
changes on existing tests, the default Process constructor now takes
fields directly and there is a from_proc class method that builds a new
Process object from /proc. Additionally, getting the pid of a process in
a container is virtually impossible, so we make the pid check optional.