feat: add marufs kernel module — CXL shared-memory filesystem by moonchan-park · Pull Request #41 · xcena-dev/maru

moonchan-park · 2026-04-10T02:45:22Z

Summary

Add marufs_kernel/ — Linux kernel filesystem module for CXL shared memory, enabling cross-node file sharing via DAX-mapped CXL memory pools
CAS-based lock-free global index (4-shard hash table), RAT (2KB CL-aligned per-file metadata), NRHT (application-level name→offset mapping), 3-stage ACL (owner→default→delegation), 4-phase background GC
Architecture docs (6 detailed + 1 overview), test suite (ioctl, mmap, cross-process, chown race, multinode), build/install scripts
GPL-2.0-only (kernel module requirement)

Structure

marufs_kernel/
├── src/           # Kernel module source (super, dir, inode, file, index, region, nrht, acl, gc, sysfs)
├── include/       # Userspace API header (marufs_uapi.h)
├── docs/          # Architecture docs (metadata layout, entry lifecycle, GC, NRHT, ACL, mount/IO)
├── tests/         # Test suite (C test programs + shell harness)
├── Makefile       # Kernel module build
└── install.sh     # Build + insmod + mount helper
docs/source/design_doc/
└── marufs_kernel_module_architecture.md  # Architecture overview with mermaid diagrams

Test plan

make builds marufs.ko without errors on target kernel
sudo ./install.sh loads module and mounts filesystem
sudo ./tests/test_local_multinode.sh passes all multinode tests

github-actions · 2026-04-10T02:46:11Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	6	0	100%
__main__.py	3	3	0%	5, 7–8
allocation_manager.py	102	10	90%	30–31, 44–45, 52, 207, 211–214
client.py	184	8	95%	88, 130, 147–148, 311–313, 319
config.py	46	1	97%	72
constants.py	9	0	100%
device_scanner.py	94	31	67%	25, 100, 102–104, 106, 114–123, 125, 127, 129, 134–143, 145, 147
handler.py	524	79	84%	125, 140, 151–158, 166–168, 176, 187–192, 197, 222, 249–250, 254, 294, 298–299, 304, 326, 333–334, 338–344, 347, 351–353, 369, 371–372, 429, 450, 460–461, 567–569, 576, 698–701, 704, 715–718, 724, 1055–1059, 1065, 1076–1080, 1086, 1165, 1170, 1212
ipc.py	275	2	99%	365, 441
kv_manager.py	102	0	100%
logging_setup.py	19	0	100%
protocol.py	216	0	100%
resource_manager_installer.py	103	13	87%	80–86, 167, 169–172, 187
rpc_async_client.py	190	0	100%
rpc_async_server.py	111	0	100%
rpc_client.py	66	0	100%
rpc_client_base.py	100	10	90%	183, 219–220, 231–232, 304–305, 309–310, 340
rpc_handler_mixin.py	102	19	81%	153–155, 158–160, 218–221, 226, 230–234, 245–246, 252
rpc_server.py	64	0	100%
serializer.py	81	0	100%
server.py	145	20	86%	44, 54–59, 64–65, 73, 168, 172, 243, 247, 265–266, 284–286, 371
stats_manager.py	95	0	100%
types.py	60	1	98%	145
uds_helpers.py	13	0	100%
memory
__init__.py	5	0	100%
allocator.py	55	0	100%
mapper.py	128	2	98%	229, 296
owned_region_manager.py	101	1	99%	212
types.py	62	0	100%
TOTAL	3081	200	93%

Tests	Skipped	Failures	Errors	Time
660	4 💤	0 ❌	0 🔥	6.478s ⏱️

moonchan-park

PR #41 리뷰 요약

이 PR을 왜 올리는가?

Maru 프로젝트는 현재 TCP/RPC 기반의 유저스페이스 메모리 관리만 지원합니다. CXL 공유 메모리를 여러 노드가 파일시스템 인터페이스로 직접 접근하려면 커널 모듈이 필요하며, 이 PR이 없으면 cross-node 파일 공유를 위해 네트워크 라운드트립이 필수입니다. marufs 커널 모듈은 DAX-mapped CXL 메모리 풀 위에 lock-free 파일시스템을 구현하여, 표준 VFS 인터페이스(open/mmap/read/write)로 노드 간 zero-copy 데이터 공유를 가능하게 합니다.

설계 개요

graph TB
    subgraph UserSpace["Userspace"]
        app["Application"]
        test["Test Suite"]
    end

    subgraph KernelMod["marufs Kernel Module"]
        subgraph VFS["VFS Layer"]
            super["super.c -- mount, umount, format"]
            dir["dir.c -- readdir, lookup, create, unlink"]
            inode["inode.c -- iget, new_inode, evict"]
            file["file.c -- mmap, ftruncate, ioctl"]
        end

        subgraph DataLayer["Data Layer"]
            idx["index.c -- 4-shard CAS hash index"]
            region["region.c -- RAT allocator, 2KB entries"]
            nrht["nrht.c -- Name-Ref Hash Table"]
        end

        subgraph SecLayer["Security Layer"]
            acl["acl.c -- 3-stage ACL check"]
        end

        subgraph MaintLayer["Maintenance"]
            gc["gc.c -- 4-phase background GC"]
            sysfs_mod["sysfs.c -- stats export"]
        end
    end

    subgraph HW["Hardware"]
        cxl["CXL Shared Memory / DAX Device"]
    end

    app -->|"open, read, mmap, ioctl"| file
    test -->|"ioctl, mmap"| file
    super --> dir
    super --> inode
    super --> file
    dir --> idx
    file --> idx
    file --> region
    file --> acl
    file --> nrht
    inode --> idx
    gc -.->|"sweep"| idx
    gc -.->|"reclaim"| region
    gc -.->|"stale cleanup"| nrht
    idx --> cxl
    region --> cxl
    nrht --> cxl

CXL 메모리 레이아웃

block-beta
    columns 4
    sb["Superblock 4KB"]:1
    shards["Global Index Shards x4"]:1
    rat["RAT Header + 256 Entries"]:1
    data["Region Data"]:1

핵심 데이터 흐름

경로	흐름
파일 생성	create -> index claim EMPTY -> RAT alloc -> link and publish VALID
파일 읽기	lookup -> index hash search -> RAT entry -> DAX direct read
mmap	open -> ftruncate(region alloc) -> mmap -> DAX fault handler
GC	Phase1: dead process reclaim -> Phase2: stale INSERTING sweep -> Phase3: orphan sweep -> Phase4: NRHT stale sweep

리뷰 결과 요약

전체적으로 lock-free CAS 설계, 상태 머신, 메모리 배리어 처리가 잘 구조화되어 있습니다. 커널 모듈 코드 품질이 높으나, 아래 이슈들을 머지 전 검토해야 합니다.

심각도	건수	주요 이슈
Critical	3	claim_entry 상태 미검증, FIND_NAME 권한 누락, batch 버퍼 사이징
High	6	i_size_write 락 미보유, mmap 파일 참조 누수, force-unlock CAS 미사용, checksum 미구현, d_revalidate 성능, compat_ioctl 누락
Medium	3	GC 스레드 stop 체크, sysfs 버퍼 오버플로, CXL 2.0 배리어
Minor	2	하드코딩된 사용자 경로, 테스트 바이너리 누락

youngrok-XCENA

리뷰 요약

이 PR의 목적

CXL(Compute Express Link) 공유 메모리를 다수 노드에서 파일 수준으로 접근/관리할 수 있는 Linux 커널 파일시스템 모듈(marufs)을 추가합니다.

미적용 시 문제: CXL 메모리 풀을 POSIX 파일 인터페이스로 다룰 수 없어, 애플리케이션이 DAX 디바이스를 직접 mmap하고 멀티 노드 간 메타데이터 동기화를 자체 구현해야 합니다.

아키텍처 설계

graph TB
    subgraph VFS["VFS Layer"]
        super["super.c: mount, umount, mkfs"]
        dir["dir.c: readdir, lookup, create"]
        inode_m["inode.c: iget, evict, getattr"]
        file_m["file.c: mmap, ftruncate, ioctl"]
    end
    subgraph CXL["CXL Shared Data Layer"]
        index_m["index.c: 4-shard lock-free hash"]
        region["region.c: RAT region allocator"]
        nrht["nrht.c: Name-Ref Hash Table"]
    end
    subgraph SEC["Security"]
        acl["acl.c: 3-stage ACL with delegation"]
    end
    subgraph BG["Background"]
        gc["gc.c: 4-phase GC sweep"]
        sysfs_mod["sysfs.c: stats and tunables"]
    end
    super --> dir
    super --> inode_m
    super --> file_m
    dir --> index_m
    file_m --> index_m
    file_m --> region
    file_m --> acl
    file_m --> nrht
    inode_m --> index_m
    gc -.-> index_m
    gc -.-> region
    gc -.-> nrht

핵심 설계 특성

항목	설명
Lock-free 동시성	CAS 기반 4-shard 해시 인덱스, NRHT, delegation ACL
WORM 시맨틱	ftruncate로 한 번만 영역 할당, 재할당 불가
4-phase GC	orphan 탐지 - 타임아웃 대기 - CAS reclaim - region sweep
DAX 직접 매핑	page cache 우회, mmap으로 CXL 메모리 직접 접근

주요 발견 사항

심각도	건수	대표 이슈
Critical	2	재부팅 시 데이터 소실(fstab format), NRHT entry CAS 탈취
High	4	DAXHEAP TOCTOU, delegation 비원자 전이, GC 레이스, i_size 비보호 쓰기
Medium	3	잘못된 errno, alloc_lock 타임아웃 미도달, x86 전용 cache flush
Low	1	하드코딩된 개발자 경로

상세 내용은 인라인 코멘트를 참조해주세요.

youngrok-XCENA · 2026-04-10T02:59:41Z

+            cat >> /etc/fstab << FSTAB
+
+# MARUFS CXL filesystem (auto-generated by setup-autoload.sh)
+none  ${MOUNT_POINT}  ${MODULE_NAME}  daxdev=${DAX_DEVICE},node_id=${NODE_ID},format,nofail  0  0


[critical] 재부팅마다 파일시스템 포맷 -- 데이터 소실

fstab에 format 마운트 옵션이 영구 기록됩니다. 재부팅 시마다 파일시스템이 초기화되어 모든 데이터가 소실됩니다. L243의 systemd unit도 동일한 문제가 있습니다.

최초 포맷 이후에는 format 옵션을 제거해야 합니다. fstab/systemd 등록 시 format 옵션 제외를 기본으로 하고, 별도 --format 플래그로만 포맷을 수행하는 것을 권장합니다.

Fixed in 29243e0

fstab/systemd 모두 format 옵션 제거했습니다.

CXL은 volatile 메모리라 매 부팅 시 어딘가에서 format이 필요한 건 맞지만, autoload 스크립트에 하드코딩하는 방식은 위험합니다. 향후 format_if_needed 마운트 옵션으로 별도 구현 예정:

마운트 시 magic + CRC32 검증

유효하면 기존 메타데이터 사용, 무효하면 idempotent format 수행

멀티노드 환경에서도 CRC commit point 기반으로 안전하게 동작

youngrok-XCENA · 2026-04-10T02:59:41Z

+				    struct marufs_nrht_entry *e)
+{
+	u32 st = READ_LE32(e->state);
+	if (marufs_le32_cas(&e->state, st, MARUFS_ENTRY_INSERTING) != st)


[critical] CAS가 INSERTING 상태를 포함한 모든 상태에서 성공 -- entry 탈취 가능

st가 이미 MARUFS_ENTRY_INSERTING이면 CAS가 INSERTING->INSERTING으로 성공하여, 다른 노드가 삽입 중인 entry의 created_at과 inserter_node를 덮어씁니다. 이는 원래 삽입자의 작업을 망가뜨립니다.

CAS 전에 st == MARUFS_ENTRY_EMPTY || st == MARUFS_ENTRY_TOMBSTONE 조건을 확인해야 합니다.

Fixed in eea4555. Added st == EMPTY || st == TOMBSTONE validation before CAS in both index_claim_entry and nrht_claim_entry.

youngrok-XCENA · 2026-04-10T02:59:41Z

+			       marufs_daxheap_bufid);
+			return -EEXIST;
+		}
+		mutex_unlock(&marufs_daxheap_lock);


[high] DAXHEAP primary 할당 시 TOCTOU 레이스 컨디션

L452에서 marufs_daxheap_bufid == 0 확인 후 L459에서 락을 해제하고, L461에서 daxheap_kern_alloc()을 락 없이 호출합니다. 두 스레드가 동시에 primary mount를 시도하면 버퍼가 이중 할당됩니다.

제안: 락을 alloc 완료 후까지 유지하거나, 할당 후 CAS로 bufid를 교체하세요.

Fixed in d0ef7e1

mutex를 check부터 alloc + bufid 기록까지 유지하도록 변경. 에러 경로마다 mutex_unlock 추가.

youngrok-XCENA · 2026-04-10T02:59:41Z

+			sizeof(*de)); /* Ensure all fields visible before state transition */
+
+		/* Publish: GRANTING → ACTIVE (now safe for readers) */
+		WRITE_LE32(de->state, MARUFS_DELEG_ACTIVE);


[high] GRANTING -> ACTIVE 상태 전환이 CAS가 아닌 단순 WRITE

GRANTING 상태에서 다른 노드(예: GC)가 상태를 이미 변경했을 수 있는데, 이를 확인하지 않고 덮어씁니다.

marufs_le32_cas(&de->state, MARUFS_DELEG_GRANTING, MARUFS_DELEG_ACTIVE)로 변경하여 예상 상태에서만 전환해야 합니다.

Fixed in d0ef7e1

WRITE_LE32(de->state, ACTIVE) → marufs_le32_cas(&de->state, GRANTING, ACTIVE). CAS 실패 시(GC가 이미 EMPTY로 전환) -EAGAIN 반환하여 caller가 retry.

youngrok-XCENA · 2026-04-10T02:59:41Z

+	if (sbi->gc_orphan_count >= MARUFS_GC_ORPHAN_MAX)
+		return;
+
+	i = sbi->gc_orphan_count++;


[high] gc_orphan_count 비원자적 증감 -- 레이스 컨디션

gc_orphan_count와 gc_orphans 배열이 락 없이 접근됩니다. marufs_gc_track_orphan()이 GC sweep 경로 등에서 여러 CPU에서 동시 호출 가능한 경우 race condition이 발생합니다.

단일 스레드에서만 호출됨을 보장하거나, spinlock/atomic으로 보호해야 합니다.

Fixed in d0ef7e1

GC는 단일 kthread로만 실행되므로 race 없음. 해당 사실을 marufs_gc_track_orphan 주석에 명시했습니다.

youngrok-XCENA · 2026-04-10T02:59:41Z

+		struct marufs_rat_entry *rat_e =
+			marufs_rat_entry_get(sbi, xi->rat_entry_id);
+		if (rat_e) {
+			inode->i_size = READ_LE64(rat_e->size);


[high] i_size를 i_rwsem 없이 직접 수정

VFS는 i_size 변경 시 i_rwsem 보호 또는 i_size_write() 사용을 요구합니다. concurrent getattr과 read_iter에서 inconsistent size를 볼 수 있습니다.

getattr 콜백에서는 inode->i_size를 갱신하지 말고 stat->size에 직접 쓰는 것이 더 안전합니다.

Fixed in 99cc329

youngrok 제안대로 stat->size에 직접 기록하는 방식으로 변경. fillattr 먼저 호출 → RAT fresh value로 stat->size/stat->blocks override. inode->i_size 업데이트 자체를 안 하므로 i_rwsem 불필요.

youngrok-XCENA · 2026-04-10T02:59:41Z

+	for (i = 0; i < MARUFS_MAX_RAT_ENTRIES; i++) {
+		struct marufs_rat_entry *entry = marufs_rat_entry_get(sbi, i);
+		if (!entry)
+			return -1;


[medium] 잘못된 에러 코드 반환

return -1은 -EPERM에 해당합니다. VFS statfs 콜백은 의미 있는 errno를 반환해야 합니다. marufs_rat_entry_get() 실패는 I/O 오류이므로 return -EIO가 적절합니다.

이전 커밋(5b5dfb8)에서 이미 수정 완료. return -1 → return -EIO (super.c statfs).

youngrok-XCENA · 2026-04-10T02:59:41Z

+	 *
+	 * Stale lock recovery: if holder crashed, force-unlock after timeout.
+	 */
+	while (retries < MARUFS_REGION_INIT_MAX_RETRIES) {


[medium] alloc_lock 재시도/타임아웃 로직 불일치 -- 5초 타임아웃 미도달

retries++가 cpu_relax() 경로에서만 증가합니다. cpu_relax()는 나노초 단위이므로, MARUFS_REGION_INIT_MAX_RETRIES(10)에 도달하는 데 마이크로초밖에 걸리지 않아, 5초 타임아웃 + force-unlock 경로는 사실상 도달 불가능합니다.

제안: retries를 force-unlock 횟수로 세거나, cpu_relax 대신 usleep_range()를 사용하여 대기 시간을 확보하세요.

Fixed in 21ccbb5.

cpu_relax() → usleep_range(500000, 600000) (500ms). 10회 retry × 500ms = 5s이므로 타임아웃 force-unlock 경로에 실제로 도달 가능. #include <linux/delay.h> 추가.

youngrok-XCENA · 2026-04-10T02:59:41Z

+ */
+#ifdef CONFIG_MARUFS_CXL2_COMPAT
+
+static inline void __marufs_cxl_flush_range(const void *addr, size_t len)


[medium] clwb/clflushopt는 x86 전용 -- 다른 아키텍처에서 컴파일 실패

clwb(p) (L403), clflushopt(p) (L413)는 x86 전용 인스트럭션입니다. ARM64/RISC-V 빌드에서 컴파일 오류가 발생합니다.

#ifdef CONFIG_X86 가드를 추가하거나, arch_wb_cache_pmem() 등 아키텍처 독립 API 사용을 검토하세요. CXL이 현재 주로 x86에서 사용되더라도, #error 메시지로 지원 범위를 명시하는 것이 좋습니다.

Fixed in 21ccbb5.

x86-only clwb/clflushopt 직접 호출 대신 커널 arch-portable API인 arch_wb_cache_pmem() / arch_invalidate_pmem() (<linux/libnvdimm.h>)으로 교체. x86에서는 동일하게 clwb+wmb / clflushopt+mb 수행, non-x86에서는 no-op (WT mapping 전제). 수동 cacheline 루프 + #error guard 불필요.

youngrok-XCENA · 2026-04-10T02:59:41Z

+MOUNT_POINT=""
+SKIP_BUILD=false
+USE_DAXHEAP=false
+DAXHEAP_DIR="${MARUFS_DAXHEAP_DIR:-/home/mcpark/daxheap}"


[low] 개발자 로컬 경로 하드코딩

/home/mcpark/daxheap가 기본값으로 설정되어 있습니다. 다른 환경에서 MARUFS_DAXHEAP_DIR 미설정 시 빌드가 실패합니다. 기본값을 제거하고 미설정 시 에러를 출력하거나, 프로젝트 상대 경로로 변경하세요.

Fixed in 21ccbb5.

DAXHEAP_DIR 기본값 제거 → 빈 문자열. install.sh에서 --daxheap 사용 시 DAXHEAP_DIR 미설정이면 즉시 에러. setup_local_multinode.sh / test_local_multinode.sh 동일 적용. Makefile 코멘트도 generic path로 변경.

… C3 batch sizing C1: Add state validation before CAS in index_claim_entry and nrht_claim_entry — reject VALID/INSERTING states to prevent active entry hijack (index.c, nrht.c) C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls, matching other NRHT ioctl permission enforcement (file.c) C3: Size batch buffer as max of both request types to prevent heap overflow if MAX or struct sizes diverge (file.c)

…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety

…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set

jooho-XCENA

Review: 미해결 HIGH 이슈 8건

기존 리뷰(moonchan-park, youngrok-XCENA)와 fix 커밋(eea4555, 99cc329, d0ef7e1, 21ccbb5, 8614242 등)을 확인한 후, 아직 미해결이고 다른 리뷰어가 코멘트하지 않은 HIGH 이슈만 남겼습니다.

이미 해결 확인된 항목 (코멘트 생략)

✅ GC kthread_should_stop inter-phase → 21ccbb5
✅ i_size_write without i_rwsem → 99cc329
✅ mmap file ref restore → d0ef7e1
✅ alloc_lock CAS → d0ef7e1
✅ GRANTING→ACTIVE CAS → d0ef7e1
✅ DAXHEAP TOCTOU → d0ef7e1
✅ compat_ioctl → 8614242
✅ claim_entry guard → eea4555
✅ FIND_NAME perm → eea4555
✅ superblock CRC32 → 5b5dfb8
✅ fstab format option → 29243e0
✅ arch-portable flush → 21ccbb5

jooho-XCENA · 2026-04-10T07:09:35Z

+		break;
+
+	case MARUFS_IOC_NRHT_INIT:
+		ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id,


[HIGH] NRHT_INIT에 ADMIN 권한 필요

MARUFS_IOC_NRHT_INIT은 영역 전체를 memset(base, 0, total_needed) + 재포맷하는 파괴적 작업입니다 (nrht.c:523).

현재 MARUFS_PERM_IOCTL만 요구하는데, 이는 일반 NRHT 이름 조회/저장과 동일한 수준입니다. IOCTL 권한이 있는 모든 프로세스가 전체 NRHT를 포맷할 수 있습니다.

// 현재: ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_IOCTL); // 권장: ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_ADMIN);

Fixed in 27206dc. MARUFS_PERM_IOCTL → MARUFS_PERM_ADMIN으로 변경.

jooho-XCENA · 2026-04-10T07:09:35Z

+
+	dreq->fd = fd;
+	get_file(sbi->heap_dmabuf->file);
+	fd_install(fd, sbi->heap_dmabuf->file);


[HIGH] DMABUF export가 전체 CXL 디바이스를 노출

sbi->heap_dmabuf는 전체 daxheap 디바이스를 나타냅니다. 여기서 export하면 사용자가 superblock, RAT, global index, 다른 사용자의 region 데이터까지 모두 접근 가능합니다 — ACL을 완전히 우회합니다.

권장 수정:

최소한 MARUFS_PERM_ADMIN 권한을 요구하거나

전체 디바이스가 아닌 해당 파일의 region만 export하는 per-region dma_buf slice 생성

현재 READ+WRITE 권한만 있으면 전체 디바이스 fd를 얻을 수 있어, single-file 권한으로 다른 모든 파일의 데이터를 읽을 수 있습니다.

Fixed in 27206dc. MARUFS_PERM_READ | MARUFS_PERM_WRITE → MARUFS_PERM_ADMIN으로 변경. 해당 ioctl은 다른 노드에서 동일한 daxheap fd를 이용하여 marufs mount하기 위해 제공되는 인터페이스이므로 per-region slice 기능은 제공하지 않을 계획입니다.

jooho-XCENA · 2026-04-10T07:09:35Z

+
+static struct kobj_attribute gc_pause_attr =
+	__ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store);
+


[HIGH] gc_pause가 world-writable (0644)

0644이면 모든 사용자가 GC를 일시정지/재개할 수 있습니다. GC가 멈추면 dead region과 orphan entry가 누적되어 메모리 고갈 및 서비스 거부로 이어집니다.

또한 gc_pause_store, gc_trigger_store, gc_stop_store, gc_restart_store 모두 capable(CAP_SYS_ADMIN) 체크가 없어, container/user-namespace 환경에서 비특권 사용자가 조작 가능합니다.

// 현재: __ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store); // 권장: __ATTR(gc_pause, 0600, gc_pause_show, gc_pause_store); // + store 함수에 추가: if (!capable(CAP_SYS_ADMIN)) return -EPERM;

Fixed in 27206dc. gc_pause 퍼미션 0644 → 0600으로 변경하고, gc_trigger/gc_stop/gc_pause/gc_restart 4개 store 함수 모두에 capable(CAP_SYS_ADMIN) 체크 추가.

jooho-XCENA · 2026-04-10T07:09:35Z

+			buckets_per_shard = 1;
+	}
+	buckets_per_shard = roundup_pow_of_two(buckets_per_shard);
+


[HIGH] num_buckets 상한 미검증 — roundup_pow_of_two overflow

num_buckets는 사용자가 NRHT_INIT ioctl을 통해 제어합니다. num_buckets가 매우 큰 값(예: 0xFFFFFFFF)이고 num_shards=1이면:

buckets_per_shard = 0xFFFFFFFF

roundup_pow_of_two(0xFFFFFFFF) → 64비트에서 1UL << 32 = 0x100000000

bucket_array_size = 0x100000000 * 4 → u64 overflow 또는 total_needed 계산 이상

memset(base, 0, (size_t)total_needed) — region 경계를 넘어 CXL 메모리 write 가능

// roundup_pow_of_two 전에 상한 검증 추가: if (buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES) return -EINVAL; buckets_per_shard = roundup_pow_of_two(buckets_per_shard); if (buckets_per_shard == 0 || buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES) return -EINVAL;

Fixed in f74049b. roundup_pow_of_two 전후로 MARUFS_NRHT_MAX_ENTRIES 상한 검증 추가.

jooho-XCENA · 2026-04-10T07:09:35Z

+		return ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		struct marufs_inode_info *xi = marufs_inode_get(inode);


[HIGH] setattr가 ATTR_SIZE 외의 변경을 무시 — chmod/chown/utimes 무효

setattr_prepare()로 검증 후 ATTR_SIZE만 처리하고 반환합니다. ATTR_UID, ATTR_GID, ATTR_MODE, ATTR_ATIME 등은 검증 통과 후 적용되지 않아, chmod/chown/utimes가 성공을 반환하지만 실제 변경 없습니다.

VFS 규약상 setattr_prepare() 통과 후 setattr_copy()를 호출하거나, 지원하지 않는 attribute는 -EPERM으로 명시적 거부해야 합니다.

if (attr->ia_valid & ATTR_SIZE) { // ... ftruncate 처리 ... } // 추가 필요: setattr_copy(MARUFS_IDMAP_ARG_COMMA inode, attr); // 또는: 지원하지 않는 경우 명시적 거부 if (attr->ia_valid & ~ATTR_SIZE) return -EPERM;

Fixed in f74049b. ATTR_SIZE | ATTR_FORCE 외의 attr 변경 요청은 -EPERM으로 명시적 거부. CXL FS는 POSIX attr를 persist하지 않으므로 setattr_copy() 대신 거부 방식을 선택.

jooho-XCENA · 2026-04-10T07:09:35Z

+	buf->f_bsize = PAGE_SIZE;
+	buf->f_blocks = sbi->total_size / PAGE_SIZE;
+	buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;
+	buf->f_bavail = buf->f_bfree;


[HIGH] f_bfree unsigned underflow — 비정상적 free space 보고

sbi->total_size와 used_size가 모두 u64이므로, metadata 손상이나 concurrent allocation으로 used_size > total_size가 되면 unsigned 뺄셈이 wrap하여 매우 큰 값이 됩니다.

// 현재: buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE; // 권장: if (used_size > sbi->total_size) buf->f_bfree = 0; else buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE; buf->f_bavail = buf->f_bfree;

Fixed in f74049b. used_size > sbi->total_size 가드 추가하여 unsigned underflow 방지.

jooho-XCENA · 2026-04-10T07:09:35Z

+	struct page *page = &folio->page;
+
+	zero_user_segments(page, 0, PAGE_SIZE, 0, 0);
+	SetPageUptodate(page);


[HIGH] read_folio가 항상 zero-filled page 반환 — page cache 경로 데이터 손실

DAX mmap이 주 접근 경로이지만, sendfile(), splice(), 또는 non-DAX fallback 경로는 page cache를 통해 read_folio를 호출합니다. 현재 항상 zero page를 반환하므로 실제 CXL 메모리 데이터 대신 0을 읽게 됩니다.

권장:

CXL 메모리에서 데이터를 복사: memcpy_from_dax(page, sbi->dax_base + phys_offset + page_offset, ...) 또는

page cache 사용을 완전히 차단하고 read_iter만으로 서빙 (address_space_operations에서 read_folio 제거 후 적절한 대안 구현)

Fixed in f74049b. CXL 데이터를 page cache로 복사하는 대신 -EIO로 차단하는 방식을 선택했습니다.

이유:

read_folio 인터페이스에서 permission check를 끼워넣을 수 없어 marufs ACL 모델 우회

DRAM page cache에 CXL 데이터 복사본이 남아 cross-node 일관성 깨짐

read() 시스콜은 read_iter에서 직접 CXL 복사 + 권한 체크로 정상 지원

sendfile/splice는 DAX FS의 KV cache 워크로드에서 사용하지 않으므로 차단해도 영향 없습니다.

jooho-XCENA · 2026-04-10T07:09:35Z

+	}
+
+	set_page_dirty(page);
+	return VM_FAULT_LOCKED;


[HIGH] set_page_dirty() — 커널 6.8+에서 제거됨, 빌드 실패

set_page_dirty()는 커널 6.8에서 제거되었습니다 (folio_mark_dirty()로 대체). 이 모듈의 compat.h가 6.17까지 지원하므로 최신 커널에서 빌드가 실패합니다.

// 현재: set_page_dirty(page); // 권장: folio_mark_dirty(page_folio(page)); // 또는 compat.h에 shim 추가: #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 8, 0) folio_mark_dirty(page_folio(page)); #else set_page_dirty(page); #endif

Fixed in f74049b. compat.h에 marufs_set_page_dirty() inline 함수 추가 (6.8+ → folio_mark_dirty(), 이전 → set_page_dirty()). file.c에서는 compat 함수를 호출.

… C3 batch sizing C1: Add state validation before CAS in index_claim_entry and nrht_claim_entry — reject VALID/INSERTING states to prevent active entry hijack (index.c, nrht.c) C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls, matching other NRHT ioctl permission enforcement (file.c) C3: Size batch buffer as max of both request types to prevent heap overflow if MAX or struct sizes diverge (file.c)

…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety

…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set

… C3 batch sizing C1: Add state validation before CAS in index_claim_entry and nrht_claim_entry — reject VALID/INSERTING states to prevent active entry hijack (index.c, nrht.c) C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls, matching other NRHT ioctl permission enforcement (file.c) C3: Size batch buffer as max of both request types to prevent heap overflow if MAX or struct sizes diverge (file.c)

…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety

…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set

Linux kernel filesystem module for CXL shared memory, enabling cross-node file sharing via DAX-mapped CXL memory pools. Core components: - VFS layer: mount/umount, directory ops, inode lifecycle, mmap/ioctl - CAS-based lock-free global index with sharded hash table (4 shards) - Region Allocation Table (RAT): per-file metadata in 2KB CL-aligned entries - NRHT (Name-Ref Hash Table): application-level name→(offset, region) mapping - ACL: 3-stage permission model (owner → default_perms → delegation table) - 4-phase background GC: dead process reap, stale index sweep, local tracker, NRHT Includes architecture docs (6 detailed + 1 overview), test suite (ioctl, mmap, cross-process, chown race, multinode), and build/install scripts.

… C3 batch sizing C1: Add state validation before CAS in index_claim_entry and nrht_claim_entry — reject VALID/INSERTING states to prevent active entry hijack (index.c, nrht.c) C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls, matching other NRHT ioctl permission enforcement (file.c) C3: Size batch buffer as max of both request types to prevent heap overflow if MAX or struct sizes diverge (file.c)

- Implement CRC32 over immutable GSB fields (magic → entries_per_shard) - Compute + write checksum at format, validate at mount - Remove active_nodes bitmask (to be replaced by per-node cacheline design) - Adjust reserved padding 200 → 208 bytes to maintain 256B struct size

Persistent format option causes re-format on every reboot, destroying all data. CXL volatile memory bootstrap will be handled by a separate format_if_needed scheme (magic + CRC32 validation at mount time).

- file.c: wrap i_size_write with inode_lock/unlock in read path - inode.c: write fresh RAT size to stat->size directly in getattr, avoiding i_size_write without i_rwsem entirely

…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety

…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set

All ioctl structs use fixed-width types (__u32/__u64/__s32) with identical 32/64-bit layout, so compat_ptr_ioctl suffices.

- Add per-shard CAS spinlock (shard_header->lock) to serialize bucket linking and post-insert dedup, eliminating TOCTOU race - New TENTATIVE(2) entry state between INSERTING and VALID; VALID is now 3, TOMBSTONE is now 4 - Rewrite post_insert_dedup to walk chain directly with self-skip instead of using nrht_find_chain - Move region_type=NRHT write after format to prevent false EEXIST on first nrht_init call - Change double-init check from physical magic probe to RAT region_type check (immune to stale CXL data) - Extract nrht_shard_lock/unlock inline helpers - Update 4_arch_nrht.md: state diagram, transition table, insert flowchart, function summary for shard lock semantics

- Add TENTATIVE state to index.c insert protocol (4→9 step): INSERTING → TENTATIVE → lock → link + dedup → VALID → unlock - Use DRAM spinlock (marufs_shard_cache.insert_lock) instead of CXL lock for node-local thread serialization; cross-node handled by token ring - Move TOMBSTONE write from post_insert_dedup to caller for consistency with nrht.c pattern - Add deleg_info sysfs attribute for per-region delegation inspection - Fix stale enum comments in marufs_layout.h (VALID 2→3, TOMBSTONE 3→4)

- sysfs: add deleg_info read/write for per-region delegation inspection - sysfs: gc_trigger now iterates ALL registered mounts (not just first sbi) - tests: integrate test_mmap_notrunc, test_negative, test_nrht_race, test_gc_deleg, test_pid_reuse into test_local_multinode.sh (Sections 25-29) - tests: show full failure output instead of head -5 truncation - docs: update entry lifecycle and metadata layout for TENTATIVE state - .gitignore: add new test binaries

- NRHT_INIT ioctl: PERM_IOCTL → PERM_ADMIN (prevents non-admin format) - DMABUF export: PERM_READ|PERM_WRITE → PERM_ADMIN (whole-device exposure) - sysfs gc_pause: 0644 → 0600 (root-only read/write) - sysfs gc_trigger/gc_stop/gc_pause/gc_restart: add capable(CAP_SYS_ADMIN)

Adds marufs_kernel/docs/0_user_guide.md: a scenario-oriented walk-through covering admin multi-node setup, application region lifecycle, NRHT name-ref publishing, and the delegation-based security model. Complements the existing architecture docs, which focus on implementation internals.

…tegies Replace per-shard spinlock-based insert serialization with cross-node Mutual Exclusion (ME) framework. Strategy Pattern exposes a common interface backed by two implementations: - Order-driven (me_order.c): token ring circulating among ACTIVE nodes - Request-driven (me_request.c): holder scans request slots, grants Two ME domains: - Global ME (S=1): serializes Index insert + RAT alloc - NRHT ME (S=N_shard): per-shard token, opt-in membership via marufs_nrht_join() pre-warm (backup path lazy-init on first insert) Adds unified poll kthread (me_poll_thread) iterating all registered ME instances (me_list / me_list_lock). marufs_nrht_init() now takes me_strategy parameter selecting the implementation per filesystem. sysfs exposes ME diagnostics; tests/monitor_me.sh + rewritten test_nrht_race.c exercise the new paths. bench_name_ref and setup/test_local_multinode.sh updated for the new membership model.

Replace shared-CB polling with per-(shard, node) doorbell slots to eliminate O(N) cache-line ping-pong on the hot token-pass path. * Token pass: writer updates CB (holder, generation) then rings the target's slot (from_node, cb_gen_at_write, token_seq++). Reader polls its own slot and cross-checks CB on seq change. * Heartbeat moved to per-node membership slot (distributed); only the cached successor of the current holder watches it. * Magic tags on CB / membership / slot — writers verify the cached pointer still addresses the intended record type before mutating, and self-deactivate the instance on mismatch. Prevents stale-layout writes from corrupting a newly reformatted region after a peer nrht_init() raced this sbi's cached ME instance. * Token-gated order_leave: acquire each shard then pass to leave_successor so the leaving node is sole writer of its own slot during handoff; clear membership status last. * teardown_reformat detection: skip leave() when format_generation mismatches cached value (CXL area was reformatted under us). * Skip self-pass in order_poll_cycle when next_active == self. * wait_for_token: keep last_cb_gen across phantoms so the gen- monotonicity filter still rejects stale passes. * /sys/fs/marufs/me_info exposes per-shard doorbell slot state for debugging (from_node, token_seq, cb_gen_at_write, request fields). * Demote order_leave acquire-failed diagnostic to pr_debug. Stress-tested via sweep bench (order + request, shards 2..64, on 2/4/8 mounts) after each change.

* New mount option `me_strategy=order|request` (default: request). * `marufs_nrht_init_req.me_strategy` surfaced in the setup example. * NRHT_JOIN ioctl documented as the optional pre-warm alternative to lazy ring join on the first NAME_OFFSET/FIND_NAME.

Adds per-ME-instance atomic64 counters for CXL RMB traffic (cb, slot, membership), ops->poll_cycle() invocations, and wall-clock ns spent in poll_cycle. Exposed via /sys/fs/marufs/me_poll_stats (aggregated across all mounted sbis); write any value to reset. test_nrht_race.c reads the counters via sysfs around each timed bench run (reset-then-read) and prints a poll-thread cost subtable in the sweep summary — including per-cycle rates (cb/c, slot/c, mem/c) so optimization deltas stay comparable across versions even when cycle count itself shifts.

Adds __le64 pending_shards_mask to the membership slot. Each node flips bit s when it raises a hand on shard s (request_acquire) and clears it after CS (request_clear_own), using a bounded CAS loop with WARN_ONCE fallback to guard against same-node races on different bits of the same word. request_poll_cycle fuses the former per-shard next_active() calls into a single membership pass that collects per-peer masks, OR-reduces them into peers_pending, and picks the round-robin successor. Holder side then gates request_scan_and_grant on (peers_pending & (1<<s)) — idle shards pay zero slot RMBs instead of the baseline N-1-per-shard full scan. Masked scan further filters nodes whose bit is clear. Release path keeps the full scan (primary grant path; staleness-driven skip would stall requests raised between poll and release). Per-cycle benchmark (S=64, N=8): rmb_slot ~33 → ~4 (-87%), ns_avg ~82k → ~60k (-27%). Low-S workloads pay a small membership-read tradeoff (O(N) pass vs. O(S*k) lazy) but NRHT production target is S >= N where the scan-skip wins dominate.

Baseline's per-shard CB RMB in each node's poll_cycle created a textbook CXL anti-pattern: N hosts continuously polling the same cache line, loading the shared memory controller queue and reducing fabric bandwidth available to the data path. Doorbell's point was to push hot-polling off shared CLs onto single-reader ones. Replace the CB read with a DRAM `is_holder[S]` boolean, flipped by: - me_pass_token on our own write (outgoing transition) - poll_cycle slot-doorbell detection on token_seq bump (incoming) - wait_for_token / common_try_acquire on CB-read success - common_join initial seed Receiver-side detection now polls only our own per-(shard,node) slot — a single-reader CL per node, no multi-host contention. The bump is a sufficient "I became holder" signal because me_pass_token orders the CB WMB before the slot WMB. Crash detection migrates out of the poll path into wait_for_token's timeout branch: after 5s without progress, check holder's heartbeat_ts; if stalled past MARUFS_ME_TIMEOUT_NS, self-takeover via me_pass_token(self, s, self). Idle shards with no acquirer require no proactive monitoring. Removes marufs_me_check_heartbeat plus the last_heartbeat / last_heartbeat_time DRAM arrays. order_poll_cycle: drop ghost-alone CB probe — the acquire-timeout takeover covers the same scenario. Per-cycle CB RMB at S=64 N=8 drops from ~33 to near-zero in steady state; slot-doorbell reads add S per cycle but land on single-reader CLs and don't contend across hosts.

Replace 8 parallel per-shard arrays in marufs_me_instance with a single struct marufs_me_shard *shards allocation. Simplifies alloc/free, keeps per-shard fields cache-adjacent, and removes scattered kcalloc/kfree bookkeeping. Fields consolidated: holding, local_waiters, local_lock, cached_successor, is_holder, poll_last_slot_seq, last_token_seq, last_cb_gen.

Replace repeated me->shards[shard_id].field accesses with a local struct marufs_me_shard *sh bound once per function/loop iteration. No behavior change. Side effect: common_join now hoists marufs_me_next_active() out of the per-shard loop since all shards get the same successor at join time — one scan instead of num_shards scans.

Replace the CB RMB in wait_for_token's fast path with a DRAM is_holder check. Saves one CXL CB round-trip per same-node burst acquire where the token is already held on entry. Correctness: - Every is_holder mutation goes through ME_BECOME_HOLDER / ME_LOSE_HOLDER, centralizing the smp_wmb rule. Reader pair ME_IS_HOLDER(sh) wraps smp_rmb + field read. - me_pass_token self-pass seeds sh->last_cb_gen / last_token_seq / poll_last_slot_seq so the fast path can trust is_holder without re-reading CB, and poll_cycle treats the doorbell as a self-bump without re-flipping state. - common_try_acquire and wait_for_token's loop-exit path pick up the previously-missing smp_wmb via the macro. - poll_cycle bump handlers touch only poll_last_slot_seq; the app thread keeps exclusive ownership of last_token_seq / last_cb_gen so its bump-detection signal is never lost to a poll-thread race. - Deadline-path takeover drops its redundant post-pass CB/slot RMB; me_pass_token self-pass already seeds the same baselines.

…y_acquire - add me_cb_snapshot() helper in me.h bundling RMB + holder/generation read; callers pass NULL for gen when only holder is needed. - remove marufs_me_common_try_acquire and .try_acquire op (no users remain). - remove order_leave_dump diagnostic (covered by existing poll tracing). - move shard pointer binding closer to first use (C99 style).

Introduces persistent per-CPU instrumentation to diagnose acquire latency, poll-thread cost, and hash-chain quality without measurable hot-path overhead (~0.5% of ns_avg; log2 buckets, non-atomic updates). New headers (kept separate from me.h/marufs.h to bound their surface): me_stats.h - struct marufs_me_stats_pcpu + helpers for wait_for_token (spin/sleep/deadline hit, wall+cpu ns, log2(ns) latency histogram), poll_cycle phase breakdown (membership/doorbell/scan), lock hold time, per-shard acquire distribution, request-mode grant age. cpu_ns is sampled via current->se.sum_exec_runtime (task_sched_runtime is unexported); the tick-granular delta is clamped to wall_ns at each sample so the aggregate cpu_util stays physically valid. nrht_stats.h - per-CPU bucket-chain walk count/steps histogram. Lives on sbi, handed to nrht_find_chain via a direct stats pointer on nrht_shard_ctx (no sbi back-ref). Sysfs (all write-any-reset except chain/poll_thread cumulative): me_fine_stats - aggregate ME counters per instance me_per_shard_acquire - hotspot detection across shards me_poll_thread_cpu - cumulative sum_exec_runtime of poll kthread (diff-based utilization; cannot be reset) nrht_chain_depth - find_chain count/steps + depth histogram Bench (test_nrht_race): - Reset + read/diff across all four nodes per run. - Per-run dump: wait hit split, cpu_util, grant count, chain depth, poll kthread util (divided by mount_count for per-thread avg). - Sweep table gains a fine-grained section with wait_avg/cpu%/spin%/ hold_avg/grant/chain/poll_cpu%.

wait_fast_hit: tracks ME_IS_HOLDER early returns in wait_for_token — the acquires that bypass all token-wait work. Combined with wait_count this exposes the intra-node holder-keep hit rate. Measurement on the bench showed request-mode sustains ~80% fast-hit across shards while order-mode collapses to <10% as shards grow, quantifying why order scales poorly. Bench sweep table gains four columns: fast% - fast_hit / (wait_count + fast_hit) mem% - membership pass share of poll_cycle wall door% - per-shard slot doorbell RMB share scan% - grant/pass phase share (masked for request, baton for order) The poll-phase split pinpoints where poll_cycle spends time under each strategy: request is membership+doorbell bound (scan <2%), order is scan-bound at high shard counts (>40%).

Replace cross-node ktime_get_ns() subtraction with counter-based probe in the acquire-deadline path. CXL peers don't share a monotonic clock zero point — per-node boot times differ, so now - heartbeat_ts produces meaningless elapsed values and can misclassify alive holders as crashed (or vice versa). On deadline, me_handle_acquire_deadline: 1. Snapshot holder's heartbeat counter (hb_before). 2. Sleep MARUFS_ME_LIVENESS_PROBE_NS on local clock. 3. Resample counter + CB. 4. late grant on us → enter CS directly, skip takeover. 5. holder changed or hb advanced → -ETIMEDOUT (back off). 6. counter stuck and holder unchanged → self-takeover via me_pass_token. Probe window: 100ms (10000× default poll interval). Conservative — the takeover path isn't hot and false-positive crash calls would race against a live holder on CB write. heartbeat_ts kept as observability field only.

Rewrite docs/7_arch_me.md around protocol mechanics: - §1 Shared State Layout: access-pattern mermaid + CXL struct table. - §2 Overview: 2.1 per-node lifecycle + 2.2 per-(shard, node) state machine (rename states to NONE / MEMBER / HOLDER_BUSY / HOLDER_IDLE — token ownership axis made explicit in labels). - §3 Thread Interaction & Memory Access: merge thread-level sequence with memory-level byte evolution per mode. - 3.1 OD (token pass): alt branch for receiver with/without app waiter (Case A fast-path vs Case B poll-thread fallback consumer). - 3.2 RD (request+grant): alt branch for grant paths (release vs poll). - 3.3 Cacheline snapshot: before/after value tables, symbolic CL labels. - 3.4 Acquire Timeout: crash vs busy-holder disambiguation via counter-based liveness probe; three post-deadline branches (late grant on self / holder changed or alive / counter stuck → takeover). - §4 Stats & Bench Integration: list current sysfs attrs, split poll-cost counters from per-CPU fine-grained stats, align with bench harness output columns. Drop sections that duplicated code (per-shard DRAM struct, step-by-step prose flows — §3 diagrams cover them).

Break the 1266-line monolithic sysfs.c into focused units while moving manual GC control out of the production attribute surface. sysfs.c 1266 -> 265 core: version/region_info/perm_info/daxheap_bufid + group/init sysfs_me.{c,h} new ME inspection + per-CPU stats (me_info, poll_stats, fine_stats, ...) sysfs_gc.{c,h} new GC monitoring (deleg_info, gc_status) sysfs_nrht.{c,h} new NRHT chain-depth histogram sysfs_internal.h new shared sbi_list/lock + get_sbi/find_by_node helpers sysfs_debug.{c,h} extended gc_trigger/stop/pause/restart relocated into debug subgroup alongside existing fault injection (me_freeze_heartbeat, me_sync_is_holder) Helper naming in sysfs_me.c unified to verb-noun form: me_state_str -> me_state_name me_tag_for -> me_format_tag me_info_emit_one -> me_emit_instance me_stats_aggregate -> me_aggregate_stats me_fine_stats_emit_buckets -> me_emit_buckets Tests updated to /sys/fs/marufs/debug/gc_* paths: test_local_multinode.sh (13 spots) test_chown_race.c, test_dupname.c, test_overlap.c, test_gc_deleg.c Also includes pre-staged ME crash-detection scaffolding consumed by the debug subgroup: me.h / me_order.c / me_request.c expose debug_freeze_poll hooks; tests/test_me_crash.sh exercises freeze + sync recovery end-to-end.

…eaders marufs.h was a 1266-line catch-all (sb_info + DAX/RAT/shard helpers + function decls for 10 modules). marufs_layout.h was a 625-line mix of on-disk structs and CXL/CAS primitives. Both now act as umbrella headers over focused per-domain files. Phase 1 — extract reusable primitives + per-module decl headers: marufs_endian.h READ_LE/WRITE_LE/READ_CXL_LE, MARUFS_CXL_WMB/RMB, le16/32/64_cas, cas_inc/dec marufs_hash.h shard_idx, bucket_idx, hash_name, make_ino, ino_to_region, align_up gc.h orphan tracker types + gc.c entry points inode.h marufs_inode_info struct + inode.c entry points + inode_ops externs acl.h, cache.h, dir.h, file.h, index.h, nrht.h, region.h, super.h per-module function declarations Phase 2 — split on-disk structs by subsystem: marufs_superblock_layout.h marufs_superblock + GSB_SIZE marufs_index_layout.h shard_header, index_entry, region defaults, BUCKET_END, state enum marufs_rat_layout.h rat, rat_entry, deleg_entry, RAT/deleg/ region_type state enums, capacity marufs_nrht_layout.h nrht_header, nrht_shard_header, nrht_entry, NRHT defaults Umbrella files now hold only: marufs.h sb_info, DAX/RAT/shard inline accessors, sysfs decls marufs_layout.h magic enum, ME area sizes + me_area_size helper, layout offsets, compile-time size validators Existing .c files keep including marufs.h alone — umbrella pulls in all per-module headers, so include patterns are unchanged. Phase 3 (sb_info field grouping into sub-structs) deferred — too invasive, low ROI. Line counts: marufs.h 1266 -> 435 (66% reduction) marufs_layout.h 625 -> 130 (79% reduction) 16 new headers ~970 lines (focused, dependency-minimal) Build clean. Compile-time size validators still pass.

Adds ME crash-detection regression coverage to the local multinode suite by delegating to the standalone test_me_crash.sh (T1-T7). Previously the crash tests had to be run by hand after every kernel change — easy to forget, easy to silently regress. Section 30 runs last because: - manipulates dmesg ring (uses dmesg -C between sub-tests) - T1 saturates CPU with stress-ng (soft dep — self-skips if absent) - T5 needs ≥3 mounts (self-skips if /mnt/marufs3 absent) - ~50s total runtime Gate: requires test_me_crash.sh executable + writable /sys/fs/marufs/debug/me_freeze_heartbeat + root. Otherwise SKIP with a one-line reason. test_me_crash.sh's own `set -euo pipefail` + die behavior maps cleanly to run_test's pass/fail accounting (non-zero exit on first failed T → Section 30 fails). NRHT --sweep benchmark intentionally NOT integrated here: it's a throughput measurement (no pass/fail), takes minutes, sensitive to machine load. A dedicated bench_nrht.sh is the right home for it.

Replaces mandatory node_id= mount option with bootstrap-elected slot assignment. Each mount CAS-claims a free slot in the on-disk bootstrap table (CLAIMED/FORMATTING states); first claimer formats the FS, rest attach. Stuck-formatter steal path covers crashed formatters. Changes: - bootstrap.c/h, marufs_bootstrap_layout.h: slot table + claim/steal - super.c: bootstrap-elected node_id; legacy explicit node_id= still supported via mount option - sysfs_debug: bootstrap_dump shows per-mount slot ownership (<mine>) - setup_local_multinode.sh: --legacy flag for old explicit-node_id style; default is auto-mount via bootstrap - test_bootstrap_chaos.sh: T1 stuck-formatter recovery, T2 concurrent mount race, T3 slot reuse sanity - test_local_multinode.sh: Section 31 auto-mount slot table checks, Section 32 delegates to chaos with auto-teardown - test_me_crash.sh: trim T1 stress 8s->4s, T2 iters 3->2, T3 busy 7s->6s - gc/file/nrht/sysfs minor adjustments for bootstrap integration - dax_zero.c: helper to wipe DAX device for chaos preconditions

Without this, request_poll_cycle's stale poll_last_slot_seq baseline re-triggers ME_BECOME_HOLDER after the token has already been passed, leaking is_holder=true cross-handoff. The next acquire's wait_for_token fast path (ME_IS_HOLDER) then enters CS while CB holds a different node — two-holder race observed under concurrent counter-RMW stress.

Add user-managed ref_count and pin_count to each NRHT entry, plus four new ioctls (REF_INC, REF_DEC, PIN_INC, PIN_DEC) for caller-driven RMW under NRHT shard ME. dec-from-zero rejects with -EINVAL, inc-from-UINT32_MAX with -EOVERFLOW. Layout: counters consume two __le32 in the existing CL0 reserved space (offsets 40-47); 128B entry size unchanged. Tests: - test_ioctl.c §3.5 covers single-process semantics (initial value, bounded overflow/underflow, ENOENT on missing entry). - test_nrht_race.c Test4 stresses balanced concurrent inc/dec across 8 workers and asserts final == 0 + zero ioctl errors. Worker logs first failed op to stderr; harness aborts on first round failure for clean dmesg capture. - run_bench bundles ref/pin INC/DEC into the per-iter timed loop on the iter's own entry so all 7 ops share a shard, scaling with cfg->num_shards. Sweep summary gains a counter ops section.

Cover the four NRHT_REF/PIN_INC/DEC ioctls with usage examples and the overflow/underflow semantics. Note that FIND_NAME returns the counters alongside the offset.

bootstrap_dump_slots() used PAGE_SIZE as its scnprintf bound while sysfs_debug's bootstrap_dump_show() called it with `buf + n` after writing a per-mount header. Each scnprintf could thus write up to PAGE_SIZE bytes past the caller's offset, overrunning the sysfs page into adjacent slab objects. Symptom: GPF in fdget/filp_flush after reading bootstrap_dump, with non-canonical addresses decoding to ASCII fragments emitted by this helper ("=CLAIMED", " node_id", "slot[N] stat..."). Add a bufsize parameter and pass PAGE_SIZE - n from the show callback; guard the loop against n >= bufsize.

Decompose me.h (~700 LOC) into three focused headers: - me.h: public API + DRAM types only - me_inline.h: inline helpers needing instance struct visibility - me_layout.h: on-disk CXL layout (header/CB/membership/slot) Move cold-path helpers (me_leave_successor, me_membership_tick_heartbeat) out of inline header into me.c. Consolidate per-shard arrays into struct marufs_me_shard and DRAM is_holder fast path. Wire callers (bootstrap, me_order, me_request, nrht, sysfs) to new layout.

Split marufs_check_permission into two layers: - marufs_check_permission_any(candidate, *out_granted): returns the granted subset of candidate bits, letting callers branch on which rights matched. Replaces ADMIN-then-GRANT two-call patterns. - marufs_check_permission: thin AND-semantics wrapper. Inline deleg matching into _any (drops marufs_deleg_matches) and bound the loop by deleg_num_entries instead of MAX_ENTRIES. Centralize ioctl perm precheck via marufs_ioctl_required_perm(cmd) table at dispatcher entry, removing per-case marufs_check_permission calls from NAME_OFFSET / BATCH_* / FIND_NAME / CLEAR_NAME / NRHT_INIT and from DMABUF_EXPORT / CHOWN. PERM_GRANT keeps self-check inside its ME critical section, now using _any to evaluate ADMIN|GRANT in one call. Move nrht_refcnt_op_t typedef from file.c to nrht.h. Extend test_nrht_race with run_test5 (new race scenario) and tighten test3/test4 coverage.

Concurrent CHOWN race: precheck ran before me->acquire(), so two callers with default_perms ADMIN could both pass and serialize on the lock. The first chown stripped ADMIN (default_perms=0, deleg cleared), but the second never re-checked and still won, letting ownership transfer twice. Fix: - Add marufs_check_permission(ADMIN) inside __marufs_ioctl_chown_locked before the ALLOCATED→ALLOCATING CAS. - Drop CHOWN from the lock-free precheck table (handler self-checks), matching the PERM_GRANT pattern. - Same in-lock recheck added to perm_set_default for symmetry: a caller that relied on default_perms ADMIN can be demoted by a concurrent perm_set_default/chown writing default_perms=0.

Add per-sbi vm_ops wrapper that copies underlying device_dax ops and overrides .open/.close/.mprotect to enforce RAT delegation on mprotect. mmap-time RAT check is no longer the sole gate. Wrapper details: - sbi-embedded vm_ops, lazy-seeded at first mmap under vm_ops_lock - xi pointer stashed in vma->vm_private_data; igrab on attach, iput in .close, igrab on .open for vma split/clone refcount balance - container_of(vma->vm_ops, sbi, vm_ops) recovers sbi at hook time (vma->vm_file = dax_filp after device_dax delegation) Hardening flags applied to every marufs vma: - VM_DONTCOPY: fork() drops the mapping; child re-mmap forces RAT recheck - VM_DONTEXPAND: mremap() cannot grow past original mmap size - VM_DONTDUMP: KV-cache contents excluded from coredumps Lock split: revert the prior sb_lock merge that caused soft lockups when me_poll_thread held the unified lock for full poll cycles. - me_list_lock: poll thread + register/unregister - nrht_me_lock: nrht_me[] creation - vm_ops_lock: lazy seed (and future hot-path use) Remove daxheap support entirely: - Drop CONFIG_DAXHEAP and DAXHEAP_DIR from Makefile / install.sh - Remove daxheap= and daxheap_import_id= mount options - Drop MARUFS_IOC_DMABUF_EXPORT ioctl and dmabuf_req struct - Remove enum marufs_dax_mode (DEV_DAX is the only mode) - Drop sbi->heap_dmabuf, marufs_dax_acquire_daxheap, /sys/fs/marufs/daxheap_bufid Tests (tests/test_mmap.c): - run_vm_protect: mprotect basics, RDONLY-fd escalation block, VM_DONTCOPY fork SIGSEGV, VM_DONTEXPAND mremap reject (with and without MAYMOVE), partial mprotect vma split, mremap MOVE-only success, 200-iter split+merge stress for igrab balance - run_vm_protect_cross: cross-node escalation block — owner grants READ-only to peer, peer mprotect(PROT_RW) rejected by RAT WRITE check; after additional WRITE grant, mprotect succeeds

Add two-layer defense against post-exec fd reuse / hostile re-execve: 1. RAT exe_inode binding (acl.c + region.c) - owner check now compares current task's exe inode/dev against owner_exe_inode_ino/dev stored in the RAT entry at create time. Catches execve into different binary. 2. FD_CLOEXEC enforcement at data access (file.c) - mmap/read/ioctl reject with -EACCES when the calling fd is not close_on_exec. Catches same-binary re-execve (hostile argv) which exe_inode binding alone cannot detect. Cannot enforce O_CLOEXEC at .open: VFS strips O_CLOEXEC from f_flags (it's stored in fdtable.close_on_exec) and the fd is not yet installed when ->open runs. Check moves to mmap/read/ioctl entry where fdtable lookup is possible. Tests: - test_postexec_attack: integrated into test_local_multinode.sh as Section 32 (Bootstrap Chaos shifts to Section 33). Two modes: no cloexec (parent mmap blocked) and --cloexec (execve closes fd). - test_negative: new Section 0 verifying mmap without FD_CLOEXEC returns EACCES. - 13 existing test sources updated to pass O_CLOEXEC on open(). Docs: 0_user_guide.md gains an O_CLOEXEC requirement bullet and a Security section paragraph explaining the fd-level check.

moonchan-park commented Apr 10, 2026

View reviewed changes

moonchan-park requested a review from a team April 10, 2026 02:56

youngrok-XCENA reviewed Apr 10, 2026

View reviewed changes

moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from a4b2829 to cd1e9ed Compare April 10, 2026 05:31

jooho-XCENA requested changes Apr 10, 2026

View reviewed changes

moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from f74049b to d2be093 Compare April 14, 2026 03:13

moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from 050e08b to 528e090 Compare April 17, 2026 07:06

moonchan-park added 13 commits April 22, 2026 10:42

fix: remove format mount option from fstab/systemd autoload

cfcc845

Persistent format option causes re-format on every reboot, destroying all data. CXL volatile memory bootstrap will be handled by a separate format_if_needed scheme (magic + CRC32 validation at mount time).

fix: i_size_write without i_rwsem in read_iter and getattr

b20c9e9

- file.c: wrap i_size_write with inode_lock/unlock in read path - inode.c: write fresh RAT size to stat->size directly in getattr, avoiding i_size_write without i_rwsem entirely

fix: add compat_ioctl for 32-bit userspace support

2b724f4

All ioctl structs use fixed-width types (__u32/__u64/__s32) with identical 32/64-bit layout, so compat_ptr_ioctl suffices.

docs: add architecture documentation links to README

defc2ca

moonchan-park added 29 commits April 22, 2026 01:43

refactor(layout): add uuid field to superblock with proper alignment

4b2123c

docs(user_guide): add §4.3 ref/pin counter ioctls

29fef35

Cover the four NRHT_REF/PIN_INC/DEC ioctls with usage examples and the overflow/underflow semantics. Note that FIND_NAME returns the counters alongside the offset.

moonchan-park closed this May 18, 2026


		static struct kobj_attribute gc_pause_attr =
		__ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store);

Conversation

moonchan-park commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Structure

Test plan

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moonchan-park left a comment

Choose a reason for hiding this comment

PR #41 리뷰 요약

이 PR을 왜 올리는가?

설계 개요

CXL 메모리 레이아웃

핵심 데이터 흐름

리뷰 결과 요약

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youngrok-XCENA left a comment

Choose a reason for hiding this comment

리뷰 요약

이 PR의 목적

아키텍처 설계

핵심 설계 특성

주요 발견 사항

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jooho-XCENA left a comment

Choose a reason for hiding this comment

Review: 미해결 HIGH 이슈 8건

moonchan-park commented Apr 10, 2026 •

edited

Loading

github-actions Bot commented Apr 10, 2026 •

edited

Loading

moonchan-park Apr 14, 2026 •

edited

Loading