feat: add marufs kernel module — CXL shared-memory filesystem#41
feat: add marufs kernel module — CXL shared-memory filesystem#41moonchan-park wants to merge 45 commits into
Conversation
moonchan-park
left a comment
There was a problem hiding this comment.
PR #41 리뷰 요약
이 PR을 왜 올리는가?
Maru 프로젝트는 현재 TCP/RPC 기반의 유저스페이스 메모리 관리만 지원합니다. CXL 공유 메모리를 여러 노드가 파일시스템 인터페이스로 직접 접근하려면 커널 모듈이 필요하며, 이 PR이 없으면 cross-node 파일 공유를 위해 네트워크 라운드트립이 필수입니다. marufs 커널 모듈은 DAX-mapped CXL 메모리 풀 위에 lock-free 파일시스템을 구현하여, 표준 VFS 인터페이스(open/mmap/read/write)로 노드 간 zero-copy 데이터 공유를 가능하게 합니다.
설계 개요
graph TB
subgraph UserSpace["Userspace"]
app["Application"]
test["Test Suite"]
end
subgraph KernelMod["marufs Kernel Module"]
subgraph VFS["VFS Layer"]
super["super.c -- mount, umount, format"]
dir["dir.c -- readdir, lookup, create, unlink"]
inode["inode.c -- iget, new_inode, evict"]
file["file.c -- mmap, ftruncate, ioctl"]
end
subgraph DataLayer["Data Layer"]
idx["index.c -- 4-shard CAS hash index"]
region["region.c -- RAT allocator, 2KB entries"]
nrht["nrht.c -- Name-Ref Hash Table"]
end
subgraph SecLayer["Security Layer"]
acl["acl.c -- 3-stage ACL check"]
end
subgraph MaintLayer["Maintenance"]
gc["gc.c -- 4-phase background GC"]
sysfs_mod["sysfs.c -- stats export"]
end
end
subgraph HW["Hardware"]
cxl["CXL Shared Memory / DAX Device"]
end
app -->|"open, read, mmap, ioctl"| file
test -->|"ioctl, mmap"| file
super --> dir
super --> inode
super --> file
dir --> idx
file --> idx
file --> region
file --> acl
file --> nrht
inode --> idx
gc -.->|"sweep"| idx
gc -.->|"reclaim"| region
gc -.->|"stale cleanup"| nrht
idx --> cxl
region --> cxl
nrht --> cxl
CXL 메모리 레이아웃
block-beta
columns 4
sb["Superblock 4KB"]:1
shards["Global Index Shards x4"]:1
rat["RAT Header + 256 Entries"]:1
data["Region Data"]:1
핵심 데이터 흐름
| 경로 | 흐름 |
|---|---|
| 파일 생성 | create -> index claim EMPTY -> RAT alloc -> link and publish VALID |
| 파일 읽기 | lookup -> index hash search -> RAT entry -> DAX direct read |
| mmap | open -> ftruncate(region alloc) -> mmap -> DAX fault handler |
| GC | Phase1: dead process reclaim -> Phase2: stale INSERTING sweep -> Phase3: orphan sweep -> Phase4: NRHT stale sweep |
리뷰 결과 요약
전체적으로 lock-free CAS 설계, 상태 머신, 메모리 배리어 처리가 잘 구조화되어 있습니다. 커널 모듈 코드 품질이 높으나, 아래 이슈들을 머지 전 검토해야 합니다.
| 심각도 | 건수 | 주요 이슈 |
|---|---|---|
| Critical | 3 | claim_entry 상태 미검증, FIND_NAME 권한 누락, batch 버퍼 사이징 |
| High | 6 | i_size_write 락 미보유, mmap 파일 참조 누수, force-unlock CAS 미사용, checksum 미구현, d_revalidate 성능, compat_ioctl 누락 |
| Medium | 3 | GC 스레드 stop 체크, sysfs 버퍼 오버플로, CXL 2.0 배리어 |
| Minor | 2 | 하드코딩된 사용자 경로, 테스트 바이너리 누락 |
youngrok-XCENA
left a comment
There was a problem hiding this comment.
리뷰 요약
이 PR의 목적
CXL(Compute Express Link) 공유 메모리를 다수 노드에서 파일 수준으로 접근/관리할 수 있는 Linux 커널 파일시스템 모듈(marufs)을 추가합니다.
미적용 시 문제: CXL 메모리 풀을 POSIX 파일 인터페이스로 다룰 수 없어, 애플리케이션이 DAX 디바이스를 직접 mmap하고 멀티 노드 간 메타데이터 동기화를 자체 구현해야 합니다.
아키텍처 설계
graph TB
subgraph VFS["VFS Layer"]
super["super.c: mount, umount, mkfs"]
dir["dir.c: readdir, lookup, create"]
inode_m["inode.c: iget, evict, getattr"]
file_m["file.c: mmap, ftruncate, ioctl"]
end
subgraph CXL["CXL Shared Data Layer"]
index_m["index.c: 4-shard lock-free hash"]
region["region.c: RAT region allocator"]
nrht["nrht.c: Name-Ref Hash Table"]
end
subgraph SEC["Security"]
acl["acl.c: 3-stage ACL with delegation"]
end
subgraph BG["Background"]
gc["gc.c: 4-phase GC sweep"]
sysfs_mod["sysfs.c: stats and tunables"]
end
super --> dir
super --> inode_m
super --> file_m
dir --> index_m
file_m --> index_m
file_m --> region
file_m --> acl
file_m --> nrht
inode_m --> index_m
gc -.-> index_m
gc -.-> region
gc -.-> nrht
핵심 설계 특성
| 항목 | 설명 |
|---|---|
| Lock-free 동시성 | CAS 기반 4-shard 해시 인덱스, NRHT, delegation ACL |
| WORM 시맨틱 | ftruncate로 한 번만 영역 할당, 재할당 불가 |
| 4-phase GC | orphan 탐지 - 타임아웃 대기 - CAS reclaim - region sweep |
| DAX 직접 매핑 | page cache 우회, mmap으로 CXL 메모리 직접 접근 |
주요 발견 사항
| 심각도 | 건수 | 대표 이슈 |
|---|---|---|
| Critical | 2 | 재부팅 시 데이터 소실(fstab format), NRHT entry CAS 탈취 |
| High | 4 | DAXHEAP TOCTOU, delegation 비원자 전이, GC 레이스, i_size 비보호 쓰기 |
| Medium | 3 | 잘못된 errno, alloc_lock 타임아웃 미도달, x86 전용 cache flush |
| Low | 1 | 하드코딩된 개발자 경로 |
상세 내용은 인라인 코멘트를 참조해주세요.
| cat >> /etc/fstab << FSTAB | ||
|
|
||
| # MARUFS CXL filesystem (auto-generated by setup-autoload.sh) | ||
| none ${MOUNT_POINT} ${MODULE_NAME} daxdev=${DAX_DEVICE},node_id=${NODE_ID},format,nofail 0 0 |
There was a problem hiding this comment.
[critical] 재부팅마다 파일시스템 포맷 -- 데이터 소실
fstab에 format 마운트 옵션이 영구 기록됩니다. 재부팅 시마다 파일시스템이 초기화되어 모든 데이터가 소실됩니다. L243의 systemd unit도 동일한 문제가 있습니다.
최초 포맷 이후에는 format 옵션을 제거해야 합니다. fstab/systemd 등록 시 format 옵션 제외를 기본으로 하고, 별도 --format 플래그로만 포맷을 수행하는 것을 권장합니다.
There was a problem hiding this comment.
Fixed in 29243e0
fstab/systemd 모두 format 옵션 제거했습니다.
CXL은 volatile 메모리라 매 부팅 시 어딘가에서 format이 필요한 건 맞지만, autoload 스크립트에 하드코딩하는 방식은 위험합니다. 향후 format_if_needed 마운트 옵션으로 별도 구현 예정:
- 마운트 시 magic + CRC32 검증
- 유효하면 기존 메타데이터 사용, 무효하면 idempotent format 수행
- 멀티노드 환경에서도 CRC commit point 기반으로 안전하게 동작
| struct marufs_nrht_entry *e) | ||
| { | ||
| u32 st = READ_LE32(e->state); | ||
| if (marufs_le32_cas(&e->state, st, MARUFS_ENTRY_INSERTING) != st) |
There was a problem hiding this comment.
[critical] CAS가 INSERTING 상태를 포함한 모든 상태에서 성공 -- entry 탈취 가능
st가 이미 MARUFS_ENTRY_INSERTING이면 CAS가 INSERTING->INSERTING으로 성공하여, 다른 노드가 삽입 중인 entry의 created_at과 inserter_node를 덮어씁니다. 이는 원래 삽입자의 작업을 망가뜨립니다.
CAS 전에 st == MARUFS_ENTRY_EMPTY || st == MARUFS_ENTRY_TOMBSTONE 조건을 확인해야 합니다.
There was a problem hiding this comment.
Fixed in eea4555. Added st == EMPTY || st == TOMBSTONE validation before CAS in both index_claim_entry and nrht_claim_entry.
| marufs_daxheap_bufid); | ||
| return -EEXIST; | ||
| } | ||
| mutex_unlock(&marufs_daxheap_lock); |
There was a problem hiding this comment.
[high] DAXHEAP primary 할당 시 TOCTOU 레이스 컨디션
L452에서 marufs_daxheap_bufid == 0 확인 후 L459에서 락을 해제하고, L461에서 daxheap_kern_alloc()을 락 없이 호출합니다. 두 스레드가 동시에 primary mount를 시도하면 버퍼가 이중 할당됩니다.
제안: 락을 alloc 완료 후까지 유지하거나, 할당 후 CAS로 bufid를 교체하세요.
There was a problem hiding this comment.
Fixed in d0ef7e1
mutex를 check부터 alloc + bufid 기록까지 유지하도록 변경. 에러 경로마다 mutex_unlock 추가.
| sizeof(*de)); /* Ensure all fields visible before state transition */ | ||
|
|
||
| /* Publish: GRANTING → ACTIVE (now safe for readers) */ | ||
| WRITE_LE32(de->state, MARUFS_DELEG_ACTIVE); |
There was a problem hiding this comment.
[high] GRANTING -> ACTIVE 상태 전환이 CAS가 아닌 단순 WRITE
GRANTING 상태에서 다른 노드(예: GC)가 상태를 이미 변경했을 수 있는데, 이를 확인하지 않고 덮어씁니다.
marufs_le32_cas(&de->state, MARUFS_DELEG_GRANTING, MARUFS_DELEG_ACTIVE)로 변경하여 예상 상태에서만 전환해야 합니다.
There was a problem hiding this comment.
Fixed in d0ef7e1
WRITE_LE32(de->state, ACTIVE) → marufs_le32_cas(&de->state, GRANTING, ACTIVE). CAS 실패 시(GC가 이미 EMPTY로 전환) -EAGAIN 반환하여 caller가 retry.
| if (sbi->gc_orphan_count >= MARUFS_GC_ORPHAN_MAX) | ||
| return; | ||
|
|
||
| i = sbi->gc_orphan_count++; |
There was a problem hiding this comment.
[high] gc_orphan_count 비원자적 증감 -- 레이스 컨디션
gc_orphan_count와 gc_orphans 배열이 락 없이 접근됩니다. marufs_gc_track_orphan()이 GC sweep 경로 등에서 여러 CPU에서 동시 호출 가능한 경우 race condition이 발생합니다.
단일 스레드에서만 호출됨을 보장하거나, spinlock/atomic으로 보호해야 합니다.
There was a problem hiding this comment.
Fixed in d0ef7e1
GC는 단일 kthread로만 실행되므로 race 없음. 해당 사실을 marufs_gc_track_orphan 주석에 명시했습니다.
| struct marufs_rat_entry *rat_e = | ||
| marufs_rat_entry_get(sbi, xi->rat_entry_id); | ||
| if (rat_e) { | ||
| inode->i_size = READ_LE64(rat_e->size); |
There was a problem hiding this comment.
[high] i_size를 i_rwsem 없이 직접 수정
VFS는 i_size 변경 시 i_rwsem 보호 또는 i_size_write() 사용을 요구합니다. concurrent getattr과 read_iter에서 inconsistent size를 볼 수 있습니다.
getattr 콜백에서는 inode->i_size를 갱신하지 말고 stat->size에 직접 쓰는 것이 더 안전합니다.
There was a problem hiding this comment.
Fixed in 99cc329
youngrok 제안대로 stat->size에 직접 기록하는 방식으로 변경. fillattr 먼저 호출 → RAT fresh value로 stat->size/stat->blocks override. inode->i_size 업데이트 자체를 안 하므로 i_rwsem 불필요.
| for (i = 0; i < MARUFS_MAX_RAT_ENTRIES; i++) { | ||
| struct marufs_rat_entry *entry = marufs_rat_entry_get(sbi, i); | ||
| if (!entry) | ||
| return -1; |
There was a problem hiding this comment.
[medium] 잘못된 에러 코드 반환
return -1은 -EPERM에 해당합니다. VFS statfs 콜백은 의미 있는 errno를 반환해야 합니다. marufs_rat_entry_get() 실패는 I/O 오류이므로 return -EIO가 적절합니다.
There was a problem hiding this comment.
이전 커밋(5b5dfb8)에서 이미 수정 완료. return -1 → return -EIO (super.c statfs).
| * | ||
| * Stale lock recovery: if holder crashed, force-unlock after timeout. | ||
| */ | ||
| while (retries < MARUFS_REGION_INIT_MAX_RETRIES) { |
There was a problem hiding this comment.
[medium] alloc_lock 재시도/타임아웃 로직 불일치 -- 5초 타임아웃 미도달
retries++가 cpu_relax() 경로에서만 증가합니다. cpu_relax()는 나노초 단위이므로, MARUFS_REGION_INIT_MAX_RETRIES(10)에 도달하는 데 마이크로초밖에 걸리지 않아, 5초 타임아웃 + force-unlock 경로는 사실상 도달 불가능합니다.
제안: retries를 force-unlock 횟수로 세거나, cpu_relax 대신 usleep_range()를 사용하여 대기 시간을 확보하세요.
There was a problem hiding this comment.
Fixed in 21ccbb5.
cpu_relax() → usleep_range(500000, 600000) (500ms). 10회 retry × 500ms = 5s이므로 타임아웃 force-unlock 경로에 실제로 도달 가능. #include <linux/delay.h> 추가.
| */ | ||
| #ifdef CONFIG_MARUFS_CXL2_COMPAT | ||
|
|
||
| static inline void __marufs_cxl_flush_range(const void *addr, size_t len) |
There was a problem hiding this comment.
[medium] clwb/clflushopt는 x86 전용 -- 다른 아키텍처에서 컴파일 실패
clwb(p) (L403), clflushopt(p) (L413)는 x86 전용 인스트럭션입니다. ARM64/RISC-V 빌드에서 컴파일 오류가 발생합니다.
#ifdef CONFIG_X86 가드를 추가하거나, arch_wb_cache_pmem() 등 아키텍처 독립 API 사용을 검토하세요. CXL이 현재 주로 x86에서 사용되더라도, #error 메시지로 지원 범위를 명시하는 것이 좋습니다.
There was a problem hiding this comment.
Fixed in 21ccbb5.
x86-only clwb/clflushopt 직접 호출 대신 커널 arch-portable API인 arch_wb_cache_pmem() / arch_invalidate_pmem() (<linux/libnvdimm.h>)으로 교체. x86에서는 동일하게 clwb+wmb / clflushopt+mb 수행, non-x86에서는 no-op (WT mapping 전제). 수동 cacheline 루프 + #error guard 불필요.
| MOUNT_POINT="" | ||
| SKIP_BUILD=false | ||
| USE_DAXHEAP=false | ||
| DAXHEAP_DIR="${MARUFS_DAXHEAP_DIR:-/home/mcpark/daxheap}" |
There was a problem hiding this comment.
[low] 개발자 로컬 경로 하드코딩
/home/mcpark/daxheap가 기본값으로 설정되어 있습니다. 다른 환경에서 MARUFS_DAXHEAP_DIR 미설정 시 빌드가 실패합니다. 기본값을 제거하고 미설정 시 에러를 출력하거나, 프로젝트 상대 경로로 변경하세요.
There was a problem hiding this comment.
Fixed in 21ccbb5.
DAXHEAP_DIR 기본값 제거 → 빈 문자열. install.sh에서 --daxheap 사용 시 DAXHEAP_DIR 미설정이면 즉시 에러. setup_local_multinode.sh / test_local_multinode.sh 동일 적용. Makefile 코멘트도 generic path로 변경.
… C3 batch sizing
C1: Add state validation before CAS in index_claim_entry and
nrht_claim_entry — reject VALID/INSERTING states to prevent
active entry hijack (index.c, nrht.c)
C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
matching other NRHT ioctl permission enforcement (file.c)
C3: Size batch buffer as max of both request types to prevent
heap overflow if MAX or struct sizes diverge (file.c)
…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety
…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
a4b2829 to
cd1e9ed
Compare
jooho-XCENA
left a comment
There was a problem hiding this comment.
Review: 미해결 HIGH 이슈 8건
기존 리뷰(moonchan-park, youngrok-XCENA)와 fix 커밋(eea4555, 99cc329, d0ef7e1, 21ccbb5, 8614242 등)을 확인한 후, 아직 미해결이고 다른 리뷰어가 코멘트하지 않은 HIGH 이슈만 남겼습니다.
이미 해결 확인된 항목 (코멘트 생략)
- ✅ GC kthread_should_stop inter-phase →
21ccbb5 - ✅ i_size_write without i_rwsem →
99cc329 - ✅ mmap file ref restore →
d0ef7e1 - ✅ alloc_lock CAS →
d0ef7e1 - ✅ GRANTING→ACTIVE CAS →
d0ef7e1 - ✅ DAXHEAP TOCTOU →
d0ef7e1 - ✅ compat_ioctl →
8614242 - ✅ claim_entry guard →
eea4555 - ✅ FIND_NAME perm →
eea4555 - ✅ superblock CRC32 →
5b5dfb8 - ✅ fstab format option →
29243e0 - ✅ arch-portable flush →
21ccbb5
| break; | ||
|
|
||
| case MARUFS_IOC_NRHT_INIT: | ||
| ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, |
There was a problem hiding this comment.
[HIGH] NRHT_INIT에 ADMIN 권한 필요
MARUFS_IOC_NRHT_INIT은 영역 전체를 memset(base, 0, total_needed) + 재포맷하는 파괴적 작업입니다 (nrht.c:523).
현재 MARUFS_PERM_IOCTL만 요구하는데, 이는 일반 NRHT 이름 조회/저장과 동일한 수준입니다. IOCTL 권한이 있는 모든 프로세스가 전체 NRHT를 포맷할 수 있습니다.
// 현재:
ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_IOCTL);
// 권장:
ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_ADMIN);There was a problem hiding this comment.
Fixed in 27206dc. MARUFS_PERM_IOCTL → MARUFS_PERM_ADMIN으로 변경.
|
|
||
| dreq->fd = fd; | ||
| get_file(sbi->heap_dmabuf->file); | ||
| fd_install(fd, sbi->heap_dmabuf->file); |
There was a problem hiding this comment.
[HIGH] DMABUF export가 전체 CXL 디바이스를 노출
sbi->heap_dmabuf는 전체 daxheap 디바이스를 나타냅니다. 여기서 export하면 사용자가 superblock, RAT, global index, 다른 사용자의 region 데이터까지 모두 접근 가능합니다 — ACL을 완전히 우회합니다.
권장 수정:
- 최소한
MARUFS_PERM_ADMIN권한을 요구하거나 - 전체 디바이스가 아닌 해당 파일의 region만 export하는 per-region dma_buf slice 생성
현재 READ+WRITE 권한만 있으면 전체 디바이스 fd를 얻을 수 있어, single-file 권한으로 다른 모든 파일의 데이터를 읽을 수 있습니다.
There was a problem hiding this comment.
Fixed in 27206dc. MARUFS_PERM_READ | MARUFS_PERM_WRITE → MARUFS_PERM_ADMIN으로 변경. 해당 ioctl은 다른 노드에서 동일한 daxheap fd를 이용하여 marufs mount하기 위해 제공되는 인터페이스이므로 per-region slice 기능은 제공하지 않을 계획입니다.
|
|
||
| static struct kobj_attribute gc_pause_attr = | ||
| __ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store); | ||
|
|
There was a problem hiding this comment.
[HIGH] gc_pause가 world-writable (0644)
0644이면 모든 사용자가 GC를 일시정지/재개할 수 있습니다. GC가 멈추면 dead region과 orphan entry가 누적되어 메모리 고갈 및 서비스 거부로 이어집니다.
또한 gc_pause_store, gc_trigger_store, gc_stop_store, gc_restart_store 모두 capable(CAP_SYS_ADMIN) 체크가 없어, container/user-namespace 환경에서 비특권 사용자가 조작 가능합니다.
// 현재:
__ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store);
// 권장:
__ATTR(gc_pause, 0600, gc_pause_show, gc_pause_store);
// + store 함수에 추가:
if (!capable(CAP_SYS_ADMIN))
return -EPERM;There was a problem hiding this comment.
Fixed in 27206dc. gc_pause 퍼미션 0644 → 0600으로 변경하고, gc_trigger/gc_stop/gc_pause/gc_restart 4개 store 함수 모두에 capable(CAP_SYS_ADMIN) 체크 추가.
| buckets_per_shard = 1; | ||
| } | ||
| buckets_per_shard = roundup_pow_of_two(buckets_per_shard); | ||
|
|
There was a problem hiding this comment.
[HIGH] num_buckets 상한 미검증 — roundup_pow_of_two overflow
num_buckets는 사용자가 NRHT_INIT ioctl을 통해 제어합니다. num_buckets가 매우 큰 값(예: 0xFFFFFFFF)이고 num_shards=1이면:
buckets_per_shard = 0xFFFFFFFFroundup_pow_of_two(0xFFFFFFFF)→ 64비트에서1UL << 32 = 0x100000000bucket_array_size = 0x100000000 * 4→ u64 overflow 또는total_needed계산 이상memset(base, 0, (size_t)total_needed)— region 경계를 넘어 CXL 메모리 write 가능
// roundup_pow_of_two 전에 상한 검증 추가:
if (buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES)
return -EINVAL;
buckets_per_shard = roundup_pow_of_two(buckets_per_shard);
if (buckets_per_shard == 0 || buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES)
return -EINVAL;There was a problem hiding this comment.
Fixed in f74049b. roundup_pow_of_two 전후로 MARUFS_NRHT_MAX_ENTRIES 상한 검증 추가.
| return ret; | ||
|
|
||
| if (attr->ia_valid & ATTR_SIZE) { | ||
| struct marufs_inode_info *xi = marufs_inode_get(inode); |
There was a problem hiding this comment.
[HIGH] setattr가 ATTR_SIZE 외의 변경을 무시 — chmod/chown/utimes 무효
setattr_prepare()로 검증 후 ATTR_SIZE만 처리하고 반환합니다. ATTR_UID, ATTR_GID, ATTR_MODE, ATTR_ATIME 등은 검증 통과 후 적용되지 않아, chmod/chown/utimes가 성공을 반환하지만 실제 변경 없습니다.
VFS 규약상 setattr_prepare() 통과 후 setattr_copy()를 호출하거나, 지원하지 않는 attribute는 -EPERM으로 명시적 거부해야 합니다.
if (attr->ia_valid & ATTR_SIZE) {
// ... ftruncate 처리 ...
}
// 추가 필요:
setattr_copy(MARUFS_IDMAP_ARG_COMMA inode, attr);
// 또는: 지원하지 않는 경우 명시적 거부
if (attr->ia_valid & ~ATTR_SIZE)
return -EPERM;There was a problem hiding this comment.
Fixed in f74049b. ATTR_SIZE | ATTR_FORCE 외의 attr 변경 요청은 -EPERM으로 명시적 거부. CXL FS는 POSIX attr를 persist하지 않으므로 setattr_copy() 대신 거부 방식을 선택.
| buf->f_bsize = PAGE_SIZE; | ||
| buf->f_blocks = sbi->total_size / PAGE_SIZE; | ||
| buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE; | ||
| buf->f_bavail = buf->f_bfree; |
There was a problem hiding this comment.
[HIGH] f_bfree unsigned underflow — 비정상적 free space 보고
sbi->total_size와 used_size가 모두 u64이므로, metadata 손상이나 concurrent allocation으로 used_size > total_size가 되면 unsigned 뺄셈이 wrap하여 매우 큰 값이 됩니다.
// 현재:
buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;
// 권장:
if (used_size > sbi->total_size)
buf->f_bfree = 0;
else
buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;
buf->f_bavail = buf->f_bfree;There was a problem hiding this comment.
Fixed in f74049b. used_size > sbi->total_size 가드 추가하여 unsigned underflow 방지.
| struct page *page = &folio->page; | ||
|
|
||
| zero_user_segments(page, 0, PAGE_SIZE, 0, 0); | ||
| SetPageUptodate(page); |
There was a problem hiding this comment.
[HIGH] read_folio가 항상 zero-filled page 반환 — page cache 경로 데이터 손실
DAX mmap이 주 접근 경로이지만, sendfile(), splice(), 또는 non-DAX fallback 경로는 page cache를 통해 read_folio를 호출합니다. 현재 항상 zero page를 반환하므로 실제 CXL 메모리 데이터 대신 0을 읽게 됩니다.
권장:
- CXL 메모리에서 데이터를 복사:
memcpy_from_dax(page, sbi->dax_base + phys_offset + page_offset, ...)또는 - page cache 사용을 완전히 차단하고 read_iter만으로 서빙 (address_space_operations에서 read_folio 제거 후 적절한 대안 구현)
There was a problem hiding this comment.
Fixed in f74049b. CXL 데이터를 page cache로 복사하는 대신 -EIO로 차단하는 방식을 선택했습니다.
이유:
read_folio인터페이스에서 permission check를 끼워넣을 수 없어 marufs ACL 모델 우회- DRAM page cache에 CXL 데이터 복사본이 남아 cross-node 일관성 깨짐
read()시스콜은read_iter에서 직접 CXL 복사 + 권한 체크로 정상 지원
sendfile/splice는 DAX FS의 KV cache 워크로드에서 사용하지 않으므로 차단해도 영향 없습니다.
| } | ||
|
|
||
| set_page_dirty(page); | ||
| return VM_FAULT_LOCKED; |
There was a problem hiding this comment.
[HIGH] set_page_dirty() — 커널 6.8+에서 제거됨, 빌드 실패
set_page_dirty()는 커널 6.8에서 제거되었습니다 (folio_mark_dirty()로 대체). 이 모듈의 compat.h가 6.17까지 지원하므로 최신 커널에서 빌드가 실패합니다.
// 현재:
set_page_dirty(page);
// 권장:
folio_mark_dirty(page_folio(page));
// 또는 compat.h에 shim 추가:
#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 8, 0)
folio_mark_dirty(page_folio(page));
#else
set_page_dirty(page);
#endifThere was a problem hiding this comment.
Fixed in f74049b. compat.h에 marufs_set_page_dirty() inline 함수 추가 (6.8+ → folio_mark_dirty(), 이전 → set_page_dirty()). file.c에서는 compat 함수를 호출.
… C3 batch sizing
C1: Add state validation before CAS in index_claim_entry and
nrht_claim_entry — reject VALID/INSERTING states to prevent
active entry hijack (index.c, nrht.c)
C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
matching other NRHT ioctl permission enforcement (file.c)
C3: Size batch buffer as max of both request types to prevent
heap overflow if MAX or struct sizes diverge (file.c)
…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety
…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
f74049b to
d2be093
Compare
… C3 batch sizing
C1: Add state validation before CAS in index_claim_entry and
nrht_claim_entry — reject VALID/INSERTING states to prevent
active entry hijack (index.c, nrht.c)
C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
matching other NRHT ioctl permission enforcement (file.c)
C3: Size batch buffer as max of both request types to prevent
heap overflow if MAX or struct sizes diverge (file.c)
…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety
…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
050e08b to
528e090
Compare
Linux kernel filesystem module for CXL shared memory, enabling cross-node file sharing via DAX-mapped CXL memory pools. Core components: - VFS layer: mount/umount, directory ops, inode lifecycle, mmap/ioctl - CAS-based lock-free global index with sharded hash table (4 shards) - Region Allocation Table (RAT): per-file metadata in 2KB CL-aligned entries - NRHT (Name-Ref Hash Table): application-level name→(offset, region) mapping - ACL: 3-stage permission model (owner → default_perms → delegation table) - 4-phase background GC: dead process reap, stale index sweep, local tracker, NRHT Includes architecture docs (6 detailed + 1 overview), test suite (ioctl, mmap, cross-process, chown race, multinode), and build/install scripts.
… C3 batch sizing
C1: Add state validation before CAS in index_claim_entry and
nrht_claim_entry — reject VALID/INSERTING states to prevent
active entry hijack (index.c, nrht.c)
C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
matching other NRHT ioctl permission enforcement (file.c)
C3: Size batch buffer as max of both request types to prevent
heap overflow if MAX or struct sizes diverge (file.c)
- Implement CRC32 over immutable GSB fields (magic → entries_per_shard) - Compute + write checksum at format, validate at mount - Remove active_nodes bitmask (to be replaced by per-node cacheline design) - Adjust reserved padding 200 → 208 bytes to maintain 256B struct size
Persistent format option causes re-format on every reboot, destroying all data. CXL volatile memory bootstrap will be handled by a separate format_if_needed scheme (magic + CRC32 validation at mount time).
- file.c: wrap i_size_write with inode_lock/unlock in read path - inode.c: write fresh RAT size to stat->size directly in getattr, avoiding i_size_write without i_rwsem entirely
…0 deleg CAS, H11 GC doc - H6 file.c: restore file reference on mmap delegation failure - H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock - H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU - H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim - H11 gc.c: document gc_orphans single-thread safety
…ortable flush, M16 daxheap path M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
All ioctl structs use fixed-width types (__u32/__u64/__s32) with identical 32/64-bit layout, so compat_ptr_ioctl suffices.
- Add per-shard CAS spinlock (shard_header->lock) to serialize bucket linking and post-insert dedup, eliminating TOCTOU race - New TENTATIVE(2) entry state between INSERTING and VALID; VALID is now 3, TOMBSTONE is now 4 - Rewrite post_insert_dedup to walk chain directly with self-skip instead of using nrht_find_chain - Move region_type=NRHT write after format to prevent false EEXIST on first nrht_init call - Change double-init check from physical magic probe to RAT region_type check (immune to stale CXL data) - Extract nrht_shard_lock/unlock inline helpers - Update 4_arch_nrht.md: state diagram, transition table, insert flowchart, function summary for shard lock semantics
- Add TENTATIVE state to index.c insert protocol (4→9 step): INSERTING → TENTATIVE → lock → link + dedup → VALID → unlock - Use DRAM spinlock (marufs_shard_cache.insert_lock) instead of CXL lock for node-local thread serialization; cross-node handled by token ring - Move TOMBSTONE write from post_insert_dedup to caller for consistency with nrht.c pattern - Add deleg_info sysfs attribute for per-region delegation inspection - Fix stale enum comments in marufs_layout.h (VALID 2→3, TOMBSTONE 3→4)
- sysfs: add deleg_info read/write for per-region delegation inspection - sysfs: gc_trigger now iterates ALL registered mounts (not just first sbi) - tests: integrate test_mmap_notrunc, test_negative, test_nrht_race, test_gc_deleg, test_pid_reuse into test_local_multinode.sh (Sections 25-29) - tests: show full failure output instead of head -5 truncation - docs: update entry lifecycle and metadata layout for TENTATIVE state - .gitignore: add new test binaries
- NRHT_INIT ioctl: PERM_IOCTL → PERM_ADMIN (prevents non-admin format) - DMABUF export: PERM_READ|PERM_WRITE → PERM_ADMIN (whole-device exposure) - sysfs gc_pause: 0644 → 0600 (root-only read/write) - sysfs gc_trigger/gc_stop/gc_pause/gc_restart: add capable(CAP_SYS_ADMIN)
Adds marufs_kernel/docs/0_user_guide.md: a scenario-oriented walk-through covering admin multi-node setup, application region lifecycle, NRHT name-ref publishing, and the delegation-based security model. Complements the existing architecture docs, which focus on implementation internals.
…tegies
Replace per-shard spinlock-based insert serialization with cross-node
Mutual Exclusion (ME) framework. Strategy Pattern exposes a common
interface backed by two implementations:
- Order-driven (me_order.c): token ring circulating among ACTIVE nodes
- Request-driven (me_request.c): holder scans request slots, grants
Two ME domains:
- Global ME (S=1): serializes Index insert + RAT alloc
- NRHT ME (S=N_shard): per-shard token, opt-in membership via
marufs_nrht_join() pre-warm (backup path lazy-init on first insert)
Adds unified poll kthread (me_poll_thread) iterating all registered
ME instances (me_list / me_list_lock). marufs_nrht_init() now takes
me_strategy parameter selecting the implementation per filesystem.
sysfs exposes ME diagnostics; tests/monitor_me.sh + rewritten
test_nrht_race.c exercise the new paths. bench_name_ref and
setup/test_local_multinode.sh updated for the new membership model.
Replace shared-CB polling with per-(shard, node) doorbell slots to eliminate O(N) cache-line ping-pong on the hot token-pass path. * Token pass: writer updates CB (holder, generation) then rings the target's slot (from_node, cb_gen_at_write, token_seq++). Reader polls its own slot and cross-checks CB on seq change. * Heartbeat moved to per-node membership slot (distributed); only the cached successor of the current holder watches it. * Magic tags on CB / membership / slot — writers verify the cached pointer still addresses the intended record type before mutating, and self-deactivate the instance on mismatch. Prevents stale-layout writes from corrupting a newly reformatted region after a peer nrht_init() raced this sbi's cached ME instance. * Token-gated order_leave: acquire each shard then pass to leave_successor so the leaving node is sole writer of its own slot during handoff; clear membership status last. * teardown_reformat detection: skip leave() when format_generation mismatches cached value (CXL area was reformatted under us). * Skip self-pass in order_poll_cycle when next_active == self. * wait_for_token: keep last_cb_gen across phantoms so the gen- monotonicity filter still rejects stale passes. * /sys/fs/marufs/me_info exposes per-shard doorbell slot state for debugging (from_node, token_seq, cb_gen_at_write, request fields). * Demote order_leave acquire-failed diagnostic to pr_debug. Stress-tested via sweep bench (order + request, shards 2..64, on 2/4/8 mounts) after each change.
* New mount option `me_strategy=order|request` (default: request). * `marufs_nrht_init_req.me_strategy` surfaced in the setup example. * NRHT_JOIN ioctl documented as the optional pre-warm alternative to lazy ring join on the first NAME_OFFSET/FIND_NAME.
Adds per-ME-instance atomic64 counters for CXL RMB traffic (cb, slot, membership), ops->poll_cycle() invocations, and wall-clock ns spent in poll_cycle. Exposed via /sys/fs/marufs/me_poll_stats (aggregated across all mounted sbis); write any value to reset. test_nrht_race.c reads the counters via sysfs around each timed bench run (reset-then-read) and prints a poll-thread cost subtable in the sweep summary — including per-cycle rates (cb/c, slot/c, mem/c) so optimization deltas stay comparable across versions even when cycle count itself shifts.
Adds __le64 pending_shards_mask to the membership slot. Each node flips bit s when it raises a hand on shard s (request_acquire) and clears it after CS (request_clear_own), using a bounded CAS loop with WARN_ONCE fallback to guard against same-node races on different bits of the same word. request_poll_cycle fuses the former per-shard next_active() calls into a single membership pass that collects per-peer masks, OR-reduces them into peers_pending, and picks the round-robin successor. Holder side then gates request_scan_and_grant on (peers_pending & (1<<s)) — idle shards pay zero slot RMBs instead of the baseline N-1-per-shard full scan. Masked scan further filters nodes whose bit is clear. Release path keeps the full scan (primary grant path; staleness-driven skip would stall requests raised between poll and release). Per-cycle benchmark (S=64, N=8): rmb_slot ~33 → ~4 (-87%), ns_avg ~82k → ~60k (-27%). Low-S workloads pay a small membership-read tradeoff (O(N) pass vs. O(S*k) lazy) but NRHT production target is S >= N where the scan-skip wins dominate.
Baseline's per-shard CB RMB in each node's poll_cycle created a textbook CXL anti-pattern: N hosts continuously polling the same cache line, loading the shared memory controller queue and reducing fabric bandwidth available to the data path. Doorbell's point was to push hot-polling off shared CLs onto single-reader ones. Replace the CB read with a DRAM `is_holder[S]` boolean, flipped by: - me_pass_token on our own write (outgoing transition) - poll_cycle slot-doorbell detection on token_seq bump (incoming) - wait_for_token / common_try_acquire on CB-read success - common_join initial seed Receiver-side detection now polls only our own per-(shard,node) slot — a single-reader CL per node, no multi-host contention. The bump is a sufficient "I became holder" signal because me_pass_token orders the CB WMB before the slot WMB. Crash detection migrates out of the poll path into wait_for_token's timeout branch: after 5s without progress, check holder's heartbeat_ts; if stalled past MARUFS_ME_TIMEOUT_NS, self-takeover via me_pass_token(self, s, self). Idle shards with no acquirer require no proactive monitoring. Removes marufs_me_check_heartbeat plus the last_heartbeat / last_heartbeat_time DRAM arrays. order_poll_cycle: drop ghost-alone CB probe — the acquire-timeout takeover covers the same scenario. Per-cycle CB RMB at S=64 N=8 drops from ~33 to near-zero in steady state; slot-doorbell reads add S per cycle but land on single-reader CLs and don't contend across hosts.
Replace 8 parallel per-shard arrays in marufs_me_instance with a single struct marufs_me_shard *shards allocation. Simplifies alloc/free, keeps per-shard fields cache-adjacent, and removes scattered kcalloc/kfree bookkeeping. Fields consolidated: holding, local_waiters, local_lock, cached_successor, is_holder, poll_last_slot_seq, last_token_seq, last_cb_gen.
Replace repeated me->shards[shard_id].field accesses with a local struct marufs_me_shard *sh bound once per function/loop iteration. No behavior change. Side effect: common_join now hoists marufs_me_next_active() out of the per-shard loop since all shards get the same successor at join time — one scan instead of num_shards scans.
Replace the CB RMB in wait_for_token's fast path with a DRAM is_holder check. Saves one CXL CB round-trip per same-node burst acquire where the token is already held on entry. Correctness: - Every is_holder mutation goes through ME_BECOME_HOLDER / ME_LOSE_HOLDER, centralizing the smp_wmb rule. Reader pair ME_IS_HOLDER(sh) wraps smp_rmb + field read. - me_pass_token self-pass seeds sh->last_cb_gen / last_token_seq / poll_last_slot_seq so the fast path can trust is_holder without re-reading CB, and poll_cycle treats the doorbell as a self-bump without re-flipping state. - common_try_acquire and wait_for_token's loop-exit path pick up the previously-missing smp_wmb via the macro. - poll_cycle bump handlers touch only poll_last_slot_seq; the app thread keeps exclusive ownership of last_token_seq / last_cb_gen so its bump-detection signal is never lost to a poll-thread race. - Deadline-path takeover drops its redundant post-pass CB/slot RMB; me_pass_token self-pass already seeds the same baselines.
…y_acquire - add me_cb_snapshot() helper in me.h bundling RMB + holder/generation read; callers pass NULL for gen when only holder is needed. - remove marufs_me_common_try_acquire and .try_acquire op (no users remain). - remove order_leave_dump diagnostic (covered by existing poll tracing). - move shard pointer binding closer to first use (C99 style).
Introduces persistent per-CPU instrumentation to diagnose acquire
latency, poll-thread cost, and hash-chain quality without measurable
hot-path overhead (~0.5% of ns_avg; log2 buckets, non-atomic updates).
New headers (kept separate from me.h/marufs.h to bound their surface):
me_stats.h - struct marufs_me_stats_pcpu + helpers for
wait_for_token (spin/sleep/deadline hit, wall+cpu ns,
log2(ns) latency histogram), poll_cycle phase
breakdown (membership/doorbell/scan), lock hold time,
per-shard acquire distribution, request-mode grant age.
cpu_ns is sampled via current->se.sum_exec_runtime
(task_sched_runtime is unexported); the tick-granular
delta is clamped to wall_ns at each sample so the
aggregate cpu_util stays physically valid.
nrht_stats.h - per-CPU bucket-chain walk count/steps histogram.
Lives on sbi, handed to nrht_find_chain via a direct
stats pointer on nrht_shard_ctx (no sbi back-ref).
Sysfs (all write-any-reset except chain/poll_thread cumulative):
me_fine_stats - aggregate ME counters per instance
me_per_shard_acquire - hotspot detection across shards
me_poll_thread_cpu - cumulative sum_exec_runtime of poll kthread
(diff-based utilization; cannot be reset)
nrht_chain_depth - find_chain count/steps + depth histogram
Bench (test_nrht_race):
- Reset + read/diff across all four nodes per run.
- Per-run dump: wait hit split, cpu_util, grant count, chain depth,
poll kthread util (divided by mount_count for per-thread avg).
- Sweep table gains a fine-grained section with wait_avg/cpu%/spin%/
hold_avg/grant/chain/poll_cpu%.
wait_fast_hit: tracks ME_IS_HOLDER early returns in wait_for_token — the acquires that bypass all token-wait work. Combined with wait_count this exposes the intra-node holder-keep hit rate. Measurement on the bench showed request-mode sustains ~80% fast-hit across shards while order-mode collapses to <10% as shards grow, quantifying why order scales poorly. Bench sweep table gains four columns: fast% - fast_hit / (wait_count + fast_hit) mem% - membership pass share of poll_cycle wall door% - per-shard slot doorbell RMB share scan% - grant/pass phase share (masked for request, baton for order) The poll-phase split pinpoints where poll_cycle spends time under each strategy: request is membership+doorbell bound (scan <2%), order is scan-bound at high shard counts (>40%).
Replace cross-node ktime_get_ns() subtraction with counter-based probe in the acquire-deadline path. CXL peers don't share a monotonic clock zero point — per-node boot times differ, so now - heartbeat_ts produces meaningless elapsed values and can misclassify alive holders as crashed (or vice versa). On deadline, me_handle_acquire_deadline: 1. Snapshot holder's heartbeat counter (hb_before). 2. Sleep MARUFS_ME_LIVENESS_PROBE_NS on local clock. 3. Resample counter + CB. 4. late grant on us → enter CS directly, skip takeover. 5. holder changed or hb advanced → -ETIMEDOUT (back off). 6. counter stuck and holder unchanged → self-takeover via me_pass_token. Probe window: 100ms (10000× default poll interval). Conservative — the takeover path isn't hot and false-positive crash calls would race against a live holder on CB write. heartbeat_ts kept as observability field only.
Rewrite docs/7_arch_me.md around protocol mechanics:
- §1 Shared State Layout: access-pattern mermaid + CXL struct table.
- §2 Overview: 2.1 per-node lifecycle + 2.2 per-(shard, node) state machine
(rename states to NONE / MEMBER / HOLDER_BUSY / HOLDER_IDLE — token
ownership axis made explicit in labels).
- §3 Thread Interaction & Memory Access: merge thread-level sequence with
memory-level byte evolution per mode.
- 3.1 OD (token pass): alt branch for receiver with/without app waiter
(Case A fast-path vs Case B poll-thread fallback consumer).
- 3.2 RD (request+grant): alt branch for grant paths (release vs poll).
- 3.3 Cacheline snapshot: before/after value tables, symbolic CL labels.
- 3.4 Acquire Timeout: crash vs busy-holder disambiguation via
counter-based liveness probe; three post-deadline branches (late grant
on self / holder changed or alive / counter stuck → takeover).
- §4 Stats & Bench Integration: list current sysfs attrs, split poll-cost
counters from per-CPU fine-grained stats, align with bench harness
output columns.
Drop sections that duplicated code (per-shard DRAM struct, step-by-step
prose flows — §3 diagrams cover them).
Break the 1266-line monolithic sysfs.c into focused units while moving
manual GC control out of the production attribute surface.
sysfs.c 1266 -> 265 core: version/region_info/perm_info/daxheap_bufid + group/init
sysfs_me.{c,h} new ME inspection + per-CPU stats (me_info, poll_stats, fine_stats, ...)
sysfs_gc.{c,h} new GC monitoring (deleg_info, gc_status)
sysfs_nrht.{c,h} new NRHT chain-depth histogram
sysfs_internal.h new shared sbi_list/lock + get_sbi/find_by_node helpers
sysfs_debug.{c,h} extended gc_trigger/stop/pause/restart relocated into debug subgroup
alongside existing fault injection (me_freeze_heartbeat,
me_sync_is_holder)
Helper naming in sysfs_me.c unified to verb-noun form:
me_state_str -> me_state_name
me_tag_for -> me_format_tag
me_info_emit_one -> me_emit_instance
me_stats_aggregate -> me_aggregate_stats
me_fine_stats_emit_buckets -> me_emit_buckets
Tests updated to /sys/fs/marufs/debug/gc_* paths:
test_local_multinode.sh (13 spots)
test_chown_race.c, test_dupname.c, test_overlap.c, test_gc_deleg.c
Also includes pre-staged ME crash-detection scaffolding consumed by the
debug subgroup: me.h / me_order.c / me_request.c expose debug_freeze_poll
hooks; tests/test_me_crash.sh exercises freeze + sync recovery end-to-end.
…eaders
marufs.h was a 1266-line catch-all (sb_info + DAX/RAT/shard helpers +
function decls for 10 modules). marufs_layout.h was a 625-line mix of
on-disk structs and CXL/CAS primitives. Both now act as umbrella
headers over focused per-domain files.
Phase 1 — extract reusable primitives + per-module decl headers:
marufs_endian.h READ_LE/WRITE_LE/READ_CXL_LE, MARUFS_CXL_WMB/RMB,
le16/32/64_cas, cas_inc/dec
marufs_hash.h shard_idx, bucket_idx, hash_name, make_ino,
ino_to_region, align_up
gc.h orphan tracker types + gc.c entry points
inode.h marufs_inode_info struct + inode.c entry points
+ inode_ops externs
acl.h, cache.h, dir.h, file.h, index.h, nrht.h, region.h, super.h
per-module function declarations
Phase 2 — split on-disk structs by subsystem:
marufs_superblock_layout.h marufs_superblock + GSB_SIZE
marufs_index_layout.h shard_header, index_entry, region
defaults, BUCKET_END, state enum
marufs_rat_layout.h rat, rat_entry, deleg_entry, RAT/deleg/
region_type state enums, capacity
marufs_nrht_layout.h nrht_header, nrht_shard_header,
nrht_entry, NRHT defaults
Umbrella files now hold only:
marufs.h sb_info, DAX/RAT/shard inline accessors, sysfs decls
marufs_layout.h magic enum, ME area sizes + me_area_size helper,
layout offsets, compile-time size validators
Existing .c files keep including marufs.h alone — umbrella pulls in
all per-module headers, so include patterns are unchanged. Phase 3
(sb_info field grouping into sub-structs) deferred — too invasive,
low ROI.
Line counts:
marufs.h 1266 -> 435 (66% reduction)
marufs_layout.h 625 -> 130 (79% reduction)
16 new headers ~970 lines (focused, dependency-minimal)
Build clean. Compile-time size validators still pass.
Adds ME crash-detection regression coverage to the local multinode suite by delegating to the standalone test_me_crash.sh (T1-T7). Previously the crash tests had to be run by hand after every kernel change — easy to forget, easy to silently regress. Section 30 runs last because: - manipulates dmesg ring (uses dmesg -C between sub-tests) - T1 saturates CPU with stress-ng (soft dep — self-skips if absent) - T5 needs ≥3 mounts (self-skips if /mnt/marufs3 absent) - ~50s total runtime Gate: requires test_me_crash.sh executable + writable /sys/fs/marufs/debug/me_freeze_heartbeat + root. Otherwise SKIP with a one-line reason. test_me_crash.sh's own `set -euo pipefail` + die behavior maps cleanly to run_test's pass/fail accounting (non-zero exit on first failed T → Section 30 fails). NRHT --sweep benchmark intentionally NOT integrated here: it's a throughput measurement (no pass/fail), takes minutes, sensitive to machine load. A dedicated bench_nrht.sh is the right home for it.
Replaces mandatory node_id= mount option with bootstrap-elected slot assignment. Each mount CAS-claims a free slot in the on-disk bootstrap table (CLAIMED/FORMATTING states); first claimer formats the FS, rest attach. Stuck-formatter steal path covers crashed formatters. Changes: - bootstrap.c/h, marufs_bootstrap_layout.h: slot table + claim/steal - super.c: bootstrap-elected node_id; legacy explicit node_id= still supported via mount option - sysfs_debug: bootstrap_dump shows per-mount slot ownership (<mine>) - setup_local_multinode.sh: --legacy flag for old explicit-node_id style; default is auto-mount via bootstrap - test_bootstrap_chaos.sh: T1 stuck-formatter recovery, T2 concurrent mount race, T3 slot reuse sanity - test_local_multinode.sh: Section 31 auto-mount slot table checks, Section 32 delegates to chaos with auto-teardown - test_me_crash.sh: trim T1 stress 8s->4s, T2 iters 3->2, T3 busy 7s->6s - gc/file/nrht/sysfs minor adjustments for bootstrap integration - dax_zero.c: helper to wipe DAX device for chaos preconditions
Without this, request_poll_cycle's stale poll_last_slot_seq baseline re-triggers ME_BECOME_HOLDER after the token has already been passed, leaking is_holder=true cross-handoff. The next acquire's wait_for_token fast path (ME_IS_HOLDER) then enters CS while CB holds a different node — two-holder race observed under concurrent counter-RMW stress.
Add user-managed ref_count and pin_count to each NRHT entry, plus four new ioctls (REF_INC, REF_DEC, PIN_INC, PIN_DEC) for caller-driven RMW under NRHT shard ME. dec-from-zero rejects with -EINVAL, inc-from-UINT32_MAX with -EOVERFLOW. Layout: counters consume two __le32 in the existing CL0 reserved space (offsets 40-47); 128B entry size unchanged. Tests: - test_ioctl.c §3.5 covers single-process semantics (initial value, bounded overflow/underflow, ENOENT on missing entry). - test_nrht_race.c Test4 stresses balanced concurrent inc/dec across 8 workers and asserts final == 0 + zero ioctl errors. Worker logs first failed op to stderr; harness aborts on first round failure for clean dmesg capture. - run_bench bundles ref/pin INC/DEC into the per-iter timed loop on the iter's own entry so all 7 ops share a shard, scaling with cfg->num_shards. Sweep summary gains a counter ops section.
Cover the four NRHT_REF/PIN_INC/DEC ioctls with usage examples and the overflow/underflow semantics. Note that FIND_NAME returns the counters alongside the offset.
bootstrap_dump_slots() used PAGE_SIZE as its scnprintf bound while
sysfs_debug's bootstrap_dump_show() called it with `buf + n` after
writing a per-mount header. Each scnprintf could thus write up to
PAGE_SIZE bytes past the caller's offset, overrunning the sysfs page
into adjacent slab objects.
Symptom: GPF in fdget/filp_flush after reading bootstrap_dump, with
non-canonical addresses decoding to ASCII fragments emitted by this
helper ("=CLAIMED", " node_id", "slot[N] stat...").
Add a bufsize parameter and pass PAGE_SIZE - n from the show callback;
guard the loop against n >= bufsize.
Decompose me.h (~700 LOC) into three focused headers: - me.h: public API + DRAM types only - me_inline.h: inline helpers needing instance struct visibility - me_layout.h: on-disk CXL layout (header/CB/membership/slot) Move cold-path helpers (me_leave_successor, me_membership_tick_heartbeat) out of inline header into me.c. Consolidate per-shard arrays into struct marufs_me_shard and DRAM is_holder fast path. Wire callers (bootstrap, me_order, me_request, nrht, sysfs) to new layout.
Split marufs_check_permission into two layers: - marufs_check_permission_any(candidate, *out_granted): returns the granted subset of candidate bits, letting callers branch on which rights matched. Replaces ADMIN-then-GRANT two-call patterns. - marufs_check_permission: thin AND-semantics wrapper. Inline deleg matching into _any (drops marufs_deleg_matches) and bound the loop by deleg_num_entries instead of MAX_ENTRIES. Centralize ioctl perm precheck via marufs_ioctl_required_perm(cmd) table at dispatcher entry, removing per-case marufs_check_permission calls from NAME_OFFSET / BATCH_* / FIND_NAME / CLEAR_NAME / NRHT_INIT and from DMABUF_EXPORT / CHOWN. PERM_GRANT keeps self-check inside its ME critical section, now using _any to evaluate ADMIN|GRANT in one call. Move nrht_refcnt_op_t typedef from file.c to nrht.h. Extend test_nrht_race with run_test5 (new race scenario) and tighten test3/test4 coverage.
Concurrent CHOWN race: precheck ran before me->acquire(), so two callers with default_perms ADMIN could both pass and serialize on the lock. The first chown stripped ADMIN (default_perms=0, deleg cleared), but the second never re-checked and still won, letting ownership transfer twice. Fix: - Add marufs_check_permission(ADMIN) inside __marufs_ioctl_chown_locked before the ALLOCATED→ALLOCATING CAS. - Drop CHOWN from the lock-free precheck table (handler self-checks), matching the PERM_GRANT pattern. - Same in-lock recheck added to perm_set_default for symmetry: a caller that relied on default_perms ADMIN can be demoted by a concurrent perm_set_default/chown writing default_perms=0.
Add per-sbi vm_ops wrapper that copies underlying device_dax ops and overrides .open/.close/.mprotect to enforce RAT delegation on mprotect. mmap-time RAT check is no longer the sole gate. Wrapper details: - sbi-embedded vm_ops, lazy-seeded at first mmap under vm_ops_lock - xi pointer stashed in vma->vm_private_data; igrab on attach, iput in .close, igrab on .open for vma split/clone refcount balance - container_of(vma->vm_ops, sbi, vm_ops) recovers sbi at hook time (vma->vm_file = dax_filp after device_dax delegation) Hardening flags applied to every marufs vma: - VM_DONTCOPY: fork() drops the mapping; child re-mmap forces RAT recheck - VM_DONTEXPAND: mremap() cannot grow past original mmap size - VM_DONTDUMP: KV-cache contents excluded from coredumps Lock split: revert the prior sb_lock merge that caused soft lockups when me_poll_thread held the unified lock for full poll cycles. - me_list_lock: poll thread + register/unregister - nrht_me_lock: nrht_me[] creation - vm_ops_lock: lazy seed (and future hot-path use) Remove daxheap support entirely: - Drop CONFIG_DAXHEAP and DAXHEAP_DIR from Makefile / install.sh - Remove daxheap= and daxheap_import_id= mount options - Drop MARUFS_IOC_DMABUF_EXPORT ioctl and dmabuf_req struct - Remove enum marufs_dax_mode (DEV_DAX is the only mode) - Drop sbi->heap_dmabuf, marufs_dax_acquire_daxheap, /sys/fs/marufs/daxheap_bufid Tests (tests/test_mmap.c): - run_vm_protect: mprotect basics, RDONLY-fd escalation block, VM_DONTCOPY fork SIGSEGV, VM_DONTEXPAND mremap reject (with and without MAYMOVE), partial mprotect vma split, mremap MOVE-only success, 200-iter split+merge stress for igrab balance - run_vm_protect_cross: cross-node escalation block — owner grants READ-only to peer, peer mprotect(PROT_RW) rejected by RAT WRITE check; after additional WRITE grant, mprotect succeeds
Add two-layer defense against post-exec fd reuse / hostile re-execve: 1. RAT exe_inode binding (acl.c + region.c) - owner check now compares current task's exe inode/dev against owner_exe_inode_ino/dev stored in the RAT entry at create time. Catches execve into different binary. 2. FD_CLOEXEC enforcement at data access (file.c) - mmap/read/ioctl reject with -EACCES when the calling fd is not close_on_exec. Catches same-binary re-execve (hostile argv) which exe_inode binding alone cannot detect. Cannot enforce O_CLOEXEC at .open: VFS strips O_CLOEXEC from f_flags (it's stored in fdtable.close_on_exec) and the fd is not yet installed when ->open runs. Check moves to mmap/read/ioctl entry where fdtable lookup is possible. Tests: - test_postexec_attack: integrated into test_local_multinode.sh as Section 32 (Bootstrap Chaos shifts to Section 33). Two modes: no cloexec (parent mmap blocked) and --cloexec (execve closes fd). - test_negative: new Section 0 verifying mmap without FD_CLOEXEC returns EACCES. - 13 existing test sources updated to pass O_CLOEXEC on open(). Docs: 0_user_guide.md gains an O_CLOEXEC requirement bullet and a Security section paragraph explaining the fd-level check.
Summary
marufs_kernel/— Linux kernel filesystem module for CXL shared memory, enabling cross-node file sharing via DAX-mapped CXL memory poolsStructure
Test plan
makebuildsmarufs.kowithout errors on target kernelsudo ./install.shloads module and mounts filesystemsudo ./tests/test_local_multinode.shpasses all multinode tests