Skip to content

feat: add marufs kernel module — CXL shared-memory filesystem#41

Closed
moonchan-park wants to merge 45 commits into
mainfrom
mcpark/feat/marufs-kernel
Closed

feat: add marufs kernel module — CXL shared-memory filesystem#41
moonchan-park wants to merge 45 commits into
mainfrom
mcpark/feat/marufs-kernel

Conversation

@moonchan-park

@moonchan-park moonchan-park commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add marufs_kernel/ — Linux kernel filesystem module for CXL shared memory, enabling cross-node file sharing via DAX-mapped CXL memory pools
  • CAS-based lock-free global index (4-shard hash table), RAT (2KB CL-aligned per-file metadata), NRHT (application-level name→offset mapping), 3-stage ACL (owner→default→delegation), 4-phase background GC
  • Architecture docs (6 detailed + 1 overview), test suite (ioctl, mmap, cross-process, chown race, multinode), build/install scripts
  • GPL-2.0-only (kernel module requirement)

Structure

marufs_kernel/
├── src/           # Kernel module source (super, dir, inode, file, index, region, nrht, acl, gc, sysfs)
├── include/       # Userspace API header (marufs_uapi.h)
├── docs/          # Architecture docs (metadata layout, entry lifecycle, GC, NRHT, ACL, mount/IO)
├── tests/         # Test suite (C test programs + shell harness)
├── Makefile       # Kernel module build
└── install.sh     # Build + insmod + mount helper
docs/source/design_doc/
└── marufs_kernel_module_architecture.md  # Architecture overview with mermaid diagrams

Test plan

  • make builds marufs.ko without errors on target kernel
  • sudo ./install.sh loads module and mounts filesystem
  • sudo ./tests/test_local_multinode.sh passes all multinode tests

@github-actions

github-actions Bot commented Apr 10, 2026

Copy link
Copy Markdown

@moonchan-park moonchan-park left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #41 리뷰 요약

이 PR을 왜 올리는가?

Maru 프로젝트는 현재 TCP/RPC 기반의 유저스페이스 메모리 관리만 지원합니다. CXL 공유 메모리를 여러 노드가 파일시스템 인터페이스로 직접 접근하려면 커널 모듈이 필요하며, 이 PR이 없으면 cross-node 파일 공유를 위해 네트워크 라운드트립이 필수입니다. marufs 커널 모듈은 DAX-mapped CXL 메모리 풀 위에 lock-free 파일시스템을 구현하여, 표준 VFS 인터페이스(open/mmap/read/write)로 노드 간 zero-copy 데이터 공유를 가능하게 합니다.

설계 개요

graph TB
    subgraph UserSpace["Userspace"]
        app["Application"]
        test["Test Suite"]
    end

    subgraph KernelMod["marufs Kernel Module"]
        subgraph VFS["VFS Layer"]
            super["super.c -- mount, umount, format"]
            dir["dir.c -- readdir, lookup, create, unlink"]
            inode["inode.c -- iget, new_inode, evict"]
            file["file.c -- mmap, ftruncate, ioctl"]
        end

        subgraph DataLayer["Data Layer"]
            idx["index.c -- 4-shard CAS hash index"]
            region["region.c -- RAT allocator, 2KB entries"]
            nrht["nrht.c -- Name-Ref Hash Table"]
        end

        subgraph SecLayer["Security Layer"]
            acl["acl.c -- 3-stage ACL check"]
        end

        subgraph MaintLayer["Maintenance"]
            gc["gc.c -- 4-phase background GC"]
            sysfs_mod["sysfs.c -- stats export"]
        end
    end

    subgraph HW["Hardware"]
        cxl["CXL Shared Memory / DAX Device"]
    end

    app -->|"open, read, mmap, ioctl"| file
    test -->|"ioctl, mmap"| file
    super --> dir
    super --> inode
    super --> file
    dir --> idx
    file --> idx
    file --> region
    file --> acl
    file --> nrht
    inode --> idx
    gc -.->|"sweep"| idx
    gc -.->|"reclaim"| region
    gc -.->|"stale cleanup"| nrht
    idx --> cxl
    region --> cxl
    nrht --> cxl
Loading

CXL 메모리 레이아웃

block-beta
    columns 4
    sb["Superblock 4KB"]:1
    shards["Global Index Shards x4"]:1
    rat["RAT Header + 256 Entries"]:1
    data["Region Data"]:1
Loading

핵심 데이터 흐름

경로 흐름
파일 생성 create -> index claim EMPTY -> RAT alloc -> link and publish VALID
파일 읽기 lookup -> index hash search -> RAT entry -> DAX direct read
mmap open -> ftruncate(region alloc) -> mmap -> DAX fault handler
GC Phase1: dead process reclaim -> Phase2: stale INSERTING sweep -> Phase3: orphan sweep -> Phase4: NRHT stale sweep

리뷰 결과 요약

전체적으로 lock-free CAS 설계, 상태 머신, 메모리 배리어 처리가 잘 구조화되어 있습니다. 커널 모듈 코드 품질이 높으나, 아래 이슈들을 머지 전 검토해야 합니다.

심각도 건수 주요 이슈
Critical 3 claim_entry 상태 미검증, FIND_NAME 권한 누락, batch 버퍼 사이징
High 6 i_size_write 락 미보유, mmap 파일 참조 누수, force-unlock CAS 미사용, checksum 미구현, d_revalidate 성능, compat_ioctl 누락
Medium 3 GC 스레드 stop 체크, sysfs 버퍼 오버플로, CXL 2.0 배리어
Minor 2 하드코딩된 사용자 경로, 테스트 바이너리 누락

Comment thread marufs_kernel/src/index.c
Comment thread marufs_kernel/src/file.c
Comment thread marufs_kernel/src/file.c Outdated
Comment thread marufs_kernel/src/file.c
Comment thread marufs_kernel/src/file.c Outdated
Comment thread marufs_kernel/src/super.c
Comment thread marufs_kernel/src/file.c
Comment thread marufs_kernel/src/gc.c Outdated
Comment thread marufs_kernel/install.sh Outdated
Comment thread marufs_kernel/src/nrht.c
@moonchan-park moonchan-park requested a review from a team April 10, 2026 02:56

@youngrok-XCENA youngrok-XCENA left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

리뷰 요약

이 PR의 목적

CXL(Compute Express Link) 공유 메모리를 다수 노드에서 파일 수준으로 접근/관리할 수 있는 Linux 커널 파일시스템 모듈(marufs)을 추가합니다.

미적용 시 문제: CXL 메모리 풀을 POSIX 파일 인터페이스로 다룰 수 없어, 애플리케이션이 DAX 디바이스를 직접 mmap하고 멀티 노드 간 메타데이터 동기화를 자체 구현해야 합니다.

아키텍처 설계

graph TB
    subgraph VFS["VFS Layer"]
        super["super.c: mount, umount, mkfs"]
        dir["dir.c: readdir, lookup, create"]
        inode_m["inode.c: iget, evict, getattr"]
        file_m["file.c: mmap, ftruncate, ioctl"]
    end
    subgraph CXL["CXL Shared Data Layer"]
        index_m["index.c: 4-shard lock-free hash"]
        region["region.c: RAT region allocator"]
        nrht["nrht.c: Name-Ref Hash Table"]
    end
    subgraph SEC["Security"]
        acl["acl.c: 3-stage ACL with delegation"]
    end
    subgraph BG["Background"]
        gc["gc.c: 4-phase GC sweep"]
        sysfs_mod["sysfs.c: stats and tunables"]
    end
    super --> dir
    super --> inode_m
    super --> file_m
    dir --> index_m
    file_m --> index_m
    file_m --> region
    file_m --> acl
    file_m --> nrht
    inode_m --> index_m
    gc -.-> index_m
    gc -.-> region
    gc -.-> nrht
Loading

핵심 설계 특성

항목 설명
Lock-free 동시성 CAS 기반 4-shard 해시 인덱스, NRHT, delegation ACL
WORM 시맨틱 ftruncate로 한 번만 영역 할당, 재할당 불가
4-phase GC orphan 탐지 - 타임아웃 대기 - CAS reclaim - region sweep
DAX 직접 매핑 page cache 우회, mmap으로 CXL 메모리 직접 접근

주요 발견 사항

심각도 건수 대표 이슈
Critical 2 재부팅 시 데이터 소실(fstab format), NRHT entry CAS 탈취
High 4 DAXHEAP TOCTOU, delegation 비원자 전이, GC 레이스, i_size 비보호 쓰기
Medium 3 잘못된 errno, alloc_lock 타임아웃 미도달, x86 전용 cache flush
Low 1 하드코딩된 개발자 경로

상세 내용은 인라인 코멘트를 참조해주세요.

Comment thread marufs_kernel/setup-autoload.sh Outdated
cat >> /etc/fstab << FSTAB

# MARUFS CXL filesystem (auto-generated by setup-autoload.sh)
none ${MOUNT_POINT} ${MODULE_NAME} daxdev=${DAX_DEVICE},node_id=${NODE_ID},format,nofail 0 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] 재부팅마다 파일시스템 포맷 -- 데이터 소실

fstab에 format 마운트 옵션이 영구 기록됩니다. 재부팅 시마다 파일시스템이 초기화되어 모든 데이터가 소실됩니다. L243의 systemd unit도 동일한 문제가 있습니다.

최초 포맷 이후에는 format 옵션을 제거해야 합니다. fstab/systemd 등록 시 format 옵션 제외를 기본으로 하고, 별도 --format 플래그로만 포맷을 수행하는 것을 권장합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 29243e0

fstab/systemd 모두 format 옵션 제거했습니다.

CXL은 volatile 메모리라 매 부팅 시 어딘가에서 format이 필요한 건 맞지만, autoload 스크립트에 하드코딩하는 방식은 위험합니다. 향후 format_if_needed 마운트 옵션으로 별도 구현 예정:

  • 마운트 시 magic + CRC32 검증
  • 유효하면 기존 메타데이터 사용, 무효하면 idempotent format 수행
  • 멀티노드 환경에서도 CRC commit point 기반으로 안전하게 동작

Comment thread marufs_kernel/src/nrht.c
struct marufs_nrht_entry *e)
{
u32 st = READ_LE32(e->state);
if (marufs_le32_cas(&e->state, st, MARUFS_ENTRY_INSERTING) != st)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] CAS가 INSERTING 상태를 포함한 모든 상태에서 성공 -- entry 탈취 가능

st가 이미 MARUFS_ENTRY_INSERTING이면 CAS가 INSERTING->INSERTING으로 성공하여, 다른 노드가 삽입 중인 entry의 created_atinserter_node를 덮어씁니다. 이는 원래 삽입자의 작업을 망가뜨립니다.

CAS 전에 st == MARUFS_ENTRY_EMPTY || st == MARUFS_ENTRY_TOMBSTONE 조건을 확인해야 합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eea4555. Added st == EMPTY || st == TOMBSTONE validation before CAS in both index_claim_entry and nrht_claim_entry.

Comment thread marufs_kernel/src/super.c Outdated
marufs_daxheap_bufid);
return -EEXIST;
}
mutex_unlock(&marufs_daxheap_lock);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] DAXHEAP primary 할당 시 TOCTOU 레이스 컨디션

L452에서 marufs_daxheap_bufid == 0 확인 후 L459에서 락을 해제하고, L461에서 daxheap_kern_alloc()을 락 없이 호출합니다. 두 스레드가 동시에 primary mount를 시도하면 버퍼가 이중 할당됩니다.

제안: 락을 alloc 완료 후까지 유지하거나, 할당 후 CAS로 bufid를 교체하세요.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0ef7e1

mutex를 check부터 alloc + bufid 기록까지 유지하도록 변경. 에러 경로마다 mutex_unlock 추가.

Comment thread marufs_kernel/src/acl.c Outdated
sizeof(*de)); /* Ensure all fields visible before state transition */

/* Publish: GRANTING → ACTIVE (now safe for readers) */
WRITE_LE32(de->state, MARUFS_DELEG_ACTIVE);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] GRANTING -> ACTIVE 상태 전환이 CAS가 아닌 단순 WRITE

GRANTING 상태에서 다른 노드(예: GC)가 상태를 이미 변경했을 수 있는데, 이를 확인하지 않고 덮어씁니다.

marufs_le32_cas(&de->state, MARUFS_DELEG_GRANTING, MARUFS_DELEG_ACTIVE)로 변경하여 예상 상태에서만 전환해야 합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0ef7e1

WRITE_LE32(de->state, ACTIVE)marufs_le32_cas(&de->state, GRANTING, ACTIVE). CAS 실패 시(GC가 이미 EMPTY로 전환) -EAGAIN 반환하여 caller가 retry.

Comment thread marufs_kernel/src/gc.c
if (sbi->gc_orphan_count >= MARUFS_GC_ORPHAN_MAX)
return;

i = sbi->gc_orphan_count++;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] gc_orphan_count 비원자적 증감 -- 레이스 컨디션

gc_orphan_countgc_orphans 배열이 락 없이 접근됩니다. marufs_gc_track_orphan()이 GC sweep 경로 등에서 여러 CPU에서 동시 호출 가능한 경우 race condition이 발생합니다.

단일 스레드에서만 호출됨을 보장하거나, spinlock/atomic으로 보호해야 합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0ef7e1

GC는 단일 kthread로만 실행되므로 race 없음. 해당 사실을 marufs_gc_track_orphan 주석에 명시했습니다.

Comment thread marufs_kernel/src/inode.c Outdated
struct marufs_rat_entry *rat_e =
marufs_rat_entry_get(sbi, xi->rat_entry_id);
if (rat_e) {
inode->i_size = READ_LE64(rat_e->size);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[high] i_size를 i_rwsem 없이 직접 수정

VFS는 i_size 변경 시 i_rwsem 보호 또는 i_size_write() 사용을 요구합니다. concurrent getattrread_iter에서 inconsistent size를 볼 수 있습니다.

getattr 콜백에서는 inode->i_size를 갱신하지 말고 stat->size에 직접 쓰는 것이 더 안전합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 99cc329

youngrok 제안대로 stat->size에 직접 기록하는 방식으로 변경. fillattr 먼저 호출 → RAT fresh value로 stat->size/stat->blocks override. inode->i_size 업데이트 자체를 안 하므로 i_rwsem 불필요.

Comment thread marufs_kernel/src/super.c Outdated
for (i = 0; i < MARUFS_MAX_RAT_ENTRIES; i++) {
struct marufs_rat_entry *entry = marufs_rat_entry_get(sbi, i);
if (!entry)
return -1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] 잘못된 에러 코드 반환

return -1-EPERM에 해당합니다. VFS statfs 콜백은 의미 있는 errno를 반환해야 합니다. marufs_rat_entry_get() 실패는 I/O 오류이므로 return -EIO가 적절합니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이전 커밋(5b5dfb8)에서 이미 수정 완료. return -1 → return -EIO (super.c statfs).

Comment thread marufs_kernel/src/region.c Outdated
*
* Stale lock recovery: if holder crashed, force-unlock after timeout.
*/
while (retries < MARUFS_REGION_INIT_MAX_RETRIES) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] alloc_lock 재시도/타임아웃 로직 불일치 -- 5초 타임아웃 미도달

retries++cpu_relax() 경로에서만 증가합니다. cpu_relax()는 나노초 단위이므로, MARUFS_REGION_INIT_MAX_RETRIES(10)에 도달하는 데 마이크로초밖에 걸리지 않아, 5초 타임아웃 + force-unlock 경로는 사실상 도달 불가능합니다.

제안: retries를 force-unlock 횟수로 세거나, cpu_relax 대신 usleep_range()를 사용하여 대기 시간을 확보하세요.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 21ccbb5.

cpu_relax()usleep_range(500000, 600000) (500ms). 10회 retry × 500ms = 5s이므로 타임아웃 force-unlock 경로에 실제로 도달 가능. #include <linux/delay.h> 추가.

Comment thread marufs_kernel/src/marufs_layout.h Outdated
*/
#ifdef CONFIG_MARUFS_CXL2_COMPAT

static inline void __marufs_cxl_flush_range(const void *addr, size_t len)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] clwb/clflushopt는 x86 전용 -- 다른 아키텍처에서 컴파일 실패

clwb(p) (L403), clflushopt(p) (L413)는 x86 전용 인스트럭션입니다. ARM64/RISC-V 빌드에서 컴파일 오류가 발생합니다.

#ifdef CONFIG_X86 가드를 추가하거나, arch_wb_cache_pmem() 등 아키텍처 독립 API 사용을 검토하세요. CXL이 현재 주로 x86에서 사용되더라도, #error 메시지로 지원 범위를 명시하는 것이 좋습니다.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 21ccbb5.

x86-only clwb/clflushopt 직접 호출 대신 커널 arch-portable API인 arch_wb_cache_pmem() / arch_invalidate_pmem() (<linux/libnvdimm.h>)으로 교체. x86에서는 동일하게 clwb+wmb / clflushopt+mb 수행, non-x86에서는 no-op (WT mapping 전제). 수동 cacheline 루프 + #error guard 불필요.

Comment thread marufs_kernel/install.sh Outdated
MOUNT_POINT=""
SKIP_BUILD=false
USE_DAXHEAP=false
DAXHEAP_DIR="${MARUFS_DAXHEAP_DIR:-/home/mcpark/daxheap}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[low] 개발자 로컬 경로 하드코딩

/home/mcpark/daxheap가 기본값으로 설정되어 있습니다. 다른 환경에서 MARUFS_DAXHEAP_DIR 미설정 시 빌드가 실패합니다. 기본값을 제거하고 미설정 시 에러를 출력하거나, 프로젝트 상대 경로로 변경하세요.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 21ccbb5.

DAXHEAP_DIR 기본값 제거 → 빈 문자열. install.sh에서 --daxheap 사용 시 DAXHEAP_DIR 미설정이면 즉시 에러. setup_local_multinode.sh / test_local_multinode.sh 동일 적용. Makefile 코멘트도 generic path로 변경.

moonchan-park added a commit that referenced this pull request Apr 10, 2026
… C3 batch sizing

C1: Add state validation before CAS in index_claim_entry and
    nrht_claim_entry — reject VALID/INSERTING states to prevent
    active entry hijack (index.c, nrht.c)

C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
    matching other NRHT ioctl permission enforcement (file.c)

C3: Size batch buffer as max of both request types to prevent
    heap overflow if MAX or struct sizes diverge (file.c)
moonchan-park added a commit that referenced this pull request Apr 10, 2026
…0 deleg CAS, H11 GC doc

- H6 file.c: restore file reference on mmap delegation failure
- H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock
- H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU
- H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim
- H11 gc.c: document gc_orphans single-thread safety
moonchan-park added a commit that referenced this pull request Apr 10, 2026
…ortable flush, M16 daxheap path

M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown
M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable
M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem
M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
@moonchan-park moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from a4b2829 to cd1e9ed Compare April 10, 2026 05:31

@jooho-XCENA jooho-XCENA left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: 미해결 HIGH 이슈 8건

기존 리뷰(moonchan-park, youngrok-XCENA)와 fix 커밋(eea4555, 99cc329, d0ef7e1, 21ccbb5, 8614242 등)을 확인한 후, 아직 미해결이고 다른 리뷰어가 코멘트하지 않은 HIGH 이슈만 남겼습니다.

이미 해결 확인된 항목 (코멘트 생략)

  • ✅ GC kthread_should_stop inter-phase → 21ccbb5
  • ✅ i_size_write without i_rwsem → 99cc329
  • ✅ mmap file ref restore → d0ef7e1
  • ✅ alloc_lock CAS → d0ef7e1
  • ✅ GRANTING→ACTIVE CAS → d0ef7e1
  • ✅ DAXHEAP TOCTOU → d0ef7e1
  • ✅ compat_ioctl → 8614242
  • ✅ claim_entry guard → eea4555
  • ✅ FIND_NAME perm → eea4555
  • ✅ superblock CRC32 → 5b5dfb8
  • ✅ fstab format option → 29243e0
  • ✅ arch-portable flush → 21ccbb5

Comment thread marufs_kernel/src/file.c Outdated
break;

case MARUFS_IOC_NRHT_INIT:
ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] NRHT_INIT에 ADMIN 권한 필요

MARUFS_IOC_NRHT_INIT은 영역 전체를 memset(base, 0, total_needed) + 재포맷하는 파괴적 작업입니다 (nrht.c:523).

현재 MARUFS_PERM_IOCTL만 요구하는데, 이는 일반 NRHT 이름 조회/저장과 동일한 수준입니다. IOCTL 권한이 있는 모든 프로세스가 전체 NRHT를 포맷할 수 있습니다.

// 현재:
ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_IOCTL);

// 권장:
ret = marufs_check_permission(fc.sbi, fc.xi->rat_entry_id, MARUFS_PERM_ADMIN);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 27206dc. MARUFS_PERM_IOCTLMARUFS_PERM_ADMIN으로 변경.

Comment thread marufs_kernel/src/file.c Outdated

dreq->fd = fd;
get_file(sbi->heap_dmabuf->file);
fd_install(fd, sbi->heap_dmabuf->file);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] DMABUF export가 전체 CXL 디바이스를 노출

sbi->heap_dmabuf는 전체 daxheap 디바이스를 나타냅니다. 여기서 export하면 사용자가 superblock, RAT, global index, 다른 사용자의 region 데이터까지 모두 접근 가능합니다 — ACL을 완전히 우회합니다.

권장 수정:

  1. 최소한 MARUFS_PERM_ADMIN 권한을 요구하거나
  2. 전체 디바이스가 아닌 해당 파일의 region만 export하는 per-region dma_buf slice 생성

현재 READ+WRITE 권한만 있으면 전체 디바이스 fd를 얻을 수 있어, single-file 권한으로 다른 모든 파일의 데이터를 읽을 수 있습니다.

@moonchan-park moonchan-park Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 27206dc. MARUFS_PERM_READ | MARUFS_PERM_WRITEMARUFS_PERM_ADMIN으로 변경. 해당 ioctl은 다른 노드에서 동일한 daxheap fd를 이용하여 marufs mount하기 위해 제공되는 인터페이스이므로 per-region slice 기능은 제공하지 않을 계획입니다.

Comment thread marufs_kernel/src/sysfs.c Outdated

static struct kobj_attribute gc_pause_attr =
__ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] gc_pause가 world-writable (0644)

0644이면 모든 사용자가 GC를 일시정지/재개할 수 있습니다. GC가 멈추면 dead region과 orphan entry가 누적되어 메모리 고갈 및 서비스 거부로 이어집니다.

또한 gc_pause_store, gc_trigger_store, gc_stop_store, gc_restart_store 모두 capable(CAP_SYS_ADMIN) 체크가 없어, container/user-namespace 환경에서 비특권 사용자가 조작 가능합니다.

// 현재:
__ATTR(gc_pause, 0644, gc_pause_show, gc_pause_store);

// 권장:
__ATTR(gc_pause, 0600, gc_pause_show, gc_pause_store);
// + store 함수에 추가:
if (!capable(CAP_SYS_ADMIN))
    return -EPERM;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 27206dc. gc_pause 퍼미션 0644 → 0600으로 변경하고, gc_trigger/gc_stop/gc_pause/gc_restart 4개 store 함수 모두에 capable(CAP_SYS_ADMIN) 체크 추가.

Comment thread marufs_kernel/src/nrht.c
buckets_per_shard = 1;
}
buckets_per_shard = roundup_pow_of_two(buckets_per_shard);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] num_buckets 상한 미검증 — roundup_pow_of_two overflow

num_buckets는 사용자가 NRHT_INIT ioctl을 통해 제어합니다. num_buckets가 매우 큰 값(예: 0xFFFFFFFF)이고 num_shards=1이면:

  1. buckets_per_shard = 0xFFFFFFFF
  2. roundup_pow_of_two(0xFFFFFFFF) → 64비트에서 1UL << 32 = 0x100000000
  3. bucket_array_size = 0x100000000 * 4 → u64 overflow 또는 total_needed 계산 이상
  4. memset(base, 0, (size_t)total_needed) — region 경계를 넘어 CXL 메모리 write 가능
// roundup_pow_of_two 전에 상한 검증 추가:
if (buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES)
    return -EINVAL;
buckets_per_shard = roundup_pow_of_two(buckets_per_shard);
if (buckets_per_shard == 0 || buckets_per_shard > MARUFS_NRHT_MAX_ENTRIES)
    return -EINVAL;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f74049b. roundup_pow_of_two 전후로 MARUFS_NRHT_MAX_ENTRIES 상한 검증 추가.

Comment thread marufs_kernel/src/inode.c
return ret;

if (attr->ia_valid & ATTR_SIZE) {
struct marufs_inode_info *xi = marufs_inode_get(inode);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] setattr가 ATTR_SIZE 외의 변경을 무시 — chmod/chown/utimes 무효

setattr_prepare()로 검증 후 ATTR_SIZE만 처리하고 반환합니다. ATTR_UID, ATTR_GID, ATTR_MODE, ATTR_ATIME 등은 검증 통과 후 적용되지 않아, chmod/chown/utimes가 성공을 반환하지만 실제 변경 없습니다.

VFS 규약상 setattr_prepare() 통과 후 setattr_copy()를 호출하거나, 지원하지 않는 attribute는 -EPERM으로 명시적 거부해야 합니다.

if (attr->ia_valid & ATTR_SIZE) {
    // ... ftruncate 처리 ...
}

// 추가 필요:
setattr_copy(MARUFS_IDMAP_ARG_COMMA inode, attr);
// 또는: 지원하지 않는 경우 명시적 거부
if (attr->ia_valid & ~ATTR_SIZE)
    return -EPERM;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f74049b. ATTR_SIZE | ATTR_FORCE 외의 attr 변경 요청은 -EPERM으로 명시적 거부. CXL FS는 POSIX attr를 persist하지 않으므로 setattr_copy() 대신 거부 방식을 선택.

Comment thread marufs_kernel/src/super.c
buf->f_bsize = PAGE_SIZE;
buf->f_blocks = sbi->total_size / PAGE_SIZE;
buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;
buf->f_bavail = buf->f_bfree;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] f_bfree unsigned underflow — 비정상적 free space 보고

sbi->total_sizeused_size가 모두 u64이므로, metadata 손상이나 concurrent allocation으로 used_size > total_size가 되면 unsigned 뺄셈이 wrap하여 매우 큰 값이 됩니다.

// 현재:
buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;

// 권장:
if (used_size > sbi->total_size)
    buf->f_bfree = 0;
else
    buf->f_bfree = (sbi->total_size - used_size) / PAGE_SIZE;
buf->f_bavail = buf->f_bfree;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f74049b. used_size > sbi->total_size 가드 추가하여 unsigned underflow 방지.

Comment thread marufs_kernel/src/file.c Outdated
struct page *page = &folio->page;

zero_user_segments(page, 0, PAGE_SIZE, 0, 0);
SetPageUptodate(page);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] read_folio가 항상 zero-filled page 반환 — page cache 경로 데이터 손실

DAX mmap이 주 접근 경로이지만, sendfile(), splice(), 또는 non-DAX fallback 경로는 page cache를 통해 read_folio를 호출합니다. 현재 항상 zero page를 반환하므로 실제 CXL 메모리 데이터 대신 0을 읽게 됩니다.

권장:

  1. CXL 메모리에서 데이터를 복사: memcpy_from_dax(page, sbi->dax_base + phys_offset + page_offset, ...) 또는
  2. page cache 사용을 완전히 차단하고 read_iter만으로 서빙 (address_space_operations에서 read_folio 제거 후 적절한 대안 구현)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f74049b. CXL 데이터를 page cache로 복사하는 대신 -EIO로 차단하는 방식을 선택했습니다.

이유:

  • read_folio 인터페이스에서 permission check를 끼워넣을 수 없어 marufs ACL 모델 우회
  • DRAM page cache에 CXL 데이터 복사본이 남아 cross-node 일관성 깨짐
  • read() 시스콜은 read_iter에서 직접 CXL 복사 + 권한 체크로 정상 지원

sendfile/splice는 DAX FS의 KV cache 워크로드에서 사용하지 않으므로 차단해도 영향 없습니다.

Comment thread marufs_kernel/src/file.c Outdated
}

set_page_dirty(page);
return VM_FAULT_LOCKED;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] set_page_dirty() — 커널 6.8+에서 제거됨, 빌드 실패

set_page_dirty()는 커널 6.8에서 제거되었습니다 (folio_mark_dirty()로 대체). 이 모듈의 compat.h가 6.17까지 지원하므로 최신 커널에서 빌드가 실패합니다.

// 현재:
set_page_dirty(page);

// 권장:
folio_mark_dirty(page_folio(page));
// 또는 compat.h에 shim 추가:
#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 8, 0)
    folio_mark_dirty(page_folio(page));
#else
    set_page_dirty(page);
#endif

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f74049b. compat.hmarufs_set_page_dirty() inline 함수 추가 (6.8+ → folio_mark_dirty(), 이전 → set_page_dirty()). file.c에서는 compat 함수를 호출.

moonchan-park added a commit that referenced this pull request Apr 14, 2026
… C3 batch sizing

C1: Add state validation before CAS in index_claim_entry and
    nrht_claim_entry — reject VALID/INSERTING states to prevent
    active entry hijack (index.c, nrht.c)

C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
    matching other NRHT ioctl permission enforcement (file.c)

C3: Size batch buffer as max of both request types to prevent
    heap overflow if MAX or struct sizes diverge (file.c)
moonchan-park added a commit that referenced this pull request Apr 14, 2026
…0 deleg CAS, H11 GC doc

- H6 file.c: restore file reference on mmap delegation failure
- H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock
- H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU
- H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim
- H11 gc.c: document gc_orphans single-thread safety
moonchan-park added a commit that referenced this pull request Apr 14, 2026
…ortable flush, M16 daxheap path

M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown
M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable
M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem
M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
@moonchan-park moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from f74049b to d2be093 Compare April 14, 2026 03:13
moonchan-park added a commit that referenced this pull request Apr 17, 2026
… C3 batch sizing

C1: Add state validation before CAS in index_claim_entry and
    nrht_claim_entry — reject VALID/INSERTING states to prevent
    active entry hijack (index.c, nrht.c)

C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
    matching other NRHT ioctl permission enforcement (file.c)

C3: Size batch buffer as max of both request types to prevent
    heap overflow if MAX or struct sizes diverge (file.c)
moonchan-park added a commit that referenced this pull request Apr 17, 2026
…0 deleg CAS, H11 GC doc

- H6 file.c: restore file reference on mmap delegation failure
- H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock
- H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU
- H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim
- H11 gc.c: document gc_orphans single-thread safety
moonchan-park added a commit that referenced this pull request Apr 17, 2026
…ortable flush, M16 daxheap path

M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown
M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable
M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem
M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
@moonchan-park moonchan-park force-pushed the mcpark/feat/marufs-kernel branch from 050e08b to 528e090 Compare April 17, 2026 07:06
Linux kernel filesystem module for CXL shared memory, enabling
cross-node file sharing via DAX-mapped CXL memory pools.

Core components:
- VFS layer: mount/umount, directory ops, inode lifecycle, mmap/ioctl
- CAS-based lock-free global index with sharded hash table (4 shards)
- Region Allocation Table (RAT): per-file metadata in 2KB CL-aligned entries
- NRHT (Name-Ref Hash Table): application-level name→(offset, region) mapping
- ACL: 3-stage permission model (owner → default_perms → delegation table)
- 4-phase background GC: dead process reap, stale index sweep, local tracker, NRHT

Includes architecture docs (6 detailed + 1 overview), test suite
(ioctl, mmap, cross-process, chown race, multinode), and build/install scripts.
… C3 batch sizing

C1: Add state validation before CAS in index_claim_entry and
    nrht_claim_entry — reject VALID/INSERTING states to prevent
    active entry hijack (index.c, nrht.c)

C2: Add PERM_IOCTL check to FIND_NAME and BATCH_FIND_NAME ioctls,
    matching other NRHT ioctl permission enforcement (file.c)

C3: Size batch buffer as max of both request types to prevent
    heap overflow if MAX or struct sizes diverge (file.c)
- Implement CRC32 over immutable GSB fields (magic → entries_per_shard)
- Compute + write checksum at format, validate at mount
- Remove active_nodes bitmask (to be replaced by per-node cacheline design)
- Adjust reserved padding 200 → 208 bytes to maintain 256B struct size
Persistent format option causes re-format on every reboot, destroying
all data. CXL volatile memory bootstrap will be handled by a separate
format_if_needed scheme (magic + CRC32 validation at mount time).
- file.c: wrap i_size_write with inode_lock/unlock in read path
- inode.c: write fresh RAT size to stat->size directly in getattr,
  avoiding i_size_write without i_rwsem entirely
…0 deleg CAS, H11 GC doc

- H6 file.c: restore file reference on mmap delegation failure
- H7 region.c: use CAS instead of WRITE for alloc_lock force-unlock
- H9 super.c: hold mutex through daxheap alloc to prevent TOCTOU
- H10 acl.c: CAS GRANTING→ACTIVE to guard against GC reclaim
- H11 gc.c: document gc_orphans single-thread safety
…ortable flush, M16 daxheap path

M12: GC kthread inter-phase kthread_should_stop() checks for clean shutdown
M14: alloc_lock retry uses usleep_range(500ms) so timeout path is reachable
M15: replace x86-only clwb/clflushopt with arch_wb_cache_pmem/arch_invalidate_pmem
M16: remove hardcoded /home/mcpark/daxheap, require MARUFS_DAXHEAP_DIR to be set
All ioctl structs use fixed-width types (__u32/__u64/__s32) with
identical 32/64-bit layout, so compat_ptr_ioctl suffices.
- Add per-shard CAS spinlock (shard_header->lock) to serialize
  bucket linking and post-insert dedup, eliminating TOCTOU race
- New TENTATIVE(2) entry state between INSERTING and VALID;
  VALID is now 3, TOMBSTONE is now 4
- Rewrite post_insert_dedup to walk chain directly with self-skip
  instead of using nrht_find_chain
- Move region_type=NRHT write after format to prevent false EEXIST
  on first nrht_init call
- Change double-init check from physical magic probe to RAT
  region_type check (immune to stale CXL data)
- Extract nrht_shard_lock/unlock inline helpers
- Update 4_arch_nrht.md: state diagram, transition table, insert
  flowchart, function summary for shard lock semantics
- Add TENTATIVE state to index.c insert protocol (4→9 step):
  INSERTING → TENTATIVE → lock → link + dedup → VALID → unlock
- Use DRAM spinlock (marufs_shard_cache.insert_lock) instead of CXL lock
  for node-local thread serialization; cross-node handled by token ring
- Move TOMBSTONE write from post_insert_dedup to caller for consistency
  with nrht.c pattern
- Add deleg_info sysfs attribute for per-region delegation inspection
- Fix stale enum comments in marufs_layout.h (VALID 2→3, TOMBSTONE 3→4)
- sysfs: add deleg_info read/write for per-region delegation inspection
- sysfs: gc_trigger now iterates ALL registered mounts (not just first sbi)
- tests: integrate test_mmap_notrunc, test_negative, test_nrht_race,
  test_gc_deleg, test_pid_reuse into test_local_multinode.sh (Sections 25-29)
- tests: show full failure output instead of head -5 truncation
- docs: update entry lifecycle and metadata layout for TENTATIVE state
- .gitignore: add new test binaries
- NRHT_INIT ioctl: PERM_IOCTL → PERM_ADMIN (prevents non-admin format)
- DMABUF export: PERM_READ|PERM_WRITE → PERM_ADMIN (whole-device exposure)
- sysfs gc_pause: 0644 → 0600 (root-only read/write)
- sysfs gc_trigger/gc_stop/gc_pause/gc_restart: add capable(CAP_SYS_ADMIN)
Adds marufs_kernel/docs/0_user_guide.md: a scenario-oriented walk-through
covering admin multi-node setup, application region lifecycle, NRHT
name-ref publishing, and the delegation-based security model. Complements
the existing architecture docs, which focus on implementation internals.
…tegies

Replace per-shard spinlock-based insert serialization with cross-node
Mutual Exclusion (ME) framework. Strategy Pattern exposes a common
interface backed by two implementations:

  - Order-driven (me_order.c): token ring circulating among ACTIVE nodes
  - Request-driven (me_request.c): holder scans request slots, grants

Two ME domains:
  - Global ME (S=1): serializes Index insert + RAT alloc
  - NRHT ME (S=N_shard): per-shard token, opt-in membership via
    marufs_nrht_join() pre-warm (backup path lazy-init on first insert)

Adds unified poll kthread (me_poll_thread) iterating all registered
ME instances (me_list / me_list_lock). marufs_nrht_init() now takes
me_strategy parameter selecting the implementation per filesystem.

sysfs exposes ME diagnostics; tests/monitor_me.sh + rewritten
test_nrht_race.c exercise the new paths. bench_name_ref and
setup/test_local_multinode.sh updated for the new membership model.
Replace shared-CB polling with per-(shard, node) doorbell slots to
eliminate O(N) cache-line ping-pong on the hot token-pass path.

* Token pass: writer updates CB (holder, generation) then rings the
  target's slot (from_node, cb_gen_at_write, token_seq++). Reader
  polls its own slot and cross-checks CB on seq change.
* Heartbeat moved to per-node membership slot (distributed);
  only the cached successor of the current holder watches it.
* Magic tags on CB / membership / slot — writers verify the cached
  pointer still addresses the intended record type before mutating,
  and self-deactivate the instance on mismatch. Prevents stale-layout
  writes from corrupting a newly reformatted region after a peer
  nrht_init() raced this sbi's cached ME instance.
* Token-gated order_leave: acquire each shard then pass to
  leave_successor so the leaving node is sole writer of its own slot
  during handoff; clear membership status last.
* teardown_reformat detection: skip leave() when format_generation
  mismatches cached value (CXL area was reformatted under us).
* Skip self-pass in order_poll_cycle when next_active == self.
* wait_for_token: keep last_cb_gen across phantoms so the gen-
  monotonicity filter still rejects stale passes.
* /sys/fs/marufs/me_info exposes per-shard doorbell slot state for
  debugging (from_node, token_seq, cb_gen_at_write, request fields).
* Demote order_leave acquire-failed diagnostic to pr_debug.

Stress-tested via sweep bench (order + request, shards 2..64, on
2/4/8 mounts) after each change.
* New mount option `me_strategy=order|request` (default: request).
* `marufs_nrht_init_req.me_strategy` surfaced in the setup example.
* NRHT_JOIN ioctl documented as the optional pre-warm alternative to
  lazy ring join on the first NAME_OFFSET/FIND_NAME.
Adds per-ME-instance atomic64 counters for CXL RMB traffic (cb, slot,
membership), ops->poll_cycle() invocations, and wall-clock ns spent in
poll_cycle. Exposed via /sys/fs/marufs/me_poll_stats (aggregated across
all mounted sbis); write any value to reset.

test_nrht_race.c reads the counters via sysfs around each timed bench
run (reset-then-read) and prints a poll-thread cost subtable in the
sweep summary — including per-cycle rates (cb/c, slot/c, mem/c) so
optimization deltas stay comparable across versions even when cycle
count itself shifts.
Adds __le64 pending_shards_mask to the membership slot. Each node flips
bit s when it raises a hand on shard s (request_acquire) and clears it
after CS (request_clear_own), using a bounded CAS loop with WARN_ONCE
fallback to guard against same-node races on different bits of the
same word.

request_poll_cycle fuses the former per-shard next_active() calls into
a single membership pass that collects per-peer masks, OR-reduces them
into peers_pending, and picks the round-robin successor. Holder side
then gates request_scan_and_grant on (peers_pending & (1<<s)) — idle
shards pay zero slot RMBs instead of the baseline N-1-per-shard full
scan. Masked scan further filters nodes whose bit is clear.

Release path keeps the full scan (primary grant path; staleness-driven
skip would stall requests raised between poll and release).

Per-cycle benchmark (S=64, N=8): rmb_slot ~33 → ~4 (-87%), ns_avg
~82k → ~60k (-27%). Low-S workloads pay a small membership-read
tradeoff (O(N) pass vs. O(S*k) lazy) but NRHT production target is
S >= N where the scan-skip wins dominate.
Baseline's per-shard CB RMB in each node's poll_cycle created a
textbook CXL anti-pattern: N hosts continuously polling the same
cache line, loading the shared memory controller queue and reducing
fabric bandwidth available to the data path. Doorbell's point was to
push hot-polling off shared CLs onto single-reader ones.

Replace the CB read with a DRAM `is_holder[S]` boolean, flipped by:
  - me_pass_token on our own write (outgoing transition)
  - poll_cycle slot-doorbell detection on token_seq bump (incoming)
  - wait_for_token / common_try_acquire on CB-read success
  - common_join initial seed

Receiver-side detection now polls only our own per-(shard,node) slot
— a single-reader CL per node, no multi-host contention. The bump is
a sufficient "I became holder" signal because me_pass_token orders
the CB WMB before the slot WMB.

Crash detection migrates out of the poll path into wait_for_token's
timeout branch: after 5s without progress, check holder's
heartbeat_ts; if stalled past MARUFS_ME_TIMEOUT_NS, self-takeover via
me_pass_token(self, s, self). Idle shards with no acquirer require
no proactive monitoring. Removes marufs_me_check_heartbeat plus the
last_heartbeat / last_heartbeat_time DRAM arrays.

order_poll_cycle: drop ghost-alone CB probe — the acquire-timeout
takeover covers the same scenario.

Per-cycle CB RMB at S=64 N=8 drops from ~33 to near-zero in steady
state; slot-doorbell reads add S per cycle but land on
single-reader CLs and don't contend across hosts.
Replace 8 parallel per-shard arrays in marufs_me_instance with a single
struct marufs_me_shard *shards allocation. Simplifies alloc/free, keeps
per-shard fields cache-adjacent, and removes scattered kcalloc/kfree
bookkeeping.

Fields consolidated: holding, local_waiters, local_lock, cached_successor,
is_holder, poll_last_slot_seq, last_token_seq, last_cb_gen.
Replace repeated me->shards[shard_id].field accesses with a local
struct marufs_me_shard *sh bound once per function/loop iteration.
No behavior change.

Side effect: common_join now hoists marufs_me_next_active() out of
the per-shard loop since all shards get the same successor at join
time — one scan instead of num_shards scans.
Replace the CB RMB in wait_for_token's fast path with a DRAM
is_holder check. Saves one CXL CB round-trip per same-node burst
acquire where the token is already held on entry.

Correctness:
- Every is_holder mutation goes through ME_BECOME_HOLDER /
  ME_LOSE_HOLDER, centralizing the smp_wmb rule. Reader pair
  ME_IS_HOLDER(sh) wraps smp_rmb + field read.
- me_pass_token self-pass seeds sh->last_cb_gen / last_token_seq /
  poll_last_slot_seq so the fast path can trust is_holder without
  re-reading CB, and poll_cycle treats the doorbell as a self-bump
  without re-flipping state.
- common_try_acquire and wait_for_token's loop-exit path pick up
  the previously-missing smp_wmb via the macro.
- poll_cycle bump handlers touch only poll_last_slot_seq; the app
  thread keeps exclusive ownership of last_token_seq / last_cb_gen
  so its bump-detection signal is never lost to a poll-thread race.
- Deadline-path takeover drops its redundant post-pass CB/slot RMB;
  me_pass_token self-pass already seeds the same baselines.
…y_acquire

- add me_cb_snapshot() helper in me.h bundling RMB + holder/generation read;
  callers pass NULL for gen when only holder is needed.
- remove marufs_me_common_try_acquire and .try_acquire op (no users remain).
- remove order_leave_dump diagnostic (covered by existing poll tracing).
- move shard pointer binding closer to first use (C99 style).
Introduces persistent per-CPU instrumentation to diagnose acquire
latency, poll-thread cost, and hash-chain quality without measurable
hot-path overhead (~0.5% of ns_avg; log2 buckets, non-atomic updates).

New headers (kept separate from me.h/marufs.h to bound their surface):
  me_stats.h   - struct marufs_me_stats_pcpu + helpers for
                 wait_for_token (spin/sleep/deadline hit, wall+cpu ns,
                 log2(ns) latency histogram), poll_cycle phase
                 breakdown (membership/doorbell/scan), lock hold time,
                 per-shard acquire distribution, request-mode grant age.
                 cpu_ns is sampled via current->se.sum_exec_runtime
                 (task_sched_runtime is unexported); the tick-granular
                 delta is clamped to wall_ns at each sample so the
                 aggregate cpu_util stays physically valid.
  nrht_stats.h - per-CPU bucket-chain walk count/steps histogram.
                 Lives on sbi, handed to nrht_find_chain via a direct
                 stats pointer on nrht_shard_ctx (no sbi back-ref).

Sysfs (all write-any-reset except chain/poll_thread cumulative):
  me_fine_stats          - aggregate ME counters per instance
  me_per_shard_acquire   - hotspot detection across shards
  me_poll_thread_cpu     - cumulative sum_exec_runtime of poll kthread
                           (diff-based utilization; cannot be reset)
  nrht_chain_depth       - find_chain count/steps + depth histogram

Bench (test_nrht_race):
  - Reset + read/diff across all four nodes per run.
  - Per-run dump: wait hit split, cpu_util, grant count, chain depth,
    poll kthread util (divided by mount_count for per-thread avg).
  - Sweep table gains a fine-grained section with wait_avg/cpu%/spin%/
    hold_avg/grant/chain/poll_cpu%.
wait_fast_hit: tracks ME_IS_HOLDER early returns in wait_for_token —
the acquires that bypass all token-wait work. Combined with wait_count
this exposes the intra-node holder-keep hit rate. Measurement on the
bench showed request-mode sustains ~80% fast-hit across shards while
order-mode collapses to <10% as shards grow, quantifying why order
scales poorly.

Bench sweep table gains four columns:
  fast%   - fast_hit / (wait_count + fast_hit)
  mem%    - membership pass share of poll_cycle wall
  door%   - per-shard slot doorbell RMB share
  scan%   - grant/pass phase share (masked for request, baton for order)

The poll-phase split pinpoints where poll_cycle spends time under each
strategy: request is membership+doorbell bound (scan <2%), order is
scan-bound at high shard counts (>40%).
Replace cross-node ktime_get_ns() subtraction with counter-based probe
in the acquire-deadline path. CXL peers don't share a monotonic clock
zero point — per-node boot times differ, so now - heartbeat_ts produces
meaningless elapsed values and can misclassify alive holders as crashed
(or vice versa).

On deadline, me_handle_acquire_deadline:
  1. Snapshot holder's heartbeat counter (hb_before).
  2. Sleep MARUFS_ME_LIVENESS_PROBE_NS on local clock.
  3. Resample counter + CB.
  4. late grant on us → enter CS directly, skip takeover.
  5. holder changed or hb advanced → -ETIMEDOUT (back off).
  6. counter stuck and holder unchanged → self-takeover via me_pass_token.

Probe window: 100ms (10000× default poll interval). Conservative — the
takeover path isn't hot and false-positive crash calls would race
against a live holder on CB write.

heartbeat_ts kept as observability field only.
Rewrite docs/7_arch_me.md around protocol mechanics:
  - §1 Shared State Layout: access-pattern mermaid + CXL struct table.
  - §2 Overview: 2.1 per-node lifecycle + 2.2 per-(shard, node) state machine
    (rename states to NONE / MEMBER / HOLDER_BUSY / HOLDER_IDLE — token
    ownership axis made explicit in labels).
  - §3 Thread Interaction & Memory Access: merge thread-level sequence with
    memory-level byte evolution per mode.
    - 3.1 OD (token pass): alt branch for receiver with/without app waiter
      (Case A fast-path vs Case B poll-thread fallback consumer).
    - 3.2 RD (request+grant): alt branch for grant paths (release vs poll).
    - 3.3 Cacheline snapshot: before/after value tables, symbolic CL labels.
    - 3.4 Acquire Timeout: crash vs busy-holder disambiguation via
      counter-based liveness probe; three post-deadline branches (late grant
      on self / holder changed or alive / counter stuck → takeover).
  - §4 Stats & Bench Integration: list current sysfs attrs, split poll-cost
    counters from per-CPU fine-grained stats, align with bench harness
    output columns.

Drop sections that duplicated code (per-shard DRAM struct, step-by-step
prose flows — §3 diagrams cover them).
Break the 1266-line monolithic sysfs.c into focused units while moving
manual GC control out of the production attribute surface.

  sysfs.c           1266 -> 265   core: version/region_info/perm_info/daxheap_bufid + group/init
  sysfs_me.{c,h}    new          ME inspection + per-CPU stats (me_info, poll_stats, fine_stats, ...)
  sysfs_gc.{c,h}    new          GC monitoring (deleg_info, gc_status)
  sysfs_nrht.{c,h}  new          NRHT chain-depth histogram
  sysfs_internal.h  new          shared sbi_list/lock + get_sbi/find_by_node helpers
  sysfs_debug.{c,h} extended     gc_trigger/stop/pause/restart relocated into debug subgroup
                                 alongside existing fault injection (me_freeze_heartbeat,
                                 me_sync_is_holder)

Helper naming in sysfs_me.c unified to verb-noun form:
  me_state_str             -> me_state_name
  me_tag_for               -> me_format_tag
  me_info_emit_one         -> me_emit_instance
  me_stats_aggregate       -> me_aggregate_stats
  me_fine_stats_emit_buckets -> me_emit_buckets

Tests updated to /sys/fs/marufs/debug/gc_* paths:
  test_local_multinode.sh  (13 spots)
  test_chown_race.c, test_dupname.c, test_overlap.c, test_gc_deleg.c

Also includes pre-staged ME crash-detection scaffolding consumed by the
debug subgroup: me.h / me_order.c / me_request.c expose debug_freeze_poll
hooks; tests/test_me_crash.sh exercises freeze + sync recovery end-to-end.
…eaders

marufs.h was a 1266-line catch-all (sb_info + DAX/RAT/shard helpers +
function decls for 10 modules). marufs_layout.h was a 625-line mix of
on-disk structs and CXL/CAS primitives. Both now act as umbrella
headers over focused per-domain files.

Phase 1 — extract reusable primitives + per-module decl headers:
  marufs_endian.h  READ_LE/WRITE_LE/READ_CXL_LE, MARUFS_CXL_WMB/RMB,
                   le16/32/64_cas, cas_inc/dec
  marufs_hash.h    shard_idx, bucket_idx, hash_name, make_ino,
                   ino_to_region, align_up
  gc.h             orphan tracker types + gc.c entry points
  inode.h          marufs_inode_info struct + inode.c entry points
                   + inode_ops externs
  acl.h, cache.h, dir.h, file.h, index.h, nrht.h, region.h, super.h
                   per-module function declarations

Phase 2 — split on-disk structs by subsystem:
  marufs_superblock_layout.h   marufs_superblock + GSB_SIZE
  marufs_index_layout.h        shard_header, index_entry, region
                               defaults, BUCKET_END, state enum
  marufs_rat_layout.h          rat, rat_entry, deleg_entry, RAT/deleg/
                               region_type state enums, capacity
  marufs_nrht_layout.h         nrht_header, nrht_shard_header,
                               nrht_entry, NRHT defaults

Umbrella files now hold only:
  marufs.h          sb_info, DAX/RAT/shard inline accessors, sysfs decls
  marufs_layout.h   magic enum, ME area sizes + me_area_size helper,
                    layout offsets, compile-time size validators

Existing .c files keep including marufs.h alone — umbrella pulls in
all per-module headers, so include patterns are unchanged. Phase 3
(sb_info field grouping into sub-structs) deferred — too invasive,
low ROI.

Line counts:
  marufs.h         1266 -> 435  (66% reduction)
  marufs_layout.h   625 -> 130  (79% reduction)
  16 new headers    ~970 lines  (focused, dependency-minimal)

Build clean. Compile-time size validators still pass.
Adds ME crash-detection regression coverage to the local multinode
suite by delegating to the standalone test_me_crash.sh (T1-T7).
Previously the crash tests had to be run by hand after every kernel
change — easy to forget, easy to silently regress.

Section 30 runs last because:
  - manipulates dmesg ring (uses dmesg -C between sub-tests)
  - T1 saturates CPU with stress-ng (soft dep — self-skips if absent)
  - T5 needs ≥3 mounts (self-skips if /mnt/marufs3 absent)
  - ~50s total runtime

Gate: requires test_me_crash.sh executable + writable
/sys/fs/marufs/debug/me_freeze_heartbeat + root. Otherwise SKIP with
a one-line reason. test_me_crash.sh's own `set -euo pipefail` + die
behavior maps cleanly to run_test's pass/fail accounting (non-zero
exit on first failed T → Section 30 fails).

NRHT --sweep benchmark intentionally NOT integrated here: it's a
throughput measurement (no pass/fail), takes minutes, sensitive to
machine load. A dedicated bench_nrht.sh is the right home for it.
Replaces mandatory node_id= mount option with bootstrap-elected slot
assignment. Each mount CAS-claims a free slot in the on-disk bootstrap
table (CLAIMED/FORMATTING states); first claimer formats the FS, rest
attach. Stuck-formatter steal path covers crashed formatters.

Changes:
- bootstrap.c/h, marufs_bootstrap_layout.h: slot table + claim/steal
- super.c: bootstrap-elected node_id; legacy explicit node_id= still
  supported via mount option
- sysfs_debug: bootstrap_dump shows per-mount slot ownership (<mine>)
- setup_local_multinode.sh: --legacy flag for old explicit-node_id
  style; default is auto-mount via bootstrap
- test_bootstrap_chaos.sh: T1 stuck-formatter recovery, T2 concurrent
  mount race, T3 slot reuse sanity
- test_local_multinode.sh: Section 31 auto-mount slot table checks,
  Section 32 delegates to chaos with auto-teardown
- test_me_crash.sh: trim T1 stress 8s->4s, T2 iters 3->2, T3 busy 7s->6s
- gc/file/nrht/sysfs minor adjustments for bootstrap integration
- dax_zero.c: helper to wipe DAX device for chaos preconditions
Without this, request_poll_cycle's stale poll_last_slot_seq baseline
re-triggers ME_BECOME_HOLDER after the token has already been passed,
leaking is_holder=true cross-handoff. The next acquire's wait_for_token
fast path (ME_IS_HOLDER) then enters CS while CB holds a different
node — two-holder race observed under concurrent counter-RMW stress.
Add user-managed ref_count and pin_count to each NRHT entry, plus four
new ioctls (REF_INC, REF_DEC, PIN_INC, PIN_DEC) for caller-driven RMW
under NRHT shard ME. dec-from-zero rejects with -EINVAL,
inc-from-UINT32_MAX with -EOVERFLOW.

Layout: counters consume two __le32 in the existing CL0 reserved space
(offsets 40-47); 128B entry size unchanged.

Tests:
- test_ioctl.c §3.5 covers single-process semantics (initial value,
  bounded overflow/underflow, ENOENT on missing entry).
- test_nrht_race.c Test4 stresses balanced concurrent inc/dec across
  8 workers and asserts final == 0 + zero ioctl errors. Worker logs
  first failed op to stderr; harness aborts on first round failure
  for clean dmesg capture.
- run_bench bundles ref/pin INC/DEC into the per-iter timed loop on
  the iter's own entry so all 7 ops share a shard, scaling with
  cfg->num_shards. Sweep summary gains a counter ops section.
Cover the four NRHT_REF/PIN_INC/DEC ioctls with usage examples and the
overflow/underflow semantics. Note that FIND_NAME returns the counters
alongside the offset.
bootstrap_dump_slots() used PAGE_SIZE as its scnprintf bound while
sysfs_debug's bootstrap_dump_show() called it with `buf + n` after
writing a per-mount header. Each scnprintf could thus write up to
PAGE_SIZE bytes past the caller's offset, overrunning the sysfs page
into adjacent slab objects.

Symptom: GPF in fdget/filp_flush after reading bootstrap_dump, with
non-canonical addresses decoding to ASCII fragments emitted by this
helper ("=CLAIMED", " node_id", "slot[N] stat...").

Add a bufsize parameter and pass PAGE_SIZE - n from the show callback;
guard the loop against n >= bufsize.
Decompose me.h (~700 LOC) into three focused headers:
- me.h: public API + DRAM types only
- me_inline.h: inline helpers needing instance struct visibility
- me_layout.h: on-disk CXL layout (header/CB/membership/slot)

Move cold-path helpers (me_leave_successor, me_membership_tick_heartbeat)
out of inline header into me.c. Consolidate per-shard arrays into
struct marufs_me_shard and DRAM is_holder fast path. Wire callers
(bootstrap, me_order, me_request, nrht, sysfs) to new layout.
Split marufs_check_permission into two layers:
- marufs_check_permission_any(candidate, *out_granted): returns the
  granted subset of candidate bits, letting callers branch on which
  rights matched. Replaces ADMIN-then-GRANT two-call patterns.
- marufs_check_permission: thin AND-semantics wrapper.

Inline deleg matching into _any (drops marufs_deleg_matches) and bound
the loop by deleg_num_entries instead of MAX_ENTRIES.

Centralize ioctl perm precheck via marufs_ioctl_required_perm(cmd)
table at dispatcher entry, removing per-case marufs_check_permission
calls from NAME_OFFSET / BATCH_* / FIND_NAME / CLEAR_NAME / NRHT_INIT
and from DMABUF_EXPORT / CHOWN. PERM_GRANT keeps self-check inside its
ME critical section, now using _any to evaluate ADMIN|GRANT in one call.

Move nrht_refcnt_op_t typedef from file.c to nrht.h.

Extend test_nrht_race with run_test5 (new race scenario) and tighten
test3/test4 coverage.
Concurrent CHOWN race: precheck ran before me->acquire(), so two
callers with default_perms ADMIN could both pass and serialize on
the lock. The first chown stripped ADMIN (default_perms=0, deleg
cleared), but the second never re-checked and still won, letting
ownership transfer twice.

Fix:
- Add marufs_check_permission(ADMIN) inside __marufs_ioctl_chown_locked
  before the ALLOCATED→ALLOCATING CAS.
- Drop CHOWN from the lock-free precheck table (handler self-checks),
  matching the PERM_GRANT pattern.
- Same in-lock recheck added to perm_set_default for symmetry: a
  caller that relied on default_perms ADMIN can be demoted by a
  concurrent perm_set_default/chown writing default_perms=0.
Add per-sbi vm_ops wrapper that copies underlying device_dax ops and
overrides .open/.close/.mprotect to enforce RAT delegation on mprotect.
mmap-time RAT check is no longer the sole gate.

Wrapper details:
- sbi-embedded vm_ops, lazy-seeded at first mmap under vm_ops_lock
- xi pointer stashed in vma->vm_private_data; igrab on attach,
  iput in .close, igrab on .open for vma split/clone refcount balance
- container_of(vma->vm_ops, sbi, vm_ops) recovers sbi at hook time
  (vma->vm_file = dax_filp after device_dax delegation)

Hardening flags applied to every marufs vma:
- VM_DONTCOPY: fork() drops the mapping; child re-mmap forces RAT recheck
- VM_DONTEXPAND: mremap() cannot grow past original mmap size
- VM_DONTDUMP: KV-cache contents excluded from coredumps

Lock split: revert the prior sb_lock merge that caused soft lockups
when me_poll_thread held the unified lock for full poll cycles.
- me_list_lock: poll thread + register/unregister
- nrht_me_lock: nrht_me[] creation
- vm_ops_lock: lazy seed (and future hot-path use)

Remove daxheap support entirely:
- Drop CONFIG_DAXHEAP and DAXHEAP_DIR from Makefile / install.sh
- Remove daxheap= and daxheap_import_id= mount options
- Drop MARUFS_IOC_DMABUF_EXPORT ioctl and dmabuf_req struct
- Remove enum marufs_dax_mode (DEV_DAX is the only mode)
- Drop sbi->heap_dmabuf, marufs_dax_acquire_daxheap,
  /sys/fs/marufs/daxheap_bufid

Tests (tests/test_mmap.c):
- run_vm_protect: mprotect basics, RDONLY-fd escalation block,
  VM_DONTCOPY fork SIGSEGV, VM_DONTEXPAND mremap reject (with and
  without MAYMOVE), partial mprotect vma split, mremap MOVE-only
  success, 200-iter split+merge stress for igrab balance
- run_vm_protect_cross: cross-node escalation block — owner grants
  READ-only to peer, peer mprotect(PROT_RW) rejected by RAT WRITE
  check; after additional WRITE grant, mprotect succeeds
Add two-layer defense against post-exec fd reuse / hostile re-execve:

1. RAT exe_inode binding (acl.c + region.c) - owner check now compares
   current task's exe inode/dev against owner_exe_inode_ino/dev stored
   in the RAT entry at create time. Catches execve into different
   binary.
2. FD_CLOEXEC enforcement at data access (file.c) - mmap/read/ioctl
   reject with -EACCES when the calling fd is not close_on_exec.
   Catches same-binary re-execve (hostile argv) which exe_inode binding
   alone cannot detect.

Cannot enforce O_CLOEXEC at .open: VFS strips O_CLOEXEC from f_flags
(it's stored in fdtable.close_on_exec) and the fd is not yet installed
when ->open runs. Check moves to mmap/read/ioctl entry where fdtable
lookup is possible.

Tests:
- test_postexec_attack: integrated into test_local_multinode.sh as
  Section 32 (Bootstrap Chaos shifts to Section 33). Two modes: no
  cloexec (parent mmap blocked) and --cloexec (execve closes fd).
- test_negative: new Section 0 verifying mmap without FD_CLOEXEC
  returns EACCES.
- 13 existing test sources updated to pass O_CLOEXEC on open().

Docs: 0_user_guide.md gains an O_CLOEXEC requirement bullet and a
Security section paragraph explaining the fd-level check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants