Skip to content

feat: End-to-End Encryption with historical key sharing#5

Open
axel-krapotke wants to merge 39 commits intomainfrom
feature/e2ee
Open

feat: End-to-End Encryption with historical key sharing#5
axel-krapotke wants to merge 39 commits intomainfrom
feature/e2ee

Conversation

@axel-krapotke
Copy link

Summary

This PR adds comprehensive End-to-End Encryption support to the matrix-client-api, including a novel historical key sharing mechanism that ensures late joiners can decrypt existing content.

What's New

E2EE Core

  • CryptoManager wrapping @matrix-org/matrix-sdk-crypto-wasm OlmMachine
  • Transparent Megolm encryption in CommandAPI (outgoing) and TimelineAPI (incoming)
  • Per-project/per-layer encryption configuration
  • Persistent IndexedDB-backed crypto store support (Electron/browser)
  • Injectable logger replacing all console.* calls

Historical Key Sharing

  • Export and share Megolm session keys with project members when sharing a layer
  • Keys are Olm-encrypted per-device and sent as m.room.encrypted to_device events
  • Server queues to_device for offline recipients — late joiners get keys on next sync
  • Receiving side intercepts decrypted key events in receiveSyncChanges() and imports via importRoomKeys()
  • Safety net: keys are also re-shared when new members join (stream handler)

Tuwunel Compatibility

  • Handle state events delivered only in timeline (not state block) during initial sync
  • Guard against missing timeline object in sync response
  • Tested against both Synapse and Tuwunel (Conduit fork)

Bug Fixes

  • m.room.encrypted power level set to CONTRIBUTOR level (was falling back to events_default = ADMIN)
  • content() no longer filters out own events (not_senders), fixing re-join state reconstruction
  • ProjectList.join() returns encryption status so joiners persist the E2EE flag
  • sendToDevice() generalized to support arbitrary user/device message maps
  • CommandAPI supports async callback functions in the queue

Documentation & Testing

  • Complete README rewrite with architecture diagram, API documentation, E2EE details, and playground CLI reference
  • 3 E2E test suites against Tuwunel Docker container (12 tests)
  • Updated powerlevel unit tests (40 tests total)
  • Interactive playground CLI with E2EE support

Breaking Changes

  • HttpAPI.sendToDevice() signature changed: (eventType, txnId, messages) instead of (deviceId, eventType, content, txnId)
  • content() no longer uses not_senders filter (includes own events for correct re-join state)

Test Results

npm test        → 40 passing
npm run test:e2e → 12 passing (against Tuwunel Docker)

…dling

- Store device_id in HttpAPI credentials (was missing)
- Attempt token refresh on any M_UNKNOWN_TOKEN 401, not just soft_logout
- Fail clearly if E2EE is enabled but no device_id is available
- Remove unsafe ODIN_DEVICE fallback
- OlmMachine is now initialized once and reused (no duplicate key uploads)
- MatrixClient factory keeps a shared CryptoManager instance
- Playground 'open' command reuses the existing client instead of creating a new one
- Add setRoomEncryption() to CryptoManager (calls OlmMachine.setRoomSettings)
- Parse m.room.encryption in roomStateReducer
- Project.hydrate() registers encrypted rooms with CryptoManager
- Pass encryption state through structure-api hierarchy
- Pass cryptoManager through Project constructor params

Without this, the OlmMachine had no knowledge of which rooms were encrypted,
causing shareRoomKey() to fail silently and Element clients unable to decrypt.
- Use queryKeysForUsers() to explicitly fetch device keys before key sharing
- Fix member filtering: use content.membership and state_key (was wrong path)
- Add debug logging to entire E2EE encrypt flow
- Process outgoing requests after key sharing to ensure delivery

The previous flow relied on outgoingRequests() returning a KeysQuery after
updateTrackedUsers(), which doesn't always happen. Now we explicitly query
device keys, ensuring the OlmMachine knows all devices before shareRoomKey().
Sends a standard m.room.message event directly through the command queue,
bypassing the ODIN operation wrapper. Useful for testing E2EE with Element.
- New method: initializeWithStore(userId, deviceId, storeName, passphrase)
  Uses StoreHandle.open() + OlmMachine.initFromStore() for persistent
  crypto state (Olm/Megolm sessions survive restarts)
- New method: close() releases store handle and OlmMachine
- New getter: isPersistent
- Original initialize() (in-memory) preserved for backwards compatibility
- Tests: persistent store API surface, close() cleanup, post-close errors
- StoreHandle requires IndexedDB (Electron/browser only, not Node.js)
- Docker Compose setup with jevolk/tuwunel:latest (~27MB)
- Minimal tuwunel.toml (no federation, open registration)
- Full E2EE flow tested against real homeserver:
  - Register users, upload device keys
  - Create encrypted room, join, key exchange
  - Alice encrypts → sends → Bob syncs → decrypts
- npm run test:e2e (skips gracefully if no homeserver running)
- Regular 'npm test' unaffected (unit tests only)
Tests the actual API components against Tuwunel, not raw fetch:
- HttpAPI: processOutgoingCryptoRequests(), sendOutgoingCryptoRequest()
- StructureAPI: room creation with m.room.encryption state
- CommandAPI: encrypt + send via CryptoManager pipeline
- TimelineAPI: sync → receiveSyncChanges → decrypt m.room.encrypted
- Full round-trip: Alice sends 3 encrypted msgs, Bob decrypts all 3

All tests use real HttpAPI with ky, real CryptoManager with OlmMachine,
against a real Tuwunel homeserver via Docker.
Tests now go through the real API stack as ODIN uses it:

Layer 1 - HttpAPI + CryptoManager:
  - processOutgoingCryptoRequests() uploads device keys
  - sendOutgoingCryptoRequest() handles KeysQuery

Layer 2 - StructureAPI:
  - createProject({ encrypted: true }) sets m.room.encryption
  - createLayer({ encrypted: true }) sets m.room.encryption
  - createProject() without encrypted does NOT set encryption

Layer 3 - CommandAPI:
  - schedule() + run() automatically encrypts sendMessageEvent
  - Verifies server sees m.room.encrypted (not plaintext ODIN type)

Layer 4 - TimelineAPI:
  - syncTimeline() transparently decrypts m.room.encrypted back to
    io.syncpoint.odin.operation with decrypted=true flag

Full Stack:
  - Alice creates encrypted layer (StructureAPI)
  - Sends 2 ODIN operations (CommandAPI)
  - Bob receives + decrypts both (TimelineAPI)
When CryptoManager is active, TimelineAPI now automatically:

1. BEFORE sync: Injects 'm.room.encrypted' into the server-side
   filter types. Without this, the server silently drops all encrypted
   events because it only sees the envelope type, not the original
   event type (e.g. io.syncpoint.odin.operation).

2. AFTER decrypt: Re-applies the original type constraint as a
   client-side filter. Since m.room.encrypted is a catch-all, any
   event type could be inside. The post-decrypt filter ensures only
   expected types pass through.

This is fully transparent to ODIN — no filter changes needed in
Project.content() or Project.start() filterProvider.

Affected paths:
- syncTimeline(): sync filter + catch-up filter augmented
- content(): history replay filter augmented + decrypt + post-filter
- Original filter is never mutated (deep clone)
MatrixClient encryption options extended:
  encryption: {
    enabled: true,
    storeName: 'crypto-<projectUUID>',  // IndexedDB name
    passphrase: '<decrypted passphrase>'  // encrypts the store
  }

When storeName is provided, uses initializeWithStore() (IndexedDB-backed,
crypto state survives restarts). Without it, falls back to in-memory
(for testing or non-browser environments).

This is the integration point for ODIN: Project-services.js passes
storeName + passphrase from LevelDB/safeStorage, and the API handles
the rest transparently.
Some homeservers (e.g. Tuwunel) place room creation state events
exclusively in the timeline rather than the state block on initial
sync. We now merge state events with state-bearing timeline events
(those with state_key) before reducing, with timeline taking
precedence per the Matrix spec.
The join result now includes an 'encrypted' flag derived from the
project's encryption state. This allows the caller to persist the
E2EE setting per project when accepting an invitation.
Tuwunel may omit the timeline object entirely for rooms with no
new timeline events, unlike Synapse which always includes it.
When a new user joins an encrypted layer room, the existing member
(who is streaming) detects the m.room.member join event and:

1. Queries the new user's device keys
2. Establishes Olm sessions
3. Exports all historical Megolm session keys for the room
4. Encrypts them per-device using Olm (encryptToDeviceEvent)
5. Sends them as m.room.encrypted to_device messages

On the receiving side, receiveSyncChanges() detects the custom
io.syncpoint.odin.room_keys event type after Olm decryption
and imports the keys via importRoomKeys().

This enables the joining user to decrypt all existing content
during the initial replay/catch-up.

Also:
- Add exportRoomKeys(roomId) and importRoomKeys() to CryptoManager
- Generalize HttpAPI.sendToDevice() to accept arbitrary message maps
Problem: If Alice shares an encrypted layer with content and Bob
joins later (possibly while Alice is offline), Bob cannot decrypt
historical events because keys were only shared on join (requiring
Alice to be online).

Solution: Share keys at TWO points:

1. At share time (shareLayer): Alice sends all Megolm session keys
   to ALL project members via to_device. These are queued server-side,
   so even if Bob is offline he receives them on next sync.

2. At join time (membershipChanged): Safety net that catches any keys
   created between share and join.

Both paths use the new _shareHistoricalKeysWithProjectMembers()
helper which handles device key query, Olm session establishment,
and per-device Olm-encrypted to_device delivery.
The historical key share must happen after content has been encrypted
and sent, not before (otherwise no Megolm session keys exist yet).

Changes:
- shareHistoricalKeys() now schedules a callback in the command queue
  that runs after all preceding content posts
- CommandAPI supports async callback functions in the queue
- Removed premature key sharing from shareLayer() (room is empty there)
- Key sharing still fires on member join as safety net
syncTimeline now collects state events (from both state block and
timeline state events) and returns them as stateEvents alongside
timeline events.

project.start() processes state events and emits a 'selfJoined'
event when the current user's own m.room.member join is detected.
This enables reliable content loading after join — the server has
fully processed the join before we attempt to load content.
The Olm-encrypted approach failed because the WASM OlmMachine
zeroizes content of decrypted to_device events it doesn't recognize.

New approach:
- Send exported Megolm session keys as unencrypted custom to_device
  events (type: io.syncpoint.odin.room_keys)
- Intercept these events in receiveSyncChanges() BEFORE passing
  to OlmMachine, import keys via importRoomKeys()
- Keys are the same exported format as server-side key backup

Also fixes receiveSyncChanges() result parsing (WASM objects with
.rawEvent, not plain JSON).

Includes integration test (content-after-join.test.mjs) that
validates the full ODIN flow: create encrypted layer → post content
→ share keys → Bob joins → Bob decrypts all content.
Reverts the unencrypted approach. Keys are now properly:
- Olm-encrypted per-device via device.encryptToDeviceEvent()
- Sent as m.room.encrypted to_device events
- Decrypted by OlmMachine on receiving side
- Extracted from DecryptedToDeviceEvent.rawEvent (JSON string)
- content field is a JSON string that needs double-parse

The previous approach failed because we didn't handle:
1. WASM return objects (need .rawEvent accessor, not JSON.parse)
2. Double-stringified content (encryptToDeviceEvent stringifies,
   rawEvent contains it as string)

Tests verify both encrypted and unencrypted content loading.
Content loading after join is handled directly in toolbar.js.
The selfJoined approach via stream didn't work due to filter timing.
stateEvents collection in timeline-api remains (useful for future).
With E2EE, ODIN operations are sent as m.room.encrypted instead of
io.syncpoint.odin.operation. Without an explicit power level for
m.room.encrypted, it falls back to events_default (100 = ADMIN),
causing 403 for CONTRIBUTORs (power level 25).

Set m.room.encrypted to CONTRIBUTOR level in both layer and project
room creation.
content() filtered out the current user's events (not_senders).
This is correct for the live stream (own changes are already local),
but on re-join after leave the local store is empty — ALL events
are needed to reconstruct the layer state, including our own.
…layground CLI

Also fix stale powerlevel unit tests to match current role definitions.
Add interactive device verification via Short Authentication String (SAS).
Both users see 7 matching emojis and confirm to verify each other's devices.

CryptoManager methods:
- requestVerification(userId, deviceId) — initiate verification
- getVerificationRequest(userId, flowId) — get pending request
- getVerificationRequests(userId) — list all requests for a user
- acceptVerification(request) — accept incoming request (SAS method)
- startSas(request) — transition to SAS flow
- getSas(request) — get SAS state machine from request
- getEmojis(sas) — get 7 emoji objects {symbol, description}
- confirmSas(sas) — confirm emojis match (marks device verified)
- cancelSas(sas) / cancelVerification(request) — cancel flow
- isDeviceVerified(userId, deviceId) — check trust status
- getDeviceVerificationStatus(userId) — all devices with trust info
- getVerificationPhase(request) — human-readable phase name

Exports: VerificationMethod, VerificationRequestPhase

Test: sas-verification.test.mjs validates the complete flow against Tuwunel.

Also: cleaned up duplicate JSDoc comments in shareHistoricalRoomKeys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant