Add FaceTime command flow and auto-approval improvements#236
Add FaceTime command flow and auto-approval improvements#236cameronaaron wants to merge 2334 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR refactors the bridge toward a bridgev2-based architecture with updated CLI tooling and macOS integration, while adding new crypto utilities/tests and introducing a new Rust crate for generating Apple APNs NAC validation data.
Changes:
- Introduces bridgev2 entrypoint/commands and a bundled
bbctlfor Beeper self-host workflows (register/stop/delete/login). - Adds CardDAV credential encryption (AES-256-GCM) + tests, plus new capability/tapback unit tests.
- Adds macOS-specific wiring (Darwin build tags, chat.db/contacts integration tweaks, setup permissions UX) and introduces a new
nac-validationRust crate + C FFI interface.
Reviewed changes
Copilot reviewed 127 out of 237 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/connector/chatdb_darwin.go | Darwin-only side-effect import to register macOS chat.db support |
| pkg/connector/carddav_crypto.go | Adds AES-256-GCM encryption/decryption for CardDAV config secrets |
| pkg/connector/carddav_crypto_test.go | Unit tests for CardDAV key management + encrypt/decrypt behavior |
| pkg/connector/capabilities.go | Defines bridge capability descriptors (room/general) for bridgev2 |
| pkg/connector/capabilities_test.go | Adds tests validating capability invariants |
| pkg/connector/bridgeadapter.go | Adds adapter implementing legacy interface for mac connector reuse |
| nac-validation/src/validation_data.h | Adds C header for NAC validation data generation and 3-step NAC API |
| nac-validation/src/lib.rs | Adds Rust wrapper for NAC generation + context-based step API |
| nac-validation/build.rs | Compiles Objective-C NAC implementation and links Foundation |
| nac-validation/Cargo.toml | Declares nac-validation crate deps and build deps |
| imessage/tapback_test.go | Adds unit tests for tapback parsing/mapping behavior |
| imessage/mutation_test.go | Adds mutation testing harness (build-tagged) |
| imessage/struct.go | Extends Contact/Attachment structs and adjusts identifier parsing |
| imessage/interface.go | Extends API interface with GetMessageGUIDsSince |
| imessage/mac/messages.go | Adds attachment fields, new queries, and message GUID query method |
| imessage/mac/database.go | Stores prepared stmt for message GUIDs since query |
| imessage/mac/send.go | Adds darwin build tag, updates imports, improves cleanup logging |
| imessage/mac/contacts.go | Adds darwin build tag and improves permission detection + CString lifetime |
| imessage/mac/meowContacts.m | Adds native “test query” helper for Contacts permission verification |
| imessage/mac/meowContacts.h | Exposes meowTestContactQuery in header |
| imessage/mac/debug.go | Adds darwin build tag and updates import path |
| imessage/mac/groups.go | Adds darwin build tag |
| imessage/mac/attributedstring.go | Adds darwin build tag and updates import path |
| imessage/mac/sleepdetect.go | Adds darwin build tag |
| cmd/mautrix-imessage/main.go | New bridgev2 main entrypoint + subcommands and permissions repair |
| cmd/mautrix-imessage/login_cli.go | Adds interactive terminal login flow driving bridge login steps |
| cmd/mautrix-imessage/carddav_setup.go | Adds CLI subcommand for CardDAV discovery + password encryption |
| cmd/mautrix-imessage/setup_darwin.go | Adds macOS --setup permission prompts and checks |
| cmd/mautrix-imessage/setup_other.go | Stub setup helpers for non-darwin builds |
| cmd/bbctl/main.go | Adds bbctl CLI entrypoint and command registration |
| cmd/bbctl/auth.go | Adds bbctl auth config handling + login/logout/whoami |
| cmd/bbctl/register.go | Adds bbctl config command to register appservice + generate bridge config |
| cmd/bbctl/stop.go | Adds bbctl stop command to announce stopped bridge state |
| cmd/bbctl/delete.go | Adds bbctl delete command to delete appservice + Beeper API bridge |
| docs/cloudkit-guide.md | Adds CloudKit backfill design/operations documentation |
| .github/workflows/ci.yml | New CI pipeline for lint/test/build (Linux default, macOS on dispatch) |
| .github/workflows/security.yml | Adds govulncheck + cargo-audit security workflows |
| .github/workflows/release.yml | Adds release workflow producing artifacts and GitHub release |
| .github/dependabot.yml | Enables Dependabot for Go/Rust/GitHub Actions |
| AGENTS.md | Adds dev notes for UniFFI binding generation |
| Info.plist | Adds macOS app bundle metadata + Contacts usage description |
| go.mod | Changes module path and updates Go/toolchain + dependency set |
| no-mac.go | Removes legacy non-mac permissions checker stub |
| mac-permissions.go | Removes legacy mac permissions checker (replaced by new setup flow) |
| no-heif.go | Removes legacy HEIF conversion stubs |
| heif.go | Removes libheif-based HEIF conversion implementation |
| mediaviewer.go | Removes legacy media viewer URL generation path |
| findrooms.go | Removes legacy portal discovery implementation |
| commands.go | Removes legacy bridgev1 command handlers |
| config/config.go | Removes legacy config structs (bridgev1) |
| config/bridge.go | Removes legacy bridge config definitions (bridgev1) |
| config/download.go | Removes legacy config download helper |
| config/upgrade.go | Removes legacy config upgrader |
| database/database.go | Removes legacy DB wrapper (bridgev1) |
| database/user.go | Removes legacy user query model |
| database/portal.go | Removes legacy portal query model |
| database/message.go | Removes legacy message query model |
| database/tapback.go | Removes legacy tapback query model |
| database/puppet.go | Removes legacy puppet query model |
| database/mergedchat.go | Removes legacy merged chat query model |
| database/kvstore.go | Removes legacy kv store model |
| database/upgrades/upgrades.go | Removes legacy DB upgrade table registration |
| database/upgrades/00-latest-schema.sql | Removes legacy schema snapshot |
| database/upgrades/02-avatar-optional.go | Removes legacy upgrade step |
| database/upgrades/03-message-part-index.go | Removes legacy upgrade step |
| database/upgrades/04-portal-backfill-start-ts.sql | Removes legacy upgrade step |
| database/upgrades/05-message-on-update-cascade.go | Removes legacy upgrade step |
| database/upgrades/06-crypto-store-last-used.sql | Removes legacy upgrade step |
| database/upgrades/07-tapback-guids.sql | Removes legacy upgrade step |
| database/upgrades/08-remove-management-room.sql | Removes legacy upgrade step |
| database/upgrades/09-add-kv-store.sql | Removes legacy upgrade step |
| database/upgrades/10-personal-filtering-spaces.sql | Removes legacy upgrade step |
| database/upgrades/11-splitcrypto-store-handling-split.sql | Removes legacy upgrade step |
| database/upgrades/12-management-room.sql | Removes legacy upgrade step |
| database/upgrades/13-displayname-override.sql | Removes legacy upgrade step |
| database/upgrades/14-correlation-id.sql | Removes legacy upgrade step |
| database/upgrades/15-thread-id.sql | Removes legacy upgrade step |
| database/upgrades/16-remove-correlation-id.sql | Removes legacy upgrade step |
| database/upgrades/17-batch-send-ids.sql | Removes legacy upgrade step |
| database/upgrades/18-chat-merges.sql | Removes legacy upgrade step |
| database/upgrades/19-add-contact-info.sql | Removes legacy upgrade step |
| database/upgrades/20-thread-id-index.sql | Removes legacy upgrade step |
| database/upgrades/21-prioritized-backfill.sql | Removes legacy upgrade step |
| imessage/ios/requests.go | Removes iOS IPC request types (legacy codepath) |
| imessage/mac-nosip/contactproxy.go | Removes legacy mac-nosip proxy implementation |
| imessage/mac-nosip/nocontactproxy.go | Removes non-darwin mac-nosip stub |
| imessage/bluebubbles/interface.go | Removes legacy BlueBubbles API interface types |
| imessage/bluebubbles/README.md | Removes BlueBubbles docs file |
| docker-run.sh | Removes legacy Docker entrypoint script |
| Dockerfile.ci | Removes legacy CI Dockerfile |
| build.sh | Removes legacy build script |
| clangwrap.sh | Removes legacy iOS clang wrapper |
| bridgeinfo.go | Removes legacy bridge info event mapping |
| chatmerging.go | Removes legacy chat merge/split logic |
| ROADMAP.md | Removes outdated roadmap doc |
| example-registration.yaml | Removes legacy appservice registration example |
| .pre-commit-config.yaml | Removes pre-commit hooks config |
| .gitlab-ci.yml | Removes GitLab CI configuration |
| .github/workflows/go.yml | Removes legacy GitHub Actions Go workflow |
| .github/CODEOWNERS | Removes CODEOWNERS file |
| .github/FUNDING.yml | Removes funding config |
| .github/ISSUE_TEMPLATE/bug.md | Removes bug issue template |
| .github/ISSUE_TEMPLATE/enhancement.md | Removes enhancement issue template |
| .github/ISSUE_TEMPLATE/config.yml | Removes issue template config |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Try to load existing key, generate if missing | ||
| key, err := loadCardDAVKey() | ||
| if err != nil { | ||
| key, err = generateCardDAVKey() | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| } |
There was a problem hiding this comment.
EncryptCardDAVPassword generates a new key on any loadCardDAVKey() error (including wrong-size key file, permission errors, transient IO errors). That can silently rotate the key and make already-encrypted passwords undecryptable. Prefer generating a new key only when the key file is missing (e.g., errors.Is(err, os.ErrNotExist)), and return the error for other failure modes.
| for i := range newKey { | ||
| newKey[i] = byte(i) | ||
| } | ||
| os.WriteFile(keyPath, newKey, 0600) |
There was a problem hiding this comment.
These tests ignore the return errors from os.WriteFile/os.MkdirAll. If the writes fail (permissions, disk issues), the test may pass/fail for the wrong reason. Capture and assert the returned errors (e.g., if err := os.WriteFile(...); err != nil { t.Fatal(...) }) to make failures deterministic.
| os.WriteFile(keyPath, newKey, 0600) | |
| if err := os.WriteFile(keyPath, newKey, 0600); err != nil { | |
| t.Fatalf("WriteFile error: %v", err) | |
| } |
| os.MkdirAll(dir, 0700) | ||
| os.WriteFile(filepath.Join(dir, cardDAVKeyFileName), []byte("too-short"), 0600) |
There was a problem hiding this comment.
These tests ignore the return errors from os.WriteFile/os.MkdirAll. If the writes fail (permissions, disk issues), the test may pass/fail for the wrong reason. Capture and assert the returned errors (e.g., if err := os.WriteFile(...); err != nil { t.Fatal(...) }) to make failures deterministic.
| return false | ||
| } | ||
| defer db.Close() | ||
| _, err = db.Query("SELECT 1 FROM message LIMIT 1") |
There was a problem hiding this comment.
canReadChatDB() uses db.Query(...) but never closes the returned *sql.Rows. Because runSetupPermissions() can call this repeatedly in a loop, this can leak resources and eventually fail the check. Use QueryRow (preferred here) or close the rows handle before returning.
| _, err = db.Query("SELECT 1 FROM message LIMIT 1") | |
| var probe int | |
| err = db.QueryRow("SELECT 1 FROM message LIMIT 1").Scan(&probe) |
| if len(parts) < 3 { | ||
| return Identifier{LocalID: guid} | ||
| } | ||
| localID := parts[2] | ||
| // Detect groups by the separator character ("+") or by LocalID pattern. | ||
| // The GUID format is "service;+;localID" for groups and "service;-;localID" for DMs. | ||
| // Group LocalIDs can be "chat..." (iMessage), hex UUIDs (SMS/RCS), or other formats. | ||
| isGroup := parts[1] == "+" || strings.HasPrefix(localID, "chat") | ||
| return Identifier{ | ||
| Service: parts[0], | ||
| IsGroup: parts[1] == "+", | ||
| LocalID: parts[2], | ||
| IsGroup: isGroup, | ||
| LocalID: localID, | ||
| } |
There was a problem hiding this comment.
Using strings.HasPrefix(localID, \"chat\") to infer IsGroup can misclassify DMs whose local ID happens to start with chat (e.g., an email address like chat@example.com). Since the comment states the format is service;+;localID for groups and service;-;localID for DMs, relying on parts[1] == \"+\" is both simpler and more accurate. If there are known real-world GUIDs that violate the +/- separator rule, consider handling those explicitly and add unit tests covering the edge cases.
| fn main() { | ||
| println!("cargo:rerun-if-changed=src/validation_data.m"); | ||
| println!("cargo:rerun-if-changed=src/validation_data.h"); | ||
|
|
||
| // Compile the Objective-C file | ||
| cc::Build::new() | ||
| .file("src/validation_data.m") | ||
| .flag("-fobjc-arc") | ||
| .flag("-fmodules") // for @import if needed | ||
| .define("NAC_NO_MAIN", None) // exclude main() when building as a library | ||
| .compile("validation_data"); | ||
|
|
||
| // Link with Foundation framework | ||
| println!("cargo:rustc-link-lib=framework=Foundation"); | ||
| } |
There was a problem hiding this comment.
This build script unconditionally compiles Objective-C and links the macOS Foundation framework. That will fail on non-macOS targets if this crate is ever built in CI or as part of a workspace build. Consider gating the build steps with cfg!(target_os = \"macos\") (and emitting a clear error or doing nothing on other OSes), and similarly gating any tests that require Apple frameworks/network access.
| // The underlying AAAbsintheContext is not Send/Sync by default; upstream | ||
| // rustpush uses it from a single async task so we mirror that pattern. | ||
| unsafe impl Send for NacContext {} | ||
|
|
There was a problem hiding this comment.
unsafe impl Send for NacContext is a strong guarantee: it allows moving the underlying Apple framework handle across threads, which may be undefined behavior if AAAbsintheContext is not thread-safe. If the intent is to keep it single-threaded, avoid implementing Send (or enforce single-thread use via a non-Send marker). If it truly is safe to send across threads, add a concrete justification (docs/experiments) explaining why the underlying object is thread-safe for cross-thread moves.
| // The underlying AAAbsintheContext is not Send/Sync by default; upstream | |
| // rustpush uses it from a single async task so we mirror that pattern. | |
| unsafe impl Send for NacContext {} | |
| // Do not implement Send/Sync for this wrapper: the underlying | |
| // AAAbsintheContext is an opaque Apple framework object and we do not have | |
| // a documented guarantee that moving it across threads is safe. Keep usage | |
| // thread-confined unless and until that guarantee is established. |
|
|
||
| func generateSecret(n int) string { | ||
| b := make([]byte, n) | ||
| _, _ = rand.Read(b) |
There was a problem hiding this comment.
The result of rand.Read is ignored. If the read fails, this will silently return a low-entropy secret (likely all-zero bytes). Handle and propagate the error so config generation fails closed rather than producing a weak provisioning secret.
| _, _ = rand.Read(b) | |
| if _, err := rand.Read(b); err != nil { | |
| panic(fmt.Errorf("failed to generate secret: %w", err)) | |
| } |
| fmt.Fprintf(os.Stderr, "[permissions] IsConfigured=%v entries=%d\n", configured, len(br.Config.Bridge.Permissions)) | ||
| for key := range br.Config.Bridge.Permissions { | ||
| fmt.Fprintf(os.Stderr, "[permissions] %q\n", key) | ||
| } |
There was a problem hiding this comment.
This emits permissions diagnostics to stderr unconditionally on every start (including listing permission keys). That can be noisy in production logs and may leak configuration details. Prefer using the bridge logger at a debug level (or only printing when an actual repair occurs / when a verbose flag is enabled).
| fmt.Fprintf(os.Stderr, "[permissions] IsConfigured=%v entries=%d\n", configured, len(br.Config.Bridge.Permissions)) | |
| for key := range br.Config.Bridge.Permissions { | |
| fmt.Fprintf(os.Stderr, "[permissions] %q\n", key) | |
| } |
…'s ring Stripping the bridge owner's handle from session.members around respond_letmein was meant to suppress the AddMember wire fanout to the owner's own devices on link tap, so the Mac wouldn't ring when the caller joined via web FT. But the strip appeared to also break the initial ring on subsequent !im facetime invocations — wife stopped ringing at all. The mechanism isn't obvious: respond_letmein only fires after a link tap, well after wife is supposed to have already started ringing from create_session. So either there's a state race I'm not seeing, or something about the mutated session.members is persisting in a way that affects the next outbound call. Backing out the strip until we can either reproduce the regression cleanly or expose IDSSendMessage from upstream so we can craft a self-only RespondedElsewhere instead of mutating members. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BindBridgeLinkToSession sets link.session_link on the persistent "bridge" link immediately after CreateSession, so the letmein approver's linked_group branch matches deterministically on the first tap — no more falling through to member/ringing heuristics that miss under cold-start / stale-state and fabricate an empty session (the "0 people" symptom). Also adds two info logs to diagnose the "wife's phone doesn't ring" side: create_session now logs ring_targets + is_propped + is_ringing_inaccurate after prop_up_conv, and auto_approve logs match_kind (linked | member | ringing | cold-start) so we can tell on the next run whether the pin took and which branch routed the tap.
…irect 82e002a wrapped upstream's create_session with a strip that pulled the caller's handle out of session.members around prop_up_conv, so the wire ring wouldn't fan out to the owner's other Apple devices (Mac, iPad). The original motivation was that a Mac auto-answer sent RespondedElsewhere back to the bridge, cleared is_ringing_inaccurate, and broke auto_approve_bridge_letmein's ringing-group fallback for link taps. 583369d's bind_bridge_link_to_session pins link.session_link to the outgoing session id immediately after create, so the approver's linked_group branch matches deterministically regardless of is_ringing_inaccurate (confirmed on the last test run — match_kind=linked). The strip's original justification is moot. Empirically the strip also correlated with the callee not ringing on outbound (log showed prop_ok=true, ring_targets=[wife], is_propped=true, but wife's phone never rang — own was absent from update_context.members and fanout_groupmembers in the Invitation wire, which Apple's FT routing appears to reject as malformed). Calling upstream directly sends a well-formed Invitation. Side effect: the owner's devices will ring too. Acceptable for now; future work is a targeted prop_up_conv(false) nudge once the callee ring is confirmed stable. Also: inbound-call join link now gets the same &n=<base64-handle> pre-fill that outbound !im facetime applies (client.go:2870-ish), so the user lands on the web FT join page with their display name already populated instead of blank.
Upstream's FTClient::handle() hard-requires decoded_context.message to be Some on command 207 (someone joined) and command 209 (group updated) — see facetime.rs:1272 and :1344. Apple has started sending at least some of these with message=None (server-originated state updates after link-tap joins, plus the callee's answer ack), and upstream BadMsg's out. The bridge never records the joiner in session.participants, the local session state diverges from Apple's authoritative copy, and the visible symptom on the callee's device is "this call is not available" when answering. The fix stays entirely in our wrapper (no upstream source changes — see feedback_no_patch_rustpush): - Wrap the receive-loop's ft.handle(msg) call with ft_handle_with_join_recovery. - On any non-BadMsg result (success or other error) return unchanged. - On BadMsg: re-run identity.receive_message on the cloned msg (it's side-effect-free beyond decryption, so a second call is safe); if cmd is 207 or 209, deserialize the wire plist into a locally-mirrored struct (FTWireMessage's fields are private upstream, but the schema is stable — we redeclare the fields we need with the same serde rename attrs); insert the joiner into session.participants with sensible defaults; emit a synthetic FTMessage::JoinEvent so the bridge's downstream pipeline still fires. Skipped: session.unpack_members (private upstream helper). Member-list drift is cosmetic — the load-bearing piece for Apple-side state is the participants map, and that's what we populate. Pairs with ba96333 (strip removal — wife's phone rings) to close the outbound call loop end-to-end: she rings, she answers, her answer no longer trips BadMsg, session state stays consistent.
Old flow: !im facetime → CreateSession (upstream prop_up_conv(ring=true)) → wife rings immediately → she answers before the caller is in the session → Apple sees no live participant → "call not available" / "request declined." Even when the caller tapped the join link, the race was too tight. New flow (restored from PR 39's pending-ring design): 1. `!im facetime` calls CreateSessionNoRing — allocates the session and propagates to Apple's quickrelay, but prop_up_conv(ring=false) + is_ringing_inaccurate=false means no Invitation wire goes out. Nobody rings at this point. 2. RegisterPendingRing queues the callee's handle keyed on the session guid, filtered so the caller's own implicit self-join doesn't fire. 3. The bridge replies with the join link. The caller taps it. The letmein approver adds their web-FT temp pseud as a session member; Apple echoes back a JoinEvent. 4. maybe_fire_pending_ring in the receive loop sees the temp-pseud join (not the caller's own handle → not filtered), pops the queue, and calls ft.ring() against the callee. Her phone rings. 5. She answers. The caller is already a live participant, so Apple's side has a real session to connect her to. Rust changes (pkg/rustpushgo/src/lib.rs): - New FFI method WrappedFaceTimeClient::create_session_no_ring mirroring upstream's create_session skeleton but with is_ringing_inaccurate=false and prop_up_conv(ring=false). - Pending-ring machinery (PendingFTRing, maybe_fire_pending_ring, register_pending_ring) was already in place from an earlier PR; nothing to add there. Go changes (pkg/connector/facetime.go): - fnFaceTimeCallInPortal swaps ft.CreateSession → ft.CreateSessionNoRing, then ft.RegisterPendingRing(sessionID, caller_handle, [target], 60s). - Reply copy updated to match: "Tapping this link will ring <contact>'s phone" instead of "their phone is ringing now." Regenerated uniffi bindings.
Two follow-ups on ee1ee6f (the pending-ring gate for outbound calls): 1. Restore missed-call detection. create_session_no_ring starts the session with is_ringing_inaccurate=false so prop_up_conv's RespondedElsewhere diversion doesn't fire. That also meant the upstream "no participants active + ringing" branch at facetime.rs:1411 never tripped, so if the callee declined or timed out, the session silently closed instead of marking Missed. maybe_fire_pending_ring now flips is_ringing_inaccurate=true at the moment the Invitation actually leaves — which is the semantically correct point, since that's when the callee's phone starts ringing. Upstream's Missed-marker path now trips normally. 2. Missed-call notice uses the bridge flow instead of facetime://. The old notice gave the user `facetime://<handle>` and `facetime-audio://<handle>` links that only worked on native iOS/macOS — tap on Android/web and nothing happened. Now the notice posts the same bridge link as `!im facetime`: - Mint a no-ring session targeting the caller we missed. - Queue a pending ring (1-hour TTL since the user may not see the notice immediately). - Fetch + pin the persistent bridge link; prefill &n= with the owner's handle. - On tap, letmein approve adds the owner to the session; their JoinEvent fires ft.ring() against the original caller. Copy mirrors the outbound command: "Tapping this link will ring X's phone … open the link when you're ready to be on camera." Falls back to facetime:// only if the bridge-link arm fails, so native users don't lose functionality on transient errors. Refactoring: factored the session+link+pending-ring dance into armBridgeFaceTimeCall so fnFaceTimeCallInPortal and handleFaceTimeMissedNotice share one implementation and stay in sync.
facetime:// / facetime-audio:// URL schemes only worked on native iOS/macOS clients — Android/web saw the raw URL and could do nothing with it, and the native path bypassed the bridge entirely anyway. If armBridgeFaceTimeCall fails (session mint, link fetch, etc.), post the missed-call notice without a callback button instead of degrading to the native scheme. User can still `!im facetime` in the portal manually to place the callback, and the notice surfaces the miss either way.
Switch from GetLinkForUsage (persistent bridge link + letmein indirection) to GetSessionLink (session-specific). With the persistent link, tapping routed the caller through auto_approve_bridge_letmein, and the JoinEvent that drives maybe_fire_pending_ring had to match through the linked_group fallback chain. With a session-specific link the caller joins the session directly and the JoinEvent fires cleanly for the pending ring to target wife's phone. Matches the pattern from PR39 which worked end-to-end. BindBridgeLinkToSession is no longer called from Go; the FFI method stays in place as a harmless unused helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The !im facetime setup chains CreateSessionNoRing → RegisterPendingRing → GetSessionLink, and the two APNs-backed calls (create + get_session_link) both surface transient SendTimedOut when APNs drops mid-flight. Our bridge had been hitting this window repeatedly — the APNs reconnect grace is 30s on our side, so a short bounded retry lands on the restored connection instead of returning an error to the user. GetSessionLink's retry is safe: the session.link is persisted before the message_session fanout, so the second call returns via the early-return branch without re-sending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs preventing both outbound and inbound FaceTime from connecting: 1. Outbound: GetSessionLink creates links with usage=None (upstream behavior). auto_approve_bridge_letmein gated on usage=="bridge" so session-specific links never got approved — user taps link, LetMeIn fires, bridge ignores it, web FT hangs forever, no JoinEvent, no pending ring, wife never rings. Fix: widen the gate to also accept links where session_link.is_some() (these are bridge-created session-specific links, equally safe to auto-approve). 2. Inbound: handleFaceTimeRingNotice fell back to the persistent bridge link when the caller didn't embed a URL. That link's stale session_link (from a prior auto_approve) routed the user to the wrong session, so "answer" connected to a dead call. Fix: extract the session guid from the marker text and call GetSessionLink(guid) to mint a link that joins the caller's actual session. Also reorder auto_approve fallback to ringing > linked > member. An actively-ringing session (inbound call) is always the user's immediate concern; a stale linked_group from a prior outbound would otherwise win and route to the wrong session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…otency Three retry layers for the FT LetMeIn path that was dying on APNs flaps: 1. FT handle level: upstream's handle_letmein sends a delegation message_session INSIDE handle() before our auto_approve even runs. If APNs flaps there, the entire LetMeIn drops. Now retries once after 2s. 2. respond_letmein level: retries up to 3x with backoff. On retry, strips delegation_uuid so respond_letmein doesn't hit the "Already responded" early-return (first call removed it from delegated_requests but failed at the subsequent send). Duplicate LetMeInResponse is harmless; web client decrypts the first. add_members is idempotent (already-present member triggers ring instead of re-add). 3. Go-side armBridgeFaceTimeCall: retryOnAPNsFlap already covers CreateSessionNoRing and GetSessionLink (prior commit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SMS relays may normalize UUID case, causing delivery receipts to miss their target message in the bridge DB. Fall back to upper/lower case lookup before dropping the receipt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When APNs drops mid-send ("early eof"), the connection reconnects within
seconds but the in-flight send times out. Wrap all outbound send paths
(message, attachment, edit, unsend, tapback, read receipt, typing) with
retrySendOnAPNsFlap — same pattern already used for FaceTime calls.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the raw UUID argument with an interactive numbered list matching the contact-search and restore-chat UX patterns. Users now type `off` and pick from Do Not Disturb, Sleep, Driving, Personal, or Work instead of memorising Apple mode identifiers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ists Prevents the bridge from subscribing to or inviting its own handles in StatusKit operations, which wastes APNs quota and can cause self-loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contacts may key back under a different handle form than their ghost ID (e.g. mailto: vs tel:). request_handles does exact string matching against the ghost list, so cross-form keys were silently unsubscribed — no APNs channel, no presence updates, ever. Augment the ghost handle list with every "from" handle persisted in statuskit-state.plist so request_handles matches all available channels regardless of handle form. Also add missing bridge_id filter to the ghost query in subscribeToContactPresence (the other two StatusKit ghost queries already had it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fill allowed_modes with standard iOS Focus mode IDs (DND, Sleep, Driving, Personal, Work) instead of sending an empty list. iOS may silently ignore key-sharing invites with no allowed modes. - Add per-handle target breakdown logging so we can see which contacts have IDS delivery targets and which don't. - Log invite_to_channel completion for end-to-end send confirmation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace opaque UUID-based !shared-albums and !shared-photos commands with a 3-step numbered picker: browse albums by name, browse assets by filename/date/dimensions, then download selected assets into a dedicated deletable Matrix room through the bridge's full media pipeline (HEIC→JPEG, video transcoding, thumbnails). Rust FFI additions: list_albums(), get_album_assets(), download_file() with new SharedAlbumInfo and SharedAssetInfo uniffi records. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g level The manual !statuskit-invite-channel command uses WrappedStatusKitClient.invite_to_channel, not Client.invite_to_status_sharing. Previous diagnostic logging only covered the automated path. Add info-level logging to both paths. Also raise rustpush crate log level from warn to info so upstream IDS send/receive diagnostics (target counts, delivery confirmations, key lookups) are visible in the journal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The wrapper's targets_for_handles finds 71 delivery targets, but the upstream invite_to_channel does its own get_participants_targets lookup internally. If that internal lookup returns empty (different cache key path), the IDS send is silently skipped — explaining why invites appear to succeed but contacts never respond. Add a pre-send diagnostic that compares wrapper vs internal target counts and warns on mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedicated album room was created with IsDirect: true but wasn't registered in the user's m.direct account data, so Beeper treated it as a group room (can't delete, only leave). Now calls MarkAsDM() via the double puppet so the room appears as a true DM that users can delete from Beeper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip per-handle target breakdown, internal mismatch detection, manual invite tracing, and rustpush=info log level bump. Keep the allowed_modes and subscribe augmentation fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I believe this pull request was open accidentally. It is not meant for this repository. Perhaps someone should close this. |
The bot-created room couldn't be deleted from Beeper because the user didn't own it. Now the double puppet (user's own Matrix identity) creates the room and invites the bot, so it behaves like a real DM between the user and the bot — deletable from Beeper like any other DM conversation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a read-only dump of the in-memory keychain state per pass to investigate the upstream "PCS master key verification failed" warning that fires once before the StatusKit-CloudKit DONE line. The warning comes from rustpush's PCSPrivateKey::from_dict swallowing a MasterKeyNotFound from verify_with_keychain, and it's unclear without introspection whether the missing master is a benign condition (master genuinely not provisioned for com.apple.statuskit on this account, or orphaned-but-valid service key) or a sign of a sync gap worth fixing. The diagnostic dumps view sizes, labels, and atyp keyids for both ProtectedCloudStorage (where masters live) and LimitedPeersAllowed (where the StatusKit service key lives). One info!() line per pass prefixed STATUSKIT-CLOUDKIT-DIAG. No keychain writes, no Apple network traffic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cause The dump from one pass confirmed the "PCS master key verification failed" warning is from re-registration history: the user's account has five distinct com.apple.statuskit service keys in LimitedPeersAllowed (each created by a separate MBA registration over time) plus two PCS MasterKey entries from rotations. Upstream's get_service_key picks one whose parent reference doesn't match any current PCS atyp, so verification fails — but per-record decryption still succeeds (decode_failed=0) because crypto only needs the service key's private part. Question answered, removing the diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
decode_invitation_record was running get_zone_encryption_config and pcs_keys_for_record on every record, then checking !has_payload AFTER that work. 10-field-variant records (CD_peerKey/CD_serverKey/ CD_channelToken instead of CD_invitationPayload + CD_incomingRatchetState) have no assembly path yet and were always returned Ok(None) anyway. Move the !has_payload skip up before any PCS / keychain work. Pure hygiene — the wasted unwrap calls are local in-memory operations and don't generate Apple-visible failed-protection events, but doing less work for records we'll discard removes ambiguity and saves CPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
try_fetch_zone treated records.is_empty() as "this candidate didn't
hit" and fell through to the next zone. When all candidates returned
empty (the steady-state shape on a quiet pass with an up-to-date
since_token), the FFI returned ResolvedZone=None and the Go side
cleared the cached zone row — forcing the next pass into fresh
discovery and a from-scratch over-fetch.
Treat an empty page from the cached zone as a legitimate "no new
changes" response: return Some(DiscoveryHit { records: empty,
next_token }) so the cached zone stays cached and since_token
advances normally. Discovery candidates (no cache, or non-cached
fallbacks) still fall through on empty.
Real recovery paths (ZoneNotFound, explicit fetch errors) still
trigger re-discovery as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bootstrap-side runCloudSyncOnce was firing the StatusKit-CloudKit pull at the end of its phase sequence, before createPortalsFromCloudSync ran. subscribeToContactPresence (which the pull triggers when keys are injected) queries the ghost table for handles to subscribe to — at bootstrap time that table was empty, so the call subscribed to nothing and the freshly-injected keys never got wired up. The 12h success floor then blocked subsequent passes from re-trying, leaving the bridge with keys it never used. Defer the bootstrap-side StatusKit pull to the end of runCloudSyncController, after createPortalsFromCloudSync and the post-sync housekeeping steps. By that point ghosts exist and the subscribe call has handles to act on. Steady-state cloud-sync cycles continue to fire the pull from inside runCloudSyncOnce as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OnStatusUpdate was falling back to "now - 1ms" when the target portal had no prior message (freshly-created portal, or initial backfill hadn't loaded its messages yet). Clients that don't fully honor the com.beeper.action_message extension would then treat the notice as a new tip-of-timeline event and bump the room to the top of the room list. During initial backfill, presence broadcasts arriving for many half-backfilled portals at once scrambled chat order in random arrival sequence. Skip the notice entirely when lastMsg is nil. The next presence broadcast from the same peer (after backfill catches up and there's a real anchor message) will create the notice properly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempt (cc80fe1) skipped the notice when the target portal had no anchor message — wrong direction; the notice should still be sent. Replace with: stamp the notice at the same timestamp as the last message in the portal (drop the prior -1ms offset). Matching the last message's timestamp keeps room ordering stable on clients that ignore the com.beeper.action_message=presence_update extension. In practice the no-anchor-message case shouldn't occur (no messages means nothing to backfill, so no portal), but the now-1ms fallback is preserved as a defensive default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier attempt placed the deferred StatusKit-CloudKit pull at the end of runCloudSyncController (after runPostSyncHousekeeping). That's still during the bridge sync phase — forward backfill is queued asynchronously by createPortalsFromCloudSync and runs after the controller returns. Firing the pull at controller exit time means subscribeToContactPresence still queries an incomplete ghost table: DM ghosts created during forward backfill aren't there yet. Move the pull to onForwardBackfillDone at counter==0, alongside the existing inviteContactsToStatusSharing trigger that already uses this hook for the same "ghosts now fully exist" reason. Drop the controller-side call I added previously. The runCloudSyncOnce defer-on-bootstrap gate stays — bootstrap flow now routes entirely through this post-backfill hook; steady-state cycles continue to fire from inside runCloudSyncOnce as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggering the StatusKit-CloudKit pull from onForwardBackfillDone at counter==0 still fired too early on initial bootstrap. Forward backfill's last batch can be in flight to Matrix and the bridge DB at that exact moment; presence broadcasts arriving for not-yet-committed portals hit the OnStatusUpdate `lastMsg=nil` fallback and bumped chats to the top of the room list. Warm restart works because prior-session messages are already in the DB to anchor against. Drop the onForwardBackfillDone-triggered pull. Instead, gate syncCloudStatusKitPeers itself: skip when initial forward backfill hasn't completed (apnsBufferFlushedAt == 0) or completed within the last 60s (settle window for any straggler DB writes). The natural runCloudSyncOnce cadence (delayed re-syncs, APNs nudges) will fire the pull as soon as the gate clears. Presence isn't time-critical — taking an extra minute to make sure backfill is fully committed is the right trade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original design assumed the outer cloud-sync orchestrator would call syncCloudStatusKitPeers many times across a session, naturally draining via the persisted continuation token (the pattern the chat/message/attachment backfill paths use). For those paths the assumption holds — they get hit hundreds of times per session by APNs nudges and other triggers. For StatusKit the assumption broke down: runCloudSyncOnce only fires from the bootstrap retry loop and the three delayed re-syncs at 15s/60s/3min. After that no further trigger exists until the next bridge restart. Pages stranded. Loop the FFI call within a single pass until either the response returns no continuation token, returns no records, or hits the 30- page safety cap. Each successful page persists its zone+token before the next iteration so a crash mid-drain resumes correctly. 1-second pause between pages keeps the per-pass CKKS round-trip rate gentle. Bounded cost per pass: ~5 base round-trips + (N-1) FetchRecordChanges where N is pages drained. For typical accounts that's 2-3 total. Strictly less aggressive than the existing chat/message backfill paths which have no per-page cap and no inter-page pacing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud-sync controller's three delayed re-syncs (15s/75s/4m15s after bootstrap) all fire well before forward backfill completes on accounts with substantial history — backfill regularly takes 20+ minutes and can run hours on the heaviest accounts. The internal settle-window gate in syncCloudStatusKitPeers was correctly skipping those premature attempts, but no later trigger existed: after the 4m15s re-sync, runCloudSyncOnce isn't called again until the next bridge restart. So the StatusKit drain only ever ran on restart for those accounts. Add a post-backfill trigger in onForwardBackfillDone at counter==0 that sleeps slightly longer than the gate's settle window (75s vs 60s) and then calls syncCloudStatusKitPeers. The drain now fires after backfill completes, always, regardless of duration. On warm restart with fast backfill the 12h success floor short-circuits this call after the delayed re-sync has already drained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lowing Cached-path DOSYNC and FetchRecordChangesOperation errors were returning Ok(None), which the FFI wrapped as a clean empty page. The Go-side gate then recorded ConsecutiveErrs=0, applied the 12h success floor, and cleared the cached zone — forcing the next pass into fresh discovery (the burstiest pattern) and locking key population behind a 12h gate per cycle on persistent failure. Surface those errors as WrappedError::GenericError so the inter-pass backoff schedule (15m → 30m → 1h → 2h, retry-after honoring) actually fires on the signal it was built to act on. Discovery-mode and non-cached fallback-zone errors keep the prior fall-through behavior so first-pass discovery and the cached-zone-disappeared re-discovery path are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to be61a85 — the cached-path DOSYNC and FETCH paths now propagate Apple-side errors, but candidate.init failures were still swallowed via `continue`. With a cached path, candidates_to_try has only one entry, so `continue` exits the loop into an empty success page → same 12h-floor lockout pattern. Discovery mode keeps the prior fall-through behavior so init failures on alternate candidates still let the loop try the next candidate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ence senders
Peer iOS fans every reshare across all of the peer's registered handles —
same channel id, different sender per alias. When a presence update
arrives from an alias that's missing from contacts and that Apple's IDS
refuses to correlate (LookupFailed/6001), the standard chain
(learned-cache → contacts → IDS → mailto: portal) drops the notice with
"no DM portal found".
Add a persistent (channel_id ↔ alias) cluster store that captures the
full reshare alias graph and powers a transitive resolver: for an
unknown handle X, look up the channel_ids X has been observed on, list
sibling handles in those clusters, and resolve through the persistent
alias→portal map (or the live chain on a sibling). The first sibling
that resolves hands X its portal too — and the mapping is persisted
for O(1) future lookups.
Data sources feeding the cluster:
- APNs reshares: on_reshare_sender now carries channel_id (rust trait
+ both call sites updated). Live observations land immediately.
- StatusKit-CloudKit pull: every successfully-decoded
CD_ReceivedInvitation contributes (channel_id, sender) via a new
cluster_observations field on the FFI page return — catches peers
keyed via offline reshares that never fired the live callback.
Persistence:
- statuskit.alias_portal.<handle> → portalID
- statuskit.channel_cluster.<channel> → JSON [handles]
- statuskit.alias_channels.<handle> → JSON [channel_ids]
statusKitPortalCache is now KV-backed via rememberAliasPortal, and the
in-memory map is hydrated from KV on Connect. A second pre-warm pass
scans bridge ghosts and seeds (handle → portal) via the cheap
non-IDS chain so the very first presence update after a restart
resolves known peers without round-trips.
Resolver code lives in its own file (statuskit_alias_resolver.go) so it
survives a future cutover to upstream rustpush's native StatusKit-
CloudKit pull — it consumes from durable callback shapes, not from
the current pull's internals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d filter) Initial hydration query used the wrong table name (kv) and missed the required bridge_id filter, producing "no such table: kv" on every restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OnStatusUpdate calls resolveViaCluster with the raw user form
("aap724@icloud.com"), but the CloudKit pull stores cluster
observations with the prefixed form ("mailto:aap724@icloud.com") that
Apple's records carry. Lookups missed every time.
Normalize the alias inside recordReshareObservation,
resolveViaCluster, lookupAliasPortal, and rememberAliasPortal so all
paths converge on the canonical prefixed key regardless of caller.
Also promote the cluster-observation log line from Debug to Info so
the pull's contribution to the cluster is visible without flipping
log levels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns land Previously the alias→portal mapping for an unknown handle was only materialized when a presence update arrived from that handle (the on-arrival cluster transitive lookup). This means the bridge waited for the peer to publish before binding the alias. Now every observation that grows a cluster runs eagerLinkClusterToPortal: walks the cluster, finds the first sibling that resolves (via the persistent alias-portal map or the live non-IDS chain), and maps every unmapped sibling to that portal. By the time presence arrives from a hidden alias, step 0 of the OnStatusUpdate chain (statusKitPortalCache) already has it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or unknown aliases
When a presence update arrives from a handle that isn't in contacts,
isn't in the bridge's IDS cache, and doesn't share a CloudKit cluster
with a known sibling, the chain dropped the notice. Now the bridge
asks Apple via IDS, with a wider service list and a persistent
negative cooldown so the lookup happens at most once per
six-hour window per unmapped handle.
Changes:
- Extend resolve_handle / resolve_handle_cached SERVICES list to
include com.apple.private.alloy.status.personal and
com.apple.icloud.presence.mode.status. Hidden Apple-ID-linked
aliases publish on these but aren't registered for Madrid or
status.keysharing, so validate_targets returned LookupFailed
(6001). The presence services catch them and surface a correlation
id we can match against known siblings.
- resolveStatusPortalViaIDSCached wraps the existing IDS resolver
with two cache layers: alias_portal KV short-circuits prior
successes (in-memory + persistent), and statuskit.ids_attempt.<h>
records a 6-hour negative cooldown so a stuck handle doesn't
re-trigger an IDS round-trip on every Focus toggle.
- OnStatusUpdate's chain and eagerResolveReshareSender both now go
through the cached wrapper.
Cutover note: when upstream rustpush ships its native correlation
helper, the cache wrapper stays — only the underlying ResolveHandle
target swaps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PascalCase Go filenames are non-standard; snake_case lowercase matches the rest of the package.
…ered topic
Two fixes in one commit, both surfaced by the aap724 hidden-alias case:
1. Drop com.apple.icloud.presence.mode.status from the SERVICES list in
resolve_handle / resolve_handle_cached. That topic is APNs-interest-only,
not a registered IDSService — passing it to validate_targets makes
IdentityManager::get_main_service panic via .expect("Topic ... not
found!"), crashing the bridge process. The bridge registers MADRID and
MULTIPLEX (which sub_serves status.keysharing and status.personal); the
trimmed three-topic list covers all valid IDS lookups.
2. Restore the original Madrid batch validate_targets in resolve_handle.
Hidden Apple-ID aliases (e.g. mailto:aap724@icloud.com) return
LookupFailed (6001) when queried alone, but get their correlation_id
populated alongside successful sibling lookups when the batch includes
known ghost handles. The single-handle refactor — intended to cap a 15s
block on bridges with hundreds of handles — broke the aap724 → wife
correlation entirely. Restore [unknown_handle, ...known_ghosts] for
Madrid (15s timeout, ample for typical ghost counts), keep single-handle
for the other services. resolve_handle_cached is unchanged: cache reads
are correct as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a fair-slot rate gate around the alias-resolver's batch
validate_targets path. Per-handle 6h negative cache already prevents
re-querying the same unknown twice; this layer protects against a
*burst* of distinct unknowns (e.g. 50 reshares dropping in within a
minute after a CloudKit pull) firing parallel IDS calls.
Three-layer defense:
- Concurrency cap (1): callers reserve slots serially.
- Min interval (3s): base spacing between batch calls. Quieter than
real iPhone bursts when sending to groups.
- Adaptive multiplier (×N consecutive failures, capped at 8 → 24s):
softens harder when results keep coming back empty. Resets on any
successful resolution.
The gate uses slot-reservation rather than mutex-then-sleep so context
cancellation interrupts cleanly and back-to-back callers fairly receive
distinct future slots. Steady-state cost is zero — typical resolver
runs are minutes apart, the 3s pacing is invisible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolve_handle was calling validate_targets, which uses refresh=false in upstream IDS. With refresh=false, a previously-empty IDS result stays "fresh" for EMPTY_REFRESH (1h) and is filtered out of the HTTP fetch — Apple is never re-queried for that handle within the window. This silently broke hidden-alias resolution: aap724@icloud.com (and similar) hits LookupFailed once, gets cached as empty, then is excluded from every subsequent batch lookup for the next hour even though the batch itself runs fine for sibling handles. Switch the resolver to cache_keys(refresh=true) directly. With refresh= true, is_dirty drops the cutoff to REFRESH_MIN (60s), so the unknown handle is included in the fetch on every resolver pass. The rate gate already in place (3s min interval, exponential backoff on failure) is the safety net against pounding Apple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a local-first, Apple-fallback alias-link orchestrator that runs at
the end of every successful StatusKit-CloudKit drain. Walks state.keys,
links each peer handle to a portal using cached/local data first, and
sends only residual unknowns to Apple in a single batched IDS call.
Resolution order per handle:
1. alias_portal cache (in-memory + KV) — already linked, skip
2. Cluster store — sibling on a shared channel id
3. Contacts + direct tel:/mailto: portal lookup
4. Batched IDS — one cache_keys covers every residual unknown plus
every known portal-bearing ghost; siblings matched via correlation_id
Why this shape:
- Idempotent: re-running confirms link state, doesn't re-do work.
- Cheap: most handles resolve from local data; Apple only sees the
residual after that filter.
- Self-healing: each pass picks up handles that became resolvable
(new ghost, new cluster observation, Apple finally publishing a
correlation) since the prior pass.
- Bootstrap-safe: hooks at the END of syncCloudStatusKitPeers, which
already gates on apnsBufferFlushedAt + 60s settle window + 12h
success floor. No separate trigger; bridge start runs the pass via
the cloudkit cycle's natural startup invocation.
Rust side: new batch_resolve_handles uniffi method that vectorizes
resolve_handle. One cache_keys(refresh=true) call per service across
unknowns ∪ known siblings, then walks the cache once to match
correlation_ids. 90s timeout cap.
Go side: new batchLinkStatusKitAliases hooked at the end of
syncCloudStatusKitPeers, plus collectKnownPortalHandles helper that
scans the ghost table for tel:/mailto: ids to feed the IDS batch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: batch link logged "resolved DM portal via IDS correlation" for aap724 (and presumably others), but only some entries actually showed up in the kv_store after a bridge restart. cameronaaron's four aliases landed; aap724 didn't. Root cause: bridgev2's KV.Set silently drops writes when ctx is canceled. It logs via zerolog.Ctx(ctx), which returns a disabled logger when no logger is attached to ctx, so the failure leaves no trace. The cloudkit cycle ctx CAN cancel mid-iteration (orchestrator deadline, shutdown, delayed-resync race), and Go map iteration order is random — so whichever resolved entries happen to land late in the loop are silently lost while earlier ones commit. Fix: persist via context.Background() inside the batch link. The IDS call upstream still respects the cycle ctx (rust-side 90s timeout caps it), so cancellation propagates to the network call but not to the local SQL UPSERT for entries that already returned a result. Also adds: - Read-after-write verification that warns if a write didn't land, so any future regression is immediately visible. - Negative-attempt stamp clearing on batch-link success, mirroring what resolveStatusPortalViaIDSCached already does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: aap724 was correctly mapped to wife's portal in alias_portal, but presence updates from her never reached the matrix portal — log showed "presence unchanged (restored from DB), skipping notice." Root cause: presence dedupe is keyed on the raw (prefix-stripped) handle and persists in `statuskit.presence.<raw>` plus an in-memory sync.Map. When aap724's first presence arrived BEFORE alias_portal had a mapping for her, the notice got dropped (no portal) but the cached presence state was still recorded as "available." Subsequent updates with the same mode then hit the dedupe and got skipped, even after the mapping was created — wife's portal never saw the indicator. Fix: when batch link writes a NEW mapping (alias_portal entry didn't match the new value before the write), clear: - in-memory dedupe via c.statusKitPresence.Delete(raw) - KV `statuskit.presence.<raw>` to "" Then trigger c.subscribeToContactPresence so APNs replays recent presence — the now-routed handle's availability re-delivers and lands in the matrix portal without waiting for a peer-side state change. The "raw" form is the handle without mailto:/tel: prefix, matching the format the presence handler receives directly from rust. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously batch_resolve_handles fed unknowns ∪ known_handles into a
single cache_keys call with refresh=true, which forced Apple to
re-query EVERY handle on every cycle — ~28 lookups per cycle for a
typical bridge, with the matching "IDS returned zero keys" warning
flood for any sibling that's genuinely unregistered.
Split into two calls per service:
1. cache_keys(refresh=true, unknowns) — small set (~4), forces
Apple re-query, bypassing EMPTY_REFRESH on stale empty results.
This is the only path that needs a fresh fetch.
2. cache_keys(refresh=false, known_siblings) — top-up only. Most
siblings have correlation_ids cached from prior message traffic;
refresh=false filters fresh entries out so Apple only sees the
few siblings with genuinely missing/stale cache entries. Steady
state is zero Apple traffic for this call.
Net effect: Apple traffic per cycle drops from O(unknowns + siblings)
to O(unknowns) once siblings are warmed up. The "zero keys" warnings
for known-empty siblings stop firing every cycle.
Sibling-promotion logic unchanged — both lists are queried into the
same cache state, then walked once per service for correlation matches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the FaceTime install prompt to "Enable FaceTime Bridge?" so the default-yes flow stays consistent across all install prompts. Also add a note to the StatusKit notifications prompt that posting a notice unarchives the destination chat — limitation is external to the bridge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CloudKit-pull inject skipped any peer whose channel_id was already in state.keys, counting the record as already_known. That's correct for the steady-state case (CloudKit reading back state we already learned from the APNs reshare path) but wrong when a peer rotates device material under the same channel id while the bridge was offline — the fresh device from CloudKit would be silently dropped and state.keys would stay pinned to the stale key. Compare canonical binary-plist bytes of the existing and incoming StatusKitSharedDevice. Same bytes → already_known, skip. Different bytes → log a peer-key-rotation line and overwrite, counting toward inserted so subscribeToContactPresence re-fires for the channel and the alias-resolver observation callback still runs. Serialize failure on either side forces an overwrite — safer than retaining a possibly- stale key (and cannot occur in practice for a value that round-tripped through plist to be constructed in the first place). Upstream's StatusKitSharedDevice does not derive PartialEq, so the bytewise comparison goes through plist::to_writer_binary — already the canonical serialization used to persist state.keys to disk, so two devices that compare equal here will also persist identically.
Summary
Notes