Reverse Engineering FaceTime (Part 0): The Link Is Not The Room
The first mistake I made was treating a FaceTime link like a room URL.
That mental model works for a lot of web calling products. There is a room identifier somewhere in the path, a signaling server behind it, and the page becomes the main client. FaceTime links behaved differently. The browser had media APIs, yes, but the browser did not look authoritative. It could present a name, knock, wait, and later become active. It could not decide that it had joined.
Apple publicly documents that FaceTime links let non-Apple users join from a browser, while an Apple user creates/hosts the call from FaceTime.1 That product statement was the first useful constraint. It told me the web guest is intentionally weaker than the native host. The reverse engineering question became: what machinery exists between a browser tab and a native FaceTime conversation?
First observation: WebRTC was only the bottom layer
I started with the obvious assumption: web participant means WebRTC participant. That is true, but it was not explanatory. WebRTC gives the browser APIs for capture, RTP, ICE, codecs, data channels, and peer connections.4 It does not define FaceTime links, host approval, Apple identity routing, push-like signaling, media-key rolling, or the waiting room.
The architecture I started with was too flat:
browser link
-> signaling websocket
-> WebRTC negotiation
-> mediaThe architecture I ended with was layered:
FaceTime link / conversation handle
-> native host authority
-> IDS-flavoured identity + routing
-> WebCourier delivery channel
-> LetMeIn admission state
-> Quick Relay allocation
-> stream subscription
-> SFrame-protected encoded mediaThat second diagram is still a simplification, but it preserves the important boundaries. Each layer has different vocabulary and different failure modes.
| Thing I expected | Thing I found instead |
|---|---|
| The link directly identifies a room. | The link resolves into conversation/link material that still requires admission. |
| The browser joins when it opens the URL. | The browser asks to be let in and waits for native approval. |
| Signaling is one app-specific socket. | Delivery is split across IDS-style web registration/query and WebCourier. |
| Relay allocation equals admission. | Admission and Quick Relay allocation are separate stages. |
| Encryption is just WebRTC DTLS/SRTP. | Encoded media also goes through an SFrame worker with MKM recovery semantics. |
This is the point where the project became less about “can I make a call?” and more about “where does each piece of authority live?”
FaceTime link as a capability, not a room
A room URL typically has ambient authority. If you have the URL, you can join, or at least you can join after a simple server-side check. FaceTime links felt closer to a capability to request participation. The browser can present link-derived material to the system, but the native side still controls whether that request becomes a participant.
The web runtime later made this less hand-wavy. Conversation-shaped structures contained fields that made the link part of a larger object graph:
Conversation {
version: uint32
members[]
message
messagesGroupUUIDString
messagesGroupName
video: bool
providerIdentifier: string
lightweightMembers[]
participantAssociation
avMode
}
ConversationMessage {
type: int32
activeParticipants[]
conversationGroupUUIDString
link: ConversationLink
isLetMeInApproved: bool
encryptedMessage
letMeInDelegationHandle
letMeInDelegationUUID
unicastConnectorBlob: bytes
guestModeEnabled: bool
}The field names are the story. link is embedded in conversation messaging. isLetMeInApproved exists as a separate boolean. activeParticipants is separate from the presence of a link. unicastConnectorBlob hints at separate connector material. A link is not enough to imply active participation.
The first state machine I wrote down
Before I had the enum names, I wrote the browser behavior as a small state machine based only on UI and host-visible changes.
OPENED_LINK
-> NAME_COLLECTED
-> KNOCK_SENT
-> WAITING_FOR_HOST
-> { APPROVED | DENIED | CANCELLED | FAILED }
-> if APPROVED: MEDIA_SETUPThat model was useful because it forced me to avoid collapsing “waiting” and “joining” into the same state. Later, the runtime gave the actual vocabulary:
LetMeInState:
PENDING = 0
REQUESTING = 1
FAILED = 2
CANCELLED = 3
ALLOWED = 4
DONE = 5
DENIED = 6
LetMeInResult:
CANCELLED = 0
ALLOWED = 1
DENIED = 2That matched the early model surprisingly well. ALLOWED and DONE being separate is important. An approval decision exists before all downstream media/key/relay work is complete.
Why FaceTime architecture felt different
Most browser-first video products centralize around a web backend that owns the meeting. FaceTime felt like the browser had to be projected into a native Apple conversation system without becoming a normal native device. That creates weird seams.
| Seam | What it implied while reversing |
|---|---|
| Native host approval | The Mac/iPhone side is authoritative for admission. Browser-only experiments cannot bypass that boundary. |
| IDS-like vocabulary in web code | The browser gets a constrained web projection of Apple identity/routing, not just a generic room socket. |
| WebCourier delivery | The browser needs push-like asynchronous signaling even though it is not an APNs client. |
| Quick Relay | Media allocation and relay state are separated from waiting-room approval. |
| SFrame worker | Media security is implemented at the encoded-frame boundary, not only by treating the relay as trusted. |
This is also why the project was fun for me. I already had background in WebRTC and protocols, so I kept expecting familiar shapes. FaceTime kept giving me familiar pieces in unfamiliar places. WebRTC was there, but wrapped in Apple identity, native call authority, relay allocation, and frame-level encryption machinery.
Working rule for the rest of the series
The rest of the reversing followed one rule: do not trust the UI name for a state until I can map it to a protocol artifact. “Waiting” had to map to LetMeInState. “Joined” had to map to participant activation and relay/media setup. “Bad link” had to map to a result/error family. “No media” had to be split between allocation, downlink subscription, AVC blob availability, and MKM decryption.
That is why Part 1 moves into the native host boundary and Part 2 moves into the web runtime. The product was clean. The protocol was layered. The only way to understand FaceTime was to keep those layers separate.