Index

Reverse Engineering FaceTime (Part 0): The Link Is Not The Room

The first mistake I made was treating a FaceTime link like a room URL.

That mental model works for a lot of web calling products. There is a room identifier somewhere in the path, a signaling server behind it, and the page becomes the main client. FaceTime links behaved differently. The browser had media APIs, yes, but the browser did not look authoritative. It could present a name, knock, wait, and later become active. It could not decide that it had joined.

Apple publicly documents that FaceTime links let non-Apple users join from a browser, while an Apple user creates/hosts the call from FaceTime.1 That product statement was the first useful constraint. It told me the web guest is intentionally weaker than the native host. The reverse engineering question became: what machinery exists between a browser tab and a native FaceTime conversation?

First observation: WebRTC was only the bottom layer

I started with the obvious assumption: web participant means WebRTC participant. That is true, but it was not explanatory. WebRTC gives the browser APIs for capture, RTP, ICE, codecs, data channels, and peer connections.4 It does not define FaceTime links, host approval, Apple identity routing, push-like signaling, media-key rolling, or the waiting room.

The architecture I started with was too flat:

browser link
  -> signaling websocket
  -> WebRTC negotiation
  -> media

The architecture I ended with was layered:

FaceTime link / conversation handle
  -> native host authority
  -> IDS-flavoured identity + routing
  -> WebCourier delivery channel
  -> LetMeIn admission state
  -> Quick Relay allocation
  -> stream subscription
  -> SFrame-protected encoded media

That second diagram is still a simplification, but it preserves the important boundaries. Each layer has different vocabulary and different failure modes.

Thing I expectedThing I found instead
The link directly identifies a room.The link resolves into conversation/link material that still requires admission.
The browser joins when it opens the URL.The browser asks to be let in and waits for native approval.
Signaling is one app-specific socket.Delivery is split across IDS-style web registration/query and WebCourier.
Relay allocation equals admission.Admission and Quick Relay allocation are separate stages.
Encryption is just WebRTC DTLS/SRTP.Encoded media also goes through an SFrame worker with MKM recovery semantics.

This is the point where the project became less about “can I make a call?” and more about “where does each piece of authority live?”

A room URL typically has ambient authority. If you have the URL, you can join, or at least you can join after a simple server-side check. FaceTime links felt closer to a capability to request participation. The browser can present link-derived material to the system, but the native side still controls whether that request becomes a participant.

The web runtime later made this less hand-wavy. Conversation-shaped structures contained fields that made the link part of a larger object graph:

Conversation {
  version: uint32
  members[]
  message
  messagesGroupUUIDString
  messagesGroupName
  video: bool
  providerIdentifier: string
  lightweightMembers[]
  participantAssociation
  avMode
}
 
ConversationMessage {
  type: int32
  activeParticipants[]
  conversationGroupUUIDString
  link: ConversationLink
  isLetMeInApproved: bool
  encryptedMessage
  letMeInDelegationHandle
  letMeInDelegationUUID
  unicastConnectorBlob: bytes
  guestModeEnabled: bool
}

The field names are the story. link is embedded in conversation messaging. isLetMeInApproved exists as a separate boolean. activeParticipants is separate from the presence of a link. unicastConnectorBlob hints at separate connector material. A link is not enough to imply active participation.

The first state machine I wrote down

Before I had the enum names, I wrote the browser behavior as a small state machine based only on UI and host-visible changes.

OPENED_LINK
  -> NAME_COLLECTED
  -> KNOCK_SENT
  -> WAITING_FOR_HOST
  -> { APPROVED | DENIED | CANCELLED | FAILED }
  -> if APPROVED: MEDIA_SETUP

That model was useful because it forced me to avoid collapsing “waiting” and “joining” into the same state. Later, the runtime gave the actual vocabulary:

LetMeInState:
  PENDING    = 0
  REQUESTING = 1
  FAILED     = 2
  CANCELLED  = 3
  ALLOWED    = 4
  DONE       = 5
  DENIED     = 6
 
LetMeInResult:
  CANCELLED = 0
  ALLOWED   = 1
  DENIED    = 2

That matched the early model surprisingly well. ALLOWED and DONE being separate is important. An approval decision exists before all downstream media/key/relay work is complete.

Why FaceTime architecture felt different

Most browser-first video products centralize around a web backend that owns the meeting. FaceTime felt like the browser had to be projected into a native Apple conversation system without becoming a normal native device. That creates weird seams.

SeamWhat it implied while reversing
Native host approvalThe Mac/iPhone side is authoritative for admission. Browser-only experiments cannot bypass that boundary.
IDS-like vocabulary in web codeThe browser gets a constrained web projection of Apple identity/routing, not just a generic room socket.
WebCourier deliveryThe browser needs push-like asynchronous signaling even though it is not an APNs client.
Quick RelayMedia allocation and relay state are separated from waiting-room approval.
SFrame workerMedia security is implemented at the encoded-frame boundary, not only by treating the relay as trusted.

This is also why the project was fun for me. I already had background in WebRTC and protocols, so I kept expecting familiar shapes. FaceTime kept giving me familiar pieces in unfamiliar places. WebRTC was there, but wrapped in Apple identity, native call authority, relay allocation, and frame-level encryption machinery.

Working rule for the rest of the series

The rest of the reversing followed one rule: do not trust the UI name for a state until I can map it to a protocol artifact. “Waiting” had to map to LetMeInState. “Joined” had to map to participant activation and relay/media setup. “Bad link” had to map to a result/error family. “No media” had to be split between allocation, downlink subscription, AVC blob availability, and MKM decryption.

That is why Part 1 moves into the native host boundary and Part 2 moves into the web runtime. The product was clean. The protocol was layered. The only way to understand FaceTime was to keep those layers separate.

References