Fingerprinting sites downloaded over HTTPS

Bennett Haselton
Last modified: 12/7/2002

Various programs such as CGIProxy exist that can be installed on a Web server, such that users can connect to the server and use the CGIProxy script to browse third-party Web sites. Because the user never connects to the third-party Web sites, this technique can be used by users on a censored network to access sites that they are blocked from accessing directly, as long as the machine hosting the CGIProxy script is not blocked. However, users in this situation might still be concerned that an eavesdropper could spy on the traffic between them and the Web server hosting the CGIProxy script, to determine what banned pages they're downloading.

It is widely assumed that if a script like CGIProxy is hosted on an HTTPS-enabled server, it is impossible for an eavesdropper to determine what page the user is viewing. As SafeWeb describes their HTTPS-enabled Triangle Boy circumventor server: "Safeweb's Triangle Boy software exploits the encryption capability of browsers and creates a distributed network which allows individual users in China to access the entire Web through an unbreakable encrypted channel." However, it is not entirely correct to call HTTPS "unbreakable", in the sense that an eavesdropper cannot make accurate guesses about what you were looking at. Many pages can be "fingerprinted" based on the relative sizes of the page document and the images and other objects that are loaded from within the page; the rough sizes of the images are preserved when the page is downloaded over an HTTPS connection. This would only be a concern in the most security-sensitive situations, but it could be done. Fortunately, there are countermeasures that a CGIProxy-type script can take to evade this detection.

Demonstration of the problem

I created a 155-byte HTML page that loaded four images, whose sizes were 3,883 bytes, 7,329 bytes, 17,783 bytes, and 32,138 bytes, respectively. I then loaded the page using Netscape Navigator, and using TracePlus/Winsock to monitor the network traffic that the application exchanged with the server. (I used Netscape rather than IE because TracePlus automatically decrypts encrypted traffic that IE exchanges with the server, and I wanted to see the raw encrypted traffic.)

Netscape made four simultaneous connections, from local ports 544, 600, 604 and 644. In total, it downloaded roughly 7,800 bytes over port 544, roughly 5,900 bytes over port 600, roughly 18,000 bytes over port 604, and roughly 32,000 bytes over port 644. These download sizes corresponded roughly to the sizes of the images on the server. (Netscape only had four simultaneous connections open; one of those connections was first used to download the page itself, and then after the page had been downloaded, that connection was re-used to download one of the images, and the other three connections used to download the other three.) The browser doesn't necessarily have to download the images in the order in which they're loaded on the page.

Because an eavesdropper can distinguish the download connections from each other (they were sent to different local ports) and see the total number of bytes sent over each download connection, and because all the downloads happen within a second of each other, an eavesdropper might recognize the "fingerprint" of a page if he already knows that the page contains images of 3,883 bytes, 7,329 bytes, 17,783 bytes, and 32,138 bytes. Of course, it would only be a probabilistic guess, since many pages probably exist on the Internet that contain images of those sizes. But if you're trying to guess what page the user was looking at, you've narrowed the field from potentially billions of pages to maybe only a handful. But if you're monitoring for people who connect to frequently-banned pages such as http://www.falundafa.org/, the home page of the Falun Gong sect currently outlawed in China, you would pay special attention to any downloads that match the "fingerprint" of that page.

This case is analogous to what cryptographers call a "known plaintext" attack -- the eavesdropper in this case takes the fingerprint of some page that he knows some users in China will try to access, and monitors for pages downloaded over HTTPS that closely match that fingerprint. If the attacker mounts a "chosen plaintext" attack -- where the attacker creates a page with a highly recognizable fingerprint, then publicizes the page and hopes that users in China will try to download it -- then this is even more dangerous because it's easy to create a page with a fingerprint that is deliberately more unique than the fingerprint of most pages.

Suppose the Chinese government creates a page, http://www.thegovernmentofchinasucks.org/, peppers it with some diatribes against the Chinese government, and spams it or otherwise advertises it to users within China. The government blocks the site, so that when a user tries to reach it and sees that it's blocked, if the user knows about a CGIProxy-type script on an HTTPS server, they will likely go to that server and try to download the banned page.

The page, however, contains a series of images of varying sizes, even though the HTML on the page sets their dimensions to 1 x 1 pixels so they're barely noticeable. The page silently loads a 1 K image, a 10 K image, a 50 K image, a 100 K image, a 120 K image, and a 150 K image. The user's browser downloads all of those images while connected to the HTTPS server, and the eavesdropper recognizes the fingerprint of the banned page that he created.

Countermeasures

A CGIProxy script could block the fingerprinting of pages by injecting a few 1x1-pixel images of varying sizes into the page. A censor looking for the fingerprint of a given page would now have to examine not just all HTTPS pages that loaded a set of images of given sizes, but all pages where any subset of the downloaded images matched a certain fingerprint -- increasing the amount of monitoring they would have to do. In addition, a CGIProxy script that was smart enough to edit the contents of GIF files could pad the GIF file with comment data, to increase (but not decrease) its size, further blocking efforts to fingerprint a page based on its image sizes.

While JPEG files do not inherently support "comments" inside the file, one user reported he was able to pad a JPEG file with comments added to the end, and the file still displayed correctly when viewed in a Web browser. JPEG files can also be recompressed at a lower quality setting than they were originally saved at, which reduces their size and further confuses attempts to "fingerprint" a site by the size of the downloaded images. (This would also have other applications besides defeating site-fingerprinting; for example, speeding up download of an image-heavy site over a slow Internet connection.) GIF and PNG images could be converted to JPEGs and then have the same process applied to them (assuming that the images don't have any properties that are unsupported by JPEGs, such as animation or transparent regions).

A site's contents, including images, could also be compressed by the proxy into a single .mht file that could be downloaded by a Web browser in one request. This, too, would have applications other than defeating site-fingerprinting; for a connection with high latency, i.e. long round-trip times between the computer originating the request and the site being contacted (such as overseas users or users with satellite Internet connections), the download time could be reduced by downloading the page and all its images at once, rather than downloading the page first, then waiting for an extra round-trip as requests were sent out to download all the loaded images.

(Thanks to Brian Ristuccia and Thomas Shaddack for suggestions to the list of countermeasures.)