How to use steganography to securely circumvent network-level censorship

    Bennett Haselton, 2/15/2000



Statement of the problem

Assumptions:

It is important to note that existing censoring proxy servers are not sophisticated enough to do most of the things described in this paper. However, they could be easily modified to do everything described -- especially if that were the only way to stop people from circumventing them. We want to design a lasting protocol for circumventing proxy servers, and not one that will only be secure until their next product cycle.

The theme of this paper is that the proxy server can count the number of bits of information transmitted by a user surfing the Web. (The distinction between "bits of information" and "bits of data" is important, and explained below.) If bits of information are sent out at a higher rate while using the "circumventor" protocol, than while surfing the Web normally, then the proxy server can detect this. Most existing methods for getting around proxy servers do not take this into account -- but eventually, they will have to.


Introduction

There is a difference between transmitting bits of data and bits of information. For example, say that you are visiting a page that has eight links on it and one of the links is to the page "/home/guestbook.html". When you click on that link, your browser sends this text to the Web server:

GET /home/guestbook.html HTTP/1.1
which is 29 bytes, or 232 bits of data that you transmitted. However, because there are eight links on the page, you could have specified which link you wanted to click on by using one of the following eight codes:

codemeaning codemeaning
000
001
010
011
first link
second link
third link
fourth link
100
101
110
111
fifth link
sixth link
seventh link
eighth link

In other words, since the Web server knows that you are going to make one of eight choices (clicking on one of the eight links), and three bits are enough to specify one selection out of eight, then when you click on one of the links, you're only sending three bits of information to the Web server.

This is relevant to the problem of circumventing a censoring proxy server without getting detected, because the administrator of the censoring proxy server can measure the number of bits of information that you are transmitting while surfing the Web. The number of bits transmitted during a normal Web-surfing session is small, so if the proxy administrator detects that the bits of information transmitted by your computer always goes up when you visit a particular site, the administrator could block that site for enabling suspicious activity.

This is easier to understand after looking at a specific example, given in Section 1 below, "Why using long URL's to transmit information is insecure and too easy to detect". How this relates to the bits-of-information problem is explained in Section 2: Counting the bits of information that you transmit when surfing the Web. Finally, Section 3: Circumventing a proxy without transmitting enough bits of information to raise a red flag, explains how this is relevant to designing a program for circumventing network-level censorship without being detected.

Section 1: Example -- Why using long URL's to transmit information is insecure and too easy to detect

This section explains why it is not a secure solution for the user to use long and garbled-looking URL's like
http://rd.yahoo.com/M=26036.208672.1462854.389526/S=2716149:N/A=167766/?http://messenger.yahoo.com/
to send information to an outside Web server, even though URL's of this form are encountered regularly when surfing the Web. (The URL above was an actual example taken from the Yahoo home page.)

A "suspicious-looking" URL is defined as a URL which is long enough that it is extremely unlikely that the user would have typed it in themselves. Hardly anyone accesses pages like
http://rd.yahoo.com/M=26036.208672.1462854.389526/S=2716149:N/A=167766/?http://messenger.yahoo.com/
by actually typing in the address. A page with an address like that is almost always accessed by clicking on a link. Similarly, when a user's browser requests an image URL like
http://a1.g.a.yimg.com/7/1/31/000/us.yimg.com/a/vi/visa/sm.gif
it's almost always because their browser was viewing a page which included the image tag
<img src="http://a1.g.a.yimg.com/7/1/31/000/us.yimg.com/a/vi/visa/sm.gif">

So, how could the administrator of a censoring proxy server allow users to access images and click on links which have long URL's, while flagging any "suspicious-looking" URL's that could be used to send a secret communication to an outside server? By applying these rules:

The above is not a complete list of how URL's should be added to the "allow list". For example, links are often created by JavaScript functions, so the censoring proxy might have to have some basic ability to parse JavaScript. And a user might have a list of bookmarks or favorites that they had visited previously, so they might jump to those URL's directly, in which case the URL will look "suspicious" if it is long enough, because the user is not getting to it by following a link from a recently-visited page.

But the censoring proxy does not necessarily have to block such pages, it just has to mark them as "suspicious" for later review by an administrator. Since most users never have more than a dozen entries in their Web favorites list, it would be easy for an administrator to audit the list of "suspicious" URL's visited by each student every few weeks.

Section 2: Counting the bits of information that you transmit when surfing the Web

In the scenario above, the problem from the user's point of view is that if they are visiting the page http://www.somesite.com with the following four links on it:
/articles/69548378684-qlapsififsdtsjn.html
/articles/98983377699-qqtdgsfgokjnwef.html
/products.cgi?id=36532529087
/products.cgi?id=32450459958
then if the user clicks on one of those four links (which has been temporarily added to their "allow list"), they are only sending two bits of information to the Web server. But if the user tries to access the URL
http://www.somesite.com/34klhfga4-GPFIDKR/?Q2*&Z
(which is not on their "allow list"), then they are transmitting 24x7 = 168 bits of information (24 bytes for the string "34klhfga4-GPFIDKR/?Q2*&Z", times 7 bits for each byte, if we assume that characters in URL's are limited to the 128 lower-ASCII characters). By only allowing the user to access links on their "allow list" without raising a red flag, the proxy administrator is basically limiting the number of bits of information that the user can transmit without getting flagged for suspicious activity.

If the user fills out a form that includes text fields, then in that situation, the censoring proxy has to allow the user to transmit many more bits of information than are usually transmitted by clicking on a link. There are a few bits that also get transmitted through the non-text fields of the form, but text fields transmit the most information. For example, the user could visit a page with the following form on it:

Band name:
Show:
Bill me later

When the censoring proxy scans this page, it adds URL's of the form:
http://www.somesite.com/perlscript.pl?bandname=[any text here]&products=[cdsonly|tapesonly|cdsandtapes]&billmelater=[either "on" or nothing]

to the user's "allow list". The user then enters this data:

Band name:
Show:
Bill me later

Then when the user submits the form, the following data is transmitted to the Web server:
/perlscript.pl?bandname=Korn&products=cdsandtapes&billmelater=on
The byte sequences marked in red are the ones that were controlled by the user, and can be used to transmit bits of information. But even though the string "cdsandtapes" consists of 11 bytes, it stores less than 2 bits of information, because there were only three choices for that option. The "band name" field could be almost anything, so it can store much more information even though the submitted value in this case, "Korn", has fewer bytes of data than "cdsandtapes". Of course a user could insert some text after products= other than "cdsonly", "tapesonly" or "cdsandtapes":
/perlscript.pl?bandname=Korn&products=45G2qvXI&billmelater=on
but thw "45G2qvXI" would be flagged by the censoring proxy, since the "allow list" currently permits only "cdsonly", "tapesonly" or "cdsandtapes" to appear after products= in the URL.

But even a text field cannot contain arbitrary data without raising a red flag. A well-designed censoring proxy could distinguish between "normal-looking" text and "suspicious" text by, for example, flagging text strings that contain too many numbers ("5DFS06JK4sdf90"), strings that contain a strange mix of upper and lower case ("FdREWIoDP lIEWkXUAR"), or even strings containing an odd proportion of vowels to consonants that is inconsistent with normal English text ("Ersdlxvbmrtj Wdsfq"). Also, if the query consists of several words, than at least one of those should be an actual English word. Text queries often contain non-English words ("eXistenZ"), but a query made up of five words should include some English words to avoid raising the suspicions of the censoring proxy. (Assume the censoring proxy has a built-in dictionary of common words so it can distinguish the English words in queries from the made-up ones.)

These rules are different for the rules for flagging "suspicious URL's" -- a URL is "suspicious" purely because of its length, since a user is almost just as unlikely to type in
http://www.somesite.com/pages/users/movie-reviews/starwars.html
as they are to type in
http://www.somesite.com/1999/6-12-1999/0,00,34523,2314.html?ceq
But data submitted in text fields will only look suspicious if the proxy's built-in rules determine that it doesn't "look" like English text. A user can go to
http://www.askjeeves.com and enter

without attracting suspicion, but entering the query
might cause the query to be flagged in the proxy's log file.

Of course, some legitimate queries will also get flagged, like "3Com (3C562B-3C563B MNP10) EtherLink III", which is an actual Ethernet card made by 3Com. But this will not break the functionality of the censoring proxy, since users will not be blocked from entering these queries; the searches will just be flagged for later review by the administrator.

The forms in all of these examples submitted data via the GET method, but the rules apply just as well to forms submitted by POST. Even though POST data does not appear as part of the submitted URL, it is still visible to the proxy server. So the user cannot submit large amounts of data in a POST string without triggering a red flag, unless the data submitted is in a format permitted by the "allow list", which is constructed in the same way for a POST form as it is for a form submitted via GET.

Section 3: Circumventing a proxy without transmitting enough bits of information to raise a red flag

Say that a user wants to use the site http://ians.978.org, for example, to get around a censoring proxy server and see the contents of the page http://www.theonion.com. The user must communicate a request for "http://www.theonion.com" -- 16 bytes (or 16 x 7 = 112 bits) of information if you leave off the "http://" -- without the transaction looking suspicious to the proxy server. To do this in a reasonable about of time, you would need to submit some text via a text field in a form.

Obviously the user cannot simply enter the text "http://www.theonion.com". Text data that begins with "http://" is too easy detect, and almost always indicates the use of a CGI proxy server to circumvent network-level censorship. (Cyber Patrol, for example, won't even let you submit a form that has "http://" as one of the field values.)

What is required is a translation between arbitrary URL's like "http://www.theonion.com" and text search queries that won't raise any red flags with the censoring proxy server. The requirements are:

Note that there is no requirement that strong encryption be used as part of the mapping. This is currently not one of the must-have design requirements for proxy-circumventing software, because in settings where this software is likely to be used, avoiding detection is usually much more important than preventing your communications from being decrypted if you do get caught. If the authorities in China, Saudi Arabia, high school, etc. discover that you circumvented their censorship software, you're likely to be punished regardless of whether the censors can figure out what you were actually looking at.

Note that there is also no requirement that all of the information be submitted in a single query. If a user is trying to access a long URL like
http://www.eff.org/pub/Publications/Declan_McCullagh/cwd.keys.to.the.kingdom.0796.article
which is too long to be encoded into a single query without triggering a red flag, then the encoded URL can be spread across several text queries, which are then submitted in sequence. To the censoring proxy, it will look as if the user is simply entering several consecutive queries into the same form.

The next step, designing a mapping from arbitrary URL strings to non-suspicious-looking text queries, will be the subject of a follow-up paper.


Copyright © 2000 Bennett Haselton