Steganography to circumvent network-level censorship

How to use steganography to securely circumvent network-level censorship

Bennett Haselton, 2/15/2000

Statement of the problem
Introduction
Section 1: Example -- Why using long URL's to transmit information is insecure and too easy to detect
Section 2: Counting the bits of information that you transmit when surfing the Web
Section 3: Circumventing a proxy without transmitting enough bits of information to raise a red flag

Statement of the problem

Assumptions:

User Bob is connected to a network where traffic between the network and the outside Internet is controlled and monitored by Mallory, the network administrator. Mallory does not allow direct traffic between machines inside and outside the firewall, unless the traffic end point inside the firewall is one of the machines "trusted" by Mallory, such as the ISP mail server or the proxy server. So Bob must surf the Web using the proxy server controlled by Mallory.
The proxy server is configured to deny access to certain sites such as http://www.cnn.com/. Sites can be blocked if they are on a built-in list of "bad sites", or if the proxy server dynamically detects "suspicious" content in a document. However, the default for uncategorized sites is "allow".
Alice runs a Web site on the outside Internet. Alice wants to help Bob circumvent Mallory's censoring software, but Mallory doesn't know this, so Mallory's proxy server does not block access to Alice's site. (In real life, this would be accomplished by having so many Alices on the Internet that Mallory cannot keep track of them and block them all.)
The protocol that Bob uses to communicate with Alice's "circumventor" site is publicly documented.
Mallory knows the format of the protocol used to communicate with "circumventor" sites, and his censoring proxy monitors the traffic that Bob generates when he surfs the Web. If the protocol generates traffic between Bob and Alice that is easy to distinguish from normal Web-surfing traffic, then the proxy can detect this and block Alice's site automatically.
Important: The censoring proxy server has large resources, in terms of processing power and storage space, to devote to analyzing Web traffic on a per-user basis. Some of the techniques described in this paper would not be practical for a censoring government to apply to all Internet traffic generated from within their country. But we make the assumption that the censoring proxy has sufficient resources to do everything described here, for these reasons:
- A censoring proxy serving many users does not have to analyze all users' traffic all of the time. Every day, it could randomly audit traffic generated by 1% of the Internet-using population.
- Moore's Law says that computing power doubles every 18 months. The same is not true of the population of users surfing the Web through a given censored connection, so the amount of computer power that a censoring proxy can devote to analyzing each user's behavior, is growing exponentially.

It is important to note that existing censoring proxy servers are not sophisticated enough to do most of the things described in this paper. However, they could be easily modified to do everything described -- especially if that were the only way to stop people from circumventing them. We want to design a lasting protocol for circumventing proxy servers, and not one that will only be secure until their next product cycle.

The theme of this paper is that the proxy server can count the number of bits of information transmitted by a user surfing the Web. (The distinction between "bits of information" and "bits of data" is important, and explained below.) If bits of information are sent out at a higher rate while using the "circumventor" protocol, than while surfing the Web normally, then the proxy server can detect this. Most existing methods for getting around proxy servers do not take this into account -- but eventually, they will have to.

Introduction

There is a difference between transmitting bits of data and bits of information. For example, say that you are visiting a page that has eight links on it and one of the links is to the page "/home/guestbook.html". When you click on that link, your browser sends this text to the Web server:

GET /home/guestbook.html HTTP/1.1

which is 29 bytes, or 232 bits of data that you transmitted. However, because there are eight links on the page, you could have specified which link you wanted to click on by using one of the following eight codes:

code meaning code meaning

000
001
010
011
first link
second link
third link
fourth link
100
101
110
111
fifth link
sixth link
seventh link
eighth link

In other words, since the Web server knows that you are going to make one of eight choices (clicking on one of the eight links), and three bits are enough to specify one selection out of eight, then when you click on one of the links, you're only sending three bits of information to the Web server.

This is relevant to the problem of circumventing a censoring proxy server without getting detected, because the administrator of the censoring proxy server can measure the number of bits of information that you are transmitting while surfing the Web. The number of bits transmitted during a normal Web-surfing session is small, so if the proxy administrator detects that the bits of information transmitted by your computer always goes up when you visit a particular site, the administrator could block that site for enabling suspicious activity.

This is easier to understand after looking at a specific example, given in Section 1 below, "Why using long URL's to transmit information is insecure and too easy to detect". How this relates to the bits-of-information problem is explained in Section 2: Counting the bits of information that you transmit when surfing the Web. Finally, Section 3: Circumventing a proxy without transmitting enough bits of information to raise a red flag, explains how this is relevant to designing a program for circumventing network-level censorship without being detected.

Section 1: Example -- Why using long URL's to transmit information is insecure and too easy to detect

This section explains why it is not a secure solution for the user to use long and garbled-looking URL's like
http://rd.yahoo.com/M=26036.208672.1462854.389526/S=2716149:N/A=167766/?http://messenger.yahoo.com/
to send information to an outside Web server, even though URL's of this form are encountered regularly when surfing the Web. (The URL above was an actual example taken from the Yahoo home page.)

A "suspicious-looking" URL is defined as a URL which is long enough that it is extremely unlikely that the user would have typed it in themselves. Hardly anyone accesses pages like
http://rd.yahoo.com/M=26036.208672.1462854.389526/S=2716149:N/A=167766/?http://messenger.yahoo.com/
by actually typing in the address. A page with an address like that is almost always accessed by clicking on a link. Similarly, when a user's browser requests an image URL like
http://a1.g.a.yimg.com/7/1/31/000/us.yimg.com/a/vi/visa/sm.gif
it's almost always because their browser was viewing a page which included the image tag
<img src="http://a1.g.a.yimg.com/7/1/31/000/us.yimg.com/a/vi/visa/sm.gif">

So, how could the administrator of a censoring proxy server allow users to access images and click on links which have long URL's, while flagging any "suspicious-looking" URL's that could be used to send a secret communication to an outside server? By applying these rules:

Maintain an "allow list" of URL's such that if a user accesses a URL on the "allow list", it will not be flagged as suspicious, regardless of how long the URL itself is. This list can be maintained on a per-user basis, and each entry on the "allow list" can be set to expire after a few minutes.
If the user visits a page http://www.somesite.com which includes these link and image tags:
<img src="/images/ads/23451345-fktuejedmngbhgidk.gif">
<a href="http://www.othersite.com/articles/00,2343,135124,412.html">
then the following URL's are added to that user's "allow list":
http://www.somesite.com/images/ads/23451345-fktuejedmngbhgidk.gif
http://www.othersite.com/articles/00,2343,135124,412.html
If the user visits a page http://www.somesite.com which includes this <FORM> element:
<FORM action="perlscript.pl" method="GET">
<input type="checkbox" name="checkbox1" checked>
<input type="radio" name="radio1" value="blue">
<input type="radio" name="radio1" value="green" checked>
<input type="submit">
</FORM>
then any URL of the form
http://www.somesite.com/perlscript.pl?checkbox1=on&radio1=green
will be added to the "allow list".
If the user accesses any other "suspicious-looking" URL that is not on the allow list, then the proxy administrator either blocks the URL or flags it as "suspicious" (and if the latter happens too many times, then the administrator will conclude that the user is somehow getting around the proxy server).

The above is not a complete list of how URL's should be added to the "allow list". For example, links are often created by JavaScript functions, so the censoring proxy might have to have some basic ability to parse JavaScript. And a user might have a list of bookmarks or favorites that they had visited previously, so they might jump to those URL's directly, in which case the URL will look "suspicious" if it is long enough, because the user is not getting to it by following a link from a recently-visited page.

But the censoring proxy does not necessarily have to block such pages, it just has to mark them as "suspicious" for later review by an administrator. Since most users never have more than a dozen entries in their Web favorites list, it would be easy for an administrator to audit the list of "suspicious" URL's visited by each student every few weeks.

Section 2: Counting the bits of information that you transmit when surfing the Web

In the scenario above, the problem from the user's point of view is that if they are visiting the page http://www.somesite.com with the following four links on it:
/articles/69548378684-qlapsififsdtsjn.html
/articles/98983377699-qqtdgsfgokjnwef.html
/products.cgi?id=36532529087
/products.cgi?id=32450459958
then if the user clicks on one of those four links (which has been temporarily added to their "allow list"), they are only sending two bits of information to the Web server. But if the user tries to access the URL
http://www.somesite.com/34klhfga4-GPFIDKR/?Q2*&Z
(which is not on their "allow list"), then they are transmitting 24x7 = 168 bits of information (24 bytes for the string "34klhfga4-GPFIDKR/?Q2*&Z", times 7 bits for each byte, if we assume that characters in URL's are limited to the 128 lower-ASCII characters). By only allowing the user to access links on their "allow list" without raising a red flag, the proxy administrator is basically limiting the number of bits of information that the user can transmit without getting flagged for suspicious activity.

If the user fills out a form that includes text fields, then in that situation, the censoring proxy has to allow the user to transmit many more bits of information than are usually transmitted by clicking on a link. There are a few bits that also get transmitted through the non-text fields of the form, but text fields transmit the most information. For example, the user could visit a page with the following form on it:

When the censoring proxy scans this page, it adds URL's of the form:
http://www.somesite.com/perlscript.pl?bandname=[any text here]&products=[cdsonly|tapesonly|cdsandtapes]&billmelater=[either "on" or nothing]

to the user's "allow list". The user then enters this data:

Then when the user submits the form, the following data is transmitted to the Web server:
/perlscript.pl?bandname=Korn&products=cdsandtapes&billmelater=on
The byte sequences marked in red are the ones that were controlled by the user, and can be used to transmit bits of information. But even though the string "cdsandtapes" consists of 11 bytes, it stores less than 2 bits of information, because there were only three choices for that option. The "band name" field could be almost anything, so it can store much more information even though the submitted value in this case, "Korn", has fewer bytes of data than "cdsandtapes". Of course a user could insert some text after products= other than "cdsonly", "tapesonly" or "cdsandtapes":
/perlscript.pl?bandname=Korn&products=45G2qvXI&billmelater=on
but thw "45G2qvXI" would be flagged by the censoring proxy, since the "allow list" currently permits only "cdsonly", "tapesonly" or "cdsandtapes" to appear after products= in the URL.

But even a text field cannot contain arbitrary data without raising a red flag. A well-designed censoring proxy could distinguish between "normal-looking" text and "suspicious" text by, for example, flagging text strings that contain too many numbers ("5DFS06JK4sdf90"), strings that contain a strange mix of upper and lower case ("FdREWIoDP lIEWkXUAR"), or even strings containing an odd proportion of vowels to consonants that is inconsistent with normal English text ("Ersdlxvbmrtj Wdsfq"). Also, if the query consists of several words, than at least one of those should be an actual English word. Text queries often contain non-English words ("eXistenZ"), but a query made up of five words should include some English words to avoid raising the suspicions of the censoring proxy. (Assume the censoring proxy has a built-in dictionary of common words so it can distinguish the English words in queries from the made-up ones.)

These rules are different for the rules for flagging "suspicious URL's" -- a URL is "suspicious" purely because of its length, since a user is almost just as unlikely to type in
http://www.somesite.com/pages/users/movie-reviews/starwars.html
as they are to type in
http://www.somesite.com/1999/6-12-1999/0,00,34523,2314.html?ceq
But data submitted in text fields will only look suspicious if the proxy's built-in rules determine that it doesn't "look" like English text. A user can go to http://www.askjeeves.com and enter

without attracting suspicion, but entering the query might cause the query to be flagged in the proxy's log file.

Of course, some legitimate queries will also get flagged, like "3Com (3C562B-3C563B MNP10) EtherLink III", which is an actual Ethernet card made by 3Com. But this will not break the functionality of the censoring proxy, since users will not be blocked from entering these queries; the searches will just be flagged for later review by the administrator.

The forms in all of these examples submitted data via the GET method, but the rules apply just as well to forms submitted by POST. Even though POST data does not appear as part of the submitted URL, it is still visible to the proxy server. So the user cannot submit large amounts of data in a POST string without triggering a red flag, unless the data submitted is in a format permitted by the "allow list", which is constructed in the same way for a POST form as it is for a form submitted via GET.

Section 3: Circumventing a proxy without transmitting enough bits of information to raise a red flag

Say that a user wants to use the site http://ians.978.org, for example, to get around a censoring proxy server and see the contents of the page http://www.theonion.com. The user must communicate a request for "http://www.theonion.com" -- 16 bytes (or 16 x 7 = 112 bits) of information if you leave off the "http://" -- without the transaction looking suspicious to the proxy server. To do this in a reasonable about of time, you would need to submit some text via a text field in a form.

Obviously the user cannot simply enter the text "http://www.theonion.com". Text data that begins with "http://" is too easy detect, and almost always indicates the use of a CGI proxy server to circumvent network-level censorship. (Cyber Patrol, for example, won't even let you submit a form that has "http://" as one of the field values.)

What is required is a translation between arbitrary URL's like "http://www.theonion.com" and text search queries that won't raise any red flags with the censoring proxy server. The requirements are:

The generated text strings must look non-suspicious to an automated scanning program; a search string like "exfserq 783 apricot qdolp" is probably about the most exotic-looking type of string that should be allowed.
The mapping must be reversible, so that if "http://www.theonion.com" maps to "exfserq 783 apricot qdolp", then the software running on ians.978.org must know how to map "exfserq 783 apricot qdolp" back to "http://www.theonion.com".
The mapping must not always map URL's the same way, e.g. "http://www.theonion.com" must not always get encoded as "exfserq 783 apricot qdolp", or else the proxy administrators can easily get their own copy of the software, figure out what "http://www.theonion.com" maps to, and monitor for that particular query (a "known-plaintext attack").
The generated text queries should not be in a format that might get blocked for some other reason, so the algorithm should guard against the small possibility that a query like "frqidfs 45 sex lmdf" would be submitted.

Note that there is no requirement that strong encryption be used as part of the mapping. This is currently not one of the must-have design requirements for proxy-circumventing software, because in settings where this software is likely to be used, avoiding detection is usually much more important than preventing your communications from being decrypted if you do get caught. If the authorities in China, Saudi Arabia, high school, etc. discover that you circumvented their censorship software, you're likely to be punished regardless of whether the censors can figure out what you were actually looking at.

Note that there is also no requirement that all of the information be submitted in a single query. If a user is trying to access a long URL like
http://www.eff.org/pub/Publications/Declan_McCullagh/cwd.keys.to.the.kingdom.0796.article
which is too long to be encoded into a single query without triggering a red flag, then the encoded URL can be spread across several text queries, which are then submitted in sequence. To the censoring proxy, it will look as if the user is simply entering several consecutive queries into the same form.

The next step, designing a mapping from arbitrary URL strings to non-suspicious-looking text queries, will be the subject of a follow-up paper.

code	meaning	code	meaning
000 001 010 011	first link second link third link fourth link	100 101 110 111	fifth link sixth link seventh link eighth link

Band name:
Show:
	Bill me later

Band name:
Show:
	Bill me later