List of possible weaknesses in systems to circumvent Internet censorship

Bennett Haselton
First draft completed 11/7/2002; continually updated

A wide variety of systems, including programs by the names of Triangle Boy, Peek-A-Booty, Six/Four, and CGIProxy, have been proposed for circumventing Internet censorship in countries such as China and Saudi Arabia, with no clear winner emerging as the single best anti-censorship solution.

One reason is that there hasn't been much discussion about how well these systems would hold up in response to various types of attacks that could be mounted by the censors. The worst thing that could happen would be for an anti-censorship system to be widely deployed, with volunteers all over the world running software to assist in the effort and people in China and other censored countries using the software every day to beat censorship, when suddenly the censors find a flaw that can undermine and block the whole system. If the censors discover a technique to detect circumvention traffic, then not only can the system be blocked and rendered obsolete, but if the traffic can be traced back to individual users in the censored countries, the penalties imposed on them could be severe. (China has jailed dissidents for downloading Internet articles critical of China and executed hackers for more committing serious cyber-crimes.)

Plus, if the traffic is detectable, the censors can also trace it to the sites outside their country which are helping defeat Internet censorship, and add those sites to a permanent blacklist. Even if those sites later upgrade to a more secure, undetectable version of the software, they will still be blacklisted, and it may be prohibitive for them to move to a new location to get around the blacklist.

So, it is a high priority to think of possible attacks against a system before the system is deployed. This page is a collection of common attacks, weaknesses, and fallacies that must be avoided. In order for an anti-Internet-censorship system to be considered safe, the author should describe how it avoids any of the following problems that might apply to it. If you think of any other general categories of attacks, please email me and I'll add them to this list.

General issues

The censorware designers can always make the last move

When designing a circumvention system, it's not enough to design it so that it can get around the existing censorship methods already in place. It should also not be possible for the censors to defeat the circumvention system by making an easy change to their censorship architecture.

As a trivial example, suppose that a server like Anonymizer.com allows users in China to retrieve banned pages. (The real Anonymizer site is blocked, of course, but suppose someone installs similar software on their own server, which the Chinese censors don't know about.) But since the Chinese block users from downloading the pages containing the string "Falun Gong", the software replaces characters in an HTML document with their HTML character code equivalents, so that "Falun Gong" might be replaced with "Falun Gong", which a Web browser will display as "Falun Gong". This will temporarily defeat China's keyword filtering. But if this method becomes widely known (as it would have to be, in order to be useful to a significant proportion of Internet users in China), the Chinese censors can modify their software to take HTML character codes into account when scanning for banned strings.

The authors of the circumvention system might be prepared to engage in an "arms race" with the censors, where each time the censors find a way to detect and block the last version of their software, the authors release a new version that avoids the old weakness. But if the circumvention scheme depends on volunteers all over the world running the software on their computers or their Web sites (as all of the existing proposed systems do), then each time a new version is released, all volunteers will have to upgrade, which may take more ongoing time and effort than they want to give. If software is required on the end-user's computer inside the censored country, then they will have to upgrade as well -- and if the circumvention system that they were using before is now blocked and made obsolete, they may not even be able to obtain the upgrade without a lot of effort. Worst of all, each time the censors get ahead in the arms race by finding a way to detect and trace the circumvention traffic, they can detect and punish anyone within their country that they catch using the software -- and those are the users who pay the biggest price for the "arms race".

When deploying a circumvention system, you can never be completely certain that there is no way to detect and block it. But for the reasons described above, it's irresponsible to deploy a system where you know of an easy way to detect it and it's only a matter of time before the censors figure it out. The best you can do is release a circumvention system that doesn't have any known fatal weaknesses.

The "human shield" fallacy

This goes something like, "We built an anti-censorship system that hides secret traffic in ICQ messages. The Chinese won't dare to block ICQ -- it's a valuable tool that increases international understanding and friendship among nations, and besides, blocking it would violate RFC 9,234,436." Even if the Chinese censors aren't block ICQ now, if ICQ became the most popular means of circumventing their government's censorship, it would likely be blocked very quickly.

The only protocols that the Chinese would probably never block, at least not without rendering the Internet essentially useless for the Chinese, would be Web traffic and email. The Chinese government must believe that Internet access provides some benefit to their country, or they wouldn't allow it at all, and blocking Web traffic or email would have a staggering impact. But any other protocol would probably get blocked very easily if it were widely used as a means of covertly sneaking around the firewall.

Assuming that censors lack the resources to monitor all traffic effectively

First of all, this only applies to "monitoring" algorithms that are processor-intensive or require traffic to be stored somewhere where it can be analyzed. Simply blocking access to a Web site at the Great Chinese Firewall is trivial, and the Chinese can block as many sites as they want. But if they wanted to, for example, block pages that a user visits 90% of the time right after being denied access to a blocked site, that would require some storage of usage history patterns.

Moore's Law says that the amount of computing power you can buy for a fixed cost doubles every 18 months. The number of Internet users in any given country can't grow that fast (it would quickly exceed the total number of people in the country), so the amount of computing power available to monitor any individual user's Internet traffic will essentially double every 18 months as well. In order for a circumvention system to stand the test of time, it should take into account the potential increase in governments' power to monitor traffic.

In the meantime, even if a country can't monitor all traffic effectively, it could decide to only monitor, say, 5% of its users at any given time. If the censors can spot circumvention traffic and trace it back to specific users within their country, then each user would have an unacceptably high 1-in-20 chance of being caught each time they used the circumvention protocol. Even if the act of circumvention could not be traced back to a specific user within the country, the circumvention site outside the country would be permanently blacklisted so that it could not be used in the future.

Also, different countries have different abilities to monitor and censor the Internet traffic over their networks. China uses centralized filtering at their national border, blocking traffic to specific sites, and only in late 2002 did they begin more fine-grained blocking, such as blocking traffic containing certain keywords. Many Chinese users also access the Internet through public cyber cafes, where violations would be virtually impossible to trace back to a specific individual. (Although as of October 2002, the Chinese are now requiring the use of ID cards to sign on to the Internet in licensed Internet cafes, so that an attempt to access a blocked site can be traced back to the individual user -- but unlicensed cafes still thrive and would not be bound by the new rule.) Saudi Arabia, on the other hand, uses a network of proxy servers supplied by SmartFilter, which allows more flexible blocking of Web access (specific keywords, URLs, search patterns, default blocking of sites accessed by IP address, etc.), and any suspicious activity can be traced back to an individual user's ISP account. The Saudi filtering system is also distributed across multiple proxy servers, allowing for more sophisticated, processor-intensive analysis of users' usage patterns.

So it would seem that since China's filtering is much less sophisticated than Saudi Arabia's, a system could be deployed that would be secure enough to go undetected in China but not in Saudi Arabia. The problem is that once the system is released and gains a reputation for helping to circumvent Internet censorship in China, it would be very hard to stop users in Saudi Arabia from using the same system if it were at all possible for them to obtain it. Once you open the floodgates, users are unlikely to understand the nuances of why a system would be safe to use in one network architecture but not another.

Traffic-flow analysis

If the censors can track the history of Internet accesses by user or by IP address, one thing they can do is watch what site a person usually connects to, immediately after being denied access to a blocked site. Whatever it is that the user does to get around a block -- whether visiting a particular Web site, or sending email to a certain address, or going on a chat network -- it will look suspicious if they always do it right after being blocked from something.

The safest way to defeat this detection would be to educate users -- tell them not to always go to a circumventor site immediately after being denied access to a blocked site. Unfortunately, it's notoriously difficult to get software users to follow any guidelines that are not actually enforced by the software. It would be better to display a warning that the user always sees when using the circumvention program.

If the circumvention method is to connect to a "circumventor" Web site where the user types in the URL of a page that they want to see, then the page itself could contain a warning: "Do not always visit this page right after being denied access to a blocked site". Of course, by that point it's too late, since the connection to the circumventor site has already been made. If the circumvention method uses software on the user's side (which connects to a server running somewhere outside the censored regime), this is safer because the software itself can be configured to display the message before the connection is initiated.

Also, the more outside circumventor server that the user knows about (or, the more circumventor servers are being stored by their client software), the fewer times the user will need to visit each one. If the user simply knows about a list of several circumventor Web sites, then unfortunately they will probably connect to the same one each time they want to view a blocked page, just out of habit. If they are using client-side software, though, the software can be configured to properly rotate through the available servers. But in both cases, the real problem with multiple circumventor servers is that if it's hard enough to make sure each user knows about at least one unblocked server, it's even harder to make sure each user knows about several of them. The more users you distribute each server location to, the greater the chance that the censors will find it, block it, and then track down anybody who attempts to connect to it.

Using steganography to hide data inside "noise"

In any communication channel, "noise" can be considered the extra data being transmitted that isn't relevant to the information being sent. The most common example is the static "noise" on a radio communication line, but the random graininess in an image could be considered "noise" as well. In fact, steganography is usually discussed in terms of hiding data inside an image by changing the least significant bits representing the color of each pixel. For example, if you have an image measuring 100 x 100 pixels, and the image is saved in 24-bit color so that each pixel has a red, green, and blue value represented by an 8-bit number, then you could alter each of the 30,000 "least significant" bits to store a 3,725-byte message, without drastically changing the appearance of the image.

The problem with using these schemes to transmit information through a censored Web proxy, is that once this method becomes widely known, the censors can simply change random color bits in each downloaded image. (This would probably not be feasible at the level of the Chinese firewall, because the censoring software would have to re-assemble the packets representing each image, then obtain an internal representation of the image in terms of its pixels and colors, change the pixels, convert the image back into raw bytes, and send them out again. But it would be feasible for a censoring proxy such as the SmartFilter proxies used in Saudi Arabia.) The censors wouldn't change enough pixels to annoy normal users, but enough to defeat any encoding scheme that transmitted information using the least-significant color bits. (Many users probably don't care much about crisp image quality unless they're downloading pornography, which most censorious regimes are blocking anyway.)

It's possible that this could be solved using some sort of error-correction algorithm -- if the remote server sends an encoded message and too many bits have been changed, then the client requests it again, and the server re-sends the image but with the information spread more "thinly" throughout the image pixels, hoping that the random bits changed by the censoring proxy will not alter the message. But a less detectable transmission would also be less efficient. You would also have to make sure that the traffic generated during the "error correction" phase would not itself look suspicious. For example, requesting the same image several times in a row, with slightly different data being sent back each time, would certainly look abnormal, so you would want each successive attempt to request a different image from the server. It may be possible to use error correction to get around this problem -- just be sure to take into account that if you hide information in "noise" data, the censors can alter the "noise", and the protocol has to take this into account.

Other examples of hiding data inside "noise" are more subtle. For example, suppose that you send an HTML page to a user's browser, but wherever you would have used spaces in the HTML source code, you use a " " some of the time and " " (which the browser will render as a space) some of the time. Secretly, you're using the alternate forms of representing a space to send a binary message in code, where " " stands for "0" and " " stands for "1". The problem is that with most Web pages, it doesn't matter whether a space is represented by a " " or a " ". Thus, the choice between " " or " " is "noise" -- and the censoring proxy can overwrite it, changing all instances of " " to " ", defeating the code while hardly having any effect on normal users.

Issues with specific systems

HTTPS traffic

At first, HTTPS -- encrypted Web traffic -- would seem to be an ideal means to help end users defeat Internet censorship. It's widely used on the Web, so censors would be unlikely to block it entirely. Since traffic is encrypted, there would be no way for censors to determine what you were looking at. In fact, because traffic is encrypted, it would seem that there would be no way for the censors to distinguish an HTTPS circumventor site from a regular HTTPS site.

Unfortunately, there are several ways the censors could do this. The main problem is that in order for the circumvention system to be effective, a large number of volunteers outside the censored countries have to be running the circumvention software on their Web sites or on their computers (if the software were running only on a small number of centrally managed sites, then the censors would just block those). And it's prohibitively difficult and expensive for volunteers to make their machines look like "real HTTPS sites", from the point of view of a censor watching the traffic that flows to those machines.

When your browser connects to a typical HTTPS site, the server sends your browser a certificate that is "signed" by a signing authority, where the signing authority has verified that the site really is who it claims to be. (If you're not familiar with digital signatures, this and other concepts are given excellent coverage in books such as Bruce Schneier's Applied Cryptography.) In order for the browser to recognize the signature, the signing authority has to be on a list of recognized signing authorities that is built into the browser. In Internet Explorer 6, you can view the built-in list of signing authorities under Tools->Internet Options->Content->Certificates->Trusted Root Certification Authorities. If you want to run an HTTPS site that users can access without seeing a warning that says "Internet Explorer does not recognize this signing authority", it costs money to get your certificate signed by one of the signing authorities in that list. The cheapest one available is $50, for a certificate from InstantSSL, but even that would eliminate many volunteers who would not want to spend that much to run a circumventor site.

Alternatively, you could install a certificate not signed by any authority recognized by most browsers, or steal a certificate from another HTTPS-enabled site and use that one. These certificates will generate warning messages when viewed with the user's browser, saying either that the browser doesn't recognize the signing authority or that the site name doesn't match the name on the certificate. But in spite of these warnings, a pilfered or untrusted certificate will still encrypt all of your traffic properly, and a censor monitoring your activity cannot see what you're viewing.

The problem is that using such a flawed certificate will cause your site to stick out like a sore thumb, if the censors ever decide to scan through a list of HTTPS sites visited by users in their country (or, if the censors have enough resources, they can even scan the site while the user is attempting to connect to it, and block the connection if it looks suspicious). It's trivial for an automated script to connect to an HTTPS site and determine (a) whether the certificate is signed by a signing authority recognized by most browsers, and (b) whether the name of the site matches the site name on the certificate. Virtually all "real" HTTPS sites meet both of these criteria, because the site owners don't want their visitors to encounter scary warning messages from their browser before being asked to enter their credit card number. A connection to an HTTPS site that failed these tests could, at the very least, be blocked as suspicious, and at worst it could be regarded as an act of attempted circumvention and get the user in trouble who tried to visit the site.

Finally, note that even an HTTPS connection is potentially breakable, in the sense that an eavesdropper can make accurate guesses about what you were looking at, if they monitor your downloads and look for patterns indicating that you're accessing a "known bad" site. This danger is described in more detail here.

CGI scripts

CGIProxy is an example of a CGI script that a user can install on any machine that they are able to configure as a Web server. When you request a page through CGIProxy, every tag pointing to an image, frame, or other object that the browser requests separately, is altered to point through the CGIProxy script URL, so that the user's browser never connects directly to the site they're trying to access.

A weakness here is that if the censoring proxy detects a large number of accesses to the same CGI script in a short amount of time, that might be indicative of someone using a CGIProxy-like program to circumvent the proxy. There are very few other situations where a user's browser would send several requests to a single CGI script within the space of a few seconds.

One solution could be for CGIProxy not to load images by default, replacing each of them with an IMG tag that preserved the height and width attributes (so as not to break the layout of the page) but pointed nowhere. Most situations in which a browser makes several quick requests in succession, are due to multiple images being loaded on a page, and this would help avoid that problem. However, the user should still have a way to load a specific image if they need to, in order to navigate a page -- similar to how users with slow connections surf the Web with image loading turned off, but if they need to load an image to navigate a page, they can right-click on an image rectangle and select "Show picture". Unfortunately there does not seem to be a method in most browsers -- using JavaScript or any other way -- to create a custom entry on the right-click menu for an image on a page, so that a user could right-click on a blank image and select an option like "Allow CGIProxy to request and load this image". It may be possible to insert JavaScript code so that if a user right-clicks on an image twice, then the IMG url will be altered so that CGIProxy will load it for you. (Due to the non-intuitive nature of this interface, the instructions for loading images would have to be explained to the user in a banner inserted at the top of each page.)

Another possibility would be to install CGIProxy on a Web server in such a way that requests to CGIProxy do not look like requests being sent to a CGI script. On the Apache Web server, for example, it's possible to set up a CGI script in a directory, so that when the user loads the URL 'http://www.somesite.com/somedir/sdf0923/8jj09we/", they are triggering a CGI script inside http://www.somesite.com/somedir/ and passing the data 'sdf0923/8jj09we/' to that script. In this case, a series of URL requests in rapid succession would not look suspicious, because the browser might simply be loading an HTML page and then loading all the images and other objects referenced on that page. However, installing a CGI script in this manner is difficult, and may not be possible for some users if they don't have the permissions on their hosting machine to change the right Apache settings.

If you run the CGIProxy program on an HTTPS server, then a censor can't see any of the URLs that you're requesting -- only the name of the machine you're connecting to. (So if you were downloading the URL "https://www.somesite.com/foo/bar/", the censoring proxy would be able to see the "https://www.somesite.com/" part but not the "/foo/bar/" part.) In this case, the censoring proxy would only see several requests in a short space of time, which doesn't look suspicious since this happens every time the user loads a page that contains images, frames, or JavaScript files -- the censoring proxy has no way of knowing that all of the requests are going to the same CGI script. However, HTTPS its own set of problems discussed earlier.

Connecting to servers running on users' home machines

Since there are more users with high-speed home Internet connections than with personal Web sites where they can host CGI scripts, many proposals rely on volunteers outside the censored regimes installing Web servers or other kinds of server software on their home machines, so that censored users can connect to those machines and download banned content indirectly. In order for this to work, you have to make sure that there's no way for the censors to distinguish between connections to "real" Web servers, and connections to people's home machines.

One way that the censors could do this would be to compile a list of the address blocks assigned to major providers of high-speed Internet access such as AT&T Cable. If these address blocks are distinct from the address blocks assigned to Web sites hosted by the same company, then the censors can simply block connections to those.

The censors can also make guesses about which machines are probably home machines, by doing a reverse lookup on the IP address. The IP address of www.yahoo.com is 66.218.71.81; doing a reverse lookup on this address (in Windows, open a command prompt and type "ping -a 66.218.71.81") obtains the hostname "www.yahoo.akadns.net" -- not the one you started with, but not a "suspicious-looking" one either. However, if you do a reverse lookup on the IP address of my home machine, 216.254.27.44, you get "dsl254-027-044.sea1.dsl.speakeasy.net", which "looks like" the kind of hostname usually assigned to a user's home machine. The censors might use an algorithm that flagged access to any Web site as suspicious, if a reverse lookup on the site's IP address obtained a hostname that "looked like" a home user's machine, i.e. a hostname with lots of numbers and dashes near the beginning. (Doing a reverse lookup is slow, so the censoring proxy might not do this while the user is actually connecting to the site or else it would slow down everyone's Web access considerably, but the censors could keep a log of sites accessed and do the reverse lookups on them later.)

Fortunately, reverse lookup is not implemented at all hosting providers or at all ISPs. There are many Web sites, including www.peacefire.org, where a reverse lookup on the IP address (209.211.253.169) produces no hostname, and there are also some ISPs that do not provide reverse lookup for the hostnames of their customers' machines. So if you ran an HTTP server on a home machine whose IP address did not resolve to any hostname when doing a reverse lookup, this would not look too suspicious, because there are many Web sites whose IP addresses do not reverse to a hostname, either. However, you'd want to check this at the time that the user installed the Web server software on their home machine -- if the installation program detects that the user's IP address can be reversed to a "suspicious-looking" hostname with lots of numbers and dashes, then it tells the user that running the server software on their machine would not be safe.

Peer-to-peer circumvention systems

The popularity of systems like FreeNet and Gnutella, which host files on a worldwide "distributed cloud" of networked machines so that the file can't be censored unless you shut down every machine in the network, have raised questions about whether the same kind of "distributed cloud" system could be used to defeat Internet filtering. The idea is that a user in China would have a client that connects to a point in the cloud somewhere outside China, and uses that connection to request banned content which is passed back to the client. The client also knows about some other nodes in the cloud, so that if the node that the client is using is suddenly blocked by the Chinese firewall, the client can transparently start using some other node. The client doesn't have to know anything about the node that they're using, so volunteers all over the Internet can help fight censorship by installing the "circumventor" software on their machines and plugging into the "distributed cloud".

A "node" could be a machine at a specific IP address, and a user in China could connect to that node by opening a direct connection to some software running on that remote machine. But even if the circumvention method is tunneled over some other protocol such as email, HTTP or ICQ chat, you can still think of circumvention points as "nodes" in a "cloud". If the circumvention protocol is tunneled over email, then the "nodes" are email addresses outside of China that automatically respond to requests and send data back to the user (and the "firewall" in that scenario would be the Chinese ISP's mail server, which could block mail to addresses that are known to be helping people automatically access banned content). If the circumvention protocol is tunneled over HTTP, then the "nodes" are Web sites that handle the requests and send back banned data. In general, and regardless of the protocol, a node location is a unit of information such that (a) if you know the information about the node location, you can connect to it and request banned content; but (b) if the censor knows the information, they can block access to the node. For example, they can block you from connecting to a specific IP address, or block ICQ traffic to a particular ICQ username by dropping all packets containing the characters that spell that username.

The "node" is where the client connects to in order to send out its original request for data, whether by transmitting a packet, sending an email, or even posting to a forum on eBay -- even if the data is passed back to the client by some other means. (So if you send out a data request by posting a message in a particular forum on eBay, and the data is sent back to you as an ICQ message, then the "node" is the eBay forum, not the ICQ user that sent you the reply.)

The fundamental problem is that if your client software attempts to connect to a circumventor node, and the node is blocked so you switch to another one, you're already in trouble anyway because the censors detected that you connected to a known circumventor node. Even if you then start using some other node as a backup, just the fact that you were trying to use a circumventor node could cause you to lose your Internet access privileges, or worse.

The downside of peer-to-peer circumvention systems is that they make it too easy for the censors to infiltrate the peer-to-peer network, find other circumventor nodes, and block them or make a list of all the users who connect to them. The "upside" of peer-to-peer systems -- that they give you a backup in case the censors block the node that you're using -- is not worth the risk, because that temporary backup has so little value when you're still going to get in trouble for trying to circumvent the system.

The only time such peer-to-peer systems are safe at all, is when attempts to access a particular site cannot be traced back to a specific user, so all is not lost if you attempt to connect to a circumventor node that is blocked by the censors. But there are few situations where this is the case, now that even China is tracking all access by user, at least in the licensed Internet cafes.

The reason that peer-to-peer anti-censorship systems such as FreeNet and Gnutella cannot defeat Chinese censorship is that they're designed to solve a fundamentally different problem: FreeNet and Gnutella defeat censorship on the publishing end by making documents "un-censorable", but Chinese users need a system to defeat censorship on the receiving end. More specifically, systems like FreeNet -- in which a document is transparently distributed across multiple servers so that an adversary would have to shut down every machine in the network in order to make the document inaccessible -- only works for users in regimes where Internet access is uncensored by default. If an adversary were able to map out the majority of nodes in the FreeNet network, it wouldn't help them, because even with the law on their side, the effort required to shut down a particular node (by convincing the hosting ISP to shut it down, or else going to court) is so high. On the other hand, if the Chinese censors could map out a significant portion of the FreeNet network, they can then block every machine they know about -- the effort required for them to block each additional machine is essentially zero.

Soliciting suggestions for more attacks

If you have any other ideas for an attack -- either against an existing system or proposal, or a general type of attack that a circumvention system needs to take into account -- email me at bennett@peacefire.org.