Tab-Tab, Come in! Bypassing Internet blocking to categorize DPI devices

Motivation

During the past two years, we have been following the technical developments in Internet filtering in Uzbekistan and Turkmenistan. Internet users were reporting that some websites were blocked, connections were reseted and in some cases users were redirected to another website.

Apart from our technical curiosity of how Internet filtering is implemented at large scale and how it can be circumvented, we are interested in providing technical facts of how blocking is taking place. The lack of technical information in this field has an interesting collateral effect in Internet users. Lack of information, help rumours to spread and consolidate the idea that governments have unlimited capabilities, that not only monitor all communications but that ultimately nothing can be done to stop their control. Understanding how Internet filtering is taking place is an important task as it allows to discover mechanisms to bypass it and fingerprint and ultimately expose the companies that provide technology and know-how to governments that want to surpress the free flow of information in the Net.

Our findings

The general Internet blocking architecture in Turkmenistan and Uzbekistan takes places in two basic stages. The first stage is responsible to identify which web/HTTP sessions are trying to reach a blacklisted website. This first stage is hence responsible of matching if a certain user is trying to reach a website that is blacklisted. Once the positive matching has taken place, the blocking or filtering is implemented. We will refer to the first stage as “matching” and the second as “blocking”.

Matching

During our technical investigation we found out that a key element in the matching stage is the “Host header” in a HTTP session. Once the initial handshake between user and the final webserver is established (TCP/IP handshake) the browser will send a HTTP packet with a content like:

GET / HTTP/1.1 Host: www.youtube.com

This first packet tells to the webserver that we want to request the homepage of the website www.youtube.com

The Host header was introduced to be able to specify which domain we are requesting in the common scenario that a single IP address is serving different websites (think here in shared/virtual hosting). The Host header is optional in HTTP 1.0 but mandatory in HTTP 1.1. According the standards, the GET request line and the Host header must end with carriage return character followed by a line feed character <CR><LF>.

Independently of which type of interception technology is used, matching takes place by observing the HTTP Host header field. There are some good technical reasons to implement the matching in this way. Filtering gear does not need to keep track of IP addresses, can operate in stealth mode as it does not need to perform any name resolution and inspecting packets for this header is as technically challenging as keeping track of the content of the full HTTP sessions.

This matching method is simple and even more effective with the current trend of webservers or websites to force users to use the correct domain name when requesting resources from the server.

Filtering

Once the matching is positive, blocking can take place if desired. We have identified three basic methods used to block a session. Reset, Proxying and Active Redirection.

Reset

In the reset method, two RST packets will be send to both ends of the communication. By injecting these two packets between the communicating parties, both the Internet user/browser and the blacklisted webserver will be instructed to terminate the connection. This blocking method is present in Turkmenistan when we could effectively trigger RST traffic as a result of a domain name present in the Host Header.

To verify if the traffic between the browser and the blocked website was fully blocked, we decided to ignore RST packets in both ends of the communication. The result of the experiment is that it was possible to visit the blocked website and the interception gear was not implementing a full blocking of all traffic. Unfortunately ignoring all RST packets is not possible and makes this method to bypass the blocking unpractical.

There are some good technical reasons why filtering is implemented using RST packets. The filtering gear does not need to keep track of each of the sessions (stateless) and the blocking can take place without major changes in an existing network infrastructure. The fact that reset blocking limits itself to inject bogus traffic without modifying legitimate traffic facilitates its implementation and scalability.

An interesting collateral effect of RST blocking is passive redirection. Due to backward compatibility with HTTP/1.0, when browsers receive a RST packet that instruct them to close the connection, a new browser connection is initiated against the same website but this time without the Host header. When the browser receives the RST, it believes it is communicating to an old webserver that does not support HTTP/1.1 and hence initiates a HTTP/1.0 session. This session will not be “matched” by the interception gear and will not be blocked. The website receiving the HTTP/1.0 connection will then send a redirection to the default domain hosted at its IP address. For example, a HTTP/1.0 connection to youtube.com will respond with a redirection to google.com. The user might believe that the redirection is implemented by the filtering infrastructure when in fact is a collateral results of the RST traffic.

Proxying

The second mechanism to block the websites is by means of semi or fully transparent proxies. In both cases the HTTP proxy is placed between the users and the Internet. All outgoing web connections are intercepted by the proxy. Depending on the technology used, the proxy can hide its presence to the end-user (semi transparent), the webserver or both (fully transparent). The most common transparent proxy implementation is the semi-transparent; this implementation intercepts the HTTP requests from the users and places the outbound connection on the users’ behalf but does not hide the web proxy IP address from the webserver i.e. the webserver can see that the connections do not come from the user directly but from the IP address of the proxy.

A fully transparent web proxy hides both from the user and the webserver that the sessions are intercepted. The webserver cannot determine if the connection comes directly from the end-user or the proxy itself. Semi-transparent proxies are commonly used in combination with caching to speed up connections while fully transparent proxies are typically implemented for surveillance.

The basic matching mechanism in any case is performed looking into the Host Header. Opposite to the RST blocking mechanism, the web proxy has full access to the data stream and can perform most complex string matching, for example: blocking certain articles or random pages based on keywords. The transparent webproxy can perform more complex filtering but requires the processing of all web requests in the infrastructure.

This blocking method is present in Uzbekistan and Turkmenistan where we could effectively identify proxy headers when requesting blocked websites.

Active redirection

In the same way that a RST packets can be injected in the datastream to teardown a connection, we have seen another type of injected traffic towards the client. In this scenario the interception gear sends a HTTP 302 redirection packet informing to the browser that the website has been “Moved” to another location.

Instead of having a full web proxy implementation to perform the redirection, we could determine that this type of redirection is performed without the need of keep any session state. As in the case of the RST blocking, once the Host header is found a packet with a redirection to www.msn.com is sent to the client.

GET / HTTP/1.1
User-Agent: TLT/2.6.4 (linux-gnu)
Accept: */*
Host: www.uznews.net
Connection: Keep-Alive 

HTTP/1.1 302 Found
Location: http://www.msn.com/
Content-Type: text/html; charset=utf-8
Content-Length: 136
Connection: close

<html><head><title>Object moved</title></head><body> <h2>Object moved to <a
href='http://www.msn.com/'>here</a>.</h2> </body></html>

This redirection method is highly scalable and users are aware that blocking is talking place.

How we manage to bypass the blocking?

During our investigation we discovered that it is possible to bypass the filtering by tampering the Host header so the matching stage is not triggered.

The basic idea is to use the \t (TAB) and \n (linefeed) characters in the basic HTTP requests headers. We could test and verify that webservers sanitize the requests headers and appending a tab character at the end of the Host Header will not have an impact in the webserver side but will bypass the detection-matching phase of the blocking gear.

So instead of sending Host: www.youtube.com\n our requests look like Host: www.youtube.com\t\n.

We also discovered that the Active Redirection Method could be bypassed by pre-pending a linefeed to the GET header. So instead of sending GET / HTTP/1.1, our requests look like \nGET / HTTP/1.1.

OONI Testing

We ran from inside of TM and UZ a set of OONI Probe tests to collect some data on how the filtering was happening and confirm our hypothesis of censorship bypassing strategies.

Inside of Uzbekistan we ran the following tests:

HTTP Header Field Manipulation

https://ooni.org/reports/0.1/TM/http_header_field_manipulation-2013-01-28T195727Z-AS20661.yamloo

Through such test we were able to determine that no extra HTTP header were being appended to our outgoing requests. As we can see the response body contains the same headers that have been sent by the client.

HTTP Invalid Request Line test

https://ooni.org/reports/0.1/TM/http_invalid_request_line-2013-01-30T214938Z-AS20661.yamloo

Through such test we were able to determine that the DPI device does not misbehave when receiving an invalid request line. This leads us to believe that the device in question is not a bluecoat device since such devices usually crash on ‘test_random_invalid_version_number’ and ‘test_random_invalid_field_count’ as is shown in the measurements done in Myanmar (https://ooni.org/reports/0.1/MM/http_invalid_request_line-2012-12-06T162217Z-AS18399.yamloo).

HTTP Host test

https://ooni.org/reports/0.1/TM/http_host-2013-01-30T224041Z-AS20661.yamloo

Through this test we were able to confirm that the above described filtering bypassing strategies do indeed work as is shown by ‘test_filtering_prepend_newline_to_method’ and ‘test_filtering_add_tab_to_host’. We were also able to determine that filtering also occurs on subdomains of the target hostname as is shown by ‘test_filtering_of_subdomain’. Notes: The ‘response_never_received’ error code is due to a TCP RST.

Inside of Turkmenistan we ran the same tests and obtained very similar results:

HTTP Header Field Manipulation

https://ooni.org/reports/0.1/UZ/http_header_field_manipulation-2013-01-29T222932Z-AS31203.yamloo

HTTP Invalid Request Line test

https://ooni.org/reports/0.1/UZ/http_invalid_request_line-2013-02-02T183110Z-AS31203.yamloo

HTTP Host test

https://ooni.org/reports/0.1/UZ/http_host-2013-02-02T183406Z-AS31203.yamloo

Also here censorship of subdomains is occurring, though the filtering strategy is to reply with a 302 redirect to ‘http://www.msn.com'.

Such results are particularly interesting, because we are now able to test for these bypassing strategies in the future. If it’s possible to circumvent censorship by means of these two strategies we can infer that the DPI device we are facing could be similar to these. If we continue doing this we will be developing a heuristic for categorizing DPI devices.

Something similar was done in the case of Bluecoat in Burma, where we discovered that by sending an invalid HTTP method in the Request Line would cause a bluecoat device to output an error message. The same thing happens to bluecoat, though with a different error message, when sending an invalid number of parts in the HTTP request line.

Lessons learned

We learned that censorship software is not perfect and that sometimes it can be trivially bypassed. The fact that it can be bypassed is not of interest to who deploys such software, because they are mainly interested in affecting a large portion of their userbase and even if a select few is able to access restricted content it is generally not a problem. This means that they accept the fact that there is an imbalance in knowledge amongst their population and this strengthens social inequality.

What is of interest to the OONI project is the fact that by detecting such imperfections in filtering software we are able to classify the kind of product being used. The next time we will see a device that can be bypassed by append tab characters to the Host header field we can infer that it may be the same used in UZ and TZ.

You can help us detect censorship by running ooni-probe. For more details see the install guide