Search the Catalog
Web Caching

Web Caching

By Duane Wessels
June 2001
1-56592-536-X, Order Number: 536X
318 pages, $39.95

Chapter 5
Interception Proxying and Caching

Contents:

Overview
The IP Layer: Routing
The TCP Layer: Ports and Delivery
The Application Layer: HTTP
Debugging Interception
Issues
To Intercept or Not To Intercept

As we discussed in Chapter 4, "Configuring Cache Clients", one of the most difficult problems you might face in deploying a web caching service is getting users to use your cache. In some cases, the problem is mostly political; users might resist caching because of privacy concerns or fears they will receive stale information. But even if users are convinced to use the cache -- or have no choice -- administrative hurdles may still be a problem. Changing the configuration of thousands of installed clients is a daunting task. For ISPs, the issue is slightly different -- they have little or no control over their customers' browser configurations. An ISP can provide preconfigured browsers to their customers, but that doesn't necessarily ensure that customers will continue to use the caching proxy.

Because of problems such as these, interception caching has become very popular recently. The fundamental idea behind interception caching (or proxying) is to bring traffic to your cache without configuring clients. This is different from a technique such as WPAD (see "Web Proxy Auto-Discovery"), whereby clients automatically locate a nearby proxy cache. Rather, your clients initiate TCP connections directly to origin servers, and a router or switch on your network recognizes HTTP traffic and redirects it to your cache. Web caches require only minor modifications to process requests received in this manner.

As wonderful as this may sound, a number of issues surround interception caching. Interception caching breaks the rules of the Internet Protocol. Routers and switches are supposed to deliver IP packets to their intended destination. Diverting web traffic to a cache is similar to a postal service that opens your mail and reads it before deciding where to send it or whether it needs to be sent at all.[1] The phrase connection hijacking is often used to describe interception caching, as a reminder that it violates the Internet Protocol standards. Interception also leads to problems with HTTP. Clients may not send certain headers, such as &Cachectrl;, when they are unaware of the caching proxy.

[1] Imagine how much work the postal service could avoid by not delivering losing sweepstakes entries. Imagine how upset Publisher's Clearinghouse would be if they did!

Interception proxies are also known as transparent proxies. Even though the word ''transparent'' is very common, it is a poor choice for several reasons. First of all, ''transparent'' doesn't really describe the function. We hope that users remain unaware of interception caches, and all web caches for that matter. However, interception proxies are certainly not transparent to origin servers. Furthermore, interception proxies are known to break both HTTP and IP interoperability. Another reason is that RFC 2616 defines a transparent proxy to mean something different. In particular, it states, "A `transparent proxy' is a proxy that does not modify the request or response beyond what is required for proxy authentication and identification." Thus, to remain consistent with documents produced by the IETF Web Replication and Caching working group, I use the term interception caching.

In this chapter, we'll explore how interception caching works and the issues surrounding it. The technical discussion is broken into three sections, corresponding to different networking layers. We start near the bottom, with the IP layer. As packets traverse the network, a router or switch diverts HTTP packets to a nearby proxy cache. At the TCP layer, we'll see how the diverted packets are accepted, possibly modified, and then sent to the application. Finally, at the application layer, the cache uses some simple tricks to turn the original request into a proxy-HTTP request.

Overview

Figure 5-1

Figure 5-1. Interception proxying schematic diagram

Figure 5-1 shows a logical diagram of a typical interception proxy setup. A client opens a TCP connection to the origin server. As the packet travels from the client towards the server, it passes through a router or a switch. Normally, the TCP/IP packets for the connection are sent to the origin server, as shown by the dashed line. With interception proxying, however, the router/switch diverts the TCP/IP packets to the cache.

Two techniques are used to deliver the packet to the cache. If the router/switch and cache are on the same subnet, the packet is simply sent to the cache's layer two (i.e., Ethernet) address. If the devices are on different subnets, then the original IP packet gets encapsulated inside another packet that is then routed to the cache. Both of these techniques preserve the destination IP address in the original IP packet, which is necessary because the cache pretends to be the origin server.

The interception cache's TCP stack is configured to accept ''foreign'' packets. In other words, the cache pretends to be the origin server. When the cache sends packets back to the client, the source IP address is that of the origin server. This tricks the client into thinking it's connected to the origin server.

At this point, an interception cache operates much like a standard proxy cache, with one important difference. The client believes it is connected to the origin server rather than a proxy cache, so its HTTP request is a little different. Unfortunately, this difference is enough to cause some interoperability problems. We'll talk more about this in "Issues".

You might wonder how the router or switch decides which packets to divert. How does it know that a particular TCP/IP packet is for an HTTP request? Strictly speaking, nothing in a TCP/IP header identifies a packet as HTTP (or as any other application-layer protocol, for that matter). However, convention dictates that HTTP servers usually run on port 80. This is the best indicator of an HTTP request, so when the device encounters a TCP packet with a source or destination port equal to 80, it assumes that the packet is part of an HTTP session. Indeed, probably 99.99% of all traffic on port 80 is HTTP, but there is no guarantee. Some non-HTTP applications are known to use port 80 because they assume firewalls would allow those packets through. Also, a small number of HTTP servers run on other ports (see Table A-9). Some devices may allow you to divert packets for these ports as well.

Note that interception caching works anywhere that HTTP traffic is found: close to clients, close to servers, and anywhere in between. Before interception, clients had to be configured to use caching proxies. This meant that most caches were located close to clients. Now, however, we can put a cache anywhere and divert traffic to it. Clients don't need to be told about the proxy. Interception makes it possible to put a cache, or surrogate, close to origin servers. Interception caches can also be located on backbone networks, although I and others feel this is a bad idea, for reasons I'll explain later in this chapter. Fortunately, it's not very common.

The IP Layer: Routing

Interception caching begins at the IP (network) layer, where all sorts of IP packets are routed between nodes. Here, a router or switch recognizes HTTP packets and diverts them to a cache instead of forwarding them to their original destination. There are a number of ways to accomplish the interception:

Inline

An inline cache is a device that combines both web caching and routing (or bridging) into a single piece of equipment. Inline caches usually have two or more network interfaces. Products from Cacheflow and Network Appliance can operate in this fashion, as can Unix boxes running Squid.

Layer four switch

Switching is normally a layer two (datalink layer) activity. A layer four switch, however, can make forwarding decisions based on upper layer characteristics, such as IP addresses and TCP port numbers. In addition to HTTP redirection, layer four switches are also often used for server load balancing.

Web Cache Coordination Protocol

WCCP is an encapsulation protocol developed by Cisco Systems that requires implementation in both a router (or maybe even a switch) and the web cache. Cisco has implemented two versions of WCCP in their router products; both are openly documented as Internet Drafts. Even so, use of the protocol in a product may require licensing from Cisco.

Cisco policy routing

Policy routing refers to a router's ability to make forwarding decisions based on more than the destination address. We can use this to divert packets based on destination port numbers.

Inline Caches

An inline cache is a single device that performs both routing (or bridging) and web caching. Such a device is placed directly in the network path so it captures HTTP traffic passing through it. HTTP packets are processed by the caching application, while other packets are simply routed between interfaces.

Not many caching products are designed to operate in this manner. An inline cache is a rather obvious single point of failure. Let's face it, web caches are relatively complicated systems and therefore more likely to fail than a simpler device, such as an Ethernet switch. Most caching vendors recommend using a third-party product, such as a layer four switch, when customers need high reliability.

You can build an inexpensive inline cache with a PC, FreeBSD or Linux, and Squid. Any Unix system can route IP packets between two or more network interfaces. Add to that a web cache plus a little packet redirection (as described in "The TCP Layer: Ports and Delivery"), and you've got an inline interception cache. Note that such a system does not have very good failure-mode characteristics. If the system goes down, it affects all network traffic, not just caching. If Squid goes down, all web traffic is affected. It should be possible, however, to develop some clever scripts that monitor Squid's status and alter the packet redirection rules if necessary.

InfoLibria has a product that fits best in the inline category. They actually use two tightly coupled devices to accomplish inline interception caching. The DynaLink is a relatively simple device that you insert into a 100BaseT Ethernet segment. Two of the DynaLink's ports are for the segment you are splitting. The other two deliver packets to and from an attached DynaCache. The DynaLink is a layer one (physical) device. It does not care about Ethernet (layer two) packets or addresses. When the DynaLink is on, electromechanical switches connect the first pair of ports to the second. The DynaCache then skims the HTTP packets off to the cache while bridging all other traffic through to the other side. If the DynaLinks loses power or detects a failure of the cache, the electromechanical switches revert to the passthrough position.

If you want to use an inline caching configuration, carefully consider your reliability requirements and the failure characteristics of individual products. If you choose an inexpensive computer system and route all your traffic through it, be prepared for the fact that a failed disk drive, network card, or power supply can totally cut off your Internet traffic.

Layer Four Switches

Recently, a new class of products known as layer four switches[2] have become widely available. The phrase ''layer four'' refers to the transport layer of the OSI reference model; it indicates the switch's ability to forward packets based on more than just IP addresses. Switches generally operate at layer two (datalink) and don't know or care about IP addresses, let alone transport layer port numbers. Layer four switches, on the other hand, peek into TCP/IP headers and make forwarding decisions based on TCP port numbers, IP addresses, etc. These switches also have other intelligent features, such as load balancing and failure detection. When used for interception caching, a layer four switch diverts HTTP packets to a connected web cache. All of the other (non-HTTP) traffic is passed through.

[2] You might also hear about ''layer seven'' or ''content routing'' switches. These products have additional features for looking even deeper into network traffic. Unfortunately, there is no widely accepted term to describe all the smart switching products.

Layer four switches also have very nice failure detection features (a.k.a. health checks). If the web cache fails, the switch simply disables redirection and passes the traffic through normally. Similarly, when the cache comes back to life, the switch once again diverts HTTP packets to it. Layer four switches can monitor a device's health in a number of ways, including the following:

ARP

The switch makes Address Resolution Protocol (ARP) requests for the cache's IP address. If the cache doesn't respond, then it is probably powered off, disconnected, or experiencing some other kind of serious failure.

Of course, this only works for devices that are ''layer two attached'' to the switch. The cache might be on a different subnet, in which case ARP is not used.

ICMP echo

ICMP echo requests, a.k.a. ''pings,'' also test the cache's low-level network configuration. As with ARP, ICMP tells the switch if the cache is on the network. However, ICMP can be used when the switch is on a different subnet.

ICMP round-trip time measurements can sometimes provide additional health information. If the cache is overloaded, or the network is congested, the time between ICMP echo and reply may increase. When the switch notices such an increase, it may send less traffic to the cache.

TCP

ARP and ICMP simply tell the switch that the cache is on the network. They don't, for example, indicate that the application is actually running and servicing requests. To check this, the switch sends connection probes to the cache. If the cache accepts the connection, that's a good indication that the application is running. If, however, the cache's TCP stack generates a reset message, the application cannot handle real traffic.

HTTP

In some cases, even establishing a TCP connection is not sufficient evidence that the cache is healthy. A number of layer four/seven products can send the cache a real HTTP request and analyze the response. For example, unless the HTTP status code is 200 (OK), the switch marks the cache as ''down.''

SNMP

Layer four switches can query the cache with SNMP. This can provide a variety of information, such as recent load, number of current sessions, service times, and error counts. Furthermore, the switch may be able to receive SNMP traps from the cache when certain events occur.

Load balancing is another useful feature of layer four switches. When the load placed on your web caches becomes too large, you can add additional caches and the switch will distribute the load between them. Switches often support numerous load balancing techniques, not all of which are necessarily good for web caches:

Round-robin

A counter is kept for each cache and incremented for every connection sent to it. The next request is sent to the cache with the lowest counter.

Least connections

The switch monitors the number of active connections per cache. The next request is sent to the cache with the fewest active connections.

Response time

The switch measures the response time of each cache, perhaps based on the time it takes to respond to a connection request. The next request is sent to the cache with the smallest response time.

Packet load

The switch monitors the number of packets traversing each cache's network port. The next request is sent to the cache with the lowest packet load.

Address hashing

The switch computes a hash function over the client and/or server IP addresses. The hash function returns an integer value, which is then divided by the number of caches. Thus, if I have three caches, each cache receives requests for one-third of the IP address space.

URL hashing

Address hashing doesn't always result in a well-balanced distribution of load. Some addresses may be significantly more popular than others, causing one cache to receive more traffic than the others. URL hashing is more likely to spread the load evenly. However, it also requires more memory and CPU capacity.

With address hashing, the forwarding decision can be made upon receipt of the first TCP packet. With URL hashing, the decision cannot be made until the entire URL has been received. Usually, URLs are quite small, less than 100 bytes, but they can be much larger. If the switch doesn't receive the full URL in the first data packet, it must store the incomplete URL and wait for the remaining piece.

If you have a cluster of caches, URL hashing or destination address hashing are the best choices. Both ensure that the same request always goes to the same cache. This partitioning maximizes your hit ratio and your disk utilization because a given object is stored in only one cache. The other techniques are likely to spread requests around randomly so that, over time, all of the caches come to hold the same objects. We'll talk more about cache clusters in Chapter 9, "Cache Clusters".

Table 5-1 lists switch products and vendors that support layer four redirection. The Linux Virtual Server is an open source solution for turning a Linux box into a redirector and/or load balancer.

Table 5-1. Switches and Products That Support Web Redirection

Vendor Product Line Home Page
Alteon, bought by Nortel AceSwitch http://www.alteonwebsystems.com/
Arrowpoint, bought by Cisco Content Smart Switch http://www.arrowpoint.com/
Cisco Local Director http://www.cisco.com/
F5 Labs Big/IP http://www.f5.com/
Foundry ServerIron http://www.foundrynet.com/
Linux Virtual Server LVS http://www.linuxvirtualserver.org/
Radware Cache Server Director http://www.radware.com/
Riverstone Networks Web Switch http://www.riverstonenet.com/

These smart switching products have many more features than I've mentioned here. For additional information, please visit the products' home pages or http://www.lbdigest.com/.

WCCP

Cisco invented WCCP to support interception caching with their router products. At the time of this writing, Cisco has developed two versions of WCCP. Version 1 has been documented within the IETF as an Internet Draft. The most recent version is dated July 2000. It's difficult to predict whether the IETF will grant any kind of RFC status to Cisco's previously proprietary protocols. Regardless, most of the caching vendors have already licensed and implemented WCCPv1. Some vendors are licensing Version 2, which was also recently documented as an Internet Draft. The remainder of this section refers only to WCCPv1, unless stated otherwise.

WCCP consists of two independent components: the control protocol and traffic redirection. The control protocol is relatively simple, with just three message types: HERE_I_AM, I_SEE_YOU, and ASSIGN_BUCKETS. A proxy cache advertises itself to its home router with the HERE_I_AM message. The router responds with an I_SEE_YOU message. The two devices continue exchanging these messages periodically to monitor the health of the connection between them. Once the router knows the cache is running, it can begin diverting traffic.

As with layer four switches, WCCP does not require that the proxy cache be connected directly to the home router. Since there may be additional routers between the proxy and the home router, diverted packets are encapsulated with GRE (Generic Routing Encapsulation, RFC 2784). WCCP is hardcoded to divert only TCP packets with destination port 80. The encapsulated packet is sent to the proxy cache. Upon receipt of the GRE packet, the cache strips off the encapsulation headers and pretends the TCP packet arrived there normally. Packets flowing in the reverse direction, from the cache to the client, are not GRE-encapsulated and don't necessarily flow through the home router.

WCCP supports cache clusters and load balancing. A WCCP-enabled router can divert traffic to many different caches.[3] In its I_SEE_YOU messages, the router tells each cache about all the other caches. The one with the lowest numbered IP address nominates itself as the designated cache. The designated cache is responsible for coming up with a partitioning scheme and sending it to the router with an ASSIGN_BUCKETS message. The buckets -- really a lookup table with 256 entries -- map hash values to particular caches. In other words, the value for each bucket specifies the cache that receives requests for the corresponding hash value. The router calculates a hash function over the destination IP address, looks up the cache index in the bucket table, and sends an encapsulated packet to that cache. The WCCP documentation is vague on a number of points. It does not specify the hash function, nor how the designated cache should divide up the load. WCCPv1 can support up to 32 caches associated with one router.

[3] Each cache can have only one home router, however.

WCCP also supports failure detection. The cache sends HERE_I_AM messages every 10 seconds. If the router does not receive at least one HERE_I_AM message in a 30-second period, the cache is marked as unusable. Requests are not diverted to unusable caches. Instead, they are sent along the normal routing path towards the origin server. The designated cache can choose to reassign the unusable cache's buckets in a future ASSIGN_BUCKETS message.

WCCPv1 is supported in Cisco's IOS versions 11.1(19)CA, 11.1(19)CC, 11.2(14)P, and later. WCCPv2 is supported in all 12.0 and later versions. Most IOS 12.x versions also support WCCPv1, but 12.0(4)T and earlier do not. Be sure to check whether your Cisco hardware supports any of these IOS versions.

When configuring WCCP in your router, you should refer to your Cisco documentation. Here are the basic commands for IOS 11.x:

ip wccp enable
!
interface fastethernet0/0
ip wccp web-cache redirect

Use the following commands for IOS 12.x:

ip wccp version 1
ip wccp web-cache
!
interface fastethernet0/0
ip wccp web-cache redirect out

Notice that with IOS 12.x you need to specify which WCCP version to use. This command is only available in IOS releases that support both WCCP versions, however. The fastethernet0/0 interface may not be correct for your installation; use the name of the router interface that connects to the outside Internet. Note that packets are redirected on their way out of an interface. IOS does not yet support redirecting packets on their way in to the router. If needed, you can use access lists to prevent redirecting requests for some origin server or client addresses. Consult the WCCP documentation for full details.

Cisco Policy Routing

Interception caching can also be accomplished with Cisco policy routing. The next-hop for an IP packet is normally determined by looking up the destination address in the IP routing table. Policy routing allows you to set a different next-hop for packets that match a certain pattern, specified as an IP access list. For interception caching, we want to match packets destined for port 80, but we do not want to change the next-hop for packets originating from the cache. Thus, we have to be a little bit careful when writing the access list. The following example does what we want:

access-list 110 deny   tcp host 10.1.2.3 any eq www
access-list 110 permit tcp any any eq www

10.1.2.3 is the address of the cache. The first line excludes packets with a source address 10.1.2.3 and a destination port of 80 (www). The second line matches all other packets destined for port 80. Once the access list has been defined, you can use it in a route-map statement as follows:

route-map proxy-redirect permit 10
match ip address 110
set ip next-hop 10.1.2.3

Again, 10.1.2.3 is the cache's address. This is where we want the packets to be diverted. The final step is to apply the policy route to specific interfaces:

interface Ethernet0
ip policy route-map proxy-redirect

This instructs the router to check the policy route we specified for packets received on interface Ethernet0.

On some Cisco routers, policy routing may degrade overall performance of the router. In some versions of the Cisco IOS, policy routing requires main CPU processing and does not take advantage of the ''fast path'' architecture. If your router is moderately busy in its normal mode, policy routing may impact the router so much that it becomes a bottleneck in your network.

Some amount of load balancing can be achieved with policy routing. For example, you can apply a different next-hop policy to each of your interfaces. It might even be possible to write a set of complicated access lists that make creative use of IP address netmasks.

Note that policy routing does not support failure detection. If the cache goes down or stops accepting connections for some reason, the router blindly continues to divert packets to it. Policy routing is only a mediocre replacement for sophisticated layer four switching products. If your production environment requires high availability, policy routing is probably not for you.

The TCP Layer: Ports and Delivery

Now that we have fiddled with the routing, the diverted HTTP packets are arriving at the cache's network interface. Usually, an Internet host rejects received packets if the destination address does not match the host's own IP address. For interception caching to work, the cache must accept the diverted packet and give it to the TCP layer for processing.

In this section, we'll discuss how to configure a Unix host for interception caching. If you use a caching appliance, where the vendor supplies both hardware and software, this section may not be of interest to you.

The features necessary to support interception caching on Unix rely heavily on software originally developed for Internet firewalls. In particular, interception caching makes use of the software for packet filtering and, in some cases, network address translation. This software does two important things. First, it tells the kernel to accept a diverted packet and give it to the TCP layer. Second, it gives us the option to change the destination port number. The diverted packets are destined for port 80, but our cache might be listening on a different port. If so, the filtering software changes the port number to that of the cache before giving the packet to the TCP layer.

I'm going to show you three ways to configure interception caching: first with Linux, then with FreeBSD, and finally with the IP Filter package, which runs on numerous Unix flavors. In all the examples, 10.1.2.3 is the IP address of the cache (and the machine that we are configuring). The examples also assume that the cache is running on port 3128, and that an HTTP server, on port 80, is also running on the system.

Linux

Linux has a number of different ways to make interception caching work. Mostly, it depends on your kernel version number. For Linux-2.2 kernels, you'll probably want to use ipchains; for 2.4 kernels, you'll want to use iptables (a.k.a. Netfilter).

Most likely, the first thing you'll need to do is compile a kernel with certain options enabled. If you don't already know how to build a new kernel, you'll need to go figure that out, and then come back here. A good book to help you with this is Linux in a Nutshell [{XREF}]. Also check out the "Linux Kernel HOWTO" from the Linux Documentation Project at http://www.linuxdoc.org/.

ipchains

To begin, make sure that your kernel has the necessary options enabled. On most systems, go to the kernel source directory (/usr/src/linux) and type:

# make menuconfig

Under Networking options, make sure the following are set:

[*] Network firewalls
[*] Unix domain sockets
[*] TCP/IP networking
[*] IP: firewalling
[*] IP: always defragment (required for masquerading)
[*] IP: transparent proxy support

Once the kernel has been configured, you need to actually build it, install it, and then reboot your system.

After you have a running kernel with the required options set, you need to familiarize yourself with the ipchains program. ipchains is used to configure IP firewall rules in Linux. Firewall rules can be complicated and unintuitive to many people. If you are not already familiar with ipchains, you should probably locate another reference that describes in detail how to use it.

The Linux IP firewall has four rule sets: input, output, forwarding, and accounting. For interception caching, you need to configure only the input rules. Rules are evaluated in order, so you need to list any special cases first. This example assumes we have an HTTP server on the Linux host, and we don't want to redirect packets destined for that server:

/sbin/ipchains -A input -p tcp -s 0/0 -d 10.1.2.3/32 80 -j ACCEPT
/sbin/ipchains -A input -p tcp -s 0/0 -d 0/0 80 -j REDIRECT 3128

The -A input option means we are appending to the set of input rules. The -p tcp option means the rule matches only TCP packets. The -s and -d options specify source and destination IP addresses with optional port numbers. Using 0/0 matches any IP address. The first rule accepts all packets destined for the local HTTP server. The second rule matches all other packets destined for port 80 and redirects them to port 3128 on this system, which is where the cache accepts connections.

Finally, you need to enable routing on your system. The easiest way to do this is with the following command:

echo 1 > /proc/sys/net/ipv4/ip_forward

After you get the ipchains rules figured out, be sure to save them to a script that gets executed every time your machine boots up.

iptables

In the Linux-2.4 kernel, iptables has replaced ipchains. Unfortunately, I don't have operational experience with this new software. Setting up iptables is similar to ipchains. You'll probably need to build a new kernel and make sure the iptables features are enabled. According to Daniel Kiracofe's ''Transparent Proxy with Squid mini-HOWTO,'' the only command you need to intercept connections is:

/sbin/iptables -t nat -D PREROUTING -i eth0 -p tcp --dport 80 \
-j REDIRECT --to-port 3128

You'll also need to enable routing, as described previously.

Since iptables is relatively new, the information here may be incomplete. Search for Daniel's mini-HOWTO or see the Squid FAQ (http://www.squid-cache.org/Doc/FAQ/) for further information.

FreeBSD

Configuring FreeBSD for interception caching is very similar to configuring Linux. The examples here are known to work for FreeBSD Versions 3.x and 4.x. These versions have all the necessary software in the kernel source code, although you need to specifically enable it. If you are stuck using an older version (like 2.2.x), you should consider upgrading, or have a look at the IP Filter software described in the following section.

First, you probably need to generate a new kernel with the IP firewall code enabled. If you are unfamiliar with building kernels, read the config(8) manual page. The kernel configuration files can usually be found in /usr/src/sys/i386/conf. Edit your configuration file and make sure these options are enabled:

options         IPFIREWALL
options         IPFIREWALL_FORWARD

Next, configure and compile your new kernel as described in the config(8) manual page. After your new kernel is built, install it and reboot your system.

You need to use the ipfw command to configure the IP firewall rules. The following rules should get you started:

/sbin/ipfw add allow tcp from any to 10.1.2.3 80 in
/sbin/ipfw add fwd 127.0.0.1,3128 tcp from any to any 80 in
/sbin/ipfw add allow any from any to any

The first rule allows incoming packets destined for the HTTP server on this machine. The second line causes all remaining incoming packets destined for port 80 to be redirected (forwarded, in the ipfw terminology) to our web cache on port 3128. The final rule allows all remaining packets that didn't match one of the first two. The final rule is shown here because FreeBSD denies remaining packets by default.

A better approach is to write additional allow rules just for the services running on your system. Once all the rules and services are working, you can have FreeBSD deny all remaining packets. If you do that, you'll need some special rules so the interception proxy works:

/sbin/ipfw add allow tcp from any 80 to any out
/sbin/ipfw add allow tcp from 10.1.2.3 to any 80 out
/sbin/ipfw add allow tcp from any 80 to 10.1.2.3 in established
/sbin/ipfw add deny any from any to any

The first rule here matches TCP packets for intercepted connections sent from the proxy back to the clients. The second rule matches packets for connections that the proxy opens to origin servers. The third rule matches packets from the origin servers coming back to the proxy. The final rule denies all other packets. Note that this configuration is incomplete. It's likely that you'll need to add additional rules for services such as DNS, NTP, and SSH.

Once you have the firewall rules configured to your liking, be sure to save the commands to a script that is executed when your system boots up.

Other Operating Systems

If you don't use Linux or FreeBSD, you might still be able to use interception caching. The IP Filter package runs on a wide range of Unix systems. According to their home page (http://cheops.anu.edu.au/~avalon/ip-filter.html), IP Filter works with FreeBSD, NetBSD, OpenBSD, BSD/OS, Linux, Irix, SunOS, Solaris, and Solaris-x86.

As with the previous Linux and FreeBSD instructions, IP Filter also requires kernel modifications. Some operating systems support loadable modules, so you might not actually need to build a new kernel. Configuring the kernels of all the different platforms is too complicated to cover here; see the IP Filter documentation regarding your particular system.

Once you have made the necessary kernel modifications, you can write an IP Filter configuration file. This file contains the redirection rules for interception caching:

rdr ed0 10.1.2.3/32 port 80 -> 127.0.0.1 port 80 tcp
rdr ed0 0.0.0.0/0 port 80 -> 127.0.0.1 port 3128 tcp

Note that the second field is a network interface name; the name ed0 may not be appropriate for your system.

To install the rules, you must use the ipnat program. Assuming that you saved the rules in a file named /etc/ipnat.rules, you can use this command:

/sbin/ipnat -f /etc/ipnat.rules

The IP Filter package works a little differently than the Linux and FreeBSD firewalls. In particular, the caching application needs to access /dev/nat to determine the proper destination IP address. Thus, your startup script should also make sure the caching application has read permission on the device:

chgrp nobody /dev/ipnat
chmod 644 /dev/ipnat

If you are using Squid, you need to tell it to compile in IP Filter support with the - -enable-ipf-transparent configure option.

The Application Layer: HTTP

Recall that standard HTTP requests and proxy-HTTP requests are slightly different (see Section "HTTP Requests"). The first line of a standard request normally includes only an absolute pathname. Proxy-HTTP requests, on the other hand, use the full URL. Because interception proxying does not require browser configuration, and the browser thinks it is connected directly to an origin server, it sends only the URL-path in the HTTP request line. The URL-path does not include the origin server hostname, so the cache must determine the origin server hostname by some other means.

The most reliable way to determine the origin server is from the HTTP/1.1 &Host; header. Fortunately, all of the recent browser products do send the &Host; header, even if they use "HTTP/1.0" in the request line. Thus, it is a relatively simple matter for the cache to transform this standard request:

GET /index.html HTTP/1.0
Host: www.ircache.net

into a proxy-HTTP request, such as:

GET http://www.ircache.net/index.html HTTP/1.0

In the absence of the &Host; header, the cache might be able to use the socket interface to get the IP address for which the packet was originally destined. The Unix sockets interface allows an application to retrieve the local address of a connected socket with the getsockname() function. In this case, the local address is the origin server that the proxy pretends to be. Whether this actually works depends on how the operating system implements the packet redirection. The native Linux and FreeBSD firewall software preserves the destination IP address, so getsockname() does work. The IP Filter package does not preserve destination addresses, so applications need to access /dev/nat to get the origin server's IP address.

If the cache uses getsockname() or /dev/nat, the resulting request looks something like this:

GET http://192.52.106.29/index.html HTTP/1.0

While either a hostname or an IP address can be used to build a complete URL, hostnames are highly preferable. The primary reason for this is that URLs typically use hostnames instead of IP addresses. In most cases, a cache cannot recognize that both forms of a URL are equivalent. If you first request a URL with a hostname, and then again with its IP address, the second request is a cache miss, and the cache now stores two copies of the same object. This problem is made worse because some hostnames have many different IP addresses.

Debugging Interception

Many people seem to have trouble configuring interception caching on their networks. This is not too surprising, because configuration requires a certain level of familiarity with switches and routers. The rules and access lists these devices use to match certain packets are particularly difficult. If you set up interception caching and it doesn't seem to be working, these hints may help you isolate the problem.

First of all, does the caching proxy receive redirected connections? The best way to determine this is with tcpdump. For example, you can use:

tcpdump -n port 80

You should see a fair amount of output if the switch or router is actually diverting connections to the proxy. Note that if you have an HTTP server running on the same machine, it is difficult to visually differentiate the proxy traffic from the server traffic. You can use additional tcpdump parameters to filter out the HTTP server traffic:

tcpdump -n port 80 and not dst 10.1.2.3

If you don't see any output from tcpdump, then it's likely your router/switch is incorrectly configured.

If your browser requests just hang, then it's likely that the switch is redirecting traffic, but the cache cannot forward misses. Running tcpdump in this case shows a lot of TCP SYN packets sent out but no packets coming back in. You can also check for this condition by running netstat -n. If you see a lot of connections in the SYN_SENT state, it is likely that the firewall/nat rules deny incoming packets from origin servers. Turn on firewall/nat debugging if you can.

You may also find that your browser works fine, but the caching proxy doesn't log any of the requests. In this case, the proxy machine is probably simply routing the packets. This could happen if you forget, or mistype, the redirect/forward rule in the ipchains/ipfw configuration.

Issues

Interception caching is still somewhat controversial. Even though it sounds like a great idea initially, you should carefully consider the following issues before deploying it on your network.

It's Difficult for Users to Bypass

If for some reason, one of your users encounters a problem with interception caching, he or she is going to have a difficult time getting around the cache. Possible problems include stale pages, servers that are incompatible with your cache (but work without it), and IP-based access controls. The only way to get around an interception cache is to configure a different proxy cache manually. Then the TCP packets are not sent to port 80 and thus are not diverted to the cache. Most likely, the user is not savvy enough to configure a proxy manually, let alone realize what the problem is. And even if he does know what to do, he most likely does not have access to another proxy on the Internet.

In my role as a Squid developer, I've received a number of email messages asking for help bypassing ISP settings. The following message is real; only the names have been changed to protect the guilty:

Duane,

I am Zach Ariah, a subscriber of XXX Internet - an ISP who has
recently installed... Squid/1.NOVM.21. All of my HTTP requests are
now being forced through the proxy(proxy-03-real.xxxxx.net). I really
don't like this, and am wondering if there is anyway around this. Can
I do some hack on my client machine, or put something special into the
browser, which will make me bypass the proxy??? I know the proxy looks
at the headers. This is why old browsers don't work.

Anyway... Please let me know what's going on with this.
Thank you and Best regards,

Zach Ariah

This is a more serious issue for ISPs than it is for corporations. Users in a corporate environment are more likely to find someone who can help them. Also, corporate users probably expect their web traffic to be filtered and cached. ISP customers have more to be angry about, since they pay for the service themselves.

All layer four switching and routing products have the ability to bypass the cache for special cases. For example, you can tell the switch to forward packets normally if the origin server is www.hotmail.com or if the request is from the user at 172.16.4.3. However, only the administrator can change the configuration. Users who experience problems need to ask the administrator for assistance. Getting help may take hours, or even days. In some cases, users may not understand the situation well enough to ask for help. It's also likely that such a request will be misinterpreted or perhaps even ignored. ISPs and other organizations that deploy interception caching must be extremely sensitive to problem reports from users trying to surf the Web.

Packet Transport Service

What exactly is the service that one gets from an Internet service provider? Certainly, we can list many services that a typical ISP offers, among them email accounts, domain name service, web hosting, and access to Usenet newsgroups. The primary service, however, is the transportation of TCP/IP packets to and from our systems. This is, after all, the fundamental function that enables all the other services.

As someone who understands a little about the Internet, I have certain expectations about the way in which my ISP handles my packets. When my computer sends a TCP/IP packet to my ISP, I expect my ISP to forward that packet towards its destination address. If my ISP does something different, such as divert my packets to a proxy cache, I might feel as though I'm not getting the service that I pay for.

But what difference does it make? If I still get the information I requested, what's wrong with that? One problem is related to the issues raised in Chapter 3, "Politics of Web Caching". Users might assume that their web requests cannot be logged because they have not configured a proxy.

Another, more subtle point to be made is that some users of the network expect the network to behave predictably. The standards that define TCP connections and IP routing do not allow for connections to be diverted and accepted under false pretense. When I send a TCP/IP packet, I expect the Internet infrastructure to handle that packet as described in the standards documents. Predictability also means that a TCP/IP packet destined for port 80 should be treated just like a packet for ports 77, 145, and 8333.

Routing Changes

Recall that most interception caching systems expose the cache's IP address when forwarding requests to origin servers. This might alter the network path (routing) for the HTTP packets coming from the origin server to your client. In some cases, the change can be very minor; in others, it might be significant. It's more likely to affect ISPs than corporations and other organizations.

Some origin servers expect all requests from a client to come from the same IP address. This can really be a problem if the server uses HTTP/TLS and unencrypted HTTP. The unencrypted (port 80) traffic may be intercepted and sent through a caching proxy; the encrypted traffic is not intercepted. Thus, the two types of requests come from two different IP addresses. Imagine that the server creates some session information and associates the session with the IP address for unencrypted traffic. If the server instructs the client to make an HTTP/TLS request using the same session, it may refuse the request because the IP address doesn't match what it expects. Given the high proliferation of caching proxies today, it is unrealistic for an origin server to make this requirement. The session key alone should be sufficient, and the server shouldn't really care about the client's IP address.

When an interception cache is located on a different subnet from the clients using the cache, a particularly confusing situation may arise. The cache may be unable to reach an origin server for whatever reason, perhaps because of a routing glitch. However, the client is able to ping the server directly or perhaps even telnet to it and see that it is alive and well. This can happen, of course, because the ping (ICMP) and telnet packets take a different route than HTTP packets. Most likely, the redirection device is unaware that the cache cannot reach the origin server, so it continues to divert packets for that server to the cache.

It Affects More Than Browsers and Users

Web caches are deployed primarily for the benefit of humans sitting at their computers, surfing the Internet. However, a significant amount of HTTP traffic does not originate from browsers. The client might instead be a so-called web robot, or a program that mirrors entire web sites, or any number of other things. Should these clients also use proxy caches? Perhaps, but the important thing is that with interception proxying, they have no choice.

This problem manifested itself in a sudden and very significant way in June of 1998, when Digex decided to deploy interception caching on their backbone network. The story also involves Cybercash, a company that handles credit card payments on the Internet. The Cybercash service is built behind an HTTP server, thus it uses port 80. Furthermore, Cybercash uses IP-based authentication for its services. That is, Cybercash requires transaction requests to come from the known IP addresses of its customers. Perhaps you can see where this is leading.

A number of other companies that sell merchandise on the Internet are connected through Digex's network. When a purchase is made at one of these sites, the merchant's server connects to Cybercash for the credit card transaction. However, with interception caching in place on the Digex network, Cybercash received these transaction connections from a cache IP address instead of the merchant's IP address. As a result, many purchases were denied until people finally realized what was happening.

The incident generated a significant amount of discussion on the North American Network Operators Group (NANOG) mailing list. Not everyone was against interception caching; many applauded Digex for being forward-thinking. However, this message from Jon Lewis () illustrates the feelings of people who are negatively impacted by interception caching:

My main gripe with Digex is that they did this (forced our traffic into a transparent proxy) without authorization or notification. I wasted an afternoon, and a customer wasted several days worth of time over a 2-3 week period trying to figure out why their cybercash suddenly stopped working. This customer then had to scan their web server logs, figure out which sales had been "lost" due to proxy breakage, and see to it that products got shipped out. This introduced unusual delays in their distribution, and had their site shut down for several days between their realization of a problem and resolution yesterday when we got Digex to exempt certain IP's from the proxy.

Others took an even stronger stance against interception caching. For example, Karl Denninger () wrote:

Well, I'd love to know where they think they get the authority to do this from in the first place.... that is, absent active consent. I'd be looking over contracts and talking to counsel if someone tried this with transit connections that I was involved in. Hijacking a connection without knowledge and consent might even run afoul of some kind of tampering or wiretapping statute (read: big trouble).....

No-Intercept Lists

Given that interception caching does not work with some servers, how can we fix it? Currently, the only thing we can do is configure the switch or router not to divert certain connections to the cache. This must be a part of the switch/router configuration because, if the packets are diverted to the cache, there is absolutely nothing the cache can do to ''undivert'' them. Every interception technique allows you to specify special addresses that should not be diverted.

The maintenance of a no-intercept list is a significant administrative headache. Proxy cache operators cannot really be expected to know of every origin server that breaks with interception caching. At the same time, discovering the list of servers the hard way makes the lives of users and technical support staff unnecessarily difficult. A centrally maintained list has certain appeal, but it would require a standard format to work with products from different vendors.

One downside to a no-divert list is that it may also prevent useful caching of some objects. Routers and switches check only the destination IP address when deciding whether to divert a connection. Any given server might have a large amount of cachable content but only a small subset of URLs that do not work through caches. It is unfortunate that the entire site must not be diverted in this case.

Are Port 80 Packets Always HTTP?

I've already made the point that packets destined for port 80 may not necessarily be HTTP. The implied association between protocols and port numbers is very strong for low-numbered ports. Everyone knows that port 23 is telnet, port 21 is FTP, and port 80 is HTTP. However, these associations are merely conventions that have been established to maximize interoperation.

Nothing really stops me from running a telnet server on port 80 on my own system. The telnet program has the option to connect to any port, so I just need to type telnet myhostname 80. However, this won't work if there is an interception proxy between my telnet client and the server. The router or switch assumes the port 80 connection is for an HTTP request and diverts it to the cache.

This issue is likely to be of little concern to most people, especially in corporate networks. Only a very small percentage of port 80 traffic is not really HTTP. In fact, some administrators see it as a positive effect, because it can prevent non-HTTP traffic from entering their network.

HTTP Interoperation Problems

Interception caching is known to impair HTTP interoperability. Perhaps the worst instance is with Microsoft Internet Explorer. When you click on Reload, and Explorer thinks it's connecting to the origin server, it omits the Cache-control: no-cache directive. The interception cache doesn't know the user clicked on Reload, so it serves a cache hit instead of forwarding the request to the origin server.[4]

[4]See Microsoft Knowledgebase article Q266121, http://support.microsoft.com/support/kb/articles/Q266/1/21.ASP.

Interception proxies also pose problems for maintaining backwards compatibility. HTTP allows clients and servers to utilize new, custom request methods and headers. Ideally, proxy caches should be able to pass unknown methods and headers between the two sides. However, in practice, many caching products cannot process new request methods. A smart client can bypass the proxy cache for the unknown methods, unless interception caching is used.

IP Interoperation Problems

There are a number of ways that interception proxies impact IP interoperability. For example, consider path MTU[5] discovery. Internet hosts use the IP don't fragment option and ICMP feedback messages to discover the smallest MTU of all links between them. This technique is almost worthless when connection hijacking creates two network paths for a single pair of IP addresses.

[5] The Maximum Transmission Unit is the largest packet size that can be sent in a single datalink-layer frame or cell.

Another problem arises when attempting to measure network proximity. One way to estimate how close you are to another server is to time how long it takes to open a TCP connection. Using this technique with an interception proxy in the way produces misleading results. Connections to port 80 are established quickly and almost uniformly. Connections to other ports, however, take significantly longer and vary greatly. A similar measurement tactic times how long it takes to complete a simple HTTP request. Imagine that you've developed a service that rates content providers based on how quickly their origin servers respond to your requests. Everything is working fine, until one day your ISP installs an interception cache. Now you're measuring the proxy cache rather than the origin servers.

I imagine that as IP security (RFC 2401) becomes more widely deployed, many people will discover problems caused by interception proxies. The IP security protocols and architecture are designed to ensure that packets are delivered end-to-end without modification. Indeed, connection hijacking is precisely one of the reasons to use IP security.

To Intercept or Not To Intercept

Interception caching, a.k.a. connection hijacking, is extremely attractive to proxy and network administrators because it eliminates client configuration headaches. Users no longer need to know how to configure proxies in their browser. Furthermore, it works with all web clients; administrators don't need specific instructions for Lynx, Internet Explorer, Netscape Navigator, and their different versions. With interception caching, administrators have greater control over the traffic sent to each cache. It becomes very easy to add or remove caches from a cluster or to disable caching altogether.

A related benefit is the sheer number of users using the cache. When users are given a choice to use proxies, most choose not to. With interception caching, however, they have no choice. The larger user base drives up hit ratios and saves more wide-area Internet bandwidth.

The most significant drawback to interception caching is that users lose some control over their web traffic. When problems occur, they can't fix the problem themselves, assuming they even know how. Another important consequence of connection hijacking is that it affects more than just end users and web browsers. This is clearly evident in the case of Digex and Cybercash.

Certainly, interception caching was in use long before Digex decided to use it on their network. Why, then, did the issue with Cybercash never come up until then? Mostly because Digex was the first to deploy interception caching in a backbone network. Previously, interception caching had been installed close to web clients, not web servers. There seems to be growing consensus in the Internet community that interception caching is acceptable at the edges of the network, where its effects are highly localized. When used in the network core (i.e., backbones), its effects are widely distributed and difficult to isolate, and thus unacceptable.

Many people feel that WPAD (or something similar) is a better way to ''force'' clients to use a caching proxy. With WPAD, clients at least understand that they are talking to a proxy rather than the origin server. Of course, there's no reason you can't use both. If you use interception proxying, you can still use WPAD to configure those clients that support it.

Back to: Web Caching


O'Reilly Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com