Web Caching Exposed

by Jeff Connelly, 2003-04-20

What is a "web cache" or "HTTP cache"?

Main Entry: ¹cache
Pronunciation: 'kash
Function: noun
Etymology: French, from cacher to press, hide, from (assumed) Vulgar Latin coacticare to press together, from Latin coactare to compel, frequentative of cogere to compel -- more at COGENT
Date: 1797 1 a : a hiding place especially for concealing and preserving provisions or implements b : a secure place of storage
2 : something hidden or stored in a cache
3 : a computer memory with very short access time used for storage of frequently used instructions or data -- called also cache memory

-- Merriam-Webster

A web cache is a server which acts as a proxy, caching (storing) requested files to serve other clients without having to re-download from the origin server. This caching speeds up web browsing and saves a significant amount bandwidth because the origin server does not have to be repeatedly contacted unless the file changes. HTTP caching is a complicated issue and details can be found in RFC2616: Hypertext Transfer Protocol -- HTTP/1.1. Corporations and ISPs use web caches to speed up web browsing.

How do I find a web cache to use?

You can either set up your own using software such as Squid Cache, or search for an existing web cache. Many caches on the Internet are free for public usage, and can be found on lists of open proxies. Not all proxies are caching, but most are.

How can I quickly check if a web cache works?

Either use checkp.pl (slow) or the web-based http://www.checker.freeproxy.ru/checker/. Use a low timeout (say, 1 minute) to speed things up. Although some caches are permanent, other caches may only exist temporarily - it all depends on who is running it.

Where can I find a list of open proxies?

Google
DNSRBL (DNS real-time blacklists) sites, open proxy blacklists
Roswell Instrument - nice site, regularly finds new proxies, can sort by speed
WInfoSec OpenProxies.com - provides 10 slow proxies everyday, and 10 fastest proxies via email
AiS Alive Proxy List - offers real-time checking, sent HTTP headers
Open Directory Computers: Internet: Proxies: Free
Check Your Proxy

There are also companies that sell proxy lists which you can buy. There are also commerical proxies. As always, be sure to check if a proxy works before using it (see previous question).

How do I use a web cache?

One way is to set your web browser to make all connections through the proxy. This varies depending on your browser:

On Internet Explorer:

Go to Tools -> Internet Options -> Connections -> LAN Settings
Check "Use a proxy server for your LAN" under "Proxy server"
Enter the specified address and port for your web cache

On Mozilla:

Go to Edit -> Preferences -> Advanced -> Proxies
Check "Manual proxy configuration"
Enter the specified address and port for your web cache under "HTTP proxy"

Sometime, software will be available to specifically utilize multiple web caches.

How do I cache a file in a web cache?

Download the file completely. Often, it will be automatically cached. However, several HTTP headers can be sent to improve caching:

Content-Type: application/octet-stream - to avoid conversion
Expires - set this to a reasonable date in the future
Last-Modified - set this to a date in the past
ETag - set this to a unique value, or a short description
Cache-Control: public, no-transform - to share cache between users

Some transparent proxies, in an attempt to furthur improve speeds, will compress or otherwise convert certain types of content. For example, images may be compressed to save bandwidth (see RFC2616 14.9.5). For this reason, the application/octet-stream MIME type is supplied, as well as the no-transform Cache-Control. Most caches don't mess with user data, but its better to be safe.

The Expires field can be set to store the file in the cache for longer periods of time. The HTTP spec says to not use a date later than one year from the present.

Last-Modified and ETag are required for some caches to operate correctly. The "entity tag" allows different files to be differentiated from each other, set it to the same string for each file.

Finally, Cache-control: public makes sure the file is stored in the user-wide cache instead of a private cache. This is often the default.

Avoid question-marks in URLs, as well as .cgi. The HTTP spec says caches shouldn't cache URLs containing "?", and some proxies may avoid caching .cgi URLs for fear that they are dynamic.

The host which originally hosted the file is known as the origin. Once the file is cached in a web cache, the origin server need not be contacted.

Hereinafter, the process of caching a file in a web cache is refered to as seeding.

How can I make my web server return those extra headers?

Use a CGI script.

How can I download a file from a web cache?

Set the cache as your web proxy, and download the file. This HTTP header helps:

Cache-Control: only-if-cached, public, max-stale

only-if-cached causes the cache to return 504 Gateway Timeout if the file is not cached, or the file if it is. This prevents unnecessary hitting of the origin server.

public ensures the cache is user-wide.

max-stale does this, according to RFC2616 14.9.3:

Indicates that the client is willing to accept a response that has exceeded its expiration time. If max-stale is assigned a value, then the client is willing to accept a response that has exceeded its expiration time by no more than the specified number of seconds. If no value is assigned to max-stale, then the client is willing to accept a stale response of any age.

Assuming the correct file is cached, this may help improve reliability. A Warning 110 (Response is Stale) header will be returned if stale. A 111 warning (Revalidation failed) is returned if the origin server is down, and a stale response was returned as a result.

How can I check if a cache still holds a file without downloading it?

Send a HEAD instead of a GET, but otherwise the same as the previous question. If not cached, expect a 504 Gateway Timeout and a Server header containing the web cache software name. If successful, several headers may be returned:

Age: seconds since response was fetched from origin
Accept-Ranges: if contains "bytes", then you can specify byte ranges to obtain parts of the file
Date: current date
Content-Length: size in bytes
Expires: expiration date from origin server
Cache-Control: public or private
Server: origin server
ETag: entity tag sent by origin server
Last-Modified: last modified date from origin server
Via: cache server software

These headers are also returned for GET (HEAD works like GET but doesn't return the entity body).

How should I select the origin host?

The origin server can be any host you wish that has the file you want to cache. If only you have the file, your IP address can be used. However, it is preferable to use DNS names - specifically, the so-called "dynamic DNS" services which offer free DNS name to IP address mappings. Not only will using a dynamic DNS name ensure the origin server can be found (if required), it will also allow the DNS name to be changed if needed, after seeding is complete.

How can I hide the origin host?

See previous question - use dynamic DNS.

How long should I keep the origin host up?

This is really up to you. After a cache or several are properly seeded, the origin host does not need to serve anything more, and if clients send the only-if-cached header and the cache honors it, the origin will not receive any HTTP requests. If you're still worried about unnecessary bandwidth wasted on the origin server's part, see the previous question.

How can I find the speed of a web cache?

Download a large file, on, say, Bandwidth Place (this will be your origin server). The speed at which you download will be the lowest of the following:

Upload speed of the origin server, unless the file is cached
Download speed of the cache server
Download speed of your Internet connection

Some caches are extremely fast, in order to serve the large user base they may have. These may easily overwhelm a residential Internet connection, and cause the download speed to be capped by your line rather than the cache. In that case, download from a faster Internet connection to achieve accurate results.

The upload speed of a web cache is less important, as a file needs only to be cached once.

In Internet Explorer, why does the download start out very fast but continously declines to a steady value?

This is reminiscent of how HTTP works and how Microsoft misunderstands it. Once you click a link or type a URL you want to download, Internet Explorer sends a request to the server requesting the particular file. Meanwhile, you are presented with a file save dialog, all the while Internet Explorer is downloading in the background. However, the timer for calculating download speed only starts once you have chosen a filename to save as, but by the time this has happened, part of the file has already been downloaded. Since speed is calculated as bytes per second, more bytes are there than should be, and less seconds, the download speed is inflated. As more time passes with a less amount of bytes arriving, the download speed sinks. Eventually, if given enough time, the speed will plateau at the true limit.

This bug isn't unique to proxies, but is worth knowing about.

Why is the "estimated time left" sometimes not available when downloading a file?

To estimate the time left, the size of the file must be known beforehand. The server you're downloading from hasn't sent a Content-Length header. Proxies and servers usually do this for you, but unless specifically written to do so CGI scripts will not.

Valid HTML 4.0?

Modified Sun Mar 25 08:48:47 2007 generated Sun Mar 25 08:56:33 2007
http://jeff.tk/caching/