Going on a bandwidth diet

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
I'm trying to cut down on the bandwidth my site sucks. Er ... sorry .. Tannin's site. I don't want to take bits of the site down, and it would be easy enough to just buy more bandwidth, but why should I? Hey, it's a free service I provide, and I'm already paying for 6GB a month, that ought to be enough. Well, it is enough, most months, but only just barely.

I could just take the risk that the site will run over and suck itself dry, and go off-line for a few days at the end of each month, except that there are five other sites I administer, all using the same shared server, and it would take the other five sites down too, which is not acceptable. (The other five only use about one-quarter as much bandwidth, they are not the problem child.) Also, it would probably kill my email, if that matters.

Last month it got slashdotted again and I had to switch the images folder off-line and have the pages generate blank-space 404s for most of the pictures for a couple of days. (The slashdotting gave me one day with about three times the usual amount of traffic. Yes, that's how close to the limit the site runs. Lucky it was a 30-day month!)

So then I got busy cutting down on wastage.
  • I've switched off hotlinking (more or less - it's possible to get around, but would be too much trouble for most people, which is close enough for my purposes).
  • I've asked Tannin to split up one of the three main sections (the CPUs one) into smaller files, so that the average random visitor who only visits one page gets half as much text and half as many jpegs. (The interested reader only has to click one link to see the next page as well, so this does not impact on usability.)
  • I've moved all the images around to new locations (which breaks existing links to them). (I also moved most of the text, as part of the general reorganisation, but there are redirects for the text file moves, so a good deal of that will be user-transparent.)
  • I've put an appropriate robots.txt into the new image folders, so that search engines won't index the images.
Not enough. I'm still running too close to my limit. And that last one will take quite a while to start having much effect, as the search engines take quite a while to update their links. So what's next?

Looking at the server stats, the pageviews split up into three groups:
  • Direct address or bookmark: 40 - 45%
  • Search engines: 40 - 40%
  • Static links: 10 - 20%

It's probably fair to make the following generalisations: anyone who has bothered to bookmark the site probably wants to read it, so they are "good" visitors, and very welcome. People following static links, in the main, the same applies to. The search engine crowd, however, are different. Here, some study of the server logs soon tells us, we have a very mixed bag. Roughly, they fall into the following groups:
  • People looking for information about particular things, like a certain old motherboard model. These are "good" visitors, in the sense that they are likely to be interested in the site content, and the people Tannin says he wants to visit. (I dunno why, they are probably all horrible geeks like him.)
  • People looking for a specific bit of informaton about a specific item: motherboard manuals, hard drive jumper settings, and so on. These are "bad" visitors, insofar as they are (a) not likely to be interested in the site, and (b) unlikely to get any benefit from it anyway - it doesn't try to provide that level of detail.
  • People looking for stupid specific things. I mean, what do you make of search terms like "jumpers to set on Intel chipset motherboard to overclock Celeron 366". Some of them are really dumb!
  • People looking for vage but probably not very intelligent things: search terms like "best hard drive" or "differences between Pentium 4 and Athlon Thunderbird 1333", or just "CPU" seem unlikely to bring the site interested readers, or bring the searcher useful content - but you'd be amazed how many people search for really, really vague stuff.
  • Image searches. It seems likely that 90% + of image searches are just a prelude to copyright violation and/or bandwidth theft.
I can't think of any good reasons to encourage image searches. All on its own, images.google accounts for about 25% of the search engine hits, or 10% of the total visitors. (Actually, it's quite a lot more than 10%, as I've taken steps to get rid of a lot of other junk visits that inflate the stats, such as hotlinks and referer spam, so of the remainder, 15 to 20% is more probable.) A random sampling of the logs shows quite clearly that people arrriving via that route practically never follow links to other pages (which, as I see it, would be an indication of genuine interest in the site content).

(continued on next post)
 

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
To begin with, I cut out a lot of the hotlinking by making a list of trusted sites (including SF, of course) that the server will provide images to on request. I redirected those requests to a hotlink error page, which doesn't work - the server provides the html, but the client never sees it, as it's expecting an image. Just wastes bandwidth and leaves a blank hole in the offending site.

So I redirected to an error image instead. This works fine, except that you can't provide much actual information about the nature of the error in an image, as the more text you include in it, the more bandwidth it sucks. Secondly, it can't distinguish between a genuine hotlink (cheapskate scumbags selling crap on Ebay are common offenders) and various other circumstances — so you wind up serving too nasty an image to people who haven't done all that much wrong, or else being to polite to scumbags. And even a small image still wastes bandwidth.

I buggerised about with various versions of this theme for a while, none of them really appropriate.

The biggest problem was images.google. Google hotlinks to your image as routine. You can serve it an image that says "stolen" or something, but then the reader might think that you have stolen the image! Worse, then the image stays in the reader's browser cache, and if they visit your page they will see "stolen" instead of the actual illustration until such time as they hit "refresh" or "shift-refresh".

It gets worse: once google decides that your server isn't going to provide it with hotlinked images, it does something clever and, instead of showing the usual thumbnail plus framed sample to the person searching for an image , it redirects (unframed) to the whole HTML page on your site, which includes not just the original image in question, but maybe 6 or 12 other images as well. By trying to save a little bandwidth, you wind up getting hammered!

In the end, I threw my toys out of the pram, and did three things.

(1) I blocked blank refferers as well as non-trusted ones. (Nothing to do with Google, this one, just part of the general tightening up. Only for images, of course, not HTML, CSS or JS files.)

(2) Blocked hotlinked google images with a 403. Now, if you try to access one of my images from Google's image search, you'll get a flat "403 forbidden: you don't have permission to access XYZ on this server".

(3) Blocked HTML access from Google images with a 403. If you find my page through Google image search, you don't get nuffin, not even if you click on the "Below is the image in its original context on the page: (whatever address)"

There will be the odd genuine visitor that get's lost in that process, but only a very few.

Comments, gentlemen? Have I gone too far? How do you feel about blocking blank HTTP referrers? What moron would misconfigure his browser/firewall software to strip referrers out anyway?
 

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
To begin with, I cut out a lot of the hotlinking by making a list of trusted sites (including SF, of course) that the server will provide images to on request. I redirected those requests to a hotlink error page, which doesn't work - the server provides the html, but the client never sees it, as it's expecting an image. Just wastes bandwidth and leaves a blank hole in the offending site.

So I redirected to an error image instead. This works fine, except that you can't provide much actual information about the nature of the error in an image, as the more text you include in it, the more bandwidth it sucks. Secondly, it can't distinguish between a genuine hotlink (cheapskate scumbags selling crap on Ebay are common offenders) and various other circumstances — so you wind up serving too nasty an image to people who haven't done all that much wrong, or else being to polite to scumbags. And even a small image still wastes bandwidth.

I buggerised about with various versions of this theme for a while, none of them really appropriate.

The biggest problem was images.google. Google hotlinks to your image as routine. You can serve it an image that says "stolen" or something, but then the reader might think that you have stolen the image! Worse, then the image stays in the reader's browser cache, and if they visit your page they will see "stolen" instead of the actual illustration until such time as they hit "refresh" or "shift-refresh".

It gets worse: once google decides that your server isn't going to provide it with hotlinked images, it does something clever and, instead of showing the usual thumbnail plus framed sample to the person searching for an image , it redirects (unframed) to the whole HTML page on your site, which includes not just the original image in question, but maybe 6 or 12 other images as well. By trying to save a little bandwidth, you wind up getting hammered!

In the end, I threw my toys out of the pram, and did three things.

(1) I blocked blank refferers as well as non-trusted ones. (Nothing to do with Google, this one, just part of the general tightening up. Only for images, of course, not HTML, CSS or JS files.)

(2) Blocked hotlinked google images with a 403. Now, if you try to access one of my images from Google's image search, you'll get a flat "403 forbidden: you don't have permission to access XYZ on this server".

(3) Blocked HTML access from Google images with a 403. If you find my page through Google image search, you don't get nuffin, not even if you click on the "Below is the image in its original context on the page: (whatever address)"

There will be the odd genuine visitor that get's lost in that process, but only a very few.

Comments, gentlemen? Have I gone too far? How do you feel about blocking blank HTTP referrers? What moron would misconfigure his browser/firewall software to strip referrers out anyway?
 

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
Oh, by the way ... if you've noticed any missing images here on SF lately, that has nothing to do with the above. That was entirely Tannin's fault. When we switched to the new hosting people a couple of months ago (who are excellent), Tannin forgot to copy the SF images over to the new server before he switched the old one off. We could hunt around and repost them, but it seems like quite a lot of work and probably won't happen.

Also by the way, why this sudden interest in bandwidth? Well, the old hosting company was located in Australia, and most of the traffic comes from America or Europe. They don't have to pay for traffic in that direction, so they didn't care how much we used.

On the other hand, they charged $350 per year (which was way too much), didn't provide anything much in the way of site administration tools, and had various broken and/or out-of-date things, such as open site statistics access to the public (= masses of referrer spam), and broken domain name arrangements — you had to type the "www" or the site didn't work.
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
I'm not sure if you have control over this or not, but I just implemented mod_gzip compression on our internal website and it now uses roughly 1/3rd the bandwidth.

With Apache you can implement this at the virtual server level or with a .htaccess file for specific directories.

It took only a couple seconds, works with all browsers, and can make a tremendous difference without affecting the experience a visitor has one bit.
 

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
Good idea! Thankyou Blakerwry.

Is it on already? Don't know.
If not, can I enable it? Don't know that either.
Finally, how much difference would it make? most of the bandwidth is images, which don't compress well anyway.

May stats (by bandwidth, not number)
jpg: 87.1 %
html: 8.7 %
css: 1 %
gif: 2.9 %

June
jpg: 85.8 %
html: 10 %
css: 1.1 %
gif: 2.8 %

July
jpg: 84.5 %
html: 10.7 %
css: 1.3 %
gif: 3.2 %

You can see that I'm gradually cutting down on the graphics component — not by changing the content, at least not to speak of, but by getting rid of parasitic loads like hotlinking. Starting point was exactly 90% images, 10% everything else. Month to date is 87.7% images, estimated current load is maybe 80 or 85% images (this won't show up until next month's stats).

So gzip would only act effectively on the remaining (non-image) content, about 15% of the total. What compression ratio can I expect on HTML? Maybe two-thirds? If so, that works out to something like 10% of the total bandwidth — which is worth having in anybody's language.

Thanks Blakerwry, I'll look into it.
 

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,926
Location
USA
You can check to see if mod_gzip is enabled by using this tool by Leknor.

Keep in mind that this does increase the server's load a little.
 

Tea

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
3,749
Location
27a No Fixed Address, Oz.
Website
www.redhill.net.au
Thanks Doug! Cool tool!

Not gzipped.

On the other hand, now that I think about it, I'd have to ask the site admin to go to some trouble to install or enable something which will place extra load on his server simply in order to avoid paying him a little bit extra ... and I have to say that Arvand is a nice guy who doesn't charge much and has provided superb service and .... it seems a bit rude. Maybe I'll hold off on the gzip idea till I see how my program of cutting out the dead wood bandwidth (hotlinks, stupid google image searches, etc.) goes.
 

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,926
Location
USA
It might be worth asking your web host admin about the mod_gzip. What you may lose on performance (which may be insignificant), you'll gain in your wallet. If you're in a shared environment using apache virtualhost configuration, he might be able to add the parameter just for your domain... I haven't tried this, but it may be possible.
 

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,728
Location
Horsens, Denmark
Have you considered switching to png image files? I noticed a 30% drop in the size of the images without noticeable image degredation.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,932
Location
Brisbane, Oz
PNG files are a cool alternative to GIF, they're best suited to drawings and screenshots. Unfortunately, they're not even on the same page as JPG when it comes to photos.

Tea, you are right, tools like gzip can't compress JPEG images at all. Actually, even the best compression algorithms in existence can only manage a couple of percent on JPEG.

I'm really impressed with the techniques you've worked out to deal with the search engines etc.
 

P5-133XL

Xmas '97
Joined
Jan 15, 2002
Messages
3,173
Location
Salem, Or
No, you have a perfect right to filter what people see off your own web site. If you want to descriminate between visitors and search engines then that seems fine with me: It is your data, not theirs or the publics...
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
Actually, with most linux distributions you can use compressed HTTP via Apache by default.

I'm not sure how all that goes since I have never compiled Apache from source, but with most distrobustions you simply enter the line
AddOutputFilterByType DEFLATE text/html
to the specific virtual server container (each site hosted on the server has a separate virtual server section in the Apache config file that specifies its properties and separates it from the other sites).


Most likely you'll only be inconveniencing your webmaster about 10 seconds.

Or... looking at the apache 2.0 docs, for the addoutputfilter directive
http://httpd.apache.org/docs-2.0/mod/mod_mime.html#addoutputfilter

you can add this command to an htaccess file and make the changes yourself without ever informing your administrator.


Since you already mentioned that you played around with Apache directives and mod rewrite I assume you're familiar with an htaccess file.
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
Forgot to mention, that I've read claims that using compressed http was actually faster and caused less load on a server than not using it. There are a few specifics involved in when this would be true, but in most cases it is faster and the load difference is negligable one way or the other.
 
Top