Practical benchmarking tips

Tannin · Apr 5, 2002

Once upon a time, it used to be a simple and uncontroversial matter to speed-rate CPUs. Sure, you could argue round the edges and quibble over details, but if you picked the best motherboard you could find, or something that was reasonably close to it, and then ran Business Winstone 97, you'd end up with a number that was in the same ballpark as anyone else was going to get.

In those days you could go to a major web site, Tom's say, or Anand's - Anand was small and friendly but he was around then - and easily compare, say, the Pentium MMX and the K6 Classic. Not too many people would have disagreed with your results.

But now, there is no generally accepted single-figure benchmark. Cruise on over to Ace's (or wherever else you like) and they will cite ten or twenty or even thirty different numbers, showing that the Athlon XP excels in CAD-Bench XYZ but that the P-4 is better in Serious Sam. And so on.

Now in many ways this is a good thing. Performance has always been a multifaceted and complex thing, and it's good to see that people have become much more aware of this than they ever used to be. But there is a downside too: it has become all but impossible to do simple things like rank-order CPUs.

My problem is this: for years now i have kept my lists of historical CPUs in one particular order: speed order.

Yup, there are a million different ways to define speed, and most of them are clearly only applicable to one or another specialist application, or else poorly defined. If you only play Serious Sam, then it is easy enough to work out which are the best CPUs and in what order.

I do at least have a pretty clear idea of what I mean by "fast". To me, "fast" is quite simple. A "fast" CPU is one that copes well with ordinary mundane tasks. And to me, that means navigating around the desktop. In a word: snap.

Tannin · Apr 5, 2002

The obvious objection to this is that we are not running 486SX-33s anymore and that any modern system, if properly set up and tuned, can deliver a reasonable amount of desktop snap. That's true, but in reality, the vast majority of real-world systems don't!

We see a really good variety of everyday working systems come into the workshop. People bring their computers into us for a host of different reasons: the video card has slipped out of its slot, or the motherboard has died, or they want virus scanning, adding a new modem, replacing a tired CD-ROM, upgrading RAM, changing the main board and CPU, a bigger hard drive, because the kids have crashed it, or just to have Windows stripped off and put on again. And the point of this is that we mostly work with systems that are NOT properly set up, that are NOT well-tuned that are NOT defragged every week, with systems that have a lot of junk in their startup, systems that have every stupid desktop "enhancement" known to man or boy, systems that have had a CPU upgrade not too long ago but are, for some incomprehensible reason, still running 32MB of RAM .....

Very often, much as you try to take out a few of the most obvious and stupid performance-sappers, you don't have the time or the budget to do all the things that really ought to be done (usually starting with "format c:") - hey, they've just brought it in to have a scanner installed, OK? You don't have the right to tinker too much with it, and even if you do you won't get paid.

Guys like Buck and CougTek and Mercutio see this sort of stuff every working day, no doubt, same as I do. And I have not the slightest doubt that all the rest of you guys that don't work on computers for a living get to see it too - Clocker has his dad, for example, and I bet Tim helps people out with computer stuff quite often. And so on. We all see it.

Now my point here is that badly configured computers are normal out there in the real world. Judging by the random samples I see landing on my workbench every day, not to mention the systems I see at friends houses, the typical working computer operates at maybe one-third of its actual capacity. It takes three times as long to boot as it could do, just opening "My Computer" involves a visible delay, and so on.

On a scale of 1 to 10, where 10 means "perfectly tuned and running at 100% of the theoretical maximum for this CPU and hard drive and RAM configuration", and 1 means "this thing has so much crap on it that if I open Word I have to make a cuppa while it loads and then type slow as well, it's amazing it ever works at all", I guess most real-world systems are in the middle of the range, between 3 and 7, say.

Tannin · Apr 5, 2002

And this is why I think office application performance still matters - because although a perfectly-configured Celeron 800 is damn near as fast as an XP 1700 when you first build it, by the time it's been out there in the real world with real non-nerds using it for a while, there will be a difference. A big difference.

So, for me, "fast" means this: "can stand up to abuse and still somehow manage to perform decently in ordinary, everyday tasks." By "abuse" I mean the sorts of horrible things that users do to their systems as routine. By "ordinary, everyday tasks" I mean booting, opening Explorer, using several web browser windows at the same time, using the actual common software that home and small business users use all the time: IE 5.0, Word 97, Winfax, Netscape, Outlook, Excel, Word Perfect 7, Quicken, all that basic stuff.

So there is my question, or list of questions, really:

1: How do the CPUs of the last few years line up?

If you had to list all the CPUs on the market, starting with, say, the Celeron 800 and working your way up to the fastest CPU on the planet today, what order would you put them in? (Remember, I'm asking you to do that in speed order using "fast" to mean what I set out above - I'm well aware that there are lots of other, equally valid, ways to define "fast", this just happens to be the one that I use and like.)

2: How the hell do you measure it?

I've some thoughts on question (2) which I'm too tired of typing to set down tonight - they might take even longer than that tome above - but which essentially boil down to two things:

(a) A measurement method has to be practical - i.e., I'll accept some loss of strict accuracy if need be, but I can't and won't accept that the only way to do this is to use the truly heroic and quite impractical sort of dedication that we see Eugene over at Storage Review putting into hard drive testing. Even running an old Winstone is not really something you can do too easily, not if you want to rank-order 20 or 30 different CPUs.

(b) The current popular general-purpose benchmark test suites, Winbench and BAPCO are NOT suitable. In the real world (at least in my real word) no-one uses Drag-n-Dictate, for example. ZDBOP and BAPCO have corrupted their office application performance suites with a whole stack of floating-point and SSE/3DNow dependant stuff that real people just don't use. Out there in the real world of homes and smaller offices, people are more likely to own Office 97 than Office 2000 (and almost no-one has Office XP yet), more likely to have IE 5 than IE 7, probably less than half have the latest Media Player, and so on. BAPCO and WinBench 200X just don't measure the sorts of things that most people do. In fact, WinStone 97 or 98 is probably a better representation of the actual real-world application mix. But I worry that testing with one of those will give me silly results, given that they were designed for 32MB and at most 64MB systems, and that hardware designers have had ample time to tune for them.

Tannin · Apr 5, 2002

Oh, and I forgot to mention one more thing. So far as I am concerned, it makes no sense to consider CPUs in isolation. You have to take the main board and RAM into consideration too.

Now the common way to do this is to take just one particular mainboard, usually the best one you can lay your hands on, and test Duron 100 vs Duron 1200 vd Athlon XP 1700 vs Athlon XP 2000 on this same main board. Then you take an Intel RDRAM board and test the P-4 chips on it (as it is the nearest match you can get to that KT2666 or Nforce board you used for the AMD chips).

But in reality, this is never the way systems ship. You just don't take a $300 main board and put a Celeron 1000 on it. (At least not often.) Now I won't stoop to testing CPUs on an integrated board - I wouldn't inflict integrated video even on a Celeron 733 - but to me it makes more sense to rate each CPU on a "good typical board". That means I'd use a KT-133A to test an Athlon Thunderbird 1200C or a Duron 1100, but I'd use a DRR-equipped KT266A to test an XP 1700.

time · Apr 5, 2002

Hear, hear!

:council:

My feelings exactly. Current benchmarks are measuring the wrong things. Which is also why drive seek time should have more emphasis in real life. In benchmarking, only the first few cylinders of a drive get tested. In real life, full seek time really matters, as apps and data get scattered to the four corners of the disk.

Drives with fast full seek times 'age' better than ones with sluggish seeks. Worst case scenarios actually happen under the disorganized Windows environment that most people run.

Mercutio · Apr 5, 2002

Fortunately, although I certainly consider myself a tech, I don't see as much awfulness as Tannin because I usually have the good luck to be working with business machines rather than home machines.

Here's my view: Your CPU is utterly irrelevant these days. Those of with DVD-Ripping habits might see things differently, but in general, there's no practical difference between a Tualatin Celeron and an XP2100+, from a user's perspective. Heck, I'd put a good bit of money that a "normal" user couldn't even tell the difference between a dog like a Via C3 and a hopped-up P4, if all the other components were the same.

So what matters?
Memory technology doesn't matter, so long as you have "enough" of it - 64MB+ for 9x or 128MB+ for anything else. Reasonable enough. Most PCs sold in the last year meet that spec as well. DDR, SDRAM? I still sell both. People are happy either way, and my machines sure as hell are "snappy" I know exactly what Tannin's talking about, even if I can't quantify it, either.
DDR (and RAMBUS, if you're into that sort of thing) is something that, again, matters only to those people needing serious memory bandwidth (divx again). No one else cares, and it's not "snap".

Operating System DOES matter. 9x loses "snap" pretty quickly, OSes like Linux and 2000 have a "snap retaining" quality. Sophistication of the OS? Or users smart enough not to gum up the works? Either way, a strong determining factor in being snappy. I'd also say that virus-scanners are the anti-snap factor. They actually work in such a way as to preclude snap.

I know motherboard performance has been proven time and time again to be largely identical within generations; there's not much difference between KT266 and AMD760 etc, but I really can't remember seeing a SiS or ALi based system that had the same qualities of responsiveness as the Via and AMD stuff I'm most used to. That could be my opinion coloring reality, of course, but I knew the machines in the training place I'm working at were SiS-based boards even before I knew what processors were being used (Yup, Duron 750s on SiS-based quality Amptron motherboards with integrated SiS graphics, SiS sound, and SiS NICs. A P54C-166 I was planning to use for a test machine actually manages to run windows 98 with about the same amount of responsiveness).

The big determining factor I see in overall system responsiveness is almost certainly hard disk performance, and probably a bit more access time than STR. Program/module load times are strongly impacted, of course, but if the rest of the machine is even marginally tuned, having a reasonably fast disk is all the key to having things "just happen"

In the end "snap" isn't system responsiveness. That's part of it, yes, but there's a big difference between "responsive" and "I don't have to wait for anything, ever", which is often the feeling you get with a "snappy" system.

As an extreme example, in the era of P2-450s, I ran into a 24MB IBM 486/33 with an ancient, 1GB 5400rpm SCSI hard disk. Someone had put Win98 on the poor thing. Molasses, you say? Nope. That thing was absolutely as fast for the basic things I had to do with it (mostly run Excel) as the P2s I had at home. Unreal, incomprehensible, mind-boggling and true. Whatever "snap" is, that crummy little 486 had it.

Buck · Apr 5, 2002

As Mercutio and Tannin illustrate, speed is more dependent upon perception for the average home user than anything else. Perception causes a home user to complain that their system is “loud”, but after they put it into a computer cabinet, all is well. Does that make sense? To them it does. You can assemble a Socket7 233Mhz system with 256 MB PC-100 RAM, a 7,200-rpm drive, 17” Monitor, Canon S630 color printer, Windows XP, and they view it as a dream machine. Why? Because the printer prints beautiful colors, the monitor is crisp, clear, and vibrant; Windows XP is new and the reasonable memory and HDD load everything quick enough. Do they realize that the fastest processor today has a clock speed 10 times as fast? No. Do they care? No, not unless you charge them the 2.4 Ghz price, of course.

Home users love downloading anything free that will change a color on their desktop, put an icon on the task bar, or play some annoying sound. They don’t need the fastest CPU; they need plenty of memory, a quick HDD, and an average video card. If you can make their system look attractive for a few extra dollars, they’ll take it.

What we need is a benchmark that runs a whole series of home-user-memory-wasting software (eg Webshots, Real Player, Gator, ICQ, AIM, Norton everything, etc) plus the additional loads of Word, IE (about six sessions simultaneously should do), some greeting card creator, solitaire, freecell, McAfee Virus Scan Scanning (they inevitably load more than one anti-virus software), Outlook, and Windows Media Player (they inevitably also have more than one media player and don’t know how to uninstall Real Player).

The ultimate step that Tannin is attempting to take, by rating CPUs, requires two levels of organization. The first level must follow the technical specification of an individual component (in this case, megahertz). The next level requires a great deal of time and testing: CPUs should be organized into a group with all other pertinent computer components, and tested as whole. My first inclination would be to categorize these by operating system, and within each of these groups, separate out different platforms.

I’ll have to revisit this subject again, because there are a lot of possiblities.

BR

CougTek · Apr 5, 2002

My current computers are built exactly the opposite way as Buck described they should. They all have only 256MB of RAM (which is scarce by today's standards), but fast processors. However, I use them for crunching the Genome@home and therefore they are optimized for it. I agree though that an average user benefits more from memory than raw CPU power. For gamers, CPU power counts though, as well as the graphic card power. Both well before RAM size.

Different user types require different configuration types.

Buck · Apr 5, 2002

CougTek said:
My current computers are built exactly the opposite way as Buck described they should. They all have only 256MB of RAM (which is scarce by today's standards), but fast processors. However, I use them for crunching the Genome@home and therefore they are optimized for it. I agree though that an average user benefits more from memory than raw CPU power. For gamers, CPU power counts though, as well as the graphic card power. Both well before RAM size.

Different user types require different configuration types.

I totally agree Coug. Four 1 Ghz Duron systems running DOS on 540 MB drives, with 8 MB of RAM would be perfect just for crunching numbers. The systems I described do not run at my house or in my shop, because I use them for different things.

hmmm ... DOS, Genome, 540 MB drives ... hmmm

Pradeep · Apr 5, 2002

Norton SI. The finest benchmark there ever was. I ran that baby on my Pentium 66, I was walking on air for days.

Buck · Apr 5, 2002

How to handle the end user:

Tannin · Apr 6, 2002

Quite a can of worms already! Many an interesting comment to ponder and respond to also. Let's see if I can simplify matters by focussing on my actual present need, then return to the broader issue.

At the moment, all I need is a rank-ordering of the more recent CPUs. I'd prefer that it have a properly thought through methodology, of course, but I'll settle for a plain seat of the pants list if I have to.

I'll illustrate: in the early part of my CPU guide I listed some of the 486-era processors in this order:

486SX-25
386DX-40
486DLC-33
486SX-33
486DX-33
486DLC-40
486DX-40
486DX/2-50
486DX-50
486SLC/2-66
486DX/2-66

Now that order I'm perfectly happy with. In it's own twisted way, it's logical. Under most circumstances, the 386DX-40 was faster than a 486SX-25, and though the difference was narrow, an SX-33 was faster than a DLC-33, but slower than a DLC-40. And so on.

But what about the SX-33 and the DX-33? Wasn't the DX-33 vastly more powerful at floating-point based stuff? Of course it was, but most people didn't do anything that used the floating-point power - these were the days of DOS 5.0, XTree Gold, Word Perfect 5.1, Paradox, Q&A Write, and maybe Windows 3.1 now and then. So the FPU power of the DX-33 was all-but irrelevant, whereas the extra 7MHz of the DLC-40 meant visible extra snap. But the DX-33's FPU power was there, waiting to be called on, so seeing as the SX-33 and the DX-33 were identical in all other respects, I listed it higher than the SX-33.

(You may or may not agree with my priorities here, but I hope I've spelled them out clearly enough to make the above list show that it is consistent with them.)

LiamC · Apr 6, 2002

Can of worms indeed! Disclaimer: I sometimes write for Realworldtech - I'm saying this upfront lest I be accussed of bias

Tannin, I would have to say (on the face of it) that Business Winstone 2001 might be a better benchmark than you think (within it's limitations).

I say this because of the way ZD determined how to put it together. They simply surveyed their readership and asked them what they ran - and based the tests on the 18 500 responses! Business WS is the Office productivity stuff - ignore Content Creation.

The trouble is, this test is more determined by the speed of the hard disk and the chipset IDE implementation than CPU.

<boast>
I discovered this - not Anand, not Tom, not Kyle. Six weeks after I emailed Anand about the article I posted, he first mentioned it in one of his reviews. Now every second review on the web says it because he said it but no one wonders where the information came from. He never acknowledged the article by posting a link either. As a good Victorian (Aus) Irishman once said, "Such is life"
</boast>

But this also means that the "snappyness factor" is more pronounced, or more reflected by this benchmark than others such as SysMark which try to avoid having the hard drive play a role in determing the score. BTW, the Winstone benchs are lousy for just trying to determine CPU scores, they are more designed for complete system tests.

It is old enough to be using Office 2000 rather than XP but not Office 97

Why ignore the Content Creation stuff? How many people do you know that do flash animations or small video clips and post them on their web sites? Thought so - a very small percentage.

Why not SysMark? Try to find out how they arrive at their methodology. You can't. If they won't disclose how and why they made their decisions, how can you determine if it is valid? Dean Kent (RealWorldTech) did an article on SysMark because we were having a rather intense argument about whether it was a "good" benchmark - he now agrees with me

- but not for the reasons I believed

A couple of people are using the new MadOnion bench, but IMHO they are worse than BAPCo (SysMark). It's too early to determine what it is measuring...

Tannin · Apr 6, 2002

Now another example, from more recent times:

6x86-200 Classic
Pentium MMX 166
K6-166
6x86MX-166
Pentium MMX 200
Celeron 266
C6-200
K6-200
Celeron 300
Pentium MMX 233
6x86MX-200
C6-225
6x86MX-233
K6-233
Pentium II 233

Here again there is no particular difficulty about the rankings. By the time you get to this sort of performance level, it becomes difficult to make fine distinctions by the seat of your pants, so these were benchmark-ranked. I forget which one now, but one of the Winstones, '97 or '98 probably. Once again, the focus is on simple desktop performance ("snap") first, with ties being decided by other factors. (It wasn't needed, but imagine that the K6-233 and the Pentium II 233 had scored exactly the same on Business Winstone 98 - then I'd have ranked the P-II higher because of its stronger FPU. But a stronger FPU alone isn't enough to jump a chip up out of its proper category.)

Remember, while you may or may not agree with my choice of rules for this, the rules themselves are clear: more desktop snap = higher ranking. Other performance factors are generally ignored, except to decde ties. Notice that these rankings are based on representative main boards. Most contemporary benchmarks used the same board for every chip, usually an HX - and HX boards were (a) not representative of real-world buying patterns, and (b) incapable of providing decent performance with the Cyrix parts. The VIA chipsets, on the other hand (at least as implemented in the FIC boards I used a lot of) were excellent performers with the Cyrix chips but incapable of showing the Pentiums to their best advantage.

In reality, we usually shipped these combinations: (again, this is memory talking)

200 Classic, 6x86MX-166: FIC PA-2005 (VIA chipset) or Gigabyte 586S (SiS chipset, 256k cache)
6x86MX 200 and 233: FIC VA-502 (VIA chipset)
Pentium MMX 200 Gigabyte or FIC HX, or Gigabyte 586S
Pentium MMX 233 and K6-233: an HX or FIC PA-2007 (VIA chipset, 1MB cache)
Celerons and P-II: Asus Intel LX boards

Or something like that. The SiS-based 586s boards, by the way, were amazing. These cheap little things consistently produced good scores with with any brand of CPU, even though they had less cache. In contrast, the HX-based boards gave excellent scores with the AMD and Intel chips, but terrible scores with the Cyrix ones, while the VIA-based boards like the VA-502 were fantastic performers with the Cyrix chips, but sub-par with AMD and Intel.

So, in other words, it seemed to me most sensible to rank CPUs using the best "normal" combination that this particular CPU shipped with. Luckily for me, in the cases where we often used two or thee different boards with a particular CPU, their performance was similar (e.g., MMX 200 with HX or 586S), so there were no real difficulties with this batch.

The next generation was not too difficult either - K6-2/450 and Celeron 400s and such like were straightforward, as was (for example) P-III 600EB vs Athlon Classic 600

But how the hell do I rank order the current CPUs?

Which, under these rules above, is faster?

Celeron 1200 or Duron 1200?

Pentium 4 1900 or XP 1700?

And which P 4 - there have been at least three different ones.

Tannin · Apr 6, 2002

(Tannin slips on over to RWT, where he has not visited for far too long, to do some in-depth reading.)

Tannin · Apr 6, 2002

" What makes this interesting is that ..... there is absolutely no gain in performance when adding memory."

(From Dean's article on Sysmark.)

And this on a benchmark that purports to be measuring multi-tasking actvity. Ouch!

LiamC · Apr 6, 2002

How many people actually multitask? The closest most come is by running a virus scanner and audio in the background. Who recalculates massive spreadsheets in the background on a regular basis? Shit, with a top-end Athlon or P4 it wouldn't matter anyway it'd happen so fast

Audio ripping is the only common task that might happen...

CougTek · Apr 6, 2002

Great arguments about Business Winstone Liam. Nope it's not perfect, but if it tests the applications that most ZNet's readers use, then it's better than many other benchmarking tools. I wouldn't believe its ranking blindly though.

I'm not informed enough about BAPCo's SysMark to state an opinion about it.

However, the few benchmarks I saw of PCMark convinced me that it was utter junk. It ranks systems with the P4 on the Willamette core in front of the Northwood, the VIA P4X266 in front of the P4X266A, 5400rpm drives in front of 7200rpm of the same generation, etc. It's a pile of crap, a shame to the reviewing industry. MadOnion has to revise their benchmark from one end to the other in order to make it even remotely credible in my book. In its current form, it is, at best, a waste of web space within an hardware article.

Thinking about PCMark and its credibility, it has all the requirements to become Tom's favorite benchmark.

CougTek · Apr 6, 2002

LiamC said:
How many people actually multitask? The closest most come is by running a virus scanner and audio in the background. Who recalculates massive spreadsheets in the background on a regular basis? Shit, with a top-end Athlon or P4 it wouldn't matter anyway it'd happen so fast

Audio ripping is the only common task that might happen...

I commonly run the Genome@home client + firewall software + advanced 3D game at the same time. Does it qualify as multitasking?

Buck · Apr 6, 2002

LiamC said:
How many people actually multitask? The closest most come is by running a virus scanner and audio in the background. Who recalculates massive spreadsheets in the background on a regular basis? Shit, with a top-end Athlon or P4 it wouldn't matter anyway it'd happen so fast

Audio ripping is the only common task that might happen...

My experience is that home users mulitask more often then not. They don't like closing windows, they like opening them. They'll open three instances of anything they can, especially IE.

Another thing that end users love, is icons. Oh they plaster there desktop with icons and half of them are either files or programs, not shortcuts.

BR

LiamC · Apr 6, 2002

CougTek - yep, that qualifies

Do I blindly believe Bus WS 2001's ranking? Nope? Using benchmarks is about finding the one that most closely resembles what you do with your PC - and then judging on that ranking.

I don't do Photoshop, AVID cinema, POVRay or Dreamweaver. If I listened to MP3 audio, it wouldn't be with Windows Media Player. So that just about rules out all of the "Content Creation" benchmarks for me - be they ZIff-Davis, SysMark, MadOnion or whatever - the scores from these benchmarks have precisely jack to do with what I do on my PC's.

My wife on the other hand, is a Graphic Artist/Designer so she would be interested in them because they reflect what she does - but given a choice, she'd use a Mac.

Running a series of benchs and choosing an overall "winner" is frought with difficulty too. Ace's usually present a fairly wide range of bench's, but some favour the Athlon and some favour the P4, but the Athlon wins more than it looses (which keeps the Ace's crowd happy), so they usually pick the Athlon as best/fastest overall. Now this would be a valid conclusion if (and only if) you happened to use that particular selection of software - but if you only use a subset, then just concentrate on the results of the subset. This is a message I find gets lost in the race to provide tons of graphs and an audience that wants their favourite chip to be declared the winner (not picking on any one website here).

Unfortunately, I don't know of any benchmark that runs SETI/Folding/Genome in the background

There was also the controversy about SysMark using a flovour of WME that didn't detect the SSE capabilities of the Athlon XP/MP, which made it appear weaker than the P4 in SysMark 2001. A lot of people boo hooed it, saying it was biased - me included. But Dean correctly pointed out to me that if people are using that version of WME, with older applications (that Tony poined out in an earlier post - most people don't upgrade software, or patch it, on a regular basis) - then it is a perfectly valid measure of performance because that is the performance people can expect with those particular versions of software.

We are enthusiasts and know how to extract the most performance - sometimes we forget that.

Practical benchmarking tips

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Fatwah on Western Digital

Storage? I am Storage!

Hairy Aussie

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Storage Is My Life

Storage? I am Storage!

Storage? I am Storage!

Storage? I am Storage!

Storage Is My Life

Hairy Aussie

Hairy Aussie

Storage? I am Storage!

Storage Is My Life