Benchmarking....Again

Buck · Jun 7, 2002

I bring this illusory subject into our forum because of what I have read at Storage Review and in certain threads at our forum. I use the term illusory because the idea of benchmarking is rather deceptive from one set of standards, realistic for some, and confusing for many individuals; confusing, because although people like seeing some numerical or graphical representation of performance, very few people can properly explain why their benchmark is worth considering. Oh, they all have marketing pitches involving low-level checks of the drive, high-level checks, consistent methodologies, etc. Granted, there are people with very honest intentions, but they are few. There is a great deal of debate about benchmarks being unrealistically affected by cache. Some portend the excellence of seek times, while others don’t really care.

Obviously many in this forum have opinions about this subject, and so I thought it would be good to discuss some of that information here. Yes, there are discussions about this elsewhere, but we have some pretty intelligent folk here. If, of course, you wish to hold this discussion somewhere else, I can understand.

CougTek · Jun 7, 2002

Why do you ask that so late, in the middle of the night? You would get far more eloquent answers in the morning when everyone is fresh.

I'm too lazy to search the thread where I wrote my opinion about hard drive benchmarking at SR, but the main lines were that people shouldn't attack IPEAK, but the applications monitored by IPEAK. WinTrace32 creates a perfect copy of the read/write accesses performed on the drive while the reviewer launches and uses various applications. Blaming WinTrace 32 and the other analysing tools within IPEAK for controversial results (to some) is futile IMO, because it's not the fault of IPEAK itself if the end-results aren't those that are expected, but the fault of the applications that are monitored by IPEAK.

You could always argue with Eugene on the monitored applications choice, but arguing against IPEAK in itself is vain.

That was, at large, my position on the matter. Please wait one hour before replying to this post if you completly disagree with myself, so that I'll be sleeping and not conscious of your attempt at demolishing my arguments. Or else, I will a) not sleep well if I someone find enough will to refrain myself from posting so late or b) reply a foreseeable impulsive and rushed defense, predictably full of holes mainly due to my sleep deprivation. Please, be nice.

Tea · Jun 7, 2002

The key to the problem, as I see it, is not the benchmarking program itself, but the things that the benchmarking program measures.

Now here I am not talking about the tasks that are taken as representative tasks for the Ipeak trace or the Winstone run. I don't have any particular argument with those, though I have not examined them closely. I suspect that they are excessively oriented toward the single-tasking user, but even if this is so (and I have no evidence to support or refute it) I doubt that it is more than a minor issue.

No, the problem with Ipeak and Winstone and other benchmarks of that ilk is that they measure everything without discrimination, where the human being (who is, after all, the thing we are mainly interested in) does not.

P5-133XL · Jun 7, 2002

The first question one needs to ask about benchmarking is what do people want from the benchmark. After knowing what they want then the benchmark is produced to reflect that want. I believe what people want out of a benchmark is a reliable indication of how the HW will perform in their machines with their applications.

The problem is that there are alot of people with different machines and different applications. How is a benchmark going to accomplish the goal considering the variablity of the variables. for example internet users have different demands than office users than gameers, than developers... To further complicate the issue is that everyone has different CPU's, different memory capabilities, different OS's, different controllers, ...

This results in chaos and the total inability to model how a specific HW being benchmarked will run on anyones machine with anyones applications except the benchmark on the tester's machine. resulting in the general total uselessness of benchmarks as a model of anything. With the above in mind, as long as the benchmark is doing known activities and known HW then the astute user may be able to extrapolate how a specific device may be able to deal with their own HW. Further if the tester was excruciatingly consistant while testing different devices and tried mightily to eliminate as many variables as possible from his testing, then a user may be able to come to some conclusion about the differences between devices.

The end result is that benchmarks are totally useless without strict controls, and total knowledge concerning the HW and SW being used, and the benchmarks themselves. Maybe, if you are very bright and with perfect testing, you may be able to extapolate the performance, but keep in mind the error factor may be very high in the best of circumstances; There are just too many variables that can't be controlled.

Clocker · Jun 7, 2002

Mark points out an obvious fact that many of us forget about....benchmarking can be very complicated.

It makes me think that, sometimes, the only reliable and consistent benchmarks we will ever be able to get that eliminate most of the variables present from system to system (for HDDs) are the lowest level measurements possible. Things like seek time and STR. Of course, that would ignore some of the advantages different drives have (i.e. the increased capability of the SCSI interface, IDE drives with big buffers, etc.).

The reason I say that is that it seems here is some pretty good correlation between (seek time & STR) and overall performance It seems fairly obvious that each generation of HDDs gets faster at the benchmarks (i.e. iPeak) each year and it is equally obvious that the low level benchmarks (for the most part) have improved too. Based on past history, I think we can assume that HDDs with better low level performance will also perform better at the higher level benchmarks (or just perform better in general).

We'll never really be able to quantify the effects of different firmware optimizations, interfaces, buffer sizes etc for all systems and (especially) all different softwares. So, maybe we should just ignore those effects and focus on something we can understand and quantify.

C

Tannin · Jun 7, 2002

That, Clocker, is my approach in a nutshell. Tea will doubtless continue her theoretical ponderings a little later on (maybe late tonight, our time) but seeing as I am the practical one around here - Tea just wants to crunch proteins, leaves it up to me to earn enough to feed the two of us - I have to make purchasing decisions with the information we have here and now. Sure, I'd love to see a better all-round high-level benchmark than any so far discovered, but here and now, I trust the low-level stuff the most.

In fact, I have yet to see any high level hard drive benchmark that manages to correlate better observed drive performance than a simple combination of access time and DTR, with the emphasis on access time. Indeed, I don't think it would be too hard to advance the argument that this produces better results than any high-level test that I'm aware of.

P5-133XL · Jun 7, 2002

I have to admit, that I only pay attention to low-level benchmarks too. I do expand my list of characteristics to include things like Vseek and DTR graphs, comparetive tempaturess, noise readings, and I will include the general consenous of peoples opinions as to the drive. But that is the only type of benchmarks I will pay attention to. The application simulations, the Winmarks and stuff like that don't tell me enough to do anything with. I also note that the very poor reliability of modeling using benchmarks means that I only pay attention to the gross differences: Small subtle differences are all in the noise.

Tannin · Jun 8, 2002

My approach exactly, Mark. Average access time, DTR, a Vseek chart, general opinion - these I pay attention to. High-level tests I only glance at. And, as you say, the differences have to be major to be worth bothering with.

time · Jun 8, 2002

Sorry, I don't believe Vseek is able to accurately measure newer drives with tricky seek algorithms. A quick comparison between results from Vseek and say, Winbench 99 reveals that the former is sometimes nowhere near a manufacturer's specification. Yet higher level tests may tend to support the manufacturer's claims.

I think Vseek assumes the drive is dumb and will perform repetitive large stroke seeks without delay, but that's by no means guaranteed with smarter firmware. Issues such as noise, heat and reliability may be more important to the drive manufacturer, for example.

BTW, where did this version of Vseek come from? It appears to be a ripoff of a tool from Golden Bow.

Tea · Jun 8, 2002

Sulo Kallas wrote it.

Tea · Jun 8, 2002

Could you expand a little on that, please Time? I'm not clear on your meaning.

time · Jun 8, 2002

I first encountered a program called Vseek in 1988. I bought the Vopt suite that included it in 1992 or thereabouts. It looks very similar to and performs the same function as Skallas' version.

What I'm struggling to say about seek behaviour is that you can no longer be certain that a drive will perform seeks as you might expect. Witness the varying behaviour of the Barracuda IV in particular. Have you wondered how a firmware change could dramatically affect even STR as is apparently happening with their variant for RAID?

If one wanted to cut down noise, for example, the firmware might check for rapid repetitive behaviour and reschedule with random intervals. This would have little impact on real world performance, but could cause certain benchmarks to nearly collapse.

Or Vseek may be assuming that the drive immediately seeks when requested. But the drive may decide to wait and try to guess what the requests are trying to do.

Tea · Jun 8, 2002

Sulo wrote to me that he had called it Vseek in honour of an old utility of the same name, Time. Doubtless that's the one.

Your other thoughts I'll leave till the morning when I am a little more awake.

Buck · Jun 10, 2002

I must say, the participation in this thread has been quite good. Not necessarily in the amount of posts, but in the clarity of opinions about this subject. For some reason, when this subject is discussed in other forums, the conversation becomes a debate and pride takes over. When this happens, responses tend to lose their positive communication.

Thank you all for you thoughts, and I look forward to more.

James · Jun 11, 2002

Actually, I'm very interested in this topic at the moment for selfish reasons - sorting out the optical reviewing methodology at SR.

At the moment I've tested a couple of drives using Tim's methodology -yes, I now have the infamous High Heat Baseball 2000 CD in my possession, ph33r me! - and while it delivers results in line with what you'd expect given previous reviews, I think it's worth revisiting the issue of what people actually want out of the optical reviews. I'd appreciate input from people here.

At the moment my thinking goes something like this :

CDR/RWs + DVD writers :
I think what people want to know is how fast does it burn a given CDR or CDRW disk. So probably the tests of imaging a CD like HHBB2k and an audio CD will remain. The Liteon 40x burns a data CD in 2:31 - quite a step up from my 6x burner! - but interestingly takes 20 seconds longer for an audio CD (only 10 seconds slower than the 32X unit). This is all good stuff to know.

What about reading speed? Do people still have seperate CDROM readers and CD writers? Dunno, but perhaps it's worth including general performance statistics given the heavier head mass in a CDR/CDRW unit. Nice graphs of read and write speed and DAE speed have a place here most likely. Note we don't do write speed graphs at the moment.

Mount Rainier conformity is important to test, does anyone have any ideas how to do this? I haven't seen any testing programs out there.

Time to format a CDRW disk is of interest, but less so post Mt. Rainier where it can do background formatting.

What about C1 and C2 error levels, like CDRInfo measure? I see less utility here. As long as 99.99% of CD readers out there will successfully read the newly burned disk, is this such a big deal?

How about reading of copy protected CDs (again like CDRInfo)? This is probably an important consideration for many, but I don't see a lot of feedback asking for it. Scratched CD reading (over and above what CD-Check tests, that is, which is audio-based)?

What other characteristics are important to measure? Ability to burn at various speeds with different media? This I see as being less important because it's a moving target - CDR blank compatibility changes from firmware to firmware and may change from batch to batch of blanks. Plus, I'm a bit defensive of my rapidly disappearing free time.

Anything else?

Note that the new testbed includes USB2 and Firewire ports to test external burners.

CD and DVD readers:
DAE speed is obviously important. Transfer rate graphs (any big dropoffs?). DVD playback noise level is important for me. Test DVD playback skipping but I'd be surprised if any fail. Is ability for region hacking of firmware important to people, and should SR condone such activity? (Probably not.) What's the most stressful thing that a CDROM reader do? I agree with the idea of testing a device under the most unfavourable conditions where such actvity takes place in normal usage of the unit.

Thoughts anyone?

Fushigi · Jun 11, 2002

James, here's what I am most interested in when it comes to optical drives:

CDR/RWs + DVD writers :
Writing: Burn speed for data discs & media compatability would be my top criteria. And I really mean burn speed, not disc duplication speed using a single drive. Most of the time my source data is on my hard drive so combined read/write speed doesn't matter to me. Next on the list would be error handling / SafeBurn / SmartBurn type stuff; does it work and how well? Mt. Rainer performance testing would be nice, although I've yet to find a personal use for it.

Reading: For data, I don't care too much as long as it's 32X or greater. Unless it has dramatically better than average seek times or really dramatically better read performance (like a TrueX drive), I likely won't notice the extra performance. DAE is different and I would be very interested in DAE speed & quality.

DVD writers, I don't know. It's not a technology I'm ready to buy in to just yet.

CD and DVD readers:
For CD reading, see above. For DVD readers, disc compatability is paramount. The drive must be able to handle any DVD I put into it. Region changing/disabling is be very important to me but maybe less so to others. I view region changing as an enabler for viewing content that we could not otherwise view. I buy R2 (Japanese) DVDs and require R2 capablity. But I still need R1 for the 400+ R1 discs I own. The R2 discs I purchase are not available in the US at any price, so it's not an issue of avoiding one region's release vs. another's. Not to mention Japanese R2 discs cost 3x the US equivalent when they are available (and mostly don't have an English audio track). Of course the reverse is also true; a Japanese person could want a region-free system so he could buy US discs at 1/3rd the Japanese price. But this is still not illegal or even unethical; it's simply a matter of adjusting for a region's economics. It's the fault of the Japanese distributors that their discs cost so much.

- Fushigi

CougTek · Jun 18, 2002

James,

Sorry not to have replied to your message sooner (about CD-RW review tips/requests/opinions). You probably already read it, but if you didn't, this article at CDR-Info regarding the writing quality might be a good read (I only browsed it so far). I know these guys spammed us, but nonetheless, when they talk about CD writers, they know their stuff.

I will come back on this subject later and I encourage others to participate too. It would be a shame if no one would help you with ideas for your reviewing methodology. We are supposed to be storage experts afterall.

CougTek · Jun 18, 2002

And I also encourage you to orfer the CD Winbench CD-ROM. E-testing Labs' softwares are relatively well regarded as benchmarking tools in the industry, it would add an "official" dimension to your reviews.

I didn't use it in my review of the OptoRite burner because I didn't had the time to order the CD (I only had the burner for two/three days), but I think you must have enough time.

Regarding the importance of measuring the C1/C2 errors, I think it can be helpful. If you read the latest comparative article about 48X burners at Spammers Info (oops), you'll see that it can help to visualize why some CD-R medias were unreadble or slower to read when they were burned at maximum speed.

James · Jun 19, 2002

Coug,

Thanks very much, I'll have a look at the article when things are a bit less frantic at work.

I have the full set of tools from Tim, which means I have 5 copies of CD Winbench, among other things. I also have CDTach, DVDTach, etc.

The problem I have with measuring C1/C2 errors is that I don't know of a tool to do it. CDR-Info sends all their stuff out to a third party to do the testing and I don't have the time (sending things to and from the US) or the money to cover that sort of activity. If someone can find a tool that will do the measurement for me, I'd be very happy to consider it.

Benchmarking....Again

Buck

Storage? I am Storage!

CougTek

Hairy Aussie

Tea

Storage? I am Storage!

P5-133XL

Xmas '97

Clocker

Storage? I am Storage!

Tannin

Storage? I am Storage!

P5-133XL

Xmas '97

Tannin

Storage? I am Storage!

time

Storage? I am Storage!

Tea

Storage? I am Storage!

Tea

Storage? I am Storage!

time

Storage? I am Storage!

Tea

Storage? I am Storage!

Buck

Storage? I am Storage!

James

Storage is cool

Fushigi

Storage Is My Life

CougTek

Hairy Aussie

CougTek

Hairy Aussie

James

Storage is cool