Cache size and RAM performance

Tannin · Jun 21, 2002

In another thread, I mentioned the excellent performance of the old K6-III CPUs on integer tasks. Then Cas said:

...my 400MHz K6-III on a P5A outperformed my 500MHz Xeon box handily with one of its processor’s disabled. Of course, even the K6-III couldn’t keep up with the dual configuration. Fortunately the datasets for compilation tend to fit nicely in to the K6-III’s cache. I suspect that the device would not have faired so well in a database server. Despite its impressive cache hierarchy, performance dropped off quickly for applications with larger datasets. The Xeon offered almost twice the memory bandwidth.

Now this raises an interesting question. My main workstation/server was a K6-III 500 for a long, long time. (Actually a 450+ overclocked to about 560 - 5x multiplier, 112MHz FSB.) I replaced the K6-III several times but could never actually make the thing go faster in all circumstances than the old K6-III. I slipped in a Duron 700, an Athlon 1000, an 1100, played with P-IIIs, even a Thunderbird 1200 and a 1333 or 1400, but I always kept getting dissatisfied with the performance and downgrading it back to the old K6-III again. The bigger Athlons were faster at some tasks, of course, but it wasn't until I went XP 1700 and DDR that I finally had a machine that was faster in all circumstances.

Now in explaining this to someone - a customer, I think it was, I remember being asked how was it that the XP could be faster when it only had 384k of cache, as against the K6-III's 1.3MB. And I remember over-simplifing it by saying that, given the vastly greater clock speed of the 266MHz DDR as compared to the 100 (or 112) MHz tertiary cache RAM on the K6 board, it wouldn't be too far off the mark to think of the Athlon's RAM as being "all cache".

Was I oversimplifing? I know the clock speeds, of course, but what about the latency? I've seen figures for bits and pieces of this here and there around the traps - notably in an excellent article over at Ace's comparing the caching strategies of the Athlon Classic, the Coppermine P-III, and the then-unreleased Athlon Thunderbird - but never managed to gather it all together in one place.

How many CPU wait states are involved with a secondary cache miss for an XP 1800 running 266MHz DDR, for example. And how does that compare with a secondary cache miss (i.e., a tertiary cache access) for a K6-III? And so on.

Anyone care to write a mini-thesis on this?

cas · Jun 22, 2002

There is no question that the K6-III was a fast processor for its day. At a time when its chief competition used an off chip L2 cache at one half the core speed, the K6-III was equipped with 256K of full speed, on die L2 cache.

Even when compared with Xeons, featuring very expensive SRAMs on the backside bus, the lower latency and shorter pipeline of the K6-III prevailed. Or at least it did, at first. Many will remember that AMD had a hard time manufacturing large quantities of K6-IIIs, due largely to the lack of redundancy in the L2 cache array. The K6-2 remained the volume leader, and ultimately scaled to high clock speeds.

Although I agree that the K6-III was competitive at introduction, I believe that you are greatly overstating your case when comparing it to the Tbird Athlon. For example, while both processors have 256KB of L2 cache, the Athlon’s cache is exclusive and running at a higher frequency. Of course, the Athlon offers twice as much L1 cache as well.

I performed a simple test using traditional MOV(32bit) and newer MMX MOVQ(64bit) instructions. The K6-III was clocked at 400MHz, and mounted on an Asus P5A with a standard 100MHz FSB. The P5A was configured with 512K of L3 cache. The Athlon was clocked at 800MHz, and mounted on an Asus A7V with a double pumped 100MHz FSB. Both systems were equipped with SDR SDRAM.

No prefetching or other highly specialized instructions were used. The MOV results in particular, are pretty close to what you might expect to see from a standard(non-media) C program.

A number of things are immediately apparent from the graph.

The most interesting, is that despite the K6’s total cache advantage (64+256+512=832K vs 128+256=384K), there is no data set size for which the K6 is faster than the Athlon. In fact, it is difficult to identify the effect of the L3 cache at all. If you look very closely, you will see that the Athlon drops to a steady 610MB/s or 910MB/s rate, when reading directly from SDRAM. The K6 on the other hand degrades gradually from roughly 400MB/s to 300MB/s as the dataset exceeds the size of the L2 cache.

Also interesting is the difference between standard and MMX instructions on each processor. Both processors see exactly a 2x improvement when using 64bit loads from L1 cache. When reading from L2 cache this difference drops to 1.5x on the K6. Finally, the Athlon sees essentially the same performance when reading from the L2 cache, regardless of the specific instruction used. When reading directly from SDRAM, these roles reverse. The Athlon’s difference grows to about 1.5x, while the K6’s delta collapses to almost zero.

The difference in the size and organization of the caches should be apparent as well. The L1 cache of the Athlon is twice as large as that of the K6. More interestingly, while the L2 caches of both CPUs are 256K, the curve of the Athlon extends farther because of its exclusive design.

For all datasets, the Athlon is substantially faster than the K6. In cases, this difference is dramatic. The Athlon is up to 4x faster in the range of the L1 cache, 2x-5x faster in the range of the L2 cache, and 2x-3x faster from SDR SDRAM.

While K6 does have a shorter pipeline than the Athlon, and will do well on branch heavy code, I am afraid that this is not sufficient to make up for the 2x difference in clock speed. Certainly, the K6 will not be saved by its much vaunted tri-level cache.

I am happy to hear that you finally upgraded from your K6-III machine. Now if only I could get you to dump OS/2.

cas · Jun 22, 2002

I noticed you used Opera 6.03 on Windows 2000 to read my post.

I am feeling better already.

BTW The P5A is running sync with a 100MHz SDRAM clock. The A7V is async with a 133MHz SDRAM clock. Both modules are cas 2.

Tannin · Jun 22, 2002

Thankyou for a detailed and very interesting reply, Cas.

This is my home machine, the one that I run Opera on. Nothing too vital happens on this machine, but I like it to be reasonably stable. Win 2000 is just right for the job. My other five main machines all run different things for different jobs.

The main office machine (an XP 1800), which is vital runs ECS (aka OS/2 5.0) because it offers superior task switching, better support for the applications I use, vastly better security, and superb stability.

The main workshop box (a K6-III) runs NT 4.0 because it's reasonably stable and supports the only two applications that that machine needs: Nero and Easy CD Creator. One day, when I get around to it, I'll get hold of some CD burning programs that are less fuss-prone and switch it over to either OS/2 or (more likely) Linux. That thing used to run W98 and despite numerous hardware changes and clean installs, it was never, ever trustworthy. NT was a vast improvement.

The little workshop box (a Pentium 200 Classic) runs DOS 6.3. It's surprising how much work it still does: mostly formatting floppies and copying drivers and the like, but also playing audio CDs to help while away the hours. It's the only machine in the whole office that has a sound card in it - a Sound Blaster AWE 64, which in combination with a freeware TSR, does just fine for playing music through.

The showroom machine (another K6-III) runs OS/2 4.5 and does double duty as the reserve server and backup repository, and as the machine we use to look up prices, print invoices, and so on. With this one too, reliability and application support is the main criterion, so either OS/2 or ECS are requirements. It is also the loudest machine I own, as it has four hard drives attached to it: two 1st gen 10,000s and a pair of 9GB IBM 7200s. The drive enclousure is heavily padded but it is still loud enough to discourage people from sitting at the front desk all day. This is my variation of the "make your customer comfortable enough to buy but not so comfortable that they take all day doing it" theory! It's a little more creative than just having a hard chair. Besides, the old 10,000s have sentimental value for me, and though I couldn't bear the noise any longer at home, or in the back office where they migrated, nor in the workshop where they migrated after that, I like still having them do something useful.

The second back office machine (a K6-III) is dual boot: OS/2 for doing paperwork (for cross-referencing between several different accounting apps on the main 21 inch Mitsubushi and the 19 inch Velta while I fake ... er I mean prepare ... my tax returns) and Windows 98SE for dealing with those very few emails people send me that are in proprietary formats that I actually want to read. Nothing on that machine is critical, so if it gets virus infested I can just FDISK it and reinstall.

--------------------

But I'm not kidding about the Athlons being inferior to the K6-III. There were definately some tasks at which even the Athlon "C"s did not meaure up. Bear in mind that I'm talking a very specific workload here: almost 100% integer, a reasonable proportion of 16-bit code (I still use quite a number of DOS apps because they are (a) faster, and (b) paid for), and heaps of task switching. I don't think that there is a single app on that machine that uses any significant amount of CPU power, but it's rare for me to have less than 15 or 20 windows open at any one time, and I flick between them incessantly.

I had always assumed that it was the large cache on the K6-III that explained the underwhelming performance of an Athlon 1000 (or etc.) in this role: but perhaps your comment about pipeline depth is something I should consider. But if that were true, then a K6-2 should do almost as well - which was not the case.

Now: time for me to sit down and look more carefully at those graphs of yours, see if they can help me puzzle this out. (Not that it matters for any practical purpose now, of course, but mysteries are always interesting.)

Cache size and RAM performance

Tannin

Storage? I am Storage!

cas

Learning Storage Performance

cas

Learning Storage Performance

Tannin

Storage? I am Storage!