CougTek said:
I didn't know Barton was rumored to be cancelled. Thanks for bringing this info.
Well, I've seen this tidbit far more than once on the web, and it simply makes sense given the current AMD's roadmap : if T-Bred is to ship in June and AMD swears CHammer will both ship this year (i.e. December? I don't suppose it'll be any earlier than that)
and be positioned for desktop (as opposed to servers, that is, it's not even MP-compatible), I hardly see any meaningful role Barton could play - either it'll cut T-Bred's useful life to mere 3-4 months, which is laughable for a core, or it'll interfere with CHammer's introduction, or both...
I don't think the problem with the Athlon's L2 cache is that it's slow.
Now, this is a matter of semantics. What is
"slow" anyway? Is cache with 4.6ns latency
"slow"? And what about 20ns, is that
"slow"? There's no universal measure to which you can tie speed, speeds can only be compared relatively to one another in order to obtain qualifiers such as
"slower" or
"faster", and as far as CPU caches speeds go - to the respective CPUs. For a 100MHz CPU, a cache with 20ns latency is
blazing fast with just 2 clock cycles latency, but for a 1.5GHz CPU it's a
snail with a 30 cycles latency. It gets worse, though, since if you compare those latencies' respective clock cycles you'll realize that a 4.6ns cache is
"slower" for a 1.5GHz CPU than the 20ns chache for the 100MHz CPU, since 7 cycles is over three times higher latency than 2 cycles.
Now, to the situation at hand : Athlon XP's cache vs P4's vs PIII's cache. You say that you don't see Athlon XP's L2 cache problem in the fact that it's slow, but it's
a given that the faster the CPU's caches are - the faster it works. Now let's compare :
- P4's L1 cache is indeed blazing fast - with just 2 cycles latency. But there's a price to pay, it's only 8kB large.
- P4's L2 cache has access latency of 7 cycles (i.e. 9 cycles load-to-use after L1 miss) and it's 512kb large.
- P4 Xeon's L3 cache has latency of 14 cycles (total of about 23 cycles after L1 & L2 misses) and is either 512kB or 1MB in size.
- PIII's L1 cache has latency of of 3 cycles.
- PIII's L2 cache has access latency of 4 cycles (for a total of 7 cycles after L1 miss).
- Athlon XP's L1 cache has latency of 3 cycles, and it's an enormous 128kB (split into 64kB I-cache and 64kB D-cache!) cache.
- Athlon XP's L2 cache is 256kB in size. And here's where things get tricky - it has a total latency (L1 miss + L2 hit) of 11 cycles.
Sort of. Kind of. Almost. Well, it does, except it's latency is 20-21 cycles.
What???
How can it be both 11 and 20-21 cycles at the same time, you ask? Quite easy - it's
either 11,
or 20-21 cycles, depending on the circumstances. Let me explain. Athlon XP has exclusive cache architecture, i.e. L2 acts as copy-back buffer for L1 once data must be evicted from L1. To save time on this copy back, L1 has a victim buffer, so if the data must be evicted, it can be copied into this victim buffer
at the same time as the new data is fetched from L2 (if there was an L2 hit at all!), and then copied back from this victim buffer into L2 for later propagation to memory. Sounds brilliant, no? Just 3 cycles for L1 miss + 8 cycles for L2 hit (while also copying stuff into victim buffer) = 11 cycles.
But there's surely a catch : the victim buffer is only 8x64 byte (8 cache lines) large. And if the victim buffer happens to be full, you get the full caches latency : 3 cycles for L1 miss + 8 cycles for copying victim buffer's contents into L2 + 2 cycle L2 turnaround + 8 cycles to put the first data word from L2 = 21 cycles. So why the heck was I mentioning 20-21 cycles? Because on the web I keep seeing the 20 cycles figure, while my own math - as given above - using the figures in AMD's manuals gives 21 cycles. Cachemem also indicates 20 cycles, so either there's one cycle AMD managed to somehow shave off since their docs were published, or the docs were off by one cycle (perhaps in turnaround?). Either way that's not critical, both 20
and 21 cycles are over twice more than P4's L2 latency and approach Xeon's
L3 latency to within 2-3 cycles.
Now comes the question - how often does victim buffer gets filled (i.e.
"how often do 11 cycles turn into 20")? The rule of thumb suggests that when there's an oddball struct that must be accessed in memory the latency will be 11 cycles as advertized. But on large frontal accesses (as with various multi-media loads etc., I suspect) you're down to the good old 20 cycles, just like the first Athlon models...
So, yes, I maintain that compared to 256kB of 7-cycle L2 cache on PIII and 512kB of 9-cycle L2 cache on P4 an 11/20-cycle L2 of Athlon XP is
relatively slow. And relation to other caches is the only way I know of deciding whether a particular cache is
"fast" or
"slow"...
It is fast, but it is too narrow (accessed only in 64bits compared to Intel's 256bits bus). I agree that enlarging the bus to L2 cache should be a high priority for AMD in their future procesors.
While widening the bus to L2 certainly wouldn't hurt Athlon's performance (granted L2's latency wouldn't get any higher!) AMD keeps maintaining that due to a large L1 the performance benefits would be negligible. And on this one I'm rather inclined to believe them. An increase in size of Athlon's L2 and/or bringing its latency down to PIII's or at least P4's (and making it a
uniform latency!!!) would be very nice though...
<sigh, as the chances are slim, at least until CHammer - IF even then>
AMD knows that it will need ClawHammer ASAP, especially if they have to rely solely on Thoroughbred untill their 64bits CPU arrives.
Knowing you're in deep doo-doo doesn't necessarily mean you have a way out. CHammer was supposed to be out late last year or at most early this year, yet AMD will be lucky to push it out the door by the
end of the year. I can't blame them for their pace of advancement - core development and validation are a PITA, but I hate overly optimistic statements... I do wish them luck, though, as I'd hate if we were back to the years of Intel being the performance leader with everybody else playing catch-up in the low-end...
Compared to the situation prior to the introduction of the first Athlon processor and the need to make all new chipset to support the EV6 bus, there seems to be a lot less improvisation now with ClawHammer then back then.
I beg to differ. CHammer won't be using EV6 bus, so VIA can forget about farting out another KT266 re-touch. The good thing is that long gone will be the days of relying on chipset manufacturers to come up with fast memory controllers. The bad thing is that AMD will still have to rely on other manufacturers to provide high-performing AGP and PCI (Uh-oh! :eekers: ) controller implementations.
Unless AMD pull their collective head out of their collective arse and start supporting their CPUs by
reliable (yes,
including the USB!) feature-rich in-house chipsets.
I wish Hyper-threading is something that AMD will eventually try to mimic in their future designs, although I don't think it will happen any time soon.
It's not so much about
mimicking as Intel aren't pioneers in this department and HyperThreading is not the ideal implementation of on-die multithreading, it's rather something all CPU manufacturers are likely to implement in one form or another (spare EPIC-style architectures, I guess).
Generally speaking, there are three methodologies to increasing CPU's speed in today's multi-tasking world (as opposed to DOS, for example) :
1. Increase CPU's clock speed. Pretty much self explanatory, everybody have been doing this forever. But there appears to be a problem with this approach as your CPU begins to turn into a perfectly viable home heating appliance, the die shrinks become harder to make and their benefits are diminishing.
2. Increase CPU's raw computational power. In a generic form this approach has also been around for ever : 80286 executed instructions on average in less clock cycles than 8086, and most other PCU manufacturers have done about the same. Later on, came the idea of making CPUs super-scalar - an attempt to throw raw silicone at the problem. In even more recent times the idea behind vector processors has found its way into scalar (by now - super-scalar) CPUs : SIMD. MMX, SSE, SSE2, AltiVec, VIS and other SIMD implementations are just another way of throwing more silicon at the broblem - but in a more elegant way. But those solutions (spare SIMD perhaps) hit another brick wall almost immediately after their introduction - you can't properly utilize those extra EUs given the existing code.
3. Increase CPU's efficiency. In other words, rise the
effective (as opposed to theoretical) IPC of a CPU. Making a CPU super-scalar turned out to be not enough : the dumb programmers (and compiler writers) refuse to line up their instructions in ways suitable for filling all the execution units and have a bad habit of using branches in their code. Thus came about caches, registers renaming, brach prediction, out of order execution, trace caches, TLBs and others - their sole goal being to find ways to keep the execution units working as much time as possible.
Contrary to #1 and #2, solutions to #3 are somewhat difficult to come up with and implement as they require a lot of brainstorming and a lot of testing, not to mention the silicone. The majority of those solutions are already incorporated by pretty much all CPU architectures today, some shine on RISCs some do better on CISCs, but in general they're already used up. Yet still your average effective IPCs for the x86 lines of CPUs are somewhere between 2 and 3, far below maximum theoretical rates.
On-chip multithreading is just another elegant solution of type #3 : an attempt to improve CPU's efficiency. It is already present in slightly different forms in 2 mass-produced CPU lines that I'm aware of and was being implemented in a
much more elegant way in a third one which sadly will never see the light of the day thanks to Intel's move to eliminate competition ("Cry, the beloved country!"), not to mention some more exotic designs! Sooner or later a good portion of CPU architectures (that will survive Intel's/Wintel's attack) will probably implement multi-threading in one form or another. So, yes, I presume if AMD lives long enough there's a good chance that one of the later Hammers will sport some SMT capabilities. If ex-Digital folks will still be around by then my crude guess is they'll go with 4-way SMT...
My point in the original post was that the dual CPU platform of Intel wasn't very impressive compared to the actual dual CPU platform from AMD, despite their much hyped Hyper-threading novelty and their new E7500 dual-DDR channel chipset.
"Ye ain't seen nofin' yet, boy, ye hear me? NOFIN'!" I'd love to watch dualie Athlon MPs with their puny 384kB of L1+L2 try to put up a good fight against a pair of 512kB L2 + 1-2MB L3 equipped Xeons running at 2.5-3GHz on a 533MHz bus, preferably with some multi-channel (2 or better yet - 4) Hastings RDRAM platform... Unfortunately I have some doubts about the RDRAM part, but the same on a dual-channel DDR chipset should be about as entertaining...
Intel simply matched AMD, they didn't beat them sharp and clear. Future tech from Intel matches current tech from AMD = problem for Intel when future tech from AMD arrives.
I hope the days of beating
"sharp and clear" (by either side) are over - those were the days of expensive CPUs and no alternative. I hope those are never to return, even if the the side taking the beating will be Intel. And those aren't
"future tech" anymore - they're here which makes them
"present tech".
Worry not, Intel always has something up its sleeve...
As far as AMD's future tech is concerned... Off the top of my head I can't think of any [published] dramatic improvements in CHammer's core as opposed to Palomino/T-Bred, spare the on-die memory controller. To counter this single lower-latency DDR channel by year's end Intel will have 533MHz bus, dual-channel DDR chipsets and (if luck would have it) some 3rd party PC1066 RDRAM chipsets. Perhaps some tweaks to the core as well (according to some info on the web P4's instruction latencies keep decreasing compared to the original P4 as it was released back in 2000, i.e. Intel keeps actively improving the core itself). So, as usual - only the time will tell whose "future" is better...
Come think of it,
if x86-64 actually catches on (and that's a big "if"), AMD may in fact do much more harm to the industry than it will do good. x86 ISA (aka the IA-32
(R)(TM)) is a brain dead architecture that's been kept alive for the last two decades for the sake of application compatibility - at the sake of CPUs performance. Look at all the impossible tricks CPU designers have to do in order to remain x86 "compatible" yet at the same time create high-performance designs! Quite often I wish IA-32 would just die an agonizing death, we'd all switch over to something more suitable and be done with it. Mac users lived through it once, and nothing horrible happened.
(As a side note, I keep reading rumors that Apple may switch the horses once more to x86-64. Picture this : x86-64 doesn't catch on in the PC mainstream, PCs move on to EPIC-like Itanic successors, and AMD becomes Apple's source of CPUs leaving the PC stage for good due to financial pressure from Intel and inability to come up with a VLIW design that would fly with Itanic compilers. A horror movie, I tell ya, A HORROR MOVIE!!! :lol: Though I've seen stranger things happen to the industry in my lifetime...)
My guess was that Barton would be the first part of AMD's upcoming goodies.
Could you give me some links to some Barton info? I haven't looked too thoroughly, I must admit, but except for SOI and the rumor that it's cancelled I haven't found nothing that would be up to date...
...if ClawHammer is on time and is as good as the rumors say.
Ah, yes, the
hype. Well, I'm personally out of the age of believing in elves, so...
I do hope CHammer will give Intel run for its money, even though I have my heavy feelings about x86-64 in general, as I mentioned above.
Yep, that's about all I had to say at this time. :roll: No animals or CPUs were harmed in making of this post.
P.S. Don't know why, but somehow the words
"long winded" keep ringing in my head... :lol: