Details on quad-core AMDs

Pradeep

Storage? I am Storage!
Joined
Jan 21, 2002
Messages
3,845
Location
Runny glass
Looks like they have made quite a few changes, in many areas. I might skip the dual core gen and go straight to the quads in 2007.

http://theinquirer.net/default.aspx?article=35011

Barcelona is the first native quad-core AMD CPU, commonly but wrongly referred to as K8L. It is the first real new core from the chipmaker in quite a while, not merely a massaged version of what came before. You should be seeing it in Q2/07, although some whispers are now saying Q3.

Other than four cores, the most obvious difference is the new widened SSE instructions. On the pre-Barcelona parts, SSE was done in 64 bit chunks, so if you wanted to do a 128b operation, you needed two passes, possibly more. With the widening of SSE, it should immediately double throughput on SSE instructions. Obviously media operations will benefit, but HPC and FP heavy ops will get a solid kick in the pants too.

In addition to the obvious width change, several less noticeable changes were made to support it. Instruction fetch was upped from 16B/cycle to 32B and support for unaligned load ops was added and cache bandwidth was doubled to support this. Last but not least the FP scheduler was widened from 64 to 128b.

The enhanced IPC is more a case of little improvements adding up to a bigger bang rather than any single thing standing out. The SSE improvements add a bit here and there to start with, and a better branch predictor adds to this. It is bigger and better with a larger history, a dedicated 512 entry indirect predictor, and the return stack is doubled in size.

In the catching up with Intel side of things, added a sideband stack optimizer, out of order load execution and data dependent divide latency. They also upped the TLB to support 1G pages, 48 bit physical addressing, and improved the ITLB and DTLBs. There are also more fastpath instructions and a few more bit manipulation instructions and SSE extensions.

For RAM efficiency, one of the main things they did was make the two memory controllers on the chip act independently, up to Rev G they could only act in lockstep. This lets AMD hit two memory locations at once, potentially a big win for server type apps, but for single users, it's benefit is less clear.

On top of this, they changed the northbridge a lot increasing buffers and adding support for new DRAM types. The Barcelona controller will do FBD if necessary, but the chances of you seeing that are something less than zero. AMD also updated the way paging is done and modified the way write bursting happens.

Additionally, one of the big complaints I hear is prefetch, and that has been comprehensively addressed. They now track positive, negative and non-unit strides, and have a dedicated prefetch buffer. On top of that, Barcelona is much more aggressive in how they fill idle ram cycle and updated the core prefetchers on both the L1D and L1I side.

The newer L3 cache is called a 'victim cache', it sits on top of the existing discrete L2 caches, and is shared among the cores. The big thing here is that if the caches are empty, the request goes to L1 cache. If that fills, the data line is evicted to L2, and when L2 fills, it goes to L3. If a core reads from L3, there are one of two things that can happen. If it is data, the line is moved to L1D, but if it is instructions, the line may be moved to L1I, or it may be copied to L1I and left in L3. The L3 is not exclusive, and the point of this is that code is often shared among the cores while data is far less often. The cache lines have performance hints associated with them that will clue to cores in on the whole copy vs move debate.

That brings us to virtualization, a topic we have covered a lot in our Pacifica articles (1, 2, 3 and 4 )The main advance Barcelona brings is that it turns on Nested Page Table support and a few other things that did not make the cut on the Rev F parts. It is also said to reduce world switch time going to the hypervisor and back by 25%.

One of the last things they did was to break the CPU and northbridge into separate power planes. This will allow the CPUs to be clocked up and down, and volted up and down independently of the CPU. This was a major sticking point in the ramp of the previous iterations of the chip, and I expect big dividends here, and it also saves a lot of power. I am also told that they can change the voltage on cores independently, but that is more of a motherboard issue. Since it is not really supported anywhere on the current platforms, don't look for a BIOS option to turn it on, but if it came back in later platform revisions, I would not be overly surprised.
 

Gilbo

Storage is cool
Joined
Aug 19, 2004
Messages
742
Location
Ottawa, ON
I'm very interested in the effect of the 32 byte instruction fetch. The Core 2 Duo, despite its 4-issue design and wide execution core is limited to only 2-3 instructions per cycle in situations involving a substantial presence of 64-bit and/or SSE instructions because of its 16 byte instruction fetch. So, while it can issue and execute 4 instructions per cycle, it can't actually fetch 4 instructions worth of data per cycle when larger instructions are used.

I only really care about this because the productivity-limiting code paths in many RAW decoders and image processing applications involve exactly these types of instruction mixes. Ironically, highly Intel-optimized (i.e. SSE-optimized) image processing applications may end up performing even better on these AMD Barcelona chips on a per core basis.

Many image processing algorithms, particularly interpolation (RAW processing) or convolution/deconvolution (sharpening & noise) algorithms --the post-processing operations every photograph will generally go through--, are also highly bandwidth intensive. AMD's quad-core chips will certainly have a massive per core memory bandwidth advantage over Intel's Core 2 Quads.

I'm interested. Not that I'll end up buying one for years because they'll probably cost an arm & a leg.
 

Gilbo

Storage is cool
Joined
Aug 19, 2004
Messages
742
Location
Ottawa, ON
Oh ya, I have a bunch of virtualized servers running on one of my home servers. Virtualization stuff is great to see.

It looks to me that this will only improve host-level virtualization as opposed to the OS-level virtualization I've decided to use. OS-level virtualization never incurred these types of context-switching performance penalties (or really any of the performance penalties that affect VMWare or Xen).
 

CityK

Storage Freak Apprentice
Joined
Sep 2, 2002
Messages
1,719
the OS-level virtualization I've decided to use. OS-level virtualization never incurred these types of context-switching performance penalties (or really any of the performance penalties that affect VMWare or Xen).
Which one are you using Gilbo - OpenVZ ?
 

Gilbo

Storage is cool
Joined
Aug 19, 2004
Messages
742
Location
Ottawa, ON
Ya, I'm using OpenVZ at the moment. Letting the kernel scheduler balance CPU time and memory usage depending on current need is just such a huge advantage. Most of my VM's don't need to do anything at all most of the time and can be entirely paged out in preference of more pressing concerns.

I'm probably going to use Xen for a few tasks, like getting rid of my independent firewall box. OpenVZ won't give you the full security Xen offers if the functionality needs kernel modules. The proper way to do a firewall for example, is either as an independent box, or with full hardware-emulation-based virtualization (like VMWare) or paravirtualization (like Xen). When your VM doesn't need kernel functionality, OpenVZ is just as good, with fewer hassles and better performance.
 

Gilbo

Storage is cool
Joined
Aug 19, 2004
Messages
742
Location
Ottawa, ON
"pricing [on 4x4] has been "re-constructed" severely downwards, with the fastest CPU pack now matching the price of Intel's Kentsfield."
--At the Inquirer.

If that is true. 'Quadfather' --what a dumb name-- could be extremely interesting to me. They will be upgradeable to the Barcelona Quad-Core's. They use standard, cheap, unregistered DDR2 (most reports indicate only 4 DIMM sockets unfortunately --necessary to avoid cannibalizing the server market I suppose).

The boards will probably cost a fortune, but they're pretty loaded. I've never come close to paying as much for a system in the last couple years as I would have to to build the cheapest, possible version of one of these, but I have to say I'm interested in using one for photo work.
 

sechs

Storage? I am Storage!
Joined
Feb 1, 2003
Messages
4,709
Location
Left Coast
If this ends up being a cheap not-quite-workstation-grade quad solution, I'd certainly be interested.
 

LiamC

Storage Is My Life
Joined
Feb 7, 2002
Messages
2,016
Location
Canberra
'Quadfather' --what a dumb name

Quadfather isn't the name. It's a moniker/nickname applied by the Inq (that was where I first saw it) or someone else. The official moniker is 4x4, which I think is equally dumb.

Did I mention that I hate marketers/marketing?
 

Sol

Storage is cool
Joined
Feb 10, 2002
Messages
960
Location
Cardiff (Wales)
I could find a use for 12 sata ports but I'd rather have them in a box with a 35watt single core athlon64... It's just much cheaper to download with a low power box and only turn on the power sucking toy when you want to play...
 

ddrueding

Fixture
Joined
Feb 4, 2002
Messages
19,742
Location
Horsens, Denmark
12 SATA ports is a bit much for me. With 750GB drives, 5-7TB would be enough. I agree that a low-powered CPU would be lovely, my 65W CPU makes me happy.
 
Top