Into the Core: Intel's next-generation microarchitecture

Wednesday, April 05, 2006

Vector execution units

From the perspective of Apple fans who were vexed by the loss of the much-loved AltiVec, one Core's most significant improvements over its predecessors is in the area vector processing, or SIMD.

As noted above, 128-bit floating-point arithmetic operations go into the two FADD/VFADD and FMUL/VFMUL pipelines. So these two units handle both vector and scalar floating-point operations. Both of these pipelines are also capable of doing floating-point and vector register moves. Finally, I'm guessing that the FMUL/VFMUL pipeline also does vector square root operations.

For integer vector operations (64-bit MMX instructions, and 128-bit SSE integer instructions), the picture is a bit murkier. From what I've been able to gather, the vector integer units on ports 0 and 1 appear to have been retained and widened to 128 bits for the purposes of single-cycle 128-bit vector integer computation. I'm currently assuming that, as with the PIII, one unit is a 128-bit VALU/shift unit and the other is a 128-bit VALU/multiply unit.

There's a fifth 128-bit vector pipeline on port 2, about which little is known except that it does vector register moves. I suspect that it also handles SSE shuffle operations (hence the name VSHUF I've assigned it) and vector reciprocal and reciprocal square root operations. This unit would be the rough equivalent of the AltiVec vector permute unit that exists on the PowerPC G4 and 970. (For a handy discussion of AltiVec vector permute and SSE shuffle instruction equivalences, see this Apple reference page.)

Now that you're familiar with Core's vector hardware, let's take a look at one of the most important improvements that Core brings to SSE/SSE2/SSE3: a true 128-bit datapath for all vector units.

True 128-bit vector processing

When Intel finally got around to adding 128-bit vector support to the Pentium line with the introduction of streaming SIMD extensions (SSE), the results weren't quite as pretty as programmers and users might've hoped. SSE and its successors (SSE2 and SSE3) have two disadvantages on the P6 and Banias: on the ISA side, SSE's main drawback is the lack of support for three-operand instructions, support that makes AltiVec a superior vector ISA for some applications; and on the hardware implementation side, 128-bit SSE operations suffer from a limitation that's the result of Intel shoehorning 128-bit operations onto the P6 core's 64-bit internal datapaths.

The P6 core's internal data buses for floating-point arithmetic and MMX are only 64 bits wide. Thus the data input ports on the SSE execution units could only be 64 bits wide, as well. In order to execute a 128-bit instruction using its 64-bit SSE units, the P6 must first break down that instruction into a pair of 64-bit instructions which can be executed on successive cycles.

To see how this works, take a look at the diagram below, which shows in a very abstract way what happens when the P6 decodes and executes a 128-bit SSE instruction. The decoder first splits the instruction into two, 64-bit micro-ops, one for the upper 64 bits of the vector and another for the lower 64 bits. Then this pair of micro-ops is passed to the appropriate SSE unit for execution.

How the P6 executes a 128-bit vector operation

The result of this hack is that all 128-bit vector operations take a minimum of two cycles to execute on the P6: one cycle for the top half and another for the bottom half. Compare this to the single-cycle throughput and latency of simple 128-bit AltiVec operations on the PowerPC G4.

Unfortunately, the Pentium 4's Netburst architecture suffered from the same drawback, as did the Pentium M.

The new Core architecture finally gives programmers a single-cycle latency for 128-bit vector operations. Intel did this by making the floating-point and vector internal data buses 128 bits wide, a feature that also means only a single micro-op needs to be generated, dispatched, scheduled, and issued for each 128-bit vector operation. Therefore not only does the new design eliminate the latency disadvantage, but it also improves decode, dispatch, and scheduling bandwidth because half as many micro-ops are generated for 128-bit vector instructions.

I went ahead and tried to represent Core's new configuration in terms of the diagram above, so take a look:

How Core executes a 128-bit vector operation

As you can see, the vector ALU's data ports, both input and output, are twice as large in order to accommodate 128 bits of data at a time.

When you combine these critical improvements with Core's increased amount of vector execution hardware and its expanded decode, dispatch, issue, and retire bandwidth then you get a beast of a vector processing machine. (Of course, SSE's unfortunate two-operand limitation still applies, but there's no helping that.) Intel's literature states that Core can, for example, execute a 128-bit packed multiply, 128-bit packed add, 128-bit packed load, 128-bit packed store, and a macro-fused cmpjcc (a compare + a jump on condition code) all in the same cycle. That's essentially six instructions in one cycle—quite a boost from any previous Intel processor.

« Prev

[Core's execution core]

[Core's pipeline]

Into the Core: Intel's next-generation microarchitecture

Vector execution units

True 128-bit vector processing

« Prev

Next »