Oracle Optimization
Lecture 2

Processors and cache subsystem

The central processing unit (CPU or processor) is the key component of any computer system. In this chapter, we cover several different CPU architectures from Intel (IA32, Intel 64 Technology, IA64) and AMD1 (AMD64) and outline their main performance characteristics.

 Processor technology

The central processing unit has actually outperformed all other computer subsystems in its evolution. Thus, most of the time, other subsystems such as disk or memory will impose a bottleneck upon your application (unless pure number crunching or complex application processing is the desired task). Understanding the functioning of a processor in itself is already quite a difficult task, but today IT professionals are faced with multiple and often very different CPU architectures.

Comparing different processors is no longer a matter of looking at the CPU clock rate but one of understanding what CPU architecture is best suited for what kind of workload. Also, 64-bit computing has finally made its move from the high-end UNIX® and mainframe systems into the Intel compatible arena and has become yet another new technology to be understood.

The Intel compatible microprocessor has evolved from the first 8004 4-bit CPU, produced in 1971, to the current line of Xeon® and Core processors. AMD, on the other hand, has stepped out of Intel's shadow with the world's first IA32 compatible 64-bit processor. Our overview of processors begins with the current line of Intel Xeon CPUs, followed by the AMD Opteron and Intel's Core Architecture. For the sake of simplicity, we will not explore earlier processors.

Hyper-Threading

Hyper-Threading technology effectively enables a single physical processor to execute two separate code streams (threads) concurrently. To the operating system, a processor with Hyper-Threading appears as two logical processors, each of which has its own architectural state-that is, its own data, segment, and control registers and its own advanced programmable interrupt controller (APIC).

Each logical processor can be individually halted, interrupted, or directed to execute a specified thread, independently of the other logical processor on the chip. However, unlike a traditional 2-way SMP configuration that uses two separate physical processors, the logical processors share the execution resources of the processor core, which include the execution engine, the caches, the system bus interface, and firmware. Figure 4-2 illustrates the basic layout of a Hyper-Threading-enabled CPU. As shown in the figure, only the components for the architectural state of the CPU have doubled.

Hyper-Threading technology is designed to improve server performance by exploiting the multi-threading capability of operating systems, such as Windows Server 2003 and Linux, and server applications, in such a way as to increase the use of the on-chip execution resources that are available on these processors.

Figure 4-2 Hyper-Threading processor versus a non-Hyper-Threading processor

Having fewer or slower processors usually yields the best gains in performance when comparing Hyper-Threading on versus Hyper-Threading off because with fewer processors there is a greater likelihood that the software can spawn sufficient numbers of threads to keep both paths busy. The performance gains from Hyper-Threading running on slower speed processors are usually greater than the gains that are obtained when running on high-speed processors because on the slower processors there are longer periods of time between serialization points that nearly every software must use. Whenever two threads must serialize, performance is reduced.

The performance gains that are obtained in a highly parallel threaded database environment from enabling Hyper-Threading are as follows:

  Two physical processors: up to about 35% performance gain
  Four physical processors: up to about 33% performance gain
  Eight physical processors: up to about 30% performance gain

Over time, these gains in performance will change because software developers will introduce improved threading, which makes more efficient use of Hyper-Threading. However, much of the currently available software often limits SMP scalability, but we can expect improved results as software matures. Best-case multi-threaded applications today are: Databases,SAP,Web servers,PeopleSoft,VMware,64-bit Terminal Services

Description
Hyper-Threading
Windows 2000 Server
Yes1
Windows 2000 Advanced Server
Yesa
Windows 2000 Datacenter Server
Yesa
Windows Server 2003, Standard Edition
Optimized
Windows Server 2003, Enterprise Edition
Optimized
Windows Server 2003, Datacenter Edition
Optimized
Linux kernel 2.4.18+
Yes
Linux kernel 2.6
Optimized
VMware ESX Server 2.x
Yes
VMware ESX Server 3
Optimized

Notes:

Intel BIOS Programmers Guide recommends disabling Hyper-Threading on systems that are running Windows 2000 Server or earlier or Linux kernels earlier than 2.4.18.

Linux kernels from 2.4.18 and later and Windows 2000 SP3 and Windows Server 2003 operating systems understand the concept of physical processors versus logical processors.

Important: Full scheduler support for Hyper-Threading is available only with with the 2.6 series of Linux kernels.

Dual core Intel Xeon processors

Moore's Law states that the number of transistors on a chip doubles about every two years. Similarly, as the transistors have become smaller, the frequency of the processors have increased which is generally equated with performance.

However, around 2003, physics started to limit advances in obtainable clock frequency. Transistors sizes have become so small that electron leakage through transistors has started to occur. Those electron leaks result in large power consumption and substantial extra heat and could even result in data loss. In addition, the ability to cool the processor at higher frequencies has become too expensive with the traditional air cooling methods.

This is why the material that comprises the dielectric in the transistors has become a major limiting factor in the frequencies that are obtainable. Manufacturing advances have continued to enable a higher per-die transistor count but have only been able to obtain about a 10% frequency improvement per year. For that reason, processor vendors are now placing more processors on the die to offset the inability to increase frequency. Multi-core processors provide the ability to increase performance with lower power consumption.

Intel released their first dual core Xeon processors in October 2005. Dual core processors are two separate physical processors combined onto a single processor socket. Dual core processors consist of twice as many cores but each core is run at lower clock speeds as an equivalent single core chip in order to lower the waste heat usage. Waste heat is the heat that is produced from electron leaks in the transistors.

There are five dual core Xeon processor models available in IBM System x servers. the most important is Woodcrest family:

Xeon 5100 Series DP processor (Woodcrest)

The Woodcrest processor is the first Xeon DP processor that uses the Intel Core microarchitecture instead of the Netburst microarchitecture. See 4.1.4, "Intel Core microarchitecture" for details.

Frequencies of 1.6-3.0 GHz are supported with an L2 cache of 4 MB. The front-side bus runs at a frequency of either 1066 or 1333 MHz as shown in Table 4-2. None of these processors support Hyper-Threading.

Woodcrest uses a low power model incorporated in the Core microarchitecture. Woodcrest processor provides substantial improvements in random memory access.

Table 4-2

 
Processor model
Speed
L2 cache
Front-side bus
Power (TDP)
Xeon 5110
1.6 GHz
4 MB
1066 MHz
65 W
Xeon 5120
1.86 GHz
4 MB
1066 MHz
65 W
Xeon 5130
2.00 GHz
4 MB
1333 MHz
65 W
Xeon 5140
2.33 GHz
4 MB
1333 MHz
65 W
Xeon 5148 LV
2.33 GHz
4 MB
1333 MHz
40 W
Xeon 5150
2.66 GHz
4 MB
1333 MHz
65 W
Xeon 5160
3.0 GHz
4 MB
1333 MHz
80 W

Quad core Intel Xeon processors

Quad-core processors differ from single-core and dual-core processors by providing four independent execution cores. While some execution resources are shared, each logical processor has its own architecture state with its own set of general-purpose registers and control registers to provide increased system responsiveness. Each core runs at the same clock speed.

Intel Quad code processors include the following:

Xeon 5300 Series DP processor (Clovertown)

The Clovertown processor is a quad-core design that is actually made up of two Woodcrest dies in a single package. Each Woodcrest die has 4 MB of L2 cache so the total L2 cache in Clovertown is 8 MB.

The Clovertown processors are also based on the Intel Core microarchitecture as described in 4.1.4, "Intel Core microarchitecture".

Processor models available include the E5310, E5320, E5335, E5345 and E5355. The processor front-side bus operates at either 1066 MHz (processor models ending in 0) or 1333 MHz (processor models ending in 5). For specifics, see Table 4-3. None of these processors support Hyper-Threading.

In addition to the features of the Intel Core microarchitecture, the features of the Clovertown processor include:

Intel Virtualization Technology - processor hardware enhancements that support software-based virtualization.
Intel 64 Architecture (EM64T) - support for both 64-bit and 32-bit applications.
Demand-Based Switching (DBS) - technology that enabled hardware and software power management features to lower average power consumption of the processor while maintaining application performance.
Intel I/O Acceleration Technology (I/OAT) - reduces processor bottlenecks by offloading network-related work from the processor.

Table 4-3

 
Processor model
Speed
L2 cache
Front-side bus
Power (TDP)
Demand-Based Switching
E5310
1.6 GHz
8 MB
1066 MHz
80 W
No
E5320
1.86 GHz
8 MB
1066 MHz
80 W
Yes
E5335
2.0 GHz
8 MB
1333 MHz
80 W
No
E5345
2.33 GHz
8 MB
1333 MHz
80 W
Yes
E5355
2.66 GHz
8 MB
1333 MHz
120 W
Yes

64-bit computing

As discussed in 4.1, Processor technology, there are three 64-bit implementations in the Intel-compatible processor marketplace:

Intel IA64, as implemented on the Itanium 2 processor
Intel 64 Technology, as implemented on the 64-bit Xeon DP and Xeon MP processors
AMD AMD64, as implemented on the Opteron processor

There exists some uncertainty as to the definition of a 64-bit processor and, even more importantly, the benefit of 64-bit computing.

Definition of 64-bit: A 64-bit processor is a processor that is able to address 64 bits of virtual address space. A 64-bit processor can store data in 64-bit format and perform arithmetic operations on 64-bit operands. In addition, a 64-bit processor has general purpose registers (GPRs) and arithmetic logical units (ALUs) that are 64 bits wide.

The Itanium 2 has both 64-bit addressability and GPRs and 64-bit ALUs. So, it is by definition a 64-bit processor.

Intel 64 Technology extends the IA32 instruction set to support 64-bit instructions and addressing, but are Intel 64 Technology and AMD64 processors real 64-bit chips? The answer is yes. Where these processors operate in 64-bit mode, the addresses are 64-bit, the GPRs are 64 bits wide, and the ALUs are able to process data in 64-bit chunks. Therefore, these processors are full-fledged, 64-bit processors in this mode.

Note that while IA64, Intel 64 Technology, and AMD64 are all 64-bit, they are not compatible for the following reasons:

Intel 64 Technology and AMD64 are, with exception of a few instructions such as 3DNOW, binary compatible with each other. Applications written and compiled for one will usually run at full speed on the other.
IA64 uses a completely different instruction set to the other two. 64-bit applications written for the Itanium 2 will not run on the Intel 64 Technology or AMD64 processors, and vice versa.

64-bit extensions: AMD64 and Intel 64 Technology

Both AMD's AMD64 and Intel 64 Technology (formerly known as EM64T) architectures extend the well-established IA32 instruction set with:

Even though the names of these extensions suggest that the improvements are simply in memory addressability, both the AMD64 and the Intel Intel 64 Technology are in fact fully functional 64-bit processors.

There are three distinct operation modes available in AMD64 and Intel 64 Technology:

32-bit legacy mode

The first and, in the near future, probably most widely used mode is the 32-bit legacy mode. In this mode, both AMD64 and Intel 64 Technology processors will act just like any other IA32 compatible processor. You can install your 32-bit OS on such a system and run 32-bit applications, but you will not be able to make use of the new features such as the flat memory addressing above 4 GB or the additional General Purpose Registers (GPRs). 32-bit applications will run just as fast as they would on any current 32-bit processor.

Most of the time, IA32 applications will run even faster because there are numerous other improvements that boost performance regardless of the maximum address size. For applications that share large amounts of data there might be performance impacts related to the NUMA-like architecture of multi-processor Opteron configurations since remote memory access might slow your application down.

Compatibility mode

The second mode supported by the AMD64 and Intel 64 Technology is compatibility mode which is an intermediate mode of the full 64-bit mode described below. In order to run in compatibility mode, you will need to install a 64-bit operating system and 64-bit drivers. If a 64-bit OS and drivers are installed both Opteron and Xeon processors will be enabled to support a 64-bit operating system with both 32-bit applications or 64-bit applications.

Compatibility mode gives you the ability to run a 64-bit operating system while still being able to run unmodified 32-bit applications. Each 32-bit application will still be limited to a maximum of 4 GB of physical memory. However the 4 GB limit is now imposed on a per-process level, not at a system-wide level. This means that every 32-bit process on this system gets its very own 4 GB of physical memory space (assuming sufficient physical memory is installed). This is already a huge improvement compared to IA32 where the operating system kernel and the application had to share 4 GB of physical memory.

Additionally, compatibility mode does not support the virtual 8086 mode, so real-mode legacy applications are not supported. 16-bit protected mode applications are however supported.

Full 64-bit mode (Long Mode)

The final mode is the full 64-bit mode. AMD refer to this as long mode and Intel refer to it as IA-32e mode. This mode is when a 64-bit operating system and 64-bit application are use. In the full 64-bit operating mode, an application can have a virtual address space of up to 40-bits (which equates to 1 TB of addressable memory). The amount of physical memory will be determined by how many DIMM slots the server has and the maximum DIMM capacity supported and available at the time.

Applications that run in full 64-bit mode will get access to the full physical memory range (depending on the operating system) and will also get access to the new GPRs as well as to the expanded GPRs. However it is important to understand that this mode of operation requires not only a 64-bit operating system (and of course 64-bit drivers) but also requires a 64-bit application that has been recompiled to take full advantage of the various enhancements of the 64-bit addressing architecture.

For more information about the AMD64 architecture, see:

http://www.x86-64.org/ 

For more information about Intel 64 Technology, see:

http://www.intel.com/technology/64bitextensions/ 

The benefit of 64-bit (AMT64, Intel 64 Technology) computing

The Benefits of a 64-Bit Architecture

Companies that produced computer hardware and software often make a point of mentioning the size of their systems' address space (typically 32 or 64 bits). In the last five years, the shift from 32-bit to 64-bit microprocessors and operating systems has caused a great deal of hype to be generated by various marketing departments. The truth is that although in certain cases 64-bit architectures run significantly faster than 32-bit architectures. And that's not because they have faster CPU performce, but beacuse they can support larger RAM and larger datapath to external devices.

What does it mean to be 64-bit?

The number of "bits" refers to the width of a data path. However, what this actually means is subject to its context. For example, we might refer to a 16-bit data path (for example, UltraSCSI). This means that the interconnect can transfer 16 bits of information at a time. With all other things held constant, it would be twice as fast as an interconnect with a 8-bit data path.

The "bitness" of a memory system refers to how many wires are used to transfer a memory address. For example, if we had an 8-bit path to the memory address, and we wanted the 19th location in memory, we would turn on the appropriate wires (1, 2, and 5; we derive this from writing 19 in binary, which gives 00010011 -- everywhere there is a one, we turn on that wire). Note, however, that since we only have 8 bits worth of addressing, we are limited to 64 (28) addresses in memory. 32-bit systems are, therefore, limited to 4,294,967,296 (232) locations in memory. Since memory is typically accessible in 1-byte blocks, this means that the system can't directly access more than 4 GB of memory. The shift to 64-bit operating systems and hardware means that the maximum amount of addressable memory is about 16 petabytes (16777216 GB), which is probably sufficient for the  forseeable future.

Unfortunately, it's often not quite this simple in practice. A 32-bit SPARC system is actually capable of having more than 4 GB of memory installed, but, usually no single process can use more than 4 GB.

Performance ramifications

The change from 32-bit to 64-bit architectures, then, expanded the size of main memory and the amount of memory a single process can have. An obvious question is, how did applications benefit from this? Here are some kinds of applications that benefitted from larger memory spaces:

In general, the biggest winners from 64-bit systems are corporate database engines. For the average desktop workstation, 32 bits is plenty.

In the same way that 16-bit processors and 16-bit applications are no longer used in this space, it is likely that at some point in the future, 64-bit processors and applications will fully replace their 32-bit counterparts.

Processors using the Intel 64 Technology and AMD64 architectures are making this transition very smooth by offering 32-bit and 64-bit modes. This means that the hardware support for 64-bit will be in place before you upgrade or replace your software applications with 64-bit versions. IBM System x already has many models available with the Intel 64 Technology-based Xeon and AMD64 Opteron processors.

The question you should be asking is whether the benefit of 64-bit processing is worth the effort of upgrading or replacing your 32-bit software applications. The answer is that it depends on the application. Here are examples of applications that will benefit from 64-bit computing:

Encryption applications

Most encryption algorithms are based on very large integers and would benefit greatly from the use of 64-bit GPRs and ALUs. While modern high-level languages allow you to specify integers above the 232 limit, in a 32-bit system, this is achieved by using two 32-bit operands, thereby causing a significant overhead while moving those operands through the CPU pipelines. A 64-bit processor will allow you to perform 64-bit integer operation with one instruction.

Scientific applications

Scientific applications are another example of workloads that need 64-bit data operations. Floating-point operations do not benefit from the larger integer size because floating-point registers are already 80 or 128 bits wide even in 32-bit processors.

Software Applications requiring more than 4 GB of memory

The biggest advantage of 64-bit computing for commercial applications is the flat, potentially massive, address space.

32-bit enterprise applications such as databases are currently implementing Page Addressing Extensions (PAE) and Addressing Windows Extensions (AWE) addressing schemes to access memory above the 4 GB limit imposed by 32-bit address limited processors. With Intel 64 Technology and AMD64, these 32-bit addressing extension schemes support access to memory up to 128 GB in size.

One constraint with PAE and AWE, however, is that memory above 4 GB can only be used to store data. It cannot be used to store or execute code. So, these addressing schemes only make sense for applications such as databases, where large data caches are needed.

In contrast, a 64-bit virtual address space provides for direct access to up to 2 Exabytes (EB), and even though we call these processors 64-bit, none of the current 64-bit processors actually supports full 64 bits of physical memory addressing, simply because this is such an enormous amount of memory.

In addition, 32-bit applications might also get a performance boost from a 64-bit Intel 64 Technology or AMD64 system running a 64-bit operating system. When the processor runs in Compatibility mode, every process has its own 4 GB memory space, not the 2 GB or 3 GB memory space each gets on a 32-bit platform. This is already a huge improvement compared to IA32 where the OS and the application had to share those 4 GB of memory.

When the application is designed to take advantage of more memory, the availability of the additional 1 or 2 GB of physical memory can create a significant performance improvement. Not all applications take advantage of the global memory available. APIs in code need to be used to recognize the availability of more than 2 GB of memory.

Furthermore, some applications will not benefit at all from 64-bit computing and might even experience degraded performance. If an application does not require greater memory capacity or does not perform high-precision integer or floating-point operations, then 64-bit will not provide any improvement.

In fact, because 64-bit computing generally requires instructions and some data to be stored as 64-bit objects, these objects consume more physical memory than the same object in a 32-bit operating environment. The memory capacity inflation of 64-bit can only be offset by an application taking advantage of the capabilities of 64-bit (greater addressing or increased calculation performance for high-precision operations), but when an application does not make use of the 64-bit operating environment features, it often experiences the overhead without the benefit.

In this case, the overhead is increased memory consumption, leaving less physical memory for operating system buffers and caches. The resulting reduction in effective memory can decrease performance.

Software driver support in general is lacking for 64-bit operating systems compared to the 32-bit counterparts. General software drivers such as disk controllers or network adapters or application tools might not have 64-bit code in place for x64 operating systems. Prior to moving to an x64 environment it might be wise to ensure that all third-party vendors and software tools support drivers for the specific 64-bit operating system that you are planning to use.

64-bit memory addressing

The width of a memory address dictates how much memory the processor can address. A 32-bit processor can address up to 232 bytes or 4 GB. A 64-bit processor can theoretically address up to 264 bytes or 16 Exabytes (or 16777216 Terabytes), although current implementations address a smaller limit, as shown in Table 4-4.

Note: These values are the limits imposed by the processors. Memory addressing can be limited further by the chipset implemented in the server. For example, the XA-64e chipset used in the x3950 Xeon based server addresses up to 512 GB of memory.

Table 4-4

Processor
Flat addressing
Addressing with PAE
Intel 32-bit Xeon MP (32-bit) processors including Foster MP and Gallatin
4 GB (32-bit)
128 GB
Intel 64-bit Xeon DP Nocona (64-bit)
64 GB (36-bit)
128 GB in compatibility mode
Intel 64-bit Xeon MP Cranford (64-bit)
64 GB (36-bit)
128 GB in compatibility mode
Intel 64-bit Xeon MP Potomac (64-bit)
1 TB (40-bit)
128 GB in compatibility mode
Intel 64-bit (64-bit) Dual Core MP including Paxville, Woodcrest, and Tulsa
1 TB (40-bit)
128 GB in compatibility mode
AMD Opteron (64-bit)
256 TB (48-bit)
128 GB in compatibility mode

Memory supported by processors

The 64-bit extensions in the processor architectures Intel 64 Technology and AMD64 provide a better performance for both 32-bit and 64-bit applications on the same system. These architectures are based on 64-bit extensions to the industry-standard x86 instruction set and provide support for existing 32-bit applications.

Processor performance

Processor performance is a complex topic because the effective CPU performance is affected by system architecture, operating system, application, and workload. This is even more so with the choice of three different CPU architectures, IA32, IA64, and AMD64/EM64T.

In general, server CPUs execute workloads that have very random address characteristics. This is expected because most servers perform many unrelated functions for many different users. So, core clock speed and L1 cache attributes have a lesser effect on processor performance compared to desktop environments. This is because with many concurrently executing threads that cannot fit into the L1 and L2 caches, the processor core is constantly waiting for L3 cache or memory for data and instructions to execute.

 

Comparing CPU architectures

Every CPU we have discussed so far had similar attributes. Every CPU has two or more pipelines, an internal clock speed, L1 cache, L2 cache (some also L3 cache). The various caches are organized in different ways, some of them 2-way associative, some go up to 16-way associativity. Some have a 800 MHz FSB while others have no FSB at all (Opteron).

Which is fastest? Is the Xeon DP the fastest CPU of them all because it is clocked at up to 3.6 GHz? Or is the Itanium 2 the fastest because it features up to 9 MB of L3 cache? Or is it perhaps the Opteron because its L2 cache features a 16-way associativity?

As is so often the case, there is never one simple answer. When comparing processors, clock frequency is only comparable when comparing processors of the same architectural family. You should never compare isolated processor subsystems across different CPU architectures and think you can make a simple performance statement. Comparing different CPU architectures is therefore a very difficult task and has to take into account available application and operating system support.

As a result, we do not compare different CPU architectures in this section, but we do compare the features of the different models of one CPU architecture.

4.3.2 Cache associativity

Cache associativity is necessary to reduce the lookup time to find any memory address stored in the cache. The purpose of the cache is to provide fast lookup for the CPU, because if the cache controller had to search the entire memory for each address, the lookup would be slow and performance would suffer.

To provide fast lookup, some compromises must be made with respect to how data can be stored in the cache. Obviously, the entire amount of memory would be unable to fit into the cache because the cache size is only a small fraction of the overall memory size (see 4.3.3, Cache size). The methodology of how the physical memory is mapped to the smaller cache is known as set associativity (or just associativity).

First, some definitions. Looking at Figure 4-11, main memory is divided up into pages. Cache is also divided up into pages and a memory pages is the same size as a cache page. Pages are divided up into lines or cache lines. Generally cache lines are 64 bytes wide.

For each page in memory or in cache, the first line is labeled cache line 1, the second line is labeled cache line 2, and so on. When data in memory is to be copied to cache, the line that this data is in is copied to the equivalent slot in cache.

Looking at Figure 4-11, when copying cache line 1 from memory page 0 to cache, it is stored in cache line 1 in the cache. This is the only slot it can be stored in cache. This is a one-way associative cache, because for any given cache line in memory, there is only one position in cache where it can be stored. This is also known as direct mapped, because the data can only go into one place in the cache.

 

Figure 4-11 One-way associative (direct mapped) cache

With a one-way associative cache, if cache line 1 in another memory page needs to be copied to cache, it too can only be stored in cache line 1 in cache. You can see from this that you would get a greater cache hit rate if you use greater associativity.

Figure 4-12 shows the 2-way set associative cache implementation. Here there are two locations in which to store the first cache line for any memory page. As the figure illustrates, main memory on the right hand side will be able to store up to two cache line 1 entries concurrently. Cache line 1 for page 0 of main memory could be located in way-a of the cache while cache line 1 for page n of main memory could be located in way-b of the cache simultaneously.

Figure 4-12 A 2-way set associative cache

Expanding on a one-way and two-way set associative cache, a 3-way set associative cache (Figure 4-13) provides three location, a 4-way set associative cache provides four locations and an 8-way set associative cache provides eight possible locations in which to store the first cache line from up to eight different memory pages.

Figure 4-13 3-way set associative cache

Set associativity greatly minimizes the cache address decode logic necessary to locate a memory address in the cache. The cache controller simply uses the requested address to generate a pointer into the correct cache page. A hit occurs when the requested address matches the address stored in one of the fixed number of cache location associated with that address. If the particular address is not there, a cache miss occurs.

Notice that as the associativity increases, the lookup time to find an address within the cache could also increase because more pages of cache must be searched. To avoid longer cache lookup times as associativity increases, the lookups are performed in parallel, however, as the associativity increases, so does the complexity and cost of the cache controller.

For the high performance X3 Architecture systems such as the System x3950, lab measurements determined that the most optimal configuration for cache was 9-way set associativity, taking into account performance, complexity and cost.

A fully associative cache in which any memory cache line could be stored in any cache location could be implemented, but this is almost never done because of the expensive (in both cost and die areas) parallel lookup circuits required.

Large servers generally have random memory access patterns as opposed to sequential memory access patters. Higher associativity favors random memory workloads due to it's ability to cache more distributed locations of memory.

4.3.3 Cache size

Faster, larger caches usually result in improved processor performance for server workloads. Performance gains obtained from larger caches increase as the number of processors within the server increase. When a single CPU is installed in a four-socket SMP server, there is little competition for memory access. Consequently, when a CPU has a cache miss, memory can respond, and with the deep pipeline architecture of modern processors, the memory subsystem usually responds before the CPU stalls. This allows one processor to run fast almost independently of the cache hit rate.

On the other hand, if there are four processors installed in the same server, each queuing multiple requests for memory access, the time to access memory is greatly increased, increasing the potential for one or more CPUs to stall. In this case, a fast L2 hit saves a significant amount of time and greatly improves processor performance.

Of course, there are diminishing returns as the size of the cache improves; these are simply rules of thumb for the maximum expected performance gain.

4.3.4 CPU clock speed

Processor clock speed affects CPU performance because it is the speed at which the CPU executes instructions. Measured system performance improvements because of an increase in clock speed are usually not directly proportional to the clock speed increase. For example, when comparing a 3.0 GHz CPU to an older 1.6 GHz CPU, you should not expect to see 87% improvement. In most cases, performance improvement from a clock speed increase will be about 30% to 50% of the percentage increase in clock speed. So for the example above, you could expect about 26% to 44% system performance improvement when upgrading a 1.6 GHz CPU to a 3.0 GHz CPU.

4.3.5 Scaling versus the number of processor cores

In general, the performance gains shown in Figure 4-14 can be obtained by adding CPUs when the server application is capable of efficiently utilizing additional processors, and of course, there are no other bottlenecks occurring in the system.

Figure 4-14 Typical performance gains when adding processors

These scaling factors can be used to approximate the achievable performance gains that can be obtained when adding CPUs and memory to a scalable Intel IA-32 server.

For example, begin with a 1-way 3.0 GHz Xeon MP processor and add another Xeon MP processor. Server throughput performance will improve up to about 1.7 times. Increase the number of Xeon processors to four and server performance can improve to almost three times greater throughput than the single processor configuration.

At eight processors, the system has a bit over four times greater throughput than the single processor configuration and finally at 16 processors the performance increases to over six fold greater throughput than the single CPU configuration.

High performing chipsets such as the XA-64e generally are designed to provide higher scalability than the average chipset. Figure 4-15 shows the performance gain of a high performing chipset such as X3 Hurricane chipset in the x3850 as processors are added assuming no other bottlenecks occurring in the system. Performance gains of 1.9 and 1.8 are possible in certain business workloads.

Figure 4-15 System x3850 Performance Scaling when adding processors

Database applications such as IBM DB2, Oracle, and Microsoft SQL Server usually provide the greatest performance improvement with increasing numbers of CPUs. These applications have been painstakingly optimized to take advantage of multiple CPUs. This effort has been driven by the database vendors' desire to post #1 transaction processing benchmark scores. High-profile industry-standard benchmarks do not exist for many applications, so the motivation to obtain optimal scalability has not been as great. As a result, most non-database applications have significantly lower scalability. In fact, many do not scale beyond two to four CPUs.

 

4.3.6 Processor features in BIOS

BIOS levels permit various settings for performance in certain IBM System x servers.

Processor Adjacent Sector Prefetch

When this setting is enabled, (enabled is the default for most systems), the processor retrieves both sectors of a cache line when it requires data that is not currently in its cache. When it is disabled, the processor will only fetch the sector of the cache line that includes the data requested. For instance, only one 64-byte line from the 128-byte sector will be prefetched with this setting disabled.

This setting can affect performance, depending on the application running on the server and memory bandwidth utilization. Typically, it affects certain benchmarks by a few percent, although in most real applications it will be negligible. This control is provided for benchmark users who want to fine-tune configurations and settings.

Processor Hardware Prefetcher

When this setting is enabled, (disabled is the default for most systems), the processors is able to prefetch extra cache lines for every memory request. Recent tests in the performance lab have shown that you will get the best performance for most commercial application types if you disable this feature. The performance gain can be as much as 20% depending on the application.

For high-performance computing (HPC) applications, we recommend you turn HW Prefetch enabled and for database workloads, we recommend you leave the HW Prefetch disabled.

Both prefetch settings do decrease the miss rate for the L2/L3 cache when they are enabled but they consume bandwidth on the front-side bus which can reach capacity under heavy load. By disabling both prefetch settings, multi-core setups achieve generally higher performance and scalability.

In single-core processor setups, it is generally more optimal to enable the adjacent sector prefetch. Figure 4-16 shows the gains that were measured with the various settings on a tuned single core x3850 server running an online transaction processing workload.

 

 

Figure 4-16 Prefetch settings on x3850 (System x366) with single core processors

For dual-core processor configurations, it is generally more optimal to disable the prefetch settings. Figure 4-17 shows the gains that were measured with the various settings on a tuned dual core x3850 server running an online transaction processing workload.

 

Figure 4-17 Prefetch settings on x3850 with dual core processors

PCI bus subsystem

The Peripheral Component Interconnect (PCI) bus is the predominant bus technology that is used in most Intel architecture servers. The PCI bus is designed to allow peripheral devices, such as LAN adapters and disk array controllers, independent access to main memory. PCI adapters that have the ability to gain direct access to system memory are called bus master devices.

Bus master devices are also called direct memory access (DMA) devices.

This chapter discusses the following topics:

6.1, “PCI and PCI-X” on page 86

6.2, “PCI-X” on page 86

6.3, “PCI Express” on page 90

6.4, “Bridges and buses” on page 94

To simply this chapter, we have combined our discussion of PCI and PCI-X into one section and have outlined any differences between the two standards.

PCI and PCI-X

The PCI bus is designed as a synchronous bus, meaning that every event must occur at a particular clock tick or edge. The standard PCI bus uses a 33 MHz or 66 MHz clock that operates at either 32-bit or 64-bit. With the introduction of PCI-X, the speeds have been increased to include 66 MHz, 133 MHz, 133 MHz DDR, and 133 MHz QDR. This increase has raised the maximum transfer rate in burst mode from 276 MBps to 4.2 GBps.

PCI uses a multi-drop parallel bus that is a multiplexed address and data bus, meaning that the address and data lines are physically the same wires. Thus, fewer signal wires are required, resulting in a simpler, smaller connector. The downside to this design is that PCI transactions must include a turnaround phase to allow the address lines to be switched from address mode to data mode. The PCI bus also has a data-pacing mechanism that enables fast devices to communicate with slower devices that are unable to respond to a data transfer request on each clock edge. The generic name for any PCI device is the agent.

A basic data transfer operation on the PCI bus is called a PCI transaction, which usually involves request, arbitration, grant, address, turnaround, and data transfer phases. PCI agents that initiate a bus transfer are called initiators, while the responding agents are called targets. All PCI operations are referenced from memory. For example, a PCI read operation is a PCI agent reading from system memory. A PCI write operation is a PCI agent writing to system memory. PCI transactions do not use any CPU cycles to perform the transfer.

The language of PCI defines the initiator as the PCI bus master adapter that initiates the data transfer (for example, a LAN adapter or SCSI adapter) and the target as the PCI device that is being accessed. The target is usually the PCI bridge device or memory controller.

6.2 PCI-X

PCI-X 2.0 is the latest version of PCI and is built upon the same architecture, protocols, signals, and connectors as traditional PCI. This architecture has resulted in maintaining hardware and software compatibility with the previous generations of PCI. This design means that devices and adapters that are compliant with PCI-X 1.0 are fully supported in PCI-X 2.0.

When supporting previous PCI devices, it is important to note that the clock must scale to a frequency that is acceptable to the lowest speed device on the bus. This results in all devices on that bus being restricted to operating at that slower speed.

PCI-X was developed to satisfy the increased requirements of today's I/O adapters, such as Gigabit Ethernet, Fibre Channel, and Ultra320 SCSI. PCI-X is fully compatible with standard PCI devices. It is an enhancement to the conventional PCI specification V2.2 and enables a data throughput of over 4 GBps at 533 MHz/64-bits in burst mode.

Adapters with high I/O traffic, such as Fibre Channel and storage adapters, benefit significantly from PCI-X. These adapters provide a huge amount of data to the PCI bus and, therefore, need PCI-X to move the data to main memory.

Tip: Simply migrating to a newer PCI bus might not alleviate the bottleneck in a system.

Although the peak throughput has increased from PCI to PCI-X, this is not the only reason why PCI-X shows increased performance over PCI adapters. The following are changes made in PCI-X that provide higher efficiency and, therefore, a performance benefit when compared to standard PCI:

Attribute phase

The attribute phase takes one clock cycle and provides further information about the transaction. PCI-X sends new information with each transaction performed within the attribute phase which enables more efficient buffer management. The attribute phase can be split into several parts:

Sequence information: Each transaction in a sequence identifies the total number of bytes remaining to be read or written. If a transaction is disconnected, the new transaction that continues the sequence includes an updated byte count. Furthermore, each transition includes the identity of the initiator (bus number, device number, and function number) an/tr>
Relaxed order structure: Relaxed ordering is a technique that allows PCI-PCI bridges to rearrange the transactions on the bus. Therefore, more important data is transmitted before less important data, and the efficiency of the system improves.
Transaction byte count: With this information, the PCI-PCI bridge gets information about how long a transaction will take. For every transaction, the byte count holds a count of how much data is remaining and, therefore, enables the PCI-PCI bridge to use its internal cache more efficiently.
Split transactions

Delayed transactions in conventional PCI are replaced by split transactions in PCI-X. All transactions except memory-write transactions are allowed to be executed as split transactions. If a target on the PCI bus cannot complete a transaction within the target initial latency limit, the target must complete the transaction as a split transaction. Thus, the target sends a split response message to the initiator telling it that the data will be delivered later on. This the frees the bus for other communications.

When the data is available for transmission, the target requests access to the bus and completes the transaction with split completion transaction. For example, a SCSI controller that is waiting for data from a disk and is blocking the PCI bus for other devices is forced to complete the transaction as a split transaction.

If the target meets the target initial latency limits, it optionally completes the transaction immediately (for example, the requested data was immediately available because it was found in the buffer of the SCSI adapter and it is immediately sent back to the initiator). The split transaction design replaces the similar, but less efficient delayed transactions used by older PCI specifications. The transactions are tagged and queued, and the specifications also allow for a relaxed ordering scheme that makes out-of-order execution possible.

Allowable disconnect boundary

When a burst transaction is initiated to prevent a single process from monopolizing the bus with a single large transfer (bursts can be up to 4096 bytes), PCI-X gives initiators and targets the chance to place interruptions. The interruptions are not placed randomly (which might compromise the efficiency of the buffers and cache operations) but are fixed on 128-byte boundaries - a figure big enough to facilitate complete cache line transmissions.

The reliability of the PCI-X bus is improved by differentiating between peripheral and system errors. PCI-X has no possibility of recovering from system errors, but the device generating the error can be held in reset status, keeping the rest of the system up and running.

The benefit of adopting the PCI-X standard is the increase in supported throughputs, evident with the 533 MHz implementation. When running at higher frequencies (133 MHz and higher), only one device can be on a PCI-X bus, making PCI-X a high-bandwidth point-to-point I/O channel. At lower speeds (less than 133 MHz), multiple devices can be connected on a single bus.

Note that the 66 MHz implementation of PCI-X doubles the number of slots supported on a current PCI 2.2 66 MHz bus. Table 6-1 shows the possible combinations of PCI modes and speeds.

Table 6-1

Mode
PCI Voltage (V)
64-bit
32-bit
16-bit
Max slots
MBps
Max slots
MBps
MBps
PCI 33
5 or 3.3
4
266
4
133
Not applicable
PCI 66
3.3
2
533
2
266
Not applicable
PCI-X 66
3.3
4
533
4
266
Not applicable
PCI-X 1331
3.3
2
800
2
400
Not applicable
PCI-X 133
3.3
1
1066
1
533
Not applicable
PCI-X 266
3.3 or 1.5
1
2133
1
1066
533 MBps
PCI-X 533
3.3 or 1.5
1
4266
1
2133
1066 MBps

1Operating at 100 MHz

PCI and PCI-X modes

PCI-X devices use 3.3V I/O signalling when operating in PCI-X mode. They also support the 5V I/O signalling levels when operating in 33 MHz conventional mode, which results in cards either designed specifically for 3.3V PCI-X or universally keyed.

 

 

Figure 6-1 Adapter keying

PCI-X cards are designed to run at either 66 MHz or 133 MHz. PCI-X cards are not designed usually to run at 100 MHz. However, the number of the loads on the bus can force a 133 MHz adapter to operate at 100 MHz.

 Performance

It is rare for the PCI bus to be able to sustain the maximum theoretical throughput rates that are shown in Table 6-1. In most servers, the sustainable PCI throughput is only about 75% of the maximum theoretical rate. Of course, if the width of the PCI bus doubles or if the peak speed of the bus doubles, then the maximum throughput will increase accordingly.

The PCI adapter, the adapter device driver, and the system PCI chipset all limit the maximum sustainable throughput. The device driver and adapter firmware play a role in how the adapter is programmed to transfer data over the PCI bus. In most cases, 75% bus efficiency is typical.

Every PCI transaction requires request, arbitration, grant, address, turnaround, and data transfer cycles. The non-data-transfer cycles are request, arbitration, address and turnaround. As described in 6.1, PCI and PCI-X, a turnaround cycle is required because the PCI bus shares the same signal lines for both data and address.

In most cases, these cycles are overhead. During these cycles, the bus does not transfer data. Obviously, overhead cycles affect the sustainable data transfer rate. An adapter bursting small amounts of data for each transaction will have a higher percentage of overhead than an adapter bursting large amounts of data during each transaction and will, therefore, have a lower data throughput.

6.3 PCI Express

PCI Express is the latest development in PCI to support adapters and devices. The technology is aimed at multiple market segments, meaning that it can be used to provide for connectivity for chip-to-chips, board-to-boards, and adapters.

PCI Express uses a serial interface and allows for point-to-point interconnections between devices using directly wired interfaces between these connection points. This design differs from previous PCI bus architectures which used a shared, parallel bus architecture.

A single PCI Express serial link is a dual-simplex connection that uses two pairs of wires - one for pair for transmit and one pair for receive - and that transmits only one bit per cycle. Although this design sounds limiting, it can transmit at the extremely high speed of 2.5 Gbps, which equates to a burst mode of 320 MBps on a single connection. This two pair of wires is called a lane.

A PCI Express link is comprised of one or more lanes. In such configurations, the connection is labeled as x1, x2, x4, x12, x16, or x32, where the number is effectively the number of lanes. So, where PCI Express x1 would require four wires to connect, an x16 implementation would require 16 times that amount or 64 wires. This implementation results in physically different sized slots.

 
Tip: When you refer to lane nomenclature, you use the word by, as in by 8 for x8.

Figure 6-2 shows the slots for a 32-bit PCI 2.0, PCI Express x1 and a PCI Express x16. From this figure, it is clear that the PCI Express x16 adapter will not fit physically in the PCI x1 slot.

 

 

Figure 6-2 PCI 2.0 and PCI Express edge connectors

You can install PCI Express slots in larger slots but not in smaller ones. For example, you can install a PCI Express x8 adapter into an x16 slot (although it will still operate at the x8 speed), but you cannot insert an x8 adapter into an x4 slot. Table 6-2 shows this compatibility.

Table 6-2

 PCI Express performance

PCI Express currently runs at 2.5 Gbps or 200 MBps per lane in each direction, providing a total bandwidth of 80 Gbps in a 32-lane configuration and up to 160 Gbps in a full duplex x32 configuration. Future frequency increases will scale up total bandwidth to the limits of copper (which is 12.5 Gbps per wire) and significantly beyond that through other media without impacting any layers above the physical layer in the protocol stack.

Table 6-3 shows the throughput of PCI Express at different lane widths.

Table 6-3 PCI Express maximum transfer rate

 
Lane width
Clock speed
Throughput (duplex, bits)
Throughput (duplex, bytes)
Initial expected uses
x1
2.5 GHz
5 Gbps
400 MBps
Slots, Gigabit Ethernet
x2
5 GHz
10 Gbps
800 MBps
None
x4
10 GHz
20 Gbps
1.6 GBps
Slots, 10 Gigabit Ethernet, SCSI, SAS
x8
20 GHz
40 Gbps
3.2 GBps
Slots, Infiniband adapters, Myrinet adapters
x16
40 GHz
80 Gbps
6.4 GBps
Graphics adapters

 

PCI Express uses an embedded clocking technique that uses 8b/10b encoding. The clock information is encoded directly into the data stream, rather than having the clock as a separate signal. The 8b/10b encoding essentially requires 10 bits per character or about 20% channel overhead. This encoding explains differences in the published specification speeds of 250 MBps (with the embedded clock overhead) and 200 MBps (data only, without the overhead). For ease of comparison, Table 6-3 shows throughput in both bps and Bps.

When compared to the current version of a PCI-X 2.0 adapter running at 133 MHz QDR (quad data rate, effectively 533 MHz), the potential sustained throughput of PCI Express x16 is over double the throughput as shown in Figure 6-3.

 

 

Figure 6-3 PCI Express and PCI-X comparison (in Gbps)

7.3.5 Intel E8500 Chipset

The Intel E8500 chipset, code named Twin Castle is used in many of our competitors' high-end systems. The E8500 is Intel's high-end chipset that was created for Intel EM64T MP-based, single-core Cranford and Potomac as well as dual-core Paxville and Tulsa processors.

This chipset is a continuation or upgrade of the previous line of high-end Intel chipset and has the following features:

Two front-side buses per node similar to the X3-64e instead of a single front side bus per node. The front-side bus has a bandwidth up to 800 MHz.
Memory support for DDR-266, DDR-333, and DDR2-400.
EM64T and support for single-core Cranford or Potomac and dual-core Paxville and Tulsa processors.
PCI-E Support 3x8 and 1x4 lanes.

The E8500 does not incorporate a snoop filter. Figure 7-6 shows the higher front-side bus utilization of a high-end chipset that does not incorporate a snoop filter such as the E8500. Processor commands sourced from one front-side bus generates traffic on the second front-side bus to ensure that those other processors do not have the requested data in their cache. This increase in traffic drives up the front-side bus utilization and slows down the overall performance of the system.

Despite being released in March 2005, there have not been many performance benchmarks that demonstrate the performance of the E8500 "Twin Castle" chipset.

The TPC-C benchmark is used generally to benchmark high-end servers by running an online-transaction processing workload. Figure 7-7 shows the performance difference between the only published four-socket Intel E8500 chipset verses four-socket X3-64e systems running the TPC-C benchmark. All three benchmark configurations used 3.0 GHz Paxville processors with 2x2 MB L2 caches and 64 GB of memory in the server.

 

Figure 7-7 Four-socket TPC-C chipset performance

The results illustrate that the single node X3-64e chipset (xSeries 366) outperforms the Intel E8500 chipset by 17% for an online transaction processing workload despite the fact that the Intel E8500 chipset was running a 800 MHz front side bus compared to the 667 MHz front side bus of the X3-64e configuration.

The X3-64e chipset was also configured with 4 CPUs configured as 2-nodes with two processors per node and a total of 128 GB memory which is represented by the xSeries 460 data point. The x460 result illustrates the scalability of multiple nodes by keeping the number of processors constant.

For more information, see:

http://www.intel.com/products/chipsets/e8500/