More cores per node is not necessarily better

Many thing that using more cores to run a job will somehow deliver better customer experience.
No, is the simple answer.

Modern server & CPU designers can make various trade-offs on complexity, performance and number of threads & cores. The best way to address these trade-offs is to look at the integrated system design.

Let’s start with a quick review of NUMA. This is taken from Wikipedia:

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

This means in a physical server with two or more sockets on an Intel Nehalem or AMD Opteron platform, very often we find memory that is local to one and memory that is local to the other socket. A socket, its local memory and the bus connecting the two components is called a NUMA node. Both sockets are connected to the other sockets’ memory allowing remote access.

Please be aware that an additional socket in a system does NOT necessarily mean an additional NUMA node! Two or more sockets can be connected to memory with no distinction between local and remote. In this case, and in the case where we have only a single socket, we have a UMA (uniform memory access) architecture.
uma Summarizing NUMA Scheduling

UMA system: one or more sockets connected to the same RAM.

Scheduling – The Complete Picture

Whenever we virtualize complete operating systems, we get two levels of where scheduling takes place: A VM is provided with vCPUs (virtual CPUs) for execution and the hypervisor has to schedule those vCPUs accross pCPUs (physical CPUs). On top of this, the guest scheduler distributes execution time on vCPUs to processes and threads.
scheduling overview Summarizing NUMA Scheduling

So, we have to take a look at scheduling at two different levels to understand what is going on there. But before we go into more detail we have to take a look at a problem that might arise in NUMA systems.

The Locality Problem

Each NUMA node has its own computing power (the cores on the socket) and a dedicated amount of memory assigned to that node. You can very often even see that taking a look at your mainboard. You will see two sockets and two separate groups of memory slots.
P 500 Summarizing NUMA Scheduling

Those two sockets are connected to their local memory through a memory bus, but they can also access the other socket’s memory via an interconnect. AMD calls that interconnect HyperTransport which is the equivalent to Intel’s QPI (QuickPath Interconnect) technology. The names both suggest very high throughput and low latency. Well, that’s true, but compared to the local memory bus connection they are still far behind.

What does this mean to us? A process or virtual machine that was started on either of the two nodes should not be moved to a different node by the scheduler. If that happened – and it can happen if the scheduler in NUMA-unware – the process or VM would have to access its memory through the NUMA node interconnect resulting in higher memory latency. For memory intensive workloads, this can seriously influence performance of applications! This is referred to by the term “NUMA locality”.

Top Visited

Switchboard

Latest

Past week

Past month

Old News ;-)

July 22, 2013 Written by Marc-Andre

This post takes a look at performance variability issues when scaling up the number of processors assigned to the Gurobi MIP solver (I did the same study for CPLEX in this post a few months ago). I summarize results from a few computational experiments we've made. I show that while increasing the number of processor cores results in quicker runs on average, but the effect on individual instances is quite difficult to predict. The general conclusions are:

In general, more processor cores available leads to reduced solution times, up to 16 cores;

On average, using 8 cores instead of a single core approximately cuts solution times in half;

On a single instance, using more cores may actually lead to considerably longer (or shorter) run times;

The run with the most cores (16, in our test) was the fastest about 66% of the time, while 7.8% of our models ran faster on a single core than on 16 cores!

Skapare (16644)

May 27, 2011 @07:43AM (#36261614) Homepage

More cores is not necessarily better. Some software, and even some algorithms, can do poorly with such contentions on memory access. OTOH, others can do much better.

You have to understand the software you intend to run on it a lot.

Is it Embarrassingly parallel [wikipedia.org]?

September 10, 2008

Hello, I am a new user of a Windows HPC cluster consisting of 3 double-dual nodes, i.e. each node has 4 cores. Up to now, I only worked with Linux clusters, so I have to get used to the job manager at first. At the moment I intend to compare the Linux und Windows cluster with the Intel MPI benchmark and want to test different constellations of nodes and cores. But I am wondering how it could be possible to start only ONE process per node, i.e. use only ONE core per node to be able to start a 3 process-job using all the 3 nodes and not only 3 cores of one node. In the mpiexec man page the -pernode option is specified but this option doesn't work on the cluster if written to the command line ´(after mpiexec) in the Task List. I also played around with different settings for maximum core when creating an new job template, but not with the desired success. I would be very happy if anybody gave me a hint how to solve this problem. Many thanks, parsus

Hello, Parsus.
Yes, the job scheduler on HPC Server (or CCS) is a bit different than running MPI on a set of Linux nodes. The key difference is the primary role the job scheduler takes in assigning resources as opposed to mpiexec. However, there are cases where the job scheduler and mpiexec arguments can be used together to better control process placement of your MPI application.

The simplest means of running a single process on each of N nodes is with node-based scheduling (/numnodes:) like this:

job submit /numnodes:3 mpiexec imb.exe

which would run the MPI application named "imp.exe" on 3 nodes with one process (MPI rank) per node. The HPCS scheduler allows you to schedule by node, socket, or core.

You can also combine job scheduler and mpiexec arguments to more closely control the processess placement of your application. For example:
job submit /numnodes:3 mpiexec -cores 2 imb.exe
will run 2 MPI processes on each of 3 nodes for a total of 6 MPI ranks for the job.

Note that mpiexec's "-affinity" argument can be used to separate processes on a node to avoid contention (and the resultant memory swapping and poor performance). The -affinity option will cause mpiexec to place that processes in such a way as to avoid any 2 processes sharing the same: L1 cache, L2 cache, Lx cache, phyiscal package, NUMA node (this list in order of precedence).

IMPORTANT NOTE: You mentioned "the -pernode option..." by which I believe you intended the "/corespernode" argument. Please be aware /corespernode is a requirement and not a resource request. For example,
job submit /corespernode:4 /numnodes:2 app.exe
will run a single process on each of 2 nodes where each of those nodes must have a least 4 cores each.

Hope this helps.
Eric
Eric Lantz (Microsoft)

February 1, 2012 | ExtremeTech

It's been nearly eight years since Intel canceled Tejas and announced its plans for a new multi-core architecture. The press wasted little time in declaring conventional CPU scaling dead - and while the media has a tendency to bury products, trends, and occasionally people well before their expiration date, this is one declaration that's stood the test of time.

To understand the magnitude of what happened in 2004 it may help to consult the following chart. It shows transistor counts, clock speeds, power consumption, and instruction-level parallelism (ILP). The doubling of transistor counts every two years is known as Moore's law, but over time, assumptions about performance and power consumption were also made and shown to advance along similar lines.

Moore got all the credit, but he wasn't the only visionary at work. For decades, microprocessors followed what's known as Dennard scaling. Dennard predicted that oxide thickness, transistor length, and transistor width could all be scaled by a constant factor. Dennard scaling is what gave Moore's law its teeth; it's the reason the general-purpose microprocessor was able to overtake and dominate other types of computers.

CPU scaling showing transistor density, power consumption, and efficiency. Chart originally from The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

The original 8086 drew ~1.84W and the P3 1GHz drew 33W, meaning that CPU power consumption increased by 17.9x while CPU frequency improved by 125x. Note that this doesn't include the other advances that occurred over the same time period, such as the adoption of L1/L2 caches, the invention of out-of-order execution, or the use of superscaling and pipelining to improve processor efficiency. It's for this reason that the 1990s are sometimes referred to as the golden age of scaling. This expanded version of Moore's law held true into the mid-2000s, at which point the power consumption and clock speed improvements collapsed. The problem at 90nm was that transistor gates became too thin to prevent current from leaking out into the substrate.

Intel and other semiconductor manufacturers have fought back with innovations like strained silicon, hi-k metal gate, FinFET, and FD-SOI - but none of these has re-enabled anything like the scaling we once enjoyed. From 2007 to 2011, maximum CPU clock speed (with Turbo Mode enabled) rose from 2.93GHz to 3.9GHz, an increase of 33%. From 1994 to 1998, CPU clock speeds rose by 300%.

Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

More cores per node is not necessarily better

Scheduling – The Complete Picture

The Locality Problem

NEWS CONTENTS

Old News ;-)

Using more cores does not necessarily lead to reduced run times on Gurobi 5.0

Re:Funny how 128 cores used to seem like a lot (Score:2)

How to use only one core of a node

The death of CPU scaling From one core to many - and why we're still stuck

February 1, 2012 | ExtremeTech

Etc

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month