|Contents||Bulletin||Scripting in shell and Perl||Network troubleshooting||History||Humor|
|High Performance Computing (HPC)||Compiling openmpi with SGE suppo||Testing openMPI environment on SGE||Running jobs under SGE|
|InfiniBand||Tuning the run-time characteristics of Infiniband communications||MCA Parameters||Specifying ethernet interface instead of infiniband||Installing Mellanox InfiniBand Driver on RHEL 6.5||SGE Parallel Environment||VASP Performance optimization|
|Troubleshooting||ulimit - locked memory limit||InfiniBand||HPC cluster Architecture||Admin Horror Stories||Humor||Etc|
Based on Wikipedia Message Passing Interface and IBM Redbook RS-6000 SP Practical MPI Programming
MPI (Message Passing Interface) is a specification for the developers and users of message passing libraries. By itself, it is NOT a library - but rather the specification of what such a library should be for MPP (Massively Parallel Processors) architecture. MPP clusters consists of nodes connected by a network that is usually high-speed. Each node has its own processor, memory, and I/O subsystem. The operating system is running on each node, so each node can be considered a workstation. Despite the term massively, the number of nodes is not necessarily large. What makes the situation more complex is that each node is usually an SMP node with two or four CPUs and 16 or 32 cores.
MPI uses the notion of process rather than processor. Program copies are mapped to processors by the MPI runtime. For maximum parallel speedup, more physical processors can be used. MPI is not sanctioned by any major standards body; nevertheless, it has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system.
Nonetheless, MPI programs are regularly run on shared memory computers.
NUMA (Non-Uniform Memory Access) architecture machines are built on a similar hardware model as MPP, but it typically provides a shared address space to applications. The memory latency varies according to whether you access local memory directly or remote memory through the interconnect. Thus the name non-uniform memory
Message Passing Interface (MPI) is typically implemented as a library of a standardized and portable message-passing functions useful for writing portable message-passing programs in Fortran 77, C or C++ programming languages. There are several well-tested and efficient implementations of MPI, including some that are free or in the public domain. Most MPI libraries implementations consist of a specific set of routines (i.e., an API) directly callable from C, C++, Fortran and any language able to interface with such libraries, including C#, Java or Python. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs).
The runtime environment for the MPI implementation used (often called mpirun or mpiexec) spawns multiple copies of the program, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. Typically, for maximum performance, each CPU (or core in a multi-core machine) will be assigned just a single process. This assignment happens at runtime through the agent that starts the MPI program, normally called mpirun or mpiexec.
Although MPI belongs in layers 5 and higher of the OSI Reference Model, implementations may cover most layers, with sockets and Transmission Control Protocol (TCP) used in the transport layer.
At present, the standard has several popular versions:
MPI library functions include, but are not limited to, point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gather and reduce operations), synchronizing nodes (barrier operation) as well as obtaining network-related information such as the number of processes in the computing session, current processor identity that a process is mapped to, neighboring processes accessible in a logical topology, and so on. Point-to-point operations come in synchronous, asynchronous, buffered, and ready forms, to allow both relatively stronger and weaker semantics for the synchronization aspects of a rendezvous-send. Many outstanding operations are possible in asynchronous mode, in most implementations.
MPI provides a rich range of abilities. The following concepts help in understanding and providing context for all of those abilities and help the programmer to decide what functionality to use in their application programs.
Many MPI functions require that you specify the type of data which is sent between processors.
MPI provides an interface to write portable message-passing programs in FORTRAN 77 and C designed for running on parallel machines. It is a specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementers, and users. For technical computing, MPI has displaced most other message-passing systems.
The MPI standardization effort involved many organizations mainly from the United States and Europe. The major vendors of high-end computers along with researchers from universities, government laboratories, and the software industry were involved in designing MPI. MPI has been influenced by work at the IBM T. J. Watson Research Center, Intel, and nCUBE with important contributions from Chimp, Parallel Virtual Machine, Portable Instrumented Communication Library and others.
MPI is a standard for writing message-passing programs that provides portability. It has a clearly defined base set of routines. MPI provides processes the ability to communicate via messages. Message passing is a paradigm used mostly on parallel machines, especially those with distributed memory. Message passing systems can be efficient and portable as demonstrated by many vendors’ implementations. System memory is used to buffer and store internal representation of various MPI objects such as groups, communicators, data types, and so on.
MPI has the following characteristics:
The standard features include the following:
Because of time constraints imposed in finishing the standard, the standard does not specify the following:
Features not included can be offered as extensions by specific implementations.
|Bulletin||Latest||Past week||Past month||
So not only has AMD priced aggressively to try to regain market share from Intel, they’ve eliminated the money decision that customers used to have to make in pondering whether to build their clusters from 4- or 2-socket motherboards.
July 19, 2013
Jeff SquyresIf you’re a bargain basement HPC user, you might well scoff at the idea of having more than one network interface for your MPI traffic.Brice July 19, 2013 at 7:14 am
“I’ve got (insert your favorite high bandwidth network name here)! That’s plenty to serve all my cores! Why would I need more than that?”
I can think of (at least) three reasons off the top of my head.
I’ll disclaim this whole blog entry by outright admitting that I’m a vendor with an obvious bias for selling more hardware. But bear with me; there is an actual engineering issue here.
Here’s three reasons for more network resources in a server:
- Processors are getting faster
- Core counts are rising
- NUMA effects and congestion within a single server
Think of it this way: MPI applications tend to be bursty with communication. They compute for a while, and then they communicate.
Since processors are getting faster, the length of computation time between communications can be decreasing. As a direct result, that same MPI application you’ve been running for years is now communicating more frequently, simply because it’s now running on faster processors.
Add to that the fact that you now have more and more MPI processes in a single server. Remember when four MPI processes per server seemed like a lot? 16 MPI processes per server is now commonplace. And that number is increasing.
And then add to that the fact that MPI applications have been adapted over the years to assume the availability of high-bandwidth networks. ”That same MPI application you’ve been running for years” isn’t really the same — you’ve upgraded it over time to newer versions that are network-hungry.
Consider this inequality in the context of MPI processes running on a single server:
num_MPI_processes * network_resources_per_MPI_process ?=
Are the applications running in your HPC clusters on the left or right hand side of that inequality? Note that the inequality refers to overall network resources — not just bandwidth. This includes queue depths, completion queue separation, ingress routing capability, etc.
And then add in another complication: NUMA effects. If you’ve only got one network uplink from your fat server, it’s likely NUMA-local to some of your MPI processes and NUMA-remote from other MPI processes on that server.
Remember that all MPI traffic from that remote NUMA node will need to traverse inter-processor links before it can hit the PCI bus to get to the network interface used for MPI. On Intel E5-2690-based machines (“Sandy Bridge”), traversing QPI links can add anywhere from hundreds of nanoseconds to a microsecond of short message half-roundtrip latency, for example. And we haven’t even mentioned the congestion/NUNA effects inside the server, which can further degrade performance.
My point is that you need to take a hard look at the applications you run in your HPC clusters and see if you’re artificially capping your performance by:
- Not having enough network resources (bandwidth is the easiest to discuss, but others exist, too!) on each server for the total number of MPI processes on that server
- Not distributing network resources among each NUMA locality in each server
Tags: HPC, mpi, NUMA, NUNA, process affinity“traversing QPI links can add about a microsecond of short message half-roundtrip latency”....
Looks exaggerated to me. I usually see several hundreds nanoseconds instead.It’s what we’ve seen within our own testing, but admittedly we didn’t rigorously check this number. So perhaps I should rephrase this: somewhere between hundreds of nanoseconds and a microsecond.
Mar 19, 2004
I have a question on the performance of MPI (MPICH, in this case) on an SMP (shared memory) system. I have a code that I, MPI parallelized myself .. and it scales near perfectly (on small sized distributed clusters). More explicitly, say I run this code on 2 single processor Macs, in parallel, I nearly get twice the speed compared with one.
However, if I run the same code on a dual processor Mac (G4 or G5) by configuring MPI to treat the machine as 2 computers .. I get much poorer performance (i.e. gain in speed over single processor). Moreover, the larger (in terms of memory) simulation I attempt, the worse the problem gets. On a run using approximately 300MB of RAM, I'm down to getting a factor of 1.5 speed-up using a dual processor over a single processor.
I even tried to reconfigure and recompile MPICH for shared memory communication (using -comm=shared) but no improvement.
I tried a totally different and unrelated code (that is also known to scale well) and I'm getting pretty much the same deal. I even (very briefly though) tried LAM-MPI with no significant difference.
Am I missing something? Has anyone noticed this as well? Note that the problem becomes significant only for *large* simulations .. say 300MB or more. Any advice would be appreciated. Maybe this is a generic occurance when you use MPI on an SMP machine .. instead of OpenMP?
I'll try a similar test on an IBM Power4 system (p690) that I have access to ..
... ... ....
Thanks for the nice and detailed responses to my
question. To answer a few questions that people
asked. I do not seem to get any significant difference
on a dual G4 compared with a dual G5. Also, both
these machines have a healthy amount of memory
(2.0 GB on the G4 and 1.5 GB on the G5). I/O was
pretty much turned off.
The current two points of view appear to be
1) memory bandwith problem (hardware)
2) "Kernel lock" while message-passing (Mac OS X)
For (1) to hold true, it would seem that the G4
would do significantly worse compared to the G5.
According to my tests .. it does not. They compare
well (in the speed-up factor on a dual over single).
To test this further what I did was the following.
I ran a SERIAL (unparallelized) version of one of my
codes in 2 different ways:
Case A) Run two copies of the code (residing in different
directories) on the Mac simultaneously.
Case B) Run simply one copy of the code and let half
the system stay idle.
If I compare the timings of the runs in Case A and Case
B, I get that for the exact same run Case A runs slower
by about 18%. This means that in Case A, both processors
run at about 82% "efficiency" compared to Case B. So, the
net efficiency of the dual proc machine is about 82%
compared with 2 separate processors. This explains, a
good part of the loss I talked about in my first message.
So, this would seem to imply this issue is not related to
MPI at all. So, maybe the memory bandwidth theory is
correct. However, I seem to get very similar numbers (even
for this test) on a G5 or G4! So what is going on?
I urge someone else to try this. Please run two copies of
a heavy duty code simultaneously but independently on a
dualie .. and see how the performance "efficiency" drops.
In this manner it will clear that I am not doing something
The global memory is used by the processors to communicate. The communication operation, in the existence of a global memory, is a memory reference operation; the sender saves data to a predefined memory location where it is read by the receiving processor. Since the global memory is accessible by both the processors, a synchronization method is required to prevent concurrent memory access to any given location; the main cause for the synchronization method is the need to protect two or more write operations to the same memory location. The performance of an SMP machine is determined by the CPU speeds, memory characteristics (caches, amount, network), and software (operating system, compilers). In addition to these factors, the communication between the processors on an SMP has to be considered. Even though the communication occurs transparently through memory references, it still affects the performance of a parallel application. Therefore, as part of this study measurements were carried out to evaluate the latency and bandwidth of the network connecting two processors within an SMP workstation.
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
Softpanorama hot topic of the month
Message Passing Interface
IBM Redbooks RS-6000 SP Practical MPI Programming by Yukiya Aoyama and Jun Nakano (IBM Japan), IBM Redbook.
Chapter 1. Introduction to Parallel Programming
Chapter 2. Basic Concepts of MPI
Chapter 3. How to Parallelize Your Program
Chapter 4. Advanced MPI Programming
Appendix A. How to Run Parallel Jobs on RS/6000 SP
Appendix B. Frequently Used MPI Subroutines Illustrated
Message Passing Interface
MPI Portable Parallel Programming for Scientific ComputingInformation on MPI tutorials is available from:
A multi-part quick start on LAM/MPI is available here. There is a shorter version below, but it has less explanations included.
Some tutorials developed at Argonne National Laboratory are also available:
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.
ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least
Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.
Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: November, 14, 2014