|Home||Switchboard||Unix Administration||Red Hat||TCP/IP Networks||Neoliberalism||Toxic Managers|
May the source be with you, but remember the KISS principle ;-)
Skepticism and critical thinking is not panacea, but can help to understand the world better
|High Performance Computing (HPC)||Installing Mellanox InfiniBand Driver on RHEL 6.5||Configuring InfiniBand on RHEL 6.5||InfiniBand Subnet Manager|
|Dell PowerEdge M1000e Enclosure||ConnectX-3 cards||GPFS on Red Hat||HPC cluster architecture||Message Passing Interface||Dell M4001 IB switch||Oracle Grid Engine|
|MLNX_OFED||Troubleshooting InfiniBand connection issues using OFED tools||Linux Troubleshooting||Linux Troubleshooting Tips||Admin Horror Stories||Humor||Etc|
Adapted from InfiniBand - Wikipedia
The Infiniband technology is a standard de facto on the HPC scene. It provides high-bandwidth, low-latency communications over high-speed serial connections. The technology was originally invented by a consortium including Microsoft, IBM, Intel, Hewlett-Packard, Compaq Computer, Dell Computer, and Sun Microsystems as a replacement for the current PCI standard for peripheral I/O. For various reasons, Infiniband has not yet lived up to the original expectations of the consortium, and both Microsoft and Intel appear to have backed away from the technology.
Infiniband is based on high-speed, switched serial links that may be combined in parallel to increase bandwidth. The Infiniband architecture supports a switched fabric for the traffic, in addition to the channel-based host interconnects. A single "1x" Infiniband link is rated at 2.5 Gbps, but interface and switching hardware is currently available in 10 Gbps (4x) and 30 Gbps (12x) full-duplex port configurations (20-Gbps and 60-Gbps total bandwidth, respectively).
Infiniband physical cable lengths for copper are limited to 17 m, and for fiber are allowed up to 17 km. The Infiniband specification includes physical, electrical, and software elements of the architecture. An interesting feature of the Infiniband hardware and software architecture is the allowance for RDMA between the interface cards and a user process' address space, so that operating system kernels can avoid data copying.
The Infiniband architecture is based on a reliable transport implemented on the interface card. This allows the bypass of the kernel's TCP/IP stack by use of the socket direct protocol (SDP), which interfaces directly between an application that uses the socket API and the Infiniband hardware, providing a TCP/IP-compatible transport. With this architecture, network transport tasks that would normally occur in the software TCP/IP stack are off-loaded to the Infiniband interface card, saving CPU overhead.
The Infiniband standard allows simultaneous transport of multiple high-level protocols through the switching fabric. Additional information on the Linux implementation of the Infiniband architecture is available at http://sourceforge.net/projects/infiniband.
InfiniBand is based on a switched fabric architecture of serial point-to-point links. Like Fibre Channel, PCI Express, Serial ATA, and many other modern interconnects, InfiniBand offers point-to-point bidirectional serial links intended for the connection of processors with high-speed peripherals such as disks. On top of the point to point capabilities, InfiniBand also offers multicast operations. It supports several signaling rates and, as with PCI Express, links can be bonded together for additional throughput.
An InfiniBand link is a serial link operating at one of five data rates: single data rate (SDR), double data rate (DDR), quad data rate (QDR), fourteen data rate (FDR), and enhanced data rate (EDR).
The SDR connection's signaling rate is 2.5 gigabit per second (Gbit/s) in each direction per connection.
For SDR, DDR and QDR, links use 8b/10b encoding - every 10 bits sent carry 8bits of data - making the effective data transmission rate four-fifths the raw rate. Thus single, double, and quad data rates carry 2, 4, or 8 Gbit/s useful data, respectively. For FDR-10, FDR and EDR, links use 64b/66b encoding - every 66 bits sent carry 64 bits of data. (Neither of these calculations takes into account the additional physical layer overhead requirements for common characters or protocol requirements such as StartOfFrame and EndOfFrame).
Implementers can aggregate links in units of 4 or 12, called 4X or 12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s of useful data. As of 2009 most systems use a 4X aggregate, implying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) connection. Larger systems with 12X links are typically used for cluster and supercomputer interconnects and for inter-switch connections
The InfiniBand future roadmap also has "HDR" (High Data rate), due in 2014, and "NDR" (Next Data Rate), due "some time later", but as of June 2010, these data rates were not yet tied to specific speeds.
The single data rate switch chips have a latency of 200 nanoseconds, DDR switch chips have a latency of 140 nanoseconds and QDR switch chips have a latency of 100 nanoseconds. The end-to-end latency range spans from 1.07 microseconds MPI latency (Mellanox ConnectX QDR HCAs) to 1.29 microseconds MPI latency (Qlogic InfiniPath HCAs) to 2.6 microseconds (Mellanox InfiniHost DDR III HCAs).
As of 2009 various InfiniBand host channel adapters (HCA) exist in the market, each with different latency and bandwidth characteristics. InfiniBand also provides RDMA capabilities for low CPU overhead. The latency for RDMA operations is less than 1 microsecond (Mellanox ConnectX HCAs).
InfiniBand uses a switched fabric topology, as opposed to a hierarchical switched network like traditional Ethernet architectures. All transmissions begin or end at a "channel adapter." Each processor contains a host channel adapter (HCA) and each peripheral has a target channel adapter (TCA). These adapters can also exchange information for security or quality of service (QoS).
InfiniBand transmits data in packets of up to 4 KB that are taken together to form a message. A message can be: a direct memory access read from or, write to, a remote node (RDMA) a channel send or receive a transaction-based operation (that can be reversed) a multicast transmission. an atomic operation
InfiniBand has been adopted in enterprise datacenters, for example Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud and Oracle SPARC SuperCluster, financial sectors, cloud computing (an InfiniBand based system won the best of VMWorld for Cloud Computing) and more. InfiniBand has been mostly used for high performance clustering computer cluster applications. A number of the TOP500 supercomputers have used InfiniBand including the former reigning fastest supercomputer, the IBM Roadrunner.
SGI, LSI, DDN, Netapp, Oracle, Nimbus Data, Rorke Data among others, have also released storage utilizing InfiniBand "target adapters". These products compete with architectures such as Fibre Channel, SCSI, and other more traditional connectivity-methods. Such target adapter-based discs can become a part of the fabric of a given network, in a fashion similar to DEC VMS clustering. The advantage to this configuration is lower latency and higher availability to nodes on the network (because of the fabric nature of the network). In 2009, the Oak-Ridge National Lab Spider storage system used this type of InfiniBand attached storage to deliver over 240 gigabytes per second of bandwidth.
Military applications such as UAV, UUV, electronic warfare are taking this technology into the rugged application space to enhance capabilities. InfiniBand is used in high performance embedded computing systems such as RADAR, Sonar and SIGINT applications. Companies such as GE Intelligent Platforms Mercury Computer Systems produce military grade Single Board Computers> that are InfiniBand capable.
Early InfiniBand used copper CX4 cable for SDR and DDR rates with 4x ports - also commonly used to connect SAS (Serial Attached SCSI) HBAs to external (SAS) disk arrays. With SAS, this is known as an SFF-8470 connector, and is referred to as an "InfiniBand-style" Connector. For 12x ports, SFF-8470 12x is used.
The latest connectors used with up to QDR and FDR speeds 4x ports are QSFP (Quad SFP) and can be copper or fiber, depending on the length required.
For 12x ports, the CXP (SFF-8642) can be used up to QDR speed.
InfiniBand has no standard programming API within the specification. The standard only lists a set of "verbs" - functions that must exist. The syntax of these functions is left to the vendors. The de-facto standard has been the syntax developed by the OpenFabrics Alliance, which was adopted by most of the InfiniBand vendors, for GNU/Linux, FreeBSD, and MS Windows. The InfiniBand software stack developed by OpenFabrics Alliance is released as "OpenFabrics Enterprise Distribution (OFED)", under a choice of two licenses GPL2 or BSD license for Linux and FreeBSD, and as "WinOF" under a choice of BSD license for Windows.
Upper-level protocols such as IP over InfiniBand (IPoIB), Socket Direct Protocol (SRP), SCSI RDMA Protocol (SDP), iSCSI Extensions for RDMA (iSER) and so on, facilitate standard data networking, storage, and file system applications to operate over InfiniBand. Except for IPoIB, which provides a simple encapsulation of TCP/IP data streams over InfiniBand, the other upper-level protocols transparently enable higher bandwidth, lower latency, lower CPU utilization, and end-to-end service using field-proven RDMA and hardware-based transport technologies available with InfiniBand. Configuring GPFS to exploit InfiniBand helps make the network design simple because InfiniBand integrates the network and the storage together (each server uses a single adapter utilizing different protocols), as in the following examples:
IPoIB running over high-bandwidth InfiniBand adapters can provide an instant performance boost to any IP-based applications. IPoIB support tunneling of IP packets over InfiniBand hardware. This method of enabling IP applications over InfiniBand is effective for management, configuration, setup or control plane related data where bandwidth and latency are not critical. Because the application continuous to run over the standard TCP/IP networking stack, the application are completely unaware of the underlying I/O Hardware. Socket Direct Protocol (SDP) For applications that use TCP sockets, the SDP delivers a significant boost to performance and reduces CPU utilization and application latency. The SDP driver provides a high-performance interface for standard socket applications and a boost in performance by bypassing the software TCP/IP stack, implementing zero copy and asynchronous I/O, and transferring data using efficient RDMA and hardware-based transport mechanisms.
August 1st, 2011 | infiniband Ins & Outs
Gamess is an electronic structure calculation package. Its installation is easy if you just want to use "sockets" communication mode. Just emerge it as you regularly do. Then use "rungms" to submit your job. The default rungms is okay to run the serial code. For the parallel computation, you still need to tune the script slightly. But since our cluster has Infiniband installed, it is better to go with the "mpi" communication mode. It took me quite some time to figure out how to install it correctly and make it run with mpiexec.hydra alone or with OpenPBS (Torque). Here is how I did it.
Software packages related:
1. gamess-20101001.3 (Dowload it beforehand from its developer's website)
2. mvapich2-1.7rc1. (Previous versions should be okay and I installed it under /usr/local/)
3. OFED-126.96.36.199. (Userspace libraries for Infiniband. See my previous post. Only updated kernel modules installed. Userspace libraries should be the same as in OFED-188.8.131.52)
4. torque-2.4.14 (OpenPBS)
1. Update the
gamess-20101001.3.ebuildwith this one and manifest it.
2. Unmask the
mpiuser flag for gamess in
emerge -av gamess.
rungmswith this one;
5. Create a new script
pbsgmsas this one;
/etc/sysctl.conf, in which XXXXX is a large enough integer for shared memory (default value 32MB is too small for DDI). Run
/sbin/sysctl -w kernel.shmmax=XXXXto update the setting in-the-fly.
Added on Sept. 9, 2011. It seems that
kernel.shmall=XXXXXshould be modified as well. Please bear in mind that the unit for
kernel.shmallis pages and
kernel.shmmaxis bytes. And a page is 4096 bytes in usual(use
getconf PAGE_SIZEto verify).
7. Environment setting. Create a file
Then update your profile.
8. Create a hostfile,
This file is only needed by invoking
9. Test your installation: copy a test job input file
/usr/share/gamess/tests/; submit the job using
pbsgms exam20(other settings will be prompted), or using
rungms exam20 00 4.
1. Two changes were made on the ebuild file.
(a). The installation suggestions given in the documentation of Gamess is not enough. More libraries other than mpich are needed to pass over to
lked, the linker program for Gamess.
(b) MPI environment constants are needed to exported to the installation program,
compddithrough an temporary file
2. Many changes were made for the script,
rungms. I could not remember all of them. Some are as following.
(a) For parallel computation, the scratch file will be put under /tmp on each node by default.
(b) The script will be working with
(c) System-wide setting for Gamess can be put under /etc/env.d.
(d) A host file is needed if not using PBS. By default, it should be at
~/.hosts. If not found, running on the local host only.
3. The script
pbsgmsis based on
sge-pbsshipped with the Gamess installation package. I have made it to work with Torque. Numerous changes were made.
OCZ's SAS SSDs in InfiniBand benchmark configuration
Editor:- June 12, 2013 - Mellanox today announced details of a benchmark demonstration it did this week showing its FDR 56Gb/s InfiniBand running on Windows Server 2012 in a system which uses OCZ's Talos 2R SSDs (2.5" SAS SSDs) working with LSI's Fast Path I/O acceleration software and RAID controllers - getting over 10GB/s throughput to a remote file system while consuming under 5% of CPU overhead.
ISC 2012 If you want to try to choke a PCI-Express 3.0 peripheral slot, you have to bring a fire hose. And that is precisely what InfiniBand and Ethernet switch and adapter card maker Mellanox Technology has done with a new Connect-IB server adapter.
Mellanox was on hand at the International Super Computing event in Hamburg, Germany this week, showing off its latest 56Gb/sec FDR InfiniBand wares and boasting of the uptake in InfiniBand technology in the Top 500 rankings of supercomputers and its general uptake in database cluster, data analytics, clustered storage arrays, and other segments of the systems racket.
Mellanox is the dominant supplier now that QLogic has sold off its InfiniBand biz to Intel, and it is milking the fact that it has FDR switches and adapters in the field when QLogic is still at 40Gb/sec QDR InfiniBand. (QLogic, if it had not been eaten by Intel, would counter that it can get the same or better performance from its QDR gear than Mellanox delivers with its FDR gear.) These are good days for Mellanox, which ate rival Voltaire to get into the Ethernet racket and which is enjoying the benefits of the rise of high-speed clusters.
At least until Intel comes back at Mellanox in a big way, pursuing all of its own OEM partners with the Xeon-QLogic-Fulcrum-Cray Aries quadruple whammy. Intel did not buy QLogic, Fulcrum Microsystems, and the Cray supercomputer interconnect business to sit on these assets, like some kind of knickknacks sitting on shelf.
Intel is going to try to become a supplier of supercomputing interconnects that do all kinds of things and that hook into its Xeon processors and chipsets tightly and seamlessly, and that will eventually make it very tough for Mellanox.
But not so at ISC this year. As El Reg previously reported, for the first time in the history of the Top 500 rankings of supercomputers, InfiniBand has edged out Ethernet, with 208 machines using InfiniBand and 207 using Ethernet. Drilling down into the data a bit, there were 195 machines that used Gigabit Ethernet switches and adapters to link server nodes together, and another 12 that used 10 Gigabit Ethernet.
There are still 78 machines on the list that use earlier InfiniBand gear, but there are 110 machines using QDR InfiniBand, and 20 machines that use FDR InfiniBand. There are a few hybrid interconnects as well on the list that mix InfiniBand with some other network.
The remainder are a mix of custom interconnects like the Cray "SeaStar" XT and "Gemini" XE routers, the Silicon Graphics NUMAlink, IBM's BlueGene/Q, Fujitsu's "Tofu," and a few others. Gigabit Ethernet is by far the most popular of any single speed or type, of course, but it is dramatic how InfiniBand has really blunted the uptake of 10GE networks at the top end of supercomputer clusters. The idea seems to be that if you are going to spend money on anything faster than Gigabit Ethernet, then you might as well skip 10GE or even 40GE and get the benefits of QDR or FDR InfiniBand.
This is certainly what Mellanox is hoping customers do, and that is why it is bragging about a new server adapter card called Connect-IB that can push two full-speed FDR ports.
The Connect-IB dual-port InfiniBand FDR adapter card (click to enlarge)
This new Connect-IB card, which is sampling now, will be available for both PCI-Express 3.0 and PCI-Express 2.0 slots, and eats an x16 slot. Up until now, network adapter cards have generally been x8 slots, with half as many lanes of traffic and therefore a lot less theoretical and realized bandwidth available to let the network chat up the servers. By moving to servers that support PCI-Express 3.0 slots, you can put two FDR ports on each adapter using an x16 slot and still run them at up to 100Gb/sec aggregate across the two ports.
If your server is using older PCI-Express 2.0 slots – and at this point, that means anything that is not using an Intel Xeon E5-2400, E5-2600, E5-4600, or E3-1200 v2 processor since no other server processor maker is supporting PCI-Express 3.0 yet – then there is an x16 Connect-IB card that has one port that you can try to push all the way up to 56Gb/sec speeds.
These new cards have a single microsecond MPI ping latency and support Remote Direct Memory Access (RDMA), which is one of the core technologies that gives InfiniBand its performance edge over Ethernet and which allows for servers to reach across the network directly into each other's main memory without going through that pesky operating system stack. Mellanox says the new two-port Connect-IB card can push 130 million messages per second – four times that of its competitor. (That presumably means you, QLogic, er, Intel.)
There is also a single and dual-port option on the Connect-IB cards that slide into x8 slots. It is not clear how much data these x8 slots can really push, and until they are tested in the field, Mellanox is probably not even sure.
In theory, an x8 slot running at PCI-Express 3.0 speeds should be able to do 8GB/sec (that's bytes, not bits) of bandwidth in both directions, for a total of 16GB/sec of total bandwidth across that x8 link. This should not saturate the x8 link.
What is certain is that an x8 slot running at PCI-Express 2.0 speeds could not really handle FDR InfiniBand, with only 64Gb/sec of bandwidth (8GB/sec) each way available. That was getting too close to the ceiling.
Now, the ConnectX chips on the Mellanox adapters as well as the SwitchX ASICs at the heart of its switches swing both ways, Ethernet and InfiniBand, so don't jump to the wrong conclusion and think Mellanox doesn't love Ethernet.
The company was peddling its 40GE adapters and switches, which support RDMA over Converged Ethernet (RoCE) and which give many of the benefits of InfiniBand to customers who don't want to build mixed InfiniBand-Ethernet networks. (Or, perhaps more precisely, they want Mellanox to do it inside of the switch and inside of the adapter cards and mask the transformation from the network.) Mellanox says that it is showing up to an 80 per cent application performance boost using its 40GE end-to-end compared to 10GE networks on clusters.
In addition, Mellanox also announced that the latest FDR InfiniBand adapters will also support Nvidia's GPUDirect protocol, which is a kind of RDMA for the Tesla GPU coprocessors that allowed GPUs inside of a single machine to access each other's memory without going through the CPU and OS stack to do it.
With the current Tesla K10 and future Tesla K20 GPU coprocessors, GPUDirect will allow for coprocessors anywhere in a cluster to access the memory of any other coprocessor, fulfilling Nvidia's dream of not really needing the CPU for much at all. This GPUDirect support will be fully enabled in Mellanox FDR adapters. ®
Google matched content
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: April, 03, 2014