|Contents||Bulletin||Scripting in shell and Perl||Network troubleshooting||History||Humor|
Exploring Clustered Parallel File Systems and Object Storage Intel® Developer Zone
This paper discusses recent research and testing of clustered, parallel file systems and object storage technology. Also included is an overview of product announcements from HP, IBM and Panasas in these areas.
The leading data access protocol for batch computing is currently Network File System (NFS), but even with bigger, faster, more expensive Network Attached Storage (NAS) hardware available, batch processing seems to have an insatiable appetite for I/O operations per second. The national labs have seen this NFS bottleneck in their high-performance computing (HPC) clusters and have abandoned NFS in favor of Lustre* at Pacific Northwest National Lab (PNNL) and Lawrence Livermore National Lab (LLNL). LLNL and Los Alamos National Lab (LANL) have adopted Panasas hardware-based object storage. This paper details the research and testing of clustered, parallel file systems with applications to batch pool HPC methodologies and discusses object storage technology. It also discusses recent product announcements from HP, IBM, and Panasas in these areas.
During the research, the following file systems were investigated:
- Global File System* (GFS*) from Sistina (now Red Hat)
- General Parallel Files System* (GPFS*) from IBM
- iSCSI aggregation* from Terascale
- Parallel Virtual File System* (PVFS*) and PVFS2* from Clemson University
- Lustre from Cluster File Systems
Out of this investigation, a lab test of PVFS2 and Lustre based on scalability and access criteria was performed. This paper details those selection criteria and test results, plus features of each file system explored.
Two interesting emerging technologies are object storage devices and iSCSI. The Lustre parallel file system and Panasas both use object storage devices (called targets in Lustre). This paper describes object storage devices as well as storage aggregation using PVFS and iSCSI. What if by using either object storage targets in Lustre or storage aggregation, one could make use of all the excess storage in each compute node, simultaneously and in parallel? In a normal e-commerce (EC) environment, the available (free) storage would be on the order of 30 to 100 terabytes, dependent on disk size and number of compute nodes.
Why the Need for High-Performance Parallel File Systems?
As ubiquitous as NFS is, it is still a high-overhead protocol. Even with the newly available NFS V4 and its ability to combine operations into one request, benchmarks have shown that the performance is still substantially less than that available via other protocols. NFS is the current industry standard for NAS, sharable storage on UNIX* and Linux* servers. Its lack of scalability and coherency can be limitations for some high-bandwidth, I/O intensive applications, causing an imposing I/O bottleneck. However, if scalability and coherency are not issues, then NFS is a viable solution. The main problem with NFS is the single point of access to a server; any particular file is still only available on one server via one network interface. We can make bigger and faster servers with bigger and faster network pipes, but we still cannot grow to the volume of d ata that is necessary to support I/O intensive applications on a 1000-node cluster. Parallel access to multiple servers is the current solution to the problem of growing throughput and overall storage performance. Two approaches to the problem are parallel virtual file systems and object storage devices.
What is a Parallel File System?
In general, a parallel file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. This is similar to network link aggregation in which the I/O is spread across several network connections in parallel, each packet taking a different link path from the previous. Parallel file systems place data blocks from files on more than one server and more than one storage device. As the available servers and available storage devices are increased, throughput can easily be doubled or tripled, given enough network connections on the clients. As a client application requests data I/O, each sequential block request potentially can be going to an entirely different server or storage device. In essence, there is a linear increase in performance up to the total capacity of the network. PVFS is one product that provides this kind of clustered parallel access to data; another product is Lustre. These products provide performance and capacity scalability using different unique protocols. Lustre adds the concept of object storage devices to provide another layer of abstraction with the capability of later providing media redundancy on a file-by-file basis. See the Web links in the References section for more information on these software applications.
Clustered versus Parallel File Systems
Clustered file systems generally fall into the category of shared storage across multiple servers. Red Hat GFS is one of these. The product really isn’t designed for performance but for brokering access to shared storage and providing many-to-one access to data. When combined with high-availability (HA) software, GFS provides for a very resilient server configuration that scales up to 255 nodes. This means shared access to a single storage node, not performance scaling by striping data. IBM GPFS also provides simultaneous shared access to storage from multiple nodes, and adds a virtualization layer that provides transparent access to multiple storage sources via the IBM SAN File System. Neither of these products was considered in this evaluation due to scalability and/or cost issues. OpenGFS and Enterprise Volume Management System (EVMS) on SuSE Linux Enterprise Server* 9 (SLES9* could) provide exciting possibilities for low-cost high-performance Linux file servers utilizing storage virtualization.
Object Storage Devices
Each file or directory can be thought of as an object-an object with attributes. Each attribute can be assigned a value such as file type, file location, data stripes or not, ownership, and permissions. An object storage device allows us to specify for each file where to store the blocks allocated to the file, via a metadata server and object storage targets. Extending the storage attribute further, we can specify not only how many targets to stripe onto, but also what level of redundancy we want. Some implementations allow us to specify RAID0, RAID1, and RAID5 on a per-file basis. Panasas has taken the conce pt of object storage devices and implemented it entirely in hardware. Using a lightweight client on Linux, Panasas is able to provide highly scalable multi-protocol file servers, and they have implemented per-file level RAID (0 and 1 currently).
Figure 1. Object Storage Model
Figure 2. Data striping in Objects Storage
Many Linux clusters use slow shared I/O protocols, such as Network File System (NFS), the current de facto standard for sharing files. The resulting slow I/O can limit the speed and throughput of the Linux cluster. Lustre provides significant advantages over other distributed file systems. It runs on commodity hardware and uses object-based disks for storage and metadata servers for file system metadata (inodes). This design provides a substantially more efficient division of labor between computing and storage resources. Replicated, failover metadata servers (MDSs) maintain a transactional record of high-level file and file system changes. One or many object storage targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. File operations bypass the metadata server completely and fully utilize the parallel data paths to all OSTs in the cluster. This unique approach - separating metadata operations from data operations - results in significantly enhanced performance. This division of function leads to a truly scalable file system and more recoverability from failure conditions by providing the advantages of both journaling and distributed file systems.
Lustre supports strong file and metadata locking semantics to maintain total coherency of the file systems even under a high volume of concurrent access. File locking is distributed across the storage targets (OSTs) that constitute the file system, with each OST handling locks for the objects that it stores. Lustre technology is designed to scale while maintaining resiliency. As servers are added to a typical cluster environment, failures become more likely due to the increasing number of physical components. Lustre’s support for resilient, redundant hardware provides protection from inevitable hardware failures through transparent failover and recovery. Lustre has not yet been ported to support UNIX and Windows operating systems. Lustre clients can and probably will be implemented on non-Linux platforms, but as of this writing, Lustre is available only on Linux.
Currently, one additional drawback to Lustre is that a Lustre client cannot be on a server that is providing OSTs. This solution is being worked on and may be available soon; however, this limits the utility of Lustre for storage aggregation (see the discussion of Storage Aggregation below). Using Lustre, combined with a low-latency high-throughput cluster interconnect, you can achieve throughput numbers of well over 500 MB/sec, by striping data across hundreds of object storage targets.
Figure 3. Typical Lustre client/server configuration.
Lustre is an open, standards-based technology that is well funded and backed by the U.S. Department of Energy (DOE), the greater open source Linux community, Cluster File Systems, Inc. (Cluster FS), and Hewlett Packard (HP). Cluster FS provides commercial support for Lustre, and provides Lustre as an open source project. HP has taken Lustre, ProLiant* file servers running Linux, with HP StorageWorks* EVA disk arrays to provide a hardware/software product called HP Scalable File Server (SFS).
HP and PNNL have partnered on the design, installation, integration and support of one of the top 10 fastest computing clusters in the world. The HP Linux super cluster, with more than 1,800 Itanium® processors, is rated at more than 11 TFLOPS. PNNL has run Lustre for more than a year and currently sustains over 3.2 GB/s of bandwidth running production loads on a 53-terabyte Lustre-based file share. Individual Linux clients are able to write data to the parallel Lustre servers at more than 650 MB/s.
Parallel Virtual File System (PVFS) is an open source project from Clemson University that provides a lightweight server daemon to provide simultaneous access to storage devices from hundreds to thousands of clients. Each node in the cluster can be a server, a client, or both. At the time PVFS2 was installed and tested, there were no considerations in the product for redundancy, and Lustre provided more features and flexibility. Now that PVFS2 has progressed beyond version 1.0 and enterprise Linux (SLES9) has been deployed, PVFS2 should be considered for further testing and evaluation. Since storage servers can also be clients, PVFS2 supports striping data across all available storage devices in the cluster (storage aggregation, see below). PVFS2 is best suited for providing large, fast temporary scratch space.
Rather than providing scalable performance by striping data across dedicated storage devices, storage aggregation provides scalable capacity by utilizing available storage blocks on each compute node. Each compute node runs a server daemon that provides access to free space on the local disks. Additional software runs on each client node that combines those available blocks into a virtual device and provides locking and concurrent access to the other compute nodes. Each compute node could potentially be a server of blocks and a client. Using storage aggregation on a 1000-node compute batch pool, 36 TB of free storage could potentially be gained for high-performance temporary space.
Two products in this area are Terrascale TerraGrid* and Ibrix Fusion*. Both of these products deserve a closer look in the future. There are obvious issues of reliability, since the mean time between failures (MTBF) is divided by the number of nodes. TerraGrid solves this problem by using the Linux native meta-device driver to provide mirroring of aggregated devices. Another issue that needs consideration in a scenario where the compute nodes are also serving storage blocks; how much of the compute resources are used serving blocks to other compute nodes?
High-Performance Computing and Cluster Technologies
This is outside the scope of this paper, but other technologies to investigate are the following:
- High-bandwidth, low-latency interconnects such as InfiniBand*, where sustained data rates of over 800 MB/sec can be obtained with data I/O intensive processes and computing.
- Single System Image clusters in order to wring the most performance out of computing resources.
New Work in pNFS
The IETF NFS v4 working group has introduced a parallel NFS (pNFS) protocol extension derived from work by Panasas, simply put, this protocol extension allows for object storage “like” access to parallel data sources using out of band metadata servers. See Gibson, IETF, and pNFS for details.
In summary, clustered, parallel file systems provide the highest performance and lowest overall cost for access to temporary design data storage in batch processing pools. Parallel cluster file systems remove our dependency on centralized monolithic NFS, and very expensive file servers for delivering data-to-batch processing nodes. Parallel cluster file systems provide storage aggregation over thousands of compute nodes. Parallel file systems can take advantage of low-latency, high-bandwidth interconnects, thus relieving file access of TCP/IP overhead and latency of shared Ethernet networks.
There are drawbacks to most of the parallel file system offerings, specifically in media redundancy, so currently the best application for clustered parallel file systems would be for high-performance scratch storage on batch pools or tape-out where source data is copied and simulation results are written from thousands of cycles simultaneously.
Additional References & Resources
- Threading Developer Center
- Parallel and Multi-Core Developer Community
- Open Source Developer Community
- HP StorageWorks Scalable File Share*
- Panasas: http://www.panasas.com/products/*
Images used with permission
- InfiniBand: http://www.intel.com/technology/infiniband/index.htm
- Sun Microsystems Lustre wiki:
http://wiki.lustre.org/index.php?title=Main_Page* (PDF 190KB)
- NFS Extensions for Parallel Storage: Panasas Position [http://www.citi.umich.edu/NEPS/positions/gibson.pdf]* (PDF 148KB)
- pNFS http://www.pdl.cmu.edu/pNFS/*
By Robin Harris
February 05, 2007 4:48 PM EST
If you're retiring in the next five years you can skip this article. Otherwise, listen up.
RAID arrays were great in their day. But that day is drawing to a close. Managing LUNs and volumes, paying 20x the cost of the raw capacity for protection, poor scale-out: RAID arrays are just not competitive for large-scale infrastructures.
Storage clusters are now a proven commodity, with support from companies such as Oracle, IBM and NetApp. Highly resilient, simplified management, much lower cost. What's not to like?
Here are some examples:
Arrays aren't going away tomorrow, or ever. It took 10 years from the publication of the Berkeley RAID paper before RAID arrays took 50% of the external storage market. Yet storage cluster use has been growing rapidly in some major niches: internet data centers, video and broadcasting and web services. The 85% of enterprise data that is unstructured is the next big market for storage clusters.
- The world's largest data centers, Google, Amazon, Yahoo and Microsoft's MSN, all use storage clusters for 7x24 availability in their advertising operations
- At least a dozen more firms are selling cluster storage, including NetApp, the fastest growing large storage company
- Polyserve and Red Hat's GFS focus on storage clusters for Oracle and DB2 databases. With Oracle and IBM support.
- Omneon, a company specializing in storage and multi-media support for broadcasters, is selling "Media Grid" storage clusters. TV stations are a real-time 7x24 production environment: if the stuff doesn't work, the TV station doesn't get paid. It works.
- A storage cluster company - Isilon - just went public with a $1.4 Billion market cap
Even if you work for an IBM-only shop you should know that IBM Global Services is installing and supporting storage clusters today. If your CIO is more aggressive, start getting informed about this technology today. You'll be the more valuable for it.
Some brief intros to storage cluster tech from my site StorageMojo include Google File System, Microsoft's Boxwood, Google's BigTable storage system, and Isilon's Cluster Technology. Google's stuff isn't for sale, nor is it architected for enterprise use, but they've done a good job of distilling storage clusters down to their bare essentials and exposing the issues.
I also recommend Kevin Closson's Oracle blog for deep Oracle insight.
In five years you could be managing petabytes with fewer headaches than terabytes give you today. Storage clusters are a new day for enterprise storage and data management pros.
Comments welcome, of course.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.
ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least
Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.
Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: November, 14, 2014