|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
| Recommended Links | Papers, ebooks tutorials | Reference | ||||||
| Unix Internals | Filesystems | Unix System Calls | Solaris vs Linux | Whitepapers | History | Tips | Humor | Etc |
Solaris Filesystem Choicesposted by John Finigan on Mon 21st Apr 2008 19:00 UTCWhen it comes to dealing with storage, Solaris 10 provides admins with more choices than any other operating system. Right out of the box, it offers two filesystems, two volume managers, an iscsi target and initiator, and, naturally, an NFS server. Add a couple of Sun packages and you have volume replication, a cluster filesystem, and a hierarchical storage manager. Trust your data to the still-in-development features found in OpenSolaris, and you can have a fibre channel target and an in-kernel CIFS server, among other things. True, some of these features can be found in any enterprise-ready UNIX OS. But Solaris 10 integrates all of them into one well-tested package. Editor's note: This is the first of our published submissions for the 2008 Article Contest.
The details of the whole Solaris storage stack could fill a book, so in this article I will focus only on filesystems. There are four common on-disk filesystems for Solaris, and my goal is to familiarize the reader with each of them, and to mention a few deployment scenarios where each is appropriate.
UFS
UFS in its various forms has been with us since the days of BSD on VAXen the size of refrigerators. The basic UFS concepts thus date back to the early 1980s and represent the second pass at a workable UNIX filesystem, after the very slow and simple filesystem that shipped with the truly ancient Version 7 UNIX. Almost all commercial UNIX OSs have had a UFS, and ext3 in Linux is similar to UFS in design. Solaris inherited UFS through SunOS, and SunOS in turn got it from BSD.
Until recently, UFS was the only filesystem that shipped with Solaris. Unlike HP, IBM, SGI, and DEC, Sun did not develop a next-generation filesystem during the 1990s. There are probably at least two reasons for this: most competitors developed their new filesystems using third party code which required per-system royalties, and the availability of VxFS from Veritas. Considering that a lot of the other vendors' filesystem IP was licensed from Veritas anyway, this seems like a reasonable decision.
Solaris 10 can only boot from a UFS root filesystem. In the future, ZFS boot will be available, as it already is in OpenSolaris. But for now, every Solaris system must have at least one UFS filesystem.
UFS is old technology but it is a stable and fast filesystem. Sun has continuously tuned and improved the code over the last decade and has probably squeezed as much performance out of this type of FS as is possible. Journaling support was added in Solaris 7 at the turn of the century and has been enabled by default since Solaris 9. Before that, volume level journaling was available. In this older scheme, changes to the raw device are journaled, and the filesystem is not journaling-aware. This is a simple but inefficient scheme, and it worked with a small performance penalty. Volume level journaling is now end-of-lifed, but interestingly, the same sort of system seems to have been added to FreeBSD recently. What is old is new again.
UFS is accompanied by the Solaris Volume Manager, which provides perfectly servicible software RAID.
Where does UFS fit in in 2008? Besides booting, it provides a filesystem which is stable and predictable and better integrated into the OS than anything else. ZFS will probably replace it eventually, but for now, it is a good choice for databases, which have usually been tuned for a traditional filesystem's access characteristics. It is also a good choice for the pathologically conservative administrator, who may not have an exciting job, but who rarely has his nap time interrupted.
ZFS
ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.
ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous "rampant layering violation" quote which has been repeated so many times. Remember, though, that this is just one developer's aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.
Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can't fill up the whole pool, and reservations may be changed at will. However, if you don't want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn't just painless, it is irrelevant.
ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.
The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem's root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.
These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet - it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.
SAM and QFS
SAM and QFS are different things but are closely coupled. QFS is Sun's cluster filesystem, meaning that the same filesystem may be simultaneously mounted by multiple systems. SAM is a hierarchical storage manager; it allows a set of disks to be used as a cache for a tape library. SAM and QFS are designed to work together, but each may be used separately.
QFS has some interesting features. A QFS filesystem may span multiple disks with no extra LVM needed to do striping or concatenation. When multiple disks are used, data may be striped or round-robined. Round-robin allocation means that each file is written to one or two disks in the set. This is useful since, unlike striping, participation by all disks is not needed to fetch a file - each disk may seek totally independently. QFS also allows metadata to be separated from data. In this way, a few disks may serve the random metadata workload while the rest serve a sequential data workload. Finally, as mentioned before, QFS is an asymmetric cluster filesystem.
QFS cannot manage its own RAID, besides striping. For this, you need a hardware controller, a traditional volume manager, or a raw ZFS volume.
SAM makes a much larger backing store (typically a tape library) look like a regular UNIX filesystem. This is accomplished by storing metadata and often-referenced data on disk, and migrating infrequently used data in and out of the disk cache as needed. SAM can be configured so that all data is staged out to tape, so that if the disk cache fails, the tapes may be used like a backup. Files staged off of the disk cache are stored in tar-like archives, so that potentially random access of small files can become sequential. This can make further backups much faster.
QFS may be used as a local or cluster filesystem for large-file intensive workloads like Oracle. SAM and QFS are often used for huge data sets such as those encountered in supercomputing. SAM and QFS are optional products and are not cheap, but they have recently been released into OpenSolaris.
VxFS
The Veritas filesystem and volume manager have their roots in a fault-tolerant proprietary minicomputer built by Veritas in the 1980s. They have been available for Solaris since at least 1993 and have been ported to AIX and Linux. They are integrated into HP-UX and SCO UNIX, and Veritas Volume Manager code has been used (and extensively modified) in Tru64 UNIX and even in Windows. Over the years, Veritas has made a lot of money licensing their tech, and not because it is cheap, but because it works.
VxFS has never been part of Solaris but, when UFS was the only option, it was a popular addition. VxVM and VxFS are tightly integrated. Through vxassist, one may shrink and grow filesystems and their underlying volumes with minimal trouble. VxVM provides online RAID relayout. If you have a RAID5 and want to turn it into a RAID10, no problem, no downtime. If you need more space, just convert it back to a RAID5. VxVM has a reputation for being cryptic, and to some extent it is, but it's not so bad and the flexibility is impressive.
VxFS is a fast, extent based, journaled, clusterable filesystem. In fact, it essentially introduced these features to the world, along with direct IO. Newer versions of VxFS and VxVM have the ability to do cross-platform disk sharing. If you ever wanted to unmount a volume from your AIX box and mount it on Linux or Solaris, now you can.
VxFS and VxVM are still closed source. A version is available from Symantec that is free on small servers, with limitations, but I imagine that most users still pay. Pricing starts around $2500 and can be shocking for larger machines. VxFS and VxVM are solid choices for critical infrastructure workloads, including databases.
Conclusion
These are the four major choices in the Solaris on-disk filesystem world. Other filesystems, such as ext2, have some degree of support in OpenSolaris, and FUSE is also being worked on. But if you are deploying a Solaris server, you are going to be using one or more of these four. I hope that you enjoyed this overview, and if you have any corrections or tales of UNIX filesystem history, please let me know.
About the Author
John Finigan is a Computer Science graduate student and IT professional specializing in backup and recovery technology. He is especially interested in the history of enterprise computing and in Cold War technology.
To give an example of using DTrace on Linux applications, I needed an application to examine. I wanted a well known program that either didn't run on Solaris or operated sufficiently differently such examining the Linux version rather than the Solaris port made sense. I decided on /usr/bin/top partly because of the dramatic differences between how it operates on Linux vs. Solaris (due to the differences in /proc), but mostly because of what I've heard my colleague, Bryan, refer to as the "top problem": your system is slow, so you run top. What's the top process? Top!
Running top in the Linux branded zone, I opened a shell in the global (Solaris) zone to use DTrace. I started as I do on Solaris applications: I looked at system calls. I was interested to see which system calls were being executed most frequently which is easily expressed in DTrace:
bash-3.00# dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }' dtrace: description 'lx-syscall:::entry' matched 272 probes ^C
[Jan 22, 2006] Supporting Multiple Page Sizes in the Solaris Operating System
Beginning with the Solaris 9 OS, multiple page sizes can be supported on UltraSPARC processors so administrators can optimize performance by changing the page size on behalf of an application. Typical performance measurement tools do not provide sufficient detail for evaluating the impact of page size and do not provide the needed support to make optimal page size choices.
This article explains how to use new tools to determine the potential performance gain. In addition, it explains how to configure larger page sizes using the multiple page size support (MPSS) feature of the Solaris 9 OS. The article addresses the following topics:
- "Understanding Why Virtual-to-Physical Address Translation Affects Performance"
- "Working With Multiple Page Sizes in the Solaris OS"
- "Configuring for Multiple Page Sizes"
[Jan 16, 2006] Solaris 8 Memory Architecture
A common question pre-Solaris 8 users ask is "Where has all my memory gone"? Thevmstatcommand, used to report virtual memory statistics, often reports that free memory (measured in Kbytes in the free column) is zero or close to zero on a pre-Solaris 8 system that has been up and running for a while.Most likely, memory is being used to cache file system data, since the virtual memory system is shared by applications, data, the kernel, and file system data. By default, any free memory is used to cache data read from or written to the file system (including NFS). The size of the file system cache is dynamic -- it grows or shrinks depending on free memory.
The idea of this memory allocation scheme is to simultaneously enhance file system performance and optimize the use of an important system resource -- virtual memory. The two computing tasks of running applications and reading and writing data compete equally for system memory.
Generally, sharing a pool of memory is not an issue on small memory systems with low compute power, but with today's powerful desktop systems and servers, the file system cache can overwhelm the memory pool and make application performance suffer. Another drawback is that file system performance is tied to how quickly the virtual memory system can free memory.
Even worse, it is difficult to measure memory usage amongst the consumers of memory on the system. The
vmstatcommand is often the first tool users run to examine virtual memory usage, but pre-Solaris 8 versions do a poor job of indicating why a system ispaging (running an algorithm that moves data out from physical memory to disk, and back into physical memory from disk).So, the question becomes: is it because the system is caching file system data, or is it because memory is a bottleneck and the system is struggling to keep up?
[Mar 25, 2005] http://www.sun.com/bigadmin/features/articles/selfheal.html
Traditionally, when a hardware or software fault occurred on a Solaris system, a message would usually be logged to the appropriate device specified in /etc/syslog.conf, and the rest of the diagnosis and repair was left to the administrator. Predictive Self-Healing technology is introduced in the Solaris 10 OS, which is available for preview through the Software Express for Solaris program.
Predictive Self-Healing is a newly designed cohesive architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions.
This new technology lessens the time required to debug a hardware or software problem and provides the administrator and Sun Technical Support with detailed data about each fault. The architecture consists of an event management protocol, the fault manager, and the software fault-handling software, the Solaris Service Manager.
[Mar 17, 2005] An interesting option in telnetd for Solaris 10
It looks like it now provides a simple "not in DNS, no access" defense via option -U:
-U
Refuses connections that cannot be mapped to a name through the getnameinfo(3SOCKET) function.
Anatomy of a Read and Write Call - 21k By Pat Shuff Linux Journal 2002-09-20 23:00
A few years ago I was tasked with making the Spec96 benchmark suite produce the fastest numbers possible using the Solaris Intel operating system and Compaq Proliant servers. We were given all the resources that Sun Microsystems and Compaq Computer Corporation could muster to help take both companies to the next level in Unix computing on the Intel architecture. Sun had just announced its flagship operating system on the Intel platform and Compaq was in a heated race with Dell for the best departmental servers. Unixware and SCO were the primary challengers since Windows NT 3.5 was not very stable at the time and no one had ever heard of an upstart graduate student from overseas who thought that he could build a kernel that rivaled those of multi-billion dollar corporations.
Now many years later, Linux has gained considerable market share and is the De facto Unix for all the major hardware manufacturers on the Intel architecture. In this article, I will attempt to take the lessons learned from this tuning exercise and show how they can be applied to the Linux operating system.
As it turned out, the gcc benchmark was the one that everyone seemed to be improving on the most. As we analyzed what the benchmark was doing, we found out that basically it opened a file, read its contents, created a new file, wrote new contents, then closed both files. It did this over and over and over. File operations proved to be the bottleneck in performance. We tried faster processors with insignificant improvement. We tried processors with huge (at the time) level 1 and level 2 cache and still found no significant improvement. We tried using a gigabyte of memory and found little or no improvement. By using the vmstat command, we found that the processor was relatively idle, little memory was being used, but we were getting a significant amount of reads and writes to the root disk. Using the same hardware and same test programs, Unixware was 25% faster than Solaris Intel. Initially, we decided that Solaris was just really slow. Unfortunately, I was working for Sun at the time and this was not the answer that we could take to my management. We had to figure out why it was slow and make recommendations on how to improve the performance. The target was 25% faster than Unixware, not slower.
The first thing that we did was to look at the configurations. It turns out that the two systems were identical hardware,. We just booted a different disk to boot the other operating system. The Unixware system was configured with /tmp as a tmpfs whereas the Solaris system had /tmp on the root file system. We changed the Solaris configuration to use tmpfs but it did not significantly improve performance. Later, we found that this was due to a bug in the tmpfs implementation on Solaris Intel. By braking down the file operation, we decided to focus on three areas; the libc interface, the node/dentry layer, and the device drivers managing the disk. In this article, we will look at the three different layers and talk about how to improve performance and how they specifically apply to Linux.
LISA 2001 Paper LISA 2001 Paper about RUF
This paper describes a utility named ruf that reads files from an unmounted file system. The files are accessed by reading disk structures directly so the program is peculiar to the specific file system employed. The current implementation supports the *BSD FFS, SunOS/Solaris UFS, HP-UX HFS, and Linux ext2fs file systems. All these file systems derive from the original FFS, but have peculiar differences in their specific implementations.
The utility can read files from a damaged file system. Since the utility attempts to read only those structures it requires, damaged areas of the disk can be avoided. Files can be accessed by their inode number alone, bypassing damage to structures above it in the directory hierarchy.
The functions of the utility is available in a library named libruf. The utility and library is available under the BSD license.
Introduction
There are many important reasons for being able to access unmounted file systems, the prime example being a damaged disk. This paper describes a utility that can be used to read a disk file without mounting the file system. The utility behaves similar to the regular cat utility, and was originally named dog, but was renamed to ruf for reading unmounted filesystems to avoid a name conflict with an older utility.
In order to access an unmounted file system, the utility must read the disk structures directly and perform all the tasks normally performed by the operating system; this requires a detailed understanding of how the file system is implemented. Implementing this utility for a particular file system is an interesting academic exercise and a good way to learn about the file system. The original work on this utility was in fact done in Evi Nemeth's system administration class.
Getting to know the Solaris filesystem, Part 1 - SunWorld - May 1999
Richard starts this journey into the Solaris filesystem by looking at the fundamental reasons for needing a filesystem and at the functionality various filesystems provide. In this first part of the series, you'll examine the evolution of the Solaris filesystem framework, moving into a study of major filesystem features. You'll focus on filesystems that store data on physical storage devices -- commonly called regular or on-disk filesystems. In future articles, you'll begin to explore the performance characteristics of each filesystem, and how to configure filesystems to provide the required levels of functionality and performance. Richard will also delve into the interaction between Solaris filesystems and the Solaris virtual memory system, and how it all affects performance.
Getting to know the Solaris filesystem, Part 3 - SunWorld - July ...
One of the most important features of a filesystem is its ability to cache file data. Ironically, however, the filesystem cache isn't implemented in the filesystem. In Solaris, the filesystem cache is implemented in the virtual memory system. In Part 3 of this series on the Solaris filesystem, Richard explains how Solaris file caching works and explores the interactions between the filesystem cache and the virtual memory system.
www.solarisinternals.com/si/reading/Getting to know the Solaris filesystem,
Part 1 - SunWorld - May 1999
[Mar 7, 2005]
CacheKit is a collection of freeware perl and shell programs to
report on cache activity on a Solaris 8 SPARC server. Tools for older Solaris
and Solaris x86 are also included in the kit, as well as some SE Toolkit
programs and extra Solaris 10 DTrace programs. The caches the kit reports
on are: I$, D$, E$, DNLC, inode cache, ufs buffer cache, segmap cache
and segvn cache. This kit assists performance tuning.
These programs have been written for a Solaris 8 (or newer) sparc
server. Also included in the kit are programs for older Solaris, Solaris
x86, and Solaris 10.
[Mar 7, 2005] docs.sun.com: Solaris Tunable Parameters Reference Manual
The Process Model of Linux Application Development
One of Unix's hallmarks is its process model. It is the key to understanding access rights, the relationships among open files, signals, job control, and most other low-level topics in this book. Linux adopted most of Unix's process model and added new ideas of its own to allow a truly lightweight threads implementation.
10.1 Defining a Process
What exactly is a process? In the original Unix implementations, a process was any executing program. For each program, the kernel kept track of
- The current location of execution (such as waiting for a system call to return from the kernel), often called the program's context
- Which files the program had access to
- The program's credentials (which user and group owned the process, for example)
- The program's current directory
- Which memory space the program had access to and how it was laid out
A process was also the basic scheduling unit for the operating system. Only processes were allowed to run on the CPU.
10.1.1 Complicating Things with Threads
Although the definition of a process may seem obvious, the concept of threads makes all of this less clear-cut. A thread allows a single program to run in multiple places at the same time. All the threads created (or spun off) by a single program share most of the characteristics that differentiate processes from each other. For example, multiple threads that originate from the same program share information on open files, credentials, current directory, and memory image. As soon as one of the threads modifies a global variable, all the threads see the new value rather than the old one.
Many Unix implementations (including AT&T's canonical System V release) were redesigned to make threads the fundamental scheduling unit for the kernel, and a process became a collection of threads that shared resources. As so many resources were shared among threads, the kernel could switch between threads in the same process more quickly than it could perform a full context switch between processes. This resulted in most Unix kernels having a two-tiered process model that differentiates between threads and processes.
10.1.2 The Linux Approach
Linux took another route, however. Linux context switches had always been extremely fast (on the same order of magnitude as the new "thread switches" introduced in the two-tiered approach), suggesting to the kernel developers that rather than change the scheduling approach Linux uses, they should allow processes to share resources more liberally.
Under Linux, a process is defined solely as a scheduling entity and the only thing unique to a process is its current execution context. It does not imply anything about shared resources, because a process creating a new child process has full control over which resources the two processes share (see the clone() system call described on page 153 for details on this). This model allows the traditional Unix process management approach to be retained while allowing a traditional thread interface to be built outside the kernel.
Luckily, the differences between the Linux process model and the two-tiered approach surface only rarely. In this book, we use the term process to refer to a set of (normally one) scheduling entities which share fundamental resources, and a thread is each of those individual scheduling entities. When a process consists of a single thread, we often use the terms interchangeably. To keep things simple, most of this chapter ignores threads completely. Toward the end, we discuss the clone() system call, which is used to create threads (and can also create normal processes).
This site provides information supporting the Solaris Internals book published by Jim Mauro and Richard McDougall. Our aim is to provide links to pertinent reference material and tools discussed in the book, plus any new and relevant information about the Solaris operating system since publication.We hope you find this site useful - we have provided contact information for any questions you may have. We also welcome and encourage feedback.
10/21/2004 - mdb's ::memstat ported to Solaris 8!
Roadmap to Sun Developer Documentation
This article provides a roadmap to information sources for Sun products for developers, with links that readers can use to bookmark the sources.
Contents:
- Introduction
- Roadmap to Documentation for Developers Working With the Solaris Operating System
- Roadmap to the Man Page Collection
- Roadmap to Documentation for Developers Working With Java Technology
- Roadmap to Documentation for Developers Working With Sun Java Enterprise System
- Roadmap to Documentation for Programming With Other Sun Products
- Other Information Sources for Sun Products
- About the Author
Introduction
A wealth of information is available to developers working with Sun software products. This article, which is an expanded version of the new manual Introduction to the Solaris Development Environment, serves as a general roadmap to the developer documentation for these products, that is, manuals, specifications, and documentation web pages, as well as other sources of information for developers.
Note that the links provided here connect to the current versions of the collections and manuals as of the publishing of this article. You should always check that you are accessing the version of the manual that is appropriate for the release of the Sun product you are using.
System Programming for Solaris 2.3, by Steven Hayes Royal Melbourne Institute of Technology
The Solaris process model, Part 2
The Solaris process model, Part 4: More on the kernel
... When Solaris boots, a linked list of CPU
structures is created for every ... queues is
rooted in a disp data structure, which is embedded
in each CPU ...
www.sunworld.com/sunworldonline/swol-12-1998/swol-12-insidesolaris.html
- 48k -
Cached -
Similar pages
The dynamic Solaris kernel
... a module. The primary data structures used in ...
module structure; the structure definitions
for modctl ... kernel maintains a linked list of
modctl structures ...
www.sunworld.com/sunworldonline/swol-02-2000/swol-02-insidesolaris.html
- 50k -
Cached -
Similar pages
|
|||||||
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: August 15, 2009