Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Disk and Filesystems Management in Solaris

News Recommended Links Solaris Volume Manager (SVM) UFS ZFS NFS
Mounting CdRom Performing Mounts Mount Options Mount tutorial Disk and Filesystems Flashdrives
Solaris snapshots  df du Raid Array and Server Humor Etc
 

Note: Discussion below is based on the article by Brian Wong,  Design, Features, and Applicability of Solaris File Systems

Like any modern OS Solaris includes many file systems, and more are available as add-ons. A file system stores named data sets and attributes about those data sets for subsequent data access and interpretation of the attributes. Attributes include things like ownership, access rights, date of last access, and physical location. More advanced attributes might be extended attributes like OS/2 HPFS, encryption keys, etc.  We can distinguish five categories of file system available in the Solaris.

Every Solaris system includes UFS. While it is definitely old and lacking some features, it is suitable for a wide variety of applications. The UFS design center handles typical files found in officeand business automation systems. The basic I/O characteristics are huge numbers of small, cachable files, accessed randomly by individual processes; bandwidth demand is low. This profile is common in most workloads, such as software development and network services (for example, in name services, web sites, and ftp sites).

When designing your server filesystems with UFS filesystems, pay attention to what role each partition will play for your particular application. Mapping partitions to separate pairs of physical disk to minimize load of each pair (in case of hardware mirroring) improves performance.

If you're running a webserver, for example -- it would benefit performance to have an separate pair of drives  dedicated to website storage. You might configure Webserver partition with both the "noatime" and "logging" options along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.

A webservers have mostly a read-requests load and the volume of data is not that big so RAID 10 can be used. 

Software mirroring is an additional overhead.  In no way you should ever mirror partitions on the same drive, except for training purposes: you'll seriously degrade your performance since you've effectively doubled your seeks.

For small web sites  (let's say up to 4G) it make sense to use /tmp  for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better (and cheaper) deal then using SANs. You just need to load the content when server reboots. the problem is that after 4G the time to reboot the server became somewhat long but few websites are that big.  In any case it make sense to use entire drive for your webserver filesystem.  New USB storage might have read performance comparable with best harddrives has no latency for reading and might also be an option.

Logs from Web server can be written on system drive as the volume is rather slim.

You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book  (p. 172-173. ) and in Suns Solaris performance tuning  course.

 

In addition to the basic UFS, there are two variants, logging UFS (LUFS) and older metatrans UFS that was used in Solaris 7. All three versions share the same basic code that blocks allocation, directory management, and data organization. In particular, older version of Solaris up to Solaris 9 have a nominal maximum UFS size of 1 terabyte. This limit was raised to 16 terabytes in the Solaris 10 OS. Obviously, a single file stored in any of them must fit inside a file system, so the maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte file system. There is no reasonable limit to the number of file systems that can be built on a single system; systems have been run with over 2880 UFS file systems. The major differences between the three UFS variants are in how they handle metadata. Metadata is information that the file system stores about the data, such as the name of the file, ownership and access rights, last modified date, file size, and other similar details. Other, less obvious, but possibly more important metadata are the location of the data on the disk, such as data blocks and the indirect blocks that indicate where data locks reside in the disk.

Getting this metadata wrong would not only mean that the affected file might be lost, but could lead to serious file system-wide problems or even a system crash in the event that live data found itself in the free space list, or worse, that free blocks somehow appeared in the middle of a file. UFS takes the simplest approach to assuring metadata integrity: it writes metadata synchronously and requires an extensive fsck on recovery from a system crash. The time and expense of the fsck operation is proportional to the number of files in the file system being checked.

Large file systems with millions of small files can take tens of hours to check. Logging file systems were developed to avoid both the ongoing performance issues associated with synchronous writes and excessive time for recovery. Logging uses the two-phase commit technique to ensure that metadata updates are either fully updated on disk, or that they will be fully updated on disk upon crash recovery. Logging implementations store pending metadata in a reserved area, and then update the master file system based on the content of the reserved area or log.

In the event of a crash, metadata integrity is assured by inspecting the log and applying any pending metadata updates to the master file system before accepting any new I/O operations from applications. The size of the log is dependent on the amount of changing metadata, not the size of the file system. Because the amount of pending metadata is quite small, usually on the order of a few hundred kilobytes for typical file systems and several tens of megabytes for very busy file systems.

Replaying the log against the master is therefore a very fast operation. Once the metadata integrity is guaranteed, the fsck operation becomes a null operation and crash recovery becomes trivial. Note that for performance reasons, only metadata is logged; user data is not logged.

The metatrans implementation was the first version of UFS to implement logging. It was built into Solstice DiskSuite or Solaris Volume Manager software (the name of the product depends on the version of the code, but otherwise, they are the same). The metatrans implementation is limited to Solaris 7 and earlier and it can be recommended only for very old releases (the Solaris 2.5.1 and Solaris 2.6 OSs) in which logging UFS (LUFS) is not available.

Logging UFS was introduced into the Solaris 8 OS but unfortunately was not enabled by default.  The reason for that was  performance degradation, found typically only at artificially high-load levels, and almost no cases have been seen in practical applications. 

So in reality logging started be used in typical installation only with Solaris 10, where it is enabled by default. Sun recommends using logging any time that fast crash recovery is required and it can be used starting from Solaris 8 but this recommendation are largely ignored.  This is particularly sad in case of  root file systems, which usually do not have any significant I/O at all.

Performance Impact of Logging

One of the most confusing issues associated with logging file systems (and particularly with logging UFS, for some reason) is the effect that the log has on performance. First, and most importantly, logging has absolutely no impact on user data operations; this is because only metadata operations are logged.

The performance of metadata operations is another story, and it is not as easy to describe. The log works by writing pending changes to the log, then actually applying the changes to the master file system. When the master is safely updated, the log entry is marked as committed, meaning that it does not need to be reapplied to the master in the event of a crash. This algorithm means that metadata changes that are accomplished primarily when creating or deleting files might actually require twice as many physical I/O operations as a non-logging implementation. The net impact of this aspect of logging performance is that there are more I/O operations going to storage. Typically, this has no real impact on overall performance, but in the case where the underlying storage was already nearly 100 percent busy, the extra operations associated with logging can tip the balance and produce significantly lower file system throughput. (In this case, throughput is not measured in megabytes per second, but rather in file creations and deletions per second.) If the utilization of the underlying storage is less than approximately 90 percent, the logging overhead is inconsequential.

On the positive side of the ledger, the most common impact on performance has to do with the cancellation of some physical metadata operations. These cases occur only when metadata updates are issued very rapidly, such as when doing a tar (1) extract operation or when removing the entire contents of a directory ("rm -f *"). Without logging, the system is required to force the directory to disk after every file is processed (this is the definition of the phrase "writing metadata synchronously); the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the file system is logging, the log record is pushed to disk when the log record fills, often when the 512 byte block is completed. This results in a 512/14 = 35 times reduction in physical I/O, and obvious performance improvements result.

The following table illustrates these results. The times are given in seconds, and lower scores are better. Times are the average of five runs, and are intended to show relative differences rather than the fastest possible absolute time. These tests were run on Solaris 8 7/01 using a single disk drive.

The tar test consists of extracting 7092 files from a 175 megabyte archive (the contents of /usr/openwin). Although a significant amount of data is moved, this test is dominated by metadata updates for creating the files. Logging is five times faster. The rm test removes the 7092 extracted files. It is also dominated by metadata updates and is an astonishing 37 times faster than the non-logging case.

On the other hand, the dd write test creates a single 1 gigabyte file in the file system, and the difference between logging and non-logging is a measurable, but insignificant, three percent. Reading the created file from the file system shows no performance impact from logging. Both tests use large block sizes (1 megabyte per I/O) to optimize throughput of the underlying storage.

Another feature present in most of the local file systems is the use of direct I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended to avoid the overhead associated with managing cache buffers for large I/O. At first glance, it might seem that caching is a good thing and that it would improve I/O performance.

There is a great deal of reality underlying these expectations. All of the local file systems perform buffer caching by default. The expected improvements occur for typical workloads that are dominated by metadata manipulation and data sets that are very small when compared to main memory sizes. Metadata, in particular, is very small, amounting to less than one kilobyte per file in most UFS applications, and only slightly more in other file systems. Typical user data sets are also quite small; they average about 70 kilobytes. Even the larger files used in every day work such as presentations created using StarOfficeT software, JPEG images, and audio clips are generally less than 2 megabytes. Compared to typical main memory sizes of 256-2048 megabytes, it is reasonable to expect that these data sets and their attributes can be cached for substantial periods of time. They are reasonably likely to still be in memory when they are accessed again, even if that access comes an hour later.

The situation is quite different with bulk data. Systems that process bulk data tend to have larger memories, up to perhaps 16 gigabytes (for example, 8-64 times larger than typical), but the data sets in these application spaces often exceed 1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even if the file literally fits into memory and could theoretically be cached, these data sets are substantially larger than memory that is consistently available for I/O caching. As a result, the likelihood that the data will still be in cache when the data is referenced again is quite low. In practice, cache reuse in these environments is nil.

Direct I/O Performance

Caching data anyway would be fine except, that the process requires effort on the part of the OS and processors. For small files, this overhead is insignificant. However, the overhead becomes not only significant, but excessive when "tidal waves" of data flow through the system. When reading 1 gigabyte of data from a disk in large blocks, throughput is similar for both direct and buffered cases; the buffered case delivers 13 percent greater throughput. The big difference between these two cases is that the buffered process consumes five times as much CPU effort. Because there is so little practical value to caching large data sets, Sun recommends using the forcedirectio option on file systems that operate on large files. In this context, large generally means more than about 15-20 megabytes. Note that the direct I/O recommendation is especially true when the server in question is exporting large files through NFS. 8 Design, Features, and Applicability of Solaris File Systems January 2004 If direct I/O is so much more efficient, why not use direct I/O all the time? Direct I/O means that caching is disabled. The impact of standard caching becomes obvious when using a UFS file system in direct I/O mode while doing small file operations. The same tar extraction benchmark used in the logging section above takes over 51 minutes, even with logging enabled, more than 29 times as long as when using regular caching (2:08)! The benchmark results are summarized in the following table.

In this table, throughput is represented by elapsed times in seconds, and smaller numbers are better. The system in question is running Solaris 9 FCS on a 750- megahertz processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive. The differences in throughput are mainly attributable to how the file system makes use of the capabilities of the underlying hardware.

Supercaching and 32-Bit Binaries

A discussion of buffered and direct I/O methodology is incomplete without addressing one particular attribute of the cached I/O strategy. Because file systems are part of the operating system, they can access the entire capability of the hardware. Of particular relevance is that file systems are able to address all of the physical memory, which now regularly exceeds the ability of 32-bit addressing. As a result, the file system is able to function as a kind of memory management unit (MMU) that permits applications that are strictly 32-bit aware to make direct use of physical memories that are far larger than their address pointers. This technique, known as supercaching, can be particularly useful to provide extended caching for applications that are not 64-bit aware. The best examples of this are the open-source databases, MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct addressing capabilities limited to 4 gigabytes.1 However, when their data tables are hosted on a file system operating in buffered mode, they benefit from cached I/O. This is not as efficient as simply using a 64-bit TABLE 2 Analyzing the Performance of Direct I/O and Buffered I/O Direct I/O Throughput (seconds) CPU % Buffered I/O Throughput (seconds) CPU % Create 1 GB file 36 5.0% 31 25.00% Read 1 GB file 30 0.0% 22 22.00% tar extract 3062 0.0% 128 6.0% rm rf * 76 1.2% 65 1.0% 1. They're limited to 4 gigabytes of memory. They obviously can address far more disk space because disk addresses are 63-bit quantities of 512-byte blocks. pointer because the application must run I/O system calls instead of merely dereferencing a 64-bit pointer, but the advantages gained by avoiding I/O outweigh these considerations by a wide margin.

Sharing Data With NFS

To Solaris users, NFS is by far the most familiar file system. It is an explicit over the wire file sharing protocol that has been a part of the Solaris since 1986. Its manifest purpose is to permit safe, deterministic access to files located on a server with reasonable security. Although NFS is media independent, it is most commonly seen operating over TCP/IP networks. NFS is specifically designed to operate in multiclient environments and to provide a reasonable tradeoff between performance, consistency, and ease-of-administration. Although NFS has historically been neither particularly fast nor particularly secure, recent enhancements address both of these areas. Performance improved by 50-60 percent between the Solaris 8 and Solaris 9 OSs, primarily due to greatly increased efficiency processing attribute-oriented operations5. Data-intensive operations don't improve by the same margin because they are dominated by data transfer times rather than attribute operations. Security, particularly authentication, has been addressed through the use of much stronger authentication mechanisms such as those available using Kerberos. NFS clients now need to trust only their servers, rather than their servers and their client peers. 5. A two times 900 MHz SF280R yielded 7200 NFS operations per second on Solaris 8 2/02. The same system yielded 1717 NFS operations second on Solaris 9 FCS. 

Understanding the Sharing Limitations of UFS

UFS is not a shared file system. Despite a fairly widespread interest in a limited-use configuration (specifically, mounted for read/write operation on one system, while mounted read-only on one or more "secondary" systems), UFS is not sharable without the use of an explicit file sharing protocol such as NFS. Although read-only sharing seems as though it should work, it doesn't. This is due to fairly fundamental decisions made in the UFS implementation many years ago, specifically in the caching of metadata.

UFS was designed with only a single system in mind and it also has a relatively complex data structure for files, notably including "indirect blocks," which are blocks of metadata that contain the addresses of real user data. To maintain reasonable performance, UFS caches metadata in memory, even though it writes metadata to disk synchronously. This way, it is not required to re-read inodes, indirect-blocks, and double-indirect blocks to follow an advancing file pointer. In a single-system environment, this is a safe assumption. However, when another system has access to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic at worst. A writable UFS file system can change the metadata and write it to disk.

Meanwhile, a read-only UFS file system on another node holds a cached copy of that metadata. If the writable system creates a new file or removes or extends an existing file, the metadata changes to reflect the request. Unfortunately, the read-only system does not see these changes and, therefore, has a stale view of the system. This is nearly always a serious problem, with the consequences ranging from corrupted data to a system crash. For example, if the writable system removes a file, its blocks are placed in the free list. The read-only system isn't provided with this information, therefore, a read of the same file will cause the read-only to follow the original data pointers and read blocks that are now on the free list!

Rather than risk such extreme consequences, it is better to use one of the many other options that exist. The selection of which option is driven by a combination of how often updated data must be made available to the other systems, and the size of the data sets involved. If the data is not updated too often, the most logical option is to make a copy of the file system and to provide the copy to other nodes. With pointin-time copy facilities such as Sun Instant Image, HDS ShadowImage, and EMC TimeFinder, copying a file system does not need to be an expensive operation.

It is entirely reasonable to export a point-in-time copy of a UFS file system from storage to another node (for example, for backup) without risk because neither the original nor the copy is being shared. If the data changes frequently, the most practical alternative is to use NFS.

Although performance is usually cited as a reason not to do this, the requirements are usually not demanding enough to warrant other solutions. NFS is far faster than most users realize, especially in environments that involve typical files smaller than 5-10 megabytes. 

There are a couple of tricks you can use under Solaris to gain a little extra performance from your filesystems and also increase their data reliability.  When designing your filesystem, pay attention to what role it will play for your particular application. Depending on your needs, you map partitions to physical disk to minimize load of each pair of disks (in case of mirroring) and improve performance.

If you're running a webserver, for example - it would benefit performance to have an separate pair of drives  dedicated to website storage. You might configure it with both the "noatime" and "logging" options mentioned below along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.

A webservers had mostly a read-requests load. RAID 10 can be used, but RIAD 5 can be used too as both provides a high read transaction rate and provides redundancy in case of a drive failure.

In no way you should even mirror partitions on the same drive. Otherwise, you'll seriously degrade your performance since you've effectively doubled your seeks.

For small web sites  (let's say up to 4G) it make sense to use /tmp  for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better deal then using SANs. You just need to load the content when server reboots and after 4G the time to reboot the server became annoyingly long.  In any case it make sense to use entire drive for your webserver filesystem.  New USB storage might have read perfomance comparable with harddrives and it is also an option.

Logs from Web server can be written on system drive as the volume is rather slim.

You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book  (p. 172-173. ) and in Suns Solaris performance tuning  course.

Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index

Old News ;-)

[Sep 8, 2008] BigAdmin Description - Less Known Solaris Features - CacheFS

Description: A tutorial about one of the really hidden features of Solaris - CacheFS.

CacheFS is something similar to a caching proxy.

But this proxy donīt cache web page, it caches files from another filesystem.

Contact: joerg.moellenkamp [ at ] sun.com

[Aug 7, 2007] Linux Replacing atime

August 7, 2007 | KernelTrap Submitted by Jeremy on August 7, 2007 - 9:26am.

In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the noatime option for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way." He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads." The discussion then looked at using the relatime mount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified." Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve relatime, and noted about atime:

"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"

[Apr 5, 2007] SATA Framework Overview By Pawel Wojcik

Feb 22, 2007 (blogs.sun.com)

As many of you noticed, Solaris now supports SATA controllers and devices. To simplify writing SATA HBA drivers the new module and a set of interfaces was created, referred to as either SATA Framework or SATA module. I was a principal architect of SATA framework, but several other Sun engineers were participating in the conceptual design and the shaping of the interfaces.
It is not small piece of software - the source, sata.c, is over 300k in size. Reading this code, with associated header files may be a little confusing. So, I created an overview of the sata module, explaining what it is, how it fits in Solaris kernel, what it does, what are the interfaces and how sample operations are performed. Hopefully, it will be useful for all that want to improve and expand SATA support in Solaris Similar overview was presented about a year ago at Silicon Valley Open Solaris User Group meeting in Santa Clara and on various occasions internally in Sun organization. The overview that I plan to present here will have several parts. Here is the first one...

[Sep 14, 2006] BigAdmin - Submitted Tech Tip Setting Up a Disk Partition Table to Be the Same As on Another Disk by Phillip Wu, August 2006

Often it is necessary to set up a partition table for a disk to be the same as on another disk, for example, where the disks are mirrored. This can be achieved by using the format utility. In the instructions that follow, the original disk is called disk a and the second disk (which will have the same partition table) is called disk b.

format
<select disk a>  (Select disk a from the list displayed.)
partition
print   (Print out the partition table to list the partition table.)
name
rootdisk   (Pick a name of your choice.)
quit   (Go back to the format menu.)
disk   (Go to the menu that allows you to select disk b.)
<select disk b>   (Select disk b from the list displayed.)
partition   (Print out the partition table before changing.)
select
--Pick rootdisk   (Pick from the menu the name you gave above.)
label   (Write out the partition table to disk.)
quit   (Go back to the format menu.)
quit   (Exit format.)

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

[PDF] File System Performance: The Solaris, UFS, Linux ext3, and ReiserFS

[PDF] Design, Features, and Applicability of Solaris File Systems

You too can understand device numbers and mapping in Solaris ...

High Availability: Configuring Boot, Root and Swap (PDF)

Scrubbing Disk Using the Solaris Operating Environment Format Program (June 2000) -by Rob Snevely Rob explains how to effectively scrub disks on a Solaris Operating Environment system, using the format utility.

Mirroring Root Filesystem

Configuring Boot Disks With Solaris Volume Manager Software (October 2002)
-by Erik Vanden Meersch and Kristien Hens
This article is an update to the April 2002 Sun BluePrints OnLine article, Configuring Boot Disks With Solstice DiskSuite Software. This article focuses on the Solaris 9 Operating Environment, Solaris Volume Manager software, and VERITAS Volume Manager .2 software. It describe how to partition and mirror the system disk, and how to create and maintain a backup system disk. In addition, this article presents technical arguments for the choices made, and includes detailed runbooks.

Solstice DiskSuite (SDS) disk mirroring

Tips

Solaris Volume Manager Performance Best Practices (November 2003)
-by Glenn Fawcett
Compelling new features such as soft partitioning and automatic device relocation make the Solaris Volume Manager software a viable candidate for storage management needs. Solaris Volume Manager software features enhance storage management capabilities beyond what is handled by intelligent storage arrays with hardware RAID. Now Solaris Volume Manager software is integrated with the Solaris Operating Environment (Solaris OE) and does not require additional license fees. This article provides specific Solaris Volume Manager tips for system, storage, and database administrators who want get the most of Solaris Volume Manager software in their data centers. This article targets an intermediate audience.


orcedirectio optimization (please read  Solaris Volume Manager Performance Best Practices  before using it) <forcedirectio | noforcedirectio If forcedirectio is specified and supported by the file system, then for the duration of the mount, forced direct I/O will be used. If the filesystem is mounted using forcedirectio, data is transferred directly between user address space and the disk. If the filesystem is mounted using noforcedirectio, data is buffered in kernel address space when data is transferred between user address space and the disk. forcedirectio is a performance option that is of benefit only in large sequential data transfers. The default behavior is noforcedirectio.

Sync buffer size optimization. RAID 1 volumes can benefit from increased sync buffer size to 1M (2048 512 blocks). To experiment use metasync -r 2048 command. For permanent changes:


Copyright Đ 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: September 02, 2009