|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Disk and Filesystems Management in Solaris
Note: Discussion below is based on
the article by Brian Wong,
Design, Features, and Applicability of Solaris File Systems
Like any modern OS Solaris includes many file systems, and more are available
as add-ons. A file system stores named data sets and attributes about those data
sets for subsequent data access and interpretation of the attributes. Attributes
include things like ownership, access rights, date of last access, and physical
location. More advanced attributes might be extended attributes like OS/2 HPFS,
encryption keys, etc. We can distinguish five categories of file system available
in the Solaris.
- Local file systems. They are usually based on disks or other media
installed on particular system. Local filesystem are the most common and provide
naming, data access, and attribute interpretation services to the system on
which they run, and no other. Examples of local file systems on Solaris are
UFS, ZFS, and VERITAS File System (VxFS).
- Shared file systems. Classic shared filesystem is NFS. Most shared
file systems require a supporting local file system on the file server or network
attached storage (NAS) appliance.
- Special file systems. The most important in Solaris is tmpfs, the filesystem
similar to virtual drive in DOS which hosts files in virtual memory. CacheFS
is another example.
- Media-specific file systems. They are associated in some way to the
media on which theyare typically found. The most common example is universal
disk format (UDF),the file system format found on most DVDs. (It is certainly
possible to use a DVD, or especially a DVD-RW/DVD+RW, to host some other type
of file system on a DVD, such as UFS) Other media-specific file systems are
ISO9660 ROM file system format). PCFS, an implementation of the Microsoft FAT-32
filesystem, is technically a local file system (specially on a Windows machine),
but in Solaris, it is effectively a media-specific file system associated with
diskettes.
- Pseudo file systems. These are actually abstractions of data thatare
merely presented to users or administrators in file system form. Because filesystems
are a convenient and powerful mechanism for representing data, the Solaris is
built on a surprising number of pseudo file systems. Probably thebest known
of these is procfs, which provides access to details of a runningprocess, but
devfs is an abstraction that categorizes device data in a convenientfashion,
and xmemfs aggregates information about physical memory usage.
Every Solaris system includes UFS. While it is definitely old and lacking some
features, it is suitable for a wide variety of applications. The UFS design center
handles typical files found in officeand business automation systems. The basic
I/O characteristics are huge numbers of small, cachable files, accessed randomly
by individual processes; bandwidth demand is low. This profile is common in most
workloads, such as software development and network services (for example, in name
services, web sites, and ftp sites).
In addition to the basic UFS, there are two variants, logging UFS (LUFS) and
older metatrans UFS that was used in Solaris 7. All three versions share the same
basic code that blocks allocation, directory management, and data organization.
In particular, older version of Solaris up to Solaris 9 have a nominal maximum UFS
size of 1 terabyte. This limit was raised to 16 terabytes in the Solaris 10 OS.
Obviously, a single file stored in any of them must fit inside a file system, so
the maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte
file system. There is no reasonable limit to the number of file systems that can
be built on a single system; systems have been run with over 2880 UFS file systems.
The major differences between the three UFS variants are in how they handle metadata.
Metadata is information that the file system stores about the data, such as the
name of the file, ownership and access rights, last modified date, file size, and
other similar details. Other, less obvious, but possibly more important metadata
are the location of the data on the disk, such as data blocks and the indirect blocks
that indicate where data locks reside in the disk.
Getting this metadata wrong would not only mean that the affected file might
be lost, but could lead to serious file system-wide problems or even a system crash
in the event that live data found itself in the free space list, or worse, that
free blocks somehow appeared in the middle of a file. UFS takes the simplest approach
to assuring metadata integrity: it writes metadata synchronously and requires an
extensive fsck on recovery from a system crash. The time and expense of the fsck
operation is proportional to the number of files in the file system being checked.
Large file systems with millions of small files can take tens of hours to check.
Logging file systems were developed to avoid both the ongoing performance issues
associated with synchronous writes and excessive time for recovery. Logging uses
the two-phase commit technique to ensure that metadata updates are either fully
updated on disk, or that they will be fully updated on disk upon crash recovery.
Logging implementations store pending metadata in a reserved area, and then update
the master file system based on the content of the reserved area or log.
In the event of a crash, metadata integrity is assured by inspecting the log
and applying any pending metadata updates to the master file system before accepting
any new I/O operations from applications. The size of the log is dependent on the
amount of changing metadata, not the size of the file system. Because the amount
of pending metadata is quite small, usually on the order of a few hundred kilobytes
for typical file systems and several tens of megabytes for very busy file systems.
Replaying the log against the master is therefore a very fast operation. Once
the metadata integrity is guaranteed, the fsck operation becomes a null operation
and crash recovery becomes trivial. Note that for performance reasons, only metadata
is logged; user data is not logged.
The metatrans implementation was the first version of UFS to implement logging.
It was built into Solstice DiskSuite or Solaris Volume Manager software (the name
of the product depends on the version of the code, but otherwise, they are the same).
The metatrans implementation is limited to Solaris 7 and earlier and it can be recommended
only for very old releases (the Solaris 2.5.1 and Solaris 2.6 OSs) in which logging
UFS (LUFS) is not available.
Logging UFS was introduced into the Solaris 8 OS but unfortunately was not enabled by
default. The reason for that was performance degradation, found typically
only at artificially high-load levels, and almost no cases have been seen in practical
applications.
So in reality logging started be used in typical installation only with Solaris
10, where it is enabled by default. Sun recommends using logging any time that fast
crash recovery is required and it can be used starting from Solaris 8 but this recommendation
are largely ignored. This is particularly sad in case of root file systems,
which usually do not have any significant I/O at all.
Performance Impact of Logging
One of the most confusing issues associated with logging file systems (and particularly
with logging UFS, for some reason) is the effect that the log has on performance.
First, and most importantly, logging has absolutely no impact on user data operations;
this is because only metadata operations are logged.
The performance of metadata operations is another story, and it is not as easy
to describe. The log works by writing pending changes to the log, then actually
applying the changes to the master file system. When the master is safely updated,
the log entry is marked as committed, meaning that it does not need to be reapplied
to the master in the event of a crash. This algorithm means that metadata changes
that are accomplished primarily when creating or deleting files might actually require
twice as many physical I/O operations as a non-logging implementation. The net impact
of this aspect of logging performance is that there are more I/O operations going
to storage. Typically, this has no real impact on overall performance, but in the
case where the underlying storage was already nearly 100 percent busy, the extra
operations associated with logging can tip the balance and produce significantly
lower file system throughput. (In this case, throughput is not measured in megabytes
per second, but rather in file creations and deletions per second.) If the utilization
of the underlying storage is less than approximately 90 percent, the logging overhead
is inconsequential.
On the positive side of the ledger, the most common impact on performance has
to do with the cancellation of some physical metadata operations. These cases occur
only when metadata updates are issued very rapidly, such as when doing a tar (1)
extract operation or when removing the entire contents of a directory ("rm -f *").
Without logging, the system is required to force the directory to disk after every
file is processed (this is the definition of the phrase "writing metadata synchronously);
the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the
file system is logging, the log record is pushed to disk when the log record fills,
often when the 512 byte block is completed. This results in a 512/14 = 35 times
reduction in physical I/O, and obvious performance improvements result.
The following table illustrates these results. The times are given in seconds,
and lower scores are better. Times are the average of five runs, and are intended
to show relative differences rather than the fastest possible absolute time. These
tests were run on Solaris 8 7/01 using a single disk drive.
The tar test consists of extracting 7092 files from a 175 megabyte archive (the
contents of /usr/openwin). Although a significant amount of data is moved, this
test is dominated by metadata updates for creating the files. Logging is five times
faster. The rm test removes the 7092 extracted files. It is also dominated by metadata
updates and is an astonishing 37 times faster than the non-logging case.
On the other hand, the dd write test creates a single 1 gigabyte file in the
file system, and the difference between logging and non-logging is a measurable,
but insignificant, three percent. Reading the created file from the file system
shows no performance impact from logging. Both tests use large block sizes (1 megabyte
per I/O) to optimize throughput of the underlying storage.
Another feature present in most of the local file systems is the use of direct
I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended
to avoid the overhead associated with managing cache buffers for large I/O. At first
glance, it might seem that caching is a good thing and that it would improve I/O
performance.
There is a great deal of reality underlying these expectations. All of the local
file systems perform buffer caching by default. The expected improvements occur
for typical workloads that are dominated by metadata manipulation and data sets
that are very small when compared to main memory sizes. Metadata, in particular,
is very small, amounting to less than one kilobyte per file in most UFS applications,
and only slightly more in other file systems. Typical user data sets are also quite
small; they average about 70 kilobytes. Even the larger files used in every day
work such as presentations created using StarOfficeT software, JPEG images, and
audio clips are generally less than 2 megabytes. Compared to typical main memory
sizes of 256-2048 megabytes, it is reasonable to expect that these data sets and
their attributes can be cached for substantial periods of time. They are reasonably
likely to still be in memory when they are accessed again, even if that access comes
an hour later.
The situation is quite different with bulk data. Systems that process bulk data
tend to have larger memories, up to perhaps 16 gigabytes (for example, 8-64 times
larger than typical), but the data sets in these application spaces often exceed
1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even
if the file literally fits into memory and could theoretically be cached, these
data sets are substantially larger than memory that is consistently available for
I/O caching. As a result, the likelihood that the data will still be in cache when
the data is referenced again is quite low. In practice, cache reuse in these environments
is nil.
Direct I/O Performance
Caching data anyway would be fine except, that the process requires effort on
the part of the OS and processors. For small files, this overhead is insignificant.
However, the overhead becomes not only significant, but excessive when "tidal waves"
of data flow through the system. When reading 1 gigabyte of data from a disk in
large blocks, throughput is similar for both direct and buffered cases; the buffered
case delivers 13 percent greater throughput. The big difference between these two
cases is that the buffered process consumes five times as much CPU effort. Because
there is so little practical value to caching large data sets, Sun recommends using
the forcedirectio option on file systems that operate on large files. In this context,
large generally means more than about 15-20 megabytes. Note that the direct I/O
recommendation is especially true when the server in question is exporting large
files through NFS. 8 Design, Features, and Applicability of Solaris File Systems
January 2004 If direct I/O is so much more efficient, why not use direct I/O all
the time? Direct I/O means that caching is disabled. The impact of standard caching
becomes obvious when using a UFS file system in direct I/O mode while doing small
file operations. The same tar extraction benchmark used in the logging section above
takes over 51 minutes, even with logging enabled, more than 29 times as long as
when using regular caching (2:08)! The benchmark results are summarized in the following
table.
In this table, throughput is represented by elapsed times in seconds, and smaller
numbers are better. The system in question is running Solaris 9 FCS on a 750- megahertz
processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive.
The differences in throughput are mainly attributable to how the file system makes
use of the capabilities of the underlying hardware.
Supercaching and 32-Bit Binaries
A discussion of buffered and direct I/O methodology is incomplete without addressing
one particular attribute of the cached I/O strategy. Because file systems are part
of the operating system, they can access the entire capability of the hardware.
Of particular relevance is that file systems are able to address all of the physical
memory, which now regularly exceeds the ability of 32-bit addressing. As a result,
the file system is able to function as a kind of memory management unit (MMU) that
permits applications that are strictly 32-bit aware to make direct use of physical
memories that are far larger than their address pointers. This technique, known
as supercaching, can be particularly useful to provide extended caching for applications
that are not 64-bit aware. The best examples of this are the open-source databases,
MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct
addressing capabilities limited to 4 gigabytes.1 However, when their data tables
are hosted on a file system operating in buffered mode, they benefit from cached
I/O. This is not as efficient as simply using a 64-bit TABLE 2 Analyzing the Performance
of Direct I/O and Buffered I/O Direct I/O Throughput (seconds) CPU % Buffered I/O
Throughput (seconds) CPU % Create 1 GB file 36 5.0% 31 25.00% Read 1 GB file 30
0.0% 22 22.00% tar extract 3062 0.0% 128 6.0% rm rf * 76 1.2% 65 1.0% 1. They're
limited to 4 gigabytes of memory. They obviously can address far more disk space
because disk addresses are 63-bit quantities of 512-byte blocks. pointer because
the application must run I/O system calls instead of merely dereferencing a 64-bit
pointer, but the advantages gained by avoiding I/O outweigh these considerations
by a wide margin.
Sharing Data With NFS
To Solaris users, NFS is by far the most familiar file system. It is an explicit
over the wire file sharing protocol that has been a part of the Solaris since
1986. Its manifest purpose is to permit safe, deterministic access to files located
on a server with reasonable security. Although NFS is media independent, it is most
commonly seen operating over TCP/IP networks. NFS is specifically designed to operate
in multiclient environments and to provide a reasonable tradeoff between performance,
consistency, and ease-of-administration. Although NFS has historically been neither
particularly fast nor particularly secure, recent enhancements address both of these
areas. Performance improved by 50-60 percent between the Solaris 8 and Solaris 9
OSs, primarily due to greatly increased efficiency processing attribute-oriented
operations5. Data-intensive operations don't improve by the same margin because
they are dominated by data transfer times rather than attribute operations. Security,
particularly authentication, has been addressed through the use of much stronger
authentication mechanisms such as those available using Kerberos. NFS clients now
need to trust only their servers, rather than their servers and their client peers.
5. A two times 900 MHz SF280R yielded 7200 NFS operations per second on Solaris
8 2/02. The same system yielded 1717 NFS operations second on Solaris 9 FCS.
Understanding the Sharing Limitations of UFS
UFS is not a shared file system. Despite a fairly widespread interest in a limited-use
configuration (specifically, mounted for read/write operation on one system, while
mounted read-only on one or more "secondary" systems), UFS is not sharable without
the use of an explicit file sharing protocol such as NFS. Although read-only sharing
seems as though it should work, it doesn't. This is due to fairly fundamental decisions
made in the UFS implementation many years ago, specifically in the caching of metadata.
UFS was designed with only a single system in mind and it also has a relatively
complex data structure for files, notably including "indirect blocks," which are
blocks of metadata that contain the addresses of real user data. To maintain reasonable
performance, UFS caches metadata in memory, even though it writes metadata to disk
synchronously. This way, it is not required to re-read inodes, indirect-blocks,
and double-indirect blocks to follow an advancing file pointer. In a single-system
environment, this is a safe assumption. However, when another system has access
to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic
at worst. A writable UFS file system can change the metadata and write it to disk.
Meanwhile, a read-only UFS file system on another node holds a cached copy of
that metadata. If the writable system creates a new file or removes or extends an
existing file, the metadata changes to reflect the request. Unfortunately, the read-only
system does not see these changes and, therefore, has a stale view of the system.
This is nearly always a serious problem, with the consequences ranging from corrupted
data to a system crash. For example, if the writable system removes a file, its
blocks are placed in the free list. The read-only system isn't provided with this
information, therefore, a read of the same file will cause the read-only to follow
the original data pointers and read blocks that are now on the free list!
Rather than risk such extreme consequences, it is better to use one of the many
other options that exist. The selection of which option is driven by a combination
of how often updated data must be made available to the other systems, and the size
of the data sets involved. If the data is not updated too often, the most logical
option is to make a copy of the file system and to provide the copy to other nodes.
With pointin-time copy facilities such as Sun Instant Image, HDS ShadowImage, and
EMC TimeFinder, copying a file system does not need to be an expensive operation.
It is entirely reasonable to export a point-in-time copy of a UFS file system
from storage to another node (for example, for backup) without risk because neither
the original nor the copy is being shared. If the data changes frequently, the most
practical alternative is to use NFS.
Although performance is usually cited as a reason not to do this, the requirements
are usually not demanding enough to warrant other solutions. NFS is far faster than
most users realize, especially in environments that involve typical files smaller
than 5-10 megabytes.
There are a couple of tricks you can use under Solaris to gain a little extra
performance from your filesystems and also increase their data reliability.
When designing your filesystem, pay attention to what role it will play for your
particular application. Depending on your needs, you map partitions to physical
disk to minimize load of each pair of disks (in case of mirroring) and improve
performance.
If you're running a webserver, for example - it would benefit performance to
have an separate pair of drives dedicated to website storage. You might configure it with
both the "noatime" and "logging" options mentioned below along with a "nosuid"
option. This
would offload requests to a separate drive and possibly separate SCSI controller
channel.
A webservers had mostly a read-requests load. RAID 10 can be used, but RIAD 5
can be used too as both provides a high read transaction rate and provides
redundancy in case of a drive failure.
In no way you should even mirror partitions on the same drive. Otherwise, you'll
seriously degrade your performance since you've effectively doubled your seeks.
For small web sites (let's say up to 4G) it make sense to use /tmp
for websites as it is mapped to memory. That means that also you pages will be
cached. The drawback is that you might need to order more memory for the server
increasing the costs. But it is a better deal then using SANs. You just need to
load the content when server reboots and after 4G the time to reboot the server
became annoyingly long. In any case it make sense to use entire drive for your
webserver filesystem. New USB storage might have read perfomance
comparable with harddrives and it is also an option.
Logs from Web server can be written on system drive as the volume is rather
slim.
You can also tweak the ufs filesystem for webserver by using
noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options
in /etc/system. See the
Sun Performance and Tuning book (p. 172-173. ) and in Suns Solaris
performance tuning course.
Notes:
- Those pages are written by people for whom English is not a
native language. Some amount of grammar and spelling errors
should be expected.
- This is a Spartan WHYFF (We Help You For Free) site. It
cannot replace the best teachers and
the
best books.
- The site contain some obsolete pages as it develops like a
living tree... Some links on older pages
are broken. Please
try to use Google, Open directory, etc. to find a replacement link
(see
HOWTO search the WEB for details).
We would appreciate if you can
mail us a correct link.
|
|
Submitted by
Jeremy on August 7, 2007 - 9:26am.
In a recent lkml thread, Linus Torvalds was involved in
a discussion about mounting filesystems with the
noatime option for better performance, "
'noatime,data=writeback'
will quite likely be *quite* noticeable (with different
effects for different loads), but almost nobody actually
runs that way." He noted that he set O_NOATIME when
writing git, "
and it was an absolutely huge
time-saver for the case of not having 'noatime' in the
mount options. Certainly more than your estimated 10%
under some loads." The discussion then looked at
using the
relatime
mount option to improve the situation, "
relative
atime only updates the atime if the previous atime is
older than the mtime or ctime. Like noatime, but useful
for applications like mutt that need to know when a file
has been read since it was last modified." Ingo
Molnar stressed the significance of fixing this
performance issue, "
I cannot over-emphasize how much
of a deal it is in practice. Atime updates are by far
the biggest IO performance deficiency that Linux has
today. Getting rid of atime updates would give us more
everyday Linux performance than all the pagecache
speedups of the past 10 years, _combined_." He
submitted some patches to improve
relatime,
and noted about
atime:
"It's also perhaps the most stupid Unix design
idea of all times. Unix is really nice and well
done, but think about this a bit: 'For every file
that is read from the disk, lets do a ... write to
the disk! And, for every file that is already cached
and which we read from the cache ... do a write to
the disk!'"
Feb 22, 2007 (blogs.sun.com)
As many of you noticed, Solaris now supports SATA
controllers and devices. To simplify writing SATA HBA drivers the new module
and a set of interfaces was created, referred to as either SATA Framework or
SATA module. I was a principal architect of SATA framework, but several
other Sun engineers were participating in the conceptual design and the
shaping of the interfaces.
It is not small piece of software - the source, sata.c, is over 300k in
size. Reading this code, with associated header files may be a little
confusing. So, I created an overview of the sata module, explaining what it
is, how it fits in Solaris kernel, what it does, what are the interfaces and
how sample operations are performed. Hopefully, it will be useful for all
that want to improve and expand SATA support in Solaris Similar overview was
presented about a year ago at Silicon Valley Open Solaris User Group meeting
in Santa Clara and on various occasions internally in Sun organization. The
overview that I plan to present here will have several parts. Here is the
first one...
Often it is necessary to set up a partition table for a disk to be the same
as on another disk, for example, where the disks are mirrored. This can be achieved
by using the format utility. In the instructions
that follow, the original disk is called disk a and the second disk (which
will have the same partition table) is called disk b.
format
<select disk a> (Select disk a from the list displayed.)
partition
print (Print out the partition table to list the partition table.)
name
rootdisk (Pick a name of your choice.)
quit (Go back to the format menu.)
disk (Go to the menu that allows you to select disk b.)
<select disk b> (Select disk b from the list displayed.)
partition (Print out the partition table before changing.)
select
--Pick rootdisk (Pick from the menu the name you gave above.)
label (Write out the partition table to disk.)
quit (Go back to the format menu.)
quit (Exit format.)
[PDF]
File System Performance: The Solaris, UFS, Linux ext3, and ReiserFS
[PDF]
Design, Features, and Applicability of Solaris File Systems
You too can understand device numbers and mapping in Solaris ...
High
Availability: Configuring Boot, Root and Swap (PDF)
Scrubbing
Disk Using the Solaris Operating Environment Format Program (June 2000)
-by Rob Snevely Rob explains how to effectively scrub disks on a Solaris
Operating Environment system, using the format utility.