|Home||Switchboard||Unix Administration||Red Hat||TCP/IP Networks||Neoliberalism||Toxic Managers|
May the source be with you, but remember the KISS principle ;-)
Skepticism and critical thinking is not panacea, but can help to understand the world better
|Enterprise Unix System Administration||Recommended Links||Performance Monitoring||Disk and Filesystems Management in Solaris||Database Performance Tuning|
|ZFS||Solaris UFS File System||Oracle Performance Tuning||Apache performance tuning||AIX performance tuning||NFS performance tuning||Network performance monitoring||Tivoli Performance Tuning|
|nfsstat||Admin Horror Stories||Humor||Etc|
The key rule with Solaris optimization is "do no harm". Learning the ropes is an important first step. See the Sun Performance and Tuning book and consider taking Suns Solaris performance tuning course.
Don't apply recipes without understanding the mechanisms behind them and obtaining an objective metrics of performance that point to a particular bottleneck. In complex cases use DTrace:
DTrace is new tracing facility built into Solaris 10. It improved the ability to identify system problems and bottlenecks. It is a huge system and the Solaris kernel instrumentation for achieving this functionality is an achievement in itself ( it has a 41-chapter manual). The tool is script language controlled. It uses D - the DTrace language which is similar to AWK. No other OS is currently even close to Solaris 10 in this area.
Introducing new problem during misguided optimization is too frequent problem to ignore. Remember that a couple of hours of additional downtime introduced by such actions will wipe out months or year of any performance gains.
In most cases "educated guess" about performance bottlenecks is wrong. Only careful measurement can reveal the real reason and without it tuning system is an exercise in futility. The useful first step in understanding the behavior of the system and the source of bottlenecks is enabling the accounting data. Collecting and checkpointing the accounting data puts a negligible additional load onto the system but can provide valuable insights about behavior of the system under typical load. At the same time the summary scripts that run once a day or once a week can have a noticeable effect, so schedule them to run outside business hours. See Processing Accounting Data into Workloads (October 1999) by Adrian Cockcroft
In many cases performance issues are resolved using better (and often cheaper) hardware. Using 5-7 year servers with Solaris in many cases just does not make sense -- hardware technology is moving too fast. Moving from Sparc to Intel is also should be viewed as optimization option as price performance of Intel boxes is better and for $7K you can buy box that on many SPEC metrics beats more expensive UltraSparc boxes. If disk subsystem is a problem, adding additional disks and/or moving some filesystems to SAN can improve I/O performance quite dramatically.
While bottlenecks can occur on practically any component of the server, typical suspect are I/O, memory and CPU. But please keep in mind that they can be caused by malfunctioning components as well.
Typically overload periods are brief and limited to "rush hours". In such cases limiting or offloading other activities on the server might improve performance without affecting stability.
Solaris blueprints contains several very good materials about performance tuning. I especially recommend blueprints written by Adrian Cockcroft.
-by Jon Hill and Kemer Thomson
This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."
Like any modern OS Solaris includes several types of filesystems with primary local filesystems. being UFS, ZFS, VxFS (The Veritas filesystem). NFS is typically use as netwrk filesystem and tmpfs(which is a memory mapped filesystem) is a class of its own.
A file system stores named data sets and attributes about those data sets for subsequent data access and interpretation of the attributes. Attributes include things like ownership, access rights, date of last access, and physical location. Advanced filesystems along with faster access often provide extended set of attributes, for example OS/2 HPFS provides user-defined attributes.
We can distinguish five categories of file system available in the Solaris.
Discussion below is based on the article by Brian Wong, Design, Features, and Applicability of Solaris File Systems
Every Solaris system includes UFS. While it is definitely old and lacking some features, it is suitable for a wide variety of applications. The UFS design center handles typical files found in office and business automation systems. The basic I/O characteristics favor huge numbers of small, cachable files, accessed randomly by individual processes; bandwidth demand of is low. This profile is common in most workloads, such as software development and network services (for example, Mail servers, DNS servers, web sites, and to certain extent ftp sites unless you are distributing photos or images from them).
When designing your server filesystems with UFS filesystems, pay attention to what role each partition will play for your particular application. Mapping partitions to separate pairs of physical disk to minimize load of each pair (in case of hardware mirroring) improves performance.
If you're running a webserver, for example -- it would benefit performance to have an separate pair of drives dedicated to website storage. You might configure Webserver partition with both the "noatime" and "logging" options along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.
A webservers have mostly a read-requests load and the volume of data is not that big so RAID 10 can be used.
Software mirroring is an additional overhead. In no way you should ever mirror partitions on the same drive, except for training purposes: you'll seriously degrade your performance since you've effectively doubled your seeks.
For small web sites (let's say up to 4G) it make sense to use /tmp for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better (and cheaper) deal then using SANs. You just need to load the content when server reboots. the problem is that after 4G the time to reboot the server became somewhat long but few websites are that big. In any case it make sense to use entire drive for your webserver filesystem. New USB storage might have read performance comparable with best harddrives has no latency for reading and might also be an option.
Logs from Web server can be written on system drive as the volume is rather slim.
You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book (p. 172-173. ) and in Suns Solaris performance tuning course.
In addition to the basic UFS, there are two variants, logging UFS (LUFS) and older UFS that was used in Solaris 7. All three versions share the same basic code that blocks allocation, directory management, and data organization. Older version of UFS up to Solaris 9 have a nominal maximum filesystem size of 1 terabyte. This limit was raised to 16 terabytes in the Solaris 10 OS.
The maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte file system. There is no reasonable limit to the number of file systems that can be built on a single system; systems have been run with over 2880 UFS file systems. The major differences between the three UFS variants are in how they handle metadata. Metadata is information that the file system stores about the data, such as the name of the file, ownership and access rights, last modified date, file size, and other similar details. Other, less obvious, but possibly more important metadata are the location of the data on the disk, such as data blocks and the indirect blocks that indicate where data locks reside in the disk.
Getting this metadata wrong would not only mean that the affected file might be lost, but could lead to serious file system-wide problems or even a system crash in the event that live data found itself in the free space list, or worse, that free blocks somehow appeared in the middle of a file. UFS takes the simplest approach to assuring metadata integrity: it writes metadata synchronously and requires an extensive fsck on recovery from a system crash. The time and expense of the fsck operation is proportional to the number of files in the file system being checked.
Large file systems with millions of small files can take tens of hours to check. Logging file systems were developed to avoid both the ongoing performance issues associated with synchronous writes and excessive time for recovery. Logging uses the two-phase commit technique to ensure that metadata updates are either fully updated on disk, or that they will be fully updated on disk upon crash recovery. Logging implementations store pending metadata in a reserved area, and then update the master file system based on the content of the reserved area or log.
In the event of a crash, metadata integrity is assured by inspecting the log and applying any pending metadata updates to the master file system before accepting any new I/O operations from applications. The size of the log is dependent on the amount of changing metadata, not the size of the file system. Because the amount of pending metadata is quite small, usually on the order of a few hundred kilobytes for typical file systems and several tens of megabytes for very busy file systems.
Replaying the log against the master is therefore a very fast operation. Once the metadata integrity is guaranteed, the fsck operation becomes a null operation and crash recovery becomes trivial. Note that for performance reasons, only metadata is logged; user data is not logged.
The metatrans implementation was the first version of UFS to implement logging. It was built into Solstice DiskSuite or Solaris Volume Manager software (the name of the product depends on the version of the code, but otherwise, they are the same). The metatrans implementation is limited to Solaris 7 and was replaced by logging UFS (LUFS).
Logging UFS was introduced into the Solaris 8 OS but unfortunately was not enabled by default. The reason for that was performance degradation, found typically only at artificially high-load levels, and almost no cases have been seen in practical applications.
So in reality logging started be used in typical installation only with Solaris 10, where it is enabled by default. Sun recommends using logging any time that fast crash recovery is required and it can be used starting from Solaris 8 but this recommendation are largely ignored. This is particularly sad in case of root file systems, which usually do not have any significant I/O at all.
One of the most confusing issues associated with logging file systems (and particularly with logging UFS, for some reason) is the effect that the log has on performance. First, and most importantly, logging has absolutely no impact on user data operations; this is because only metadata operations are logged.
The performance of metadata operations is another story, and it is not as easy to describe. The log works by writing pending changes to the log, then actually applying the changes to the master file system. When the master is safely updated, the log entry is marked as committed, meaning that it does not need to be reapplied to the master in the event of a crash. This algorithm means that metadata changes that are accomplished primarily when creating or deleting files might actually require twice as many physical I/O operations as a non-logging implementation. The net impact of this aspect of logging performance is that there are more I/O operations going to storage. Typically, this has no real impact on overall performance, but in the case where the underlying storage was already nearly 100 percent busy, the extra operations associated with logging can tip the balance and produce significantly lower file system throughput. (In this case, throughput is not measured in megabytes per second, but rather in file creations and deletions per second.) If the utilization of the underlying storage is less than approximately 90 percent, the logging overhead is inconsequential.
On the positive side of the ledger, the most common impact on performance has to do with the cancellation of some physical metadata operations. These cases occur only when metadata updates are issued very rapidly, such as when doing a tar (1) extract operation or when removing the entire contents of a directory ("rm -f *"). Without logging, the system is required to force the directory to disk after every file is processed (this is the definition of the phrase "writing metadata synchronously); the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the file system is logging, the log record is pushed to disk when the log record fills, often when the 512 byte block is completed. This results in a 512/14 = 35 times reduction in physical I/O, and obvious performance improvements result.
The following table illustrates these results. The times are given in seconds, and lower scores are better. Times are the average of five runs, and are intended to show relative differences rather than the fastest possible absolute time. These tests were run on Solaris 8 7/01 using a single disk drive.
The tar test consists of extracting 7092 files from a 175 megabyte archive (the contents of /usr/openwin). Although a significant amount of data is moved, this test is dominated by metadata updates for creating the files. Logging is five times faster. The rm test removes the 7092 extracted files. It is also dominated by metadata updates and is an astonishing 37 times faster than the non-logging case.
On the other hand, the dd write test creates a single 1 gigabyte file in the file system, and the difference between logging and non-logging is a measurable, but insignificant, three percent. Reading the created file from the file system shows no performance impact from logging. Both tests use large block sizes (1 megabyte per I/O) to optimize throughput of the underlying storage.
Another feature present in most of the local file systems is the use of direct I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended to avoid the overhead associated with managing cache buffers for large I/O. At first glance, it might seem that caching is a good thing and that it would improve I/O performance.
There is a great deal of reality underlying these expectations. All of the local file systems perform buffer caching by default. The expected improvements occur for typical workloads that are dominated by metadata manipulation and data sets that are very small when compared to main memory sizes. Metadata, in particular, is very small, amounting to less than one kilobyte per file in most UFS applications, and only slightly more in other file systems. Typical user data sets are also quite small; they average about 70 kilobytes. Even the larger files used in every day work such as presentations created using StarOfficeT software, JPEG images, and audio clips are generally less than 2 megabytes. Compared to typical main memory sizes of 256-2048 megabytes, it is reasonable to expect that these data sets and their attributes can be cached for substantial periods of time. They are reasonably likely to still be in memory when they are accessed again, even if that access comes an hour later.
The situation is quite different with bulk data. Systems that process bulk data tend to have larger memories, up to perhaps 16 gigabytes (for example, 8-64 times larger than typical), but the data sets in these application spaces often exceed 1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even if the file literally fits into memory and could theoretically be cached, these data sets are substantially larger than memory that is consistently available for I/O caching. As a result, the likelihood that the data will still be in cache when the data is referenced again is quite low. In practice, cache reuse in these environments is nil.
Caching data anyway would be fine except, that the process requires effort on the part of the OS and processors. For small files, this overhead is insignificant. However, the overhead becomes not only significant, but excessive when "tidal waves" of data flow through the system. When reading 1 gigabyte of data from a disk in large blocks, throughput is similar for both direct and buffered cases; the buffered case delivers 13 percent greater throughput. The big difference between these two cases is that the buffered process consumes five times as much CPU effort. Because there is so little practical value to caching large data sets, Sun recommends using the forcedirectio option on file systems that operate on large files. In this context, large generally means more than about 15-20 megabytes. Note that the direct I/O recommendation is especially true when the server in question is exporting large files through NFS. 8 Design, Features, and Applicability of Solaris File Systems January 2004 If direct I/O is so much more efficient, why not use direct I/O all the time? Direct I/O means that caching is disabled. The impact of standard caching becomes obvious when using a UFS file system in direct I/O mode while doing small file operations. The same tar extraction benchmark used in the logging section above takes over 51 minutes, even with logging enabled, more than 29 times as long as when using regular caching (2:08)! The benchmark results are summarized in the following table.
In this table, throughput is represented by elapsed times in seconds, and smaller numbers are better. The system in question is running Solaris 9 FCS on a 750- megahertz processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive. The differences in throughput are mainly attributable to how the file system makes use of the capabilities of the underlying hardware.
A discussion of buffered and direct I/O methodology is incomplete without addressing one particular attribute of the cached I/O strategy. Because file systems are part of the operating system, they can access the entire capability of the hardware. Of particular relevance is that file systems are able to address all of the physical memory, which now regularly exceeds the ability of 32-bit addressing. As a result, the file system is able to function as a kind of memory management unit (MMU) that permits applications that are strictly 32-bit aware to make direct use of physical memories that are far larger than their address pointers. This technique, known as supercaching, can be particularly useful to provide extended caching for applications that are not 64-bit aware. The best examples of this are the open-source databases, MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct addressing capabilities limited to 4 gigabytes.1 However, when their data tables are hosted on a file system operating in buffered mode, they benefit from cached I/O. This is not as efficient as simply using a 64-bit They obviously can address far more disk space because disk addresses are 63-bit quantities of 512-byte blocks. pointer because the application must run I/O system calls instead of merely dereferencing a 64-bit pointer, but the advantages gained by avoiding I/O outweigh these considerations by a wide margin.
To Solaris users, NFS is by far the most familiar file system. It is an explicit over the wire file sharing protocol that has been a part of the Solaris since 1986. Its manifest purpose is to permit safe, deterministic access to files located on a server with reasonable security. Although NFS is media independent, it is most commonly seen operating over TCP/IP networks. NFS is specifically designed to operate in multiclient environments and to provide a reasonable tradeoff between performance, consistency, and ease-of-administration. Although NFS has historically been neither particularly fast nor particularly secure, recent enhancements address both of these areas. Performance improved by 50-60 percent between the Solaris 8 and Solaris 9 OSs, primarily due to greatly increased efficiency processing attribute-oriented operations5. Data-intensive operations don't improve by the same margin because they are dominated by data transfer times rather than attribute operations. Security, particularly authentication, has been addressed through the use of much stronger authentication mechanisms such as those available using Kerberos. NFS clients now need to trust only their servers, rather than their servers and their client peers. 5. A two times 900 MHz SF280R yielded 7200 NFS operations per second on Solaris 8 2/02. The same system yielded 1717 NFS operations second on Solaris 9 FCS.
UFS is not a shared file system. Despite a fairly widespread interest in a limited-use configuration (specifically, mounted for read/write operation on one system, while mounted read-only on one or more "secondary" systems), UFS is not sharable without the use of an explicit file sharing protocol such as NFS. Although read-only sharing seems as though it should work, it doesn't. This is due to fairly fundamental decisions made in the UFS implementation many years ago, specifically in the caching of metadata.
UFS was designed with only a single system in mind and it also has a relatively complex data structure for files, notably including "indirect blocks," which are blocks of metadata that contain the addresses of real user data. To maintain reasonable performance, UFS caches metadata in memory, even though it writes metadata to disk synchronously. This way, it is not required to re-read inodes, indirect-blocks, and double-indirect blocks to follow an advancing file pointer. In a single-system environment, this is a safe assumption. However, when another system has access to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic at worst. A writable UFS file system can change the metadata and write it to disk.
Meanwhile, a read-only UFS file system on another node holds a cached copy of that metadata. If the writable system creates a new file or removes or extends an existing file, the metadata changes to reflect the request. Unfortunately, the read-only system does not see these changes and, therefore, has a stale view of the system. This is nearly always a serious problem, with the consequences ranging from corrupted data to a system crash. For example, if the writable system removes a file, its blocks are placed in the free list. The read-only system isn't provided with this information, therefore, a read of the same file will cause the read-only to follow the original data pointers and read blocks that are now on the free list!
Rather than risk such extreme consequences, it is better to use one of the many other options that exist. The selection of which option is driven by a combination of how often updated data must be made available to the other systems, and the size of the data sets involved. If the data is not updated too often, the most logical option is to make a copy of the file system and to provide the copy to other nodes. With pointin-time copy facilities such as Sun Instant Image, HDS ShadowImage, and EMC TimeFinder, copying a file system does not need to be an expensive operation.
It is entirely reasonable to export a point-in-time copy of a UFS file system from storage to another node (for example, for backup) without risk because neither the original nor the copy is being shared. If the data changes frequently, the most practical alternative is to use NFS.
Although performance is usually cited as a reason not to do this, the requirements are usually not demanding enough to warrant other solutions. NFS is far faster than most users realize, especially in environments that involve typical files smaller than 5-10 megabytes.
There are a couple of tricks you can use under Solaris to gain a little extra performance from your filesystems and also increase their data reliability. When designing your filesystem, pay attention to what role it will play for your particular application. Depending on your needs, you map partitions to physical disk to minimize load of each pair of disks (in case of mirroring) and improve performance.
If you're running a webserver, for example - it would benefit performance to have an separate pair of drives dedicated to website storage. You might configure it with both the "noatime" and "logging" options mentioned below along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.
A webservers had mostly a read-requests load. RAID 10 can be used, but RIAD 5 can be used too as both provides a high read transaction rate and provides redundancy in case of a drive failure.
In no way you should even mirror partitions on the same drive. Otherwise, you'll seriously degrade your performance since you've effectively doubled your seeks.
For small web sites (let's say up to 4G) it make sense to use /tmp for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better deal then using SANs. You just need to load the content when server reboots and after 4G the time to reboot the server became annoyingly long. In any case it make sense to use entire drive for your webserver filesystem. New USB storage might have read perfomance comparable with harddrives and it is also an option.
Logs from Web server can be written on system drive as the volume is rather slim.
You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book and in Suns Solaris performance tuning course.
The kernel summit was two weeks ago, and at the end of that I got one of the new 80GB solid state disks from Intel. Since then, I've been wanting to talk to people about it because I'm so impressed with it, but at the same time I don't much like using the kernel mailing list as some kind of odd public publishing place that isn't really kernel-related, so since I'm testing this whole blogging thing, I might as well vent about it here.
That thing absolutely rocks.
I've been impressed by Intel before (Core 2), but they've had their share of total mistakes and idiotic screw-ups too (Itanic), but the things Intel tends to have done well are the things where they do incremental improvements. So it's a nice thing to be able to say that they can do new things very well too. And while I often tend to get early access to technology, seldom have I looked forward to it so much, and seldom have things lived up to my expectations so well.
In fact, I can't recall the last time that a new tech toy I got made such a dramatic difference in performance and just plain usability of a machine of mine.
So what's so special about that Intel SSD, you ask? Sure, it gets up to 250MB/s reads and 70MB/s writes, but fancy disk arrays can certainly do as well or better. Why am I not gushing about some nice NAS box? I didn't even put the thing into a laptop, after all, it's actually in Tove's Mac Mini (running Linux, in case anybody was confused ;), so a RAID NAS box would certainly have been a lot bigger and probably have more features.
But no, forget about the throughput figures. Others can match - or at last come close - to the throughput, but what that Intel SSD does so well is random reads and writes. You can do small random accesses to it and still get great performance, and quite frankly, that's the whole point of not having some stupid mechanical latencies as far as I'm concerned.
And the sad part is that other SSD's generally absolutely suck when it comes to especially random write performance. And small random writes is what you get when you update various filesystem meta-data on any normal filesystem, so it really does matter. For example, a vendor who shall remain nameless has an SSD disk out there that they were also hawking at the Kernel Summit, and while they get fine throughput (something like 50+MB/s on big contiguous writes), they benchmark a pitiful 10 (yes, that's ten, as in "how many fingers do you have) small random writes per second. That is slower than a rotational disk.
In contrast, the Intel SSD does about 8,500 4kB random writes per second. Yeah, that's over eight thousand IOps on random write accesses with a relevant block size, rather than some silly and unrealistic contiguous write test. That's what I call solid-state media.
The whole thing just rocks. Everything performs well. You can put that disk in a machine, and suddenly you almost don't even need to care whether things were in your page cache or not. Firefox starts up pretty much as snappily in the cold-cache case as it does hot-cache. You can do package installation and big untars, and you don't even notice it, because your desktop doesn't get laggy or anything.
So here's the deal: right now, don't buy any other SSD than the Intel ones, because as far as I can tell, all the other ones are pretty much inferior to the much cheaper traditional disks, unless you never do any writes at all (and turn off 'atime', for that matter).
So people - ignore the manufacturer write throughput numbers. They don't mean squat. The fact that you may be able to push 50MB/s to the SSD is meaningless if that can only happen when you do big, aligned, writes.
If anybody knows of any reasonable SSDs that work as well as Intel's, let me know.
**** docs.sun.com Solaris Tunable Parameters Reference Manual
**** Solaris - Tuning Your TCP-IP Stack
**** The SE Performance Toolkit - Release 3.2 -- not the latest version -- see SE Toolkit.com for 3.2.1
ITworld.com - Performance Q&A
Adrian Cockcroft's Performance Q&A column in Unix Insider generates more reader email than any other item. We've started to see some repetition to your letters, so to lighten Adrian's email load and offer you quicker access to Adrian's wisdom, we've compiled the more frequently asked questions here.
Solaris Kernel Tuning -- Princeton CIT guide by Scott Cromar
Solaris tuning information - outdated
Successful SolarisTM Performance Tuning
The performance problems described in this article are common, so much so that I improve my chances of success by making the parameter changes every time I set up a server without waiting for a problem to surface. If you suspect network problems, refer to Jens Voeckler's Web site:http://www.sean.de/Solaris
If you have memory, CPU, or disk problems, a good resource is Adrian Cockroft's book, Sun Performance and Tuning SPARC and Solaris, Second edition (Sun Microsystems Press, ISBN 0-13-149642-5). It's not easy reading if you're just grazing, but if you're investigating a specific problem, this book is the place to go. In October 2000, Sun published a manual describing all the kernel-tunable parameters, which is available at:http://docs.sun.com
Fine tuning proxy servers SunWorld - Letters to the Editor - April 1997
SQUID Frequently Asked Questions System-Dependent Weirdnesses
select(3c) won't handle more than 1024 file descriptors. The configure script should enable poll() by default for Solaris. poll() allows you to use many more filedescriptors, probably 8192 or more. <
For older Squid versions you can enable poll() manually by changing HAVE_POLL in include/autoconf.h, or by adding -DUSE_POLL=1 to the DEFINES in src/Makefile.malloc libmalloc.a is leaky. Squid's configure does not use -lmalloc on Solaris.
DNS lookups and nscd by David J N Begley. DNS lookups can be slow because of some mysterious thing called ncsd. You should edit /etc/nscd.conf and make it say:enable-cache hosts no
Apparently nscd serializes DNS queries thus slowing everything down when an application (such as Squid) hits the resolver hard. You may notice something similar if you run a log processor executing many DNS resolver queries - the resolver starts to slow.. right.. down..
According to Andres Kroonmaa, users of Solaris starting from version 2.6 and up should NOT completely disable nscd daemon. nscd should be running and caching passwd and group files, although it is suggested to disable hosts caching as it may interfere with DNS lookups.
Several library calls rely on available free FILE descriptors FD < 256. Systems running without nscd may fail on such calls if first 256 files are all in use.
Since solaris 2.6 Sun has changed the way some system calls work and is using nscd daemon as a implementor of them. To communicate to nscd Solaris is using undocumented door calls. Basically nscd is used to reduce memory usage of user-space system libraries that use passwd and group files. Before 2.6 Solaris cached full passwd file in library memory on the first use but as this was considered to use up too much ram on large multiuser systems Sun has decided to move implementation of these calls out of libraries and to a single dedicated daemon.
DNS lookups and /etc/nsswitch.conf by Jason Armistead. The /etc/nsswitch.conf file determines the order of searches for lookups (amongst other things). You might only have it set up to allow NIS and HOSTS files to work. You definitely want the "hosts:" line to include the word dns, e.g.:hosts: nis dns [NOTFOUND=return] filesDNS lookups and NIS by Chris Tilbury. Our site cache is running on a Solaris 2.6 machine. We use NIS to distribute authentication and local hosts information around and in common with our multiuser systems, we run a slave NIS server on it to help the response of NIS queries.
We were seeing very high name-ip lookup times (avg ~2sec) and ip->name lookup times (avg ~8 sec), although there didn't seem to be that much of a problem with response times for valid sites until the cache was being placed under high load. Then, performance went down the toilet.
After some time, and a bit of detective work, we found the problem. On Solaris 2.6, if you have a local NIS server running (ypserv) and you have NIS in your /etc/nsswitch.conf hosts entry, then check the flags it is being started with. The 2.6 ypstart script checks to see if there is a resolv.conf file present when it starts ypserv. If there is, then it starts it with the -d option.
This has the same effect as putting the YP_INTERDOMAIN key in the hosts table -- namely, that failed NIS host lookups are tried against the DNS by the NIS server.
This is a bad thing(tm)! If NIS itself tries to resolve names using the DNS, then the requests are serialised through the NIS server, creating a bottleneck (This is the same basic problem that is seen with nscd). Thus, one failing or slow lookup can, if you have NIS before DNS in the service switch file (which is the most common setup), hold up every other lookup taking place.
If you're running in this kind of setup, then you will want to make sure that
- ypserv doesn't start with the -d flag.
- you don't have the YP_INTERDOMAIN key in the hosts table (find the B=-b line in the yp Makefile and change it to B=)
We changed these here, and saw our average lookup times drop by up to an order of magnitude (~150msec for name-ip queries and ~1.5sec for ip-name queries, the latter still so high, I suspect, because more of these fail and timeout since they are not made so often and the entries are frequently non-existent anyway).Tuning Solaris 2.x - tuning your TCP/IP stack and more by Jens-S. Vckler
disk write error: (28) No space left on device
You might get this error even if your disk is not full, and is not out of inodes. Check your syslog logs (/var/adm/messages, normally) for messages like either of these:NOTICE: realloccg /proxy/cache: file system full NOTICE: alloc: /proxy/cache: file system full
In a nutshell, the UFS filesystem used by Solaris can't cope with the workload squid presents to it very well. The filesystem will end up becoming highly fragmented, until it reaches a point where there are insufficient free blocks left to create files with, and only fragments available. At this point, you'll get this error and squid will revise its idea of how much space is actually available to it. You can do a "fsck -n raw_device" (no need to unmount, this checks in read only mode) to look at the fragmentation level of the filesystem. It will probably be quite high (>15%).
Sun suggest two solutions to this problem. One costs money, the other is free but may result in a loss of performance (although Sun do claim it shouldn't, given the already highly random nature of squid disk access).
The first is to buy a copy of VxFS, the Veritas Filesystem. This is an extent-based filesystem and it's capable of having online defragmentation performed on mounted filesystems. This costs money, however (VxFS is not very cheap!)
The second is to change certain parameters of the UFS filesystem. Unmount your cache filesystems and use tunefs to change optimization to "space" and to reduce the "minfree" value to 3-5% (under Solaris 2.6 and higher, very large filesystems will almost certainly have a minfree of 2% already and you shouldn't increase this). You should be able to get fragmentation down to around 3% by doing this, with an accompanied increase in the amount of space available.
Thanks to Chris Tilbury.Changing the directory lookup cache size by Mike Batchelor
On Solaris, the kernel variable for the directory name lookup cache size is ncsize. In /etc/system, you might want to tryset ncsize = 8192
or even higher. The kernel variable ufs_inode - which is the size of the inode cache itself - scales with ncsize in Solaris 2.5.1 and later. Previous versions of Solaris required both to be adjusted independently, but now, it is not recommended to adjust ufs_inode directly on 2.5.1 and later. You can set ncsize quite high, but at some point - dependent on the application - a too-large ncsize will increase the latency of lookups.
Defaults are:Solaris 2.5.1 : (max_nprocs + 16 + maxusers) + 64 Solaris 2.6/Solaris 7 : 4 * (max_nprocs + maxusers) + 320The priority_paging algorithm by Mike Batchelor
Another new tuneable (actually a toggle) in Solaris 2.5.1, 2.6 or Solaris 7 is the priority_paging algorithm. This is actually a complete rewrite of the virtual memory system on Solaris. It will page out application data last, and filesystem pages first, if you turn it on (set priority_paging = 1 in /etc/system). As you may know, the Solaris buffer cache grows to fill available pages, and under the old VM system, applications could get paged out to make way for the buffer cache, which can lead to swap thrashing and degraded application performance. The new priority_paging helps keep application and shared library pages in memory, preventing the buffer cache from paging them out, until memory gets REALLY short. Solaris 2.5.1 requires patch 103640-25 or higher and Solaris 2.6 requires 105181-10 or higher to get priority_paging. Solaris 7 needs no patch, but all versions have it turned off by default.
The DNLC stores directory lookup information for files whose names are shorter than 30 characters. (The restriction on file name length was lifted in Solaris 7 and 8.)
sar -areports on the activity of this cache. In this output,
namei/sreports the name lookup rate and
iget/sreports the number of directory lookups per second. Note that an
igetis issued for each component of a file's path, so the hit rate cannot be calculated directly from the
sar -aoutput. The
sar -aoutput is useful, however, when looking at cache efficiency in a more holistic sense.
For our purposes, the most important number is the
total name lookupsline in the
vmstat -soutput. This line reports a cache hit percentage. If this percentage is not above 90%, the DNLC should be resized.
DNLC size is determined by the
ncsizekernel parameter. By default, this is set to
(17xmaxusers)+90(Solaris 2.5.1) or
4x(maxusers + max_nprocs)+320(Solaris 2.6-8). It is not recommended that it be set any higher than a value which corresponds to a
maxusersvalue of 2048.
(Note that the AnswerBooks and Cockroft report the incorrect algorithm for
ufs_ninode. The above formula comes from Sun's kernel support group.)
ncsize, add a line to the
The DNLC can be disabled by setting ncsize to a negative number (Solaris 2.5.1-7) or a non-positive number (Solaris 8).
Dio is a device I/O analysis tool. It can analyse a partition, a disk, an entire file system or any other kind of I/O device. It provides realtime output of maximum read rate, total bytes read and other useful stats.
[Aug 8, 2001] http://www.sysadminmag.com/newsletters/feature/ What I Did Instead of Buying a SAN by Adam Anderson
In situations where SAN technology is not cost-effective, Anderson has used two alternative storage technologies that deliver SAN functionality at reasonable cost.
[July 27, 2001] Solaris partitioning: Partitioning in itself, doesn't give efficiency, and can actually be a hindrance, since you cannot easily expand a partition, unless you use LVM (Logical Volume Manager).
It depends on your disk sub system: How many disks, software RAID or hardware RAID (1, 0+1, 5), SCSI or IDE.
Generally, I think of my harddisk content divided into 3 categories: data, configuration-files, and binaries /applications /OS.
Efficency can be gained, by distribute I/O load between different disk "sub-systems".
Eg. lets say; the webserver generates lots of logging info on every request, and that every request generates database I/O activity too. It would then make sense, to place the webserver logging data, and the DB on different disks (and therefore on different partitions). This is especially true, regarding SCSI, but IDE disks should benefit too.
Generel rules of thumb:
/home should be on its own partition and ideally on its own disk. Of course, this depends on whether your server has local users, uses .maildir (qmail).
If you got users and userdata in /home this is very convinient, especially when; performing dangerous upgrades (unmount it), restoring the system after a disk crash or compromise, or if users needs more diskspace (see IBM's excellent article on moving /home, on their developer network). Size? Depends entirely, but _a lot_ since you can't just clean up in the users home dirs, if size becomes a problem.
/var should be on its own partition. This may give a little extra security and stability, since /var is used for dynamic data and log-files. If a process runs amok (or by a DOS) and generates ever expanding logfiles, the damage is constrained to a single partition. This may prevent the system from crashing. A couple of GB's is not too little.
Some like a separate /boot partition on eg. 50MB. (I don't use that)
/usr may be a candidate for its own partition. If so, then allocate it lots of free space, since /usr tends to grow a lot with time, and the extra free space may be needed during distribution upgrades. A couple of GB's will do fine for many.
swap The official guidelines for swap space with kernel 2.4, is swap space=2*RAM.
So if the server has 256MB RAM, use 512MB for swap. Again, check out IBM's Linux section on their developer network. They have a nice article, on swap usage; eg. if you have 2 disks, make eg. a 256MB on each. Then swapping would be parallelized, which mean that it would have the same speed advantage as RAID 0.
Always allocate much more space on a partition than you need.
Don't make too many partitions
Tuning the ACL User Cache
Solaris File System Tuning
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: March 12, 2019