Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Performance Monitoring

News

Performance tuning

Books

Recommended Links

Performance tuning

Tutorials Papers
top uptime vmstat ps netstat Software Distribution Unix System Monitoring
Linux Performance Tuning AIX performance tuning NFS performance tuning Database Performance Tuning Oracle Performance Tuning Tivoli perfomance tuning Etc

In its simplest form, the performance monitor, or system monitor, is a utility which tracks the running processes and give a real time graphical display of the resources utilization. Unix  top  utility is a classic example of such tool. It can be used to assist you with the planning of upgrades, tracking of processes that need to be optimized, monitoring results of tuning and configuration scenarios, and the understanding of a workload and its effect on resource usage to identify bottlenecks.

Bottlenecks can occur on practically any component of the server, with typical suspect being I/O, memory and CPU. It can be caused by a malfunctioning resource, the system not having enough resources, a program that dominates a particular resource. 

Solaris blueprints contains several very good materials about performance tuning. I especially recommend blueprints written by Adrian Cockcroft.

Old News ;-)

[Oct 9, 2008] .. so I got one of the new Intel SSD's

The kernel summit was two weeks ago, and at the end of that I got one of the new 80GB solid state disks from Intel. Since then, I've been wanting to talk to people about it because I'm so impressed with it, but at the same time I don't much like using the kernel mailing list as some kind of odd public publishing place that isn't really kernel-related, so since I'm testing this whole blogging thing, I might as well vent about it here.

That thing absolutely rocks.

I've been impressed by Intel before (Core 2), but they've had their share of total mistakes and idiotic screw-ups too (Itanic), but the things Intel tends to have done well are the things where they do incremental improvements. So it's a nice thing to be able to say that they can do new things very well too. And while I often tend to get early access to technology, seldom have I looked forward to it so much, and seldom have things lived up to my expectations so well.

In fact, I can't recall the last time that a new tech toy I got made such a dramatic difference in performance and just plain usability of a machine of mine.

So what's so special about that Intel SSD, you ask? Sure, it gets up to 250MB/s reads and 70MB/s writes, but fancy disk arrays can certainly do as well or better. Why am I not gushing about some nice NAS box? I didn't even put the thing into a laptop, after all, it's actually in Tove's Mac Mini (running Linux, in case anybody was confused ;), so a RAID NAS box would certainly have been a lot bigger and probably have more features.

But no, forget about the throughput figures. Others can match - or at last come close - to the throughput, but what that Intel SSD does so well is random reads and writes. You can do small random accesses to it and still get great performance, and quite frankly, that's the whole point of not having some stupid mechanical latencies as far as I'm concerned.

And the sad part is that other SSD's generally absolutely suck when it comes to especially random write performance. And small random writes is what you get when you update various filesystem meta-data on any normal filesystem, so it really does matter. For example, a vendor who shall remain nameless has an SSD disk out there that they were also hawking at the Kernel Summit, and while they get fine throughput (something like 50+MB/s on big contiguous writes), they benchmark a pitiful 10 (yes, that's ten, as in "how many fingers do you have) small random writes per second. That is slower than a rotational disk.

In contrast, the Intel SSD does about 8,500 4kB random writes per second. Yeah, that's over eight thousand IOps on random write accesses with a relevant block size, rather than some silly and unrealistic contiguous write test. That's what I call solid-state media.

The whole thing just rocks. Everything performs well. You can put that disk in a machine, and suddenly you almost don't even need to care whether things were in your page cache or not. Firefox starts up pretty much as snappily in the cold-cache case as it does hot-cache. You can do package installation and big untars, and you don't even notice it, because your desktop doesn't get laggy or anything.

So here's the deal: right now, don't buy any other SSD than the Intel ones, because as far as I can tell, all the other ones are pretty much inferior to the much cheaper traditional disks, unless you never do any writes at all (and turn off 'atime', for that matter).

So people - ignore the manufacturer write throughput numbers. They don't mean squat. The fact that you may be able to push 50MB/s to the SSD is meaningless if that can only happen when you do big, aligned, writes.

If anybody knows of any reasonable SSDs that work as well as Intel's, let me know.

[Jul 08, 2008] dim_STAT 8.2  by Dimitri

About: dim_STAT is a performance analysis and monitoring tool for Solaris and Linux (as well all other UNIX) systems. Its main features are a Web based interface, data storage in a SQL database, several data views, interactive (Java) or static (PNG) graphs, real-time monitoring, multi-host monitoring, post analyzing, statistics integration, professional reporting with automated features, and more.

Changes: A major performance update.

[Apr 09, 2008] Linux.com Inspecting disk IO performance with fio By Ben Martin

Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system's performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration. To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the -T and -t options. The -T option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The -t option also reads the disk through the cache, but without any precaching of results. Thus -t can give an idea of how fast a disk can deliver information stored sequentially on disk.

The hdparm tool isn't the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.

fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.

fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the make; sudo make install combination.

The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started. rw=randread tells fio to use a random reading access pattern, size=128m specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and the directory parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don't specify directory, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.

			$ cat random-read-test.fio ; random read of 128mb of data [random-read] 
			rw=randread size=128m directory=/tmp/fio-testing/data $ fio random-read-test.fio 
			random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 
			Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MiB) 
			Jobs: 1 (f=1): [r] [100.0% done] [ 3588/ 0 kb/s] [eta 00m:00s] random-read: 
			(groupid=0, jobs=1): err= 0: pid=30598 read : io=128MiB, bw=864KiB/s, 
			iops=211, runt=155282msec clat (usec): min=139, max=148K, avg=4736.28, 
			stdev=6001.02 bw (KiB/s) : min= 227, max= 5275, per=100.12%, avg=865.00, 
			stdev=362.99 cpu : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10 
			IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			issued r/w: total=32768/0, short=0/0 lat (usec): 250=34.92%, 500=0.36%, 
			750=0.02%, 1000=0.05% lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 
			50=0.94% lat (msec): 100=0.37%, 250=0.01% Run status group 0 (all jobs): 
			READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, 
			maxt=155282msec Disk stats (read/write): dm-6: ios=32768/148, merge=0/0, 
			ticks=154728/12490, in_queue=167218, util=99.59% 
		

fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.

The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the 4=12.80%, 10=44.96% section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.

The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).

I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the iodepth parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example, the 8=96.0% tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on 4=4.0%, 4% of the time there were only three or four requests in the queue.

 
			$ cat random-read-test-aio.fio ; same as random-read-test.fio ; ... 
			ioengine=libaio iodepth=8 direct=1 invalidate=1 $ fio random-read-test-aio.fio 
			random-read: (groupid=0, jobs=1): err= 0: pid=31318 read : io=128MiB, 
			bw=2,352KiB/s, iops=574, runt= 57061msec slat (usec): min=8, max=260, 
			avg=25.90, stdev=23.23 clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87 
			bw (KiB/s) : min= 0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60 
			... IO depths : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			... Run status group 0 (all jobs): READ: io=128MiB, aggrb=2,352KiB/s, 
			minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msec
		 

Random reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.

The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is rw, which can specify sequential or random reads and or writes in many combinations. The ioengine parameter can select how the IO requests are issued to the kernel. The invalidate option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. The runtime specifies that a test should run for a given amount of time and then be considered complete. The thinktime parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk. fsync=n can be used to issue a sync call after every n writes issued. write_iolog and read_iolog cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.

Simulating servers

You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I've based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.

 
			$ cat four-threads-randio.fio ; Four threads, two query, two writers. 
			[global] rw=randread size=256m directory=/tmp/fio-testing/data ioengine=libaio 
			iodepth=4 invalidate=1 direct=1 [bgwriter] rw=randwrite iodepth=32 [queryA] 
			iodepth=1 ioengine=mmap direct=0 thinktime=3 [queryB] iodepth=1 ioengine=mmap 
			direct=0 thinktime=5 [bgupdater] rw=randrw iodepth=16 thinktime=40 size=32m 
			$ fio four-threads-randio.fio bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, 
			ioengine=libaio, iodepth=32 queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, 
			ioengine=mmap, iodepth=1 queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, 
			ioengine=mmap, iodepth=1 bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, 
			ioengine=libaio, iodepth=16 Starting 4 processes bgwriter: (groupid=0, 
			jobs=1): err= 0: pid=3241 write: io=256MiB, bw=7,480KiB/s, iops=1,826, 
			runt= 35886msec slat (usec): min=9, max=106K, avg=35.29, stdev=583.45 
			clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00 bw (KiB/s) 
			: min= 0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44 cpu : usr=0.40%, 
			sys=4.13%, ctx=18254, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.4%, 
			8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0% issued r/w: total=0/65536, short=0/0 
			lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11% lat (msec): 
			2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32% lat (msec): 100=4.74%, 
			250=1.86% queryA: (groupid=0, jobs=1): err= 0: pid=3242 read : io=256MiB, 
			bw=589MiB/s, iops=147K, runt= 445msec clat (usec): min=2, max=165, avg= 
			3.48, stdev= 2.38 cpu : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545 
			IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.20%, 10=22.51%, 
			20=1.17%, 50=0.05%, 100=0.05% lat (usec): 250=0.01% queryB: (groupid=0, 
			jobs=1): err= 0: pid=3243 read : io=256MiB, bw=455MiB/s, iops=114K, 
			runt= 576msec clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31 bw 
			(KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 
			0.00 cpu : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545 IO depths 
			: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued 
			r/w: total=65536/0, short=0/0 lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 
			50=0.02%, 100=0.03% lat (usec): 250=0.01%, 500=0.01% bgupdater: (groupid=0, 
			jobs=1): err= 0: pid=3244 read : io=16,348KiB, bw=1,014KiB/s, iops=247, 
			runt= 16501msec slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19 
			clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53 bw (KiB/s) 
			: min= 0, max= 1957, per=2.37%, avg=794.90, stdev=495.94 write: io=16,420KiB, 
			bw=1,018KiB/s, iops=248, runt= 16501msec slat (usec): min=9, max=42,510, 
			avg=38.73, stdev=663.37 clat (usec): min=202, max=229K, avg=49803.02, 
			stdev=34393.32 bw (KiB/s) : min= 0, max= 1840, per=10.89%, avg=865.54, 
			stdev=411.66 cpu : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9 IO 
			depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0% 
			issued r/w: total=4087/4105, short=0/0 lat (usec): 2=0.02%, 4=0.04%, 
			20=0.01%, 50=0.06%, 100=1.44% lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 
			1000=1.17% lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82% 
			lat (msec): 100=19.07%, 250=4.72% Run status group 0 (all jobs): READ: 
			io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, 
			maxt=16501msec WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, 
			maxb=7,480KiB/s, mint=16501msec, maxt=35886msec Disk stats (read/write): 
			dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, 
			util=99.74% 
		 

As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.

To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.

 			bgwriter: (groupid=0, jobs=1): err= 0: pid=14963 write: io=256MiB, bw=6,545KiB/s, 
			iops=1,597, runt= 41013msec queryA: (groupid=0, jobs=1): err= 0: pid=14964 
			read : io=256MiB, bw=160MiB/s, iops=39,888, runt= 1643msec queryB: (groupid=0, 
			jobs=1): err= 0: pid=14965 read : io=256MiB, bw=163MiB/s, iops=40,680, 
			runt= 1611msec bgupdater: (groupid=0, jobs=1): err= 0: pid=14966 read 
			: io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec write: io=16,352KiB, 
			bw=420KiB/s, iops=102, runt= 39788msec READ: io=528MiB, aggrb=13,915KiB/s, 
			minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec WRITE: io=272MiB, 
			aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msec
		 

The vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.

Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.

Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.

[Apr 27, 2008] sysprof 1.0.10   by Søren Sandmann

About: Sysprof is a sampling CPU profiler that uses a Linux kernel module to profile the entire system, not just a single application. It handles shared libraries, and applications do not need to be recompiled. It profiles all running processes, not just a single application, has a nice graphical interface, shows the time spent in each branch of the call tree, can load and save profiles, and is easy to use.

Changes: Compiles with 2.6.25 and later.

Observability (December 1999) by Adrian Cockcroft

Discusses Capacity Planning and Performance Management techniques.

Processing Accounting Data into Workloads (October 1999) by Adrian Cockcroft

Information about Solaris operating system accounting to include code examples that extract the data in a usable format and pattern match it into workloads.

Scenario Planning - Part 1 (February 2000) by Adrian Cockcroft

Discusses scenario planning techniques to help predict latent demand during overload periods. In this part 1 he explains how to simplify your model down to a single bottleneck.

Scenario Planning - Part 2 (March 2000) by Adrian Cockcroft

Presents part two of the Scenario Planning article and explains how to follow-up a simple planning methodology based on a spreadsheet that is used to break down the problem and experiment with alternative future scenarios.

Static Performance Tuning (May 2000) by Richard Elling

Richard discusses a class of problems that can affect system performance which is not dynamic by nature, and cannot be detected by conventional dynamic tuning tools.

System Performance Management: Moving from Chaos to Value (July 2001)
-by Jon Hill and Kemer Thomson

This article presents the rationale for formal system performance management from a management, systems administrative and vendor perspective. It describes four classes of systems monitoring tools and their uses. The article discusses the issues of tool integration, "best-of-breed versus integrated suite" and the decision to "buy versus build."

Troubleshooting Tips

From the SGI Admin Guide - last I checked the CPU spends most of its time waiting for something to do  
Table 5-3 : Indications of an I/O-Bound System

Field					Value		sar Option

%busy (% time disk is busy)		>85		sar -d

%rcache (reads in buffer cache)		low, <85	sar -b

%wcache (writes in buffer cache)	low, <60%	sar -b

%wio (idle CPU waiting for disk I/O)	dev. system >30	sar -u
					fileserver >80

Table 5-5 Indications of Excessive Swapping/Paging

bswot/s (ransfers from memory to disk swap area)	>200	sar -w

bswin/s (transfers to memory)				>200	sar -w

%swpocc (time swap queue is occupied)			>10	sar -q

rflt/s (page reference fault)				>0	sar -t

freemem (average pages for user processes)		<100	sar -r

Indications of a CPU bound systems

%idle (% of time CPU has no work to do)			<5	sar -u

runq-sz (processes in memory waiting for CPU)		>2	sar -q

%runocc (% run queue occupied and processes not executing)	>90	sar -q

hypermail /usr/local/src/src/hypermail - mailing list to web page converter; grep hypermail /etc/aliases shows which lists use hypermail

pwck, grpck should be run weekly to make sure ok; grpck produces a ton of errors

can use local man pages - text only - see Ch3 User Services
put in /usr/local/manl (try /usr/man/local/manl) suffix .l
long ones pack -> pack program.1;mv program.1.z /usr/man/local/mannl/program.z

freshmeat.net Project details for sysstat

The sysstat package contains the sar, sadf, iostat, mpstat, and pidstat commands for Linux. The sar command collects and reports system activity information. The statistics reported by sar concern I/O transfer rates, paging activity, process-related activites, interrupts, network activity, memory and swap space utilization, CPU utilization, kernel activities, and TTY statistics, among others. The sadf command may be used to display data collected by sar in various formats. The iostat command reports CPU statistics and I/O statistics for tty devices and disks. The pidstat command reports statistics for Linux processes. The mpstat command reports global and per-processor statistics.

Release focus: Minor bugfixes

Changes:
mpstat and sar didn't parse /proc/interrupts correctly when some CPUs had been disabled. This is now fixed. This release also fixes a bug in pidstat which caused confusion between PID and TID, resulting in erroneous statistics values being displayed. The iconfig script has been updated: Help for the --enable-compress-manpg parameter is now available, help for the --enable-install-cron parameter has been updated, and the parameter cron_interval has been added.

DNet eWEEK Sprint puts backbone flow under surveillance

Aiming to provide increasingly higher-quality IP and Internet services at lower prices, Sprint Corp. has begun its most comprehensive study to date of traffic behavior on its Internet backbone.

After a year of developing its own test equipment, the carrier late last month began collecting data at its San Jose, Calif., Internet POP (point of presence), the first of many sites slated for testing.

Sprint plans to use the data from the testing, called the Internet Measurement Study, to ensure that its network can handle ever-increasing customer traffic volume and to discover which network monitoring tools will be needed in future network equipment.

"Very little is known about the detailed behavior of Internet backbones," said Bryan Lyles, chief scientist at Sprint, in Kansas City, Mo. "Very fine-grained studies are what we need to make rational decisions on the equipment that goes into the network -- even the standards that go into it."

Sprint hopes the multimillion-dollar, multiyear study will enable it to keep its equipment costs as low as possible and ensure that its network delivers optimal performance.

"The goal is to make sure we make the best use of capital and the other resources we put into the network and to keep our customers happy," Lyles said.

Performance, performance, performance

As the Internet's importance to a company's bottom line increases, users expect ISPs (Internet service providers) or other data carriers to meet increasingly stringent service performance goals.

At Quebecor Printing (USA) Inc., which is installing an IP-based VPN (virtual private network) at its many locations, "class of service will include bandwidth allocation and prioritization for certain applications," said Terry Bush, vice president of data communications, in Greenwich, Conn.

At its bigger printing facilities, the company is installing multiple 1.5M-bps circuits to handle growth in its data traffic because IP bandwidth is more efficient and flexible in a VPN than in more conventional network designs, Bush said. Nevertheless, Quebecor demands service levels that rival private network solutions and has a service-level agreement that specifies zero packet loss and a round-trip, coast-to-coast network delay of less than 75 milliseconds, Bush said.

Sprint isn't alone among carriers and ISPs in its quest to improve Internet service. For example, "2001 will probably be the last year that we will buy narrowband switches," said Fred Briggs, chief technical officer at WorldCom Inc., in Clinton, Miss.

Solaris Developer Connection

Chat Title: Solaris Utilities for Monitoring System Performance
Guest Speakers: James Liu and Karpagam Narayanan

This is a moderated forum

LizA: Welcome to the Solaris Live Chat, "Solaris Utilities for Monitoring System Performance" with James Liu and Karpagam Narayanan. James was our first Solaris Live! guest and we're very happy to have him back. James is ready to answer your questions on software development and benchmark formation strategies and configuration, scaling analysis, processor management, thread libraries, and so on. He is joined by Karpagam Narayanan, who has lots of experience with all the standard tools like Virtial Adrian (aka SE Toolkit) disk partitioning, network bandwidth trunking, and other things that get your app to run faster on Solaris[tm]. Karpagam and James, let's say that I'm new to Solaris and I want to know what CPU a process takes. Is there a command that shows me this?

jamesliu: I'll take this one. A number of commands can show this. You can use prstat which is bundled with Solaris 8 and is probably easiest. If you have the freeware top... you can use this too.

LizA: What does NLWP mean in prstat?

karpagam: NLWP refers to the number of light weight processes, or LWP, associated with the process.

LizA: How does someone find out which processors are online or off line?

jamesliu: You can find out using the psrinfo command. -v option gives you a lot of info on the processors

LizA: I need to increase the file descriptors on my server...I bumped up the ulimit but it still doesn't work. What else do I need to do?

karpagam: Increase the rlim_fd_max and rlim_fd_cur parameters in /etc/system. Remember that these take affect after you reboot.

jamesliu: LizA, you can also gain some efficiencies if your problem is related to using network file descriptors (i.e. sockets). You can tune the tcp/ip parameters using the ndd /dev/tcp command to shorten the tcp_time_wait_interval.

tefluid: I'm interested in optimizing application servers in order to run Java[tm] engines such as BEA WebLogic and ATG Dynamo. What advice can you give on profiling the system to best determine where the bottlenecks lie?

karpagam: This is a Java on Solaris question. Java has a profiling tool called hprof that can be included in the command line. Type -Xrunhprof:help for more info on this. The output gives you methods that take more CPU time...

karpagam: tefluid, There is a HAT (Heap Analysis Tool) also available. There are also 3rd party GUI tools available. Optimizeit and JProbe are two of them.

LizA: I heard that in Solaris you can allocate certain processors to work on only one process. Will that help, too?

jamesliu: LizA, you can in fact specify certain processors to a specific process. The command to use is psrset. For folks like Tefluid, binding the JVM PID to a processor set and excluding interrupts can possibly give a boost in performance.

Craki: I have a farm of Sybase database boxes all on Solaris 8. Where can I start in making sure that everything that can be optimized is, for database operations.

karpagam: Craki, I would always start with the db monitoring tools. Once you are sure that you do not have any issues go through the system parameters...

karpagam: Craki, Start by looking into shared memory, semaphores and message queue parameters first in the /etc/system. Then look into disk, network, NFS, swapping/paging, memory, CPU, filesystem, and TCP, one at a time...

karpagam: Craki, do look in http://www.sun.com/sun-on-net/performance/perftools-solaris8.pdf for more info on Solaris tools

Zartaj: I am interested in performance comparisons between Sun Solaris and Wintel. The problem is it is not easy to decide what is the right pair to compare. I have a UE250 450MHz with Solaris 8 and a P3 733 MHz with Windows 2000. I have seen the Wintel box consistently outperform the UE250. But is that a fair comparison? In general if I have a Sun system how do I determine what is the equivalent Wintel system to compare. Going by price alone, Wintel seems to have the edge.

jamesliu: Zartaj, it is often a race for more MIPS/MFLOPS, etc. in the hardware area. I don't know which benchmarks you run but in those apps that are important to Sun's customers. Sun consistently tunes our applications to out scale and outperform anything on the market. It all depends on the use. In your particular case, it may in fact be that Wintel has better price performance. In many of Sun's core customers, our value proposition is reliability, availability and scalability. We've competed well on this philosophy for about 18 years and I predict we'll continue. As for your particulars, perhaps we can communicate offline and discuss how to improve your performance.

alexc: We use some scripts to automate gathering info from ps. We also use sar. We notice that total CPU utilization (by adding up ps info) is usually quite a bit less than what is stated by sar. Why is there a discrepancy?

karpagam: Alexc, I am not sure what ps you are referring to - /usr/ucb/ps? In what version of Solaris? I do not know the time interval that ps uses for data gathering. If you are in Solaris 8, try using prstat. There are a lot of parameters that can come into play here - interval, versions, options for the tools, etc...

LizA: What do I need in order to look at mpstat? What do the columns mutexes and context switching mean?

karpagam: LizA, mutexes occur when a lot of CPUs are trying to grab the same resource lock. Only one CPU will be successful at any time. We do not want this to happen a lot...

jamesliu: LizA, context switching is also something that, done too often, expends resources... What you want to do is to limit these values to certain levels. smtx, for example is best below 500 per CPU per second. Context switches ... you can check at http://www.setoolkit.com.

Zartaj: I'd like to know what tools are available for shared library profiling? Shared libraries cannot be instrumented for prof or gprof. And the LD_PROFILE variable can be used only for one shared library at a time. So how do I go about profiling all shared libraries being used by an app?

karpagam: Zartaj, You can try using truss and sotruss. truss gives shared library activity and entry/exit trace of user-level function calls. sotruss is good and has less noise than truss...

dmdebertin: Are there any particular columns in vmstat (or other command) output that could indicate hardware or software problems? What are some things to look for that could indicate problems, and what is harmless?

jamesliu: DMDebertin, if your CPU percentage is high but system usage is low, most of the CPU is consumed by your app. You may want to think about tuning your code in this case. If system time is high, check out more with mpstat and look at context switch and smtx values.

Emory2: Could you please compare the performance of a 24 CPU SunFire 6800 to the performance of a 24 CPU IBM S80 (configured with the same amount of RAM).

karpagam: Emory2, For what workload? You can consider looking into TPC-C, TPC-D, spec standard benchmark pages that matches your workload.

LizA: How do I monitor the network?

karpagam: LizA, the primary tool you can use is netstat. There are options like -in for cumulative data, -s for TCP/UDP stats, -I for specific interface. I like to put in netstat -in in a while loop...

jamesliu: LizA, Sun also provides some scripts for tuning your network drivers. http://www.sun.com has these scripts. Search for "network tuning" or "syn flood" and you should see some docs on how to tune your network interface.

karpagam: LizA, netstat -a gives a lot more information on thevsockets/ports open. Look for ESTABLISHED and TIME_WAIT

LizA: netstat -a tells me that I have over 8000 connections. But I have only 3000 sessions open. They have a time_wait status on more than half of them. Is that something to do with my application?

jamesliu: LizA, Regarding netstat output, you'll probably have lots of network sessions still waiting to close. The default setting on Solaris is 240 seconds. You can use ndd /dev/tcp to set the tcp_time_wait_interval to a lower value so that these connections close down more quickly. Say 30 seconds is good. Be careful not to set this too low as slow connections (e.g. modems) might get dropped.

Zartaj: I believe a 32-bit process can only use around 3GB out of a possible 4GB. So is it useful to have more than 4GB physical memory on a system that allows it?

karpagam: Zartaj, What you need to look into is how much your application uses/needs. Are you running 64-bit Oracle and need more than 4GB SGA? Use pmap to tell you the processor footprint and calculate on that basis.

Zartaj: In the Solaris Multithreading Guide, it recommends against thread-pooling saying it is cheaper to create threads as needed. Do you agree with that?

jamesliu: Zartaj, in general I would agree that threads are relatively cheaper to create than to pool. Pooling creates many potential oppotunities for contention. However, in some cases, such as Java, the threading model may be more amenable to pooling since there is a Java layer there.

jd: The way I understand load average to be calculated, it is incremented by 1 for every CPU's worth of time spent. (Ex. a 10 CPU system with 10% user time as shown by vmstat will report a load avg. of 1). High system time (as show in vmstat) causes load to jump very high in some cases; I have seen load avg. of 30 on a 10 CPU system with 40% system time/10% user time. I would like to know how the system comes up with that load avg.

jamesliu: jd, I couldn't tell you exactly how the algorithm works. It's been a while since I've touched on it. Karpagam?

karpagam: jd, A high system time of that ratio clearly shows that there is a bottleneck. Did you check to see how your disks are doing. You also might want to see in mpstat/top/prstat/statit how the utilizations per processor is.

Craki: I find that whenever a box has fairly high uptime, memory reports on usage is higher than it should be. My DBA's see this and start getting worrired about the boxes not being big enough. Is this a Solaris behavioral quirk?

jamesliu: Craki, I can't be certain, but our experience shows that in uptimes of 60+ days, the memory footprint remains stable on many of our servers. The most common area of memory growth over time we've seen has perhaps been in memory leaks on the application or windowing side. Many windowing apps or servers or windows managers do in fact leak lots of memory. This may be the cause of growth over time.

jd: I am not asking about a problem in particular, I have just seen the load avg. jump like that and am curious as to how it's calculated.

karpagam: jd, Did you see this on Solaris 8?

Emory2: Does anyone know if there is a working version of "proctool" for Solaris 8? One version that we tested did not work for multiprocessors.

karpagam: Emory2, you can use /usr/proc/bin proc tools - right? pmap, ptree, ptime, pldd, etc...

jd: I have seen it on 2.6 and 8; the most recent was on 8 where a Java programmer had an app. that went crazy with creating/deleting threads.

jamesliu: jd, I guess you're still asking about how the load average is computed. Again, I can't tell you off hand since it's been a while since I've touched the algorithms. But I can imagine that any process that creates/destroys lots of threads is a contrived and somewhat unique situation. Perhaps we can work offline to discuss optimization and development techniques to reduce the CPU utilization.

LizA: Are there any special libraries I can use to improve performance?

jamesliu: LizA, there are a number of libraries that might boost performance. Some are in Solaris 8, some are third party. If you have a thread intensive application and have high smtx values, due to schedlock, you may want to put /usr/lib/lwp at the top of your LD_LIBRARY_PATH which is an alternate thread library. If your app. is memory allocation intensive, there are 3 ISV solutions that replace the bundled malloc on Solaris that improve performance.

alexc: question about threading, etc., ... the way I understand it, some programmers use multiple processes to do threading (spawning child processes) and some use threads within a single process. Clearly, multiple processes can run on multiple processes simultaneously. However, can threads within a single process run on more than one processr simultaneously?

alexc: Rather, multiple processes can use multiple processORs, but can threads within a single process do the same?

jamesliu: Alexc, absolutely. Threads do run on multiple processors on Solaris. As do multiple processors with multiple threads. Solaris supports scheduling that allows a many-to-many relationship between threads or processes and processORs.

Craki: Can you recommend a centralized monitoring/management package? I've done a small deployment of Sun[tm] Management Center and liked it. Would Big Brother be a good solution as well?

karpagam: Sun Management Center is very good. If you want to monitor database statistics also, I know that a lot of folks use Foglight from Quest Software. I do not know about Big Brother - sorry.

LizA: We're about out of time. Thanks to Karpagam and James...and all of you who asked such great questions. Karpagam and James, do you have a few parting words?

jamesliu: It has again been a pleasure. I'd be pleased to field questions in this forum again soon. -JCL

jamesliu: Note to all, if you're running any of the vmstat or mpstat, just make sure you put a time interval like 5 seconds and exclude the first entry in you computations. - jcl

karpagam: Thanks everyone for all the wonderful questions. It has been a pleasure. Thanks LizA for taking this forum smoothly :)

LizA: Be sure to join us again on June 21, at 10 a.m. PDT, when our guest is Rich Teer and the topic is "Secure C Programming."


Books

System Performance Tuning

Oracle and Unix Performance Tuning ~ Usually ships in 24 hours
Ahmed Alomari / Paperback / Published 1997
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Aix Performance Tuning ~ Usually ships in 2-3 days
Frank Waters / Paperback / Published 1996
Amazon price: $63.00
Optimizing Unix for Performance ~ Usually ships in 24 hours
Amir H. Majidimehr / Paperback / Published 1995
Amazon price: $40.00
Solaris Performance Administration : Performance Measurement, Fine Tuning, and Capacity Planning for Releases 2.5.1 and 2.6 ~ Usually ships in 24 hours
H. Frank Cervone / Paperback / Published 1998
Amazon price: $35.96 ~ You Save: $8.99 (20%)
Sun Performance and Tuning : Java and the Internet ~ Usually ships in 24 hours
Adrian Cockcroft, et al / Paperback / Published 1998
Amazon price: $40.80 ~ You Save: $10.20 (20%)
System Performance Tuning (Nutshell Handbooks) ~ Usually ships in 2-3 days
Michael Kosta Loukides, Mike Loukides / Paperback / Published 1991
Amazon price: $23.96 ~ You Save: $5.99 (20%)
UNIX Performance Tuning; Sys Admin-Essential Reference Series ~ Usually ships in 2-3 days
Sys Admin Magazine(Editor) / Paperback / Published 1997
Amazon price: $23.96 ~ You Save: $5.99 (20%)
Hp-Ux Tuning and Performance : Concepts, Tools and Methods (Hewlett-Packard Professional Books)
Robert F. Sauers, Peter Weygant / Paperback / Published 1999
Amazon price: $45.00 (Not Yet Published -- On Order)
Sun Performance and Tuning : Sparc & Solaris
Adrian Cockcroft / Paperback / Published 1994
(Publisher Out Of Stock)
Taming UNIX : UNIX Performance Management Series
Robert A. Lund / Spiral-bound / Published 1997
Amazon price: $59.95 (Special Order)

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

Tutorials

Papers

Other Cockcroft columns at www.sun.com

Troubleshooting Tips



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: August 15, 2009