Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

How to troubleshoot Linux performance bottlenecks

News

Performance tuning

Recommended Links

Linux Kernel Tuning

IRC

Man Pages

Reference

 

Upgrades

HOW-TOs

Keyboard and Shell

Disks and partitioning

Swap partition

Net

Humor

 

Here is a relevant quote from Tuning Red Hat Enterprise Linux on IBM Eserver xSeries Servers (2005)

Identifying bottlenecks

The following steps are used as our quick tuning strategy:

  1. Know your system.

  2. Back up the system.

  3. Monitor and analyze the system’s performance.

  4. Narrow down the bottleneck and find its cause.

  5. Fix the bottleneck cause by trying only one single change at a time.

  6. Go back to step 3 until you are satisfied with the performance of the system.

4.1.1 Gathering information

Mostly likely, the only first-hand information you will have access to will be statements such as “There is a problem with the server.” It is crucial to use probing questions to clarify and document the problem. Here is a list of questions you should ask to help you get a better picture of the system.

Can you give me a complete description of the server in question?

Can you tell me exactly what the problem is?

Tip: You should document each step, especially the changes you make and their effect on performance.

The fact that the problem can be reproduced enables you to see and understand it better.

Document the sequence of actions that are necessary to reproduce the problem:

4.1.2 Analyzing the server’s performance

At this point, you should begin monitoring the server. The simplest way is to run monitoring tools from the server that is being analyzed. (See Chapter 2, “Monitoring tools” on page 15, for information.)

A performance log of the server should be created during its peak time of operation (for example, 9:00 a.m. to 5:00 p.m.); it will depend on what services are being provided and on who is using these services. When creating the log, if available, the following objects should be included:

Before you begin, remember that a methodical approach to performance tuning is important.

Our recommended process, which you can use for your xSeries server performance tuning process, is as follows:

1. Understand the factors affecting server performance. This Redpaper and the redbook Tuning IBM Eserver xSeries Servers for Performance, SG24-5287 can help.

2. Measure the current performance to create a performance baseline to compare with your future measurements and to identify system bottlenecks.

3. Use the monitoring tools to identify a performance bottleneck. By following the instructions in the next sections, you should be able to narrow down the bottleneck to the subsystem level.

4. Work with the component that is causing the bottleneck by performing some actions to improve server performance in response to demands.

5. Measure the new performance. This helps you compare performance before and after the tuning steps.

When attempting to fix a performance problem, remember the following:

Take measurements before you upgrade or modify anything so that you can tell whether the change had any effect. (That is, take baseline measurements.)

Examine the options that involve reconfiguring existing hardware, not just those that involve adding new hardware.

Important: Before taking any troubleshooting actions, back up all data and the configuration information to prevent a partial or complete loss.

Note: It is important to understand that the greatest gains are obtained by upgrading a component that has a bottleneck when the other components in the server have ample “power” left to sustain an elevated level of performance.

4.2 CPU bottlenecks

For servers whose primary role is that of an application or database server, the CPU is a critical resource and can often be a source of performance bottlenecks. It is important to note that high CPU utilization does not always mean that a CPU is busy doing work; it may, in fact, be waiting on another subsystem. When performing proper analysis, it is very important that you look at the system as a whole and at all subsystems because there may be a cascade effect within the subsystems.

4.2.1 Finding CPU bottlenecks

Determining bottlenecks with the CPU can be accomplished in several ways. As discussed in

Chapter 2, “Monitoring tools” on page 15, Linux has a variety of tools to help determine this;

the question is: which tools to use?

One such tool is uptime. By analyzing the output from uptime, we can get a rough idea of what has been happening in the system for the past 15 minutes. For a more detailed explanation of this tool, see 2.2, “uptime” on page 16.

Example 4-1 uptime output from a CPU strapped system

18:03:16 up 1 day, 2:46, 6 users, load average: 182.53, 92.02, 37.95

Using KDE System Guard and the CPU sensors lets you view the current CPU workload.

Using top, you can see both CPU utilization and what processes are the biggest contributors

to the problem (Example 2-3 on page 18). If you have set up sar, you are collecting a lot of

information, some of which is CPU utilization, over a period of time. Analyzing this information

can be difficult, so use isag, which can use sar output to plot a graph. Otherwise, you may

wish to parse the information through a script and use a spreadsheet to plot it to see any

trends in CPU utilization. You can also use sar from the command line by issuing sar -u or

sar -U processornumber. To gain a broader perspective of the system and current utilization

of more than just the CPU subsystem, a good tool is vmstat (2.6, “vmstat” on page 21).

4.2.2 SMP

SMP-based systems can present their own set of interesting problems that can be difficult to

detect. In an SMP environment, there is the concept of CPU affinity, which implies that you

bind a process to a CPU.

The main reason this is useful is CPU cache optimization, which is achieved by keeping the

same process on one CPU rather than moving between processors. When a process moves

between CPUs, the cache of the new CPU must be flushed. Therefore, a process that moves

between processors causes many cache flushes to occur, which means that an individual

process will take longer to finish. This scenario is very hard to detect because, when

Note: There is a common misconception that the CPU is the most important part of the

server. This is not always the case, and servers are often overconfigured with CPU and

underconfigured with disks, memory, and network subsystems. Only specific applications

that are truly CPU-intensive can take advantage of today’s high-end processors.

Tip: Be careful not to add to CPU problems by running too many tools at one time. You

may find that using a lot of different monitoring tools at one time may be contributing to the

high CPU load.

monitoring it, the CPU load will appear to be very balanced and not necessarily peaking on

any CPU. Affinity is also useful in NUMA-based systems such as the xSeries 445 and xSeries

455, where it is important to keep memory, cache, and CPU access local to one another.

4.2.3 Performance tuning options

The first step is to ensure that the system performance problem is being caused by the CPU

and not one of the other subsystems. If the processor is the server bottleneck, then a number

of steps can be taken to improve performance. These include:

Ensure that no unnecessary programs are running in the background by using ps -ef. If

you find such programs, stop them and use cron to schedule them to run at off-peak

hours.

Identify non-critical, CPU-intensive processes by using top and modify their priority using

renice.

In an SMP-based machine, try using taskset to bind processes to CPUs to make sure that

processes are not hopping between processors, causing cache flushes.

Based on the running application, it may be better to scale up (bigger CPUs) than scale

out (more CPUs). This depends on whether your application was designed to effectively

take advantage of more processors. For example, a single-threaded application would

scale better with a faster CPU and not with more CPUs.

General options include making sure you are using the latest drivers and firmware, as this

may affect the load they have on the CPU.

4.3 Memory bottlenecks

On a Linux system, many programs run at the same time; these programs support multiple

users and some processes are more used than others. Some of these programs use a

portion of memory while the rest are “sleeping.” When an application accesses cache, the

performance increases because an in-memory access retrieves data, thereby eliminating the

need to access slower disks.

The OS uses an algorithm to control which programs will use physical memory and which are

paged out. This is transparent to user programs. Page space is a file created by the OS on a

disk partition to store user programs that are not currently in use. Typically, page sizes are

4 KB or 8 KB. In Linux, the page size is defined by using the variable EXEC_PAGESIZE in the

include/asm-<architecture>/param.h kernel header file. The process used to page a process

out to disk is called pageout.

4.3.1 Finding memory bottlenecks

Start your analysis by listing the applications that are running on the server. Determine how

much physical memory and swap each application needs to run. Figure 4-1 on page 75

shows KDE System Guard monitoring memory usage.

Figure 4-1 KDE System Guard memory monitoring

The indicators in Table 4-1 can also help you define a problem with memory.

Table 4-1 Indicator for memory analysis

Paging and swapping indicators

In Linux, as with all UNIX-based operating systems, there are differences between paging

and swapping. Paging moves individual pages to swap space on the disk; swapping is a

bigger operation that moves the entire address space of a process to swap space in one

operation.

Swapping can have one of two causes:

A process enters sleep mode. This usually happens because the process depends on

interactive action, as editors, shells, and data entry applications spend most of their time

waiting for user input. During this time, they are inactive.

Memory indicator Analysis

Memory available This indicates how much physical memory is available for use. If, after you start your application,

this value has decreased significantly, you may have a memory leak. Check the application that

is causing it and make the necessary adjustments. Use free -l -t -o for additional information.

Page faults There are two types of page faults: soft page faults, when the page is found in memory, and hard

page faults, when the page is not found in memory and must be fetched from disk. Accessing

the disk will slow your application considerably. The sar -B command can provide useful

information for analyzing page faults, specifically columns pgpgin/s and pgpgout/s.

File system cache This is the common memory space used by the file system cache. Use the free -l -t -o

command for additional information.

Private memory for

process

This represents the memory used by each process running on the server. You can use the pmap

command to see how much memory is allocated to a specific process.

A process behaves poorly. Paging can be a serious performance problem when the

amount of free memory pages falls below the minimum amount specified, because the

paging mechanism is not able to handle the requests for physical memory pages and the

swap mechanism is called to free more pages. This significantly increases I/O to disk and

will quickly degrade a server’s performance.

If your server is always paging to disk (a high page-out rate), consider adding more memory.

However, for systems with a low page-out rate, it may not affect performance.

4.3.2 Performance tuning options

It you believe there is a memory bottleneck, consider performing one or more of these

actions:

Tune the swap space using bigpages, hugetlb, shared memory.

Increase or decrease the size of pages.

Improve the handling of active and inactive memory.

Adjust the page-out rate.

Limit the resources used for each user on the server.

Stop the services that are not needed, as discussed in 3.3, “Daemons” on page 38.

Add memory.

4.4 Disk bottlenecks

The disk subsystem is often the most important aspect of server performance and is usually

the most common bottleneck. However, problems can be hidden by other factors, such as

lack of memory. Applications are considered to be I/O-bound when CPU cycles are wasted

simply waiting for I/O tasks to finish.

The most common disk bottleneck is having too few disks. Most disk configurations are based

on capacity requirements, not performance. The least expensive solution is to purchase the

smallest number of the largest-capacity disks possible. However, this places more user data

on each disk, causing greater I/O rates to the physical disk and allowing disk bottlenecks to

occur.

The second most common problem is having too many logical disks on the same array. This

increases seek time and greatly lowers performance.

The disk subsystem is discussed in 3.12, “Tuning the file system” on page 52.

A recommendation is to apply the diskstats-2.4.patch to fix problems with disk statistics

counters, which can occasionally report negative values.

4.4.1 Finding disk bottlenecks

A server exhibiting the following symptoms may be suffering from a disk bottleneck (or a

hidden memory problem):

Slow disks will result in:

– Memory buffers filling with write data (or waiting for read data), which will delay all

requests because free memory buffers are unavailable for write requests (or the

response is waiting for read data in the disk queue)

– Insufficient memory, as in the case of not enough memory buffers for network requests,

will cause synchronous disk I/O

Chapter 4. Analyzing performance bottlenecks 77

Disk utilization, controller utilization, or both will typically be very high.

Most LAN transfers will happen only after disk I/O has completed, causing very long

response times and low network utilization.

Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will

be idle or have low utilization because they wait long periods of time before processing the

next request.

The disk subsystem is perhaps the most challenging subsystem to properly configure.

Besides looking at raw disk interface speed and disk capacity, it is key to also understand the

workload: Is disk access random or sequential? Is there large I/O or small I/O? Answering

these questions provides the necessary information to make sure the disk subsystem is

adequately tuned.

Disk manufacturers tend to showcase the upper limits of their drive technology’s throughput.

However, taking the time to understand the throughput of your workload will help you

understand what true expectations to have of your underlying disk subsystem.

Table 4-2 Exercise showing true throughput for 8 KB I/Os for different drive speeds

Random read/write workloads usually require several disks to scale. The bus bandwidths of

SCSI or Fibre Channel are of lesser concern. Larger databases with random access

workload will benefit from having more disks. Larger SMP servers will scale better with more

disks. Given the I/O profile of 70% reads and 30% writes of the average commercial

workload, a RAID-10 implementation will perform 50% to 60% better than a RAID-5.

Sequential workloads tend to stress the bus bandwidth of disk subsystems. Pay special

attention to the number of SCSI buses and Fibre Channel controllers when maximum

throughput is desired. Given the same number of drives in an array, RAID-10, RAID-0, and

RAID-5 all have similar streaming read and write throughput.

There are two ways to approach disk bottleneck analysis: real-time monitoring and tracing.

Real-time monitoring must be done while the problem is occurring. This may not be

practical in cases where system workload is dynamic and the problem is not repeatable.

However, if the problem is repeatable, this method is flexible because of the ability to add

objects and counters as the problem becomes well understood.

Tracing is the collecting of performance data over time to diagnose a problem. This is a

good way to perform remote performance analysis. Some of the drawbacks include the

potential for having to analyze large files when performance problems are not repeatable,

and the potential for not having all key objects and parameters in the trace and having to

wait for the next time the problem occurs for the additional data.

Disk speed Latency Seek

time

Total random

access timea

a. Assuming that the handling of the command + data transfer < 1 ms, total random

access time = latency + seek time + 1 ms.

I/Os per

second

per diskb

b. Calculated as 1/total random access time.

Throughput

given 8 KB I/O

15 000 RPM 2.0 ms 3.8 ms 6.8 ms 147 1.15 MBps

10 000 RPM 3.0 ms 4.9 ms 8.9 ms 112 900 KBps

7 200 RPM 4.2 ms 9 ms 13.2 ms 75 600 KBps

vmstat command

One way to track disk usage on a Linux system is by using the vmstat tool. The columns of

interest in vmstat with respect to I/O are the bi and bo fields. These fields monitor the

movement of blocks in and out of the disk subsystem. Having a baseline is key to being able

to identify any changes over time.

Example 4-2 vmstat output

[root@x232 root]# vmstat 2

r b swpd free buff cache si so bi bo in cs us sy id wa

2 1 0 9004 47196 1141672 0 0 0 950 149 74 87 13 0 0

0 2 0 9672 47224 1140924 0 0 12 42392 189 65 88 10 0 1

0 2 0 9276 47224 1141308 0 0 448 0 144 28 0 0 0 100

0 2 0 9160 47224 1141424 0 0 448 1764 149 66 0 1 0 99

0 2 0 9272 47224 1141280 0 0 448 60 155 46 0 1 0 99

0 2 0 9180 47228 1141360 0 0 6208 10730 425 413 0 3 0 97

1 0 0 9200 47228 1141340 0 0 11200 6 631 737 0 6 0 94

1 0 0 9756 47228 1140784 0 0 12224 3632 684 763 0 11 0 89

0 2 0 9448 47228 1141092 0 0 5824 25328 403 373 0 3 0 97

0 2 0 9740 47228 1140832 0 0 640 0 159 31 0 0 0 100

iostat command

Performance problems can be encountered when too many files are opened, being read and

written to, then closed repeatedly. This could become apparent as seek times (the time it

takes to move to the exact track where the data is stored) start to increase. Using the iostat

tool, you can monitor the I/O device loading in real time. Different options enable you to drill

down even farther to gather the necessary data.

Example 4-3 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows

average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.

Example 4-3 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1

[root@x232 root]# iostat 2 -x /dev/sdb1

avg-cpu: %user %nice %sys %idle

11.50 0.00 2.00 86.50

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 1792.00 12240.00 748.37

101.70 2717.33 266.67 100.00

avg-cpu: %user %nice %sys %idle

10.50 0.00 1.00 88.50

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 1792.00 12240.00 758.49

101.65 2739.19 270.27 100.00

avg-cpu: %user %nice %sys %idle

10.95 0.00 1.00 88.06

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12 1783.08 12788.06 781.01

101.69 2728.00 268.00 100.00

The iostat -x (for extended statistics) command provides low-level detail of the disk

subsystem. Some things to point out:

%util Percentage of CPU consumed by I/O requests

svctm Average time required to complete a request, in milliseconds

await Average amount of time an I/O waited to be served, in milliseconds

avgqu-sz Average queue length

avgrq-sz Average size of request

rrqm/s Number of read requests merged per second that were issued to the device

wrqms Number of write requests merged per second that were issued to the device

For a more detailed explanation of the fields, see the man page for iostat(1).

Changes made to the elevator algorithm as described in “Tune the elevator algorithm in kernel

2.4” on page 55 will be seen in avgrq-sz (average size of request) and avgqu-sz (average

queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz

will decrease. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of

merged reads and writes that the disk can manage.

4.4.2 Performance tuning options

After verifying that the disk subsystem is a system bottleneck, several solutions are possible.

These solutions include the following:

If the workload is of a sequential nature and it is stressing the controller bandwidth, the

solution is to add a faster disk controller. However, if the workload is more random in

nature, then the bottleneck is likely to involve the disk drives, and adding more drives will

improve performance.

Add more disk drives in a RAID environment. This spreads the data across multiple

physical disks and improves performance for both reads and writes. This will increase the

number of I/Os per second. Also, use hardware RAID instead of the software

implementation provided by Linux. If hardware RAID is being used, the RAID level is

hidden from the OS.

Offload processing to another system in the network (users, applications, or services).

Add more RAM. Adding memory increases system memory disk cache, which in effect

improves disk response times.

4.5 Network bottlenecks

A performance problem in the network subsystem can be the cause of many problems, such

as a kernel panic. To analyze these anomalies to detect network bottlenecks, each Linux

distribution includes traffic analyzers.

4.5.1 Finding network bottlenecks

We recommend KDE System Guard because of its graphical interface and ease of use. The

tool, which is available on the distribution CDs, is discussed in detail in 2.10, “KDE System

Guard” on page 24. Figure 4-2 on page 80 shows it in action.

Figure 4-2 KDE System Guard network monitoring

It is important to remember that there are many possible reasons for these performance

problems and that sometimes problems occur simultaneously, making it even more difficult to

pinpoint the origin. The indicators in Table 4-3 can help you determine the problem with your

network.

Table 4-3 Indicators for network analysis

Network indicator Analysis

Packets received

Packets sent

Shows the number of packets that are coming in and going out of the

specified network interface. Check both internal and external interfaces.

Collision packets Collisions occur when there are many systems on the same domain. The

use of a hub may be the cause of many collisions.

Dropped packets Packets may be dropped for a variety of reasons, but the result may affect

performance. For example, if the server network interface is configured to

run at 100 Mbps full duplex, but the network switch is configured to run at

10 Mbps, the router may have an ACL filter that drops these packets. For

example:

iptables -t filter -A FORWARD -p all -i eth2 -o eth1 -s 172.18.0.0/24

-j DROP

Errors Errors occur if the communications lines (for instance, the phone line) are of

poor quality. In these situations, corrupted packets must be resent, thereby

decreasing network throughput.

Faulty adapters Network slowdowns often result from faulty network adapters. When this

kind of hardware fails, it may begin to broadcast junk packets on the network.

4.5.2 Performance tuning options

These steps illustrate what you should do to solve problems related to network bottlenecks:

Ensure that the network card configuration matches router and switch configurations (for

example, frame size).

Modify how your subnets are organized.

Use faster network cards.

Tune the appropriate IPV4 TCP kernel parameters. (See Chapter 3, “Tuning the operating

system” on page 35.) Some security-related parameters can also improve performance,

as described in that chapter.

If possible, change network cards and recheck performance.

Add network cards and bind them together to form an adapter team, if possible.

 

 


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Apr 09, 2008] Linux.com Inspecting disk IO performance with fio By Ben Martin

Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system's performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration.

To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the -T and -t options. The -T option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The -t option also reads the disk through the cache, but without any precaching of results. Thus -t can give an idea of how fast a disk can deliver information stored sequentially on disk.

The hdparm tool isn't the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.

fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.

fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the make; sudo make install combination.

The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started. rw=randread tells fio to use a random reading access pattern, size=128m specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and the directory parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don't specify directory, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.

			$ cat random-read-test.fio ; random read of 128mb of data [random-read] 
			rw=randread size=128m directory=/tmp/fio-testing/data $ fio random-read-test.fio 
			random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 
			Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MiB) 
			Jobs: 1 (f=1): [r] [100.0% done] [ 3588/ 0 kb/s] [eta 00m:00s] random-read: 
			(groupid=0, jobs=1): err= 0: pid=30598 read : io=128MiB, bw=864KiB/s, 
			iops=211, runt=155282msec clat (usec): min=139, max=148K, avg=4736.28, 
			stdev=6001.02 bw (KiB/s) : min= 227, max= 5275, per=100.12%, avg=865.00, 
			stdev=362.99 cpu : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10 
			IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			issued r/w: total=32768/0, short=0/0 lat (usec): 250=34.92%, 500=0.36%, 
			750=0.02%, 1000=0.05% lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 
			50=0.94% lat (msec): 100=0.37%, 250=0.01% Run status group 0 (all jobs): 
			READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, 
			maxt=155282msec Disk stats (read/write): dm-6: ios=32768/148, merge=0/0, 
			ticks=154728/12490, in_queue=167218, util=99.59% 


fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.

The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the 4=12.80%, 10=44.96% section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.

The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).

I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the iodepth parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example, the 8=96.0% tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on 4=4.0%, 4% of the time there were only three or four requests in the queue.

 
			$ cat random-read-test-aio.fio ; same as random-read-test.fio ; ... 
			ioengine=libaio iodepth=8 direct=1 invalidate=1 $ fio random-read-test-aio.fio 
			random-read: (groupid=0, jobs=1): err= 0: pid=31318 read : io=128MiB, 
			bw=2,352KiB/s, iops=574, runt= 57061msec slat (usec): min=8, max=260, 
			avg=25.90, stdev=23.23 clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87 
			bw (KiB/s) : min= 0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60 
			... IO depths : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			... Run status group 0 (all jobs): READ: io=128MiB, aggrb=2,352KiB/s, 
			minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msec
		 

Random reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.

The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is rw, which can specify sequential or random reads and or writes in many combinations. The ioengine parameter can select how the IO requests are issued to the kernel. The invalidate option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. The runtime specifies that a test should run for a given amount of time and then be considered complete. The thinktime parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk. fsync=n can be used to issue a sync call after every n writes issued. write_iolog and read_iolog cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.

Simulating servers

You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I've based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.

 
			$ cat four-threads-randio.fio ; Four threads, two query, two writers. 
			[global] rw=randread size=256m directory=/tmp/fio-testing/data ioengine=libaio 
			iodepth=4 invalidate=1 direct=1 [bgwriter] rw=randwrite iodepth=32 [queryA] 
			iodepth=1 ioengine=mmap direct=0 thinktime=3 [queryB] iodepth=1 ioengine=mmap 
			direct=0 thinktime=5 [bgupdater] rw=randrw iodepth=16 thinktime=40 size=32m 
			$ fio four-threads-randio.fio bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, 
			ioengine=libaio, iodepth=32 queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, 
			ioengine=mmap, iodepth=1 queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, 
			ioengine=mmap, iodepth=1 bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, 
			ioengine=libaio, iodepth=16 Starting 4 processes bgwriter: (groupid=0, 
			jobs=1): err= 0: pid=3241 write: io=256MiB, bw=7,480KiB/s, iops=1,826, 
			runt= 35886msec slat (usec): min=9, max=106K, avg=35.29, stdev=583.45 
			clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00 bw (KiB/s) 
			: min= 0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44 cpu : usr=0.40%, 
			sys=4.13%, ctx=18254, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.4%, 
			8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0% issued r/w: total=0/65536, short=0/0 
			lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11% lat (msec): 
			2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32% lat (msec): 100=4.74%, 
			250=1.86% queryA: (groupid=0, jobs=1): err= 0: pid=3242 read : io=256MiB, 
			bw=589MiB/s, iops=147K, runt= 445msec clat (usec): min=2, max=165, avg= 
			3.48, stdev= 2.38 cpu : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545 
			IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% 
			issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.20%, 10=22.51%, 
			20=1.17%, 50=0.05%, 100=0.05% lat (usec): 250=0.01% queryB: (groupid=0, 
			jobs=1): err= 0: pid=3243 read : io=256MiB, bw=455MiB/s, iops=114K, 
			runt= 576msec clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31 bw 
			(KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 
			0.00 cpu : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545 IO depths 
			: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued 
			r/w: total=65536/0, short=0/0 lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 
			50=0.02%, 100=0.03% lat (usec): 250=0.01%, 500=0.01% bgupdater: (groupid=0, 
			jobs=1): err= 0: pid=3244 read : io=16,348KiB, bw=1,014KiB/s, iops=247, 
			runt= 16501msec slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19 
			clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53 bw (KiB/s) 
			: min= 0, max= 1957, per=2.37%, avg=794.90, stdev=495.94 write: io=16,420KiB, 
			bw=1,018KiB/s, iops=248, runt= 16501msec slat (usec): min=9, max=42,510, 
			avg=38.73, stdev=663.37 clat (usec): min=202, max=229K, avg=49803.02, 
			stdev=34393.32 bw (KiB/s) : min= 0, max= 1840, per=10.89%, avg=865.54, 
			stdev=411.66 cpu : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9 IO 
			depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0% 
			issued r/w: total=4087/4105, short=0/0 lat (usec): 2=0.02%, 4=0.04%, 
			20=0.01%, 50=0.06%, 100=1.44% lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 
			1000=1.17% lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82% 
			lat (msec): 100=19.07%, 250=4.72% Run status group 0 (all jobs): READ: 
			io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, 
			maxt=16501msec WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, 
			maxb=7,480KiB/s, mint=16501msec, maxt=35886msec Disk stats (read/write): 
			dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, 
			util=99.74% 
		 

As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.

To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.

 			bgwriter: (groupid=0, jobs=1): err= 0: pid=14963 write: io=256MiB, bw=6,545KiB/s, 
			iops=1,597, runt= 41013msec queryA: (groupid=0, jobs=1): err= 0: pid=14964 
			read : io=256MiB, bw=160MiB/s, iops=39,888, runt= 1643msec queryB: (groupid=0, 
			jobs=1): err= 0: pid=14965 read : io=256MiB, bw=163MiB/s, iops=40,680, 
			runt= 1611msec bgupdater: (groupid=0, jobs=1): err= 0: pid=14966 read 
			: io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec write: io=16,352KiB, 
			bw=420KiB/s, iops=102, runt= 39788msec READ: io=528MiB, aggrb=13,915KiB/s, 
			minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec WRITE: io=272MiB, 
			aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msec
		 

The vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.

Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.

Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.

How to troubleshoot Linux performance bottlenecks

ken milberg
09.30.2008

You've just had your first cup of coffee and have received that dreaded phone call. The system is slow. What are you going to do? This article will discuss performance bottlenecks and optimization in Red Hat Enterprise Linux (RHEL5).

Before getting into any monitoring or tuning specifics, you should always use some kind of tuning methodology. This is one which I've used successfully through the years:

1. Baseline – The first thing you must do is establish a baseline, which is a snapshot of how the system appears when it's performing well. This baseline should not only compile data, but also document your system's configuration (RAM, CPU and I/O). This is necessary because you need to know what a well-performing system looks like prior to fixing it.
2. Stress testing and monitoring – This is the part where you monitor and stress your systems at peak workloads. It's the monitoring which is key here – as you cannot effectively tune anything without some historic trending data.

3. Bottleneck identification – This is where you come up with the diagnosis for what is ailing your system. The primary objective of section 2 is to determine the bottleneck. I like to use several monitoring tools here. This allows me to cross-reference my data for accuracy.

4. Tune – Only after you've identified the bottleneck can you tune it.

5. Repeat – Once you've tuned it, you can start the cycle again – but this time start from step 2 (monitoring) – as you already have your baseline.

It's important to note that you should only make one change at a time. Otherwise, you'll never know exactly what impacted any changes which might have occurred. It is only by repeating your tests and consistently monitoring your systems that you can determine if your tuning is making an impact.

RHEL monitoring tools
Before we can begin to improve the performance of our system, we need to use the monitoring tools available to us to baseline. Here are some monitoring tools you should consider using:

Oprofile
This tool (made available in RHEL5) utilizes the processor to retrieve kernel system information about system executables. It allows one to collect samples of performance data every time a counter detects an interrupt. I like the tool also because it carries little overhead – which is very important because you don't want monitoring tools to be causing system bottlenecks. One important limitation is that the tool is very much geared towards finding problems with CPU limited processes. It does not identify processes which are sleeping or waiting on I/O.

The steps used to start up Oprofile include setting up the profiler, starting it and then dumping the data.

First we'll set up the profile. This option assumes that one wants to monitor the kernel.

# opcontrol --setup –vmlinux=/usr/lib/debug/lib/modules/'uname -r'/vmlinux

Then we can start it up.

# opcontrol --start

Finally, we'll dump the data.

# opcontrol --stop/--shutdown/--dump

SystemTap
This tool (introduced in RHEL5) collects data by analyzing the running kernel. It really helps one come up with a correct diagnosis of a performance problem and is tailor-made for developers. SystemTap eliminates the need for the developer to go through the recompile and reinstallation process to collect data.

Frysk
This is another tool which was introduced by Red Hat in RHEL5. What does it do for you? It allows both developers and system administrators to monitor running processes and threads. Frysk differs from Oprofile in that it uses 100% reliable information (similar to SystemTap) - not just a sampling of data. It also runs in user mode and does not require kernel modules or elevated privileges. Allowing one to stop or start running threads or processes is also a very useful feature.

Some more general Linux tools include top and vmstat. While these are considered more basic, often I find them much more useful than more complex tools. Certainly they are easier to use and can help provide information in a much quicker fashion.

Top provides a quick snapshot of what is going on in your system – in a friendly character-based display.

It also provides information on CPU, Memory and Swap Space.

Let's look at vmstat – one of the oldest but more important Unix/Linux tools ever created. Vmstat allows one to get a valuable snapshot of process, memory, sway I/O and overall CPU utilization.

Now let's define some of the fields:

Memory
swpd – The amount of virtual memory
free – The amount of free memory
buff – Amount of memory used for buffers
cache – Amount of memory used as page cache

Process
r – number of run-able processes
b – number or processes sleeping.
Make sure this number does not exceed the amount of run-able processes, because when this condition occurs it usually signifies that there are performance problems.

Swap
si – the amount of memory swapped in from disk
so – the amount of memory swapped out.

This is another important field you should be monitoring – if you are swapping out data, you will likely be having performance problems with virtual memory.

CPU
us – The % of time spent in user-level code.
It is preferable for you to have processes which spend more time in user code rather than system code. Time spent in system level code usually means that the process is tied up in the kernel rather than processing real data.
sy – the time spent in system level code
id – the amount of time the CPU is idle wa – The amount of time the system is spending waiting for I/O.

If your system is waiting on I/O – everything tends to come to a halt. I start to get worried when this is > 10.

There is also:

Free – This tool provides memory information, giving you data around the total amount of free and used physical and swap memory.

Now that we've analyzed our systems – lets look at what we can do to optimize and tune our systems.

CPU Overhead – Shutting Running Processes
Linux starts up all sorts of processes which are usually not required. This includes processes such as autofs, cups, xfs, nfslock and sendmail. As a general rule, shut down anything that isn't explicitly required. How do you do this? The best method is to use the chkconfig command.

Here's how we can shut these processes down.
[root ((Content component not found.)) _29_140_234 ~]# chkconfig --del xfs

You can also use the GUI - /usr/bin/system-config-services to shut down daemon process.

Tuning the kernel
To tune your kernel for optimal performance, start with:

sysctl – This is the command we use for changing kernel parameters. The parameters themselves are found in /proc/sys/kernel

Let's change some of the parameters. We'll start with the msgmax parameter. This parameter specifies the maximum allowable size of a single message in an IPC message queue. Let's view how it currently looks.

[root ((Content component not found.)) _29_139_52 ~]# sysctl kernel.msgmax
kernel.msgmax = 65536
[root ((Content component not found.)) _29_139_52 ~]#

There are three ways to make these kinds of kernel changes. One way is to change this using the echo command.

[root ((Content component not found.)) _29_139_52 ~]# echo 131072 >/proc/sys/kernel/msgmax
[root ((Content component not found.)) _29_139_52 ~]# sysctl kernel.msgmax
kernel.msgmax = 131072
[root ((Content component not found.)) _29_139_52 ~]#

Another parameter that is changed quite frequently is SHMMAX, which is used to define the maximum size (in bytes) for a shared memory segment. In Oracle this should be set large enough for the largest SGA size. Let's look at the default parameter:

# sysctl kernel.shmmax
kernel.shmmax = 268435456

This is in bytes – which translates to 256 MG. Let's change this to 512 MG, using the -w flag.

[root ((Content component not found.)) _29_139_52 ~]# sysctl -w kernel.shmmax=5368709132
kernel.shmmax = 5368709132
[root ((Content component not found.)) _29_139_52 ~]#

The final method for making changes is to use a text editor such as vi – directly editing the /etc/sysctl.conf file to manually make our changes.

To allow the parameter to take affect dynamically without a reboot, issue the sysctl command with the -p parameter.

Obviously, there is more to performance tuning and optimization than we can discuss in the context of this small article – entire books have been written on Linux performance tuning. For those of you first getting your hands dirty with tuning, I suggest you tread lightly and spend time working on development, test and/or sandbox environments prior to deploying any changes into production. Ensure that you monitor the effects of any changes that you make immediately; it's imperative to know the effect of your change. Be prepared for the possibility that fixing your bottleneck has created another one. This is actually not a bad thing in itself, as long as your overall performance has improved and you understand fully what is happening.

Performance monitoring and tuning is a dynamic process which does not stop after you have fixed a problem. All you've done is established a new baseline. Don't rest on your laurels, and understand that performance monitoring must be a routine part of your role as a systems administrator.

About the author: Ken Milberg is a systems consultant with two decades of experience working with Unix and Linux systems. He is a SearchEnterpriseLinux.com Ask the Experts advisor and columnist.

Linux.com Inspecting disk IO performance with fio

Storage performance has failed to keep up with that of other major components of computer systems. Hard disks have gotten larger, but their speed has not kept pace with the relative speed improvements in RAM and CPU technology. The potential for your hard drive to be your system's performance bottleneck makes knowing how fast your disks and filesystems are and getting quantitative measurements on any improvements you can make to the disk subsystem important. One way to make disk access faster is to use more disks in combination, as in a RAID-5 configuration.

To get a basic idea of how fast a physical disk can be accessed from Linux you can use the hdparm tool with the -T and -t options. The -T option takes advantage of the Linux disk cache and gives an indication of how much information the system could read from a disk if the disk were fast enough to keep up. The -t option also reads the disk through the cache, but without any precaching of results. Thus -t can give an idea of how fast a disk can deliver information stored sequentially on disk.

The hdparm tool isn't the best indicator of real-world performance. It operates at a very low level; once you place a filesystem onto a disk partition you might get significantly different results. You will also see large differences in speed between sequential access and random access. It would also be good to be able to benchmark a filesystem stored on a group of disks in a RAID configuration.

fio was created to allow benchmarking specific disk IO workloads. It can issue its IO requests using one of many synchronous and asynchronous IO APIs, and can also use various APIs which allow many IO requests to be issued with a single API call. You can also tune how large the files fio uses are, at what offsets in those files IO is to happen at, how much delay if any there is between issuing IO requests, and what if any filesystem sync calls are issued between each IO request. A sync call tells the operating system to make sure that any information that is cached in memory has been saved to disk and can thus introduce a significant delay. The options to fio allow you to issue very precisely defined IO patterns and see how long it takes your disk subsystem to complete these tasks.

fio is packaged in the standard repository for Fedora 8 and is available for openSUSE through the openSUSE Build Service. Users of Debian-based distributions will have to compile from source with the make; sudo make install combination.

The first test you might like to perform is for random read IO performance. This is one of the nastiest IO loads that can be issued to a disk, because it causes the disk head to seek a lot, and disk head seeks are extremely slow operations relative to other hard disk operations. One area where random disk seeks can be issued in real applications is during application startup, when files are requested from all over the hard disk. You specify fio benchmarks using configuration files with an ini file format. You need only a few parameters to get started. rw=randread tells fio to use a random reading access pattern, size=128m specifies that it should transfer a total of 128 megabytes of data before calling the test complete, and the directory parameter explicitly tells fio what filesystem to use for the IO benchmark. On my test machine, the /tmp filesystem is an ext3 filesystem stored on a RAID-5 array consisting of three 500GB Samsung SATA disks. If you don't specify directory, fio uses the current directory that the shell is in, which might not be what you want. The configuration file and invocation is shown below.

$ cat random-read-test.fio ; random read of 128mb of data [random-read] rw=randread size=128m directory=/tmp/fio-testing/data $ fio random-read-test.fio random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MiB) Jobs: 1 (f=1): [r] [100.0% done] [ 3588/ 0 kb/s] [eta 00m:00s] random-read: (groupid=0, jobs=1): err= 0: pid=30598 read : io=128MiB, bw=864KiB/s, iops=211, runt=155282msec clat (usec): min=139, max=148K, avg=4736.28, stdev=6001.02 bw (KiB/s) : min= 227, max= 5275, per=100.12%, avg=865.00, stdev=362.99 cpu : usr=0.07%, sys=1.27%, ctx=32783, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=32768/0, short=0/0 lat (usec): 250=34.92%, 500=0.36%, 750=0.02%, 1000=0.05% lat (msec): 2=0.41%, 4=12.80%, 10=44.96%, 20=5.16%, 50=0.94% lat (msec): 100=0.37%, 250=0.01% Run status group 0 (all jobs): READ: io=128MiB, aggrb=864KiB/s, minb=864KiB/s, maxb=864KiB/s, mint=155282msec, maxt=155282msec Disk stats (read/write): dm-6: ios=32768/148, merge=0/0, ticks=154728/12490, in_queue=167218, util=99.59%

fio produces many figures in this test. Overall, higher values for bandwidth and lower values for latency constitute better results.

The bw result shows the average bandwidth achieved by the test. The clat and bw lines show information about the completion latency and bandwidth respectively. The completion latency is the time between submitting a request and it being completed. The min, max, average, and standard deviation for the latency and bandwidth are shown. In this case, the standard deviation for both completion latency and bandwidth is quite large relative to the average value, so some IO requests were served much faster than others. The CPU line shows you how much impact the IO load had on the CPU, so you can tell if the processor in the machine is too slow for the IO you want to perform. The IO depths section is more interesting when you are testing an IO workload where multiple requests for IO can be outstanding at any point in time as is done in the next example. Because the above test only allowed a single IO request to be issued at any time, the IO depths were at 1 for 100% of the time. The latency figures indented under the IO depths section show an overview of how long each IO request took to complete; for these results, almost half the requests took between 4 and 10 milliseconds between when the IO request was issued and when the result of that request was reported. The latencies are reported as intervals, so the 4=12.80%, 10=44.96% section reports that 44.96% of requests took more than 4 (the previous reported value) and up to 10 milliseconds to complete.

The large READ line third from last shows the average, min, and max bandwidth for each execution thread or process. fio lets you define many threads or processes to all submit work at the same time during a benchmark, so you can have many threads, each using synchronous APIs to perform IO, and benchmark the result of all these threads running at once. This lets you test IO workloads that are closer to many server applications, where a new thread or process is spawned to handle each connecting client. In this case we have only one thread. As the READ line near the bottom of output shows, the single thread has an 864Kbps aggregate bandwidth (aggrb) which tells you that either the disk is slow or the manner in which IO is submitted to the disk system is not friendly, causing the disk head to perform many expensive seeks and thus producing a lower overall IO bandwidth. If you are submitting IO to the disk in a friendly way you should be getting much closer to the speeds that hdparm reports (typically around 40-60Mbps).

I performed the same test again, this time using the Linux asynchronous IO subsystem in direct IO mode with the possibility, based on the iodepth parameter, of eight requests for asynchronous IO being issued and not fulfilled because the system had to wait for disk IO at any point in time. The choice of allowing up to only eight IO requests in the queue was arbitrary, but typically an application will limit the number of outstanding requests so the system does not become bogged down. In this test, the benchmark reported almost three times the bandwidth. The abridged results are shown below. The IO depths show how many asynchronous IO requests were issued but had not returned data to the application during the course of execution. The figures are reported for intervals from the previous figure; for example, the 8=96.0% tells you that 96% of the time there were five, six, seven, or eight requests in the async IO queue, while, based on 4=4.0%, 4% of the time there were only three or four requests in the queue.

$ cat random-read-test-aio.fio ; same as random-read-test.fio ; ... ioengine=libaio iodepth=8 direct=1 invalidate=1 $ fio random-read-test-aio.fio random-read: (groupid=0, jobs=1): err= 0: pid=31318 read : io=128MiB, bw=2,352KiB/s, iops=574, runt= 57061msec slat (usec): min=8, max=260, avg=25.90, stdev=23.23 clat (usec): min=1, max=124K, avg=13901.91, stdev=12193.87 bw (KiB/s) : min= 0, max= 5603, per=97.59%, avg=2295.43, stdev=590.60 ... IO depths : 1=0.1%, 2=0.1%, 4=4.0%, 8=96.0%, 16=0.0%, 32=0.0%, >=64=0.0% ... Run status group 0 (all jobs): READ: io=128MiB, aggrb=2,352KiB/s, minb=2,352KiB/s, maxb=2,352KiB/s, mint=57061msec, maxt=57061msec
 

Random reads are always going to be limited by the seek time of the disk head. Because the async IO test could issue as many as eight IO requests before waiting for any to complete, there was more chance for reads in the same disk area to be completed together, and thus an overall boost in IO bandwidth.

The HOWTO file from the fio distribution gives full details of the options you can use to specify benchmark workloads. One of the more interesting parameters is rw, which can specify sequential or random reads and or writes in many combinations. The ioengine parameter can select how the IO requests are issued to the kernel. The invalidate option causes the kernel buffer and page cache to be invalidated for a file before beginning the benchmark. The runtime specifies that a test should run for a given amount of time and then be considered complete. The thinktime parameter inserts a specified delay between IO requests, which is useful for simulating a real application that would normally perform some work on data that is being read from disk. fsync=n can be used to issue a sync call after every n writes issued. write_iolog and read_iolog cause fio to write or read a log of all the IO requests issued. With these commands you can capture a log of the exact IO commands issued, edit that log to give exactly the IO workload you want, and benchmark those exact IO requests. The iolog options are great for importing an IO access pattern from an existing application for use with fio.

Simulating servers

You can also specify multiple threads or processes to all submit IO work at the same time to benchmark server-like filesystem interaction. In the following example I have four different processes, each issuing their own IO loads to the system, all running at the same time. I've based the example on having two memory-mapped query engines, a background updater thread, and a background writer thread. The difference between the two writing threads is that the writer thread is to simulate writing a journal, whereas the background updater must read and write (update) data. bgupdater has a thinktime of 40 microseconds, causing the process to sleep for a little while after each completed IO.

$ cat four-threads-randio.fio ; Four threads, two query, two writers. [global] rw=randread size=256m directory=/tmp/fio-testing/data ioengine=libaio iodepth=4 invalidate=1 direct=1 [bgwriter] rw=randwrite iodepth=32 [queryA] iodepth=1 ioengine=mmap direct=0 thinktime=3 [queryB] iodepth=1 ioengine=mmap direct=0 thinktime=5 [bgupdater] rw=randrw iodepth=16 thinktime=40 size=32m $ fio four-threads-randio.fio bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 queryA: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1 queryB: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=mmap, iodepth=1 bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16 Starting 4 processes bgwriter: (groupid=0, jobs=1): err= 0: pid=3241 write: io=256MiB, bw=7,480KiB/s, iops=1,826, runt= 35886msec slat (usec): min=9, max=106K, avg=35.29, stdev=583.45 clat (usec): min=117, max=224K, avg=17365.99, stdev=24002.00 bw (KiB/s) : min= 0, max=14636, per=72.30%, avg=5746.62, stdev=5225.44 cpu : usr=0.40%, sys=4.13%, ctx=18254, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.4%, 8=3.3%, 16=59.7%, 32=36.5%, >=64=0.0% issued r/w: total=0/65536, short=0/0 lat (usec): 250=0.05%, 500=0.33%, 750=0.70%, 1000=1.11% lat (msec): 2=7.06%, 4=14.91%, 10=27.10%, 20=21.82%, 50=20.32% lat (msec): 100=4.74%, 250=1.86% queryA: (groupid=0, jobs=1): err= 0: pid=3242 read : io=256MiB, bw=589MiB/s, iops=147K, runt= 445msec clat (usec): min=2, max=165, avg= 3.48, stdev= 2.38 cpu : usr=70.05%, sys=30.41%, ctx=91, majf=0, minf=65545 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.20%, 10=22.51%, 20=1.17%, 50=0.05%, 100=0.05% lat (usec): 250=0.01% queryB: (groupid=0, jobs=1): err= 0: pid=3243 read : io=256MiB, bw=455MiB/s, iops=114K, runt= 576msec clat (usec): min=2, max=303, avg= 3.48, stdev= 2.31 bw (KiB/s) : min=464158, max=464158, per=1383.48%, avg=464158.00, stdev= 0.00 cpu : usr=73.22%, sys=26.43%, ctx=69, majf=0, minf=65545 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% issued r/w: total=65536/0, short=0/0 lat (usec): 4=76.81%, 10=21.61%, 20=1.53%, 50=0.02%, 100=0.03% lat (usec): 250=0.01%, 500=0.01% bgupdater: (groupid=0, jobs=1): err= 0: pid=3244 read : io=16,348KiB, bw=1,014KiB/s, iops=247, runt= 16501msec slat (usec): min=7, max=42,515, avg=47.01, stdev=665.19 clat (usec): min=1, max=137K, avg=14215.23, stdev=20611.53 bw (KiB/s) : min= 0, max= 1957, per=2.37%, avg=794.90, stdev=495.94 write: io=16,420KiB, bw=1,018KiB/s, iops=248, runt= 16501msec slat (usec): min=9, max=42,510, avg=38.73, stdev=663.37 clat (usec): min=202, max=229K, avg=49803.02, stdev=34393.32 bw (KiB/s) : min= 0, max= 1840, per=10.89%, avg=865.54, stdev=411.66 cpu : usr=0.53%, sys=1.39%, ctx=12089, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=22.8%, 16=76.8%, 32=0.0%, >=64=0.0% issued r/w: total=4087/4105, short=0/0 lat (usec): 2=0.02%, 4=0.04%, 20=0.01%, 50=0.06%, 100=1.44% lat (usec): 250=8.81%, 500=4.24%, 750=2.56%, 1000=1.17% lat (msec): 2=2.36%, 4=2.62%, 10=9.47%, 20=13.57%, 50=29.82% lat (msec): 100=19.07%, 250=4.72% Run status group 0 (all jobs): READ: io=528MiB, aggrb=33,550KiB/s, minb=1,014KiB/s, maxb=589MiB/s, mint=445msec, maxt=16501msec WRITE: io=272MiB, aggrb=7,948KiB/s, minb=1,018KiB/s, maxb=7,480KiB/s, mint=16501msec, maxt=35886msec Disk stats (read/write): dm-6: ios=4087/69722, merge=0/0, ticks=58049/1345695, in_queue=1403777, util=99.74%

As one would expect, the bandwidth the array achieved in the query and writer processes was vastly different. Queries are performed at about 500Mbps while writing comes in at 1Mbps or 7.5Mbps depending on whether it is read/write or purely write performance respectively. The IO depths show the number of pending IO requests that are queued when an IO request is issued. For example, for the bgupdater process, nearly 1/4 of the async IO requests are being fulfilled with eight or less requests in the queue of a potential 16. In contrast, the bgwriter has more than half of its requests performed with 16 or less pending requests in the queue.

To contrast with the three-disk RAID-5 configuration, I reran the four-threads-randio.fio test on a single Western Digital 750GB drive. The bgupdater process achieved less than half the bandwidth and each of the query processes ran at 1/3 the overall bandwidth. For this test the Western Digital drive was on a different computer with different CPU and RAM specifications as well, so any comparison should be taken with a grain of salt.

bgwriter: (groupid=0, jobs=1): err= 0: pid=14963 write: io=256MiB, bw=6,545KiB/s, iops=1,597, runt= 41013msec queryA: (groupid=0, jobs=1): err= 0: pid=14964 read : io=256MiB, bw=160MiB/s, iops=39,888, runt= 1643msec queryB: (groupid=0, jobs=1): err= 0: pid=14965 read : io=256MiB, bw=163MiB/s, iops=40,680, runt= 1611msec bgupdater: (groupid=0, jobs=1): err= 0: pid=14966 read : io=16,416KiB, bw=422KiB/s, iops=103, runt= 39788msec write: io=16,352KiB, bw=420KiB/s, iops=102, runt= 39788msec READ: io=528MiB, aggrb=13,915KiB/s, minb=422KiB/s, maxb=163MiB/s, mint=1611msec, maxt=39788msec WRITE: io=272MiB, aggrb=6,953KiB/s, minb=420KiB/s, maxb=6,545KiB/s, mint=39788msec, maxt=41013msec

The vast array of ways that fio can issue its IO requests lends it to benchmarking IO patterns and the use of various APIs to perform that IO. You can also run identical fio configurations on different filesystems or underlying hardware to see what difference changes at that level will make to performance.

Benchmarking different IO request systems for a particular IO pattern can be handy if you are about to write an IO-intensive application but are not sure which API and design will work best on your hardware. For example, you could keep the disk system and RAM fixed and see how well an IO load would be serviced using memory-mapped IO or the Linux asyncio interface. Of course this requires you to have a very intricate knowledge of the typical IO requests that your application will issue. If you already have a tool that uses something like memory-mapped files, then you can get IO patterns for typical use from the existing tool, feed them into fio using different IO engines, and get a reasonable picture of whether it might be worth porting the application to a different IO API for better performance.

Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.

http://collectl.sourceforge.net/Examples.html

Re: Inspecting disk IO performance with fio

Mark Seger on April 19, 2008 02:32 PM
I think one thing people tend to easily forget is that measuring I/O rates based on end-to-end time is a useful number, but I also find it very important to look at the rates over time, as often as once a second and compare them with what's happening on the rest of the system by also looking at the cpu, memory, network or even interrupts as well as other subsystems and that's why I wrote http://collectl.sourceforge.net/. I find this to be very complementary to load generators because they can be counted on to deliver a known I/O stream and collectl can tell you what's happening during that load. If you want a real quick look with minimal effort look at http://collectl.sourceforge.net/Examples.html .
-mark

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Top articles

Sites

Internal

Commercial_linuxes/Performance_tuning/Performance/Performance%20Tuning%20for%20SUSE%20Linux%20Enterprise%20Server.pdf

Commercial_linuxes/Performance_tuning/Performance/tut303.pdf



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 12, 2019