Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Unix Kernel Internals

News	See also	Recommended Books	Recommended Links	Linux Performance Tuning	Papers & Tutorials	Unix System Calls
init	Linux process management	Virtual memory	Linux filesystems	Linux Swap filesystem	Linux Networking	Linux Performance Tuning
Solaris Internals	Controlling System Processes in Solaris	Solaris Virtual Memory	Disk and Filesystems Management in Solaris	Swap Space and Virtual Memory	Solaris Networking	Solaris Performance Tuning
IRC	IEEE Software 1999		History	Random Findings	Humor	Etc

Unix kernel is a pretty old, probably the forth oldest surviving kernel in existence (if we count VMS, MVS and VM/CMS ). The first version of Unix was developed in 1969 by Ken Thompson with strong influence from Multics. After several years of internal development, the team from Berkeley led by Bill Joy made important contributions. The first Berkeley major contribution was addition in 1978 of virtual memory and on-demand paging, The result is widely known as 3BSD UNIX.

This work convinced DARPA to fund Berkeley for the development of a standard Unix system for government use that included networking protocol now known as TCP/IP. The result was 4BSD that was able to communicate uniformly among diverse ser of protocols including LAN (Ethernet and token rings), as well as wide area networks. In 1983 4.2 BSD and in 1986 4.3BSD was released. The quality of those implementations and their free availability was probably one of the most important reason of popularity of networking and rapid grows of Internet. The last Berkeley release was finalized in June of 1993. It included BSD Fast Filesystem (ffs) and NFS (originated by Sun).

That paradoxically coincided with Microsoft self-imposed withdrawal from the Unix scene: in October 1988 Dave Cutler, the architect of VAX/VMS was hired by Microsoft and tasked with the development of new OS which will became world famous Windows NT. After Microsoft withdrawal, the main developer and promoter of Unix became Sun Microsystems. It introduced several important enhancement of the OS like /proc filesystem, virtual filesystems layer (required for NFS implementation), RPC and several others.

Commercial part of Unix story from early 80th was dominated by Microsoft which produced XENIX and Sun which produced SunOS (1984). In 1989 the ANSI standard of C language was approved and Unix was ported to this new version of C. The fact the Microsoft withdraw from the Unix development also provided a nice opening for Linux as the kernel re-implementation project originally started in Finland and then moved to the USA.

While initially a free software project Linux soon became a part of commercial story of Unix due to existence of enterprise distributions like Red Hat and Suse as well as peculiarities' of the license used (GPL), which permitted "brute-force/largest player survives" commercialization. While in essence Linux was a reimplementation of Unix kernel as any reimplementation it helped to polish certain areas and also served as stimulus for established Unix players to upgrade and made more compatible their offering. Linux soon became the lowest common denominator in Unix world.

Kernel provides the following key functions:

The file system (covered in the separate page)
CPU scheduling and process management
Memory management See virtual_memory.shtml

They are all provided via system calls.

Those day Unix kernel became complex and stray away from the original design goals. Neither simplicity of orientation of programmers as the main users survives commercial success. Those goals quietly died. Here is an interesting quote from Solaris kernel developer Andy Tucker interview:

The nature of OS research has changed over the years. In the 80's and early 90's, there was a lot of "big systems" research; universities and industry labs would start by building an operating system, and then use that as a platform for investigating new ideas. So CMU had Mach, Berkeley had Sprite, Stanford had the V System, etc.. This meant that there was a lot of re-examination of basic OS constructs --- how to best build an OS from the ground up. As a result we had work on distributed systems, microkernels, etc. --- but the systems were all aimed at supporting the same applications, essentially the ones running on the researchers' desktops.

Now most of the research I see is based on existing OS platforms, usually Linux or one of the *BSDs. The focus is often on improving support for new types of applications --- multimedia, mobility, etc.. So we have fewer people looking at the basic structure of operating systems (with some notable exceptions), but more looking at how to make operating systems perform better from a user's point of view. The use of existing OS platforms also removes some of the barriers to entry for OS research --- universities with small OS groups and budgets can do interesting research without having to build an entirely new operating system.

30 years after UNIX was recoded in C, most people still use C (or in some cases a little bit of C++) for the OS kernel. Is C perfectly adequate, or do they see some of the newer languages (C#, Java, or even modern C++ paradigms) being applied to OS design?

Andy Tucker: There have been various experiments in this area; as an example, Sun has developed operating systems in both C++ (SpringOS) and Java (JavaOS). While object-oriented languages offer a number of advantages in terms of ease of development for higher-level programming abstractions, this doesn't always benefit OS kernels as much as it would user applications. Since the kernel is the piece of software that most directly interacts with the hardware, the benefits of having a simple mapping between the language and machine instructions is often more compelling than ease-of-development features like garbage collection and templates. There are also issues like runtime support requirements that can be extensive, depending on the language. What we often wind up doing instead is taking some of the concepts from object-oriented languages, such as polymorphism, and finding creative ways to implement them in non-OO languages like C.

How do you feel Solaris process management technologies like the Fair Share Scheduler will stack up to the Linux O(1) scheduler. Furthermore, has Sun ever attempted to implement an O(1) scheduler for Solaris and if so, what problems/drawbacks they encountered which kept it out of the released kernels.

Andy Tucker: Solaris has actually had an O(1) scheduler for a number of years. The run queues are also per-CPU to maximize scalability. This isn't a secret, but we haven't talked about the technology itself much; we've been mostly focused on the results.

The "fair-share scheduler" is one of several scheduling policies in Solaris, which control how priorities are assigned to individual processes. This is separate from the scheduler, which handles dispatching processes onto processors in priority order.

The fair-share scheduler allows the allocation of CPU in the system to be divided among groups of processes according to proportions defined by an administrator. For example, on a system running both a mail server and a web server, the administrator might decide that if the system is busy, 2/3 of the CPU should go to the mail server, and 1/3 should go the web server. Although in the past the fair-share scheduler was available only as a separate product (Solaris Resource Manager), we decided that it was important enough technology for our customers to bundle in the core operating system.

What is the future holds for Solaris 10? What enhancements are in-store in the OS and kernel level? Are there any plans to integrate the Gridengine into Solaris rather than being a separate application?

Andy Tucker: Solaris 10 will have a number of new features that we think are pretty exciting. One is Solaris Zones --- this takes an idea that was initially developed for FreeBSD (jails) and extends it to address the needs of our customers. It allows administrators to divide up a single system into a number of separate application environments, called zones, where processes in one zone are not able to see or interact with those in other zones. This means that multiple applications can run on the same system without conflicting with each other, but the administrator only has to deal with one OS kernel for backups, patches, etc..

We're also looking at ways to improve system reliability and observability. Solaris 10 will include tools that allow tracing not only what's going on at user level, but also what's going on in the kernel. So a developer trying to understand why their application is performing poorly can get information from the whole software stack and get a much better picture of what's really going on. We're also using these tools internally to improve the performance and reliability of Solaris and other Sun software.

Nice overview of Linux kernel is provided in the Performance Tuning for Linux An Introduction to Kernels Linux Kernel Architecture. Here is an extended quote from sample chapter:

Let’s begin this section by discussing the architecture of the Linux kernel, including responsibilities of the kernel, its organization and modules, services of the kernel, and process management.

Kernel Responsibilities

The kernel (also called the operating system) has two major responsibilities:

To interact with and control the system’s hardware components

To provide an environment in which applications can run

Some operating systems allow applications to directly access hardware components, although this capability is very uncommon nowadays. UNIX-like operating systems hide all the low-level hardware details from an application. If an application wants to make use of a hardware resource, it must make a request to the operating system. The operating system then evaluates the request and interacts with the hardware component on behalf of the application, but only if it’s valid. To enforce this kind of scheme, the operating system needs to depend on hardware capabilities that forbid applications to directly interact with them.

Organization and Modules

Like many other UNIX-like operating systems, the Linux kernel is monolithic. This means that even though Linux is divided into subsystems that control various components of the system (such as memory management and process management), all of these subsystems are tightly integrated to form the whole kernel. In contrast, microkernel operating systems provide bare, minimal functionality, and all other operating system layers are performed on top of microkernels as processes. Microkernel operating systems are generally slower due to message passing between the various layers. However, microkernel operating systems can be extended very easily.

Linux kernels can be extended by modules. A module is a kernel feature that provides the benefits of a microkernel without a penalty. A module is an object that can be linked to the kernel at runtime.

Using Kernel Services

The kernel provides a set of interfaces for applications running in user mode to interact with the system. These interfaces, also known as system calls, give applications access to hardware and other kernel resources. System calls not only provide applications with abstracted hardware, but also ensure security and stability.

Most applications do not use system calls directly. Instead, they are programmed to an application programming interface (API). It is important to note that there is no relation between the API and system calls. APIs are provided as part of libraries for applications to make use of. These APIs are generally implemented through the use of one or more system calls.

/proc File System—External Performance View

The /proc file system provides the user with a view of internal kernel data structures. It also lets you look at and change some of the kernel internal data structures, thereby changing the kernal’s behavior. The /proc file system provides an easy way to fine-tune system resources to improve the performance not only of applications but of the overall system.

/proc is a virtual file system that is created dynamically by the kernel to provide data. It is organized into various directories. Each of these directories corresponds to tunables for a given subsystem. Appendix A explains in detail how to use the /proc file system to fine-tune your system.

Another essential of the Linux system is memory management. In the next section, we’ll cover five aspects of how Linux handles this management.

Memory Management

The various aspects of memory management in Linux include address space, physical memory, memory mapping, paging, and swapping.

Address Space. One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The virtual memory can be many times larger than the physical memory in the system. Each process in the system has its own virtual address space. These virtual address spaces are completely separate from each other. A process running one application cannot affect another, and the applications are protected from each other. The virtual address space is mapped to physical memory by the operating system. From an application point of view, this address space is a flat linear address space. The kernel, however, treats the user virtual address space very differently.
The linear address space is divided into two parts: user address space and kernel address space. The user address space cannot change every time a context switch occurs and the kernel address space remains constant. How much space is allocated for user space and kernel space depends mainly on whether the system is a 32-bit or 64-bit architecture. For example, x86 is a 32-bit architecture and supports only a 4GB address space. Out of this 4GB, 3GB is reserved for user space and 1GB is reserved for the kernel. The location of the split is determined by the PAGE_OFFSET kernel configuration variable.

Physical Memory Linux uses an architecture-independent way of describing physical memory in order to support various architectures.
Physical memory can be arranged into banks, with each bank being a particular distance from the processor. This type of memory arrangement is becoming very common, with more machines employing NUMA (Nonuniform Memory Access) technology. Linux VM represents this arrangement as a node. Each node is divided into a number of blocks called zones that represent ranges within memory. There are three different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. For example, x86 has the following zones:

ZONE_ DMA            First 16MB of memory

ZONE_ NORMAL     16MB – 896MB

ZONE_ HIGHMEM    896MB – end

Each zone has its own use. Some of the legacy ISA devices have restrictions on where they can perform I/O from and to. ZONE_DMA addresses those requirements.

ZONE_NORMAL is used for all kernel operations and allocations. It is extremely crucial for system performance.

ZONE_ HIGHMEM is the rest of the memory in the system. It’s important to note that ZONE_HIGHMEM cannot be used for kernel allocations and data structures—it can only be used for user data.

Memory Mapping While looking at how kernel memory is mapped, we will use x86 as an example for better understanding. As mentioned earlier, the kernel has only 1GB of virtual address space for its use. The other 3GB is reserved for the kernel. The kernel maps the physical memory in ZONE_DMA and ZONE_NORMAL directly to its address space. This means that the first 896MB of physical memory in the system is mapped to the kernel’s virtual address space, which leaves only 128MB of virtual address space. This 128MB of virtual space is used for operations such as vmalloc and kmap.
This mapping scheme works well as long as physical memory sizes are small (less than 1GB). However, these days, all servers support tens of gigabytes of memory. Intel has added PAE (Physical Address Extension) to its Pentium processors to support up to 64GB of physical memory. Because of the preceding memory mapping, handling physical memories in tens of gigabytes is a major source of problems for x86 Linux. The Linux kernel handles high memory (all memory about 896MB) as follows: When the Linux kernel needs to address a page in high memory, it maps that page into a small virtual address space (kmap) window, operates on that page, and unmaps the page. The 64-bit architectures do not have this problem because their address space is huge.

Paging Virtual memory is implemented in many ways, but the most effective way is hardware-based. Virtual address space is divided into fixed-size chunks called pages. Virtual memory references are translated into addresses in physical memory using page tables. To support various architectures and page sizes, Linux uses a three-level paging mechanism. The three types of page tables are as follows:

Page Global Directory (PGD)

Page Middle Directory (PMD)

Page Table (PTE)

Address translation provides a way to separate the virtual address space of a process from the physical address space. Each page of virtual memory can be marked "present" or "not present" in the main memory. If a process references an address in virtual memory that is not present, hardware generates a page fault, which is handled by the kernel. The kernel handles the fault and brings the page into main memory. In this process, the system might have to replace an existing page to make room for the new one.

The replacement policy is one of the most critical aspects of the paging system. Linux 2.6 fixed various problems surrounding the page selection and replacement that were present in previous versions of Linux.

Swapping Swapping is the moving of an entire process to and from secondary storage when the main memory is low. Many modern operating systems, including Linux, do not use this approach, mainly because context switches are very expensive. Instead, they use paging. In Linux, swapping is performed at the page level rather than at the process level. The main advantage of swapping is that it expands the process address space that is usable by a process. As the kernel needs to free up memory to make room for new pages, it may need to discard some of the less frequently used or unused pages. Some of the pages cannot be freed up easily because they are not backed by disks. Instead, they have to be copied to a backing store (swap area) and need to be read back from the backing store when needed. One major disadvantage of swapping is speed. Generally, disks are very slow, so swapping should be eliminated whenever possible.

NEWS CONTENTS

20110304 : Linux Scheduler simulation by M. Tim Jones ( Linux Scheduler simulation, Mar 04, 2011 )
20110304 : Operating Systems Lecture Notes Lecture 6 CPU Scheduling by Martin C. Rinard ( Operating Systems Lecture Notes Lecture 6 CPU Scheduling, )
20110304 : Anatomy of Linux process management by M. Tim Jones ( Dec 20, 2008 , developerWorks )
20110304 : Linux kernel advances ( Linux kernel advances, )
20080909 : Linux.com Kernel tuning with sysctl by Federico Kereki ( Linux.com Kernel tuning with sysctl, Sep 09, 2008 )
20080909 : Getting somewhere? ( )
20080909 : LPI exam 201 prep, Topic 201: Linux kernel ( LPI exam 201 prep, Topic 201: Linux kernel, )
20080909 : Linux Kernel Compiling - Intel® Software Network ( Linux Kernel Compiling - Intel® Software Network, )
20080909 : The Process Model of Linux Application Development ( The Process Model of Linux Application Development, )
210210 : Become a Linux Kernel Hacker and Write Your Own Module ( Become a Linux Kernel Hacker and Write Your Own Module, )

Old News ;-)

[Mar 04, 2011] Linux Scheduler simulation by M. Tim Jones

Operating Systems Lecture Notes Lecture 6 CPU Scheduling by Martin C. Rinard

What is CPU scheduling? Determining which processes run when there are multiple runnable processes. Why is it important? Because it can can have a big effect on resource utilization and the overall performance of the system.

By the way, the world went through a long period (late 80's, early 90's) in which the most popular operating systems (DOS, Mac) had NO sophisticated CPU scheduling algorithms. They were single threaded and ran one process at a time until the user directs them to run another process. Why was this true? More recent systems (Windows NT) are back to having sophisticated CPU scheduling algorithms. What drove the change, and what will happen in the future?

Basic assumptions behind most scheduling algorithms:

There is a pool of runnable processes contending for the CPU.

The processes are independent and compete for resources.

The job of the scheduler is to distribute the scarce resource of the CPU to the different processes ``fairly'' (according to some definition of fairness) and in a way that optimizes some performance criteria.

In general, these assumptions are starting to break down. First of all, CPUs are not really that scarce - almost everybody has several, and pretty soon people will be able to afford lots. Second, many applications are starting to be structured as multiple cooperating processes. So, a view of the scheduler as mediating between competing entities may be partially obsolete.

How do processes behave? First, CPU/IO burst cycle. A process will run for a while (the CPU burst), perform some IO (the IO burst), then run for a while more (the next CPU burst). How long between IO operations? Depends on the process.

IO Bound processes: processes that perform lots of IO operations. Each IO operation is followed by a short CPU burst to process the IO, then more IO happens.

CPU bound processes: processes that perform lots of computation and do little IO. Tend to have a few long CPU bursts.

One of the things a scheduler will typically do is switch the CPU to another process when one process does IO. Why? The IO will take a long time, and don't want to leave the CPU idle while wait for the IO to finish.

When look at CPU burst times across the whole system, have the exponential or hyperexponential distribution in Fig. 5.2.

What are possible process states?

Running - process is running on CPU.

Ready - ready to run, but not actually running on the CPU.

Waiting - waiting for some event like IO to happen.

When do scheduling decisions take place? When does CPU choose which process to run? Are a variety of possibilities:

When process switches from running to waiting. Could be because of IO request, because wait for child to terminate, or wait for synchronization operation (like lock acquisition) to complete.

When process switches from running to ready - on completion of interrupt handler, for example. Common example of interrupt handler - timer interrupt in interactive systems. If scheduler switches processes in this case, it has preempted the running process. Another common case interrupt handler is the IO completion handler.

When process switches from waiting to ready state (on completion of IO or acquisition of a lock, for example).

When a process terminates.

How to evaluate scheduling algorithm? There are many possible criteria:

CPU Utilization: Keep CPU utilization as high as possible. (What is utilization, by the way?).

Throughput: number of processes completed per unit time.

Turnaround Time: mean time from submission to completion of process.

Waiting Time: Amount of time spent ready to run but not running.

Response Time: Time between submission of requests and first response to the request.

Scheduler Efficiency: The scheduler doesn't perform any useful work, so any time it takes is pure overhead. So, need to make the scheduler very efficient.

Big difference: Batch and Interactive systems. In batch systems, typically want good throughput or turnaround time. In interactive systems, both of these are still usually important (after all, want some computation to happen), but response time is usually a primary consideration. And, for some systems, throughput or turnaround time is not really relevant - some processes conceptually run forever.

Difference between long and short term scheduling. Long term scheduler is given a set of processes and decides which ones should start to run. Once they start running, they may suspend because of IO or because of preemption. Short term scheduler decides which of the available jobs that long term scheduler has decided are runnable to actually run.

Let's start looking at several vanilla scheduling algorithms.

First-Come, First-Served. One ready queue, OS runs the process at head of queue, new processes come in at the end of the queue. A process does not give up CPU until it either terminates or performs IO.

Consider performance of FCFS algorithm for three compute-bound processes. What if have 4 processes P1 (takes 24 seconds), P2 (takes 3 seconds) and P3 (takes 3 seconds). If arrive in order P1, P2, P3, what is

Waiting Time? (24 + 27) / 3 = 17

Turnaround Time? (24 + 27 + 30) = 27.

Throughput? 30 / 3 = 10.

What about if processes come in order P2, P3, P1? What is

Waiting Time? (3 + 3) / 2 = 6

Turnaround Time? (3 + 6 + 30) = 13.

Throughput? 30 / 3 = 10.

Shortest-Job-First (SJF) can eliminate some of the variance in Waiting and Turnaround time. In fact, it is optimal with respect to average waiting time. Big problem: how does scheduler figure out how long will it take the process to run?

For long term scheduler running on a batch system, user will give an estimate. Usually pretty good - if it is too short, system will cancel job before it finishes. If too long, system will hold off on running the process. So, users give pretty good estimates of overall running time.

For short-term scheduler, must use the past to predict the future. Standard way: use a time-decayed exponentially weighted average of previous CPU bursts for each process. Let T_n be the measured burst time of the nth burst, s_n be the predicted size of next CPU burst. Then, choose a weighting factor w, where 0 <= w <= 1 and compute s_n+1 = w T_n + (1 - w)s_n. s₀ is defined as some default constant or system average.

w tells how to weight the past relative to future. If choose w = .5, last observation has as much weight as entire rest of the history. If choose w = 1, only last observation has any weight. Do a quick example.

Preemptive vs. Non-preemptive SJF scheduler. Preemptive scheduler reruns scheduling decision when process becomes ready. If the new process has priority over running process, the CPU preempts the running process and executes the new process. Non-preemptive scheduler only does scheduling decision when running process voluntarily gives up CPU. In effect, it allows every running process to finish its CPU burst.

Consider 4 processes P1 (burst time 8), P2 (burst time 4), P3 (burst time 9) P4 (burst time 5) that arrive one time unit apart in order P1, P2, P3, P4. Assume that after burst happens, process is not reenabled for a long time (at least 100, for example). What does a preemptive SJF scheduler do? What about a non-preemptive scheduler?

Priority Scheduling. Each process is given a priority, then CPU executes process with highest priority. If multiple processes with same priority are runnable, use some other criteria - typically FCFS. SJF is an example of a priority-based scheduling algorithm. With the exponential decay algorithm above, the priorities of a given process change over time.

Assume we have 5 processes P1 (burst time 10, priority 3), P2 (burst time 1, priority 1), P3 (burst time 2, priority 3), P4 (burst time 1, priority 4), P5 (burst time 5, priority 2). Lower numbers represent higher priorities. What would a standard priority scheduler do?

Big problem with priority scheduling algorithms: starvation or blocking of low-priority processes. Can use aging to prevent this - make the priority of a process go up the longer it stays runnable but isn't run.

What about interactive systems? Cannot just let any process run on the CPU until it gives it up - must give response to users in a reasonable time. So, use an algorithm called round-robin scheduling. Similar to FCFS but with preemption. Have a time quantum or time slice. Let the first process in the queue run until it expires its quantum (i.e. runs for as long as the time quantum), then run the next process in the queue.

Implementing round-robin requires timer interrupts. When schedule a process, set the timer to go off after the time quantum amount of time expires. If process does IO before timer goes off, no problem - just run next process. But if process expires its quantum, do a context switch. Save the state of the running process and run the next process.

How well does RR work? Well, it gives good response time, but can give bad waiting time. Consider the waiting times under round robin for 3 processes P1 (burst time 24), P2 (burst time 3), and P3 (burst time 4) with time quantum 4. What happens, and what is average waiting time? What gives best waiting time?

What happens with really a really small quantum? It looks like you've got a CPU that is 1/n as powerful as the real CPU, where n is the number of processes. Problem with a small quantum - context switch overhead.

What about having a really small quantum supported in hardware? Then, you have something called multithreading. Give the CPU a bunch of registers and heavily pipeline the execution. Feed the processes into the pipe one by one. Treat memory access like IO - suspend the thread until the data comes back from the memory. In the meantime, execute other threads. Use computation to hide the latency of accessing memory.

What about a really big quantum? It turns into FCFS. Rule of thumb - want 80 percent of CPU bursts to be shorter than time quantum.

Multilevel Queue Scheduling - like RR, except have multiple queues. Typically, classify processes into separate categories and give a queue to each category. So, might have system, interactive and batch processes, with the priorities in that order. Could also allocate a percentage of the CPU to each queue.

Multilevel Feedback Queue Scheduling - Like multilevel scheduling, except processes can move between queues as their priority changes. Can be used to give IO bound and interactive processes CPU priority over CPU bound processes. Can also prevent starvation by increasing the priority of processes that have been idle for a long time.

A simple example of a multilevel feedback queue scheduling algorithm. Have 3 queues, numbered 0, 1, 2 with corresponding priority. So, for example, execute a task in queue 2 only when queues 0 and 1 are empty.

A process goes into queue 0 when it becomes ready. When run a process from queue 0, give it a quantum of 8 ms. If it expires its quantum, move to queue 1. When execute a process from queue 1, give it a quantum of 16. If it expires its quantum, move to queue 2. In queue 2, run a RR scheduler with a large quantum if in an interactive system or an FCFS scheduler if in a batch system. Of course, preempt queue 2 processes when a new process becomes ready.

Another example of a multilevel feedback queue scheduling algorithm: the Unix scheduler. We will go over a simplified version that does not include kernel priorities. The point of the algorithm is to fairly allocate the CPU between processes, with processes that have not recently used a lot of CPU resources given priority over processes that have.

Processes are given a base priority of 60, with lower numbers representing higher priorities. The system clock generates an interrupt between 50 and 100 times a second, so we will assume a value of 60 clock interrupts per second. The clock interrupt handler increments a CPU usage field in the PCB of the interrupted process every time it runs.

The system always runs the highest priority process. If there is a tie, it runs the process that has been ready longest. Every second, it recalculates the priority and CPU usage field for every process according to the following formulas.

CPU usage field = CPU usage field / 2

Priority = CPU usage field / 2 + base priority

So, when a process does not use much CPU recently, its priority rises. The priorities of IO bound processes and interactive processes therefore tend to be high and the priorities of CPU bound processes tend to be low (which is what you want).

Unix also allows users to provide a ``nice'' value for each process. Nice values modify the priority calculation as follows:

Priority = CPU usage field / 2 + base priority + nice value

So, you can reduce the priority of your process to be ``nice'' to other processes (which may include your own).

In general, multilevel feedback queue schedulers are complex pieces of software that must be tuned to meet requirements.

Anomalies and system effects associated with schedulers.

Priority interacts with synchronization to create a really nasty effect called priority inversion. A priority inversion happens when a low-priority thread acquires a lock, then a high-priority thread tries to acquire the lock and blocks. Any middle-priority threads will prevent the low-priority thread from running and unlocking the lock. In effect, the middle-priority threads block the high-priority thread.

How to prevent priority inversions? Use priority inheritance. Any time a thread holds a lock that other threads are waiting on, give the thread the priority of the highest-priority thread waiting to get the lock. Problem is that priority inheritance makes the scheduling algorithm less efficient and increases the overhead.

Preemption can interact with synchronization in a multiprocessor context to create another nasty effect - the convoy effect. One thread acquires the lock, then suspends. Other threads come along, and need to acquire the lock to perform their operations. Everybody suspends until the lock that has the thread wakes up. At this point the threads are synchronized, and will convoy their way through the lock, serializing the computation. So, drives down the processor utilization.

If have non-blocking synchronization via operations like LL/SC, don't get convoy effects caused by suspending a thread competing for access to a resource. Why not? Because threads don't hold resources and prevent other threads from accessing them.

Similar effect when scheduling CPU and IO bound processes. Consider a FCFS algorithm with several IO bound and one CPU bound process. All of the IO bound processes execute their bursts quickly and queue up for access to the IO device. The CPU bound process then executes for a long time. During this time all of the IO bound processes have their IO requests satisfied and move back into the run queue. But they don't run - the CPU bound process is running instead - so the IO device idles. Finally, the CPU bound process gets off the CPU, and all of the IO bound processes run for a short time then queue up again for the IO devices. Result is poor utilization of IO device - it is busy for a time while it processes the IO requests, then idle while the IO bound processes wait in the run queues for their short CPU bursts. In this case an easy solution is to give IO bound processes priority over CPU bound processes.

In general, a convoy effect happens when a set of processes need to use a resource for a short time, and one process holds the resource for a long time, blocking all of the other processes. Causes poor utilization of the other resources in the system.

Permission is granted to copy and distribute this material for educational purposes only, provided that the following credit line is included: "Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard." Permission is granted to alter and distribute this material provided that the following credit line is included: "Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard."

Anatomy of Linux process management by M. Tim Jones

Dec 20, 2008 | developerWorks

Linux is a very dynamic system with constantly changing computing needs. The representation of the computational needs of Linux centers around the common abstraction of the process. Processes can be short-lived (a command executed from the command line) or long-lived (a network service). For this reason, the general management of processes and their scheduling is very important.

From user-space, processes are represented by process identifiers (PIDs). From the user's perspective, a PID is a numeric value that uniquely identifies the process. A PID doesn't change during the life of a process, but PIDs can be reused after a process dies, so it's not always ideal to cache them.

In user-space, you can create processes in any of several ways. You can execute a program (which results in the creation of a new process) or, within a program, you can invoke a fork or exec system call. The fork call results in the creation of a child process, while an exec call replaces the current process context with the new program. I discuss each of these methods to understand how they work.

For this article, I build the description of processes by first showing the kernel representation of processes and how they're managed in the kernel, then review the various means by which pro

Linux kernel advances

What's new in 2.6.28?

Linux kernel 2.6.28 was released on December 24, 2008 (at release 5 as of early February, 2009). This first release of 2.6.28 includes a large number of changes-so large that the change-log text file is itself almost 6MB in size. This release is viewed as so stable that it's the kernel of the next Ubuntu distribution, version 9.04, Jaunty Jackalope.

The fourth extended file system

The fourth extended file system (ext4) file system was renamed from ext4dev to ext4, which means that it's stable enough for regular use. Ext4 is the successor to the standard third extended file system (ext3) available today, but with better performance, features, and reliability. Ext4 permits exabyte file systems that can support larger numbers of files, larger files, and deeper directory structures. It also includes extents with multi-block and delayed block allocation for performance. Ext4 is both forward and backward compatible (meaning that you can mount an ext4 file system on an ext3 disk format and vice versa, depending upon the features used). You can also gradually migrate a file system from ext3 to ext4 online with a mass change. For links to more information about the ext4 file system, see Resources.

And although ext4 will be the new standard Linux file system for some time to come, other file systems are coming that offer even better scalability and features. One such file system, Btrfs, is available in an experimental form in the 2.6.29 kernel. Btrfs is a Linux-compatible file system (read GNU Public License [GPL]) that competes in features with the well-known ZFS.

Graphics Execution Manager memory management

One of the areas that has seen solid improvements over the past year is the Linux graphics stack. Not surprisingly, it's also an area where graphics processor units (GPUs) provide useful assists for rendering. In many cases, GPUs are more powerful than the central processing units (CPUs) they assist.

To support the GPUs of today and tomorrow, one area of the Linux graphics stack that needed improvement was memory management, including buffer management, page mapping, placement, and caching. This was necessary because graphics applications-particularly three-dimensional applications-can consume a vast amount of memory. The Graphics Execution Manager (GEM) helps here by providing ways to manage graphics data that blends into the kernel using the existing kernel subsystems (such as using the shared memory file system, or shmfs, to manage graphic objects).

Boot tracer

Although the time required to boot Linux has shrunk over time, expectations are still that it takes too long. For that reason, boot times remain under scrutiny. This kernel includes a new feature to measure and record the timings of init calls. The timings can be used later to visualize the flow and performance of the boot process. This process is configurable (it requires enabling to collect the data), but once collected, the data can be analyzed using offline scripts (including graphical depictions), which will ultimately lead to better boot times and a more optimized boot process. This update incorporates the process identifier (PID) of the calling thread so that the parallelism of the boot process can be viewed.

Freezer

Based conceptually on the idea of suspending an operating system for the purpose of migrating it to a new host (for example, virtual machine, or VM, migration), a new capability called freezing (and thawing) has been committed. This new feature allows either a group of tasks or a file system to be frozen and kept in its freeze-time state, later to be thawed to reintroduce the task group or file system.

You freeze tasks in the context of a container, which is a scheme that virtualizes operating systems at the user-space level (a single kernel supports multiple user spaces). This new functionality is a step in the direction of migrating a set of processes between hosts, which can be very useful for load balancing. You can also freeze file systems to support snapshots for file system backup. Currently, file system freezing is achieved through an ioctl with an argument of FIFREEZE or FITHAW.

Outside of containers, this new freeze/thaw scheme can find uses in checkpointing. In this application, you could freeze a collection of related processes at specific intervals (checkpoints), then thaw a particular epoch as a way to roll back to a known good state.

Improved virtual memory scalability

As Linux finds increasing use in virtualized systems-particularly those with many processors and vast amounts of memory-the ability to scale memory usage becomes critical to performance. Kernel 2.6.28 includes a number of scalability enhancements related to memory. For example, this kernel maintains separate Least Recently Used (LRU) lists, one for pages backed by files and another for pages backed by swap. This allows the kernel to focus on swap-backed pages, which are more likely to be written to disk, and pay less attention to file-backed pages.

Another change separates the evictable pages from the unevictable pages (such as those that were locked through mlock). In this way, the pageout code does not need to iterate unevictable pages in the LRU list, leading to improved performance in systems with very large numbers of pages.

[Sep 09, 2008] Linux.com Kernel tuning with sysctl by Federico Kereki

The Linux kernel is flexible, and you can even modify the way it works on the fly by dynamically changing some of its parameters, thanks to the sysctl command. Sysctl provides an interface that allows you to examine and change several hundred kernel parameters in Linux or BSD. Changes take effect immediately, and there's even a way to make them persist after a reboot. By using sysctl judiciously, you can optimize your box without having to recompile your kernel, and get the results immediately.

To start getting a taste of what sysctl can modify, run sysctl -a and you will see all the possible parameters. The list can be quite long: in my current box there are 712 possible settings.
$ sysctl -a kernel.panic = 0 kernel.core_uses_pid = 0 kernel.core_pattern = core kernel.tainted = 129 ...many lines snipped...
If you want to get the value of just a single variable, use something like sysctl vm.swappiness, or just sysctl vm to list all variables that start with "vm." Add the -n option to output just the variable values, without the names; -N has the opposite effect, and produces the names but not the values.

You can change any variable by using the -w option with the syntax sysctl -w variable=value. For example, sysctl -w net.ipv6.conf.all.forwarding=1 sets the corresponding variable to true (0 equals "no" or "false"; 1 means "yes" or "true") thus allowing IP6 forwarding. You may not even need the -w option -- it seems to be deprecated. Do some experimenting on your own to confirm that.

For more information, run man sysctl to display the standard documentation.
sysctl and the /proc directory
The /proc/sys virtual directory also provides an interface to the sysctl parameters, allowing you to examine and change them. For example, the /proc/sys/vm/swappiness file is equivalent to the vm.swappiness parameter in sysctl.conf; just forget the initial "/proc/sys/" part, substitute dots for the slashes, and you get the corresponding sysctl parameter. (By the way, the substitution is not actually required; slashes are also accepted, though it seems everybody goes for the notation with the dots instead.) Thus, echo 10 >/proc/sys/vm/swappiness is exactly the same as sysctl -w vm.swappiness=10. But as a rule of thumb, if a /proc/sys file is read-only, you cannot set it with sysctl either.

sysctl values are loaded at boot time from the /etc/sysctl.conf file. This file can have blank lines, comments (lines starting either with a "#" character or a semicolon), and lines in the "variable=value" format. For example, my own sysctl.conf file is listed below. If you want to apply it at any time, you can do so with the command sysctl -p.
# Disable response to broadcasts. net.ipv4.icmp_echo_ignore_broadcasts = 1 # enable route verification on all interfaces net.ipv4.conf.all.rp_filter = 1 # enable ipV6 forwarding net.ipv6.conf.all.forwarding = 1 # increase the number of possible inotify(7) watches fs.inotify.max_user_watches = 65536
 
Getting somewhere?

With so many tunable parameters, how do you decide what to do? Alas, this is a sore point with sysctl: most of the relevant documentation is hidden in the many source files of the Linux kernel, and isn't easily available, and it doesn't help that the explanations given are sometime arcane and difficult to understand. You may find something in the /usr/src/linux/Documentation/sysctl directory, but most (if not all) files there refer to kernel 2.2, and seemingly haven't been updated in the last several years.

Looking around for books on the subject probably won't help much. I found hack #71 in O'Reilly's Linux Server Hacks, Volume 2, from 2005, but that was about it. Several other books include references to sysctl, but as to specific parameters or hints, you are on your own.

As an experiment, I tried looking for information on the swappiness parameter, which can optimize virtual memory management. The /usr/src/Linux/Documentation/sysctl/vm.txt file didn't even refer to it, probably because this parameter appeared around version 2.6 of the kernel. Doing a general search in the complete /usr/src/linux directory turned up five files that mention "swappiness": three "include" (.h) files in include/linux, plus kernel/sysctl.c and mm/vmscan.c. The latter file included the information:
 
/* * From 0 .. 100. Higher means more swappy. */ int vm_swappiness = 60;
 
That was it! You can see the default value (60) and a minimal reference to the field meaning. How helpful is that?

My suggestion would be to use sysctl -a to learn the available parameters, then Google around for extra help. You may find, say, an example of changing the shared memory allocation to solve a video program problem, or an explanation on vm.swappiness, or even more suggestions for optimizing IP4 network traffic.

sysctl shows yet another aspect of the great flexibility of Linux systems. While documentation for it is not widely available, learning its features and capabilities on your own can help you get even more performance out of your box. That's system administration at its highest (or lowest?) level.

Read in the original layout at: http://www.linux.com/feature/146599

LPI exam 201 prep, Topic 201: Linux kernel

In this tutorial, David Mertz begins preparing you to take the Linux Professional Institute Intermediate Level Administration (LPIC-2) Exam 201. In this first of a series of eight tutorials, you will learn to understand, compile, and customize a Linux kernel.

Linux Kernel Compiling - Intel® Software Network

Linux* kernel compilation presents a workload that represents a common software development task, and is included in standard benchmark suites by trade publications to test CPU and system performance.
The purpose of this document is two-fold: to demonstrate parallel build of the Linux kernel; and to evaluate the Intel® Extended Memory 64 Technology (Intel EM64T) performance benefit on the Intel processors. This study is based on 3.6 GHz Intel Xeon® processor with Intel EM64T.
Intel EM64T is an enhancement to Intel IA-32 architecture. An IA-32 processor equipped with this technology is compatible with the existing IA-32 software. This enables the software to access more memory address space, and allows for the co-existence of software written for the 32-bit linear address space with software capable of accessing the 64-bit linear address space.
A minor configuration change on the Intel EM64T platforms, enabling Hyper-Threading Technology (HT Technology) and building the Linux kernel in multistream mode (by adding a single parameter to the build process), delivers significant performance benefit over the default configuration and build process. Several key results indicate a performance benefit with HT Technology turned on, and from Intel EM64T.
Linux kernel 2.6.4*, which is freely available, is evaluated in this study. Red Hat EL 3.0 distribution is used on all hosts. All Intel platforms considered in this study are enabled with the HT Technology and include DP 3.6GHz Nocona, and 3.2GHz Intel Xeon platforms.
Following are the key objectives of this paper:

To evaluate the HT Technology benefit with Intel processors for multistream Linux kernel build.

To review Linux kernel build performance on Intel processors with Intel EM64T.

The Process Model of Linux Application Development

One of Unix's hallmarks is its process model. It is the key to understanding access rights, the relationships among open files, signals, job control, and most other low-level topics in this book. Linux adopted most of Unix's process model and added new ideas of its own to allow a truly lightweight threads implementation.

Become a Linux Kernel Hacker and Write Your Own Module

Soulskill :
140 comments
M-Saunders (706738) writes "It might sound daunting, but kernel hacking isn't a mysterious black art reserved for the geekiest of programmers. With a bit of background knowledge, anyone with a grounding in C can implement a new kernel module and understand how the kernel works internally. Linux Voice explains how to write a module that creates a new device node, /dev/reverse, that reverses a string when it's written to it. Sure, it's not the most practical example in the world, but it's a good starting point for your own projects, and gives you an insight into how it all fits together."

MindPrison (864299) | yesterday | (#47102547)

Very true... (4, Interesting)

...I remember my first meeting with Slackware, it was a Linux distro that provoked any user to learn stuff from scratch, and you HAD to use the command line (bash/shell) to install it if you wanted to use it. This forced me to learn Linux. (At least some of the basics)
It also came with a Kernel compilation system + all the needed libraries and packages, so compiling to your own computer was a few commands and worked right out of the box. And then my curiosity got piqued and this drove me to go into the configuration and find out how I could optimize my kernel to fit my needs. In the beginning it was a lot of trial and error, and it looked real daunting, but after a few tries - it wasn't nearly as scary. Before you knew it, I was coding my first stuff in C++. A lot of fun, actually.
So yeah, by all means - if you guys have the time, the curiosity, do go ahead and code something, but do yourself a favor - start off easy.

ADRA (37398) | yesterday | (#47102567)

Umm (4, Insightful)
Well yes, any C developer (already a minority in the umbrella of 'programmers' these days) can write code for the kernel, but just because one can write software for the kernel doesn't mean they can write anything meaningful to be done in kernel space vs. anywhere else. If you're expecting a slew of new driver hackers reverse engineering chipsets, and implementing better drivers, testing all corner cases (because dev's LOVE testing) I think you're barking up a very small tree, but all the luck to you, becase what's good for Linux is good for me, you, us all.

shoor (33382) | yesterday | (#47103871)

Re:First Tutorial I've seen with Goto... (2)
I got my intro to programming in the mid 1960s with 'the college computer' a PDP-8 that we programmed in Fortran using punched cards. In those days, just getting access to a computer was a pretty big deal, but things were changing, so 'programming paradigms' started appearing, and the first one that I remember was 'structured programming'. This is where I first heard the mantra of 'goto-less' programming. (Before that, the mantra was not to write self-modifying code, which was something you almost had to be writing assembly language code to be able to do, though COBOL had an 'alters' statement as I recall.)
I remember being somewhat startled by the idea of excluding gotos. How could you write non trivial code without any goto statements? I actually thought of it almost as a challenge to figure out how to do so. The opposite of structured code was 'spaghetti code'. Anyway, it's become a conventional bit of wisdom that I suppose is just automatically passed down to each generation of students without anyone ever seriously questioning it, except those who find they really need it sometimes. At some point I started defiantly putting an occasional goto in my code again, but not often.

Eravnrekaree (467752) | yesterday | (#47103379)

Writing modules near impossible (3, Interesting)
While the article shows a cute little example on how to write a useless module, it does not show anyone how to actually write a serious kernel module. The Linux kernel has never been known for documenting kernel internals, such documentation is scant at best and simply not sufficient to write a module.
It is safe to say tha due to the poor practices of Kernel developers who consistently ignore good practice by not Documenting Their Crap, the kernel is an elite club of developers with knowledge that is secret. The practices of the Linux kernel development is just sheer sloppiness, horribly bad practice.
They could have easily set up a Wiki and documented the interfaces and their architecture. What we see with the kernel developers is that they do not care about anyone else, not users, and not even outside techies, so why would they care about whether or not an outsider can understand the kernel, just as why would they care if a user can upgrade kernel versions without having all of their device drivers blow up.
As anyone well versed in computer science knows, computer code is rarely self documenting, especially the kernel, and trying to reverse document a large software project is an outrageous waste of time and can be enough of a problem that it keeps even seasoned programmers away from the project. A huge piece of undocumented code is just not worth the effort to learn.

10.1 Defining a Process

What exactly is a process? In the original Unix implementations, a process was any executing program. For each program, the kernel kept track of

The current location of execution (such as waiting for a system call to return from the kernel), often called the program's context

Which files the program had access to

The program's credentials (which user and group owned the process, for example)

The program's current directory

Which memory space the program had access to and how it was laid out

A process was also the basic scheduling unit for the operating system. Only processes were allowed to run on the CPU.

10.1.1 Complicating Things with Threads

Although the definition of a process may seem obvious, the concept of threads makes all of this less clear-cut. A thread allows a single program to run in multiple places at the same time. All the threads created (or spun off) by a single program share most of the characteristics that differentiate processes from each other. For example, multiple threads that originate from the same program share information on open files, credentials, current directory, and memory image. As soon as one of the threads modifies a global variable, all the threads see the new value rather than the old one.

Many Unix implementations (including AT&T's canonical System V release) were redesigned to make threads the fundamental scheduling unit for the kernel, and a process became a collection of threads that shared resources. As so many resources were shared among threads, the kernel could switch between threads in the same process more quickly than it could perform a full context switch between processes. This resulted in most Unix kernels having a two-tiered process model that differentiates between threads and processes.

10.1.2 The Linux Approach

Linux took another route, however. Linux context switches had always been extremely fast (on the same order of magnitude as the new "thread switches" introduced in the two-tiered approach), suggesting to the kernel developers that rather than change the scheduling approach Linux uses, they should allow processes to share resources more liberally.

Under Linux, a process is defined solely as a scheduling entity and the only thing unique to a process is its current execution context. It does not imply anything about shared resources, because a process creating a new child process has full control over which resources the two processes share (see the clone() system call described on page 153 for details on this). This model allows the traditional Unix process management approach to be retained while allowing a traditional thread interface to be built outside the kernel.

Luckily, the differences between the Linux process model and the two-tiered approach surface only rarely. In this book, we use the term process to refer to a set of (normally one) scheduling entities which share fundamental resources, and a thread is each of those individual scheduling entities. When a process consists of a single thread, we often use the terms interchangeably. To keep things simple, most of this chapter ignores threads completely. Toward the end, we discuss the clone() system call, which is used to create threads (and can also create normal processes).

Sys Admin Magazine DTrace -- Most Exposing Solaris Tool Ever Peter Baer Galvin

DTrace is a powerful new tool that's part of the Solaris 10 release and is available in pre-release via the Software Express for Solaris mechanism discussed in the April 2004 Solaris Companion. Because it is unique, DTrace is a bit difficult to describe. In this column, I'll summarize the features of DTrace, but I'll leave it to the Solaris kernel engineers who wrote DTrace to explore it with me in a series of questions and answers. I think that by the time you are finished hearing the engineers talk about DTrace, and once you experience it yourself, you'll agree with me that it's a brilliant piece of work that adds greatly to the ability to understand the workings of Solaris.

Interview with Solaris Kernel Engineer Andy Tucker - OSNews.com

1. Why have the other commercial Unixes all pretty much bitten the dust? Is Solaris that much better, or is it just more important to Sun than HP-UX was to HP, AIX to IBM or IRIX to SGI?

Andy Tucker: I think the most important thing Sun has done to ensure the success of Solaris is simply to remain committed to it. Even in the early days of Solaris, when most Sun customers were still running SunOS 4.x and other companies with UNIX implementations were starting to look at NT, Sun stayed focused on Solaris.

You can also look at some of the "big bets" that were made early in Solaris development. One of the most significant was that of designing in support for multithreading and multiprocessing from the ground up. Doing this work up front allowed Solaris to easily scale on large multiprocessors, and to handle the multithreaded workloads that are increasingly common.

2. Do you think that the proprietary, company-supported development effort that you're a part in has any specific benefits over the Linux kernel's Linus-and-his-henchmen method?

Andy Tucker: The main advantage Sun has is that we can make sure our efforts are well integrated and are focused on the needs of Sun's customers. There's a lot of great stuff available for Linux, but the decentralized development model means that someone who's looking for, say, both a fair-share CPU scheduler and network QoS support has to pull the pieces out of different places, build them into a kernel, and hope they work together. Solaris has these as built-in, integrated components that just need to be switched on.

3. Technically-speaking, what do you think of the Linux kernel and the Mach kernel? Also, how FreeBSD 5.x compares to Solaris?

Andy Tucker: I think they're all fine operating systems, each with their strengths and weaknesses. Mach broke a lot of new ground: it was the first microkernel OS to get widespread use and introduced some basic concepts (such as processor sets) that we've since borrowed in Solaris. Linux obviously has a huge developer base, and as a result there's a tremendous amount of activity and energy around it. FreeBSD (and the other *BSD implementations) are inheritors of the BSD legacy and have been the source of a lot of interesting ideas.

I don't really like to do head-to-head comparisons, since I like to think of OS development as a collaborative exercise. We're all working to improve the state of the art and to make life easier for our users. The open source operating systems are often a source of new and interesting ideas; I hope the developers of those operating systems see Solaris similarly.

4. Solaris has some very complex algorithms. STREAMS, page coloring, and multi-level scheduling are all more complex than what is usually implemented in UNIX kernels. In retrospect, which Solaris features have really paid off, despite their complexity, and which ones have not?

Andy Tucker: I'll note that most modern operating systems incorporate some sort of page coloring and multi-level scheduling algorithms; Solaris is hardly unique in this regard. I think that in most cases the significant work we've done has paid off; the complexity (if any) is usually required to meet the customer requirements. We're also happy to rewrite things if we find a better or simpler way to do something.

On the other hand, there are obviously some features that haven't really succeeded in the customer base, such as NIS+. And there are also some cases where we took a direction with the underlying technology that turned out to be a mistake. An example is the two-level thread scheduling model, where thread scheduling happens both at user level and in the kernel. Although this approach had some theoretical advantages in terms of thread creation and context switch time, it turned out to be enormously complicated, particularly when dealing with traditional Unix process semantics like signals. In Solaris 8, we made an "alternate" version of the threads library available that relied solely on kernel-based scheduling; it turned out to be not only much simpler and easier to maintain, but also faster in almost every case. It particularly sped up Java code, which is obviously important to us. In Solaris 9 (and later) we switched over to the single-level library as the only one available.

5. What do you think about the Cathedral vs Bazaar idea when applied to OS kernels, where the programming model is rather different than than of regular application programs?

Andy Tucker: In some ways the Cathedral vs. Bazaar distinction seems a bit artificial. I don't know about other OS companies, but within Sun we have hundreds of engineers from all over the company working on different parts of the operating system. Many of these people aren't actually part of the Solaris engineering organization; they work on different hardware platforms, or on storage devices, or in the research labs, or on some other product that touches on Solaris in some way. We continuously release the latest code for internal use throughout the development cycle, and do beta tests to get feedback from customers. So in a way we're doing "Bazaar" style development, even though it's commercial product and all developers are Sun employees.

The difficulty with this type of development, particularly on a large complex piece of software like an OS kernel, is ensuring that changes are architecturally consistent, well integrated, and of appropriate quality. This doesn't mean there can't be a large development community, it just means there needs to be some person or persons that are checking proposed changes to make sure they're not going to cause a problem. In Linux, this role is filled by Linus and some of the other folks working with him, who review the changes going into the official kernel base. Within Sun, we have groups of senior engineers who similarly review proposed changes for quality, appropriateness, completeness, etc..

Solaris Kernel Tuning

sysdef -i reports on several system resource limits. Other parameters can be checked on a running system using adb -k :

adb -k /dev/ksyms /dev/mem parameter-name/D ^D (to exit)

More information on kernel tuning is available in Sun's online documentation.

maxusers

The maxusers kernel parameter is the one most often tuned. By default, it is set to the number of Mb of physical memory or 1024, whichever is lower. It cannot be set higher than 2048.

Several kernel parameters are set when maxusers is set unless otherwise overridden by the /etc/system file. Some of these formulas differ between different versions of Solaris:

max_nprocs: Number of processes = 10 + (16 x maxusers)

ufs_ninode: Inode cache size = (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8). See the Disk I/O page for more information.

ncsize: Name lookup cache size = (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8). See the Disk I/O page for more information.

ndquot: Quota table size = (maxusers x 10) + max_nprocs

maxuproc: User process limit = max_nprocs - 5

ptys

Solaris 8 dynamically sizes the number of ptys available to a system, so you are less likely to run into pty starvation than was the case under Solaris 2.5.1-7. There are still hard system limits that are set based upon hardware configuration, and it may be necessary to increase the number of ptys manually as in Solaris 2.5.1-7.

If the system is suffering from pty starvation, the number of ptys available can be increased by increasing pt_cnt above the default of 48. Solaris 2.5.1 and 2.6 systems should not have pt_cnt set higher than 3844 due to limitations with the telnet and rlogin daemons. Solaris 7 does not have this restriction, but there may be other system issues that prevent setting pt_cnt arbitrarily high. Once pt_cnt is increased, a reconfiguration boot (boot -r) is required to build the ptys.

If pt_cnt is increased, some sources recommend that other variables be set at the same time. Other sources (such as the Solaris2 FAQ) suggest that this advice is spurious and results in a needless consumption of resources. See the notes below before making any of these changes; setting the values too high may result in wasted memory. In any case, one form of these recommendations is:

npty: Set to pt_cnt (see the note below)

nautopush: Set to twice the value of pt_cnt

sadcnt: Set to same value as pt_cnt

npty limits the number of BSD ptys. These are not usually used by applications, but may need to be increased on a system running a special service. In addition to setting npty in the /etc/system file, the /etc/iu.ap file will need to be edited to substitute the value npty-1 in the third field of the ptsl line. After both changes are made, a boot -r is required for the changes to take effect. Note that Solaris does not support any more than 176 BSD ptys in any case. sadcnt sets the number of STREAMS addressable devices and nautopush sets the number of STREAMS autopush entries. nautopush should be set to twice sadcnt. Whether or not these values need to be increased as above depends on the types of activity on the system.

RAM Tuneables

See the Memory/Swapping page for a discussion of parameters related to RAM and paging.

Disk I/O Tuneables

See the Disk I/O page for a full discussion of disk I/O-related tuneables.

IPC Tuneables

Check the IPC Tuning page for InterProcess Communication-related resource parameters.

File Descriptors

See the File Descriptors page for more discussion regarding tuning issues.

File descriptors are retired when the file is closed or the process terminates. Opens always choose the lowest-numbered file descriptor available. Available file descriptors are allocated as follows:

rlim_fd_cur: It is dangerous to set this value higher than 256 due to limitations with the stdio library. If programs require more file descriptors, they should use setrlimit directly.

rlim_fd_max: It is dangerous to set this value higher than 1024 due to limitations with select. If programs require more file descriptors, they should use setrlimit directly.

Misc Tuneables

dump_cnt: Size of dumps.

rstchown: Posix/restricted chown enabled (default=1)

ngroups_max: Maximum number of supplementary groups per user (default=32).

A quick guide for repairing your kernel from a live CD

OSNews.com

quick guide for repairing your kernel from a live CD Posted by special contributor Ben Hughes on 2004-10-05 19:16:46 UTC

GNU/Linux, and all other operating systems, are based around a kernel which controls hardware access and maximizes CPU and RAM efficiency by controlling when and how much programs get to use. The difference between Linux and most other operating systems (closed source ones at least BSD and other open source OS's you can do this with) is that you can compile the kernel to meet your needs.

Step 1. Basics of the kernel.

I will most likely never have to use an old serial modem or something, so i would not compile in the drivers for it. Also, Linux supports modules, which are drivers that don't load until you tell them to. Modules can be useful for things that you don't use much, like I don't use ReiserFS personally but if my friend who does needs me to retrieve data from a hard drive, I don't want to have to recompile my kernel to help, instead i just type modprobe reiserfs . Compiling a kernel in Linux is fairly easy, if you know basically what you are doing, that is what this article hopes to explain.

If you have a working system and just want a kernel to improve performance, get you up to date, or for bragging rights, go down to Step 3

If you f00barred your system and need to install a new kernel from a live cd, keep on reading.

Step 2. Chrooting from Knoppix

Okay, this step is very easy it involves opening a konsole and typing as root

mount /dev/ -rw /mnt/linux

mount /dev/ -rw /mnt/linux/

chroot /mnt/linux

Well, that basically concludes that step. Basically you just mount all your required linux partitions. (Yes you have to know what those are, if you feel like you are going to b0rk your install soon and still have normal access to the computer just print out your /etc/fstab) Then, you simply chroot into it.

Step 3. Configuring and Compiling the Kernel

Configuring the kernel is the hardest part of this. Before going into this know your hardware. That said download the sources for the latest kernel version from www.kernel.org or if you are using Gentoo (if you are you should have read the manual but anyway...) emerge the version of kernel sources you want (such as gentoo-dev-sources gentoo-gaming-sources or whatever). Once they are downloaded decompress and untar them to /usr/source and then create a linux symlink.

tar -xvjf .bz2 -C /usr/src

cd /usr/src

rm linux

ln -s linux

cd linux

Now you are in your kernel source directory, and now its time for the magic to happen type

make menuconfig

This will launch a rather nice interface for configuring the kernel. I will tell you what every system *needs* to function. First off you are going to want to go under file systems and select all the ones you use and under psuedo-filesystems select all of them (NOTE: DO NOT set any of the ones that you use constantly to modules, this will make it so that the computer cannot boot). Now go into processor type and features and select the applicable options. Now its time to explore the device drivers, these are rather important, go crazy here, make sure you include support for your network cards, block devices, sound cards, whatever. Now for the most part it should be done, look through the other categories though to make sure everything is happy. Once you are satisfied with your config, save and exit. Now it is time to actually compile the beast, depending on your system this could take a while, call the pizza guy if you must. Type

make && make modules_install

Now wait for it. While you are waiting lets go over the next step, actually installing the kernel. What you have to do is copy the bzImage into your /boot directory, but you do not have to call it bzImage, you can call it Bob or John or Alice or whatever, I usually just call it gentoo. Okay, the code to install is

cp arch/i386/boot/bzImage /boot/

cp System.map /boot/System.map

cp .config /boot/.config

Once that is done, all you have left to do is edit /etc/lilo.conf (or grub.conf but i don't know much about grub, there is some good information online about it) For LILO simply update lilo.conf (Mine looks like this because I do some fancy things with it)

boot=/dev/sda # Install LILO in the MBR

prompt # Give the user the chance to select another section

timeout=500 # Wait 5 (five) seconds before booting the

default=gentoo # When the timeout has passed, boot the "gentoo" section

install=/boot/boot-bmp.b # means you will use grafical version

bitmap=/boot/handy_128.bmp # background path

bmp-colors=38,68,53,112,38,25 # text color

bmp-table=114p,347p,2,7 # label position on the screen p=pixel

bmp-timer=470p,336p,25,0,11 # timer position on the screen p=pixel

#This is where you put kernel information for linux

image=/boot/gentoo #image name (what you named the bzImage)

label=gentoo # Name we give to this section

read-only # Start with a read-only root. Do not alter!

root=/dev/sda7 # Location of the root filesystem

# The next two lines are only if you dualboot with a Windows system.

# In this case, Windows is hosted on /dev/hda6.

other=/dev/sda1

label=windows

Once that is edited to include the latest information. Simply run as root

lilo

then everything should be happy if you did everything right. Now boot into your normal system and see if it works, if it kernel panics try again. This takes a bit of practice but once you understand it, it becomes easy.

About the Author
I am SchleyFox and I use Gentoo GNU/Linux. I go to www.usalug.org to get linux help and so should you.

[Mar 22, 2000]LinuxWorld: Customizing the FreeBSD Kernel - FreeBSD for the Linux administrator

"This step-by-step guide includes a discussion of some of the core differences between the FreeBSD kernel and the Linux kernel; descriptions of the kernel configuration, build processes, and common kernel options; ways you can gather more information; and steps to take if you have trouble."

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 29, 2020

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

Unix Kernel Internals

Old News ;-)

[Mar 04, 2011] Linux Scheduler simulation by M. Tim Jones

See also Inside the Linux Scheduler (developerWorks, June 2006)

Operating Systems Lecture Notes Lecture 6 CPU Scheduling by Martin C. Rinard

Anatomy of Linux process management by M. Tim Jones

Dec 20, 2008 | developerWorks

[Sep 09, 2008] Linux.com Kernel tuning with sysctl by Federico Kereki

Getting somewhere?

10.1 Defining a Process

10.1.1 Complicating Things with Threads

10.1.2 The Linux Approach

Sys Admin Magazine DTrace -- Most Exposing Solaris Tool Ever Peter Baer Galvin

maxusers

ptys

RAM Tuneables

Disk I/O Tuneables

IPC Tuneables

File Descriptors

Misc Tuneables

[Mar 22, 2000]LinuxWorld: Customizing the FreeBSD Kernel - FreeBSD for the Linux administrator