Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

NFS version 3 Design and Operation

News Recommended Links NFS version 3 Design and Operation NFS4 Linux NFS implementation Solaris NFS implementation
nfsstat showmount Mounting NFS directory owned by root NFS performance tuning Mounting NFS Resources AutoFS and automountd daemon
Troubleshooting NFS Security History Tips  Humor Etc

Contents

Introduction

Like NIS, NFS was implemented as a set of RPC procedures that use eXternal Data Representation (XDR) encoding to pass arguments between client and server. A filesystem mounted using NFS provides two levels of transparency:

NFS achieves the first level of transparency by defining a generic set of filesystem operations that are performed on a Virtual File System (VFS). The second level comes from the definition of virtual nodes, which are related to the more familiar Unix filesystem inode structures but hide the actual structure of the physical filesystem beneath them. The set of all procedures that can be performed on files is the vnode interface definition. The vnode and VFS specifications together define the NFS protocol.

Virtual filesystems and virtual nodes

The Virtual File System allows a client system to access many different types of filesystems as if they were all attached locally. VFS hides the differences in implementations under a consistent interface. On a Unix NFS client, the VFS interface makes all NFS filesystems look like Unix filesystems, even if they are exported from IBM MVS or Windows NT servers. The VFS interface is really nothing more than a switchboard for filesystem- and file-oriented operations.

Actions that operate on entire filesystems, such as getting the amount of free space left in the filesystem, are called VFS operations; calls that operate on files or directories are vnode operations. On the server side, implementing a VFS entails taking the generic VFS and vnode operations and converting them into the appropriate actions on the real, underlying filesystem. This conversion happens invisibly to the NFS client process. It made a straightforward system call, which the client-side VFS turned into a vnode operation, and the server then converted into an equivalent operation on its filesystem.

For example, the chown( ) system call has an analogous operator in the vnode interface that sets the attributes of a file, as does the stat( ) system call that retrieves these attributes. There is not a strict one-to-one relationship of Unix system calls to vnode operations. The write( ) system call uses several filesystem calls to get a file's attributes, and append or modify blocks in the file. Some vnode operations are not defined on certain types of filesystems. The FAT filesystem, for example, doesn't have an equivalent of symbolic links, so an NFS file server running on an Windows NT machine rejects any attempts to use the vnode operation to create a symbolic link.

So far we have defined an interface to some filesystem objects, but not the mechanism used to "name" objects in the system. In a local Unix system call, these object names are file descriptors, which uniquely identify a file within the scope of a process. The counterparts of file descriptors in NFS are filehandles, which are opaque "pointers" to files on the remote system. An opaque handle is of no value to the client because it can only be interpreted in the context of the remote filesystem. When you want to make a system call on a file, you first get a file descriptor for it. To make an NFS call (in the kernel) you must get a filehandle for the vnode. It is up to the virtual filesystem layer to translate user-level file descriptors into kernel-level filehandles. Filehandles and their creation will be covered in more depth in the next section.

 NFS protocol and implementation

NFS is an RPC-based protocol, with a client-server relationship between the machine having the filesystem to be distributed and the machine wanting access to that filesystem. NFS kernel server threads run on the server and accept RPC calls from clients. These server threads are initiated by an nfsd daemon. NFS servers also run the mountd daemon to handle filesystem mount requests and some pathname translation. On an NFS client, asynchronous I/O threads (async threads) are usually run to improve NFS performance, but they are not required.

On the client, each process using NFS files is a client of the server. The client's system calls that access NFS-mounted files make RPC calls to the NFS servers from which these files were mounted. The virtual filesystem really just extends the operation of basic system calls like read( ) and write( ), similar to the way that NIS extends the operation of library calls like getpwuid( ). In NIS, the getpwuid( ) routine knows how to use the NIS RPC protocol to locate user information that isn't in the local /etc/passwd file. Within the virtual filesystem, the basic file- and filesystem-oriented system calls were modified to "know" how to operate on non-local filesystems.

Let's look at this with an example. On an NFS client, a user process executes a chmod( ) system call on an NFS-mounted file. The virtual filesystem passes this system call to NFS, which then executes a remote procedure call to set the permissions on the file, as specified in the process's system call. When the RPC completes, the system call returns to the user process. This example is fairly simple, because it doesn't involve any block I/O to get file data to or from the NFS server. When blocks of files are moved around, the async threads get involved to improve NFS performance. This section covers the protocols used by NFS and features of its implementation that were driven by performance or transparency goals.

NFS components

Each version of the NFS RPC protocol contains several procedures, each of which operates on either a file or a filesystem object. The basic procedures performed on an NFS server can be grouped into directory operations, file operations, link operations, and filesystem operations. Directory operations include mkdir and rmdir, which create and destroy directories like their Unix system call equivalents. readdir reads a directory, using an opaque directory pointer to perform sequential reads of the same directory. Other directory-oriented procedures are rename and remove, which operate on entries in a directory the same way the mv and rm commands do. create makes a new directory entry for a file.

The lookup operation is the heart of the pathname-to-filehandle translation mechanism. lookup finds a named directory entry and returns a filehandle pointing to it. The open( ) system call uses lookup( ) extensively: it breaks a pathname down into its components and locates each component in its parent directory. For example, open( ) would handle the pathname /home/thud/stern by performing three operations:

File operations are very closely associated with Unix system calls: read and write move data to and from the NFS client, and getattr and setattr get or modify the file's attributes. In a local filesystem, such as UFS, these attributes are stored in the file's inode, but file attributes are mapped to whatever system is used by the NFS server. Link operations include link, which creates a hard link on the server, and symlink and readlink which create and read the values of symbolic links, respectively. Finally, statfs is a filesystem operation that returns information about the mounted filesystem that might be needed by df, for example.

Other filesystem operations include mounting and unmounting a filesystem, but these are handled through the NFS mountd server rather than the server threads. Mount operations are separated from the NFS protocol because mount points revolve around pathnames, and pathname syntax is peculiar to each operating system. Unix and VMS, for example, do not use the same syntax to specify the path to a file. The mount protocol is responsible for turning the server's file pathname into information that NFS can use to locate the file in future operations.

From the preceding descriptions, it is fairly clear how the basic Unix system calls map into NFS RPC calls. It is important to note that the NFS RPC protocol and the vnode interface are two different things. The vnode interface defines a set of operating system services that are used to access all filesystems, NFS or local. Vnodes simply generalize the interface to file objects. There are many routines in the vnode interface that correspond directly to procedures in the NFS protocol, but the vnode interface also contains implementations of operating system services such as mapping file blocks and buffer cache management.

The NFS RPC protocol is a specific realization of one of these vnode interfaces. It is used to perform specific vnode operations on remote files. Using the vnode interface, new filesystem types may be plugged into the operating system by adding kernel routines that perform the necessary vnode operations on objects in that filesystem.

Statelessness and crash recovery

Older versions of the NFS protocol were stateless, meaning that there is no need to maintain information about the protocol on the server. The client keeps track of all information required to send requests to the server, but the server has no information about previous NFS requests, or how various NFS requests relate to each other. Remember the differences between the TCP and UDP protocols: UDP is a stateless protocol that can lose packets or deliver them out of order; TCP is a stateful protocol that guarantees that packets arrive and are delivered in order. The hosts using TCP must remember connection state information to recognize when part of a transmission was lost.

The choice of a stateless protocol has two implications for the design and implementation of NFS:

The primary motivation for choosing a stateless protocol was to minimize the burden of crash recovery. Unlike a database system, which must verify transaction logs and look for incomplete operations, NFS has no explicit crash recovery mechanism. Because no state is maintained, the server may reboot and begin accepting client NFS requests again as if nothing had happened. Similarly, when clients reboot, the server does not need to know anything about them. Each NFS request contains enough information to be completed without any reference to state on the client or server.

Request retransmission

NFS RPC requests are sent from a client to the server one at a time. A single client process will not issue another RPC call until the call in progress completes and has been acknowledged by the NFS server. In this respect NFS RPC calls are like system calls — a process cannot continue with the next system call until the current one completes. A single client host may have several RPC calls in progress at any time, coming from several processes, but each process ensures that its file operations are well ordered by waiting for their acknowledgements. Using the NFS async threads makes this a little more complicated, but for now it's helpful to think of each process sending a stream of NFS requests, one at a time.

When a client makes an RPC request, it sets a timeout period during which the server must service and acknowledge it. If the server doesn't get the request because it was lost along the way, or because the server is too overloaded to complete the request within the timeout period, the client retransmits the request. Requests are idempotent (if the server has a duplicate request cache), so no harm is done if the server executes the same request twice — when the NFS client gets a second confirmation from the RPC request, the client discards it.

NFS clients continue to retransmit requests until the request completes, either with an acknowledgement from the server or an error from the RPC layer. If an NFS server crashes, clients continue to repeat the call to the RPC layer (if the NFS filesystem is hard-mounted, otherwise the RPC timeout error is returned to the application) until the server reboots and can service them again. When the server is up again, NFS clients continue as if nothing happened. NFS clients cannot tell the difference between a server that has crashed and one that is very slow. This raises some important issues for tuning NFS servers and networks, which will be visited in Section 18.1.

The duplicate request cache on NFS servers usually contains a few hundred entries — the last few seconds (at most) of NFS requests on a busy server. This cache is limited in size to establish a "window" in which non-idempotent NFS requests are considered duplicates caused by retransmission rather than distinct requests. For example, if you execute:

% rm foo

on an NFS client, the client may need to send two or more remove requests to the NFS server before it receives an acknowledgment. It's up to the NFS server to weed out the duplicate remove requests, even if they are a second or so apart. However, if you execute rm foo on Monday, and then on Tuesday you execute the same command in the same directory (where the file has already been removed), you would be very surprised if rm did not return an error. Executing this "duplicate request" a day later should produce this familiar error:

% rm foo 
rm: foo: No such file or directory

To distinguish between duplicates generated due to an RPC timeout and retry and duplicates due to you repeating a command (whether it be a day later or a second later), NFS servers record a 32-bit RPC transaction identifier (xid ) with each entry in the duplicate request cache. The xid is part of every RPC request's header, and it is expected that the NFS client will generate unique xids.

Preserving Unix filesystem semantics

The VFS makes all filesystems appear homogeneous to user processes. There is a single Unix system call interface that operates on files, and the VFS and underlying vnode interface translate semantics of these system calls into actions appropriate for each type of underlying filesystem. It's important to stress the difference between syntax and semantics of system calls. Consistent syntax means that the system calls take the same arguments independent of the underlying filesystem. Semantics refers to what the system calls actually do: preserving semantics across different filesystem types means that a system call will have the same net effect on the files in each filesystem type. Unix filesystem semantics collectively refers to the way in which Unix files behave when various sequences of system calls are made. For example, opening a file and then unlinking it doesn't cause the file's data blocks to be released until the close( ) system call is made. A new filesystem that wants to maintain Unix filesystem semantics must support this behavior.

The VFS definition makes it possible to ensure that semantics are preserved for all filesystems, so they all behave in the same manner when Unix system calls are made on their files. It is easy to use VFS to implement a filesystem with non-Unix semantics. It's also possible to integrate a filesystem into the VFS interface without supporting all of the Unix semantics; for example, you can put FAT (a filesystem used in MS-DOS, Windows, and NT operating systems) filesystems under VFS, but you can't create Unix-like symbolic links on them because the native FAT filesystem doesn't support symbolic links.

In this section, we'll look at how NFS deals with Unix filesystem semantics, including some of the operations that aren't exactly the same under NFS. NFS has slightly different semantics than the local Unix filesystem, but it tries to preserve the Unix semantics. An application that works with a local filesystem works equally well with an NFS-mounted filesystem and will not be able to distinguish between the two.

Consistency at the vnode interface level makes NFS a powerful tool for creating filesystem hierarchies using many different NFS servers. The mount command requires that a filesystem be mounted on a directory; but directories are vnodes themselves. An NFS filesystem can be mounted on any vnode, which means that NFS filesystems can be mounted on top of other NFS filesystems or local filesystems. This is completely consistent with the way in which local disks are mounted on local filesystems. /net may be on the root filesystem, and /net/host is mounted on top of it. A workstation configured using NFS can create a view of the filesystems on the network that best meets its requirements by mounting these filesystems with a directory naming scheme of its choice.

Maintaining other Unix filesystem semantics is not quite as easy. Locking operations, for example, introduce state into a system that was meant to be stateless. This problem is addressed by a separate lock manager daemon. Another bit of Unix lore that had be preserved was the retention of an open file's data blocks, even when the file's directory entry was removed. Many Unix utilities including shells and mailers, use this "delayed unlink" feature to create temporary files that have no name in the filesystem, and are therefore invisible to probing users.

A complete solution to the problem would require that the server keep open file reference counts for each file and not free the file's data blocks until the reference count decreased to zero. However, this is precisely the kind of state information that makes crash recovery difficult, so NFS was implemented with a client-side solution that handles the common applications of this feature. When a remove operation is performed on an open file, the client issues a rename NFS RPC instead. The file is renamed to .nfsXXXX, where XXXX is a suffix to make the filename unique. When the file is eventually closed, the client issues the remove operation on the previously unlinked file. Note that there is no need for an "open" or "close" NFS RPC procedure, since "opened" and "closed" are states that are maintained on the client. It is still possible to confuse two clients that attempt to unlink a shared, open NFS-mounted file, since one client will not know that the other has the file open, but it emulates the behavior of a local filesystem sufficiently to eliminate the need to change utilities that rely on it.

Pathnames and filehandles

All NFS operations use filehandles to designate the files or directories on which they will be performed. Filehandles are created on the server and contain information that uniquely identifies the file or directory on the server. The client's NFS mount and lookup requests retrieve these filehandles for existing files. A side effect of making all vnodes homogeneous is that file pathname lookup must be done one component at a time. Each directory in the pathname might be a mount point for another filesystem, so each name look-up request cannot include multiple components. For example, let's look at Client A that NFS-mounts the /usr/local filesystem and also NFS-mounts a filesystem on /usr/local/bin:

clientA# mount server1:/usr/local /usr/local 
clientA# mount server2:/usr/local/bin.mips /usr/local/bin

When the NFS client reaches the bin component in the pathname, it realizes that there is an NFS filesystem mounted on this directory, and it sends its lookup requests to server2 instead of server1. If the NFS client passed the whole pathname to server1, it might get the wrong answer on its lookup: server1 has its own /usr/local/bin directory that may or may not be the same directory that Client A has mounted. While this may seem to be a very expensive series of operations, the kernel keeps a directory name lookup cache (DNLC) that prevents every look-up request from going to an NFS server.

The lookup operation takes a filename and a filehandle for a directory, and returns a filehandle pointing to the named file on the server. How then does the pathname traversal get started, if every lookup requires a filehandle from a previous pathname resolution? The mount operation seeds the lookup process by providing a filehandle for the root of the mounted filesystem. Within NFS, the only procedure that accepts full pathnames is the mount RPC, which turns the pathname into a filehandle for the mounted filesystem.

Let's look at how NFS turns the pathname /usr/local/bin/emacs into an NFS filehandle, assuming that it's on a filesystem mounted on /usr/local from server wahoo:

Filehandles are opaque to the client. In most NFS implementations on Unix machines, they are an encoding of the file's inode number, disk device number, and inode generation number. Other implementations, particularly non-Unix NFS servers that do not have inodes, encode their own native filesystem information in the filehandle. In any system, the filehandle is in a form that can be disassembled only on the NFS server. The structures contained in the filehandle are kept hidden from the client, the same way the structures in an object-oriented system are hidden in the object's implementation routines. In the case of NFS filehandles, the data described by the structure doesn't even exist on the client — it's all on the server, where the filehandle can be converted into a pointer to local file.

Filehandles become invalid, or stale, when the inodes to which they point (on the server) are freed or re-used. NFS clients have no way of knowing what other operations may be affecting objects pointed to by their filehandles, so there is no way to warn a client in advance that a filehandle is invalid. If an RPC call is made with a filehandle that is stale, the NFS server returns a stale filehandle error to the caller. Say that a user on one client removes an NFS-mounted directory and its contents using rm -rf test, while another client has a process using test as its current working directory. The next time the other process tries to read its working directory, it gets a stale filehandle error back from the NFS server:

Client A

Client B

cd /mnt/test

cd /mnt

 

rm -rf test

stat(.)-->Stale file handle

 

If one client removes a file and then creates a new file that re-uses the freed inode, other filehandles (on other clients) that point to the re-used inode must be marked stale. Inode generation numbers were added to the basic Unix filesystem to add a time history to an inode. In addition to the inode number, the filehandle must match the current generation number of the inode, or it is marked stale. When the inode is re-used for a new file, its generation number is incremented. Stale filehandles become a problem when one user's work tramples on an area in use by another, or when a filesystem on a server is rebuilt from a backup tape. When restoring from a dump tape onto a fresh filesystem, all of the inode generation numbers in the filesystem are set to random numbers. This causes every filehandle in use for that filesystem to become stale — every inode pointed to by a pre-restore filehandle now probably points to a completely different file on the disk.

Therefore, a quick way to cripple an NFS network is to restore a fileserver from a dump tape without rebooting the NFS clients. When you rebuild the server's filesystems, all of the inode generation numbers are reset; when you load the tape, files end up with different inode numbers and different inode generation numbers than they had on the original filesystem. All NFS client filehandles are now invalid because of the new generation numbers and the (random) renumbering of each file's inode. Any attempt to use an open filehandle results in stale filehandle errors. If you are going to restore an NFS-exported filesystem from tape, unmount it from its clients or reboot the clients.

 NFS Version 3

There are four versions of the NFS protocol: Versions 2, 3, and 4. Version 1 did exist, but it was only a prototype, and neither an implementation nor specification was ever released.  Version 3 has three major differences from Version 2:

Large file support

Version 2 supported files up to four gigabytes in length, though most implementations are limited to up to two-gigabyte files. Version 3 supports files up to and including 264 - 1 bytes in length. Large file support was the primary driver for a protocol revision.

Writes to unstable storage

Version 2 of the NFS protocol specified that NFS servers could not reply successfully to a write request until the data had been committed to stable storage, usually magnetic disk, but non-volatile RAM was permissible as well. This limited the write throughput of NFS clients, and so Version 3 of the protocol permits the client to indicate that the write need not be committed to stable storage. This allows NFS servers to respond quickly to write requests. Of course, clients are still interested in committing their data to stable storage, and so Version 3 has a new procedure called commit, which tells the NFS server to write the uncommitted data to stable storage before returning success.

The theory behind this, supported by experimental measurement, is that faster throughput is gained by the NFS server committing data to stable storage in parallel with the client doing something else (such as generating more NFS requests), before the client issues the commit. Typically, the NFS Version 3 client will issue a commit when it is about to close a file, or when buffer space is tight.

Large transfer sizes

NFS Version 2 had a limit of 8192 bytes per NFS read and write request. NFS Version 3 lets the client and server negotiate a mutually acceptable limit.

If packets are larger than the medium's MTU they must be fragmented. Fragmentation of output packets is easy, but the other direction, reassembly of input fragments, is harder if the fragments arrive out of order, or if a fragment is dropped or delayed. With larger NFS transfer sizes, the risk of a reassembly problem is higher, and if there is a problem, the entire datagram must be retransmitted, including all the fragments. NFS Version 2 was designed to be gentler to the network during the days when operating systems, routers, and network hardware were less capable. Nowadays, these components are much more effective, and so NFS Version 3 removes the artificial limits to transfer size.

NFS over TCP

Both NFS Version 2 and Version 3 operate over UDP and TCP. Since TCP is stateful, and NFS is stateless, it would seem to be a contradiction, if not an impossibility for NFS to operate over TCP. However, the layer between NFS and TCP is RPC, and RPC is implemented to hide state issues of TCP from NFS.

The first time an NFS client contacts a server over TCP, the RPC layer takes care of establishing a connection. If a server crashes, the client won't know that immediately, but the next time it sends a request over the connection, the connection will break due to a connection reset from the server, or a connection timeout. In either case, the RPC layer simply re-establishes a connection.

Some NFS/TCP implementations, such as that in Solaris, maintain a single connection between the NFS client and server, such that all traffic—for all users and mount points—is multiplexed between the client and server. Other implementations, such as those in the BSD releases, have one connection per mountpoint. Aside from a user-level NFS client like a web browser, or a Java application linked to NFS classes, you are not likely to encounter an NFS client that creates a connection per user.

If the client crashes, the server will periodically close connections that haven't been used in a while. On a Solaris NFS server, this connection idle timer defaults to six minutes.

NFS components

NFS is similar to other RPC services in its use of a server-side daemon (nfsd ) to process incoming requests. It differs from the typical client-server model in that processes on NFS clients make some RPC calls themselves, and other RPC calls are made by the clients' async threads. All of the NFS client and server code is contained in the kernel, instead of in the server daemon executable—a decision driven by performance requirements.

nfsd and NFS server threads

With all of the NFS code in the kernel, why bother with user processes for the server? Why not make NFS a purely kernel-to-kernel service, without any user processes? On systems that have an nfsd daemon, nfsd does the following:

What the aforementioned system call does varies among implementations. Two common variations are:

The alternative to multiple daemons or kernel thread support is that an NFS server is forced to handle one NFS request at a time. Running multiple daemons or kernel threads allows the server to have multiple, independent threads of execution, so the server can handle several NFS requests at once. Daemons or threads service requests in a pseudo-round robin fashion—whenever a daemon or thread is done with a request it goes to the end of the queue waiting for a new request. Using this scheduling algorithm, a server is always able to accept a new NFS request as long as at least one daemon or thread is waiting in queue. Running multiple daemons or threads lets a server start multiple disk operations at the same time and handle quick turnaround requests such as getattr and lookup while disk-bound requests are in progress.

Still, why do systems that have kernel server thread support need a running nfsd daemon process? With an NFS server that supported just UDP, it would be possible for it to simply exit once the endpoint was sent to the kernel. With the introduction of NFS/TCP implementations, transport endpoints get created and closed down continuously. Thus nfsd is needed to listen for, accept, and tell the kernel about new connections. Similarly, when the connections are broken, nfsd takes care of telling the kernel that the endpoint is about to be closed, and then closes it.

Client I/O system

On the client side, each process accessing an NFS-mounted filesystem makes its own RPC calls to NFS servers. A single process will be a client of many NFS servers if it is accessing several filesystems on the client. For operations that do not involve block I/O, such as getting the attributes of a file or performing a name lookup, having each process make its own RPC calls provides adequate performance. However, when file blocks are moved between client and server, NFS needs to use the Unix file buffer cache mechanism to provide throughput similar to that achieved with a local disk. On many implementations, the client-side async threadsare the parts of NFS that interact with the buffer cache.

Before looking at async threadsin detail, some explanation of buffer cache and file cache management is required. The traditional Unix buffer cache is a portion of the system's memory that is reserved for file blocks that have been recently referenced. When a process is reading from a file, the operating system performs read-ahead on the file and fills the buffer cache with blocks that the process will need in future operations. The result of this "pre-fetch" activity is that not all read( ) system calls require a disk operation: some can be satisfied with data in the buffer cache. Similarly, data that is written to disk is written into the cache first; when the cache fills up, file blocks are flushed out to disk. Again, the buffer cache allows the operating system to bunch up disk requests, instead of making every system call wait for a disk transfer.

SunOS 4.x, System V Release 4, and Solaris replace the buffer cache with a page mapping system. Instead of transferring files into and out of the buffer cache, the virtual memory management system directly maps files into a process's address space, and treats file accesses as page faults. Any page that is not being used by the system can be taken to cache file pages. The net effect is the same as that of a buffer cache, but the size of the cache is not fixed. The file page cache could be a large percentage of the system's memory if only one or two processes are doing file I/O operations. For this discussion, we'll refer to the in-memory copies of file blocks as the "buffer cache," whether it is implemented as a cache of file pages or as a traditional Unix buffer cache.

The client-side async threads improve NFS performance by filling and draining the buffer cache on behalf of NFS clients. When a process reads from an NFS-mounted file, it performs the read RPC itself. To pre-fetch data for the buffer cache, the kernel has the async threads send more read RPC requests to the server, as if the reading process had requested this data. NFS functions properly without any async threadson a client — but no read-ahead is done without them, limiting the throughput of the NFS filesystem. When the async threads are running, the client's kernel can initiate several RPC calls at the same time. If restricted to a single RPC call per process, NFS client performance suffers — sometimes dramatically.

When a client writes to a file, the data is put into the buffer cache. After a complete buffer is filled, the operating system writes out the data in the cache to the filesystem. If the data needs to be written to an NFS server, the kernel makes an RPC call to perform the write operation. If there are async threads available, they make the write RPC requests for the client, draining the buffer cache when the cache management system dictates. If no async threads can make the RPC call, the process calling write( ) performs the RPC call itself. Again, without any async threads, the kernel can still write to NFS files, but it must do so by forcing each client process to make its own RPC calls. The async threads allow the client to execute multiple RPC requests at the same time, performing write-behind on behalf of the processes using NFS files.

NFS read and write requests are performed in NFS buffer sizes. The buffer size used for disk I/O requests is independent of the network's MTU and the server or client filesystem block size. It is chosen based on the most efficient size handled by the network transport protocol, and is usually 8 kilobytes for NFS Version 2, and 32 kilobytes for NFS Version 3. The NFS client implements this buffering scheme, so that all disk operations are done in larger (and usually more efficient) chunks. When reading from a file, an NFS Version 2 read RPC requests an entire 8 kilobyte NFS buffer. The client process may only request a small portion of the buffer, but the buffer cache saves the entire buffer to satisfy future references.

For write requests, the buffer cache batches them until a full NFS buffer has been written. Once a full buffer is ready to be sent to the server, an async thread picks up the buffer and performs the write RPC request. The size of a buffer in the cache and the size of an NFS buffer may not be the same; if the machine has 2 kilobyte buffers then four buffers are needed to make up a complete 8 kilobyte NFS Version 2 buffer. The async thread attempts to combine buffers from consecutive parts of a file in a single RPC call. It groups smaller buffers together to form a single NFS buffer, if it can. If a process is performing sequential write operations on a file, then the async threads will be able to group buffers together and perform write operations with NFS buffer-sized requests. If the process is writing random data, it is likely that NFS writes will occur in buffer cache-sized pieces.

On systems that use page mapping (SunOS 4.x, System V Release 4, and Solaris), there is no buffer cache, so the notion of "filling a buffer" isn't quite as clear. Instead, the async threads are given file pages whenever a write operation crosses a page boundary. The async threads group consecutive pages together to form a single NFS buffer. This process is called dirty page clustering.

If no async threads are running, or if all of them are busy handling other RPC requests, then the client process performing the write( ) system call executes the RPC itself (as if there were no async threads at all). A process that is writing large numbers of file blocks enjoys the benefits of having multiple write RPC requests performed in parallel: one by each of the async threads and one that it does itself.

Smaller write requests that do not force an RPC call return to the client right away.

Doing the read-ahead and write-behind in NFS buffer-sized chunks imposes a logical block size on the NFS server, but again, the logical block size has nothing to do with the actual filesystem implementation on either the NFS client or server. We'll look at the buffering done by NFS clients when we discuss data caching and NFS write errors. The next section discusses the interaction of the async threads and Unix system calls in more detail.

 

The async threads exist in Solaris. Other NFS implementations use multiple block I/O daemons (biod daemons) to achieve the same result as async threads.

NFS kernel code

The functions performed by the parallel async threads and kernel server threads provide only part of the boost required to make NFS performance acceptable. The nfsd is a user-level process, but contains no code to process NFS requests. The nfsd issues a system call that gives the kernel a transport endpoint. All the code that sends NFS requests from the client and processes NFS requests on the server is in the kernel.

It is possible to put the NFS client and server code entirely in user processes. Unfortunately, making system calls is relatively expensive in terms of operating system overhead, and moving data to and from user space is also a drain on the system. Implementing NFS code outside the kernel, at the user level, would require every NFS RPC to go through a very convoluted sequence of kernel and user process transitions, moving data into and out of the kernel whenever it was received or sent by a machine.

The kernel implementation of the NFS RPC client and server code eliminates most copying except for the final move of data from the client's kernel back to the user process requesting it, and it eliminates extra transitions out of and into the kernel. To see how the NFS daemons, buffer (or page) cache, and system calls fit together, we'll trace a read( ) system call through the client and server kernels:

Obviously, changing the numbers of async threads and server threads, and the NFS buffer sizes impacts the behavior of the read-ahead (and write-behind) algorithms.

Caching

Caching involves keeping frequently used data "close" to where it is needed, or preloading data in anticipation of future operations. Data read from disks may be cached until a subsequent write makes it invalid, and data written to disk is usually cached so that many consecutive changes to the same file may be written out in a single operation. In NFS, data caching means not having to send an RPC request over the network to a server: the data is cached on the NFS client and can be read out of local memory instead of from a remote disk. Depending upon the filesystem structure and usage, some cache schemes may be prohibited for certain operations to guarantee data integrity or consistency with multiple processes reading or writing the same file. Cache policies in NFS ensure that performance is acceptable while also preventing the introduction of state into the client-server relationship.

File attribute caching

Not all filesystem operations touch the data in files; many of them either get or set the attributes of the file such as its length, owner, modification time, and inode number. Because these attribute-only operations are frequent and do not affect the data in a file, they are prime candidates for using cached data. Think of ls -l as a classic example of an attribute-only operation: it gets information about directories and files, but doesn't look at the contents of the files.

NFS caches file attributes on the client side so that every getattr operation does not have to go all the way to the NFS server. When a file's attributes are read, they remain valid on the client for some minimum period of time, typically three seconds. If the file's attributes remain static for some maximum period, normally 60 seconds, they are flushed from the cache. When an application on the NFS client modifies an NFS attribute, the attribute is immediately written back to the server. The only exceptions are implicit changes to the file's size as a result of writing to the file. As we will see in the next section, data written by the application is not immediately written to the server, so neither is the file's size attribute.

The same mechanism is used for directory attributes, although they are given a longer minimum lifespan. The usual defaults for directory attributes are a minimum cache time of 30 seconds and a maximum of 60 seconds. The longer minimum cache period reflects the typical behavior of periods of intense filesystem activity — files themselves are modified almost continuously but directory updates (adding or removing files) happen much less frequently.

The attribute cache can get updated by NFS operations that include attributes in the results. Nearly all of NFS Version 3's RPC procedures include attributes in the results.

Attribute caching allows a client to make a steady stream of access to a file without having to constantly get attributes from the server. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call.

In the previous section, we saw how the async thread fills and drains the NFS client's buffer or page cache. This presents a cache consistency problem: if an async thread performs read-ahead on a file, and the client accesses that information at some later time, how does the client know that the cached copy of the data is valid? What guarantees are there that another client hasn't changed the file, making the copy of the file's data in the buffer cache invalid?

An NFS client needs to maintain cache consistency with the copy of the file on the NFS server. It uses file attributes to perform the consistency check. The file's modification time is used as a cache validity check; if the cached data is newer than the modification time then it remains valid. As soon as the file's modification time is newer than the time at which the async thread read data, the cached data must be flushed. In page-mapped systems, the modification time becomes a "valid bit" for cached pages. If a client reads a file that never gets modified, it can cache the file's pages for as long as needed.

This feature explains the "accelerated make" phenomenon seen on NFS clients when compiling code. The second and successive times that a software module (located on an NFS fileserver) is compiled, the make process is faster than the first build. The reason is that the first make reads in header files and causes them to be cached. Subsequent builds of the same modules or other files using the same headers pick up the cached pages instead of having to read them from the NFS server. As long as the header files are not modified, the client's cached pages remain valid. The first compilation requires many more RPC requests to be sent to the server; the second and successive compilations only send RPC requests to read those files that have changed.

The cache consistency checks themselves are by the file attribute cache. When a cache validity check is done, the kernel compares the modification time of the file to the timestamp on its cached pages; normally this would require reading the file's attributes from the NFS server. Since file attributes are kept in the file's inode (which is itself cached on the NFS server), reading file attributes is much less "expensive" than going to disk to read part of the file. However, if the file attributes are not changing frequently, there is no reason to re-read them from the server on every cache validity check. The data cache algorithms use the file attribute cache to speed modification time comparisons.

Keeping previously read data blocks cached on the client does not introduce state into the NFS system, since nothing is being modified on the client caching the data. Long-lived cache data introduces consistency problems if one or more other clients have the file open for writing, which is one of the motivations for limiting the attribute cache validity period. If the attribute cache data never expired, clients that opened files for reading only would never have reason to check the server for possible modifications by other clients. Stateless NFS operation requires each client to be oblivious to all others and to rely on its attribute cache only for ensuring consistency. Of course, if clients are using different attribute cache aging schemes, then machines with longer cache attribute lifetimes will have stale data.

Client data caching

In the previous section, we looked at the async thread's management of an NFS client's buffer cache. The async threads perform read-ahead and write-behind for the NFS client processes. We also saw how NFS moves data in NFS buffers, rather than in page- or buffer cache-sized chunks. The use of NFS buffers allows NFS operations to utilize some of the sequential disk I/O optimizations of Unix disk device drivers.

Reading in buffers that are multiples of the local filesystem block size allows NFS to reduce the cost of getting file blocks from a server. The overhead of performing an RPC call to read just a few bytes from a file is significant compared to the cost of reading that data from the server's disk, so it is to the client's and server's advantage to spread the RPC cost over as many data bytes as possible. If an application sequentially reads data from a file in 128-byte buffers, the first read operation brings over a full (8 kilobytes for NFS Version 2, usually more for NFS Version 3) buffer from the filesystem. If the file is less than the buffer size, the entire file is read from the NFS server. The next read( ) picks up data that is in the buffer (or page) cache, and following reads walk through the entire buffer. When the application reads data that is not cached, another full NFS buffer is read from the server. If there are async threads performing read-ahead on the client, the next buffer may already be present on the NFS client by the time the process needs data from it. Performing reads in NFS buffer-sized operations improves NFS performance significantly by decoupling the client application's system call buffer size and the VFS implementation's buffer size.

Going the other way, small write operations to the same file are buffered until they fill a complete page or buffer. When a full buffer is written, the operating system gives it to an async thread, and async threads try to cluster write buffers together so they can be sent in NFS buffer-sized requests. The eventual write RPC call is performed synchronous to the async thread; that is, the async thread does not continue execution (and start another write or read operation) until the RPC call completes. What happens on the server depends on what version of NFS is being used.

There are elements of a write-back cache in the async threads. Queueing small write operations until they can be done in buffer-sized RPC calls leaves the client with data that is not present on a disk, and a client failure before the data is written to the server would leave the server with an old copy of the file. This behavior is similar to that of the Unix buffer cache or the page cache in memory-mapped systems. If a client is writing to a local file, blocks of the file are cached in memory and are not flushed to disk until the operating system schedules them. If the machine crashes between the time the data is updated in a file cache page and the time that page is flushed to disk, the file on disk is not changed by the write. This is also expected of systems with local disks — applications running at the time of the crash may not leave disk files in well-known states.

Having file blocks cached on the server during writes poses a problem if the server crashes. The client cannot determine which RPC write operations completed before the crash, violating the stateless nature of NFS. Writes cannot be cached on the server side, as this would allow the client to think that the data was properly written when the server is still exposed to losing the cached request during a reboot.

Ensuring that writes are completed before they are acknowledged introduces a major bottleneck for NFS write operations, especially for NFS Version 2. A single Version 2 file write operation may require up to three disk writes on the server to update the file's inode, an indirect block pointer, and the data block being written. Each of these server write operations must complete before the NFS write RPC returns to the client. Some vendors eliminate most of this bottleneck by committing the data to nonvolatile, nondisk storage at memory speeds, and then moving data from the NFS write buffer memory to disk in large (64 kilobyte) buffers. Even when using NFS Version 3, the introduction of nonvolatile, nondisk storage can improve performance, though much less dramatically than with NFS Version 2.

Using the buffer cache and allowing async threads to cluster multiple buffers introduces some problems when several machines are reading from and writing to the same file. To prevent file inconsistency with multiple readers and writers of the same file, NFS institutes a flush-on-close policy:

This ensures that a process on another NFS client sees all changes to a file that it is opening for reading:

 

Client A

Client B

open( )

 

write( )

 

NFS Version 3 only: commit

 

close( )

 
 

open( )

 

read( )

The read( ) system call on Client B will see all of the data in a file just written by Client A, because Client A flushed out all of its buffers for that file when the close( ) system call was made. Note that file consistency is less certain if Client B opens the file before Client A has closed it. If overlapping read and write operations will be performed on a single file, file locking must be used to prevent cache consistency problems. When a file has been locked, the use of the buffer cache is disabled for that file, making it more of a write-through than a write-back cache. Instead of bundling small NFS requests together, each NFS write request for a locked file is sent to the NFS server immediately.

Server-side caching

The client-side caching mechanisms — file attribute and buffer caching — reduce the number of requests that need to be sent to an NFS server. On the server, additional cache policies reduce the time required to service these requests. NFS servers have three caches:

Cache mechanisms on NFS clients and servers provide acceptable NFS performance while preserving many — but not all — of the semantics of a local filesystem. If you need finer consistency control when multiple clients are accessing the same files, you need to use file locking.

 File locking

File locking allows one process to gain exclusive access to a file or part of a file, and forces other processes requiring access to the file to wait for the lock to be released. Locking is a stateful operation and does not mesh well with the stateless design of NFS. One of NFS's design goals is to maintain Unix filesystem semantics on all files, which includes supporting record locks on files.

Unix locks come in two flavors: BSD-style file locks and System V-style record locks. The BSD locking mechanism implemented in the flock( ) system call exists for whole file locking only, and on Solaris is implemented in terms of the more general System V-style locks. The System V-style locks are implemented through the fcntl( ) system call and the lockf( ) library routine, which uses fcntl( ). System V locking operations are separated from the NFS protocol and handled by an RPC lock daemon and a status monitoring daemon that recreate and verify state information when either a client or server reboot.

Lock and status daemons

The RPC lock daemon, lockd, runs on both the client and server. When a lock request is made for an NFS-mounted file, lockd forwards the request to the server's lockd. The lock daemon asks the status monitor daemon, statd, to note that the client has requested a lock and to begin monitoring the client.

The file locking daemon and status monitor daemon keep two directories with lock "reminders" in them: /var/statmom/sm and /var/statmon/sm.bak. (On some systems, these directories are /etc/sm and /etc/sm.bak.) The first directory is used by the status monitor on an NFS server to track the names of hosts that have locked one or more of its files. The files in /var/statmon/sm are empty and are used primarily as pointers for lock renegotiation after a server or client crash. When statd is asked to monitor a system, it creates a file with that system's name in /etc/statmon/sm.

If the system making the lock request must be notified of a server reboot, then an entry is made in /var/statmon/sm.bak as well. When the status monitor daemon starts up, it calls the status daemon on all of the systems whose names appear in /var/statmon/sm.bak to notify them that the NFS server has rebooted. Each client's status daemon tells its lock daemon that locks may have been lost due to a server crash. The client-side lock daemons resubmit all outstanding lock requests, recreating the file lock state (on the server) that existed before the server crashed.

Client lock recovery

If the server's statd cannot reach a client's status daemon to inform it of the crash recovery, it begins printing annoying messages on the server's console:

statd: cannot talk to statd at client, RPC: Timed out(5)

These messages indicate that the local statd process could not find the portmapper on the client to make an RPC call to its status daemon. If the client has also rebooted and is not quite back on the air, the server's status monitor should eventually find the client and update the file lock state. However, if the client was taken down, had its named changed, or was removed from the network altogether, these messages continue until statd is told to stop looking for the missing client.

To silence statd, kill the status daemon process, remove the appropriate file in /var/statmon/sm.bak, and restart statd. For example, if server onaga cannot find the statd daemon on client noreaster, remove that client's entry in /var/statmon/sm.bak :

onaga# ps -eaf | fgrep statd 
root   133     1  0   Jan 16 ?        0:00 /usr/lib/nfs/statd
root  8364  6300  0 06:10:27 pts/13   0:00 fgrep statd
onaga# kill -9 133 
onaga# cd /var/statmon/sm.bak 
onaga# ls 
noreaster 
onaga# rm noreaster 
onaga# cd / 
onaga# /usr/lib/nfs/statd


 

Error messages from statd should be expected whenever an NFS client is removed from the network, or when clients and servers boot at the same time.

Recreating state information

Because permanent state (state that survives crashes) is maintained on the server host owning the locked file, the server is given the job of asking clients to re-establish their locks when state is lost. Only a server crash removes state from the system, and it is missing state that is impossible to regenerate without some external help.

When a client reboots, it by definition has given up all of its locks, but there is no state lost. Some state information may remain on the server and be out-of-date, but this "excess" state is flushed by the server's status monitor. After a client reboot, the server's status daemon notices the inconsistency between the locks held by the server and those the client thinks it holds. It informs the server lockd that locks from the rebooted client need reclaiming. The server's lockd sets a grace period — 45 seconds by default — during which the locks must be reclaimed or be lost. When a client reboots, it will not reclaim any locks, because there is no record of the locks in its local lockd. The server releases all of them, removing the old state from the client-server system.

Think of this server-side responsibility as dealing with your checkbook and your local bank branch. You keep one set of records, tracking what your balance is, and the bank maintains its own information about your account. The bank's information is the "truth," no matter how good or bad your recording keeping is. If you vanish from the earth or stop contacting the bank, then the bank tries to contact you for some finite grace period. After that, the bank releases its records and your money. On the other hand, if the bank were to lose its computer records in a disaster, it could ask you to submit checks and deposit slips to recreate the records of your account.



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2020 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019