Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Linux filesystems

News	See also	Recommended Links	Tutorials	Introductory materials	Papers
LVM	Snapshots	RAM Disks	Linux Swap filesystem	UFS	NTFS
Ext2 / Ext3	ReisnerFS	jfs	XFS	Humor	Etc

The file system is one of the most important parts of an operating system. The file system stores and manages user data on disk drives, and ensures that what�s read from storage is identical to what was originally written. In addition to storing user data in files, the file system also creates and manages information about files and about itself. Besides guaranteeing the integrity of all that data, file systems are also expected to be extremely reliable and have very good performance.

File systems update their structural information (called metadata) by synchronous writes. Each metadata update may require many separate writes, and if the system crashes during the write sequence, metadata may be in inconsistent state. At the next boot the filesystem check utility (called fsck) must walk through the metadata structures, examining and repairing them. This operation takes a very very long time on large filesystems. And the disk may not contain sufficient information to correct the structure. This results in misplaced or removed files. A journaling file system uses a separate area called a log or journal. Before metadata changes are actually performed, they are logged to this separate area. The operation is then performed. If the system crashes during the operation, there is enough information in the log to "replay" the log record and complete the operation. This approach does not require a full scan of the file system, yielding very quick filesystem check time on large file systems, generally a few seconds for a multiple-gigabyte file system. In addition, because all information for the pending operation is saved, no removals or lost-and-found moves are required. Disadvantage of journaling filesystems is that they are slower than other filesystems. Some journaling filesystems: BeFS, HTFS, JFS, NSS, Ext3, VxFS and XFS.

Fortunately, a number of other Linux file systems take up where Ext2 leaves off. Indeed, Linux now offers four alternatives to Ext2:

Ext3,
ReiserFS,
XFS,
JFS.

In addition to meeting some or all of the requirements listed above, each of these alternative file systems also supports journaling, a feature certainly demanded by enterprises, but beneficial to anyone running Linux. A journaling file system can simplify restarts, reduce fragmentation, and accelerate I/O. Better yet, journaling file systems make fscks a thing of the past.

If you maintain a system of fair complexity or require high-availability, you should seriously consider a journaling file system. Let�s find out how journaling file systems work, look at the four journaling file systems available for Linux, and walk through the steps of installing one of the newer systems, JFS. Switching to a journaling file system is easier than you might think, and once you switch � well, you�ll be glad you did.

Fun with File Systems

To better appreciate the benefits of journaling file systems, let�s start by looking at how files are saved in a non-journaled file system like Ext2. To do that, it�s helpful to speak the vernacular of file systems.

A logical block is the smallest unit of storage that can be allocated by the file system. A logical block is measured in bytes, and it may take several blocks to store a single file.
A logical volume can be a physical disk or some subset of the physical disk space. A logical volume is also known as a disk partition.
Block allocation is a method of allocating blocks where the file system allocates one block at a time. In this method, a pointer to every block in a file is maintained and recorded.
Internal fragmentation occurs when a file does not a fill a block completely. For example, if a file is 10K and a block is 8K, the file system allocates two blocks to hold the file, but 6K is wasted. Notice that as blocks get bigger, so does the potential to have waste.
External fragmentation occurs when the logical blocks that make up a file are scattered all over the disk. External fragmentation can cause poor performance.
An extent is a large number of contiguous blocks. Each extent is described by a triple, consisting of (file offset, starting block number, length), where file offset is the offset of the extent�s first block from the beginning of the file, starting block number is the first block in the extent, and length is the number of blocks in the extent. Extents are allocated and tracked as a single unit, meaning that a single pointer tracks a group of blocks. For large files, extent allocation is a much more efficient technique than block allocation. Figure One shows how extents are used.
File system meta-data is the file system�s internal data structures � everything concerning a file except the actual data inside the file. Meta-data includes date and time stamps, ownership information, file access permissions, other security information such as access control lists (if they exist), the file�s size and the storage location or locations on disk.
An inode stores all of the information about a file except the data itself. You can think of an inode as a �bookkeeping� file for a file (indeed, an inode is a file that consumes blocks, too). An inode contains file permissions, file types, and the number of links to the file. It can also contain some direct pointers to file data blocks; pointers to blocks that contain pointers to file data bocks (so-called indirect pointers); and even double- and triple-indirect pointers. Every inode has a unique inode number that distinguishes it from every other inode.
A directory is a special kind of file that simply contains pointers to other files. Specifically, the inode for a directory file simply contains the inode numbers of its contents, plus permissions, etc.

Figure One: How file extents work

An extent is described by its block offset in the file, the location of the first block in the extent, and the length of the extent.

If file sample.txt requires 18 blocks, and the file system is able to allocate one extent of length 8, a second extent of length 5, and a third extent of length 5, the file system would look something like the drawing below. The first extent has offset 0 (block Ain the file), location 0, and length 8. The second extent has offset 8 (block I), location 20, and length 5. The last extent has offset 13, location 35, and length 5.

Figure Two illustrates blocks, inodes (with a number of meta-data attributes), directories, and their relationships.

Figure Two: Blocks, inodes, directories, files, and their relationships

When Good File Systems Go Bad

With those concepts in mind, here�s what happens when a three-block file is modified and grows to be a five-block file:

First, two new blocks are allocated to hold the new data.
Next, the file�s inode is updated to record the two new block pointers and the new size of the file.
Finally, the actual data is written into the blocks.

As you can see, while writing data to a file appears to be a single atomic operation, the actual process involves a number of steps (even more steps than shown here if you consider all of the accounting required to remove free blocks from a list of free blocka, among other possible metadata changes).

If all the steps to write a file are completed perfectly (and this happens most of the time), the file is saved successfully. However, if the process is interrupted at any time (perhaps due to power failure or other systemic failure), a non-journaled file system can end up in an inconsistent state. Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation may not be totally reflected on the media at any given point in time.

If the meta-data or the file data is left in an inconsistent state, the file system will no longer function properly.

Non-journaled file systems rely on fsck to examine all of the file system�s metadata and detect and repair structural integrity problems before restarting. If Linux shuts down smoothly, fsck will typically return a clean bill of health. However, after a power failure or crash, fsck is likely to find some kind of error in meta-data.

A file system has a lot of meta-data, and fsck can be very time consuming. After all, fsck has to scan a file system�s entire repository of meta-data to ensure consistency and error-free operation. As you may have experienced, the speed of fsck on a disk partition is proportional to the size of the partition, the number of directories, and the number of files in each directory.

For large file systems, journaling becomes crucial. A journaling file system provides improved structural consistency, better recovery, and faster restart times than non-journaled file systems. In most cases, a journaled file system can restart in less than a second.

Dear Journal�

The magic of journaling file systems lies in transactions. Just like a database transaction, a journaling file system transaction treats a sequence of changes as a single, atomic operation � but instead of tracking updates to tables, the journaling file system tracks changes to file system meta-data and/or user data. The transaction guarantees that either all or none of the file system updates are done.

For example, the process of creating a new file modifies several meta-data structures (inodes, free lists, directory entries, etc.). Before the file system makes those changes, it creates a transaction that describes what it�s about to do. Once the transaction has been recorded (on disk), the file system goes ahead and modifies the meta-data. The journal in a journaling file system is simply a list of transactions.

In the event of a system failure, the file system is restored to a consistent state by replaying the journal. Rather than examine all meta-data (the fsck way), the file system inspects only those portions of the meta-data that have recently changed. Recovery is much faster, usually only a matter of seconds. Better yet, recovery time is not dependent on the size of the partition.

In addition to faster restart times, most journaling file systems also address another significant problem: scalability. If you combine even a few large-capacity disks, you can assemble some massive (certainly by early-90s� standards) file systems. Features of modern file systems include:

Faster allocation of free blocks. Extents (as described above) and B+ trees are used individually or together to find and allocate several free blocks, either by size or location, quickly.
Large (or very large) numbers of files in a directory. A directory is a special file that contains a list of files. If you want a directory to contain thousands or tens of thousands of files, something better than a linked-list of (name, inode) pairs is needed. Again, advanced file systems used B+ trees to store directory entries. In some cases, a single B+ tree is used for the entire system.
Large files. The old technique of storing direct, indirect, double-indirect, and even triple indirect pointers to blocks does not scale well. For very large files, the number of disk accesses needed to retrieve a block in the data file would be prohibitively expensive.

More advanced file systems also manage sparse files, internal fragmentation, and the allocation of inodes better than Ext2.

A Wealth of Options

While advanced file systems are tailored primarily for the high throughput and high uptime requirements of servers (from single processor systems to clusters), these file systems can also benefit client machines where performance and reliability are wanted or needed.

As mentioned in the introduction, recent releases of Linux include not one, but four journaling file systems. JFS from IBM, XFS from SGI, and ReiserFS from Namesys have all been �open sourced� and subsequently included in the Linux kernel. In addition, Ext3 was developed as a journaling add-on to Ext2.

Figure Three shows where the file systems fit in Linux. You�ll note that JFS, XFS, ReiserFS, and Ext3 are independent �peers.� It�s possible for a single Linux machine to use all of those file systems at the same time. A system administrator could configure a system to use XFS on one partition, and ReiserFS on another.

Figure Three: Where file systems fit in the operating system

What are the features and benefits of each system? Let�s take a quick look at Ext3, ReiserFS, and XFS, and then an in-depth look at JFS.

EXT3

As mentioned above, Ext2 is the de facto file system for Linux. While it lacks some of the advanced features (extremely large files, extent-mapped files, etc.) of XFS and ReiserFS and others, it�s reliable, stable, and still the default �out of the box� file system for all Linux distributions. Ext2�s real weakness is fsck: the bigger the Ext2 file system, the longer it takes to fsck. Longer fsck times means longer down times.

The Ext3 file system was designed to provide higher availability without impacting the robustness (at least the simplicity and reliability) of Ext2. Ext3 is a minimal extension to Ext2 to add support for journaling. Ext3 uses the same disk layout and data structures as Ext2, and it�s forward- and backward-compatible with Ext2. Migration from Ext2 to Ext3 (and vice versa) is quite easy, and can even be done in-place in the same partition. The other three journaling file systems required the partition to be formatted with their mkfs utility.

If you want to adopt a journaling file system, but don�t have free partitions on your system, Ext3 could be the journaling file system to use. See �Switching to Ext3″ for information on how to switch to Ext3 on your Linux machine.

Switching to Ext3

If you want to switch to Ext3, it�s a good idea to make a backup of your file systems. Once you�ve done that, run the tune2fs program with the -j option to add a journal file to an existing Ext2 file system. You can run tune2fs on a mounted or unmounted Ext2 file system. For instance, if /dev/hdb3 is an Ext2 file system, the command

# tune2fs -j /dev/hdb3

creates the log. If the file system is mounted, a journal file named .journal will be placed in the root directory of the file system. If the file system is not mounted, the journal file will be hidden. (When you mount an Ext3 file system, the .journal file will appear. The .journal file is just an indicator to show that the file system is indeed Ext3.)

Next, the entry for /dev/hdb in /etc/fstab needs to be changed from ext2 to ext3. The final step is to reboot and verify that the /dev/hdb3 partition has type ext3. Type mount. The output should include an entry like this one:

% mount

/dev/hdb3 on /test type ext3 (rw)

Ext3 provides three data journaling modes that can be set at mount time: data=journal, data=writeback, and data=ordered. The data=journal mode provides both meta-data and data journaling. data=writeback mode provides only meta-data journaling. data=ordered mode, which is the default mode, provides meta-data journaling with increased integrity. With three modes, a system administrator can make a trade off between performance and file data consistency.

If for some reason you�d like to change the Ext3 partition back to Ext2, the process is very simple: umount the file system, and re-mount it using Ext2.

# mount -t ext2 /dev/hdb3 /test

If you want the file system to mount as Ext2 at boot time, you�ll also have to change its entry in etc/fstab.

The downside of Ext3? It�s an add-on to Ext2, so it still has the same limitations that Ext2 has. The fixed internal structures of Ext2 are simply too small (too few bits) to capture large file sizes, extremely large partition sizes, and enormous numbers of files in a single directory. Moreover, the bookkeeping techniques of Ext2, such as its linked-list directory implementation, do not scale well to large file systems (there is an upper limit of 32,768 subdirectories in a single directory, and a �soft� upper limit of 10,000-15,000 files in a single directory.) To make radical improvements to Ext2, you�d have to make radical changes. Radical change was not the intent of Ext3.

However, newer file systems do not have to be backward-compatible with Ext2. ReiserFS, XFS, and JFS offer scalability, high-performance, very large file systems, and of course, journaling. �Why Four Journaling File Systems is a Good Thing� presents an overview of the capabilities of the four journaling file systems.

Why Four Journaling File Systems is Good

One of the great things about open source is that choice is looked upon favorably. Linux is the only operating system with four journaling file systems in production: ReiserFS, Ext3, JFS, and XFS.

All four file systems have the GPL license, and source code is available at http://www.kernel.org or on each project�s home page. Each of the journaling file system teams follow a community model and welcome users and contributors. In fact, the teams share their best ideas, and competitive benchmarking encourages constant improvement of all of the systems.

The table below summarizes the features and limits of the four Linux journaling file systems. The first section provides some history of when the journaling file system were accepted into the kernel.org source trees. The next section, lists some of the features of the file systems. The final section, lists some of the distributions that are currently shipping the journaling file systems. If the distribution is shipping the file system that you want to use, you can use that file system right �out-of-the-box.�

For complete feature lists of each journaling file system, see the respective project Web pages.


A comparison of journaling file systems
Kernel support	Ext3	ReiserFS	XFS	JFS
Kernel prerequisites	No	No	Yes	No
In kernel.org source tree 2.4.Ix	2.4.15	2.4.1	-	-
In kernel.org source tree 2.5.Ix	2.5.0	2.5.0	-	2.5.6
License	GPL	GPL	GPL	GPL

Features
Largest block size supported on ia32	4 Kb	4 Kb	4 Kb	4 Kb
File system size maximum	16384 Gb	17592 Gb	18,000 Pb+	32 Pb
File size maximum	2048 Gb	1 Eb*	9,000 Pb	4 Pb
Growing the file system size	Patch	Yes	Yes	Yes
Access Control Lists	Patch	No	Yes	WIP
Dynamic disk inode allocation	No	Yes	Yes	Yes
Data logging	Yes	No	No	No
Place log on an external device	Yes	Yes	Yes	Yes

Distros with journaling file systems
Red Hat 7.3	Yes	Yes	No	Yes
SuSE 8.0	Yes	Yes	Yes	Yes
Mandrake Linux 8.2	Yes	Yes	Yes	Yes
Slackware Linux 8.1	Yes	Yes	Yes	Yes

+ Pb is petabyte, or 10¹⁵ bytes

* Eb is exabyte or 10¹⁸ bytes

By the way, the 2.4 kernel has a limit of 2048 Gb for a single block device, so no file system larger than that can be created at this time (without patching the standard kernel). This restriction could be removed in the 2.5.x development kernel, and there are patches available to remove this limit, but as of 2.5.29, the patches haven�t been officially included yet.

REISERFS

ReiserFS is designed and developed by Hans Reiser and his team of developers at Namesys. Like the other journaling file systems, it�s open source, is available in most Linux distributions, and supports meta-data journaling.

One of the unique advantages of ReiserFS is support for small files � lots and lots of small files. Reiser�s philosophy is simple: small files encourage coding simplicity. Rather than use a database or create your own file caching scheme, use the filesystem to handle lots of small pieces of information.

ReiserFS is about eight to fifteen times faster than Ext2 at handling files smaller than 1K.

Even more impressive, (when properly configured) ReiserFS can actually store about 6% more data that Ext2 on the same physical file system. Rather than allocate space in fixed 4K blocks, ReiserFS can allocate the exact space that�s needed. A B* tree manages all file system meta-data, and stores and compresses tails, portions of files smaller than a block.

Of course, ReiserFS also has excellent performance for large files, but it�s especially adept at managing small files.

For a more in-depth discussion of ReiserFS and instructions on how to install it, see �Journaling File Systems� in the August 2000 issue, available online at http://www.linux-mag.com/2000-08/journaling_01.html.

JFS

JFS for Linux is based on IBM�s successful JFS file system for OS/2 Warp. Donated to open source in early 2000 and ported to Linux soon after, JFS is well-suited to enterprise environments. JFS uses many advanced techniques to boost performance, provide for very large file systems, and of course, journal changes to the file system. SGI�s XFS (described next) has many similar features. Some of the features of JFS include:

Extent-based addressing structures. JFS uses extent-based addressing structures, along with aggressive block allocation policies to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. This feature yields excellent performance.
Dynamic inode allocation. JFS dynamically allocates space for disk inodes as required, freeing the space when it is no longer required. This is a radical improvement over Ext2, which reserves a fixed amount of space for disk inodes at file system creation time. With dynamic inode allocation, users do not have to estimate the maximum number of files and directories that a file system will contain. Additionally, this feature decouples disk inodes from fixed disk locations.
Directory organization. Two different directory organizations are provided: one is used for small directories and the other for large directories. The contents of a small directory (up to 8 entries, excluding the self (. or �dot�) and parent (.. or �dot dot� entries) are stored within the directory�s inode. This eliminates the need for separate directory block I/O and the need to allocate separate storage. The contents of larger directories are organized in a B+ tree keyed on name. B+ trees provide faster directory lookup, insertion, and deletion capabilities when compared to traditional unsorted directory organizations.
64-bits. JFS is a full 64-bit file system. All of the appropriate file system structure fields are 64-bits in size. This allows JFS to support large files and partitions.

There are other advanced features in JFS such as allocation groups (which speeds file access times by maximizing locality), and various block sizes ranging from 512-bytes to 4096-bytes (which can be tuned to avoid internal and external fragmentation). You can read about all of them at the JFS Web site at http://www-124.ibm.com/developerworks/oss/jfs.

XFS

A little more than a year ago, SGI released a version of its high-end XFS file system for Linux. Based on SGI�s Irix XFS file system technology, XFS supports meta-data journaling, and extremely large disk farms. How large? A single XFS file system can be 18,000 petabytes (that�s 10¹⁵ bytes) and a single file can be 9,000 petabytes. XFS is also capable of delivering excellent I/O performance.

In addition to truly amazing scale and speed, XFS uses many of the same techniques found in JFS.

Installing JFS

For the rest of the article, let�s look at how to install and use IBM�s JFS system. If you have the latest release of Turbolinux, Mandrake, SuSE, Red Hat, or Slackware, you can probably skip ahead to the section �Creating a JFS Partition.� If you want to include the latest JFS source code drop into your kernel, the next few sections show you what to do.

THE LATEST AND GREATEST

JFS has been incorporated into the 2.5.6 Linux kernel, and is also included in Alan Cox�s 2.4.X-ac kernels beginning with 2.4.18-pre9-ac4, which was released on February 14, 2002. Alan�s patches for 2.4.x series are available from http://www.kernel.org. You can also download a 2.4 kernel source tree and add the JFS patches to this tree. JFS comes as a patch for several of the 2.4.x kernel, so first of all, get the latest kernel from http://www.kernel.org.

At the time of writing, the latest kernel was 2.4.18 and the latest release of JFS was 1.0.20. We�ll be using those in the instructions below. The JFS patch is available from the JFS web site. You also need both the utilities (jfsutils-1.0.20.tar.gz), the kernel patch (jfs-2.4.18-patch), and the file system source (jfs-2.4-1.0.20.tar.gz).

If you�re using any of the latest distros, you probably won�t have to patch the kernel for the JFS code. Instead, you�ll only need to compile the kernel to update to the latest release of JFS (you can build JFS either as built-in or as a module). (To determine what version of JFS was shipped in the distribution you�re running, you can edit the JFS file super.c and look for a printk() that has the JFS development version number string.)

PATCHING THE KERNEL TO SUPPORT JFS

In the example below, we�ll use the 2.4.18 kernel source tree as an example on how to patch JFS into the kernel source tree.

First, you need to download the Linux kernel: linux-2.4.18 .tar.gz. If you have a linux subdirectory, move it to linux-org, so it won�t replaced by the linux-2.4.18 source tree. When you download the kernel archive, save it under /usr/src and expand the kernel source tree by using:

% mv linux linux-org
% tar zxvf linux-2.4.18.tar.gz

This operation will create a directory named /usr/src/linux.

The next step is to get the JFS utilities and the appropriate patch for kernel 2.4.18. Before you do that, you need to create a directory for JFS source, /usr/src/jfs1020, and download (to that directory) the JFS kernel patch and the JFS file system source files. Once you have those files, you have everything you need to patch the kernel.

Next, change to the directory of the kernel 2.4.18 source tree and apply the JFS kernel patch:

% cd /usr/src/linux
% patch -p1 < /usr/src/jfs1020/jfs-2.4-18-patch
% cp /usr/src/jfs1020/jfs-2.4-1.0.20.tar.gz .
% tar zxvf jfs-2.4-1.0.20.tar.gz

Now, you need to configure the kernel and enable JFS by going to the File systems section of the configuration menu and enabling JFS file system support (CONFIG_JFS_FS=y). You also have the option to configure JFS as a module, in which case you only need to recompile and reinstall kernel modules by typing:

% make modules && make install_modules

Otherwise, if you configured the JFS option as a kernel built-in, you need to:

1. Recompile the kernel (in /usr/src/linux). Run the command

% make dep && make clean && make bzImage

2. Recompile and install modules (only if you added other options as modules)

% make modules && make modules_install

3. Install the kernel.

# cp arch/i386/boot/bzImage /boot/jfs-bzImage
# cp System.map /boot/jfs-System.map
# ln -s /boot/jfs-System.map /boot/System.map

Next, update /etc/lilo.conf with the new kernel. Add an entry like the one that follows and a jfs1020 entry should appear at the lilo boot prompt:

image=/boot/jfs-bzImage
label=jfs1020
read-only
root=/dev/hda5  # Change to your partition

Be sure to specify the correct root partition. Then run

# lilo

to make the system aware of the new kernel. Reboot and select the jfs1020 kernel to boot from the new image.

After you compile and install the kernel, you should compile and install the JFS utilities. Save the jfsutils-1.0.20.tar.gz file into the /usr/src/jfs1020 directory, expand it, run configure, and the install the utilities.

  % tar zxvf jfsutils-1.0.20.tar.gz
  % cd jfsutils-1.0.20
  % ./configure
  % make && make install

Creating a JFS partition

Having built and installed the JFS utilities, the next step is to create a JFS partition. In this exact example, we�ll demonstrate the process using a spare partition.

(If there�s unpartitioned space on your disk, you can create a partition using fdisk. After you create the partition, reboot the system to make sure that the new partition is available to create a JFS file system on it. In our test system, we had /dev/hdb3 as a spare partition.)

To create the JFS file system with the log inside the JFS partition, apply the following command:

# mkfs.jfs /dev/hdb3

After the file system has been created, you need to mount it. You will need a mount point. Create a new empty directory such as /jfs to mount the file system with the following command:

# mount -t jfs /dev/hdb3 /jfs

After the file system is mounted, you are ready to try out JFS. To unmount the JFS file system, you simply use the umount command with the same mount point as the argument:

# umount /jfs

A Performance Tweak for All File Systems

Linux records an atime, or access time, whenever a file is read. However, access time isn�t very useful, and can be quite costly to track.

To get a quick performance boost on any kind of Linux file system, simply disable access time updates with the mount option noatime. For example, to disable access times on a JFS partition, do something like this in /etc/fstab:

/dev/hda6 /jfs jfs noatime 1 2

Go Faster with An External Log

An external log improves performance since the log updates are saved to a different partition than its corresponding file system.

To create the JFS file system with the log on an external device, your system will need to have 2 unused partitions. Our test system had /dev/hda6 and /dev/hdb1 as spare partitions.

# mkfs.jfs -j /dev/hdb1 /dev/hda6
mkfs.jfs version: 1.0.20 21-Jun-2002
Warning! All data on device /dev/hda6 will be lost!
Warning! All data on device /dev/hdb1 will be lost!
Continue? (Y/N) y
Format completed successfully.
10249438 kilobytes total disk space.

To mount the file system use the following mount command:

# mount -t jfs /dev/hda6 /jfs

So you don�t have to mount this file system every time you boot, you can add it to /etc/fstab. Make a backup of /etc/fstab and edit it with you favorite editor. Add the /dev/hda6 device. For example, add:

/dev/hda6 /jfs jfs defaults 1 2

Not Just for Reboots Anymore

Some people have the impression that journaling file systems only provide fast restart times. As you�ve seen, this isn�t true. Considerable coding efforts have made journaling file systems scalable, reliable, and fast.

Whether you�re running an enterprise server, a cluster supercomputer, or a small Web site, XFS, JFS, and ReiserFS add credibility and oomph to Linux. Need a better reason to switch to a journaling file system? Just imagine yourself in a world without fsck. What will you do with all that extra time?

Steve Best works in the Linux Technology Center of IBM in Austin, Texas. He is currently working on the Journaled File System (JFS) for Linux project. Steve has done extensive work in operating system development with a focus in the areas of file systems, internationalization, and security. He can be reached at [email protected].

NEWS CONTENTS

20.2. Major File Systems in Linux
btrfs Wiki
Freezing filesystems and containers [LWN.net]

Old News ;-)

20.2. Major File Systems in Linux

Unlike two or three years ago, choosing a file system for a Linux system is no longer a matter of a few seconds (Ext2 or ReiserFS?). Kernels starting from 2.4 offer a variety of file systems from which to choose. The following is an overview of how these file systems basically work and which advantages they offer.

It is very important to bear in mind that there may be no file system that best suits all kinds of applications. Each file system has its particular strengths and weaknesses, which must be taken into account. Even the most sophisticated file system cannot substitute for a reasonable backup strategy, however.

The terms data integrity and data consistency, when used in this chapter, do not refer to the consistency of the user space data (the data your application writes to its files). Whether this data is consistent must be controlled by the application itself.

Setting up File Systems

Unless stated otherwise in this chapter, all the steps required to set up or change partitions and file systems can be performed using the YaST module.

20.2.1. Ext2

The origins of Ext2 go back to the early days of Linux history. Its predecessor, the Extended File System, was implemented in April 1992 and integrated in Linux 0.96c. The Extended File System underwent a number of modifications and, as Ext2, became the most popular Linux file system for years. With the creation of journaling file systems and their astonishingly short recovery times, Ext2 became less important.

A brief summary of Ext2's strengths might help understand why it was � and in some areas still is � the favorite Linux file system of many Linux users.

Solidity

Being quite an �old-timer,� Ext2 underwent many improvements and was heavily tested. This may be the reason why people often refer to it as rock-solid. After a system outage when the file system could not be cleanly unmounted, e2fsck starts to analyze the file system data. Metadata is brought into a consistent state and pending files or data blocks are written to a designated directory (called lost+found). In contrast to journaling file systems, e2fsck analyzes the entire file system and not just the recently modified bits of metadata. This takes significantly longer than checking the log data of a journaling file system. Depending on file system size, this procedure can take half an hour or more. Therefore, it is not desirable to choose Ext2 for any server that needs high availability. However, because Ext2 does not maintain a journal and uses significantly less memory, it is sometimes faster than other file systems.

Easy Upgradability

The code for Ext2 is the strong foundation on which Ext3 could become a highly-acclaimed next-generation file system. Its reliability and solidity were elegantly combined with the advantages of a journaling file system.

20.2.2. Ext3

Ext3 was designed by Stephen Tweedie. Unlike all other next-generation file systems, Ext3 does not follow a completely new design principle. It is based on Ext2. These two file systems are very closely related to each other. An Ext3 file system can be easily built on top of an Ext2 file system. The most important difference between Ext2 and Ext3 is that Ext3 supports journaling. In summary, Ext3 has three major advantages to offer:

Easy and Highly Reliable Upgrades from Ext2

Because Ext3 is based on the Ext2 code and shares its on-disk format as well as its metadata format, upgrades from Ext2 to Ext3 are incredibly easy. Unlike transitions to other journaling file systems, such as ReiserFS, JFS, or XFS, which can be quite tedious (making backups of the entire file system and recreating it from scratch), a transition to Ext3 is a matter of minutes. It is also very safe, because recreating an entire file system from scratch might not work flawlessly. Considering the number of existing Ext2 systems that await an upgrade to a journaling file system, you can easily figure out why Ext3 might be of some importance to many system administrators. Downgrading from Ext3 to Ext2 is as easy as the upgrade. Just perform a clean unmount of the Ext3 file system and remount it as an Ext2 file system.

Reliability and Performance

Other journaling file systems follow the �metadata-only� journaling approach. This means your metadata is always kept in a consistent state but the same cannot be automatically guaranteed for the file system data itself. Ext3 is designed to take care of both metadata and data. The degree of �care� can be customized. Enabling Ext3 in the data=journal mode offers maximum security (data integrity), but can slow down the system because both metadata and data are journaled. A relatively new approach is to use the data=ordered mode, which ensures both data and metadata integrity, but uses journaling only for metadata. The file system driver collects all data blocks that correspond to one metadata update. These blocks are grouped as a �transaction� and written to disk before the metadata is updated. As a result, consistency is achieved for metadata and data without sacrificing performance. A third option to use is data=writeback, which allows data to be written into the main file system after its metadata has been committed to the journal. This option is often considered the best in performance. It can, however, allow old data to reappear in files after crash and recovery while internal file system integrity is maintained. Unless you specify something else, Ext3 is run with the data=ordered default.

20.2.3. Converting an Ext2� File System into Ext3�

Converting from Ext2� to Ext3� involves two separate steps:

Creating the Journal

Log in as root and run tune2fs -j. This creates an Ext3 journal with the default parameters. To decide yourself how large the journal should be and on which device it should reside, run tune2fs -J instead together with the desired journal options size= and device=. More information about the tune2fs program is available in its manual page (man 8 tune2fs).

Specifying the File System Type in /etc/fstab

To ensure that the Ext3 file system is recognized as such, edit the file /etc/fstab, changing the file system type specified for the corresponding partition from ext2 to ext3. The change takes effect after the next reboot.

Using ext3 for the Root Directory

To boot a root file system set up as an ext3 partition, include the modules ext3 and jbd in the initrd. To do so, edit the file /etc/sysconfig/kernel to include the two modules under INITRD_MODULES then execute the command mk_initrd.

20.2.4. ReiserFS

Officially one of the key features of the 2.4 kernel release, ReiserFS has been available as a kernel patch for 2.2.x SUSE kernels since SUSE LINUX version 6.4. ReiserFS was designed by Hans Reiser and the Namesys development team. ReiserFS has proven to be a powerful alternative to the old Ext2. Its key assets are better disk space utilization, better disk access performance, and faster crash recovery. However, there is a minor drawback: ReiserFS pays great care to metadata but not to the data itself. Future generations of ReiserFS will include data journaling (both metadata and actual data are written to the journal) as well as ordered writes.

ReiserFS's strengths, in more detail, are:

Better Disk Space Utilization

In ReiserFS, all data is organized in a structure called B^*-balanced tree. The tree structure contributes to better disk space utilization because small files can be stored directly in the B^* tree leaf nodes instead of being stored elsewhere and just maintaining a pointer to the actual disk location. In addition to that, storage is not allocated in chunks of 1 or 4 kB, but in portions of the exact size needed. Another benefit lies in the dynamic allocation of inodes. This keeps the file system more flexible than traditional file systems, like Ext2, where the inode density must be specified at file system creation time.

Better Disk Access Performance

For small files, often find that both file data and �stat_data� (inode) information are stored next to each other. They can be read with a single disk I/O operation, meaning that only one access to disk is required to retrieve all the information needed.

Fast Crash Recovery

Using a journal to keep track of recent metadata changes makes a file system check a matter of seconds, even for huge file systems.

20.2.5. JFS

JFS, the Journaling File System was developed by IBM. The first beta version of the JFS Linux port reached the Linux community in the summer of 2000. Version 1.0.0 was released in 2001. JFS is tailored to suit the needs of high throughput server environments where performance is the ultimate goal. Being a full 64-bit file system, JFS supports both large files and partitions, which is another reason for its use in server environments.

A closer look at JFS shows why this file system might prove a good choice for your Linux server:

Efficient Journaling

JFS follows a �metadata-only� approach like ReiserFS. Instead of an extensive check, only metadata changes generated by recent file system activity are checked, which saves a great amount of time in recovery. Concurrent operations requiring multiple concurrent log entries can be combined into one group commit, greatly reducing performance loss of the file system through multiple write operations.

Efficient Directory Organization

JFS holds two different directory organizations. For small directories, it allows the directory's content to be stored directly into its inode. For larger directories, it uses B⁺trees, which greatly facilitate directory management.

Better Space Usage through Dynamic inode Allocation

For Ext2, you must define the inode density in advance (the space occupied by management information), which restricts the maximum number of files or directories of your file system. JFS spares these considerations � it dynamically allocates inode space and frees it when it is no longer needed.

20.2.6. XFS

Originally intended as the file system for their IRIX OS, SGI started XFS development in the early 1990s. The idea behind XFS was to create a high-performance 64-bit journaling file system to meet the extreme computing challenges of today. XFS is very good at manipulating large files and performs well on high-end hardware. However, even XFS has a drawback. Like ReiserFS, XFS takes great care of metadata integrity, but less of data integrity.

A quick review of XFS's key features explains why it may prove a strong competitor for other journaling file systems in high-end computing.

High Scalability through the Use of Allocation Groups

At the creation time of an XFS file system, the block device underlying the file system is divided into eight or more linear regions of equal size. Those are referred to as allocation groups. Each allocation group manages its own inodes and free disk space. Practically, allocation groups can be seen as file systems in a file system. Because allocation groups are rather independent of each other, more than one of them can be addressed by the kernel simultaneously. This feature is the key to XFS's great scalability. Naturally, the concept of independent allocation groups suits the needs of multiprocessor systems.

High Performance through Efficient Management of Disk Space

Free space and inodes are handled by B⁺-trees inside the allocation groups. The use of B⁺-trees greatly contributes to XFS's performance and scalability. A feature truly unique to XFS is delayed allocation. XFS handles allocation by breaking the process into two pieces. A pending transaction is stored in RAM and the appropriate amount of space is reserved. XFS still does not decide where exactly (speaking of file system blocks) the data should be stored. This decision is delayed until the last possible moment. Some short-lived temporary data may never make its way to disk, because it may be obsolete at the time XFS decides where actually to save it. Thus XFS increases write performance and reduces file system fragmentation. Because delayed allocation results in less frequent write events than in other file systems, it is likely that data loss after a crash during a write is more severe.

Preallocation to Avoid File System Fragmentation

Before writing the data to the file system, XFS reserves (preallocates) the free space needed for a file. Thus, file system fragmentation is greatly reduced. Performance is increased because the contents of a file are not distributed all over the file system.

	Setting up File Systems
Unless stated otherwise in this chapter, all the steps required to set up or change partitions and file systems can be performed using the YaST module.

JLS2009 A Btrfs update [LWN.net]

Nov 1, 2009 | lwn.ne
An fsync() that really synchronously writes to the disk is always going to be slow, because it lets the program wait for the disk(s). And with a good file system it's completely unnecessary for an application like an editor; editors just call it as a workaround for bad file systems.
nix (subscriber, #2304) [Link]
So, er, you're suggesting that a good filesystem, what, calls sync() every
second? I can't see any way in which you could get the guarantees fsync()
does for files you really care about without paying some kind of price for
it in latency for those files.

And I don't really like the idea of calling sync() every second (or every
five, thank you ext3).

Being able to fsync() the important stuff *without* forcing everything
else to disk, like btrfs promises, seems very nice. Now my editor files
can be fsync()ed without also requiring me to wait for a few hundred Mb of
who-knows-what breadcrumb crud from FF to also be synced.

[May 17, 2010] btrfs Wiki

Btrfs is a new copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Initially developed by Oracle, Btrfs is licensed under the GPL and open for contribution from anyone.

Linux has a wealth of filesystems to choose from, but we are facing a number of challenges with scaling to the large storage subsystems that are becoming common in today's data centers. Filesystems need to scale in their ability to address and manage large storage, and also in their ability to detect, repair and tolerate errors in the data stored on disk.

Btrfs is under heavy development, but every effort is being made to keep the filesystem stable and fast. As of 2.6.31, we only plan to make forward compatible disk format changes, and many users have been experimenting with Btrfs on their systems with good results. Please email the Btrfs mailing list if you have any problems or questions while using Btrfs.

[Apr 16, 2009] Freezing filesystems and containers [LWN.net]

By Jake Edge
June 25, 2008
Freezing seems to be on the minds of some kernel hackers these days, whether it is the northern summer or southern winter that is causing it is unclear. Two recent patches posted to linux-kernel look at freezing, suspending essentially, two different pieces of the kernel: filesystems and containers. For containers, it is a step along the path to being able to migrate running processes elsewhere, whereas for filesystems it will allow backup systems to snapshot a consistent filesystem state. Other than conceptually, the patches have little to do with each other, but each is fairly small and self-contained so a combined look seemed in order.

Takashi Sato proposes taking an XFS-specific feature and moving it into the filesystem code. The patch would provide an ioctl() for suspending write access to a filesystem, freezing, along with a thawing option to resume writes. For backups that snapshot the state of a filesystem or otherwise operate directly on the block device, this can ensure that the filesystem is in a consistent state.

Essentially the patch just exports the freeze_bdev() kernel function in a user accessible way. freeze_bdev() locks a file system into a consistent state by flushing the superblock and syncing the device. The patch also adds tracking of the frozen state to the struct block_device state field. In its simplest form, freezing or thawing a filesystem would be done as follows:
    ioctl(fd, FIFREEZE, 0);

    ioctl(fd, FITHAW, 0);
Where fd is a file descriptor of the mount point and the argument is ignored.

In another part of the patchset, Sato adds a timeout value as the argument to the ioctl(). For XFS compatibility�though courtesy of a patch by David Chinner, the XFS-specific ioctl() is removed�a value of 1 for the pointer argument means that the timeout is not set. A value of 0 for the argument also means there is no timeout, but any other value is treated as a pointer to a timeout value in seconds. It would seem that removing the XFS-specific ioctl() would break any applications that currently use it anyway, so keeping the compatibility of the argument value 1 is somewhat dubious.

If the timeout occurs, the filesystem will be automatically thawed. This is to protect against some kind of problem with the backup system. Another ioctl() flag, FIFREEZE_RESET_TIMEOUT, has been added so that an application can periodically reset its timeout while it is working. If it deadlocks, or otherwise fails to reset the timeout, the filesystem will be thawed. Another FIFREEZE_RESET_TIMEOUT after that occurs will return EINVAL so that the application can recognize that it has happened.

Moving on to containers, Matt Helsley posted a patch which reuses the software suspend (swsusp) infrastructure to implement freezing of all the processes in a control group (i.e. cgroup). This could be used now to checkpoint and restart tasks, but eventually could be used to migrate tasks elsewhere entirely for load balancing or other reasons. Helsley's patch set is a forward port of work originally done by Cedric Le Goater.

The first step is to make the freeze option, in the form of the TIF_FREEZE flag, available to all architectures. Once that is done, moving two functions, refrigerator() and freeze_task(), from the power management subsystem to the new kernel/freezer.c file makes freezing tasks available even to architectures that don't support power management.

As is usual for cgroups, controlling the freezing and thawing is done through the cgroup filesystem. Adding the freezer option when mounting will allow access to each container's freezer.state file. This can be read to get the current freezer state or written to change it as follows:
    # cat /containers/0/freezer.state
    RUNNING
    # echo FROZEN > /containers/0/freezer.state
    # cat /containers/0/freezer.state
    FROZEN
It should be noted that it is possible for tasks in a cgroup to be busy doing something that will not allow them to be frozen. In that case, the state would be FREEZING. Freezing can then be retried by writing FROZEN again, or canceled by writing RUNNING. Moving the offending tasks out of the cgroup will also allow the cgroup to be frozen. If the state does reach FROZEN, the cgroup can be thawed by writing RUNNING.

In order for swsusp and cgroups to share the refrigerator() it is necessary to ensure that frozen cgroups do not get thawed when swsusp is waking up the system after a suspend. The last patch in the set ensures that thaw_tasks() checks for a frozen cgroup before thawing, skipping over any that it finds.

There has not been much in the way of discussion about the patches on linux-kernel, but an ACK from Pavel Machek would seem to be a good sign. Some comments by Paul Menage, who developed cgroups, also indicate interest in seeing this feature merged.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater�s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright � 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: July 18, 2014

Linux filesystems

NEWS CONTENTS

Old News ;-)

20.2. Major File Systems in Linux

20.2.1. Ext2

20.2.2. Ext3

20.2.3. Converting an Ext2� File System into Ext3�

20.2.4. ReiserFS

20.2.5. JFS

20.2.6. XFS

JLS2009 A Btrfs update [LWN.net]

[May 17, 2010] btrfs Wiki

[Apr 16, 2009] Freezing filesystems and containers [LWN.net]

Recommended Links

Etc