Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

JFS/JFS2 Filesystem

News See also Recommended Links Tutorials Disk Repartitioning Creating External Snapshots for JFS (not JFS2) AIX Logical Volume Manager

IBM's JFS was developed in the mid-1990s for AIX, then it found its way to OS/2 and then to Linux.  It was open sourced by IBM in 1999 and available in the Linux kernel sources since 2002. It's therefore well tested, although the Linux version is rarely used. 

IBM introduced JFS with the initial release of AIX 3.1. In May of 2001, IBM introduced JFS2. Both filesystem types link their file and directory data to the structure used by the AIX LVM for storage and retrieval. JFS2 is optimized for a 64-bit environment. JFS2 is architected for filesystems up to four petabytes, but it has currently only been tested up to 16 terabyte-sized filesystems. Also, file sizes are limited to 16 terabytes. The number of inodes that can be created in a filesystem is dynamic and it is only limited by the amount of free space in the filesystem. JFS2 supports buffered I/O, synchronous I/O (the file is opened with O_SYNC or O_DSYNC flags), kernel asynchronous I/O (through the use of the Async I/O system calls), Direct I/O (on a per-file basis if the file is opened with O_DIRECT, or on a per-filesystem basis when the filesystem is mounted with the dio mount option), and Concurrent I/O (on a per-file basis if the file is opened with O_CIO or when the filesystem is mounted with the CIO mount option). With AIX, you can use either JFS or JFS2, as they are both linked to the LVM. Both are journalized and no third party filesystems are necessary. In AIX 5L Version 5.1, every filesystem corresponds to a logical volume. In order to create a journaled filesystem, you need to use the smit fastpath: smitty crfs or crfs from the command line. To increase the size of a filesystem, use the chfs command, in addition to using SMIT.

JFS is a fully 64-bit filesystem. With a default block size of 4KB, it supports a maximum filesystem size of 4 petabytes (less if you use smaller block sizes). The minimum filesystem size supported is 16MB.  he JFS transaction log has a default size of 0.4% of the aggregate size, rounded up to a megabyte boundary. The maximum size of the log is 32MB. One interesting aspect of the layout on disk is the fsck working space, a small area allocated within the filesystem for keeping track of block allocation if there is not enough RAM to track a large filesystem at boot time.  Here is a comparison table reproduced from Notes about Linux file systems

\
Desktop filesystem features
Feature ext3 JFS XFS
Block sizes 1024-4096 4096 512-4096
Max fs size 8TiB (243B) 32PiB (255B) 8EiB (263B)
16TiB (244B) on 32b system
Max file size 1TiB (240B) 4PiB (252B) 8EiB (263B)
16TiB (244B) on 32b system
Max files/fs 232 232 232
Max files/dir 232 231 232
Max subdirs/dir 215 216 232
Number of inodes fixed dynamic dynamic
Indexed dirs option auto yes
Small data in inodes no some auto
fsck speed slow fast fast
fsck space ? 32B per inode 2GiB RAM per 1TiB + 200B per inode
(half on 32b CPU)
Redundant metadata yes yes ?
Bad block handling yes mkfs only no
Tunable commit interval yes no no
Supports VFS lock yes yes yes
Has own lock/snapshot no no yes
Names 8 bit UTF-16 or 8 bit 8 bit
noatime yes yes yes
O_DIRECT allegedly allegedly yes
barrier yes no yes (checked)
commit interval yes no no
EA/ACLs both both both
Quotas both patch both
DMAPI no patch option
Case insensitive no mkfs only no
Supported by GRUB yes yes mostly
Can grow online online only online only
Can shrink offline no no
Journals data option no no
Journals what blocks operations operations
Journal disabling yes yes no
Journal size fixed fixed grow/shrink
Resize journal offline maybe offline
Journal on another partition yes yes yes
Special features or misfeatures In place convert from ext2.
MS Windows drivers.
Case insensitive option.
Low CPU usage.
DCE DFS compatible.
OS2 compatible.
Real time (streaming) section.
IRIX compatible.
Very large write behind.
Superblock on block 0.

Support for JFS has been added to the 2.4.20 and 2.5.6 kernels.  JFS is no longer used by IBM. On AIX IBM uses JFS2.

JFS volume structure

JFS is organized like a traditional Unix-ish file system, it presents a logical view of files and directories linked together to form a tree-like structure. This is the concept that spread from the Unix world pretty much everywhere else and that we all know. JFS is created on top of a logical volume. To maintain information about files and directories, it uses the following important internal structures: The superblock lies at the heart of JFS (and many other file systems). It contains essential information such as size of file system, number of blocks it contains or state of the file system (clean, dirty etc.).

The entire file system space is divided into logical blocks that contain file or directory data. For JFS, the logical blocks are always 4096 bytes (4K) in size, but can be optionally subdivided into smaller fragments (512, 1024 or 2048 bytes).

An i-node is a logical entity that contains information about a file or directory. There is a 1:1 relationship between i-nodes and files/directories. An i-node contains file type, access permissions, user/group ID (UID/GID - unused on OS/2), access times and points to actual logical blocks where file contents are stored. The maximum file size allowed in JFS is 2TB. It should be noted that the number of i-nodes is fixed. It is determined at file system creation time and depends on fragment size (which is user selectable). Users could run out of i-nodes, meaning that they would be unable to create more files even if there was enough free space. In practice this is extremely rare.

Fragments were already briefly mentioned in the discussion of logical blocks. The JFS logical block size is fixed at 4K. This is a reasonable default but it means that the file system cannot allocate less than 4K for file storage. If a file system stores large amounts of small files (< 2K), the disk space waste becomes significant. We've all got to know and hate this problem from FAT (cluster size of 32K leads to massive waste of space, in some cases over 50%). JFS attacks this by allowing fragmentation of logical blocks into smaller units, as small as 512 bytes (this is sector size on harddrives and it is not possible to read or write less than 512 bytes from/to disk). However users should be careful because fragmentation incurs additional overhead and hence slows down disk access. I would recommend using fragments smaller than 4K only when the users know for sure that they will store very large amounts of small files on the file system.

The entire JFS volume space is subdivided into allocation groups. Each allocation group contains i-nodes and data blocks. This enables the file system to store i-nodes and their associated data in physical proximity (HPFS uses a very similar technique). The allocation group size varies from 8MB to 64MB and depends on fragment size and number of fragments it contains.

Journaling

As the name of JFS implies, journaling is a very important feature of this file system. It should be noted that journaling is actually independent of JFS's structure described above. The journaling technique has its roots in database systems and it is employed to ensure maximum consistency of the file system, hence minimizing the risk of data loss - a very important feature for servers, but even home/SOHO users hate to lose data.

JFS uses a special log device to implement circular journal. On AIX, several JFS volumes can share single log device. I'm not sure this is possible on OS/2, I believe each JFS volume (corresponding to a drive letter) has its own 'inline' log located inside the JFS volume - its size is  selectable at FORMAT time.

It is important to note that JFS does not log (or journal) everything. It only logs all changes to file system meta-data. Simply speaking, the log contains a record of changes to everything in the file system except actual file data, ie. changes to the superblock, i-nodes, directories and allocation structures. It is clear that there must be some overhead here and indeed, performance may suffer when applications are doing lots of synchronous (uncached) I/O or creating and/or deleting many files in short amount of time. The performance loss is however not noticeable in most cases and is well worth the increased security.

The log (or journal) occupies a dedicated area on disk and is written to immediately when any meta-data change occurs. When the disk becomes idle, the actual file system structure is updated according to the log. After a crash, all it usually takes to restore the file system to full consistency is replaying the log, ie. performing the recorded transactions. Of course, if a process was in the middle of writing a file when the system crashed or power died, the file could be inconsistent (the app might not be able to read it again),  but you will not lose this file nor other files, as is often the case with other file systems.

JFS

OS/2 users often ask what exactly the difference is between the various file systems available on OS/2. The following table, taken almost verbatim from WSeB's Quick Beginnings book, summarizes the most important differences between the file systems available for WSeB from IBM.
 
Characteristic Journaled File System (JFS) 386 High Performance File System (386HPFS) High Performance File System (HPFS) FAT File System
Max volume size 2TB (terabytes) 64GB (gigabytes) 64GB (gigabytes) 2GB (gigabytes)
Max file size 2TB (terabytes) 2GB (gigabytes) 2GB (gigabytes) 2GB (gigabytes)
Allows spaces and periods in file names Yes Yes Yes No (8.3 format)
Standard directory and file attributes Within file system Within file system Within file system Within file system
Extended Attributes (64KB text or binary data with keywords) Within file system Within file system Within file system In separate file
Max path length 260 characters 1) 260 characters 260 characters 64 characters
Bootable No 2) Yes Yes Yes
Allows dynamic volume expansion Yes No No No
Scales with SMP Yes No No No
Local security support No Yes No No
Average wasted space per file 256 to 2048 bytes 256 bytes 256 bytes 1/2 cluster (1KB to 16KB)
Allocation information for files Near each file in its i-node Near each file in its FNODE Near each file in its FNODE Centralized near volume beginning
Directory structure Sorted B+tree Sorted B-tree Sorted B-tree, must be searched exhaustively Unsorted linear
Directory location Close to files it contains Near seek center of volume Near seek center of volume Root directory at beginning of volume; others scattered
Write-behind (lazy write) Optional Optional Optional Optional
Maximum cache size Physical memory available Physical memory available 2MB 14MB
Caching program None (parameters set in CONFIG.SYS) CACHE386.EXE CACHE.EXE None (parameters set in CONFIG.SYS)
LAN Server access control lists Within file system Within file system In separate file (NET.ACC) In separate file

1) JFS stores file and directory names in Unicode. This allows JFS to always maintain proper sort order, regardless of active codepage.
2) This is not a permanent limitation. Only no one wrote a JFS micro- and mini-IFS yet.

It might perhaps interest some users that JFS also seems to have built-in support for DASD limits. I have however never tried to use this feature. DASD limits, aka Directory Limits feature of LAN Server allows administrators to control how much space a directory can take, effectively enabling them to limit disk space usage of users. Previously this feature only worked on HPFS386 volumes. Obviously this is of no use to home users who have all their disk space for themselves but it can be very useful for system administrators.

JFS Utilities

WSeB comes with several new JFS-specific utilities, in addition to the usual ones like CHKDSK and FORMAT. I'll only give a quick overview of them here, the important ones are documented in the Command Reference. In addition to the above utilities that are supplied with WSeB, I also managed to build several extra utilities from the OpenJFS sources thanks to invaluable help from several friends. Those are not available publicly in binary form to my knowledge, though I could probably e-mail them to interested readers - but beware, these are for experts only and not guaranteed to work!

Glossary of Terms:

NEWS CONTENTS

Old News ;-)

Storage Management in AIX 5L Version 5.3 AIX IBM Systems Magazine

June 2006 | by Shiv Dutta

Note: This article can also be found on the IBM Developerworks Web site (www-128.ibm.com/developerworks/eserver/library/es-aix5l-lvm.html).

When this article was first published in April 2005 under the title Logical Volume Manager in AIX 5L Version 5.3, it discussed a number of features that were introduced in AIX 5L* Version 5.3 to enhance the scope, functionality, and performance of the Logical Volume Manager (LVM). The next major enhancements to AIX 5L were introduced in the 5300-03 maintenance level, which was released in September 2005. This article is an updated and expanded version of the April 2005 publication. While the original content has been retained almost in its entirety, the article has been augmented by including a discussion of some of the LVM enhancements introduced in the 5300-03 maintenance level. Also, its scope has been broadened to cover a number of improvements, introduced both in the original release of the AIX 5L Version 5.3 and the 5300-03 maintenance level, to the Enhanced Journal File System (JFS2). In the following discussions, I use the expression (5300-03) to indicate that the referenced feature is available only for the 5300-03 maintenance level and beyond.

LVM command enhancements
In AIX 5L Version 5.3, changes have been made to the following LVM commands to enhance their performance, such as they require less execution time than their counterparts in prior releases of AIX*:

Concurrent mode (classical and enhanced)
The classical concurrent mode volume groups (VGs) only supported Serial DASD and SSA disks in conjunction with the 32-bit kernel. AIX 5L Version 5.1 overcame the restriction of supported disk types by introducing the so-called enhanced concurrent mode VG, which extended the concurrent mode support to all other disk types. While AIX 5L Version 5.2 did not allow creation of classical concurrent mode VGs, it did support them. The support for classical concurrent mode VGs has been completely removed from AIX 5L Version 5.3. When trying to import a classical concurrent mode VG in AIX 5L Version 5.3, an error message informs the user to convert the VG to enhanced concurrent mode.

VGs (normal, big, and scalable)
The VG type, commonly known as standard or normal, allows a maximum of 32 physical volumes (PVs). A standard or normal VG is no more than 1016 physical partitions (PPs) per PV and has an upper limit of 256 logical volumes (LVs) per VG. Subsequently, a new VG type was introduced which was referred to as big VG. A big VG allows up to 128 PVs and a maximum of 512 LVs.

AIX 5L Version 5.3 has introduced a new VG type called scalable volume group (scalable VG). A scalable VG allows a maximum of 1024 PVs and 4096 LVs. The maximum number of PPs applies to the entire VG and is no longer defined on a per disk basis. This opens up the prospect of configuring VGs with a relatively small number of disks and fine-grained storage allocation options through a large number of PPs, which are small in size. The scalable VG can hold up to 2,097,152 (2048 K) PPs. As with the older VG types, the size is specified in units of megabytes and the size variable must be equal to a power of 2. The range of PP sizes starts at 1 (1 MB) and goes up to 131,072 (128 GB). This is more than two orders of magnitude above the 1024 (1 GB), which is the maximum for both normal and big VG types in AIX 5L Version 5.2. The new maximum PP size provides an architectural support for 256 petabyte disks. Table 1 below shows the variation of configuration limits with different VG types. Note that the maximum number of user definable LVs is given by the maximum number of LVs per VG minus 1 because one LV is reserved for system use. Consequently, system administrators can configure 255 LVs in normal VGs, 511 in big VGs, and 4095 in scalable VGs.

The scalable VG implementation in AIX 5L Version 5.3 provides configuration flexibility with respect to the number of PVs and LVs that can be accommodated by a given instance of the new VG type. The configuration options allow any scalable VG to contain 32, 64, 128, 256, 512, 768, or 1024 disks and 256, 512, 1024, 2048, or 4096 LVs. You do not need to configure the maximum values of 1024 PVs and 4096 LVs at the time of VG creation to account for potential future growth. You can always increase the initial settings at a later date as required.

The System Management Interface Tool (SMIT) and the Web-based System Manager graphical user interface fully support the scalable VG. Existing SMIT panels, which are related to VG management tasks, have been changed and many new panels added to account for the scalable VG type. For example, you can use the new SMIT fast path _mksvg to directly access the Add a Scalable VG SMIT menu. The user commands mkvg, chvg, and lsvg have been enhanced in support of the scalable VG type.

Striped column support for LVs
AIX 5L Version 5.3 provides striped columns support for LVs. This new feature allows extension of a striped LV, even if one of the PVs in the disk array becomes full. In previous AIX releases, you could enlarge the size of a striped LV with the extendlv command, as long as enough PPs were available within the group of disks which defined the redundant array of independent disks (RAID) disk array. Rebuilding the entire LV was the only way to expand a striped LV beyond the hard limits imposed by the disk capacities. You needed to back up and delete the striped LV, and then recreate the LV with a larger stripe width followed by a restore operation of the LV data. To overcome the disadvantages of this time-consuming procedure, AIX 5L Version 5.3 has introduced the concept of striped columns for LVs.

Prior to AIX 5L Version 5.3, the stripe width of a striped LV was determined at the time of LV creation by either of the following two methods:

Prior versions of AIX 5L do not allow you to configure a striped LV with an upper bound larger than the stripe width. In AIX 5L Version 5.3, the upper bound can be a multiple of the stripe width. One set of disks, as determined by the stripe width, is considered as one striped column. Note that the upper bound value is not related to the number of mirror copies in case you are using a RAID 10 configuration.

If you use the extendlv command to extend a striped LV beyond the physical limits of the first striped column, AIX uses an entire new set of disks to fulfill the allocation request for additional logical partitions. If you further expand the LV, more striped columns might get added as required, as long as you stay within the upper bound limit. The chlv -u command allows you to increase the upper bound limit to provide additional headroom for striped LV expansion. You can also use the -u flag of the enhanced extendlv command to raise the upper bound and extend the LV all in one operation.

The user commands mklv, chlv, extendlv, and mklvcopy have been enhanced to support the introduction of the striped column feature in AIX 5L Version 5.3.

Volume group pbuf pools
The LVM uses a construct named pbuf to control a pending disk I/O. A pbuf is a pinned memory buffer. The LVM always uses one pbuf for each individual I/O request, regardless of the amount of data that is transferred. AIX creates extra pbufs when adding a new PV to a VG. In previous AIX releases, the pbuf pool was a system-wide resource, but the LVM assigns and manages one pbuf pool per VG with AIX 5L Version 5.3. This enhancement supports advanced scalability and performance for systems with a large number of VGs and applies to all VG types. As a consequence of the new pbuf pool implementation, AIX displays and manages additional LVM statistics and tuning parameters.

AIX 5L Version 5.3 now includes the lvmo command. It provides support for new pbuf pool-related administrative tasks. You can use the lvmo command to display pbuf and blocked I/O statistics and settings for pbuf tunables, regardless of whether the scope of the entity is system-wide or VG-specific. However, the lvmo command only allows the settings to change for the LVM pbuf tunables that are dedicated to specific VGs. The ioo command continues to manage the sole pbuf tunable with system-wide scope. Also, the vmstat -v command still displays the system-wide number of I/Os that were blocked due to lack of free pbufs like in prior releases of AIX.

Variable logical track group
When the LVM receives a request for an I/O, it breaks the I/O down into what is called logical track group (LTG) sizes before it passes the request down to the device driver of the underlying disks. The LTG is the maximum transfer size of an LV and is common to all the LVs in the VG. AIX 5L Version 5.2 accepted LTG values of 128 KB, 256 KB, 512 KB, and 1024 KB. However, many disks now support transfer sizes larger than 1 MB. To take advantage of these larger transfer sizes and get better disk I/O performance, AIX 5L Version 5.3 accepts values of 128 KB, 256 KB, 512 KB, 1 MB, 2 MB, 4 MB, 8 MB, and 16 MB for the LTG size.

In contrast to previous releases, AIX 5L Version 5.3 also allows the stripe size of an LV to be larger than the LTG size in use and expands the range of valid stripe sizes significantly. Version 5.3 adds support for 2 MB, 4 MB, 8 MB, 16 MB, 32 MB, 64 MB, and 128 MB stripe sizes to complement the 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, and 1 MB stripe size options available in prior releases of AIX. In AIX 5L Version 5.2, the LTG size was set by the -L flag on the chvg or mkvg command. In AIX 5L Version 5.3, it is set by the varyonvg command, using the flag -M. The LTG size thus created is called the variable LTG size.

The following command sets the LTG size of the tmpvg VG at 512 KB:

# varyonvg -M512K tmpvg

The LTG size is specified either in K or M units, implying KB or MB respectively. When the LTG size is set using the -M flag, the varyonvg and extendvg commands might fail if an underlying disk has a maximum transfer size that is smaller than the LTG size. To find out the maximum supported LTG size of your hard disk, you can use the lquerypv command with the -M flag. The output gives the LTG size in KB, as shown in example below.

# /usr/sbin/lquerypv -M hdisk0 256

The lspv command displays the same value as MAX REQUEST, as shown in Listing 1.

You can list the value of the LTG in use with the lsvg command, as shown in Listing 2.

Note that the LTG size for a VG created in AIX 5L Version 5.3 will be displayed as dynamic in the lsvg command output, as shown in Listing 2. By default, AIX 5L Version 5.3 creates VGs with a variable LTG size. If you want to import it to a previous release of AIX, you first need to disable the variable LTG by using the -I option for mkvg or chvg and then do a varyoffvg followed by exportvg, otherwise the importvg command on the previous release fails.

Geographic Logical Volume Manager (GLVM) (5300-03)
It extends the LVM mirroring function and supports copy of a logical volume on a remote AIX system connected using TCP/IP network. A complete copy of application data can be quickly and easily brought back online on a remote system.

The mirscan command (5300-03)
This command searches for and corrects physical partitions that are stale or unable to perform I/O operations. This is useful for the following type of situations:

  1. A physical partition on the underlying storage is incapable of performing I/O operations but, for a long time, no I/O operations have been attempted for that physical partition. The customer needs a way to detect and correct this condition.
  2. A disk is about to be replaced. The customer needs to make sure they are not about to remove the last good copy of their data from the system.

Multiple instances of AIX on a single root volume group (multibos) (5300-03)
This feature allows the user to create a new instance of the AIX Base Operating System (BOS) within the running rootvg. This new instance, based on the running rootvg, contains private and shared data. A similar offering already available is Alternate Disk Installation. While somewhat similar, multibos varies in a few very important aspects:

Rollback function (available for JFS2 file system only) (5300-03)
Restores an entire file system to a valid point-in-time snapshot (target snapshot). Rollback attempts to restore the snapshots present at the time of the target snapshot. Snapshots taken after the target snapshot are lost.

Disk quotas support for JFS2
AIX 5L Version 5.3 extends the JFS2 functionality by implementing disk usage quotas to control usage of persistent storage.

Disk quotas might be set for individual users or groups on a per file system basis. Version 5.3 also introduces the concept of Limit Classes. It allows the configuration of per file system limits, provides a method to remove old or stale quota records, and offers comprehensive support through dedicated SMIT panels. It also provides a method to define a set of hard and soft disk block and file allocation limits and the grace periods before the soft limit becomes enforced as the hard limit.

The quota support for JFS2 and JFS can be used on the same system.

Shrink a file system
AIX 5L Version 5.3 supports shrinking a JFS2 file system dynamically. When the size of the file system is decreased, the LV on which the file system resides is also decreased.

JFS2 logredo scalability
AIX 5L Version 5.3 provides the following enhancements in the area of logredo to improve performance and to support large numbers of file systems:

JFS2 file system check scalability
AIX 5L Version 5.3 enhanced the implementation of the helper, which specifically performs the file system check for JFS2 file systems. The new code makes a better use of system resources and includes algorithms that improve the scalability and performance.

JFS2 ACL support for NFS V4
Starting with AIX 5L Version 5.3, the Enhanced Journaled File System now supports ACLs for NFS version 4. This allows you to establish fine-grained access control for file system objects and support inheritance features.

Conclusion
AIX 5L Version 5.3 has many more features than have been discussed here. I hope this article has given you a flavor of the type of enhancements you can expect in the latest release of AIX.

IBM Systems Magazine is a trademark of International Business Machines Corporation. The editorial content of IBM Systems Magazine is placed on this website by MSP TechMedia under license from International Business Machines Corporation.

©2009 MSP Communications, Inc. All rights reserved.

Linux.com 30 days with JFS By Keith Winston

... JFS is a fully 64-bit filesystem. With a default block size of 4KB, it supports a maximum filesystem size of 4 petabytes (less if you use smaller block sizes). The minimum filesystem size supported is 16MB. The JFS transaction log has a default size of 0.4% of the aggregate size, rounded up to a megabyte boundary. The maximum size of the log is 32MB. One interesting aspect of the layout on disk is the fsck working space, a small area allocated within the filesystem for keeping track of block allocation if there is not enough RAM to track a large filesystem at boot time.

JFS dynamically allocates space for disk inodes, freeing the space when it is no longer required. This eliminates the possibility of running out of inodes due to a large number of small files. As far as I can tell, JFS is the only filesystem in the kernel with this feature. For performance and efficiency, the contents of small directories are stored within the directory's inode. Up to eight entries are stored in-line within the inode, excluding the self (.) and parent (..) entries. Larger directories use a B+ tree keyed on name for faster retrieval. Internally, JFS uses extents to allocate blocks to files, leading to efficient use of space even as files grow in size. This is also available in XFS, and is a major new feature in ext4.

JFS supports both sparse and dense files. Sparse files allow data to be written to random locations within a file without writing intervening file blocks. JFS reports the file size as the largest used block, while only allocating actually used blocks. Sparse files are useful for applications that require a large logical space but use only a portion of the space. With dense files, blocks are allocated to fill the entire file size, whether data is written to them or not.

In addition to the standard permissions, JFS supports basic extended attributes, such as the immutable (i) and append-only (a) attributes. I was able to successfully set and test them with the lsattr and chattr programs. I could not find definitive information on JFS access control list support under Linux.

Logging

The main design goal of JFS was to provide fast crash recovery for large filesystems, avoiding the long filesystem check (fsck) times of older Unix filesystems. That was also the primary goal of filesystems like ext3 and ReiserFS. Unlike ext3, journaling was not an add-on to JFS, but baked into the design from the start. For high-performance applications, the JFS transaction log file can be created on an external volume if one is specified when the filesystem is first created.

JFS only logs operations on meta-data, maintaining the consistency of the filesystem structure, but not necessarily the data. A crash might result in stale data, but the files should remain consistent and usable.

Here is a list of the filesystem operations logged by JFS:

Utilities

JFS provides a suite of utilities to manage its filesystems. You must be the root user to use them.

Utility Description
jfs_debugfs Shell-based JFS filesystem editor. Allows changes to the ACL, uid/gid, mode, time, etc. You can also alter data on disk, but only by entering hex strings -- not the most efficient way to edit a file.
jfs_fsck Replay the JFS transaction log, check and repair a JFS device. Should be run only on an unmounted or read-only filesystem. Run automatically at boot.
jfs_fscklog Extract a JFS fsck service log into a file. jfs_fscklog -e /dev/hda6 extracts the binary log to file fscklog.new. To view, use jfs_fscklog -d fscklog.new.
jfs_logdump Dump the journal log to a plain text file that shows data on each transaction in the log file.
jfs_mkfs Create a JFS formatted partition. Use the -j journal_device option to create an external journal (1.0.18 or later).
jfs_tune Adjust tunable filesystem parameters on JFS. I didn't find options that looked like they might improve performance. The -l option lists the superblock info.

Here is what a dump of the superblock information looks like:

root@slackt41:~# jfs_tune -l /dev/hda6
jfs_tune version 1.1.11, 05-Jun-2006

JFS filesystem superblock:

JFS magic number:       'JFS1'
JFS version:            1
JFS state:              mounted
JFS flags:              JFS_LINUX  JFS_COMMIT  JFS_GROUPCOMMIT  JFS_INLINELOG 
Aggregate block size:   4096 bytes
Aggregate size:         12239720 blocks
Physical block size:    512 bytes
Allocation group size:  16384 aggregate blocks
Log device number:      0x306
Filesystem creation:    Wed Jul 11 01:52:42 2007
Volume label:           ''

Crash testing

White papers and man pages are no substitute for the harsh reality of a server room. To test the recovery capabilities of JFS, I started crashing my system (forced power off) with increasing workloads. I repeated each crash twice to see if my results were consistent.

Crash workload Recovery
Console (no X) running text editor with one open file About 2 seconds to replay the journal log. Changes I had not saved in the editor were missing but the file was intact.
X window system with KDE, GIMP, Nvu, and text editor in xterm all with open files About 2 seconds to replay the journal log. All open files were intact, unsaved changes were missing.
X window system with KDE, GIMP, Nvu, and text editor all with open files, plus a shell script that inserted records into a MySQL (ISAM) table. The script I wrote was an infinite loop, and I let it run for a couple of minutes to make sure some records were flushed to disk. About 3 seconds to replay the journal log. All open files intact, database intact with a few thousand records inserted, but the timestamp on the table file had been rolled back one minute.

In all cases, these boot messages appeared:

**Phase 0 - Replay Journal Log
-|----- (spinner appeared for a couple of seconds, then went away)
Filesystem is clean

Throughout the crash testing, I saw no filesystem corruption, and the longest log replay time I experienced was about 3 seconds.

Conclusion

While my improvised crash tests were not a good simulation a busy server, JFS did hold up well, and recovery time was fast. All file-level applications I tested, such as tar and rsync, worked flawlessly, and lower-level programs like Truecrypt also worked as expected.

After 30 days of kicking and prodding, I have a high level of confidence in JFS, and I am content trusting my data to it. JFS may not have been marketed as effectively as other alternatives, but is a solid choice in the long list of quality Linux filesystems.

JFS structure summary (051031)

This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.

Basic entities
Partition
A partition is a container, and has merely a size and a sector size, also called a partition block size, which defines IO granularity (and is usually the same for all partitions on the physical medium); a partition only contains an aggregate.
Extent
A contiguous sequence of blocks, wholly contained in one allocation group. The maximum size of an extent is 224-1 blocks, or almost 64GiB. There are a few types of extents, one of them is ABNR which describes an extent contaning zero bytes only.
Map
A map is a collection of extents that contains a B+-tree index rooted in the first extent of the collection; for example it can be an index of extents for a file body, in which case it is an allocation map, or an index of inode names for a directory, in which case it is called a directory map; the extents in a map are described in the map itself. The root extent of the map is called btree and the leaf extents are called xtrees (and contain an array of entries called xads) if they are for an allocation map, and dtrees if they are for a directory map.
File body
A file body is a sequence of one or more extents, the extents being listed in an allocation map. The extents may be from different allocation groups.
Inode
An inode is a 512 byte descriptor for the attributes of a file or directory, and contains also the root of a file body's allocation map, or of a directory map.
Aggregates
Aggregate
An aggregate is about allocating space, and has a size and an aggregate block size, which defines the granularity of allocation of space to files, and currently must be 4096.
  • Aggregates have a primary and a backup superblock.
  • Aggregates contain one or more allocation groups.
  • Aggregates have a primary and backup aggregate inode tables, which must be exactly one 32 inodes long.
  • Aggregates may contain one or more filesets, but currently only one is allowed.
  • Aggregates also have some space reserved for use by jfs_fsck.
Allocation group
An allocation group, also known as an AG, is merely a section of an aggregate. There is no data structure associated with an allocation group, all belong either to the aggregate or to a fileset.
  • There can be up to 128 AGs in an aggregate, and each must be at least 8192 blocks or 32MiB.
  • Each allocation group must contain a number of blocks that is a power of 2 of the number of block descriptors in a dmap page.
  • If multiple files are growing, each allocates extents from a different allocation group if possible.
Aggregate inode table
The aggregate inode table is an inode allocation map for the inodes that are used internally by the aggregate, and are not user visible (that is, are not part of any fileset). The inodes defined in the table are: Since the aggregate inode table file refers to itself, the first extent of its inode allocation map has a well known constant address (just after the superblock).
Block allocation map
The block allocation map, also called bmap, is a file (not a B+-tree, despite being called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:
  • Two arrays of 213 bits where each bit corresponds to a block of the aggregate, and the bit is 1 if the block is in use. Because of the limit of three levels of dmap control pages, there can be at most 230 dmap pages, and thus at most 243 blocks in an aggregate.
  • Some metadata, includings a buddy tree that defines a buddy system of the free and allocated blocks. The buddy tree also extends upwards in the dmap control pages.
The block allocation map contains information that is redundant with that of inode allocation maps, so it can be fully reconstructed, but only with a a full scan of the aggregate and fileset inode tables.
Inline log
A sequence of blocks towards the end of an aggregate that is used to record intended modifications to aggregate or fileset metadata.
Bad blocks
This is a file whose extents cover all the bad blocks discovered by jfs_fsck if any.
Inode allocation maps
Inode allocation map
An inode allocation map is the file body of an inode table file, not a map. This file body contains as the first 4KiB block a control page called dinomap, and after that a number of extents called inode allocation groups.
The dinomap contains:
  • The AG free inode lists array.
  • The AG free inode extents lists array.
  • The IAG free list.
  • The IAG free next.
which segment the information held in the inode allocation map by allocation group.
AG free inode lists array
The AG free inode lists array contains a list headers for each AG. Each lists threads together all the IAGs in that AG that have some free inode entries.
AG free inode extents lists array
The AG free inode extents lists array contains a list header for each AG, and each list threads together all the IAGs in an AG that have some free inode extents.
IAG free list
The IAG free list array contains a list header for each AG, and each list contains the number of those IAGs in the AG whose inodes are all free.
IAG free next
The IAG free next is the number of the next IAG to append (if required) to an inode allocation map, or equivalently the number of IAGs in an inode allocation map plus 1.
Inode allocation group
An inode allocation group, also called IAG, is a 4KiB block that describes up to 128 inode table extents, for a total of up to 4096 inode table entries.
An inode allocation group can be in any allocation group, but all the inode table extents it describes must be in the same allocation group as the first one, unlike the extents of a general purpose file body, which can be in any allocation group; as soon as its first inode table extent is allocated in a allocation group, the inode allocation group is tied to it, until all such extents are freed.
Once allocated, inode allocation groups are never freed, but their inode table extents may be freed though.
Inode table extent
Inode table extents are pointed to by inode allocation groups, and each must be 16KiB in length, and contains 32 inode table entries.
Filesets
Fileset
A fileset is a collections of named inodes. Filesets are defined as and by a fileset inode table, which is an inode allocation map file. It contains these inodes:
  • Number 0 is reserved.
  • Number 1 is a file containing extended fileset information.
  • Number 2 is a directory which is the root of the fileset naming tree.
  • Number 3 is a file containing the ACL for the fileset.
  • Number 4 and following are used for the other files or directories in the fileset, all must be reachable from the directory at number 2.
File
A file is an inode with an attached (optional) allocation map describing a file body that contains data; a particular case of a file is a symbolic link, where the data in the file is a path name.
Directory
A directory is a an inode with a list of name and corresponding inode numbers; the list is either contained entirely within the inode if it is small, or is an attached directory map, containing dtree entries.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

JFS Log

Open source : JFS project Web site

04/28/04, JFS release 1.1.6 is available. 03/04/04, JFS release 1.1.5 is available. 11/11/03, Added "Who's using JFS?" section to web site ...
jfs.sourceforge.net/ - 1k - Cached - Similar pages

This white paper gives an overview of the changes to meta-data structures that JFS logs. Introduction. The Journaled File System (JFS) provides a log-based, ...

[PDF] Using JFS ACLFile System Security -- HP-UX System Administrator's Guide: Security Management, HP Part Number '5991-6482', Publication Date 'E0207's


[PDF] JFS Tuning and Performance

Section 2. Introduction. What is JFS? JFS is an advanced journalling filesystem. It has been designed to provide excellent ...


ploug.eu.org/doc/jfs-a4.pdf - Similar pages .usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/best/best_html/index.html -Similar pages

Notes about Linux file systems

Updated: 2006-10-22
Created: 2005-10-31

  • Section menu

    File system references (061022)

    Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.

    General
    Descriptions (061022)
    Benchmarks
    Warnings: many of these benchmarks not only are designed somewhat naively, some truly essential aspects of the context, like the elevator or the filesystem readahead, are not mentioned; benchmarks under Linux 2.6 can give very different results from under Linux 2.4; SCSI and ATA/IDE disc drives have very, very different performance profiles, including sync reporting.
    Online discussions
    Warning: some of these discussions are listed here because I think that they are notably wrong. Some pointers are to single articles, some to threads.

    File system features (060801)

    Desktop filesystem features
    Feature ext3 JFS XFS
    Block sizes 1024-4096 4096 512-4096
    Max fs size 8TiB (243B) 32PiB (255B) 8EiB (263B)
    16TiB (244B) on 32b system
    Max file size 1TiB (240B) 4PiB (252B) 8EiB (263B)
    16TiB (244B) on 32b system
    Max files/fs 232 232 232
    Max files/dir 232 231 232
    Max subdirs/dir 215 216 232
    Number of inodes fixed dynamic dynamic
    Indexed dirs option auto yes
    Small data in inodes no some auto
    fsck speed slow fast fast
    fsck space ? 32B per inode 2GiB RAM per 1TiB + 200B per inode
    (half on 32b CPU)
    Redundant metadata yes yes ?
    Bad block handling yes mkfs only no
    Tunable commit interval yes no no
    Supports VFS lock yes yes yes
    Has own lock/snapshot no no yes
    Names 8 bit UTF-16 or 8 bit 8 bit
    noatime yes yes yes
    O_DIRECT allegedly allegedly yes
    barrier yes no yes (checked)
    commit interval yes no no
    EA/ACLs both both both
    Quotas both patch both
    DMAPI no patch option
    Case insensitive no mkfs only no
    Supported by GRUB yes yes mostly
    Can grow online online only online only
    Can shrink offline no no
    Journals data option no no
    Journals what blocks operations operations
    Journal disabling yes yes no
    Journal size fixed fixed grow/shrink
    Resize journal offline maybe offline
    Journal on another partition yes yes yes
    Special features or misfeatures In place convert from ext2.
    MS Windows drivers.
    Case insensitive option.
    Low CPU usage.
    DCE DFS compatible.
    OS2 compatible.
    Real time (streaming) section.
    IRIX compatible.
    Very large write behind.
    Superblock on block 0.

    Some of my notes on filesystems (061022)

    These are pointers to some of the entries in my technical blog where file systems are discussed:



    Etc

    Society

    Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

    Quotes

    War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

    Bulletin:

    Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

    History:

    Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

    Classic books:

    The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

    Most popular humor pages:

    Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

    The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


    Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

    This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

    You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

    Disclaimer:

    The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

    The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

    Last modified: September 12, 2017