Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

JFS/JFS2 Filesystem

News

**Desktop filesystem features**
Feature	`ext3`	JFS	XFS
Block sizes	1024-4096	4096	512-4096
Max fs size	8TiB (2⁴³B)	32PiB (2⁵⁵B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max file size	1TiB (2⁴⁰B)	4PiB (2⁵²B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max files/fs	2³²	2³²	2³²
Max files/dir	2³²	2³¹	2³²
Max subdirs/dir	2¹⁵	2¹⁶	2³²
Number of inodes	fixed	dynamic	dynamic
Indexed dirs	option	auto	yes
Small data in inodes	no	some	auto
`fsck` speed	slow	fast	fast
`fsck` space	?	32B per inode	2GiB RAM per 1TiB + 200B per inode (half on 32b CPU)
Redundant metadata	yes	yes	?
Bad block handling	yes	`mkfs` only	no
Tunable commit interval	yes	no	no
Supports VFS lock	yes	yes	yes
Has own lock/snapshot	no	no	yes
Names	8 bit	UTF-16 or 8 bit	8 bit
`noatime`	yes	yes	yes
`O_DIRECT`	allegedly	allegedly	yes
`barrier`	yes	no	yes (checked)
commit interval	yes	no	no
EA/ACLs	both	both	both
Quotas	both	patch	both
DMAPI	no	patch	option
Case insensitive	no	`mkfs` only	no
Supported by GRUB	yes	yes	mostly
Can grow	online	online only	online only
Can shrink	offline	no	no
Journals data	option	no	no
Journals what	blocks	operations	operations
Journal disabling	yes	yes	no
Journal size	fixed	fixed	grow/shrink
Resize journal	offline	maybe	offline
Journal on another partition	yes	yes	yes
Special features or misfeatures	In place convert from `ext2`. MS Windows drivers.	Case insensitive option. Low CPU usage. DCE DFS compatible. OS2 compatible.	Real time (streaming) section. IRIX compatible. Very large write behind. Superblock on block 0.

JFS volume structure

JFS is organized like a traditional Unix-ish file system, it presents a logical view of files and directories linked together to form a tree-like structure. This is the concept that spread from the Unix world pretty much everywhere else and that we all know.

capacity - JFS is a 64-bit file system
recovery - thanks to the journaling techniques employed

JFS is created on top of a logical volume. To maintain information about files and directories, it uses the following important internal structures:

the superblock
the i-nodes
the data blocks
the allocation groups

The superblock lies at the heart of JFS (and many other file systems). It contains essential information such as size of file system, number of blocks it contains or state of the file system (clean, dirty etc.).

The entire file system space is divided into logical blocks that contain file or directory data. For JFS, the logical blocks are always 4096 bytes (4K) in size, but can be optionally subdivided into smaller fragments (512, 1024 or 2048 bytes).

An i-node is a logical entity that contains information about a file or directory. There is a 1:1 relationship between i-nodes and files/directories. An i-node contains file type, access permissions, user/group ID (UID/GID - unused on OS/2), access times and points to actual logical blocks where file contents are stored. The maximum file size allowed in JFS is 2TB. It should be noted that the number of i-nodes is fixed. It is determined at file system creation time and depends on fragment size (which is user selectable). Users could run out of i-nodes, meaning that they would be unable to create more files even if there was enough free space. In practice this is extremely rare.

Fragments were already briefly mentioned in the discussion of logical blocks. The JFS logical block size is fixed at 4K. This is a reasonable default but it means that the file system cannot allocate less than 4K for file storage. If a file system stores large amounts of small files (< 2K), the disk space waste becomes significant. We've all got to know and hate this problem from FAT (cluster size of 32K leads to massive waste of space, in some cases over 50%). JFS attacks this by allowing fragmentation of logical blocks into smaller units, as small as 512 bytes (this is sector size on harddrives and it is not possible to read or write less than 512 bytes from/to disk). However users should be careful because fragmentation incurs additional overhead and hence slows down disk access. I would recommend using fragments smaller than 4K only when the users know for sure that they will store very large amounts of small files on the file system.

The entire JFS volume space is subdivided into allocation groups. Each allocation group contains i-nodes and data blocks. This enables the file system to store i-nodes and their associated data in physical proximity (HPFS uses a very similar technique). The allocation group size varies from 8MB to 64MB and depends on fragment size and number of fragments it contains.

Journaling

As the name of JFS implies, journaling is a very important feature of this file system. It should be noted that journaling is actually independent of JFS's structure described above. The journaling technique has its roots in database systems and it is employed to ensure maximum consistency of the file system, hence minimizing the risk of data loss - a very important feature for servers, but even home/SOHO users hate to lose data.

JFS uses a special log device to implement circular journal. On AIX, several JFS volumes can share single log device. I'm not sure this is possible on OS/2, I believe each JFS volume (corresponding to a drive letter) has its own 'inline' log located inside the JFS volume - its size is selectable at FORMAT time.

It is important to note that JFS does not log (or journal) everything. It only logs all changes to file system meta-data. Simply speaking, the log contains a record of changes to everything in the file system except actual file data, ie. changes to the superblock, i-nodes, directories and allocation structures. It is clear that there must be some overhead here and indeed, performance may suffer when applications are doing lots of synchronous (uncached) I/O or creating and/or deleting many files in short amount of time. The performance loss is however not noticeable in most cases and is well worth the increased security.

The log (or journal) occupies a dedicated area on disk and is written to immediately when any meta-data change occurs. When the disk becomes idle, the actual file system structure is updated according to the log. After a crash, all it usually takes to restore the file system to full consistency is replaying the log, ie. performing the recorded transactions. Of course, if a process was in the middle of writing a file when the system crashed or power died, the file could be inconsistent (the app might not be able to read it again), but you will not lose this file nor other files, as is often the case with other file systems.

JFS

OS/2 users often ask what exactly the difference is between the various file systems available on OS/2. The following table, taken almost verbatim from WSeB's Quick Beginnings book, summarizes the most important differences between the file systems available for WSeB from IBM.

Characteristic	Journaled File System (JFS)	386 High Performance File System (386HPFS)	High Performance File System (HPFS)	FAT File System
Max volume size	2TB (terabytes)	64GB (gigabytes)	64GB (gigabytes)	2GB (gigabytes)
Max file size	2TB (terabytes)	2GB (gigabytes)	2GB (gigabytes)	2GB (gigabytes)
Allows spaces and periods in file names	Yes	Yes	Yes	No (8.3 format)
Standard directory and file attributes	Within file system	Within file system	Within file system	Within file system
Extended Attributes (64KB text or binary data with keywords)	Within file system	Within file system	Within file system	In separate file
Max path length	260 characters `1)`	260 characters	260 characters	64 characters
Bootable	No `2)`	Yes	Yes	Yes
Allows dynamic volume expansion	Yes	No	No	No
Scales with SMP	Yes	No	No	No
Local security support	No	Yes	No	No
Average wasted space per file	256 to 2048 bytes	256 bytes	256 bytes	1/2 cluster (1KB to 16KB)
Allocation information for files	Near each file in its i-node	Near each file in its FNODE	Near each file in its FNODE	Centralized near volume beginning
Directory structure	Sorted B+tree	Sorted B-tree	Sorted B-tree, must be searched exhaustively	Unsorted linear
Directory location	Close to files it contains	Near seek center of volume	Near seek center of volume	Root directory at beginning of volume; others scattered
Write-behind (lazy write)	Optional	Optional	Optional	Optional
Maximum cache size	Physical memory available	Physical memory available	2MB	14MB
Caching program	None (parameters set in CONFIG.SYS)	CACHE386.EXE	CACHE.EXE	None (parameters set in CONFIG.SYS)
LAN Server access control lists	Within file system	Within file system	In separate file (NET.ACC)	In separate file

1) JFS stores file and directory names in Unicode. This allows JFS to always maintain proper sort order, regardless of active codepage.
2) This is not a permanent limitation. Only no one wrote a JFS micro- and mini-IFS yet.

It might perhaps interest some users that JFS also seems to have built-in support for DASD limits. I have however never tried to use this feature. DASD limits, aka Directory Limits feature of LAN Server allows administrators to control how much space a directory can take, effectively enabling them to limit disk space usage of users. Previously this feature only worked on HPFS386 volumes. Obviously this is of no use to home users who have all their disk space for themselves but it can be very useful for system administrators.

JFS Utilities

WSeB comes with several new JFS-specific utilities, in addition to the usual ones like CHKDSK and FORMAT. I'll only give a quick overview of them here, the important ones are documented in the Command Reference.

DEFRAGFS - can be used to defragment and reorganize a JFS volume. It is similar in spirit to equivalent FAT or HPFS utilities. It should be noted that just like HPFS, JFS tries not to fragment files. However especially on nearly full volumes, this is not always possible. In addition to defragmenting files, DEFRAGFS will try to rearrange internal JFS structures by placing certain pieces of data physically close to each other to speed up disk access. DEFRAGFS is designed to be run in the background.
EXTENDFS - after enlarging a LVM volume, this utility must be used to tell the JFS file system that it should take up all the extra space now available.
CACHEJFS - not documented in Command Reference, this utility can be used to query the settings of the JFS cache and set its lazy writer parameters.
CHKLGJFS - again undocumented. This is a diagnostic tool and will show a formatted log of the last (or one before last) checkdisk process. Not very useful to normal users.

In addition to the above utilities that are supplied with WSeB, I also managed to build several extra utilities from the OpenJFS sources thanks to invaluable help from several friends. Those are not available publicly in binary form to my knowledge, though I could probably e-mail them to interested readers - but beware, these are for experts only and not guaranteed to work!

LOGDUMP - as the name suggests, this tool dumps formatted contents of the current JFS log (journal) to a file.
CSTATS - lists current statistics of the JFS cache.
XPEEK - perhaps the most useful of the bunch, this one is the closest thing to a JFS disk editor I've seen. This utility lets users dump and optionally modify various internal JFS structures. It has a very crude interface but it worked for me. Needless to say, this utility is extremely dangerous and you can easily destroy your data if you don't know exactly what you're doing.

Glossary of Terms:

Partition - a portion of physical hard disk space. A hard disk may contain one or more partitions. Partitions are defined by PC BIOS and described by partition tables stored on a harddrive. Every PC OS understands partitions.
Volume - a logical concept which hides the physical organization of storage space. A compatibility volume directly corresponds to a partition while LVM volume may span more than one partition on one or more physical disks. A volume is seen by users as a single drive letter. Only WSeB and eCS understand LVM volumes.
DASD - Direct Access Storage Device. A term often used by IBM instead of 'hard disk' to confuse mere mortals.

NEWS CONTENTS

200102 : Storage Management in AIX 5L Version 5.3 AIX IBM Systems Magazine ( Storage Management in AIX 5L Version 5.3 AIX IBM Systems Magazine, )
200102 : Linux.com 30 days with JFS by Keith Winston ( Linux.com 30 days with JFS, )
200102 : Crash testing ( )
200102 : JFS structure summary ( softpanorama.org, )

Old News ;-)

Storage Management in AIX 5L Version 5.3 AIX IBM Systems Magazine

June 2006 | by Shiv Dutta

Note: This article can also be found on the IBM Developerworks Web site (www-128.ibm.com/developerworks/eserver/library/es-aix5l-lvm.html).

When this article was first published in April 2005 under the title Logical Volume Manager in AIX 5L Version 5.3, it discussed a number of features that were introduced in AIX 5L* Version 5.3 to enhance the scope, functionality, and performance of the Logical Volume Manager (LVM). The next major enhancements to AIX 5L were introduced in the 5300-03 maintenance level, which was released in September 2005. This article is an updated and expanded version of the April 2005 publication. While the original content has been retained almost in its entirety, the article has been augmented by including a discussion of some of the LVM enhancements introduced in the 5300-03 maintenance level. Also, its scope has been broadened to cover a number of improvements, introduced both in the original release of the AIX 5L Version 5.3 and the 5300-03 maintenance level, to the Enhanced Journal File System (JFS2). In the following discussions, I use the expression (5300-03) to indicate that the referenced feature is available only for the 5300-03 maintenance level and beyond.

LVM command enhancements
In AIX 5L Version 5.3, changes have been made to the following LVM commands to enhance their performance, such as they require less execution time than their counterparts in prior releases of AIX*:

extendvg

importvg

mkvg

varyonvg

chlvcopy

mklvcopy

lslv

lspv

Concurrent mode (classical and enhanced)
The classical concurrent mode volume groups (VGs) only supported Serial DASD and SSA disks in conjunction with the 32-bit kernel. AIX 5L Version 5.1 overcame the restriction of supported disk types by introducing the so-called enhanced concurrent mode VG, which extended the concurrent mode support to all other disk types. While AIX 5L Version 5.2 did not allow creation of classical concurrent mode VGs, it did support them. The support for classical concurrent mode VGs has been completely removed from AIX 5L Version 5.3. When trying to import a classical concurrent mode VG in AIX 5L Version 5.3, an error message informs the user to convert the VG to enhanced concurrent mode.

VGs (normal, big, and scalable)
The VG type, commonly known as standard or normal, allows a maximum of 32 physical volumes (PVs). A standard or normal VG is no more than 1016 physical partitions (PPs) per PV and has an upper limit of 256 logical volumes (LVs) per VG. Subsequently, a new VG type was introduced which was referred to as big VG. A big VG allows up to 128 PVs and a maximum of 512 LVs.

AIX 5L Version 5.3 has introduced a new VG type called scalable volume group (scalable VG). A scalable VG allows a maximum of 1024 PVs and 4096 LVs. The maximum number of PPs applies to the entire VG and is no longer defined on a per disk basis. This opens up the prospect of configuring VGs with a relatively small number of disks and fine-grained storage allocation options through a large number of PPs, which are small in size. The scalable VG can hold up to 2,097,152 (2048 K) PPs. As with the older VG types, the size is specified in units of megabytes and the size variable must be equal to a power of 2. The range of PP sizes starts at 1 (1 MB) and goes up to 131,072 (128 GB). This is more than two orders of magnitude above the 1024 (1 GB), which is the maximum for both normal and big VG types in AIX 5L Version 5.2. The new maximum PP size provides an architectural support for 256 petabyte disks. Table 1 below shows the variation of configuration limits with different VG types. Note that the maximum number of user definable LVs is given by the maximum number of LVs per VG minus 1 because one LV is reserved for system use. Consequently, system administrators can configure 255 LVs in normal VGs, 511 in big VGs, and 4095 in scalable VGs.

The scalable VG implementation in AIX 5L Version 5.3 provides configuration flexibility with respect to the number of PVs and LVs that can be accommodated by a given instance of the new VG type. The configuration options allow any scalable VG to contain 32, 64, 128, 256, 512, 768, or 1024 disks and 256, 512, 1024, 2048, or 4096 LVs. You do not need to configure the maximum values of 1024 PVs and 4096 LVs at the time of VG creation to account for potential future growth. You can always increase the initial settings at a later date as required.

The System Management Interface Tool (SMIT) and the Web-based System Manager graphical user interface fully support the scalable VG. Existing SMIT panels, which are related to VG management tasks, have been changed and many new panels added to account for the scalable VG type. For example, you can use the new SMIT fast path _mksvg to directly access the Add a Scalable VG SMIT menu. The user commands mkvg, chvg, and lsvg have been enhanced in support of the scalable VG type.

Striped column support for LVs
AIX 5L Version 5.3 provides striped columns support for LVs. This new feature allows extension of a striped LV, even if one of the PVs in the disk array becomes full. In previous AIX releases, you could enlarge the size of a striped LV with the extendlv command, as long as enough PPs were available within the group of disks which defined the redundant array of independent disks (RAID) disk array. Rebuilding the entire LV was the only way to expand a striped LV beyond the hard limits imposed by the disk capacities. You needed to back up and delete the striped LV, and then recreate the LV with a larger stripe width followed by a restore operation of the LV data. To overcome the disadvantages of this time-consuming procedure, AIX 5L Version 5.3 has introduced the concept of striped columns for LVs.

Prior to AIX 5L Version 5.3, the stripe width of a striped LV was determined at the time of LV creation by either of the following two methods:

Direct specification of all PV names

Specification of the maximum number of PVs allocated to the striped LV

Prior versions of AIX 5L do not allow you to configure a striped LV with an upper bound larger than the stripe width. In AIX 5L Version 5.3, the upper bound can be a multiple of the stripe width. One set of disks, as determined by the stripe width, is considered as one striped column. Note that the upper bound value is not related to the number of mirror copies in case you are using a RAID 10 configuration.
If you use the extendlv command to extend a striped LV beyond the physical limits of the first striped column, AIX uses an entire new set of disks to fulfill the allocation request for additional logical partitions. If you further expand the LV, more striped columns might get added as required, as long as you stay within the upper bound limit. The chlv -u command allows you to increase the upper bound limit to provide additional headroom for striped LV expansion. You can also use the -u flag of the enhanced extendlv command to raise the upper bound and extend the LV all in one operation.

The user commands mklv, chlv, extendlv, and mklvcopy have been enhanced to support the introduction of the striped column feature in AIX 5L Version 5.3.

Volume group pbuf pools
The LVM uses a construct named pbuf to control a pending disk I/O. A pbuf is a pinned memory buffer. The LVM always uses one pbuf for each individual I/O request, regardless of the amount of data that is transferred. AIX creates extra pbufs when adding a new PV to a VG. In previous AIX releases, the pbuf pool was a system-wide resource, but the LVM assigns and manages one pbuf pool per VG with AIX 5L Version 5.3. This enhancement supports advanced scalability and performance for systems with a large number of VGs and applies to all VG types. As a consequence of the new pbuf pool implementation, AIX displays and manages additional LVM statistics and tuning parameters.

AIX 5L Version 5.3 now includes the lvmo command. It provides support for new pbuf pool-related administrative tasks. You can use the lvmo command to display pbuf and blocked I/O statistics and settings for pbuf tunables, regardless of whether the scope of the entity is system-wide or VG-specific. However, the lvmo command only allows the settings to change for the LVM pbuf tunables that are dedicated to specific VGs. The ioo command continues to manage the sole pbuf tunable with system-wide scope. Also, the vmstat -v command still displays the system-wide number of I/Os that were blocked due to lack of free pbufs like in prior releases of AIX.

Variable logical track group
When the LVM receives a request for an I/O, it breaks the I/O down into what is called logical track group (LTG) sizes before it passes the request down to the device driver of the underlying disks. The LTG is the maximum transfer size of an LV and is common to all the LVs in the VG. AIX 5L Version 5.2 accepted LTG values of 128 KB, 256 KB, 512 KB, and 1024 KB. However, many disks now support transfer sizes larger than 1 MB. To take advantage of these larger transfer sizes and get better disk I/O performance, AIX 5L Version 5.3 accepts values of 128 KB, 256 KB, 512 KB, 1 MB, 2 MB, 4 MB, 8 MB, and 16 MB for the LTG size.

In contrast to previous releases, AIX 5L Version 5.3 also allows the stripe size of an LV to be larger than the LTG size in use and expands the range of valid stripe sizes significantly. Version 5.3 adds support for 2 MB, 4 MB, 8 MB, 16 MB, 32 MB, 64 MB, and 128 MB stripe sizes to complement the 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, and 1 MB stripe size options available in prior releases of AIX. In AIX 5L Version 5.2, the LTG size was set by the -L flag on the chvg or mkvg command. In AIX 5L Version 5.3, it is set by the varyonvg command, using the flag -M. The LTG size thus created is called the variable LTG size.

The following command sets the LTG size of the tmpvg VG at 512 KB:
# varyonvg -M512K tmpvg

The LTG size is specified either in K or M units, implying KB or MB respectively. When the LTG size is set using the -M flag, the varyonvg and extendvg commands might fail if an underlying disk has a maximum transfer size that is smaller than the LTG size. To find out the maximum supported LTG size of your hard disk, you can use the lquerypv command with the -M flag. The output gives the LTG size in KB, as shown in example below.
# /usr/sbin/lquerypv -M hdisk0 256

The lspv command displays the same value as MAX REQUEST, as shown in Listing 1.

You can list the value of the LTG in use with the lsvg command, as shown in Listing 2.

Note that the LTG size for a VG created in AIX 5L Version 5.3 will be displayed as dynamic in the lsvg command output, as shown in Listing 2. By default, AIX 5L Version 5.3 creates VGs with a variable LTG size. If you want to import it to a previous release of AIX, you first need to disable the variable LTG by using the -I option for mkvg or chvg and then do a varyoffvg followed by exportvg, otherwise the importvg command on the previous release fails.

Geographic Logical Volume Manager (GLVM) (5300-03)
It extends the LVM mirroring function and supports copy of a logical volume on a remote AIX system connected using TCP/IP network. A complete copy of application data can be quickly and easily brought back online on a remote system.

The mirscan command (5300-03)
This command searches for and corrects physical partitions that are stale or unable to perform I/O operations. This is useful for the following type of situations:

A physical partition on the underlying storage is incapable of performing I/O operations but, for a long time, no I/O operations have been attempted for that physical partition. The customer needs a way to detect and correct this condition.

A disk is about to be replaced. The customer needs to make sure they are not about to remove the last good copy of their data from the system.

Multiple instances of AIX on a single root volume group (multibos) (5300-03)
This feature allows the user to create a new instance of the AIX Base Operating System (BOS) within the running rootvg. This new instance, based on the running rootvg, contains private and shared data. A similar offering already available is Alternate Disk Installation. While somewhat similar, multibos varies in a few very important aspects:

The new instance is built from the running root volume group (similar to the alt_disk_install clone operation).

The new instance is housed within the current root volume group (for example, the same disks).

Certain data within the rootvg might be shared between the instances.

Rollback function (available for JFS2 file system only) (5300-03)
Restores an entire file system to a valid point-in-time snapshot (target snapshot). Rollback attempts to restore the snapshots present at the time of the target snapshot. Snapshots taken after the target snapshot are lost.

Disk quotas support for JFS2
AIX 5L Version 5.3 extends the JFS2 functionality by implementing disk usage quotas to control usage of persistent storage.

Disk quotas might be set for individual users or groups on a per file system basis. Version 5.3 also introduces the concept of Limit Classes. It allows the configuration of per file system limits, provides a method to remove old or stale quota records, and offers comprehensive support through dedicated SMIT panels. It also provides a method to define a set of hard and soft disk block and file allocation limits and the grace periods before the soft limit becomes enforced as the hard limit.

The quota support for JFS2 and JFS can be used on the same system.

Shrink a file system
AIX 5L Version 5.3 supports shrinking a JFS2 file system dynamically. When the size of the file system is decreased, the LV on which the file system resides is also decreased.

JFS2 logredo scalability
AIX 5L Version 5.3 provides the following enhancements in the area of logredo to improve performance and to support large numbers of file systems:

Support for minor numbers greater than 512 in a volume group
Support for copy-on-write and cached updates to reduce I/O activity
Support for shrink file system

JFS2 file system check scalability
AIX 5L Version 5.3 enhanced the implementation of the helper, which specifically performs the file system check for JFS2 file systems. The new code makes a better use of system resources and includes algorithms that improve the scalability and performance.

JFS2 ACL support for NFS V4
Starting with AIX 5L Version 5.3, the Enhanced Journaled File System now supports ACLs for NFS version 4. This allows you to establish fine-grained access control for file system objects and support inheritance features.

Conclusion
AIX 5L Version 5.3 has many more features than have been discussed here. I hope this article has given you a flavor of the type of enhancements you can expect in the latest release of AIX.

IBM Systems Magazine is a trademark of International Business Machines Corporation. The editorial content of IBM Systems Magazine is placed on this website by MSP TechMedia under license from International Business Machines Corporation.

Linux.com 30 days with JFS By Keith Winston

... JFS is a fully 64-bit filesystem. With a default block size of 4KB, it supports a maximum filesystem size of 4 petabytes (less if you use smaller block sizes). The minimum filesystem size supported is 16MB. The JFS transaction log has a default size of 0.4% of the aggregate size, rounded up to a megabyte boundary. The maximum size of the log is 32MB. One interesting aspect of the layout on disk is the fsck working space, a small area allocated within the filesystem for keeping track of block allocation if there is not enough RAM to track a large filesystem at boot time.
JFS dynamically allocates space for disk inodes, freeing the space when it is no longer required. This eliminates the possibility of running out of inodes due to a large number of small files. As far as I can tell, JFS is the only filesystem in the kernel with this feature. For performance and efficiency, the contents of small directories are stored within the directory's inode. Up to eight entries are stored in-line within the inode, excluding the self (.) and parent (..) entries. Larger directories use a B+ tree keyed on name for faster retrieval. Internally, JFS uses extents to allocate blocks to files, leading to efficient use of space even as files grow in size. This is also available in XFS, and is a major new feature in ext4.

JFS supports both sparse and dense files. Sparse files allow data to be written to random locations within a file without writing intervening file blocks. JFS reports the file size as the largest used block, while only allocating actually used blocks. Sparse files are useful for applications that require a large logical space but use only a portion of the space. With dense files, blocks are allocated to fill the entire file size, whether data is written to them or not.

In addition to the standard permissions, JFS supports basic extended attributes, such as the immutable (i) and append-only (a) attributes. I was able to successfully set and test them with the lsattr and chattr programs. I could not find definitive information on JFS access control list support under Linux.

Logging

The main design goal of JFS was to provide fast crash recovery for large filesystems, avoiding the long filesystem check (fsck) times of older Unix filesystems. That was also the primary goal of filesystems like ext3 and ReiserFS. Unlike ext3, journaling was not an add-on to JFS, but baked into the design from the start. For high-performance applications, the JFS transaction log file can be created on an external volume if one is specified when the filesystem is first created.

JFS only logs operations on meta-data, maintaining the consistency of the filesystem structure, but not necessarily the data. A crash might result in stale data, but the files should remain consistent and usable.

Here is a list of the filesystem operations logged by JFS:

File creation (create)

Linking (link)

Making directory (mkdir)

Making node (mknod)

Removing file (unlink)

Rename (rename)

Removing directory (rmdir)

Symbolic link (symlink)

Truncating regular file

Utilities

JFS provides a suite of utilities to manage its filesystems. You must be the root user to use them.

Utility Description

jfs_debugfs Shell-based JFS filesystem editor. Allows changes to the ACL, uid/gid, mode, time, etc. You can also alter data on disk, but only by entering hex strings -- not the most efficient way to edit a file.

jfs_fsck Replay the JFS transaction log, check and repair a JFS device. Should be run only on an unmounted or read-only filesystem. Run automatically at boot.

jfs_fscklog Extract a JFS fsck service log into a file. jfs_fscklog -e /dev/hda6 extracts the binary log to file fscklog.new. To view, use jfs_fscklog -d fscklog.new.

jfs_logdump Dump the journal log to a plain text file that shows data on each transaction in the log file.

jfs_mkfs Create a JFS formatted partition. Use the -j journal_device option to create an external journal (1.0.18 or later).

jfs_tune Adjust tunable filesystem parameters on JFS. I didn't find options that looked like they might improve performance. The -l option lists the superblock info.

Here is what a dump of the superblock information looks like:
root@slackt41:~# jfs_tune -l /dev/hda6
jfs_tune version 1.1.11, 05-Jun-2006

JFS filesystem superblock:

JFS magic number:       'JFS1'
JFS version:            1
JFS state:              mounted
JFS flags:              JFS_LINUX  JFS_COMMIT  JFS_GROUPCOMMIT  JFS_INLINELOG 
Aggregate block size:   4096 bytes
Aggregate size:         12239720 blocks
Physical block size:    512 bytes
Allocation group size:  16384 aggregate blocks
Log device number:      0x306
Filesystem creation:    Wed Jul 11 01:52:42 2007
Volume label:           ''
Crash testing

White papers and man pages are no substitute for the harsh reality of a server room. To test the recovery capabilities of JFS, I started crashing my system (forced power off) with increasing workloads. I repeated each crash twice to see if my results were consistent.

Crash workload Recovery

Console (no X) running text editor with one open file About 2 seconds to replay the journal log. Changes I had not saved in the editor were missing but the file was intact.

X window system with KDE, GIMP, Nvu, and text editor in xterm all with open files About 2 seconds to replay the journal log. All open files were intact, unsaved changes were missing.

X window system with KDE, GIMP, Nvu, and text editor all with open files, plus a shell script that inserted records into a MySQL (ISAM) table. The script I wrote was an infinite loop, and I let it run for a couple of minutes to make sure some records were flushed to disk. About 3 seconds to replay the journal log. All open files intact, database intact with a few thousand records inserted, but the timestamp on the table file had been rolled back one minute.

In all cases, these boot messages appeared:
**Phase 0 - Replay Journal Log
-|----- (spinner appeared for a couple of seconds, then went away)
Filesystem is clean
Throughout the crash testing, I saw no filesystem corruption, and the longest log replay time I experienced was about 3 seconds.

Conclusion

While my improvised crash tests were not a good simulation a busy server, JFS did hold up well, and recovery time was fast. All file-level applications I tested, such as tar and rsync, worked flawlessly, and lower-level programs like Truecrypt also worked as expected.

After 30 days of kicking and prodding, I have a high level of confidence in JFS, and I am content trusting my data to it. JFS may not have been marketed as effectively as other alternatives, but is a solid choice in the long list of quality Linux filesystems.

Utility	Description
jfs_debugfs	Shell-based JFS filesystem editor. Allows changes to the ACL, uid/gid, mode, time, etc. You can also alter data on disk, but only by entering hex strings -- not the most efficient way to edit a file.
jfs_fsck	Replay the JFS transaction log, check and repair a JFS device. Should be run only on an unmounted or read-only filesystem. Run automatically at boot.
jfs_fscklog	Extract a JFS fsck service log into a file. `jfs_fscklog -e /dev/hda6` extracts the binary log to file fscklog.new. To view, use `jfs_fscklog -d fscklog.new`.
jfs_logdump	Dump the journal log to a plain text file that shows data on each transaction in the log file.
jfs_mkfs	Create a JFS formatted partition. Use the `-j journal_device` option to create an external journal (1.0.18 or later).
jfs_tune	Adjust tunable filesystem parameters on JFS. I didn't find options that looked like they might improve performance. The `-l` option lists the superblock info.

Crash workload	Recovery
Console (no X) running text editor with one open file	About 2 seconds to replay the journal log. Changes I had not saved in the editor were missing but the file was intact.
X window system with KDE, GIMP, Nvu, and text editor in xterm all with open files	About 2 seconds to replay the journal log. All open files were intact, unsaved changes were missing.
X window system with KDE, GIMP, Nvu, and text editor all with open files, plus a shell script that inserted records into a MySQL (ISAM) table. The script I wrote was an infinite loop, and I let it run for a couple of minutes to make sure some records were flushed to disk.	About 3 seconds to replay the journal log. All open files intact, database intact with a few thousand records inserted, but the timestamp on the table file had been rolled back one minute.

JFS structure summary (051031)

This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.

Basic entities

Partition

A partition is a container, and has merely a size and a sector size, also called a partition block size, which defines IO granularity (and is usually the same for all partitions on the physical medium); a partition only contains an aggregate.

Extent

A contiguous sequence of blocks, wholly contained in one allocation group. The maximum size of an extent is 2²⁴-1 blocks, or almost 64GiB. There are a few types of extents, one of them is ABNR which describes an extent contaning zero bytes only.

Map

A map is a collection of extents that contains a B+-tree index rooted in the first extent of the collection; for example it can be an index of extents for a file body, in which case it is an allocation map, or an index of inode names for a directory, in which case it is called a directory map; the extents in a map are described in the map itself. The root extent of the map is called btree and the leaf extents are called xtrees (and contain an array of entries called xads) if they are for an allocation map, and dtrees if they are for a directory map.

File body

A file body is a sequence of one or more extents, the extents being listed in an allocation map. The extents may be from different allocation groups.

Inode

An inode is a 512 byte descriptor for the attributes of a file or directory, and contains also the root of a file body's allocation map, or of a directory map.

Aggregates

Aggregate

An aggregate is about allocating space, and has a size and an aggregate block size, which defines the granularity of allocation of space to files, and currently must be 4096.

Aggregates have a primary and a backup superblock.

Aggregates contain one or more allocation groups.

Aggregates have a primary and backup aggregate inode tables, which must be exactly one 32 inodes long.

Aggregates may contain one or more filesets, but currently only one is allowed.

Aggregates also have some space reserved for use by jfs_fsck.

Allocation group

An allocation group, also known as an AG, is merely a section of an aggregate. There is no data structure associated with an allocation group, all belong either to the aggregate or to a fileset.

There can be up to 128 AGs in an aggregate, and each must be at least 8192 blocks or 32MiB.

Each allocation group must contain a number of blocks that is a power of 2 of the number of block descriptors in a dmap page.

If multiple files are growing, each allocates extents from a different allocation group if possible.

Aggregate inode table

The aggregate inode table is an inode allocation map for the inodes that are used internally by the aggregate, and are not user visible (that is, are not part of any fileset). The inodes defined in the table are:

Number 0 is reserved.

Number 1 is the aggregate inode table itself.

Number 2 is the block allocation map file.

Number 3 is the inline log file.

Number 4 is the bad blocks file.

Number 16 is the fileset root file.

Since the aggregate inode table file refers to itself, the first extent of its inode allocation map has a well known constant address (just after the superblock).

Block allocation map

The block allocation map, also called bmap, is a file (not a B+-tree, despite being called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:

Two arrays of 2¹³ bits where each bit corresponds to a block of the aggregate, and the bit is 1 if the block is in use. Because of the limit of three levels of dmap control pages, there can be at most 2³⁰ dmap pages, and thus at most 2⁴³ blocks in an aggregate.

Some metadata, includings a buddy tree that defines a buddy system of the free and allocated blocks. The buddy tree also extends upwards in the dmap control pages.

The block allocation map contains information that is redundant with that of inode allocation maps, so it can be fully reconstructed, but only with a a full scan of the aggregate and fileset inode tables.

Inline log

A sequence of blocks towards the end of an aggregate that is used to record intended modifications to aggregate or fileset metadata.

Bad blocks

This is a file whose extents cover all the bad blocks discovered by jfs_fsck if any.

Inode allocation maps

Inode allocation map

An inode allocation map is the file body of an inode table file, not a map. This file body contains as the first 4KiB block a control page called dinomap, and after that a number of extents called inode allocation groups.
The dinomap contains:

The AG free inode lists array.

The AG free inode extents lists array.

The IAG free list.

The IAG free next.

which segment the information held in the inode allocation map by allocation group.

AG free inode lists array

The AG free inode lists array contains a list headers for each AG. Each lists threads together all the IAGs in that AG that have some free inode entries.

AG free inode extents lists array

The AG free inode extents lists array contains a list header for each AG, and each list threads together all the IAGs in an AG that have some free inode extents.

IAG free list

The IAG free list array contains a list header for each AG, and each list contains the number of those IAGs in the AG whose inodes are all free.

IAG free next

The IAG free next is the number of the next IAG to append (if required) to an inode allocation map, or equivalently the number of IAGs in an inode allocation map plus 1.

Inode allocation group

An inode allocation group, also called IAG, is a 4KiB block that describes up to 128 inode table extents, for a total of up to 4096 inode table entries.
An inode allocation group can be in any allocation group, but all the inode table extents it describes must be in the same allocation group as the first one, unlike the extents of a general purpose file body, which can be in any allocation group; as soon as its first inode table extent is allocated in a allocation group, the inode allocation group is tied to it, until all such extents are freed.
Once allocated, inode allocation groups are never freed, but their inode table extents may be freed though.

Inode table extent

Inode table extents are pointed to by inode allocation groups, and each must be 16KiB in length, and contains 32 inode table entries.

Filesets

Fileset

A fileset is a collections of named inodes. Filesets are defined as and by a fileset inode table, which is an inode allocation map file. It contains these inodes:

Number 0 is reserved.

Number 1 is a file containing extended fileset information.

Number 2 is a directory which is the root of the fileset naming tree.

Number 3 is a file containing the ACL for the fileset.

Number 4 and following are used for the other files or directories in the fileset, all must be reachable from the directory at number 2.

File

A file is an inode with an attached (optional) allocation map describing a file body that contains data; a particular case of a file is a symbolic link, where the data in the file is a path name.

Directory

A directory is a an inode with a list of name and corresponding inode numbers; the list is either contained entirely within the inode if it is small, or is an attached directory map, containing dtree entries.

Desktop filesystem features
Feature	`ext3`	JFS	XFS
Block sizes	1024-4096	4096	512-4096
Max fs size	8TiB (2⁴³B)	32PiB (2⁵⁵B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max file size	1TiB (2⁴⁰B)	4PiB (2⁵²B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max files/fs	2³²	2³²	2³²
Max files/dir	2³²	2³¹	2³²
Max subdirs/dir	2¹⁵	2¹⁶	2³²
Number of inodes	fixed	dynamic	dynamic
Indexed dirs	option	auto	yes
Small data in inodes	no	some	auto
`fsck` speed	slow	fast	fast
`fsck` space	?	32B per inode	2GiB RAM per 1TiB + 200B per inode (half on 32b CPU)
Redundant metadata	yes	yes	?
Bad block handling	yes	`mkfs` only	no
Tunable commit interval	yes	no	no
Supports VFS lock	yes	yes	yes
Has own lock/snapshot	no	no	yes
Names	8 bit	UTF-16 or 8 bit	8 bit
`noatime`	yes	yes	yes
`O_DIRECT`	allegedly	allegedly	yes
`barrier`	yes	no	yes (checked)
commit interval	yes	no	no
EA/ACLs	both	both	both
Quotas	both	patch	both
DMAPI	no	patch	option
Case insensitive	no	`mkfs` only	no
Supported by GRUB	yes	yes	mostly
Can grow	online	online only	online only
Can shrink	offline	no	no
Journals data	option	no	no
Journals what	blocks	operations	operations
Journal disabling	yes	yes	no
Journal size	fixed	fixed	grow/shrink
Resize journal	offline	maybe	offline
Journal on another partition	yes	yes	yes
Special features or misfeatures	In place convert from `ext2`. MS Windows drivers.	Case insensitive option. Low CPU usage. DCE DFS compatible. OS2 compatible.	Real time (streaming) section. IRIX compatible. Very large write behind. Superblock on block 0.

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019

JFS/JFS2 Filesystem

JFS volume structure

Journaling

JFS

JFS Utilities

Glossary of Terms:

NEWS CONTENTS

Old News ;-)

Storage Management in AIX 5L Version 5.3 AIX IBM Systems Magazine

Linux.com 30 days with JFS By Keith Winston

Logging

Utilities

Crash testing

Conclusion

JFS structure summary (051031)

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

JFS Log

Open source : JFS project Web site

Notes about Linux file systems

File system references (061022)

File system features (060801)

Some of my notes on filesystems (061022)

Etc