|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Softpanorama Filesystems Webliography
Science is facts; just as houses are made of stones, so is science
made
of facts; but a pile of stones is not a house and a collection of facts
is not necessarily science.
-- Henri Poincaire
Filesystems is a very interesting area, one of the few areas in Unix where new
algorithms still can make a huge difference in performance.
Often the historical view on filesystems is a bit too Unix-centric and states
that the Berkeley Fast File System is the ancestor of most modern file systems.
This view ignores competitive and earlier implementations from IBM(HPFS), DEC (VAX
VMS), Microsoft (NTFS) and others.
Still Unix filesystems became a classic and concepts introduced in ti dominate
all modern filesystems It also introduced many interesting features and algorithms
into the area. For example a very interesting concept of extended attributes introduced
in the 4.4 BSD filesystem have recently been added to Ext2fs:
Immutable files can only be read: nobody can write or delete them. This
can be used to protect sensitive configuration files.
Append-only files can be opened in write mode but data is always appended
at the end of the file. Like immutable files, they cannot be deleted or renamed.
This is especially useful for log files which can only grow. All-in all
following attributes are avialable at ext2f:
- A
(no
Access time):
if a file or directory has this attribute set, whenever it is accessed,
either for reading of for writing, its last access time will not be updated.
This can be useful, for example, on files or directories which are very
often accessed for reading, especially since this parameter is the only
one which changes on an inode when it's open read-only.
- a
(
append only):
if a file has this attribute set and is open for writing, the only operation
possible will be to append data to its previous contents. For a directory,
this means that you can only add files to it, but not rename or delete any
existing file. Only root
can set or clear this attribute.
- d
(no
dump):
dump (8)
is the standard UNIX
utility for backups. It dumps any filesystem for which the dump counter
is 1 in /etc/fstab
(see chapter
"Filesystems and Mount Points"). But if a file or directory has
this attribute set, unlike others, it will not be taken into account when
a dump is in progress. Note that for directories, this also includes all
subdirectories and files under it.
i
(
immutable):
a file or directory with this attribute set simply can not be modified at
all: it can not be renamed, no further link can be created to it
[1] and it cannot be removed.
Only root
can set or clear this attribute. Note that this also prevents changes to
access time, therefore you do not need to set the
A
attribute when i
is set.
s
(
secure deletion):
when such a file or directory with this attribute set is deleted, the blocks
it was occupying on disk are written back with zeroes.
S
(
Synchronous mode):
when a file or directory has this attribute set, all modifications on it
are synchronous and written back to disk immediately.
Unix filesystem is a classic, but classic has it's own problems: it's actually
an old and largely outdated filesystem that outlived its usefulness. Later
ideas implemented in HPFS, BFS and several other more modern filesystems are absent
in plain-vanilla implementation of Unix file systems. Balanced trees now serve the
base of most modern filesystems including ReiserFs (which started as NTFS clone
but aqured some unique features in the process of development):
The
Reiser Filesystems
by Hans Reiser [and Moscow University researchers], a very ambitious project
to not only improve performance and add journaling, but to redefine the
filesystem as a storage repository for arbitrarily complex objects.
reiserfs.
Reiserfs is faster than ext2/3 because it uses balanced trees for it's directory-structures.
It was used by Suse and Gentoo.
Unfortunately the novel feature introduced in HPFS called extended attributes
never got traction in other filesystems. Of course the fundamental
decision to make attributes indexable deserves closer examination, given the costs
of indexing, but still the fixed set of attributes (like in UFS) created too many
problems to ignore this issue. Still I think that extended attributes should be
present in a filesystem, and they can replace such kludges as #! notation in UNIX
for specifying default processor in executable files.
Notes:
- Those pages are written by people for whom English is not a
native language. Some amount of grammar and spelling errors
should be expected.
- This is a Spartan WHYFF (We Help You For Free) site. It
cannot replace the best teachers and
the
best books.
- The site contain some obsolete pages as it develops like a
living tree... Some links on older pages
are broken. Please
try to use Google, Open directory, etc. to find a replacement link
(see
HOWTO search the WEB for details).
We would appreciate if you can
mail us a correct link.
|
|
The Small Computer Systems Interface (SCSI) is a
collection of standards that define the interface and
protocols for communicating with a large number of
devices (predominantly storage related). Linux® provides
a SCSI subsystem to permit communication with these
devices. Linux is a great example of a layered
architecture that joins high-level drivers, such as disk
or CD-ROM drivers, to a physical interface such as Fibre
Channel or Serial Attached SCSI (SAS). This article
introduces you to the Linux SCSI subsystem and discusses
where this subsystem is going in the future.
When it comes to file systems, Linux® is the Swiss Army
knife of operating systems. Linux supports a large
number of file systems, from journaling to clustering to
cryptographic. Linux is a wonderful platform for using
standard and more exotic file systems and also for
developing file systems. This article explores the
virtual file system (VFS)—sometimes called the virtual
filesystem switch—in the Linux kernel and then reviews
some of the major structures that tie file systems
together.
data=writeback
While the writeback option provides lower data consistency
guarantees than the journal or ordered modes, some applications show very
significant speed improvement when it is used. For example, speed improvements can
be seen when heavy synchronous writes are performed, or when applications
create and delete large volumes of small files, such as delivering a large flow of
short email messages. The results of the testing effort described in Chapter 3
illustrate this topic.
When the writeback option is used, data consistency is similar
to that provided by the ext2 file system. However, file system integrity is maintained
continuously during normal operation in the ext3 file system.
In the event
of a power failure or system crash, the file system may not be recoverable if a
significant portion of data was held only in system memory and not on permanent storage. In
this case, the filesystem must be recreated from backups. Often, changes made since
the file system was last backed up are inevitably lost.
Submitted by
Jeremy on August 7, 2007 - 9:26am.
In a recent lkml thread, Linus Torvalds was involved in
a discussion about mounting filesystems with the
noatime option for better performance,
"'noatime,data=writeback'
will quite likely be *quite* noticeable (with different
effects for different loads), but almost nobody actually
runs that way."
He noted that he set O_NOATIME when
writing git, "and it was an absolutely huge
time-saver for the case of not having 'noatime' in the
mount options. Certainly more than your estimated 10%
under some loads."
The discussion then looked at
using the
relatime
mount option to improve the situation, "relative
atime only updates the atime if the previous atime is
older than the mtime or ctime. Like noatime, but useful
for applications like mutt that need to know when a file
has been read since it was last modified."
Ingo
Molnar stressed the significance of fixing this
performance issue, "I cannot over-emphasize how much
of a deal it is in practice. Atime updates are by far
the biggest IO performance deficiency that Linux has
today. Getting rid of atime updates would give us more
everyday Linux performance than all the pagecache
speedups of the past 10 years, _combined_."
He
submitted some patches to improve relatime,
and noted about atime:
"It's also perhaps the most stupid Unix design
idea of all times. Unix is really nice and well
done, but think about this a bit: 'For every file
that is read from the disk, lets do a ... write to
the disk! And, for every file that is already cached
and which we read from the cache ... do a write to
the disk!'"
This series was originally called "Advanced Filesystem Implementor's Guide
and was published on IBM developerWorks
This methodology utilizes a tmpfs volume, and it can speed up operations approximately
three times.
This document describes a methodology for configuring a fast file system
that handles several small files on the Solaris Operating System. This could
be used for building a Java technology-based product or for handling many operations
on a large amount of small files. This methodology utilizes a tmpfs volume,
and it can speed up operations approximately three times.
The requirements are as follows:
- Solaris 7 OS through Solaris 10 OS Update 1
- Some experience with Solaris system administration. This procedure is
not recommended for UNIX users who are uncomfortable with using
mount, maintaining /etc/vfstab,
or modifying their kernel parameters.
Warning: Do not develop on a tmpfs volume. A tmpfs volume is only
persistent while the system is powered up, so a power loss or system problem
will cause you to lose any changes to that volume.
Procedure
Solaris tmpfs volumes are easy to create, but require a significant amount
of RAM and swap space. It is recommended that you have at least 1 Gbyte of RAM,
but there have also been major performance gains on systems with 512 Mbytes
of RAM. In addition, you should add twice as much swap space as the tmpfs volume
you are creating. That is, for a 2-Gbyte tmpfs volume, add 4 Gbytes of swap
space to the system. Feel free to experiment with these values.
The following examples are for a 2-Gbyte tmpfs volume, which is approximately
what is needed to do a developer build. Replace <swapfilename>
with the absolute path to a swapfile (such as
/disk1/swapfile), and <mountpoint>
with the absolute path to where you want the tmpfs volume mounted (such as
/ramdisk).
Add swap space to your workstation:
root# /usr/sbin/mkfile 2000m <swapfilename>
Create a mount point for the tmpfs volume:
root# mkdir <mountpoint>
Edit your /etc/vfstab file to use the swap and
create the tmpfs volume at boot time. Add the following two lines:
<swapfilename> - - swap - no -
RAMDISK - <mountpoint> tmpfs - yes size=2000m
Note that on the Solaris 7 OS you may not make a single tmpfs volume larger
than 2 Gbytes.
Edit your kernel parameters to increase the number of files you can create
in the tmpfs volume. Add the following line to your /etc/system
file. (We've had the most success using this value.)
set tmpfs:tmpfs_maxkmem=250000000
Reboot your workstation. Then verify that the tmpfs volume exists at the
size you specified:
% df -k <mountpoint>
Make the tmpfs volume writable. Note: This step is necessary after
each reboot of the workstation.
root# chmod 777 <mountpoint>
There are a lot of Linux filesystems comparisons available but
most of them are anecdotal, based on artificial tasks or completed
under older kernels. This benchmark essay is based on 11 real-world
tasks appropriate for a file server with older generation hardware
(Pentium II/III, EIDE hard-drive).
Since its initial publication, this article has generated
a lot of questions, comments and suggestions to improve it.
Consequently, I'm currently working hard on a new batch of tests
to answer as many questions as possible (within the original scope
of the article).
Results will be available in about two weeks (May 8, 2006)
Many thanks for your interest and keep in touch with
Debian-Administration.org!
Hans
Why another benchmark test?
I found two quantitative and reproductible benchmark testing
studies using the 2.6.x kernel (see References). Benoit (2003) implemented
12 tests using large files (1+ GB) on a Pentium II 500 server with
512MB RAM. This test was quite informative but results are beginning
to aged (kernel 2.6.0) and mostly applied to settings which manipulate
exclusively large files (e.g., multimedia, scientific, databases).
Piszcz (2006) implemented 21 tasks simulating a variety of file
operations on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard
disk. To date, this testing appears to be the most comprehensive
work on the 2.6 kernel. However, since many tasks were "artificial"
(e.g., copying and removing 10 000 empty directories, touching 10
000 files, splitting files recursively), it may be difficult to
transfer some conclusions to real-world settings.
Thus, the objective of the present benchmark testing is to complete
some Piszcz (2006) conclusions, by focusing exclusively on real-world
operations found in small-business file servers (see Tasks
description).
Test settings
Hardware
- Processor : Intel Celeron 533
- RAM : 512MB RAM PC100
- Motherboard : ASUS P2B
- Hard drive : WD Caviar SE 160GB (EIDE 100, 7200 RPM, 8MB
Cache)
- Controller : ATA/133 PCI (Silicon Image)
OS
- Debian Etch (kernel 2.6.15), distribution upgraded on April
18, 2006
- All optional daemons killed (cron,ssh,saMBa,etc.)
Filesystems
- Ext3 (e2fsprogs 1.38)
- ReiserFS (reiserfsprogs 1.3.6.19)
- JFS (jfsutils 1.1.8)
- XFS (xfsprogs 2.7.14)
Description of selected tasks
Operations on a large file (ISO image, 700MB)
- Copy ISO from a second disk to the test disk
- Recopy ISO in another location on the test disk
- Remove both copies of ISO
Operations on a file tree (7500 files, 900 directories, 1.9GB)
- Copy file tree from a second disk to the test disk
- Recopy file tree in another location on the test disk
- Remove both copies of file tree
Operations into the file tree
- List recursively all contents of the file tree and save
it on the test disk
- Find files matching a specific wildcard into the file tree
Operations on the file system
- Creation of the filesystem (mkfs) (all FS were created with
default values)
- Mount filesystem
- Umount filesystem
The sequence of 11 tasks (from creation of FS to umounting FS)
was run as a Bash script which was completed three times (the average
is reported). Each sequence takes about 7 min. Time to complete
task (in secs), percentage of CPU dedicated to task and number of
major/minor page faults during task were computed by the GNU time
utility (version 1.7).
RESULTS
Partition capacity
Initial (after filesystem creation) and residual (after removal
of all files) partition capacity was computed as the ratio of number
of available blocks by number of blocks on the partition. Ext3 has
the worst inital capacity (92.77%), while others FS preserve
almost full partition capacity (ReiserFS = 99.83%, JFS = 99.82%,
XFS = 99.95%). Interestingly, the residual capacity of Ext3
and ReiserFS was identical to the initial, while JFS and XFS lost
about 0.02% of their partition capacity, suggesting that these FS
can dynamically grow but do not completely return to their inital
state (and size) after file removal.
Conclusion : To use the maximum of your partition capacity,
choose ReiserFS, JFS or XFS.
File system creation, mounting and unmounting
The creation of FS on the 20GB test partition took 14.7 secs
for Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2,
JFS = 1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times
longer to mount the FS (2.3 secs) when compared to other
FS (Ext3 = 0.2, JFS = 0.2, XFS = 0.5), and also 2 times longer to
umount the FS (0.4 sec). All FS took comparable amounts of
CPU to create FS (between 59% - ReiserFS and 74% - JFS) and to mount
FS (between 6 and 9%). However, Ex3 and XFS took about 2 times more
CPU to umount (37% and 45%), compared to ReiserFS and JFS (14% and
27%).
Conclusion : For quick FS creation and mounting/unmounting,
choose JFS or XFS.
Operations on a large file (ISO image, 700MB)
The initial copy of the large file took longer on Ext3 (38.2
secs) and ReiserFS (41.8) when compared to JFS and XFS (35.1 and
34.8). The recopy on the same disk advantaged the XFS (33.1 secs),
when compared to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9).
The ISO removal was about 100 times faster on JFS and XFS (0.02
sec for both), compared to 1.5 sec for ReiserFS and 2.5 sec for
Ext3! All FS took comparable amounts of CPU to copy (between 46
and 51%) and to recopy ISO (between 38% to 50%). The ReiserFS used
49% of CPU to remove ISO, when other FS used about 10%. There was
a clear trend of JFS to use less CPU than any other FS (about 5
to 10% less). The number of minor page faults was quite similar
between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose
JFS or XFS. If you need to minimize CPU usage, prefer JFS.
Operations on a file tree (7500 files, 900 directories, 1.9GB)
The initial copy of the tree was quicker for Ext3 (158.3 secs)
and XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1).
Similar results were observed during the recopy on the same disk,
which advantaged the Ext3 (120 secs) compared to other FS (XFS =
135.2, ReiserFS = 136.9 and JFS = 151). However, the tree removal
was about 2 times longer for Ext3 (22 secs) when compared to ReiserFS
(8.2 secs), XFS (10.5 secs) and JFS (12.5 secs)! All FS took comparable
amounts of CPU to copy (between 27 and 36%) and to recopy the file
tree (between 29% - JFS and 45% - ReiserFS). Surprisingly, the ReiserFS
and the XFS used significantly more CPU to remove file tree (86%
and 65%) when other FS used about 15% (Ext3 and JFS). Again, there
was a clear trend of JFS to use less CPU than any other FS. The
number of minor page faults was significantly higher for ReiserFS
(total = 5843) when compared to other FS (1400 to 1490). This difference
appears to come from a higher rate (5 to 20 times) of page faults
for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose
Ext3 or XFS. Benchmarks from other authors have supported the use
of ReiserFS for operations on large number of small files. However,
the present results on a tree comprising thousands of files of various
size (10KB to 5MB) suggest than Ext3 or XFS may be more appropriate
for real-world file server operations. Even if JFS minimize CPU
usage, it should be noted that this FS comes with significantly
higher latency for large file tree operations.
Directory listing and file search into the previous file tree
The complete (recursive) directory listing of the tree was quicker
for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and
JFS (2.5 and 3.1). Similar results were observed during the file
search, where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results
compared to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took
comparable amounts of CPU for directory listing (35%) and file search
(6%). XFS took more CPU for directory listing (70%) but comparable
amount for file search (10%). ReiserFS appears to be the most CPU-intensive
FS, with 71% for directory listing and 36% for file search. Again,
the number of minor page faults was 3 times higher for ReiserFS
(total = 1991) when compared to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems
can be regrouped as (a) quick and more CPU-intensive (ReiserFS and
XFS) or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears
as a good compromise, with relatively quick results, moderate usage
of CPU and acceptable rate of page faults.
OVERALL CONCLUSION
These results replicate previous observations from Piszcz (2006)
about reduced disk capacity of Ext3, longer mount time of ReiserFS
and longer FS creation of Ext3. Moreover, like this report,
both
reviews have observed that JFS is the lowest CPU-usage FS. Finally,
this report appeared to be the first to show the high page faults
activity of ReiserFS on most usual file operations.
While recognizing the relative merits of each filesystem, only
one filesystem can be install for each partition/disk. Based on
all testing done for this benchmark essay, XFS appears to be
the most appropriate filesystem to install on a file server
for home or small-business needs :
- It uses the maximum capacity of your server hard disk(s)
- It is the quickest FS to create, mount and unmount
- It is the quickest FS for operations on large files (>500MB)
- This FS gets a good second place for operations on a large
number of small to moderate-size files and directories
- It constitutes a good CPU vs time compromise for large directory
listing or file search
- It is not the least CPU demanding FS but its use of system
ressources is quite acceptable for older generation hardware
While Piszcz (2006) did not explicitly recommend XFS, he concludes
that "Personally, I still choose XFS for filesystem performance
and scalability". I can only support this conclusion.
References
Benoit, M. (2003).
Linux File
System Benchmarks.
Piszcz, J. (2006).
Benchmarking Filesystems Part II. Linux Gazette, 122 (January
2006).
2002-09-20 (Linux Journal)
We look at three different tactics for optimizing read and
write performance under Linux.
A few years ago I was tasked with
making the Spec96 benchmark suite produce the fastest numbers possible
using the Solaris Intel operating system and Compaq Proliant servers.
We were given all the resources that Sun Microsystems and Compaq
Computer Corporation could muster to help take both companies to
the next level in Unix computing on the Intel architecture. Sun
had just announced its flagship operating system on the Intel platform
and Compaq was in a heated race with Dell for the best departmental
servers. Unixware and SCO were the primary challengers since Windows
NT 3.5 was not very stable at the time and no one had ever heard
of an upstart graduate student from overseas who thought that he
could build a kernel that rivaled those of multi-billion dollar
corporations.
Now many years later, Linux has gained
considerable market share and is the De facto Unix for all the major
hardware manufacturers on the Intel architecture. In this article,
I will attempt to take the lessons learned from this tuning exercise
and show how they can be applied to the Linux operating system.
As it turned out, the gcc benchmark
was the one that everyone seemed to be improving on the most. As
we analyzed what the benchmark was doing, we found out that basically
it opened a file, read its contents, created a new file, wrote new
contents, then closed both files. It did this over and over and
over. File operations proved to be the bottleneck in performance.
We tried faster processors with insignificant improvement. We tried
processors with huge (at the time) level 1 and level 2 cache and
still found no significant improvement. We tried using a gigabyte
of memory and found little or no improvement. By using the vmstat
command, we found that the processor was relatively idle, little
memory was being used, but we were getting a significant amount
of reads and writes to the root disk. Using the same hardware and
same test programs, Unixware was 25% faster than Solaris Intel.
Initially, we decided that Solaris was just really slow. Unfortunately,
I was working for Sun at the time and this was not the answer that
we could take to my management. We had to figure out why it was
slow and make recommendations on how to improve the performance.
The target was 25% faster than Unixware, not slower.
The first thing that we did was to
look at the configurations. It turns out that the two systems were
identical hardware,. We just booted a different disk to boot the
other operating system. The Unixware system was configured with
/tmp as a tmpfs whereas the Solaris system had /tmp on the root
file system. We changed the Solaris configuration to use tmpfs but
it did not significantly improve performance. Later, we found that
this was due to a bug in the tmpfs implementation on Solaris Intel.
By braking down the file operation, we decided to focus on three
areas; the libc interface, the node/dentry layer, and the device
drivers managing the disk. In this article, we will look at the
three different layers and talk about how to improve performance
and how they specifically apply to Linux.
This paper describes a utility named
ruf that reads
files from an unmounted file system. The files are accessed by reading disk
structures directly so the program is peculiar to the specific file system employed.
The current implementation supports the *BSD FFS, SunOS/Solaris UFS, HP-UX HFS,
and Linux ext2fs file systems. All these file systems derive from the original
FFS, but have peculiar differences in their specific implementations.
The utility can read files from a damaged file
system. Since the utility attempts to read only those structures it requires,
damaged areas of the disk can be avoided. Files can be accessed by their inode
number alone, bypassing damage to structures above it in the directory hierarchy.
The functions of the utility is available
in a library named libruf.
The utility and library is available under the BSD license.
Introduction
There are many important reasons for being
able to access unmounted file systems, the prime example being a damaged disk.
This paper describes a utility that can be used to read a disk file without
mounting the file system. The utility behaves similar to the regular
cat utility, and was originally
named dog, but
was renamed to ruf
for reading unmounted filesystems to avoid a name conflict
with an older utility.
In order to access an unmounted file system,
the utility must read the disk structures directly and perform all the tasks
normally performed by the operating system; this requires a detailed understanding
of how the file system is implemented. Implementing this utility for a particular
file system is an interesting academic exercise and a good way to learn about
the file system. The original work on this utility was in fact done in Evi Nemeth's
system administration class.
Richard starts this journey into the Solaris filesystem by looking at the
fundamental reasons for needing a filesystem and at the functionality various
filesystems provide. In this first part of the series, you'll examine the evolution
of the Solaris filesystem framework, moving into a study of major filesystem
features. You'll focus on filesystems that store data on physical storage devices
-- commonly called regular or on-disk filesystems. In future
articles, you'll begin to explore the performance characteristics of each filesystem,
and how to configure filesystems to provide the required levels of functionality
and performance. Richard will also delve into the interaction between Solaris
filesystems and the Solaris virtual memory system, and how it all affects performance.
One of the most important features of a filesystem is its ability to cache
file data. Ironically, however, the filesystem cache isn't implemented in the
filesystem. In Solaris, the filesystem cache is implemented in the virtual memory
system. In Part 3 of this series on the Solaris filesystem, Richard explains
how Solaris file caching works and explores the interactions between the filesystem
cache and the virtual memory system.
CacheKit is a collection of freeware perl and shell programs to report
on cache activity on a Solaris 8 sparc server. Tools for older Solaris and Solaris
x86 are also included in the kit, as well as some SE Toolkit programs and extra
Solaris 10 DTrace programs. The caches the kit reports on are: I$, D$, E$,
DNLC, inode cache, ufs buffer cache, segmap cache and segvn cache.
This kit assists performance tuning.
Here's a remarkable tale about a company that
replaced an Oracle database cluster with a few Linux servers.
The great thing about this story is that the
Linux servers did not run database software. The Oracle database had been converted
and stored on the Linux hard disk as a collection of some 100,000 files. The
work was part of a major application upgrade that involved redesigning all the
components of a busy web site.
It's a stunning tale, but it's obviously an exceptional
case. There's absolutely no suggestion that Oracle databases are old, slow,
poor or anything else of the sort.
But this story shows how design decisions taken
a few years ago can rapidly be undermined by new technologies. In this case,
the original site served pages from an Oracle back-end via J2EE middleware.
The new system uses a Linux back-end and Java XML middleware. First time round,
Oracle and the J2EE framework were the best choice, but a few years later Java
and XML had matured. The redesign enabled some 40,000 lines of J2EE code to
be replaced by 5,000 lines of Java/XML.
The Oracle replacement was another spin-off from
redesigning the middleware. This time it was enabled by the Reiser File System
(ReiserFS) - a relatively new development that is already the default in Suse,
Lindows and Gentoo Linux, largely because it's a journaling file system, so
it doesn't lose data following "unplanned outages". Linux servers don't crash
very much, but power fails sometimes, so a robust file system is a definite
advantage.
ReiserFS uses an improved version of the same
basic tree indexing scheme as some database engines. Thus ReiserFS is often
very fast and efficient compared with the traditional file systems of Linux
and Windows. Of course, for most datasets and applications it's probably not
as fast as Oracle's sophisticated database cluster. But in this case a bit less
performance was acceptable, and the new option of ReiserFS contributed to the
demise of one underused Oracle database.
One might imagine switching from an Oracle cluster
to a Linux file system produced huge cost savings, but it's not that simple.
The money saved was in fact used to pay for two new staff - software developers
who also contribute to the open-source application server used in the new Java
XML architecture.
So the decision to move away from Oracle and
J2EE was not driven simply by costs. Here, the web site is the firm's core business,
and fixing the feature set and technical agenda of its core business to one
supplier seemed a poor choice. Now the firm has influence over the features
and direction of the application server.
None of this is to disparage other Oracle databases,
and this is not a tale about open source versus commercial software. Rather
it's about choosing the best technologies available, and re-examining those
choices from time to time.
ReiserFS is unlikely to be the ultimate file
system. It will probably soon seem old hat compared with the next big thing.
My vote would be for something like Coda - a research implementation of a fault-tolerant,
distributed file system for long-latency IP networks. Meanwhile, keep building
and rebuilding - cut costs and prosper.
-
Understanding What a File System Is
-
Understanding File System Taxonomy
-
Understanding Local File System Functionality
-
Understanding Differences Between Types of Shared File Systems
- Understanding How Applications Interact With Different Types of File Systems
-
Conclusions
-
About the Author
Ext2 compatibility (Score:5, Informative)
by Wise Dragon (71071)
on Thursday August 14, @02:42PM (#6698412)
(http://slashdot.org/)
Dude, there are papers published about Ext2fs which describe the data structures
in exquisite detail. You don't need to look at the code to write an ext2fs clone.
I have written proprietary utilities to access ext2fs data structures. I know what
I am talking about.
http://e2fsprogs.sourceforge.net/ext2intro.html
http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm
In addition, there are various commercial tools that read and write ext2, such as
Ext2fs Anywhere [partition-manager.com].
So in that case, you're full of crap. I don't know if I am really qualified to comment
on the other case, but doesn't BSD have linux compatibility? And isn't BSD available
under a much less restrictive license? They could just adapt that code.
|
Series of interesting papers
AVFS is a system, which enables all programs to look inside gzip, tar, zip,
etc. files or view remote (ftp, http, dav, etc.) files, without recompiling
the programs.
As Linux grows up, it aims to satisfy different
users and potential situations' needs. During recent years, we have seen Linux
acquire different capabilities and be used in many heterogeneous situations.
We have Linux inside micro-controllers, Linux router projects, one floppy Linux
distribution, partial 3-D hardware speedup support, multi-head Xfree support,
Linux games and a bunch of new window managers as well. Those are important
features for end users. There has also been a huge step forward for Linux server
needs — mainly as a result of the 2.2.x Linux kernel switch. Furthermore, sometimes
as a consequence of industry support and others leveraged by Open Source community
efforts, Linux is being provided with the most important commercial UNIX and
large server's features. One of these features is the support of new file systems
able to deal with large hard-disk partitions, scale up easily with thousands
of files, recover quickly from crash, increase I/O performance, behave well
with both small and large files, decrease the internal and external fragmentation
and even implement new file system abilities not supported yet by the former
ones.
This article is the first in a series of two,
where the reader will be introduced to the Journal File Systems: JFS,
XFS, Ext3, and ReiserFs. Also we will explain different features and concepts
related to the new file systems above. The second article is intended to review
the Journal File Systems behaviour and performance through the use of tests
and benchmarks.
FreeBSD uses the UFS (Unix File System), which is a little
more complex than Linux's ext2. It offers a better way to insure filesystem
data integrity, mainly with the "sofupdates" option. This option decreases synchronous
I/O and increases asynchronous I/O because writes to a UFS filesystem aren't
synced on a sector basis but according to the filesystem structure. This ensures
that the filesystem is always coherent between two updates. In my informal performance
updates, softupdates showed significant improvement.
I used two identical boxes, one with Linux and the other with
FreeBSD 4.0-RELEASE. I moved a 1.2GB file between two mount points, back and
forth. I found that FreeBSD, without the sofupdates, performs a little slower
than Linux. This speed changed after I added the softupdates to the FreeBSD
kernel and then updated the mount point (via tunefs). Only then did I notice
that FreeBSD's performance was marginally better (10 percent, or so).
These performance tests aren't perfect or anywhere near conclusive.
The Linux filesystem can be tweaked for performance; however, currently ext2
gets its performance from having an asynchronous mount. This is great for speed,
but if your system crashes it could take out the filesystem, its data, and its
current state. Often, a hard crash permanently damages a mount. FreeBSD with
sofupdates can sustain a very hard crash with only minor data loss, and the
filesystem will be remountable with few problems.
Besides performance, FreeBSD UFS also has one major advantage
over Linux in security. FreeBSD supports file flags, which can stop a simple
script kiddie dead in his tracks. There are several flags that you can add to
a file, such as the immutable flag. The immutable (schg) flag
won't allow any alteration to the file or directory unless you remove it. Other
very handy flags are append only (sappnd), cannot delete
(sunlnk), and archive (arch). When you combine these with the
kernel security level covered below, you have a very impenetrable system.
libferris.so
libferris is a virtual filesystem that exposes hierarchical data of all kinds
through a common C++ interface.
Access to data is performed using C++ IOStreams and Extended
Attributes (EA) can be attached to each datum to present metadata. Ferris uses
a plugin API to read various data sources and expose them as contexts and to
generate interesting EA. Current implementations include Native (kernel disk
IO with event updates using fam), xml (mount an xml file as a filesystem), edb
(mount a berkeley database), ffilter (mount an LDAP filter string) and mbox
(mount your mailbox). EA generators include image, audio, and animation decoders.
About: translucency is a Linux
kernel module that virtually merges two directories, making it possible to overwrite
files on read-only media and compile projects (such as the Linux kernel) with
different options without copying sources each time. No user-space tools have
to be changed. The process is also known as inheriting (ifs), stacking, translucency
(tfs), loopback (lofs), and overlay (ovlfs).
Changes: This version has enabled
".." handling and improves behavior on existing files.
To achieve the long-elusive goal of easily finding information hidden in computer
files, Microsoft is returning to a decade-old idea.
The company is building new file organization software that
will begin to form the underpinnings of the next major version of its Windows
operating system. The complex data software is meant to address a conundrum
as old as the computer industry itself: how to quickly find and work with a
piece of information, no matter what its format, from any location.
For those using Windows, this will mean easier, faster and
more reliable searches for information. Replacing its antiquated file system
with modern database technology should also mean a more reliable Windows that's
less likely to break and easier to fix when it does, said analysts and software
developers familiar with the company's plans.
In the process, the plan could boost Microsoft's high-profile
.Net Web services plan and pave the way to enter new markets for document management
and portal software, while simultaneously dealing a blow to competitors.
But success won't come overnight. Building a new data store
is a massive undertaking, one that will touch virtually every piece of software
Microsoft sells. The company plans to include the first pieces of the new data
store in next release of Windows, code-named Longhorn, which is scheduled to
debut in test form next year.
"We're going to have to redo the Windows shell; we're going
to have to redo Office, and Outlook particularly, to take advantage" of the
new data store, Microsoft CEO Steve Ballmer said in a recent interview with
CNET News.com. "We're working hard on it. It's tough stuff."
Tough indeed. The development of the new file system technology
is so difficult that Microsoft may have to market two distinctly different product
lines while it completes the work--a move Ballmer concedes would be a huge step
backward in the company's long-sought plan to unify its operating systems with
Windows XP and Windows .Net Server, which has been
delayed
until year's end.
For years, Microsoft has sold two operating systems: a consumer
version based on the 20-year-old technology DOS, and a corporate version based
on the company's newer, built-from-scratch Windows NT kernel. The dual-OS track
has frustrated software developers, who needed to support two different operating
systems, and has confused customers, who often didn't understand the difference
between them.
"Will we have two parallel tracks in the market at once? Not
desirable. There are a lot of reasons why that was really a pain in the neck
for everybody, and I hope we can avoid that here," Ballmer said. "But it's conceivable
that we will wind up with something that will be put on a dual track."
Still, Ballmer and his executive team believe it's a risk
well worth taking. Right now, each Windows program includes its own method for
storing data, such as the vastly different formats used by Microsoft's Outlook
e-mail program and Word document software. Despite advances in Windows' design
and networking technology, it's still impossible to search across a corporate
network for all e-mails, documents and spreadsheets related to a specific project,
for instance. Searching through video, audio and image files is kludgy at best.
Likewise, it's tricky--if not impossible--to build new programs
that tap into those files. "If I'm looking for anything where I interacted with
one customer in the last 12 months, I need to search for e-mail, Word documents
or information in my database," said Chris Pels, president of iDev Technologies,
a software consulting and design firm in East Greenwich, R.I. "That kind of
stuff is a nightmare from a programming perspective these days."
Other software makers have attempted to solve the same problem.
Nearly two years ago, Oracle
introduced
something called Internet File System, which works with its database server
to make storage and retrieval of data--including Microsoft Word and Excel documents--easier
and more reliable. "This hasn't been done in a commercial operating system,
but it has been done with Oracle's database," said Rob Helm, editor in chief
of Directions
on Microsoft, crediting Oracle CEO Larry Ellison as an early proponent of
the idea.
Oracle continues to challenge Microsoft on this front. Last
fall, the company
announced
an e-mail server option for its 9i
database
management software along with a migration program to move companies from Microsoft
Exchange to Oracle's database.
Yet Oracle's efforts amount to more of a jab between long-time
adversaries than a serious competitive challenge. Given Windows' enormous market
clout, Microsoft's plan could change the competitive landscape of the software
business and affect millions of computer users and technology buyers.
"It's a huge risk for Microsoft," Helm said. "They have so
much riding on this. If this is late and doesn't work as advertised, it will
have effects that will ripple through the entire company and the industry. But
the benefits, if they succeed, will be huge."
Microsoft's first--and perhaps largest--challenges will be
internal: how to overcome the technical and organizational obstacles it encountered
when it set out to solve the very same problem in the early 1990s. At that time,
the company launched an ambitious development project to design and build a
new technology called the Object File System, or OFS, which was slated to become
part of an operating system project code-named Cairo.
"We've been working hard on the next file system for years,
and--not that we've made the progress that we've wanted to--we're at it again,"
Ballmer said.
While the Cairo project eventually resulted in Microsoft's
Windows 2000 operating system, the file system work was abandoned because of
complexity, market forces and internal bickering. "It never went away. We just
had other things that needed to be done," Jim Allchin, the group vice president
in charge of Windows development, told News.com.
Those other things most likely included battling "Netscape
and Java and the challenge of the Internet and the Department of Justice," Gartner
Group analyst David Smith said--issues that
continue
to persist
today.
Microsoft executives say the company plans to resurrect the
OFS idea with the Longhorn release of Windows. "This will impact Longhorn deeply,
and we will create a new
API for applications to take advantage of it," Allchin said.
He said bringing the plan back now makes sense because new
technologies such as XML (Extensible Markup Language) will make it much easier
to put in place. XML is already a standard for exchanging information between
programs and a cornerstone of Microsoft's Web services effort, which is still
under development.
Longhorn and the new data store are the "next frontier" of software design,
Allchin said.
In addition, Microsoft has already developed the database
technology it needs for a new file system. A future release of its SQL Server
database, code-named Yukon, is being designed to store and manage both standard
business data, such as columns of numbers and letters, and unstructured
data, such as images. Yukon will
also form the data storage core of Microsoft's Exchange Server and other future
products.
The more important reasons for the renewed development effort,
however, are strategic. If the plan succeeds, it will give Microsoft a huge
technological advantage over the competition by making its products more attractive
to buyers and giving large companies another reason to install Windows-based
servers.
"Having multiple data stores makes life harder for the enterprise
customer," Helm said. "Search will become much easier, and this should make
it cheaper to build new systems because customers only have to learn one database."
Helm said the database capability in Windows will make it
a snap to add document management and more advanced portal development tools.
Those applications will in essence be built into the operating system, making
it more likely that customers will use them.
Moreover, industry veterans note that the new data store will
benefit from Microsoft's tried-and-true strategy for entering new markets--leveraging
the overwhelming market share of Windows. Because Microsoft needs the new data
store to make its .Net services plan work, analysts say the company is likely
to pressure customers to make the move to the Longhorn release of Windows through
licensing incentives or other means.
Nevertheless, widespread acceptance is not a foregone conclusion.
For big companies not yet ready to install Microsoft's 3-year-old Windows 2000
operating system--much less Windows XP, released last October--the Longhorn
plan may be too much to contemplate right now.
"That's the real issue that I see in the trenches: the rate
of change--for programmers, for businesses, in terms of making infrastructure
technology decisions," Pels said. "People can't keep up with it, and if they
want to keep up with it, is it worthwhile for their business?"
Mike Gilpin, an analyst with Giga Information Group, agrees.
"It's a great dream," he said. "But it could be hard to make real."
"Alan Cox replied tersely, "Which means an ext3
volume cannot be recovered on a hard disk error." And Stephen replied: Depends
on the error. If the disk has gone hard-readonly, then we need to recover in
core, and that's something which is not yet implemented but is a known todo
item."
This document describes Sun's implementation of the Large File Summit's standard
for 64 bit file access... including the User level experience of converting
existing applications to the new standard.
File System Indexing, and Backup by Jerome H. Saltzer Laboratory
for Computer Science Massachusetts Institute of Technology M.I.T. Room NE43-513
Cambridge, Massachusetts 02139 U.S.A.
This paper briefly proposes two operating system ideas: indexing
for file systems, and backup by replication rather than tape copy. Both of these
ideas have been implemented in various non-operating system contexts; the proposal
here is that they become operating system functions.
IBM's journaled file system technology, currently used in
IBM enterprise servers, is designed for high-throughput server environments,
key to running intranet and other high-performance e-business file servers.
IBM is contributing this technology to the Linux open source community with
the hope that some or all of it will be useful in bringing the best of journaling
capabilities to the Linux operating system. Work is currently underway to complete
the port of this technology to Linux.
Developing JFS
JFS is licensed under the
GNU General Public
License. If there's a feature that you'd like to see added to JFS, consider
becoming a part of the JFS development process. Since JFS is an open source
project, it's easy to get involved.
Get the Source
A
CVS repository contains the
latest stable version of the JFS source code and documentation. All JFS
core team members and JFS contributors have read-write access to CVS and WebCVS.
CVS is a system that lets groups of people work simultaneously
on groups of files. CERN has a Web site with
general information
on CVS , as does cyclic.com.
For convenience the latest source may be downloaded as
jfs-0.0.1.tar.gz.
For details on building the source and a list of ToDo items,
examine the
README.
Report bugs
Jitterbug is the system for tracking JFS bugs and feature requests. The
core team and contributors have read-write access to this database. The community
at large has read access through a Web interface.
Jitterbug is a Web-based bug tracking system. It handles bug
tracking, problem reports, and queries and is available under the GNU General
Public License. JitterBug has a Web site for
general information on
JitterBug.
ReiserFS article. Interestingly, this project is being funded by Suse and Mp3.com.
The FS basically seems to be boasting much more efficient algorithms and handles
small file space better.
A great way to follow kernel development is to read the excellent kernel
mailing list synopses written by Zack Brown at:
http://kt.linuxcare.com
xt3fs is a journaled version of ext2fs written by Stephen
Tweedie. It's in beta form right now but works pretty well. Stephen and Ted
Ts'o talked about ext3fs at our Linux Storage Management Workshop in Darmstadt,
Germany (you can get the slides for this workshop at
ftp://linux.msede.com/lsmws_talks/)
The ext3 filesystem, of which early alphas are ready (version 0.0.2c, the excitement
!!). Development is on the linux-fsdevel mailing list, archived
here. Hello, I've been running ext3 on my laptop computer for about two
months now. It works great. Just sync the disks and turn it off. No shutdown.
No data loss either. If you look at e.g Solaris disk-suite you are able to control
where your should store your metadata. Say that you want to have journaling
file data also, this is normally slowing the system down. But if you can specify
that all file metadata should be on a separate solidstate disk (naturally mirrored
for safety). Then journaling of file data will be quick and swift. This is in
my view quite important. If I understand everything correctly you can do that
with ext3. One of the major problems with ext2fs (IMHO) is that it doesn't resize
well. This is because there is a copy of every group descriptor in every
group [a g.d. contains metadata for a group of blocks/inodes, typically 8M in
size]. Therefore enlarging or shrinking the drive causes a major reshuffle of
ALL the data; so far, the only utility I know that can do this is resiz2fs,
which comes with Partition Magic (there are no doubt others now).
This redundancy is good in theory (backups), but keeping a copy of a constant
number of group descriptors (perhaps the previous and next 32) in a given group
would still give you a lot of redundancy plus make resizing simpler.
Granted, resizing isn't something you do a lot, but having had my system lock
up and die while resizing and having to recover using Turbo C++ and the ext2fs
spec (code and info on my
ext2fs
page), it would be nice if ext3fs (or XFS) made this easier.
The Reiser
Filesystems by Hans Reiser, a very ambitious project to not only improve
performance and add journaling, but to redefine the filesystem as a storage
repository for arbitrarily complex objects.
reiserfs.
Reiserfs is faster than ext2/3 because it uses balanced trees for it's directory-structures.
The project is now released for 2.2.11 - 2.2.13. Mailing list archive
here.
The Xfs site
has some docs. The work to unencumber the code is accelerating, and February
is the target date for source code release. XFS is the one that I think has
the most potential. It's a full logging filesystem from the ground up, not an
extension (not that EXT3 or DTFS are bad or misguided efforts) I'm betting it
will be the highest performance filesystem for linux when it goes gold. I think
the tight integration of the log could be a huge plus. It's been a while since
filesystem 101 but I would think that there are a ton of ways to optimize performance
with log write back tricks and useage optimizations.. You could include a hit
counter in metadata and have an optimizer that moves higher hit files closer
to the log in the center of the disk making your more frequently used files
closer to where the head is supposed to be. Those kinds of optimizations (if
practical, maybe I'm full of it) wouldn't be nearly as easy with ext3 since
the FS doesn't have any knowldege of the log. Plus xfs has ACLs and big file
support already.
Hi,ext3fs is a journaled version of ext2fs written by Stephen Tweedie. It's
in beta form right now but works pretty well. Stephen and Ted Ts'o talked about
ext3fs at our Linux Storage Management Workshop in Darmstadt, Germany (you can
get the slides for this workshop at ftp://linux.msede.com/lsmws_talks/)
Stephen also gave a talk on ext3fs at the Linux Kongress in Augsburg, Germany.
He is predicting Summer 2000 for production use of ext3fs. Nice features include
the fact that ext3fs is backwards compatible with older versions of ext2. In
addition, ext3fs uses asynchronous journaling, which means the performance will
be as good or better than ext2fs.
I am involved with the SGI effort to port XFS to Linux. The work to unencumber
the code is accelerating, and February is the target date for source code release.
The read path is working at this time. More work remains however, so stay tuned
to
http://oss.sgi.com
[August 17, 1999]
read-2_23.zip
Size:72kb LREAD v2.3 - Programm to read LINUX Extended2-Filesystems on PCs from
within DOS
Alan put 2.2.11pre2 up on
ftp://ftp.*.kernel.org/pub/linux/kernel/alan/proposed-2.2.11pre2.gz and
posted a changelog against 2.2.10. Linus replied,
"Looks good, except aic7xxx is wrong version ;) Tssk,
tssk."
One of Alan's changes was "FAT now uses cluster numbering
for inode info", which Alexander Viro took exception to. Alexander replied to
the announcement:
"It doesn't. It
generates inumbers on the fly. Cluster numbering is unusable for that -
truncate() *shouldn't* change inumber. FAT *has* no file invariants that
would survive (a) rename, (b) truncate(), (c) write and (d) umount. Of all
those umount give the least pain wrt races. New code guarantees constant
inumbers for opened files.
The bottom line - inumbers on FAT will suck anyway. There
is no inodes in normal sense. And inumbers changing after reboot are *much*
better than exploitable races. On FAT usage of (old) inumbers for any backup
stuff was broken - rename() would go unnoticed."
Another one of Alan's changes was to remove the COMA workaround,
and recommend people just use set6x86 if they have that Cyrix CPU bug. Zoltan
Boszormenyi said sarcastically that in that case, they might as well remove
the f00f bugfix as well. Alan defended the change, and there was a discussion
about which fix was enabled in which version and then switched for which other
fix.
See also
OSRC
File Systems
There are several necessary extension of the Unix filesystem. the
two most often mentioned are journaling and file system indexing.
File System Indexing, and Backup by Jerome H. Saltzer Laboratory
for Computer Science Massachusetts Institute of Technology M.I.T. Room NE43-513
Cambridge, Massachusetts 02139 U.S.A.
This paper briefly proposes two operating system ideas: indexing
for file systems, and backup by replication rather than tape copy. Both of these
ideas have been implemented in various non-operating system contexts; the proposal
here is that they become operating system functions.
A UNIX APPROACH
TO DATABASE SOFTWARE Thomas Lord, Berkeley CA.
Sprite papers
- A Trace-Driven
Analysis of the UNIX BSD File System. John K. Ousterhout, Herve' Da Costa,
David Harrison, John A. Kunze, Mike Kupfer, and James G. Thompson
-
Measured
Performance of Caching in the Sprite Network File System Brent B. Welch
- Caching
in the Sprite Network File System. Michael N. Nelson, Brent B. Welch,
and John K. Ousterhout
- Sprite
Position Statement: Use Distributed State for Failure Recovery. Brent
Welch, Mary Baker, Fred Douglis, John Hartman, Mendel Rosenblum, John Ousterhout
- The
File System Belongs in the Kernel. Brent Welch
- The Sprite
Internet Protocol Server. Andrew Cherenson
- The Jaquith
Archive Server. James W. Mott-Smith
- Beating
the I/O Bottleneck: A Case for Log-Structured File Systems. John Ousterhout
and Fred Douglis
- The
LFS Storage Manager. Mendel Rosenblum and John K. Ousterhout
- The Design
and Implementation of a Log-Structured File System. Mendel Rosenblum
and John K. Ousterhout
- Measurements
of a Distributed File System. Mary G. Baker, John H. Hartman, Michael
D. Kupfer, Ken W. Shirriff, and John K. Ousterhout
-
A Trace-Driven Analysis of Name and Attribute Caching in a Distributed System.
Ken W. Shirriff and John K. Ousterhout
-
Non-Volatile Memory for Fast, Reliable File Systems. Mary Baker, Satoshi
Asami, Etienne Deprit, John Ousterhout, Margo Seltzer
- Why
Aren't Operating Systems Getting Faster As Fast as Hardware? John K.
Ousterhout
- Pseudo Devices:
User-Level Extensions to the Sprite File System. Brent B. Welch and John
K. Ousterhout
- Pseudo-File-Systems.
Brent B. Welch and John K. Ousterhout
-
The
Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment.
Mary Baker and Mark Sullivan
- Availability
in the Sprite Distributed File System. Mary Baker and John Ousterhout
- The Sawmill
logging file system. Ken Shirriff
- Slides
from a work-in-progress talk on the Sawmill logging file system. Ken Shirriff
- An Implementation
of Memory Sharing and File Mapping. Ken Shirriff
- The Sprite
Network Operating System. John K. Ousterhout, Andrew R. Cherenson, Frederick
Douglis, Michael N. Nelson, Brent B. Welch
- Sprite
on Mach. Michael D. Kupfer
- The Role
of Distributed State. John K. Ousterhout
-
Transparent Process Migration: Design Alternatives and the Sprite Implementation.
Fred Douglis and John Ousterhout
- Virtual
Memory vs. The File System. Michael N. Nelson
- Virtual Memory
for the Sprite Operating System. Michael N. Nelson
-
A Comparison
of the Vnode and Sprite File System Architectures. Brent Welch
- "Zebra:
A Striped Network File System". Appeared in the Proceedings of the USENIX
Workshop on File Systems. Also as UCB/CSD Tech Report 92/683 John Hartman
and John Ousterhout
- The Zebra
Striped Network File System. John Hartman and John Ousterhout
Extending the Operating System at the User Level the Ufo Global File System
by Albert D. Alexandrov, Maximilian Ibel, Klaus E. Schauser, and Chris J.
Scheiman, Proceedings of the USENIX 1997
Annual Technical Conference Anaheim, California, January 1997.
In this paper we show how to extend the functionality
of standard operating systems completely at the user level. Our approach works
by intercepting selected system calls at the user level, using tracing facilities
such as the /proc file system provided by many Unix operating systems. The behavior
of some intercepted system calls is then modified to implement new functionality.
This approach does not require any re-linking or re-compilation of existing
applications. In fact, the extensions can even be dynamically ``installed''
into already running processes. The extensions work completely at the user level
and install without system administrator assistance.
We used this approach to implement a global file system, called
Ufo, which allows users to treat remote files exactly as if they were local.
Currently, Ufo supports file access through the FTP and HTTP protocols and allows
new protocols to be plugged in. While several other projects have implemented
global file system abstractions, they all require either changes to the operating
system or modifications to standard libraries. The paper gives a detailed performance
analysis of our approach to extending the OS and establishes that Ufo introduces
acceptable overhead for common applications even though intercepting system
calls incurs a high cost.
Keywords: operating systems, user-level extensions,
/proc file system, global file system, global name space, file caching
See also GSCHWIND, M. K. 1994. FTP---Access as a user-defined
file system. ACM SIGOPS Oper. Syst. Rev. 28, 2 (Apr.), 73--80.
SunWorld: Security basics, Part 1 - Understanding file attribute bits and modes(Oct
29, 2000)
Linux Magazine: A Tour of the Linux Filesystem: Part II(Sep 23, 2000)
Linux Magazine: A Tour of the Linux Filesystem: Part I(Aug 27, 2000)
ShowMeLinux.com: Ask Alex - Linking Files, Finding Files, Dual Modem Setups, and
More(Aug 26, 2000)
FirstLinux.net: Linux Directory Structure(Aug 02, 2000)
Linux Magazine: Guru Guidance: Managing Filesystems: Beyond the Basics(Jul 23,
2000)
LinuxNovice.org The Linux filesystem
There is no doubt that one of the most confusing things about
Linux (at least to the novice user) is its filesystem. Since most of us grew
accustomed to the way Windows does things, thinking about the filesystem in
terms of the A or C drive seems almost natural, but understanding the differences
between /etc and /var takes us to a whole different world. The present article
tries to make it easier for new Linux users to understand the filesystem.
If you need more information, feel free to visit the
Filesystem Hierarchy
Standard (FHS) site. This is the organization that tries to lay out a filesystem
standard not only for the different Linux distributions, but also for UNIX.
GNULinux.com
- Filesystem review for Complete Newbies
The Linux filesystem is rather confusing to
new users. We'll try to remove a little of the mystery, showing you the logic
of Linux and help you become more accustomed to accessing files and mount points.
This document will be updated from time to time, and I will keep a running list
of FAQ's at the bottom of the page for easy reference.
A few conventions before we go on:
Everything in Linux is considered either
a file or a directory Linux sees everything as a text file and can be opened
as such--generally, it means you get a screen full of garbage, but many of the
files are human-readable and you can edit them as you see fit. You can check
into a few examples of the file theme in the /dev (devices) and the /proc (process)
directories. The /dev is the location for devices attached to your system (i.e.,
hardware) and /proc is literally what is in your system's memory, from IRQ's
and PCI channels, to temp files. It's never a good idea to edit the things in
/dev or /proc, but looking in those directories will help with Linux concept
that everything is a file in one way or another.
It's been this way for 30 years. Yep, the Unices
(UN*X-like operating systems, which includes Linux) have a long and deep history,
stretching back to the dawn of computing, and the filesystem has stayed pretty
much intact. Sure, there have been changes along the way, but as a whole, it's
the same structure. A working knowledge of the filesystem makes you a good user
and a good administrator, able to circumvent problems by going to the source
every time.
(Apr 2, 2000, 17:13 UTC) (Posted by
marty) (0 talkbacks posted) (1071 reads)
"This article will unravel the mysteries of the Unix and GNU/Linux filesystem to
the new user: where you can find files, where you should put files, and how to avoid
getting lost."
[June 25, 1999]
Gregor N. Purdy - Project Ideas - CVFS -- Concurrent Versioning File System
(CVFS)
SCO OpenServer Release 5 Filesystem Technology - White Paper
Filesystem Administration
[Aug 19, 2000]
Opensource.html
IBM announces AFS as an open source product under the IBM Public License
Re:you probably don't want AFS
(Score:1)
by jlrobins_uncc
(jlrobins@uncc.delete.edu) on Thursday August
17, @05:55PM EDT (#113)
(User
#136569 Info)
http://www.cs.uncc.edu/~jlrobins/
|
| AFS makes great sense for Web server farms and/or
mirrors of the same site across a WAN such as the Internet (think an east
coast site and a west coast site). Just edit the file and pow, a server
-> client callback notifies any clients caching the file that they need
to refetch.
Couple this with having the content in a read-only replicated volume,
then go ahead and update many files, get your new site look-and-feel redone,
then once your happy with it, release the read/write volume for replication,
and pow -- one atomic transaction to all of the mirroring servers on the
WAN!
Mabye this is why AFS is a major component of IBM's Websphere platform.
All of this, currently working like a champ, and it'll be free and
open source!
---------- Hail Ants!
|
Why this has so much potential for
good. (Score:4, Insightful)
by jlrobins_uncc
(jlrobins@uncc.delete.edu) on Thursday August
17, @05:47PM EDT (#111)
(User
#136569 Info)
http://www.cs.uncc.edu/~jlrobins/
|
AFS is a very stable, tested, enterprise filesystem.
It offers the following features:
- Cross platform: Many UNIXen as well as NT as either client or server.
- Secure: Uses Kerberos IV for user authentication.
- Client-side caching: client machines use disk or virtual memory
to cache MRU files, greatly reducing # trips down the wire on reads.
- Unified naming scheme: names of files don't indicate what file server
they're on. Makes moving of volumes from one fileserver (or drive on
the same fileserver) to another a cinch, since no client-side changes
need to happen.
- Read-only replication: Make your application install directories
replicated in each building on campus.
Now, it's not a perfect product, but it is way cooler than vanilla
NFSv2 or NFSv3, especially on the server-side management side of things.
It doesn't do disconnected operation (which CODA strives to do), byte-range
locking, strict UNIX file semantics (data most recently written == data
viewiable by all file handles to that file), or Kerberos 5, but it is a
far simpler system to get running than DCE, which does address some of those
issues.
One would hope to see the following things from this open sourcing:
- *BSD client / servers.
- MacOS X client (at least!)
- Millineum / Win2K clients (NT clients exist currently).
If the MacOS X client happens, then there will be a secure, scaleable
enterprise filesystem for the three major computer platforms -- Wintel,
UNIX, and Mac, and it'll even be freely available! I don't believe
that there are any products available today that offer secure,
robust support for all three platforms (and no, I don't consider
protocol translators, such as Samba or CAP, which require you to set up
the clients to use cleartext passwords over the wire to authenticate (not
to downplay in any way the role of either technology -- it's not their fault
that you've got to set up the clients in that fashon to interoperate with
AFS as it is now), or using NFSv2 or v3 on the UNIX end to talk to something
like Novell 5 (which, AFAIK, doesn't talk at all to Macs anymore)).
This will give us one protocol on the wire, multiple server-side implementations
(interoperable in the same cell!), multiple client-side implementations,
WAN scalability, and secure authentication. A good day for the world!
|
As one of the architects designing
DFS in IBM (Score:4, Interesting)
by gelfling on Thursday August 17, @08:06PM EDT (#128)
(User
#6534 Info) |
We've always had a hard time selling DFS internally.
In fact we've stopped trying to do that because there weren't enough internal
customers. The hurdle costs were too high the skills were hard to find and
expensive and customers still wanted SMB shares via Samba which drove the
cost even higher. The client side DCE licence costs drove Samba since the
per client cost was $65/seat in bulk. AFS as open source can only be a good
thing since we can always find someone to pick up the development and maintenance
and foregoing DCE-Kerberos is really not that big a deal from an internal
perspective. In our environment the challenge was to collapse hundreds of
LanServer domains. DFS or AFS fit the bill and the cost dynamics work very
well compared to staffing 1 headcount/25-35 servers in the LanServer world.
The problem anyone will find though is backup and storage management. butc
or buta just don't scale very well even with multiple replicas of the fldb
core so whoever tries to manage this, as we did, will be forced to write
extensions to their storage management code, as we did with ADSM. Also you
will find that Samba doesn't scale nearly as well as you want with only
a few hundred accounts on a Samba server even if it sits on a huge Unix
machine. This leaves you will a few hundred or more SMB gateways if you
try to scale up to the huge numbers we did.
Once again AFS open source can only be a good thing - it will propagate
a great technology into large sites where they would shied away from it
previously.
|
| [
Reply to This |
Parent ] |
Articles IBM Open Sourcing AFS
AFS semantics are very different from UNIX file
system semantics: permissions are associated with directories only, access is
determined only by the containing directory, if multiple clients modify the
same file, updates are lost, you can't have any special files in an AFS file
system, etc. AFS uses its own authentication, it doesn't work well for big files,
it always requires extra work to get it to work with daemons, and it has severe
problems for scientific compute clusters. IBM has long ago moved onto DFS (unrelated
to Microsoft DFS), which fixes many of the problems of AFS (but is itself big,
even more complex than AFS, and hard to administer). Many places are trying
to get rid of AFS because it's just too much of a hassle to run it (and converting
back to a UNIX file system isn't easy because AFS encourages permissions and
ACLs to mushroom unnecessarily).
AFS may be acceptable for specific applications
(in fact, what it was designed for originally): a large untrusted user population,
dedicated system management staff, and smallish files and problems (text file
editing, small programming jobs). But for many environments where Linux is used--big
software development projects, web servers, scientific computing, home networking--it
just doesn't seem like a good fit.
If it's the security you care about, NFSv4 might
be for you, although it clearly also has some problems. If you want something
AFS-like, Coda might be an option (but I don't know how mature it is yet). MFS
and GFS are options for compute clusters. Maybe we can get 9P or Styx up on
Linux.
Re:you probably don't want AFS
(Score:2, Informative)
(User
#125105 Info)
http://www.cae.wisc.edu/~gerdts
The problems you mention with "not working well
with daemons" is likely related to the fact that it uses Kerberos IV. If the
daemon needs to have more access to AFS directories than you are willing to
give to any other user on the system, there is a lot of work to do.
Specifically, you need to stash a password away
such that the daemon can authenticate and periodically reauthenticate so that
it does not lose the rights that it has.
AFS does allow you to have ACL's based on IP
address. As such, if you are running a daemon on a machine than only system
administrators have access to, it may not be a big deal to allow everyone on
that machine to write to a directory. Other machines, though, may have read-only
or no access to the directory.
NFS 4 will have the same problem, as a requirement
for it is that Kerberos V is supported as an authentication mechanism. If you
don't give world write to a file/directory, then you cannot write to it without
a kerberos V ticket.
Too little, too late?
(Score:4, Insightful)
(User
#131596 Info)
As someone who has worked with AFS for the past
8 years, I have to say that I greet this announcement with a somewhat more pessimistic
vie