|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Filesystems
|
Science is facts; just as houses are made
of stones, so is science made of facts; but a pile of stones
is not a house and a collection of facts is not necessarily
science.
-- Henri Poincaire
|
Filesystems is a very interesting area, one of the few areas in Unix
where new algorithms still can make a huge difference in performance.
Often the historical view on filesystems is a bit too Unix-centric and
states that the Berkeley Fast File System is the ancestor of most modern
file systems. This view ignores competitive and earlier implementations
from IBM(HPFS), DEC (VAX VMS), Microsoft (NTFS) and others.
Still Unix filesystems became a classic and concepts introduced in ti
dominate all modern filesystems It also introduced many interesting features
and algorithms into the area. For example a very interesting concept of
extended attributes introduced in the 4.4 BSD filesystem have recently been
added to Ext2fs:
Immutable files can only be read: nobody can write or delete them.
This can be used to protect sensitive configuration files.
Append-only files can be opened in write mode but data is always
appended at the end of the file. Like immutable files, they cannot be deleted
or renamed. This is especially useful for log files which can only grow.
All-in all following attributes are available at ext2f:
- A
(no
Access time):
if a file or directory has this attribute set, whenever it is accessed,
either for reading of for writing, its last access time will not be
updated. This can be useful, for example, on files or directories which
are very often accessed for reading, especially since this parameter
is the only one which changes on an inode when it's open read-only.
- a
(
append only):
if a file has this attribute set and is open for writing, the only operation
possible will be to append data to its previous contents. For a directory,
this means that you can only add files to it, but not rename or delete
any existing file. Only
root can set or
clear this attribute.
- d
(no
dump):
dump (8)
is the standard
UNIX utility for
backups. It dumps any filesystem for which the dump counter is 1 in
/etc/fstab
(see chapter
"Filesystems and Mount Points"). But if a file or directory
has this attribute set, unlike others, it will not be taken into account
when a dump is in progress. Note that for directories, this also includes
all subdirectories and files under it.
- i
(
immutable):
a file or directory with this attribute set simply can not be modified
at all: it can not be renamed, no further link can be created to it
[1] and it cannot be removed.
Only root
can set or clear this attribute. Note that this also prevents changes
to access time, therefore you do not need to set the
A
attribute when i
is set.
- s
(
secure deletion):
when such a file or directory with this attribute set is deleted, the
blocks it was occupying on disk are written back with zeroes.
- S
(
Synchronous mode):
when a file or directory has this attribute set, all modifications on
it are synchronous and written back to disk immediately.
Unix filesystem is a classic, but classic has it's own problems: it's
actually an old and largely outdated filesystem that outlived its usefulness.
Later ideas implemented in HPFS, BFS and several other more modern filesystems
are absent in plain-vanilla implementation of Unix file systems. Balanced
trees now serve the base of most modern filesystems including ReiserFs (which
started as NTFS clone but aqured some unique features in the process of
development):
The
Reiser
Filesystems by Hans Reiser [and Moscow University researchers],
a very ambitious project to not only improve performance and add journaling,
but to redefine the filesystem as a storage repository for arbitrarily
complex objects.
reiserfs.
Reiserfs is faster than ext2/3 because it uses balanced trees for it's
directory-structures. It was used by Suse and Gentoo.
Unfortunately the novel feature introduced in HPFS called extended
attributes never got traction in other filesystems. Of
course the fundamental decision to make attributes indexable deserves closer
examination, given the costs of indexing, but still the fixed set of attributes
(like in UFS) created too many problems to ignore this issue. Still I think
that extended attributes should be present in a filesystem, and they can
replace such kludges as #! notation in UNIX for specifying default processor
in executable files.
Notes:
- This is a Spartan WHYFF (We Help
You For Free) site written by people for whom English
is not a native language.
Some amount of grammar and spelling errors should be
expected.
- The site contain some broken links
as it develops like a living tree...
Please try to use Google, Open directory,
etc. to find a replacement link (see
HOWTO search the WEB for details). We would appreciate
if you can
mail us a correct link.
|
|
|
|
Interesting, albeit outdated discussion
It also doesn't tell how messy Solaris's VFS is ... and nothing about
Sun's marketing bullshit like ZFS.
www.softpanorama.org/Articles/Linux_vs_Solaris/comparison_of_internal_architecture.shtml
But it is a useful guide for those of us who recognize that there
are applications where Solaris is a better fit and there are other
applications where a GNU/Linux dist is a better fit. I haven't met
anyone who has used ZFS, understood it and still shares your opinion.
In my opinion, the only thing Sun's marketing might be guilty of
is focusing on catchy and confusing names for technology and not
enough on explaining how unique and useful the technology is. As
I write this, I'm transfering my DV video and photos from an HFS+
volume to a ZFS pool. I'm looking forward to the day when Apple,
Linux and Microsoft have a filesystem which can raid, resilver and
compress as easily as ZFS and which can validate that what I read
from disk is what I wrote. Regardless of what name marketing comes
up for these features, I doubt I'm the only one who finds them useful.
Jeff Layton is an Enterprise Technologist for HPC at Dell.
Nine-Year Review of Benchmarks
Recently there was a
paper
published by Avishay Traeger and Erez Zadok from Stony Brook University
and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research
Center entitled, “A Nine Year Study of File System and Storage Benchmarking”
(Note: a summary of the paper can be found at
this link). The paper examines 415 file systems and storage benchmarks
from 106 recent papers. Based on this examination the paper makes some
very interesting observations and conclusions that are, in many ways,
very critical of the way “research” papers have been written about storage
and file systems. These results are important to good benchmarking.
And, stepping back from that, they make recommendations on how to perform
good benchmarks (or at the very minimum, “better” benchmarks).
The research included papers from the Symposium on Operating Systems
Principles (SOSP), the Symposium on Operating Systems Design and Implementation
(OSDI), the USENIX Conference on File and Storage Technologies (FAST),
and the USENIX Annual Technical Conference (USENIX). The conferences
range from 1999 through 2007. The criteria for the selection of papers
was fairly involved but focused on papers of good quality that covered
benchmarks focusing on performance not on correctness or capacity. Of
the 106 papers surveyed, the researchers included 8 of their own.
When selecting the papers, they used two underlying themes or guidelines
for evaluation:
- Looking to see if the authors explained exactly what was done
- providing details on the benchmarking process.
- Finding out if the authors just didn’t explain what was done,
but justified why it was done in that particular fashion. For example,
explaining why comparing file systems is fair or why a particular
benchmark was run
Breaking Down Good Benchmarks
Repetition One of the simplest things that can be
done for a benchmark is to run the benchmark a number of times and report
the median or average. In addition, it would be extremely easy (and
helpful) to report some measure of the spread of the data such as a
standard deviation. This allows the reader to get an idea of what kind
of variation they could see if they tried to reproduce the results and
it also allows readers to understand the overall performance over a
period of time.
The paper examined the 106 benchmark papers for the number of times
the benchmark was run. The table below is from the review paper for
all 388 benchmarks examined and is broken down by conference. Since
most of the time the data was unclear, it was assumed that each benchmark
was run only once.
Table 1 - Statistics of Number of Runs by Conference
| Conference |
Mean |
Standard Deviation |
Median |
| SOSP |
2.1 |
2.4 |
1 |
| FAST |
3.6 |
3.6 |
1 |
| OSDI |
3.8 |
4.3 |
2 |
| USENIX |
4.7 |
6.2 |
3 |
It is fairly obvious that the dispersion in the data is quite large.
In some cases the standard deviation is as large or larger than the
mean value.
Runtime The next topic examined is the runtime of
the benchmark. Of the 388 benchmarks examined, only 198 (51%) specified
the elapsed time of the benchmark. From this data, it was found:
- 28.6% of the benchmarks ran for less than one minute
- 58.3% ran for less than 5 minutes
- 70.9% ran for less than 10 minutes
Typically run times that are short (less than one minute) are too
fast to achieve any sort of steady-state value.
With 49% of the benchmarks having no known runtime and another 28.6%
running for less than a minute, easily three-quarters of these results
should cause some of your warning bells to start ringing. If there’s
no data, it’s not a benchmark; it’s an advertisement.
Linux Magazine
In a previous article,
the case was made for how low file system benchmarks have fallen. Benchmarks
have become the tool of marketing to the point where they are mere numbers
and do not prove of much use. The article reviewed a paper that examined
nine years of storage and file system benchmarking and made some excellent
observations. The paper also made some recommendations about how to
improve benchmarks.
This article isn’t so much about benchmarks as a product, but rather
it is an exploration looking for interesting observations or trends
or the lack thereof. In particular this article examines the metadata
performance of several Linux file systems using a specific micro-benchmark.
Fundamentally this article is really an exploration to understand if
there is any metadata performance differences between 4 Linux file systems
(ext3, ext4, btrfs, and nilfs) using a metadata benchmark called
fdtree.
So now it’s time to eat our dog food and do benchmarking with the recommendations
previously mentioned.
Start at the Beginning - Why?
The previous
article made several observations about benchmarking, one of which
is that storage and file system benchmarks seldom, if ever, explain
why they are performing a benchmark. This is a point that is not to
be underestimated. Specifically, if the reason why the benchmark was
performed can not be adequately explained, then the benchmark itself
becomes suspect (it may just be pure marketing material).
Given this point, the reason the benchmark in this article is being
performed is to examine or explore if, and possibly how much, difference
there is between the metadata performance of four Linux file systems
using a single metadata benchmark. The search is not to find which file
system is the best because it is a single benchmark,
fdtree.
Rather it is to search for differences and contrast the metadata performance
of the file systems.
Why is examining the metadata performance a worthwhile exploration?
Glad that you asked. There are a number of applications, workloads,
and classes of applications that are metadata intensive. Mail servers
can be very metadata intensive applications because of the need to read
and write very small files. Sometimes databases have workloads that
do a great deal of reading and writing small files. In the world of
technical computing, many bioinformatic applications such as gene sequencing
applications, do a great deal of small reads and writes.
The metadata benchmark used in this article is called
fdtree.
It is a simple bash script that stresses the metadata aspects of the
file system using standard *nix commands. While it is not the most well
known benchmark in the storage and file system world, it is a bit better
known in the HPC (High Performance Computing) world.
More performance: We add five file systems to our previous benchmark
results to creating a “uber” article on metadata file system performance.
We follow the “good” benchmarking guidelines presented in a previous
article and examine the good, the bad and the interesting.
Nice intro...
by Sheryl Calish
What is a "filesystem," anyway? Sheryl
Calish explains the concept as well as its practical application
Published November 2004
Although the kernel is the heart of Linux, files
are the main vehicles through which users interact with the operating
system. This is especially true of Linux, because in the UNIX tradition,
it uses the file I/O mechanism to manage hardware devices as well as
with data files.
Unfortunately, the terminology used to discuss
Linux filesystem concepts is a bit confusing for newcomers. The terms
filesystem
and file system
are used interchangeably in the Linux documentation to refer to several
different but related concepts. They refer to the data structures as
well as the methods that manage the files within the partitions, in
addition to specific instances of a disk partition.
To further confuse the uninitiated, these
terms are also used to refer to the overall organization of files in
a system: the directory tree. Then again, they can refer to each of
the subdirectories within the directory tree, as in
the /home filesystem.
Some hold that these directories and subdirectories cannot truly be
called a filesystem unless they each reside on their own disk partition.
Nevertheless, others do refer to them as filesystems, contributing to
the confusion.
Linux veterans understand, from context, the
sense in which these terms are used. Newcomers, however, have an understandably
harder time discerning the context.
The overriding objective of this article is
to provide enough background to help you discern the context of this
terminology for yourself. In the process of untangling the subtleties
of the filesystem terminology, however, you will also acquire the knowledge
to move beyond the theoretical to the practical application of some
very useful related tools.
The article focuses on the Linux disk partitions
and file management system features in version 2.4 of the Linux kernel.
It also reviews new features available in version 2.6 of the kernel.
About: GNU ddrescue is a data recovery tool. It copies data
from one file or block device (hard disc, cdrom, etc) to another, trying
hard to rescue data in case of read errors. GNU ddrescue does not truncate
the output file if not asked to. So, every time you run it on the same
output file, it tries to fill in the gaps. The basic operation of GNU
ddrescue is fully automatic. That is, you don't have to wait for an
error, stop the program, read the log, run it in reverse mode, etc.
If you use the logfile feature of GNU ddrescue, the data is rescued
very efficiently (only the needed blocks are read). Also you can interrupt
the rescue at any time and resume it later at the same point.
Changes: The new option "--domain-logfile" has been added.
This release is also available in lzip format. To download the lzip
version, just replace ".bz2" with ".lz" in the tar.bz2 package name.
The Small Computer Systems Interface (SCSI) is a collection of standards
that define the interface and protocols for communicating with a large
number of devices (predominantly storage related). Linux® provides a
SCSI subsystem to permit communication with these devices. Linux is
a great example of a layered architecture that joins high-level drivers,
such as disk or CD-ROM drivers, to a physical interface such as Fibre
Channel or Serial Attached SCSI (SAS). This article introduces you to
the Linux SCSI subsystem and discusses where this subsystem is going
in the future.
When it comes to file systems, Linux® is the Swiss Army knife of
operating systems. Linux supports a large number of file systems, from
journaling to clustering to cryptographic. Linux is a wonderful platform
for using standard and more exotic file systems and also for developing
file systems. This article explores the virtual file system (VFS)—sometimes
called the virtual filesystem switch—in the Linux kernel and then reviews
some of the major structures that tie file systems together.
data=writeback While the writeback
option provides lower data consistency guarantees than the journal or
ordered modes, some applications show very
significant speed improvement when it is used. For example,
speed improvements can be seen when heavy synchronous writes are performed,
or when applications create and delete large volumes of small files,
such as delivering a large flow of short email messages. The results
of the testing effort described in Chapter 3 illustrate this topic.
When the writeback option is used, data consistency is similar to
that provided by the ext2 file system. However, file system integrity
is maintained continuously during normal operation in the ext3 file
system.
In the event of a power failure or system crash, the file system
may not be recoverable if a significant portion of data was held only
in system memory and not on permanent storage. In this case, the filesystem
must be recreated from backups. Often, changes made since the file system
was last backed up are inevitably lost.
Submitted by
Jeremy on August 7, 2007 - 9:26am.
In a recent lkml thread, Linus Torvalds was involved
in a discussion about mounting filesystems with
the
noatime option for better performance,
"'noatime,data=writeback'
will quite likely be *quite* noticeable (with different
effects for different loads), but almost nobody
actually runs that way."
He noted that he set O_NOATIME when writing git,
"and it was an absolutely
huge time-saver for the case of not having 'noatime'
in the mount options. Certainly more than your
estimated 10% under some loads."
The discussion then looked at using the
relatime mount option to improve the
situation, "relative atime only updates the atime
if the previous atime is older than the mtime or
ctime. Like noatime, but useful for applications
like mutt that need to know when a file has been
read since it was last modified."
Ingo Molnar stressed the significance of fixing
this performance issue, "I cannot over-emphasize
how much of a deal it is in practice.
Atime updates are by far
the biggest IO performance deficiency that Linux
has today. Getting rid of atime updates would give
us more everyday Linux performance than all the
pagecache speedups of the past 10 years, _combined_."
He submitted some patches to improve
relatime, and noted about atime:
"It's also perhaps
the most stupid Unix design idea of all times.
Unix is really nice and well done, but think
about this a bit: 'For every file that is read
from the disk, lets do a ... write to the disk!
And, for every file that is already cached and
which we read from the cache ... do a write
to the disk!'"
This series was originally called "Advanced Filesystem Implementor's
Guide and was published on IBM developerWorks
This methodology utilizes a tmpfs volume, and it can speed up operations
approximately three times.
This document describes a methodology for configuring a fast file
system that handles several small files on the Solaris Operating System.
This could be used for building a Java technology-based product or for
handling many operations on a large amount of small files. This
methodology utilizes a tmpfs volume, and it can speed up operations
approximately three times.
The requirements are as follows:
- Solaris 7 OS through Solaris 10 OS Update 1
- Some experience with Solaris system administration. This procedure
is not recommended for UNIX users who are uncomfortable with using
mount, maintaining
/etc/vfstab, or modifying their kernel
parameters.
Warning: Do not develop on a tmpfs volume. A tmpfs volume
is only persistent while the system is powered up, so a power loss or
system problem will cause you to lose any changes to that volume.
Procedure
Solaris tmpfs volumes are easy to create, but require a significant
amount of RAM and swap space. It is recommended that you have at least
1 Gbyte of RAM, but there have also been major performance gains on
systems with 512 Mbytes of RAM. In addition, you should add twice as
much swap space as the tmpfs volume you are creating. That is, for a
2-Gbyte tmpfs volume, add 4 Gbytes of swap space to the system. Feel
free to experiment with these values.
The following examples are for a 2-Gbyte tmpfs volume, which is approximately
what is needed to do a developer build. Replace
<swapfilename> with the absolute path to a
swapfile (such as /disk1/swapfile),
and <mountpoint> with the absolute path to
where you want the tmpfs volume mounted (such as
/ramdisk).
Add swap space to your workstation:
root# /usr/sbin/mkfile 2000m <swapfilename>
Create a mount point for the tmpfs volume:
root# mkdir <mountpoint>
Edit your /etc/vfstab file to use the
swap and create the tmpfs volume at boot time. Add the following two
lines:
<swapfilename> - - swap - no -
RAMDISK - <mountpoint> tmpfs - yes size=2000m
Note that on the Solaris 7 OS you may not make a single tmpfs volume
larger than 2 Gbytes.
Edit your kernel parameters to increase the number of files you can
create in the tmpfs volume. Add the following line to your
/etc/system file. (We've had the most success
using this value.)
set tmpfs:tmpfs_maxkmem=250000000
Reboot your workstation. Then verify that the tmpfs volume exists
at the size you specified:
% df -k <mountpoint>
Make the tmpfs volume writable. Note: This step is necessary
after each reboot of the workstation.
root# chmod 777 <mountpoint>
There are a lot of Linux filesystems comparisons available but most
of them are anecdotal, based on artificial tasks or completed under
older kernels. This benchmark essay is based on 11 real-world tasks
appropriate for a file server with older generation hardware (Pentium
II/III, EIDE hard-drive).
Since its initial publication, this article has generated
a lot of questions, comments and suggestions to improve it.
Consequently, I'm currently working hard on a new batch of tests
to answer as many questions as possible (within the original scope
of the article).
Results will be available in about two weeks (May 8, 2006)
Many thanks for your interest and keep in touch with
Debian-Administration.org!
Hans
Why another benchmark test?
I found two quantitative and reproductible benchmark testing studies
using the 2.6.x kernel (see References). Benoit (2003) implemented 12
tests using large files (1+ GB) on a Pentium II 500 server with 512MB
RAM. This test was quite informative but results are beginning to aged
(kernel 2.6.0) and mostly applied to settings which manipulate exclusively
large files (e.g., multimedia, scientific, databases).
Piszcz (2006) implemented 21 tasks simulating a variety of file operations
on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard disk. To date,
this testing appears to be the most comprehensive work on the 2.6 kernel.
However, since many tasks were "artificial" (e.g., copying and removing
10 000 empty directories, touching 10 000 files, splitting files recursively),
it may be difficult to transfer some conclusions to real-world settings.
Thus, the objective of the present benchmark testing is to complete
some Piszcz (2006) conclusions, by focusing exclusively on real-world
operations found in small-business file servers (see Tasks description).
Test settings
Hardware
- Processor : Intel Celeron 533
- RAM : 512MB RAM PC100
- Motherboard : ASUS P2B
- Hard drive : WD Caviar SE 160GB (EIDE 100, 7200 RPM, 8MB Cache)
- Controller : ATA/133 PCI (Silicon Image)
OS
- Debian Etch (kernel 2.6.15), distribution upgraded on April
18, 2006
- All optional daemons killed (cron,ssh,saMBa,etc.)
Filesystems
- Ext3 (e2fsprogs 1.38)
- ReiserFS (reiserfsprogs 1.3.6.19)
- JFS (jfsutils 1.1.8)
- XFS (xfsprogs 2.7.14)
Description of selected tasks
Operations on a large file (ISO image, 700MB)
- Copy ISO from a second disk to the test disk
- Recopy ISO in another location on the test disk
- Remove both copies of ISO
Operations on a file tree (7500 files, 900 directories, 1.9GB)
- Copy file tree from a second disk to the test disk
- Recopy file tree in another location on the test disk
- Remove both copies of file tree
Operations into the file tree
- List recursively all contents of the file tree and save it on
the test disk
- Find files matching a specific wildcard into the file tree
Operations on the file system
- Creation of the filesystem (mkfs) (all FS were created with
default values)
- Mount filesystem
- Umount filesystem
The sequence of 11 tasks (from creation of FS to umounting FS) was
run as a Bash script which was completed three times (the average is
reported). Each sequence takes about 7 min. Time to complete task (in
secs), percentage of CPU dedicated to task and number of major/minor
page faults during task were computed by the GNU time utility (version
1.7).
RESULTS
Partition capacity
Initial (after filesystem creation) and residual (after removal of
all files) partition capacity was computed as the ratio of number of
available blocks by number of blocks on the partition. Ext3 has the
worst inital capacity (92.77%), while others FS preserve almost
full partition capacity (ReiserFS = 99.83%, JFS = 99.82%, XFS = 99.95%).
Interestingly, the residual capacity of Ext3 and ReiserFS was
identical to the initial, while JFS and XFS lost about 0.02% of their
partition capacity, suggesting that these FS can dynamically grow but
do not completely return to their inital state (and size) after file
removal.
Conclusion : To use the maximum of your partition capacity, choose
ReiserFS, JFS or XFS.
File system creation, mounting and unmounting
The creation of FS on the 20GB test partition took 14.7 secs for
Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2, JFS =
1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times longer to
mount the FS (2.3 secs) when compared to other FS (Ext3 = 0.2,
JFS = 0.2, XFS = 0.5), and also 2 times longer to umount the
FS (0.4 sec). All FS took comparable amounts of CPU to create FS (between
59% - ReiserFS and 74% - JFS) and to mount FS (between 6 and 9%). However,
Ex3 and XFS took about 2 times more CPU to umount (37% and 45%), compared
to ReiserFS and JFS (14% and 27%).
Conclusion : For quick FS creation and mounting/unmounting, choose
JFS or XFS.
Operations on a large file (ISO image, 700MB)
The initial copy of the large file took longer on Ext3 (38.2 secs)
and ReiserFS (41.8) when compared to JFS and XFS (35.1 and 34.8). The
recopy on the same disk advantaged the XFS (33.1 secs), when compared
to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9). The ISO removal
was about 100 times faster on JFS and XFS (0.02 sec for both), compared
to 1.5 sec for ReiserFS and 2.5 sec for Ext3! All FS took comparable
amounts of CPU to copy (between 46 and 51%) and to recopy ISO (between
38% to 50%). The ReiserFS used 49% of CPU to remove ISO, when other
FS used about 10%. There was a clear trend of JFS to use less CPU than
any other FS (about 5 to 10% less). The number of minor page faults
was quite similar between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose JFS
or XFS. If you need to minimize CPU usage, prefer JFS.
Operations on a file tree (7500 files, 900 directories, 1.9GB)
The initial copy of the tree was quicker for Ext3 (158.3 secs) and
XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1). Similar
results were observed during the recopy on the same disk, which advantaged
the Ext3 (120 secs) compared to other FS (XFS = 135.2, ReiserFS = 136.9
and JFS = 151). However, the tree removal was about 2 times longer for
Ext3 (22 secs) when compared to ReiserFS (8.2 secs), XFS (10.5 secs)
and JFS (12.5 secs)! All FS took comparable amounts of CPU to copy (between
27 and 36%) and to recopy the file tree (between 29% - JFS and 45% -
ReiserFS). Surprisingly, the ReiserFS and the XFS used significantly
more CPU to remove file tree (86% and 65%) when other FS used about
15% (Ext3 and JFS). Again, there was a clear trend of JFS to use less
CPU than any other FS. The number of minor page faults was significantly
higher for ReiserFS (total = 5843) when compared to other FS (1400 to
1490). This difference appears to come from a higher rate (5 to 20 times)
of page faults for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose
Ext3 or XFS. Benchmarks from other authors have supported the use of
ReiserFS for operations on large number of small files. However, the
present results on a tree comprising thousands of files of various size
(10KB to 5MB) suggest than Ext3 or XFS may be more appropriate for real-world
file server operations. Even if JFS minimize CPU usage, it should be
noted that this FS comes with significantly higher latency for large
file tree operations.
Directory listing and file search into the previous file tree
The complete (recursive) directory listing of the tree was quicker
for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and JFS
(2.5 and 3.1). Similar results were observed during the file search,
where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results compared
to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took comparable amounts
of CPU for directory listing (35%) and file search (6%). XFS took more
CPU for directory listing (70%) but comparable amount for file search
(10%). ReiserFS appears to be the most CPU-intensive FS, with 71% for
directory listing and 36% for file search. Again, the number of minor
page faults was 3 times higher for ReiserFS (total = 1991) when compared
to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems
can be regrouped as (a) quick and more CPU-intensive (ReiserFS and XFS)
or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears as
a good compromise, with relatively quick results, moderate usage of
CPU and acceptable rate of page faults.
OVERALL CONCLUSION
These results replicate previous observations from Piszcz (2006)
about reduced disk capacity of Ext3, longer mount time of ReiserFS and
longer FS creation of Ext3. Moreover, like this report,
both reviews have observed that JFS is the
lowest CPU-usage FS. Finally, this report appeared to
be the first to show the high page faults
activity of ReiserFS on most usual file operations.
While recognizing the relative merits of each filesystem, only one
filesystem can be install for each partition/disk. Based on all testing
done for this benchmark essay, XFS appears to be the most appropriate
filesystem to install on a file server for home or small-business
needs :
- It uses the maximum capacity of your server hard disk(s)
- It is the quickest FS to create, mount and unmount
- It is the quickest FS for operations on large files (>500MB)
- This FS gets a good second place for operations on a large number
of small to moderate-size files and directories
- It constitutes a good CPU vs time compromise for large directory
listing or file search
- It is not the least CPU demanding FS but its use of system ressources
is quite acceptable for older generation hardware
While Piszcz (2006) did not explicitly recommend XFS, he concludes
that "Personally, I still choose XFS for filesystem performance and
scalability". I can only support this conclusion.
References
Benoit, M. (2003).
Linux File System
Benchmarks.
Piszcz, J. (2006).
Benchmarking Filesystems Part II. Linux Gazette, 122 (January 2006).
2002-09-20 (Linux Journal)
We look at three different tactics for optimizing
read and write performance under Linux.
A few years ago I was tasked
with making the Spec96 benchmark suite produce the fastest
numbers possible using the Solaris Intel operating system
and Compaq Proliant servers. We were given all the resources
that Sun Microsystems and Compaq Computer Corporation could
muster to help take both companies to the next level in
Unix computing on the Intel architecture. Sun had just announced
its flagship operating system on the Intel platform and
Compaq was in a heated race with Dell for the best departmental
servers. Unixware and SCO were the primary challengers since
Windows NT 3.5 was not very stable at the time and no one
had ever heard of an upstart graduate student from overseas
who thought that he could build a kernel that rivaled those
of multi-billion dollar corporations.
Now many years later, Linux
has gained considerable market share and is the De facto
Unix for all the major hardware manufacturers on the Intel
architecture. In this article, I will attempt to take the
lessons learned from this tuning exercise and show how they
can be applied to the Linux operating system.
As it turned out, the gcc
benchmark was the one that everyone seemed to be improving
on the most. As we analyzed what the benchmark was doing,
we found out that basically it opened a file, read its contents,
created a new file, wrote new contents, then closed both
files. It did this over and over and over. File operations
proved to be the bottleneck in performance. We tried faster
processors with insignificant improvement. We tried processors
with huge (at the time) level 1 and level 2 cache and still
found no significant improvement. We tried using a gigabyte
of memory and found little or no improvement. By using the
vmstat command, we found that the processor was relatively
idle, little memory was being used, but we were getting
a significant amount of reads and writes to the root disk.
Using the same hardware and same test programs, Unixware
was 25% faster than Solaris Intel. Initially, we decided
that Solaris was just really slow. Unfortunately, I was
working for Sun at the time and this was not the answer
that we could take to my management. We had to figure out
why it was slow and make recommendations on how to improve
the performance. The target was 25% faster than Unixware,
not slower.
The first thing that we did
was to look at the configurations. It turns out that the
two systems were identical hardware,. We just booted a different
disk to boot the other operating system. The Unixware system
was configured with /tmp as a tmpfs whereas the Solaris
system had /tmp on the root file system. We changed the
Solaris configuration to use tmpfs but it did not significantly
improve performance. Later, we found that this was due to
a bug in the tmpfs implementation on Solaris Intel. By braking
down the file operation, we decided to focus on three areas;
the libc interface, the node/dentry layer, and the device
drivers managing the disk. In this article, we will look
at the three different layers and talk about how to improve
performance and how they specifically apply to Linux.
This paper describes a utility
named ruf
that reads files from an unmounted file system. The files are accessed
by reading disk structures directly so the program is peculiar to the
specific file system employed. The current implementation supports the
*BSD FFS, SunOS/Solaris UFS, HP-UX HFS, and Linux ext2fs file systems.
All these file systems derive from the original FFS, but have peculiar
differences in their specific implementations.
The utility can read files from a damaged
file system. Since the utility attempts to read only those structures
it requires, damaged areas of the disk can be avoided. Files can be
accessed by their inode number alone, bypassing damage to structures
above it in the directory hierarchy.
The functions of the utility is available
in a library named libruf.
The utility and library is available under the BSD license.
Introduction
There are many important reasons
for being able to access unmounted file systems, the prime example being
a damaged disk. This paper describes a utility that can be used to read
a disk file without mounting the file system. The utility behaves similar
to the regular cat
utility, and was originally named dog,
but was renamed to ruf
for reading unmounted filesystems to avoid a name
conflict with an older utility.
In order to access an unmounted file
system, the utility must read the disk structures directly and perform
all the tasks normally performed by the operating system; this requires
a detailed understanding of how the file system is implemented. Implementing
this utility for a particular file system is an interesting academic
exercise and a good way to learn about the file system. The original
work on this utility was in fact done in Evi Nemeth's system administration
class.
Richard starts this journey into the Solaris filesystem by looking
at the fundamental reasons for needing a filesystem and at the functionality
various filesystems provide. In this first part of the series, you'll
examine the evolution of the Solaris filesystem framework, moving into
a study of major filesystem features. You'll focus on filesystems that
store data on physical storage devices -- commonly called regular
or on-disk filesystems. In future articles, you'll begin to
explore the performance characteristics of each filesystem, and how
to configure filesystems to provide the required levels of functionality
and performance. Richard will also delve into the interaction between
Solaris filesystems and the Solaris virtual memory system, and how it
all affects performance.
One of the most important features of a filesystem is its ability
to cache file data. Ironically, however, the filesystem cache isn't
implemented in the filesystem. In Solaris, the filesystem cache is implemented
in the virtual memory system. In Part 3 of this series on the Solaris
filesystem, Richard explains how Solaris file caching works and explores
the interactions between the filesystem cache and the virtual memory
system.
CacheKit is a collection of freeware perl and shell programs
to report on cache activity on a Solaris 8 sparc server. Tools for older
Solaris and Solaris x86 are also included in the kit, as well as some
SE Toolkit programs and extra Solaris 10 DTrace programs. The caches
the kit reports on are: I$, D$, E$, DNLC, inode cache, ufs buffer
cache, segmap cache and segvn cache. This kit assists performance
tuning.
Here's a remarkable tale about a company
that replaced an Oracle database cluster with a few Linux servers.
The great thing about this story is that
the Linux servers did not run database software. The Oracle database
had been converted and stored on the Linux hard disk as a collection
of some 100,000 files. The work was part of a major application upgrade
that involved redesigning all the components of a busy web site.
It's a stunning tale, but it's obviously
an exceptional case. There's absolutely no suggestion that Oracle databases
are old, slow, poor or anything else of the sort.
But this story shows how design decisions
taken a few years ago can rapidly be undermined by new technologies.
In this case, the original site served pages from an Oracle back-end
via J2EE middleware. The new system uses a Linux back-end and Java XML
middleware. First time round, Oracle and the J2EE framework were the
best choice, but a few years later Java and XML had matured. The redesign
enabled some 40,000 lines of J2EE code to be replaced by 5,000 lines
of Java/XML.
The Oracle replacement was another spin-off
from redesigning the middleware. This time it was enabled by the Reiser
File System (ReiserFS) - a relatively new development that is already
the default in Suse, Lindows and Gentoo Linux, largely because it's
a journaling file system, so it doesn't lose data following "unplanned
outages". Linux servers don't crash very much, but power fails sometimes,
so a robust file system is a definite advantage.
ReiserFS uses an improved version of
the same basic tree indexing scheme as some database engines. Thus ReiserFS
is often very fast and efficient compared with the traditional file
systems of Linux and Windows. Of course, for most datasets and applications
it's probably not as fast as Oracle's sophisticated database cluster.
But in this case a bit less performance was acceptable, and the new
option of ReiserFS contributed to the demise of one underused Oracle
database.
One might imagine switching from an Oracle
cluster to a Linux file system produced huge cost savings, but it's
not that simple. The money saved was in fact used to pay for two new
staff - software developers who also contribute to the open-source application
server used in the new Java XML architecture.
So the decision to move away from Oracle
and J2EE was not driven simply by costs. Here, the web site is the firm's
core business, and fixing the feature set and technical agenda of its
core business to one supplier seemed a poor choice. Now the firm has
influence over the features and direction of the application server.
None of this is to disparage other Oracle
databases, and this is not a tale about open source versus commercial
software. Rather it's about choosing the best technologies available,
and re-examining those choices from time to time.
ReiserFS is unlikely to be the ultimate
file system. It will probably soon seem old hat compared with the next
big thing. My vote would be for something like Coda - a research implementation
of a fault-tolerant, distributed file system for long-latency IP networks.
Meanwhile, keep building and rebuilding - cut costs and prosper.
-
Understanding What a File System Is
-
Understanding File System Taxonomy
-
Understanding Local File System Functionality
-
Understanding Differences Between Types of Shared File Systems
- Understanding How Applications Interact With Different Types of
File Systems
-
Conclusions
-
About the Author
Ext2 compatibility (Score:5, Informative)
by Wise
Dragon (71071) on Thursday August 14, @02:42PM (#6698412)
(http://slashdot.org/)
Dude, there are papers published about Ext2fs which describe the
data structures in exquisite detail. You don't need to look at the code
to write an ext2fs clone. I have written proprietary utilities to access
ext2fs data structures. I know what I am talking about.
http://e2fsprogs.sourceforge.net/ext2intro.html
http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm
In addition, there are various commercial tools that read and write
ext2, such as
Ext2fs Anywhere [partition-manager.com].
So in that case, you're full of crap. I don't know if I am really qualified
to comment on the other case, but doesn't BSD have linux compatibility?
And isn't BSD available under a much less restrictive license? They
could just adapt that code.
|
Series of interesting papers
AVFS is a system, which enables all programs to look inside gzip,
tar, zip, etc. files or view remote (ftp, http, dav, etc.) files, without
recompiling the programs.
As Linux grows up, it aims to satisfy
different users and potential situations' needs. During recent years,
we have seen Linux acquire different capabilities and be used in many
heterogeneous situations. We have Linux inside micro-controllers, Linux
router projects, one floppy Linux distribution, partial 3-D hardware
speedup support, multi-head Xfree support, Linux games and a bunch of
new window managers as well. Those are important features for end users.
There has also been a huge step forward for Linux server needs — mainly
as a result of the 2.2.x Linux kernel switch. Furthermore, sometimes
as a consequence of industry support and others leveraged by Open Source
community efforts, Linux is being provided with the most important commercial
UNIX and large server's features. One of these features is the support
of new file systems able to deal with large hard-disk partitions, scale
up easily with thousands of files, recover quickly from crash, increase
I/O performance, behave well with both small and large files, decrease
the internal and external fragmentation and even implement new file
system abilities not supported yet by the former ones.
This article is the first in a series
of two, where the reader will be introduced to the Journal File Systems:
JFS, XFS, Ext3, and ReiserFs. Also we will explain different features
and concepts related to the new file systems above. The second article
is intended to review the Journal File Systems behaviour and performance
through the use of tests and benchmarks.
FreeBSD uses the UFS (Unix File System), which is
a little more complex than Linux's ext2. It offers a better way to insure
filesystem data integrity, mainly with the "sofupdates" option. This
option decreases synchronous I/O and increases asynchronous I/O because
writes to a UFS filesystem aren't synced on a sector basis but according
to the filesystem structure. This ensures that the filesystem is always
coherent between two updates. In my informal performance updates, softupdates
showed significant improvement.
I used two identical boxes, one with Linux and the
other with FreeBSD 4.0-RELEASE. I moved a 1.2GB file between two mount
points, back and forth. I found that FreeBSD, without the sofupdates,
performs a little slower than Linux. This speed changed after I added
the softupdates to the FreeBSD kernel and then updated the mount point
(via tunefs). Only then did I notice that FreeBSD's performance was
marginally better (10 percent, or so).
These performance tests aren't perfect or anywhere
near conclusive. The Linux filesystem can be tweaked for performance;
however, currently ext2 gets its performance from having an asynchronous
mount. This is great for speed, but if your system crashes it could
take out the filesystem, its data, and its current state. Often, a hard
crash permanently damages a mount. FreeBSD with sofupdates can sustain
a very hard crash with only minor data loss, and the filesystem will
be remountable with few problems.
Besides performance, FreeBSD UFS also has one major
advantage over Linux in security. FreeBSD supports file flags, which
can stop a simple script kiddie dead in his tracks. There are several
flags that you can add to a file, such as the immutable flag.
The immutable (schg) flag won't allow any alteration to the file or
directory unless you remove it. Other very handy flags are append
only (sappnd), cannot delete (sunlnk), and
archive (arch). When you combine these with the kernel security
level covered below, you have a very impenetrable system.
libferris.so
libferris is a virtual filesystem that exposes hierarchical data of
all kinds through a common C++ interface.
Access to data is performed using C++ IOStreams and
Extended Attributes (EA) can be attached to each datum to present metadata.
Ferris uses a plugin API to read various data sources and expose them
as contexts and to generate interesting EA. Current implementations
include Native (kernel disk IO with event updates using fam), xml (mount
an xml file as a filesystem), edb (mount a berkeley database), ffilter
(mount an LDAP filter string) and mbox (mount your mailbox). EA generators
include image, audio, and animation decoders.
About: translucency is
a Linux kernel module that virtually merges two directories, making
it possible to overwrite files on read-only media and compile projects
(such as the Linux kernel) with different options without copying sources
each time. No user-space tools have to be changed. The process is also
known as inheriting (ifs), stacking, translucency (tfs), loopback (lofs),
and overlay (ovlfs).
Changes: This version
has enabled ".." handling and improves behavior on existing files.
To achieve the long-elusive goal of easily finding information hidden
in computer files, Microsoft is returning to a decade-old idea.
The company is building new file organization software
that will begin to form the underpinnings of the next major version
of its Windows operating system. The complex data software is meant
to address a conundrum as old as the computer industry itself: how to
quickly find and work with a piece of information, no matter what its
format, from any location.
For those using Windows, this will mean easier, faster
and more reliable searches for information. Replacing its antiquated
file system with modern database technology should also mean a more
reliable Windows that's less likely to break and easier to fix when
it does, said analysts and software developers familiar with the company's
plans.
In the process, the plan could boost Microsoft's high-profile
.Net Web services plan and pave the way to enter new markets for document
management and portal software, while simultaneously dealing a blow
to competitors.
But success won't come overnight. Building a new data
store is a massive undertaking, one that will touch virtually every
piece of software Microsoft sells. The company plans to include the
first pieces of the new data store in next release of Windows, code-named
Longhorn, which is scheduled to debut in test form next year.
"We're going to have to redo the Windows shell; we're
going to have to redo Office, and Outlook particularly, to take advantage"
of the new data store, Microsoft CEO Steve Ballmer said in a recent
interview with CNET News.com. "We're working hard on it. It's tough
stuff."
Tough indeed. The development of the new file system
technology is so difficult that Microsoft may have to market two distinctly
different product lines while it completes the work--a move Ballmer
concedes would be a huge step backward in the company's long-sought
plan to unify its operating systems with Windows XP and Windows .Net
Server, which has been
delayed until year's end.
For years, Microsoft has sold two operating systems:
a consumer version based on the 20-year-old technology DOS, and a corporate
version based on the company's newer, built-from-scratch Windows NT
kernel. The dual-OS track has frustrated software developers, who needed
to support two different operating systems, and has confused customers,
who often didn't understand the difference between them.
"Will we have two parallel tracks in the market at
once? Not desirable. There are a lot of reasons why that was really
a pain in the neck for everybody, and I hope we can avoid that here,"
Ballmer said. "But it's conceivable that we will wind up with something
that will be put on a dual track."
Still, Ballmer and his executive team believe it's
a risk well worth taking. Right now, each Windows program includes its
own method for storing data, such as the vastly different formats used
by Microsoft's Outlook e-mail program and Word document software. Despite
advances in Windows' design and networking technology, it's still impossible
to search across a corporate network for all e-mails, documents and
spreadsheets related to a specific project, for instance. Searching
through video, audio and image files is kludgy at best.
Likewise, it's tricky--if not impossible--to build
new programs that tap into those files. "If I'm looking for anything
where I interacted with one customer in the last 12 months, I need to
search for e-mail, Word documents or information in my database," said
Chris Pels, president of iDev Technologies, a software consulting and
design firm in East Greenwich, R.I. "That kind of stuff is a nightmare
from a programming perspective these days."
Other software makers have attempted to solve the
same problem. Nearly two years ago, Oracle
introduced something called Internet File System, which works with
its database server to make storage and retrieval of data--including
Microsoft Word and Excel documents--easier and more reliable. "This
hasn't been done in a commercial operating system, but it has been done
with Oracle's database," said Rob Helm, editor in chief of
Directions
on Microsoft, crediting Oracle CEO Larry Ellison as an early proponent
of the idea.
Oracle continues to challenge Microsoft on this front.
Last fall, the company
announced an e-mail server option for its 9i
database management software along with a migration program to move
companies from Microsoft Exchange to Oracle's database.
Yet Oracle's efforts amount to more of a jab between
long-time adversaries than a serious competitive challenge. Given Windows'
enormous market clout, Microsoft's plan could change the competitive
landscape of the software business and affect millions of computer users
and technology buyers.
"It's a huge risk for Microsoft," Helm said. "They
have so much riding on this. If this is late and doesn't work as advertised,
it will have effects that will ripple through the entire company and
the industry. But the benefits, if they succeed, will be huge."
Microsoft's first--and perhaps largest--challenges
will be internal: how to overcome the technical and organizational obstacles
it encountered when it set out to solve the very same problem in the
early 1990s. At that time, the company launched an ambitious development
project to design and build a new technology called the Object File
System, or OFS, which was slated to become part of an operating system
project code-named Cairo.
"We've been working hard on the next file system for
years, and--not that we've made the progress that we've wanted to--we're
at it again," Ballmer said.
While the Cairo project eventually resulted in Microsoft's
Windows 2000 operating system, the file system work was abandoned because
of complexity, market forces and internal bickering. "It never went
away. We just had other things that needed to be done," Jim Allchin,
the group vice president in charge of Windows development, told News.com.
Those other things most likely included battling "Netscape
and Java and the challenge of the Internet and the Department of Justice,"
Gartner Group analyst David Smith said--issues that
continue to persist
today.
Microsoft executives say the company plans to resurrect
the OFS idea with the Longhorn release of Windows. "This will impact
Longhorn deeply, and we will create a new
API for applications to take advantage of it," Allchin said.
He said bringing the plan back now makes sense because
new technologies such as XML (Extensible Markup Language) will make
it much easier to put in place. XML is already a standard for exchanging
information between programs and a cornerstone of Microsoft's Web services
effort, which is still under
development. Longhorn and the new data store are the "next frontier"
of software design, Allchin said.
In addition, Microsoft has already developed the database
technology it needs for a new file system. A future release of its SQL
Server database, code-named Yukon, is being designed to store and manage
both standard business data, such as columns of numbers and letters,
and unstructured
data, such as images. Yukon
will also form the data storage core of Microsoft's Exchange Server
and other future products.
The more important reasons for the renewed development
effort, however, are strategic. If the plan succeeds, it will give Microsoft
a huge technological advantage over the competition by making its products
more attractive to buyers and giving large companies another reason
to install Windows-based servers.
"Having multiple data stores makes life harder for
the enterprise customer," Helm said. "Search will become much easier,
and this should make it cheaper to build new systems because customers
only have to learn one database."
Helm said the database capability in Windows will
make it a snap to add document management and more advanced portal development
tools. Those applications will in essence be built into the operating
system, making it more likely that customers will use them.
Moreover, industry veterans note that the new data
store will benefit from Microsoft's tried-and-true strategy for entering
new markets--leveraging the overwhelming market share of Windows. Because
Microsoft needs the new data store to make its .Net services plan work,
analysts say the company is likely to pressure customers to make the
move to the Longhorn release of Windows through licensing incentives
or other means.
Nevertheless, widespread acceptance is not a foregone
conclusion. For big companies not yet ready to install Microsoft's 3-year-old
Windows 2000 operating system--much less Windows XP, released last October--the
Longhorn plan may be too much to contemplate right now.
"That's the real issue that I see in the trenches:
the rate of change--for programmers, for businesses, in terms of making
infrastructure technology decisions," Pels said. "People can't keep
up with it, and if they want to keep up with it, is it worthwhile for
their business?"
Mike Gilpin, an analyst with Giga Information Group,
agrees. "It's a great dream," he said. "But it could be hard to make
real."
"Alan Cox replied tersely, "Which means
an ext3 volume cannot be recovered on a hard disk error." And Stephen
replied: Depends on the error. If the disk has gone hard-readonly, then
we need to recover in core, and that's something which is not yet implemented
but is a known todo item."
This document describes Sun's implementation of the Large File Summit's
standard for 64 bit file access... including the User level experience
of converting existing applications to the new standard.
File System Indexing, and Backup by Jerome H. Saltzer Laboratory
for Computer Science Massachusetts Institute of Technology M.I.T. Room NE43-513
Cambridge, Massachusetts 02139 U.S.A.
This paper briefly proposes two operating system ideas:
indexing for file systems, and backup by replication rather than tape
copy. Both of these ideas have been implemented in various non-operating
system contexts; the proposal here is that they become operating system
functions.
IBM's journaled file system technology, currently
used in IBM enterprise servers, is designed for high-throughput server
environments, key to running intranet and other high-performance e-business
file servers. IBM is contributing this technology to the Linux open
source community with the hope that some or all of it will be useful
in bringing the best of journaling capabilities to the Linux operating
system. Work is currently underway to complete the port of this technology
to Linux.
Developing JFS
JFS is licensed under the
GNU General
Public License. If there's a feature that you'd like to see added
to JFS, consider becoming a part of the JFS development process. Since
JFS is an open source project, it's easy to get involved.
Get the Source
A
CVS repository contains the
latest stable version of the JFS source code and documentation.
All JFS core team members and JFS contributors have read-write access
to CVS and WebCVS.
CVS is a system that lets groups of people work simultaneously
on groups of files. CERN has a Web site with
general information
on CVS , as does
cyclic.com.
For convenience the latest source may be downloaded
as
jfs-0.0.1.tar.gz.
For details on building the source and a list of ToDo
items, examine the
README.
Report bugs
Jitterbug is the system for tracking JFS bugs and feature requests.
The core team and contributors have read-write access to this database.
The community at large has read access through a Web interface.
Jitterbug is a Web-based bug tracking system. It handles
bug tracking, problem reports, and queries and is available under the
GNU General Public License. JitterBug has a Web site for
general information
on JitterBug.
ReiserFS article. Interestingly, this project is being funded by Suse
and Mp3.com. The FS basically seems to be boasting much more efficient algorithms
and handles small file space better.
A great way to follow kernel development is to read the excellent
kernel mailing list synopses written by Zack Brown at:
http://kt.linuxcare.com
xt3fs is a journaled version of ext2fs written by
Stephen Tweedie. It's in beta form right now but works pretty well.
Stephen and Ted Ts'o talked about ext3fs at our Linux Storage Management
Workshop in Darmstadt, Germany (you can get the slides for this workshop
at ftp://linux.msede.com/lsmws_talks/)
The ext3 filesystem, of which early alphas are ready (version 0.0.2c,
the excitement !!). Development is on the linux-fsdevel mailing list,
archived
here. Hello, I've been running ext3 on my laptop computer for about
two months now. It works great. Just sync the disks and turn it off.
No shutdown. No data loss either. If you look at e.g Solaris disk-suite
you are able to control where your should store your metadata. Say that
you want to have journaling file data also, this is normally slowing
the system down. But if you can specify that all file metadata should
be on a separate solidstate disk (naturally mirrored for safety). Then
journaling of file data will be quick and swift. This is in my view
quite important. If I understand everything correctly you can do that
with ext3. One of the major problems with ext2fs (IMHO) is that it doesn't
resize well. This is because there is a copy of every group descriptor
in every group [a g.d. contains metadata for a group of blocks/inodes,
typically 8M in size]. Therefore enlarging or shrinking the drive causes
a major reshuffle of ALL the data; so far, the only utility I know that
can do this is resiz2fs, which comes with Partition Magic (there are
no doubt others now).
This redundancy is good in theory (backups), but keeping a copy of a
constant number of group descriptors (perhaps the previous and next
32) in a given group would still give you a lot of redundancy
plus make resizing simpler.
Granted, resizing isn't something you do a lot, but having had my system
lock up and die while resizing and having to recover using Turbo C++
and the ext2fs spec (code and info on my
ext2fs page), it would be nice if ext3fs (or XFS) made this easier.
The
Reiser Filesystems by Hans Reiser, a very ambitious project to not
only improve performance and add journaling, but to redefine the
filesystem as a storage repository for arbitrarily complex objects.
reiserfs.
Reiserfs is faster than ext2/3 because it uses balanced trees for it's
directory-structures.
The project is now released for 2.2.11 - 2.2.13. Mailing list archive
here.
The Xfs site
has some docs. The work to unencumber the code is accelerating, and
February is the target date for source code release. XFS is the one
that I think has the most potential. It's a full logging filesystem
from the ground up, not an extension (not that EXT3 or DTFS are bad
or misguided efforts) I'm betting it will be the highest performance
filesystem for linux when it goes gold. I think the tight integration
of the log could be a huge plus. It's been a while since filesystem
101 but I would think that there are a ton of ways to optimize performance
with log write back tricks and useage optimizations.. You could include
a hit counter in metadata and have an optimizer that moves higher hit
files closer to the log in the center of the disk making your more frequently
used files closer to where the head is supposed to be. Those kinds of
optimizations (if practical, maybe I'm full of it) wouldn't be nearly
as easy with ext3 since the FS doesn't have any knowldege of the log.
Plus xfs has ACLs and big file support already.
Hi,ext3fs is a journaled version of ext2fs written by Stephen Tweedie.
It's in beta form right now but works pretty well. Stephen and Ted Ts'o
talked about ext3fs at our Linux Storage Management Workshop in Darmstadt,
Germany (you can get the slides for this workshop at ftp://linux.msede.com/lsmws_talks/)
Stephen also gave a talk on ext3fs at the Linux Kongress in Augsburg,
Germany. He is predicting Summer 2000 for production use of ext3fs.
Nice features include the fact that ext3fs is backwards compatible with
older versions of ext2. In addition, ext3fs uses asynchronous journaling,
which means the performance will be as good or better than ext2fs.
I am involved with the SGI effort to port XFS to Linux. The work
to unencumber the code is accelerating, and February is the target date
for source code release. The read path is working at this time. More
work remains however, so stay tuned to
http://oss.sgi.com
[August 17, 1999]
read-2_23.zip
Size:72kb LREAD v2.3 - Programm to read LINUX Extended2-Filesystems on PCs
from within DOS
Alan put 2.2.11pre2 up on
ftp://ftp.*.kernel.org/pub/linux/kernel/alan/proposed-2.2.11pre2.gz
and posted a changelog against 2.2.10. Linus replied,
"Looks good, except aic7xxx is wrong version
;) Tssk, tssk."
One of Alan's changes was "FAT now uses cluster numbering
for inode info", which Alexander Viro took exception to. Alexander replied
to the announcement:
"It doesn't.
It generates inumbers on the fly. Cluster numbering is unusable
for that - truncate() *shouldn't* change inumber. FAT *has* no file
invariants that would survive (a) rename, (b) truncate(), (c) write
and (d) umount. Of all those umount give the least pain wrt races.
New code guarantees constant inumbers for opened files.
The bottom line - inumbers on FAT will suck anyway.
There is no inodes in normal sense. And inumbers changing after
reboot are *much* better than exploitable races. On FAT usage of
(old) inumbers for any backup stuff was broken - rename() would
go unnoticed."
Another one of Alan's changes was to remove the COMA
workaround, and recommend people just use set6x86 if they have that
Cyrix CPU bug. Zoltan Boszormenyi said sarcastically that in that case,
they might as well remove the f00f bugfix as well. Alan defended the
change, and there was a discussion about which fix was enabled in which
version and then switched for which other fix.
See also
OSRC File Systems
There are several necessary extension of the Unix filesystem.
the two most often mentioned are journaling and file system indexing.
File System Indexing, and Backup by Jerome H. Saltzer Laboratory
for Computer Science Massachusetts Institute of Technology M.I.T. Room NE43-513
Cambridge, Massachusetts 02139 U.S.A.
This paper briefly proposes two operating system ideas:
indexing for file systems, and backup by replication rather than tape
copy. Both of these ideas have been implemented in various non-operating
system contexts; the proposal here is that they become operating system
functions.
A UNIX
APPROACH TO DATABASE SOFTWARE Thomas Lord, Berkeley CA.
Sprite papers
-
A Trace-Driven Analysis of the UNIX BSD File System. John K.
Ousterhout, Herve' Da Costa, David Harrison, John A. Kunze, Mike Kupfer,
and James G. Thompson
-
Measured Performance of Caching in the Sprite Network File System
Brent B. Welch
-
Caching
in the Sprite Network File System. Michael N. Nelson, Brent B.
Welch, and John K. Ousterhout
-
Sprite Position Statement: Use Distributed State for Failure Recovery.
Brent Welch, Mary Baker, Fred Douglis, John Hartman, Mendel Rosenblum,
John Ousterhout
-
The File System Belongs in the Kernel. Brent Welch
-
The
Sprite Internet Protocol Server. Andrew Cherenson
-
The
Jaquith Archive Server. James W. Mott-Smith
-
Beating
the I/O Bottleneck: A Case for Log-Structured File Systems. John
Ousterhout and Fred Douglis
-
The LFS Storage Manager. Mendel Rosenblum and John K. Ousterhout
-
The
Design and Implementation of a Log-Structured File System. Mendel
Rosenblum and John K. Ousterhout
-
Measurements of a Distributed File System. Mary G. Baker, John
H. Hartman, Michael D. Kupfer, Ken W. Shirriff, and John K. Ousterhout
-
A Trace-Driven Analysis of Name and Attribute Caching in a Distributed
System. Ken W. Shirriff and John K. Ousterhout
-
Non-Volatile Memory for Fast, Reliable File Systems. Mary Baker,
Satoshi Asami, Etienne Deprit, John Ousterhout, Margo Seltzer
-
Why Aren't Operating Systems Getting Faster As Fast as Hardware?
John K. Ousterhout
- Pseudo
Devices: User-Level Extensions to the Sprite File System. Brent
B. Welch and John K. Ousterhout
- Pseudo-File-Systems.
Brent B. Welch and John K. Ousterhout
-
The Recovery Box: Using Fast Recovery to Provide High Availability in
the UNIX Environment. Mary Baker and Mark Sullivan
-
Availability in the Sprite Distributed File System. Mary Baker
and John Ousterhout
-
The
Sawmill logging file system. Ken Shirriff
-
Slides from a work-in-progress talk on the Sawmill logging file
system. Ken Shirriff
-
An Implementation of Memory Sharing and File Mapping. Ken Shirriff
-
The Sprite Network Operating System. John K. Ousterhout, Andrew
R. Cherenson, Frederick Douglis, Michael N. Nelson, Brent B. Welch
-
Sprite on Mach. Michael D. Kupfer
- The
Role of Distributed State. John K. Ousterhout
-
Transparent Process Migration: Design Alternatives and the Sprite Implementation.
Fred Douglis and John Ousterhout
-
Virtual
Memory vs. The File System. Michael N. Nelson
- Virtual
Memory for the Sprite Operating System. Michael N. Nelson
-
A Comparison of the Vnode and Sprite File System Architectures.
Brent Welch
-
"Zebra: A Striped Network File System". Appeared in the Proceedings
of the USENIX Workshop on File Systems. Also as UCB/CSD Tech Report
92/683 John Hartman and John Ousterhout
-
The
Zebra Striped Network File System. John Hartman and John Ousterhout
Extending the Operating System at the User Level the Ufo Global File System
by Albert D. Alexandrov, Maximilian Ibel, Klaus E. Schauser, and
Chris J. Scheiman, Proceedings of the USENIX 1997
Annual Technical Conference Anaheim, California, January 1997.
In this paper we show how to extend
the functionality of standard operating systems completely at the user
level. Our approach works by intercepting selected system calls at the
user level, using tracing facilities such as the /proc file system provided
by many Unix operating systems. The behavior of some intercepted system
calls is then modified to implement new functionality. This approach
does not require any re-linking or re-compilation of existing applications.
In fact, the extensions can even be dynamically ``installed'' into already
running processes. The extensions work completely at the user level
and install without system administrator assistance.
We used this approach to implement a global file system,
called Ufo, which allows users to treat remote files exactly as if they
were local. Currently, Ufo supports file access through the FTP and
HTTP protocols and allows new protocols to be plugged in. While several
other projects have implemented global file system abstractions, they
all require either changes to the operating system or modifications
to standard libraries. The paper gives a detailed performance analysis
of our approach to extending the OS and establishes that Ufo introduces
acceptable overhead for common applications even though intercepting
system calls incurs a high cost.
Keywords: operating systems, user-level extensions,
/proc file system, global file system, global name space, file caching
See also GSCHWIND, M. K. 1994. FTP---Access
as a user-defined file system. ACM SIGOPS Oper. Syst. Rev. 28, 2
(Apr.), 73--80.
SunWorld: Security basics, Part 1 - Understanding file attribute bits and
modes(Oct 29, 2000)
Linux Magazine: A Tour of the Linux Filesystem: Part II(Sep 23, 2000)
Linux Magazine: A Tour of the Linux Filesystem: Part I(Aug 27, 2000)
ShowMeLinux.com: Ask Alex - Linking Files, Finding Files, Dual Modem Setups,
and More(Aug 26, 2000)
FirstLinux.net: Linux Directory Structure(Aug 02, 2000)
Linux Magazine: Guru Guidance: Managing Filesystems: Beyond the Basics(Jul
23, 2000)
LinuxNovice.org The Linux filesystem
There is no doubt that one of the most confusing things
about Linux (at least to the novice user) is its filesystem. Since most
of us grew accustomed to the way Windows does things, thinking about
the filesystem in terms of the A or C drive seems almost natural, but
understanding the differences between /etc and /var takes us to a whole
different world. The present article tries to make it easier for new
Linux users to understand the filesystem.
If you need more information, feel free to visit the
Filesystem Hierarchy
Standard (FHS) site. This is the organization that tries to lay
out a filesystem standard not only for the different Linux distributions,
but also for UNIX.
GNULinux.com
- Filesystem review for Complete Newbies
The Linux filesystem is rather confusing
to new users. We'll try to remove a little of the mystery, showing you
the logic of Linux and help you become more accustomed to accessing
files and mount points. This document will be updated from time to time,
and I will keep a running list of FAQ's at the bottom of the page for
easy reference.
A few conventions before we go
on:
Everything in Linux is considered
either a file or a directory Linux sees everything as a text file and
can be opened as such--generally, it means you get a screen full of
garbage, but many of the files are human-readable and you can edit them
as you see fit. You can check into a few examples of the file theme
in the /dev (devices) and the /proc (process) directories. The /dev
is the location for devices attached to your system (i.e., hardware)
and /proc is literally what is in your system's memory, from IRQ's and
PCI channels, to temp files. It's never a good idea to edit the things
in /dev or /proc, but looking in those directories will help with Linux
concept that everything is a file in one way or another.
It's been this way for 30 years. Yep,
the Unices (UN*X-like operating systems, which includes Linux) have
a long and deep history, stretching back to the dawn of computing, and
the filesystem has stayed pretty much intact. Sure, there have been
changes along the way, but as a whole, it's the same structure. A working
knowledge of the filesystem makes you a good user and a good administrator,
able to circumvent problems by going to the source every time.
(Apr 2, 2000, 17:13 UTC) (Posted by
marty) (0 talkbacks posted) (1071
reads)
"This article will unravel the mysteries of the Unix and GNU/Linux filesystem
to the new user: where you can find files, where you should put files, and
how to avoid getting lost."
In case of broken links
please try to use Google search. If you find the page please notify
us about new location
[June 25, 1999]
Gregor N. Purdy - Project Ideas - CVFS -- Concurrent Versioning File
System (CVFS)
SCO OpenServer Release 5 Filesystem Technology - White Paper
Filesystem Administration
[Aug 19, 2000]
Opensource.html IBM announces AFS as an open source product under the IBM
Public License
Re:you probably don't want
AFS (Score:1)
by jlrobins_uncc
(jlrobins@uncc.delete.edu) on Thursday
August 17, @05:55PM EDT (#113)
(User
#136569 Info)
http://www.cs.uncc.edu/~jlrobins/
|
| AFS makes great sense for Web server farms
and/or mirrors of the same site across a WAN such as the Internet
(think an east coast site and a west coast site). Just edit the
file and pow, a server -> client callback notifies any clients caching
the file that they need to refetch.
Couple this with having the content in a read-only replicated
volume, then go ahead and update many files, get your new site look-and-feel
redone, then once your happy with it, release the read/write volume
for replication, and pow -- one atomic transaction to all of the
mirroring servers on the WAN!
Mabye this is why AFS is a major component of IBM's Websphere
platform. All of this, currently working like a champ, and it'll
be free and open source!
---------- Hail Ants!
|
Why this has so much potential
for good. (Score:4, Insightful)
by jlrobins_uncc
(jlrobins@uncc.delete.edu) on Thursday
August 17, @05:47PM EDT (#111)
(User
#136569 Info)
http://www.cs.uncc.edu/~jlrobins/
|
AFS is a very stable, tested, enterprise filesystem.
It offers the following features:
- Cross platform: Many UNIXen as well as NT as either client
or server.
- Secure: Uses Kerberos IV for user authentication.
- Client-side caching: client machines use disk or virtual
memory to cache MRU files, greatly reducing # trips down the
wire on reads.
- Unified naming scheme: names of files don't indicate what
file server they're on. Makes moving of volumes from one fileserver
(or drive on the same fileserver) to another a cinch, since
no client-side changes need to happen.
- Read-only replication: Make your application install directories
replicated in each building on campus.
Now, it's not a perfect product, but it is way cooler
than vanilla NFSv2 or NFSv3, especially on the server-side management
side of things. It doesn't do disconnected operation (which CODA
strives to do), byte-range locking, strict UNIX file semantics (data
most recently written == data viewiable by all file handles to that
file), or Kerberos 5, but it is a far simpler system to get running
than DCE, which does address some of those issues.
One would hope to see the following things from this open sourcing:
- *BSD client / servers.
- MacOS X client (at least!)
- Millineum / Win2K clients (NT clients exist currently).
If the MacOS X client happens, then there will be a secure, scaleable
enterprise filesystem for the three major computer platforms --
Wintel, UNIX, and Mac, and it'll even be freely available!
I don't believe that there are any products available today
that offer secure, robust support for all three platforms
(and no, I don't consider protocol translators, such as Samba or
CAP, which require you to set up the clients to use cleartext passwords
over the wire to authenticate (not to downplay in any way the role
of either technology -- it's not their fault that you've got to
set up the clients in that fashon to interoperate with AFS as it
is now), or using NFSv2 or v3 on the UNIX end to talk to something
like Novell 5 (which, AFAIK, doesn't talk at all to Macs anymore)).
This will give us one protocol on the wire, multiple server-side
implementations (interoperable in the same cell!), multiple client-side
implementations, WAN scalability, and secure authentication. A good
day for the world!
|
As one of the architects
designing DFS in IBM (Score:4, Interesting)
by gelfling on Thursday August 17, @08:06PM EDT (#128)
(User
#6534 Info) |
We've always had a hard time selling DFS internally.
In fact we've stopped trying to do that because there weren't enough
internal customers. The hurdle costs were too high the skills were
hard to find and expensive and customers still wanted SMB shares
via Samba which drove the cost even higher. The client side DCE
licence costs drove Samba since the per client cost was $65/seat
in bulk. AFS as open source can only be a good thing since we can
always find someone to pick up the development and maintenance and
foregoing DCE-Kerberos is really not that big a deal from an internal
perspective. In our environment the challenge was to collapse hundreds
of LanServer domains. DFS or AFS fit the bill and the cost dynamics
work very well compared to staffing 1 headcount/25-35 servers in
the LanServer world. The problem anyone will find though is backup
and storage management. butc or buta just don't scale very well
even with multiple replicas of the fldb core so whoever tries to
manage this, as we did, will be forced to write extensions to their
storage management code, as we did with ADSM. Also you will find
that Samba doesn't scale nearly as well as you want with only a
few hundred accounts on a Samba server even if it sits on a huge
Unix machine. This leaves you will a few hundred or more SMB gateways
if you try to scale up to the huge numbers we did.
Once again AFS open source can only be a good thing - it will propagate
a great technology into large sites where they would shied away
from it previously.
|
| [
Reply to This |
Parent ] |
Articles IBM Open Sourcing AFS
AFS semantics are very different from
UNIX file system semantics: permissions are associated with directories
only, access is determined only by the containing directory, if multiple
clients modify the same file, updates are lost, you can't have any special
files in an AFS file system, etc. AFS uses its own authentication, it
doesn't work well for big files, it always requires extra work to get
it to work with daemons, and it has severe problems for scientific compute
clusters. IBM has long ago moved onto DFS (unrelated to Microsoft DFS),
which fixes many of the problems of AFS (but is itself big, even more
complex than AFS, and hard to administer). Many places are trying to
get rid of AFS because it's just too much of a hassle to run it (and
converting back to a UNIX file system isn't easy because AFS encourages
permissions and ACLs to mushroom unnecessarily).
AFS may be acceptable for specific applications
(in fact, what it was designed for originally): a large untrusted user
population, dedicated system management staff, and smallish files and
problems (text file editing, small programming jobs). But for many environments
where Linux is used--big software development projects, web servers,
scientific computing, home networking--it just doesn't seem like a good
fit.
If it's the security you care about,
NFSv4 might be for you, although it clearly also has some problems.
If you want something AFS-like, Coda might be an option (but I don't
know how mature it is yet). MFS and GFS are options for compute clusters.
Maybe we can get 9P or Styx up on Linux.
Re:you probably don't want AFS
(Score:2, Informative)
(User
#125105 Info)
http://www.cae.wisc.edu/~gerdts
The problems you mention with "not working
well with daemons" is likely related to the fact that it uses Kerberos
IV. If the daemon needs to have more access to AFS directories than
you are willing to give to any other user on the system, there is a
lot of work to do.
Specifically, you need to stash a password
away such that the daemon can authenticate and periodically reauthenticate
so that it does not lose the rights that it has.
AFS does allow you to have ACL's based
on IP address. As such, if you are running a daemon on a machine than
only system administrators have access to, it may not be a big deal
to allow everyone on that machine to write to a directory. Other machines,
though, may have read-only or no access to the directory.
NFS 4 will have the same problem, as
a requirement for it is that Kerberos V is supported as an authentication
mechanism. If you don't give world write to a file/directory, then you
cannot write to it without a kerberos V ticket.
Too little, too late?
(Score:4, Insightful)
(User
#131596 Info)
As someone who has worked with AFS for
the past 8 years, I have to say that I greet this announcement with
a somewhat more pessimistic view.
Namely: AFS is now officially dead.
I say "officially" because, IMO, AFS
is already dead, and has been for years (ever
since Transarc (now IBM Transarc Labs, but I'll refer to them as Transarc
for brevity)) came out with DCE/DFS, really).
Oh, there were bouts of heavy maintenance
and limited development. These periods were inevitably precipitated
by Transarc's AFS customers becoming vocal and complaining. But when
the complaints died down, so did Transarc's commitment.
Transarc has never treated AFS like a
real product. Their "development" efforts have been limited to ports
to new versions of the same operating systems, a few ports to new architectures,
bugfixes, and very limited feature additions (mostly backports
from DFS).
In fact, this year has seen Transarc's
AFS support sink to a new low. From what I've been able to garner, all
AFS development is being outsourced to India. Responses from Transarc's
AFS hotline support (a support service which customers purchase!) have
been inept. There was no Decorum (Transarc's yearly AFS conference)
this year, nor even an announcement concerning it. It's been ages since
anyone from Transarc has posted on the AFS mailing list.
So, why is Transarc (now IBM Transarc
labs) open-sourcing AFS? For one simple reason: AFS is IBM's red-headed
stepchild, and they don't know what else to do with it.
If you read the announcement at
http://www.transarc.com/News/pre ss/opensource.html, you'll note
this entry in the FAQ:
Is IBM still investing in AFS?
Yes. IBM recognizes that many of
our customers will still want a commercially-supported version of
AFS IBM AFS. IBM/Transarc will still sell, maintain, port (to new
versions of currently-supported OS), support, and provide minor
enhancements to "IBM AFS".
Good software grows or dies. AFS died
a long time ago. I, personally, think this is tragic, because AFS had
great potential. But Transarc never made a long-term commitment to anything
other than keeping it on life support. Perhaps it can be resuscitated
back to health, but I can't help but wonder if the Open Source community's
effort would be better spent towards other distributed filesystems efforts,
such as CODA
(which I admittedly haven't investigated, but plan to).
Re:you probably
don't want AFS (Score:1)
by Tower
(/dev/whoop-ass) on Thursday
August 17, @04:55PM EDT (#89)
(User
#37395 Info)
Actually, both AFS and DFS are in use
here at IBM (and at every other site I've vistited... no AFS on the
windows boxen, but everyone using the RS/6ks seems to prefer AFS...
Personally, I prefer the ACLs of AFS to traditional permission structures,
and they are really rather flexible. You can still set rwx on the files,
so it doesn't take a whole lot away...
I agree that AFS isn't meant for clustering, but it works well from
a security standpoint, especially with Kerberos.
Re:you probably don't want AFS
(Score:3, Informative)
by Anonymous Coward on Thursday August 17, @05:41PM EDT (#109)
> AFS semantics are very different from
UNIX file system semantics: permissions are associated with
> directories only, access is determined only by the containing directory,
Think about hard links: that's why it works this way.
> if multiple clients modify the same file, updates are lost
That's not entirely true but I agree it's stupid. Anyway, it doesn't
matter, if you don't use file locking you should expect corruption anyway.
> you can't have any special files in an AFS file system
I hope you don't expect your users to be able to create /dev/mem nodes
in their home directories...
> AFS uses its own authentication
Yes, it's called Kerberos... ever heard of it?
> it doesn't work well for big files
It works reasonably well with big files, unlike Coda which unfortunately
doesn't work at all with them. Anyway for huge amounts of data you shouldn't
be creating massive files anyway, look into databases or steaming software.
> it always requires extra work to get it to work with daemons
You mean you want root on a given machine to have "root" in your whole
enterprise?
> and it has severe problems for scientific compute clusters
What, rsh doesn't work? Just patch it and it works fine. Otherwise what's
the problem?
> IBM has long ago moved onto DFS
No they haven't
> (unrelated to Microsoft DFS)
Thank god. But I'm glad Microsoft has finally invented the automounter.
> which fixes many of the problems of AFS (but is itself big, even more
complex than AFS, and hard
> to administer).
And nobody uses it...
> Many places are trying to get rid of AFS because it's just too much
of a hassle to run it
There really is no better alternative, though.
> (and converting back to a UNIX file system isn't easy because AFS
encourages permissions and ACLs
> to mushroom unnecessarily).
You mean it encourages security? :)
> AFS may be acceptable for specific applications (in fact, what it
was designed for originally): a
> large untrusted user population, dedicated system management staff,
and smallish files and
> problems (text file editing, small programming jobs).
It lets you solve problems on a big scale. I hope the open source release
will make it even better and more available for everyone to use.
> But for many environments where Linux is used--big software development
projects, web
> servers, scientific computing, home networking--it just doesn't seem
like a good fit.
Big software development is one of the first things AFS was used for.
It's only recently, ironically, that local disks+Linux have outperformed
network file systems so much.
AFS makes sense on web servers for replicating site data and allowing
many people to "upload" without the insecurity of FTP.
And I don't see why anyone wouldn't want to use AFS at home. Again,
I hope the open source release will allow as many people to have real
security in network filesystems as possible.
> If it's the security you care about, NFSv4 might be for you
Whenever that will be available...
> If you want something AFS-like, Coda might be an option (but I don't
know how mature it is yet)
Coda is nice but not packaged well enough for everyone to start using
it. It also chokes on big files much worse than AFS, unfortunately.
> MFS and GFS are options for compute clusters.
They're nice for high bandwidth to big files. But they give you no security...
do you really want a root exploit on one machine in a cluster to destroy
all data in the entire site?
Why? CODA
(Score:2, Interesting)
by Anonymous Coward on Thursday August 17, @03:49PM EDT (#26)
Why open source it? Because coda is about
to replace it. CODA (http://www.coda.cs.cmu.edu/) is a Free (free software),
scalible, distributed file system. It covers every feature of AFS, and
goes quite a bit further.
Coda is reaching a point of stability and availablity that it's nearly
ready for widespread production deployment.
UKUUG Linux 2000 Conference - Timetable
At UKUUG this year,
Owen LeBlanc, a Coda expert if there ever was one, said "if you
have a small number of users and a relatively small amount of data,
then Coda may be just what you need". I also seem to recall him saying
he thought AFS is pretty darn nice. He'd be the one to know.
AFS Frequently Asked Questions
O'Reilly Network Exploring the /proc/net Directory [Nov 22, 2000]
The /proc/ filesystem is a trick the
Linux kernel uses to make certain internal information available to
user-space processes. The kernel presents the information in virtual
files in virtual directories. The files and directories of the
/proc/ filesystems are virtual because the data is not actually
stored on any sort of permanent storage like a hard disk; instead, the
directories, files, and data within them are created dynamically in
memory from raw kernel data whenever you attempt to read from them.
A variety of network information and data is available in the
/proc/net/ directory. In this column we'll take a look at some
of the more useful files available in the /proc/net/ subdirectory
and how you might use them in administration of your network.
[August 3, 1999]How
to use a Ramdisk for Linux By Mark Nielsen LG #44 -- good and important
How-to
The Linux Virtual File-system Layer Neil Brown neilb@cse.unsw.edu.au
and others. 29 December 1999 - v1.6
The Linux operating system supports multiple
different file-systems, including ext2 (the Second Extended file-system),
nfs (the Network File-system), FAT (The MS-DOS File Allocation Table
file system), and others. To enable the upper levels of the kernel to
deal equally with all of these and other file-systems, Linux defines
an abstract layer, known as the Virtual File-system, or vfs. Each lower
level file-system must present an interface which conforms to this Virtual
file-system. This document describes the vfs interface (as present in
Linux 2.3.29). NOTE this document is incomplete.
Etc
-
SFS - Port to Linux
? (Score:3)
by Anonymous Coward on Thursday May 20, @10:11AM EDT (#1435)
|
Just a thought - it hasn't crashed yet
on my amiga, and I'm using an early beta from ages ago, and
I delibrately tested it by power cycling in the middle of lots
of writes on several occasions.
It's great to never have to use l:disk-validator (amiga fsck)
again ( o.k, ok. it's in ROM, not l: on all Amigas above 1.3,
but hey...)
The website has an exceptionally clear discription of how the
filesystem has been implemented. It's 64-bit, using the NSD
(New Style Device)API.
It's also free.
here's the site :
www.xs4all.nl/~hjohn/SFS/ |
-
EDM-2 - Inside the High Performance File System - Part 1-6
- CODA An
advanced networked file system.
-
e2compr Transparent compression for ext2 file systems - a bit like
Stacker/DriveSpace on DOS.
- FSDEXT2
Read-only support for accessing ext2fs linux drives in Win95.
-
HFS Support for accessing Mac's Hierarchical File System.
-
reiserfs A tree-balanced filesystem for Linux.
- smbfs
Mount volumes exported by NT and Win95 over the network.
- TCFS
Transparent Cryptographic File System - a secure NFS.
The BeOS
filesystem (BFS) is a 64-bit journalled filesystem with support for arbitrary
file attributes on any node (ie, you can apply them to directories and symlinks
as well as regular files).”
http://www-classic.be.com/documentation/be_book/The%20Storage%20Kit/index.html
Slashdot Tux2 The Filesystem That Would Be King
http://innominate.org/~phillips/tux2/
Practical File System Design with the Be File System
by
Dominic Giampaolo
Our Price: $27.96
Paperback - 256 pages (November 1998)
Morgan Kaufmann Publishers; ISBN: 1558604979 ; Dimensions (in inches):
0.62 x 8.97 x 7.04
Amazon.com Sales Rank: 36,474
Avg. Customer Review:
Number of Reviews: 4
table of contents
Publisher page:
Practical File System Design with the Be File System
This is the new guide to the design and implementation
of file systems in general, and the Be File System (BFS) in particular.
This book covers all topics related to file systems, going into considerable
depth where traditional operating systems books often stop. Advanced
topics are covered in detail such as journaling, attributes, indexing
and query processing. Built from scratch as a modern 64 bit, journaled
file system, BFS is the primary file system for the Be Operating System
(BeOS), which was designed for high performance multimedia applications.
You do not have to be a kernel architect or file system engineer to
use Practical File System Design. Neither do you have to be a BeOS developer
or user. Only basic knowledge of C is required. If you have ever wondered
about how file systems work, how to implement one, or want to learn
more about the Be File System, this book is all you will need.
Features:
- Review of other file systems, including Linux
ext2, BSD FFS, Macintosh HFS, NTFS and SGI's XFS.
- Allocation policies for placing data on disks
and discussion of on-disk data structures used by BFS
- How to implement journaling
- How a disk cache works, including cache interactions
with the file system journal
- File system performance tuning and benchmarks
comparing BFS, NTFS, XFS, and ext2
- A file system construction kit that allows the
user to experiment and create their own file systems
Dominic Giampaolo has a Masters degree in Computer
Science from Worchester Polytechnic and is one of the principal kernel
engineers for Be Inc. His responsibilities include the file system and
various other parts of the kernel.
 |
the Big Picture
and the specifics |
April
25, 2000
|
|
Reviewer:
gseven (see more about me) |
|
If you are worried
that this will only talk about Be file system design, worry
no more. It has overviews of several other major file systems
and their pros and cons before wading into the Be decisions
for a file system and how they are implimented. So, I thought
it was nicely organized and broadly applicable. |
 |
I wish every
technical writer were this good. |
March
21, 2000
|
|
Reviewer: A
reader from Texas, USA |
|
I had wanted to
buy this book for some time, but as a Unix Admin, I couldn't
justify the money nor the study time. Well, now that I've bought
it, I'm kicking myself for not doing so earlier. I have gained
a much greater understanding of hashes, trees, filesystems,
and databases. The book is an epitome of clarity of thought
and presentation. It's not often (never?) that I find a technical
book that I want to read cover to cover in one sitting! I only
wish that the author had more time to revisit the BeFS short-comings
that he mentions, and then GPL the end result. |
 |
Worth reading,
but not the last word in file system design |
January
18, 1999
|
|
Reviewer: billtodd@foo.mv.com
from rural New Hampshire |
|
This book may be
slightly over-sold on its jacket ("guide to the design and implementation
of file systems in general ... covers all topics related to
file systems") but that's likely not the author's fault. It
does provide intermediate levels of detail regarding many, perhaps
most, areas of concern to file system designers and deserves
a place in the library of anyone embarking on such a project
- though people expecting a cookbook rather than a source of
detailed ideas will be disappointed.
The ideas are in general sound
and representative of the current state of file system practice.
The historical view is a bit Unix-centric - to state that the
Berkeley Fast File System is the ancestor of most modern file
systems is to ignore arguably superior and significantly earlier
implementations from IBM, DEC, and others. This bias carries
over into aspects of implementation as well, such as use of
the Unix direct/indirect/double-indirect mapping mechanism to
manage contiguous 'block runs' without adding file address information
to the mapping blocks to eliminate the need to scan them sequentially
(save for the double-indirect blocks, which avoid the scan by
establishing a standard run-length of 4 blocks - arrgh!) when
positioning within the file - and the unbalanced Unix-style
tree itself would almost certainly be better implemented as
a b-tree variant (with its root in-line in the i-node) indexed
on file address. And the text occasionally blurs the distinction
between what the BFS chose to implement (a journal system that
forced meta-data update transactions to be serialized) and what
is possible (a multi-threaded journal supporting concurrent
transactions simply by allowing each transaction to submit a
log record for each individual change it makes - which would
also support staged execution of extremely large transactions
eliminating the log size as a constraint on them).
Some of the choices made in BFS
can be questioned, even in its particular use context. The 'allocation
group' mechanism interacts in subtle ways with the basic file
system block size, and given the relative and on-going improvement
of disk seek time vs. rotational latency the value of locating
related structures relatively near each other (though not actually
adjacent) on disk may no longer justify the added complexity
(though the effort to place file inodes immediately following
the parent directory inode is likely worthwhile if a read-ahead
facility exists to take advantage of it). The discussion of
on-disk placement also ignores 'disks' that may in fact be composed
of multiple striped units, which would further dilute the benefits
of allocation groups; note that this would also complicate the
read-ahead facility just mentioned, as would a shared-disk environment
unless the disk unit itself performed the read-ahead and replication
if present was taken into account (as in the Veritas file system,
as I remember).
Even the fundamental decision
to make attributes indexable deserves closer examination, given
the costs of indexing. Current hardware can perform a complete
inode scan on a single-user workstation fast enough to satisfy
the occasional random query and can scan the inodes for files
within some limited sub-tree of the directory structure (e.g.,
a cluster of e-mail directories) relatively quickly for more
common queries, and in a multi-user environment indexing individual
attributes across all users is frequently not the behavior desired.
Placing index management under explicit application control
may be a better approach, perhaps by allowing the application
to specify on attribute creation the index, if any, in which
its value should be entered (thus preserving the ability to
encapsulate the operation within a system-controlled transaction
without the need for user-level transaction support) - and storing
the index (perhaps by its inode) with the attribute for later
change or deletion.
Conspicuous by their omission
are any mentions of how to manage very large allocation bit-maps
(which one really must expect when other parts of the system
are carefully crafted to handle 2**58-byte files) or the impact
of a shared-disk environment (if BFS was intended to be limited
to desk-top use this may be more understandable, but even desk-tops
may soon have high-availability configurations). Security is
mentioned briefly as a concern to be addressed later - but BFS's
dynamic allocation of inodes from the general space pool makes
this impossible, given that directory inode addresses can apparently
be fed in from user-mode (the author does note this near the
book's end, but fails to discuss possible remedies).
The author also expresses regret
in the introduction at not having had time to include more comparative
information on other file systems, both current and historical.
Perhaps he is leaving himself room to write a second book. I
hope so: despite my comments above, this one was worthwhile
- both on its own merits, and because of the lack of competition
in this subject area.
|
Copyright © 1996-2009 by Dr. Nikolai Bezroukov.
www.softpanorama.org was
created as a service to the UN Sustainable Development Networking Programme (SDNP)
in the author free time.
Submit
comments This document is an industrial compilation designed and created
exclusively for educational use and is placed under the copyright of the
Open Content License(OPL).
Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made
for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
- The statements, views and opinions presented on
this web page are those of the author and are not endorsed by, nor do they necessarily
reflect, the opinions of the author present and former employers, SDNP or any other
organization the author may be associated with.
- We do not warrant the correctness of the information provided or its
fitness for any purpose
- In no way this site is associated with or endorse cybersquatters
using
the term "softpanorama" with other main or country domains (e.g. softpanorama.com) with
bad faith intent to profit from the goodwill belonging to
someone else.
Last modified:
October 25, 2009
Posted by jofa beetz on May 25, 2006 at 12:10 AM GMT+00:00 #