Solaris UFS File System
A file system consists of blocks of data. The number of bytes constituting
a block varies depending on the OS. The internal physical structure of a
hard disk consists of cylinders. The hard disk is divided into groups of
cylinders known as cylinder groups, further divided into blocks.
The file system is comprised of five main blocks (boot block, superblock,
Inode block, data block,
-
Boot block. The boot block is part of the disk label that
contains a loader used to boot the operating system.
-
Super block. All partitions within the Unix filing system
usually contain a special block called the
super block. The super block contains the basic information
about the entire file system. It stores the following details about
the file system:
- The size of the file system
- The status of the file system
- The date and time of the last update
- The pathname of the last mount point
- Cylinder group size
- The name of the partition
- The modification time of the file system
- The number of data blocks
- The list of free and allocated blocks
A super block plays
an important role during the system boot up and shutdown process.
When the system boots, the details in the super block are loaded
into the memory to improve the speed of processing. The super block
is then updated at regular time intervals from the data in the memory.
During system shutdown, a program called sync writes the
updated data in the memory back to the super block. This process
is very crucial because an inaccurate super block might even lead
to an unusable file system. This is precisely why the proper shutdown
of a Solaris system is essential.
Because of the critical nature of the super block, it is replicated
at the beginning of every cylinder group. These blocks are known
as surrogate super blocks. A damaged or corrupted super block is
recovered from one of the surrogate super blocks.
-
Inode block. Inode is a kernel structure
that contains a pointer to the disk blocks that store data. This pointer
points to information such as file type, permission type, owner and
group information, file size, file modification time, and so on. Note
that the inode does not contain the filename as part of the
information. The filename is listed in a directory that contains a list
of filenames and related inodes associated with the file. When
a user attempts to access a given file by name, the name is looked up
in the directory where the corresponding inode is found.
Inode stores the following information about every file:
- The type of the file
- The owner
- The group
- The size of the file
- The time and date of creation
- The time and date of last modification
- The time and date of last access
- An array of 15 disk block addresses
Each inode has a unique number associated with it, called
the inode number. The -li option of the ls
command displays the inode number of a file:
# ls -li
When a user creates a file in the directory or modifies it, the following
events occur:
- The Inode of the file is stored in the Inode
block of the file system.
- The file contents are stored in the allocated data blocks referenced
by the Inode.
- The Inode number is stored in the directory.
-
Data block
The data block is the storage unit of data in the Solaris file system.
The default size of a data block in the Solaris file system is 8192
bytes. After a block is full, the file is allotted another block. The
addresses of these blocks are stored as an array in the Inode.
The first 12 pointers in the array are direct addresses of the file;
that is, they point to the first 12 data blocks where the file contents
are stored. If the file grows larger than these 12 blocks, then a 13th
block is added, which does not contain data. This block, called an indirect
block, contains pointers to the addresses of the next set of direct
blocks.
If the file grows still larger, then a 14th block is added, which
contains pointers to the addresses of a set of indirect blocks. This
block is called the double indirect block. If the file grows still larger,
then a 15th block is added, which contains pointers to the addresses
of a set of double indirect blocks. This block is called the triple
indirect block.
-
Vnodes. A Virtual Node or vnode is a data structure
that represents an open file, directory, or device that appears in the
file system namespace. A vnode does not render the physical
file system it implements. The vnode interface allows high-level
operating system modules to perform uniform operations on vnodes.
Links
Hard and soft links are a great features of Unix. It is a reference in
a directory to a file stored in another directory. In case of soft links
it can be a reference to a directory. There might be multiple links to a
file. Links eliminate redundancy because you do not need to store multiple
copies of a file.
Links are of two types: hard and soft (also known as symbolic).
- A hard link is a pointer to a file and is indistinguishable
from the original directory entry. Any changes to a file are independent
of the name used to reference the file. Hard links may not span file
systems and may not refer to directories. In other words hard links
are "synonyms" for a file and technically structured as a real directory
entry. All hard links are equal. There is no way to tell which is primary
and which is secondary. Every hard link must reside on the same
mounted filesystem (usually a disk or a part of a disk). You cannot
make a new hard link to a file that is on a different mounted filesystem.
Hard links can not be made for directories (actually you can make them
if you are root, but all the consequences are yours: what is ".."
for children of such a "multiple personalities" directory? )The
ln command by default creates hard links.
- Symbolic links (sometimes called a soft links), is a special
file that contains path to another file (target), much like a shortcut
in Windows. Unlike a hard link, a symbolic link is asymmetrical and
there it's easy to tell which file is link and which is actual file.
This difference gives symbolic links certain qualities that hard links
do not have, such as the ability to link to directories, or to files
on remote computers networked through NFS. Also, when you delete a target
file, symbolic links to that file become unusable, whereas hard links
preserve the contents of the file.
To create a symbolic link, you must use the -s option with the
ln command. Files that are soft linked contain an l symbol
at the first bit of the access permission bits displayed by the ls -l
command, whereas those that are hard linked do not contain the l
symbol. A directory is symbolically linked to a file. However, it cannot
be hard linked.
It is obvious that no file exists with a link count less than one.
Relative pathnames . or .. are nothing but links for the
current directory and its parent directory. These are present in every
directory: any directory stores the two links ., .. and
the Inode numbers of the files. They can be listed by the ls
-lia option. A directory must have a minimum of two links. The number
of links increases as the number of sub-directories increase. Whenever you
issue a command to list the file attributes, it refers to the Inode
block with the Inode number and the corresponding data is retrieved.
Solaris File Systems and Their Functions
Each file system used in Solaris is intended for a specific purpose.
The root file system is at the top of an inverted tree structure. It
is the first file system that the kernel mounts during booting. It contains
the kernel and device drivers. The / directory is also called the
mount pointdirectory of the file system. All references in the
file system are relative to this directory. The entire file system structure
is attached to the main system tree at the root directory during the process
of mounting, and hence the name. During the creation of the file system,
a lost + found directory is created within the mount point directory.
This directory is used to dump into the file system any unredeemed files
that were found during the customary file system check, which you do with
the fsck command.
/ (root)
The directory located at the top of the Unix file system. It is represented
by the "/" (forward slash) character.
/usr Contains commands and programs for system-level usage and
administration.
/var Contains system log files and spooling files, which grow
in size with system usage.
/home Contains user home directories.
/opt Contains optional third-party software and applications.
/tmp Contains temporary files, which are cleared each time the
system is booted.
/proc Contains information about all active processes.
You create file systems with the newfs command. The newfs
command accepts only logical raw device names. The syntax is as follows:
newfs [ -v ] [ mkfs-options ] raw-special-device
For example, to create a file system on the disk slice c0t3d0s4,
the following command is used:
# newfs -v /dev/rdsk/c0t3d0s4
The -v option prints the actions in verbose mode. The newfs
command calls the mkfs command to create a file system. You can
invoke the mkfs command directly by specifying a -F option
followed by the type of file system.
Mounting File Systems
Mounting file systems is the next logical step to creating file systems.
Mounting refers to naming the file system and attaching it to the inverted
tree structure. This enables access from any point in the structure. A file
system can be mounted during booting, manually from the command line, or
automatically if you have enabled the automount feature.
With remote file systems, the server shares the file system over the
network and the client mounts it.
The / and /usr file systems, as mentioned earlier,
are mounted during booting. To mount a file system, attach it to a directory
anywhere in the main inverted tree structure. This directory is known as
the mount point. The syntax of the mount command is as follows:
# mount <logical block device name> <mount point>
The following steps mount a file system c0t2d0s7 on the
/export/home directory:
# mkdir /export/home
# mount /dev/dsk/c0t2d0s7 /export/home
You can verify the mounting by using the mount command, which
lists all the mounted file systems.
Note: If the mount point directory has any content prior to the
mounting operation, it is hidden and remains inaccessible until the file
system is unmounted.
Data is stored and retrieved from the physical disk where the file system
is mounted.Although there are no defined specifications for creating
the file systems on the physical disk, slices are usually allocated as following:
0. Root or / Files and directories of the OS.
- Swap Virtual memory space.
- Refers to the entire disk.
- /export Different OS versions.
- /export/swap Unused. Left to user's choice.
- /opt Application software added to a system.
- /usr OS commands by users.
- /home Files created by users.
The slices shown above are all allocated on a single single disk. However,
there is no restriction that all file systems need to be located on a single
disk. They can also span across multiple disks. Slice 2 refers to the entire
disk. Hence, if you want to allocate an entire disk for a file system, you
can do so by creating it on slice 2. The mount command supports
a variety of useful options.
|
Option
|
Description
|
|
-o largefiles
|
Files larger than 2GB are supported in the file system.
|
|
-o nolargefiles
|
Does not mount file systems with files larger than 2GB.
|
|
-o rw
|
File system is mounted with read and write permissions.
|
|
-o ro
|
File system is mounted with read-only permission.
|
|
-o bg
|
Repeats mount attempts in the background. Used with non-critical
file systems.
|
|
-o fg
|
Repeats mount attempts in the foreground. Used with critical
file systems.
|
|
-p
|
Prints the list of mounted file systems in /etc/vfstab
format.
|
|
-m
|
Mounts without making an entry in /etc/mnt /etc/tab file.
|
|
-O
|
Performs an Overlay mount. Mounts over an existing mount
point.
|
The mountall command mounts all file systems that have the
mount at boot field in the /etc/vfstab file set to yes.
It can also be used anytime after booting.Unmounting File Systems
A file system can be unmounted with the umount command. The
following is the syntax for umount:
umount <mount-point or logical block device name >
File systems cannot be unmounted when they are in use or when the umount command is issued from any subdirectory within the file system mount point.
Note: A file system can be unmounted forcibly if you use the
-f option of the umount command. Please refer to the man
page to learn about the use of these options.
The umountall command is used to unmount a group of file systems.
The umountall command unmounts all file systems in the /etc/mnttab
file except the /, /usr, /var, and /proc
file systems. If you want to unmount all the file systems from a specified
host, use the -h option. If you want to unmount all the file systems
mounted from remote hosts, use the -r option.
/etc/vfstab File
The /etc/vfstab (Virtual File System Table) file plays a very
important role in system operations. This file contains one record for every
device that has to be automatically mounted when the system enters run level
2.
|
Column Name
|
Description
|
|
device to mount
|
The logical block name of the device to be mounted. It can also
be a remote resource name for NFS.
|
|
device to fsck
|
The logical raw device name to be subjected to the fsck
check during booting. It is not applicable for read-only file systems,
such as High Sierra File System (HSFS) and network File systems
such as NFS.
|
|
Mount point
|
The mount point directory.
|
|
FS type
|
The type of the file system.
|
|
fsck pass
|
The number used by fsck to decide whether the file system is
to be checked.
0 File system is not checked.
1 File system is checked sequentially.
2 File system is checked simultaneously along with other file
systems where this field is set to 2.
|
|
Mount at boot
|
The file system to be mounted by the mount all command
at boot time is determined by this field. The options are either
yes or no.
|
|
Mount options
|
The mount options to be supported by the mount command
while the particular file system is mounted.
|
Note the no values in this field for the root,
/usr, and /var file systems. These are mounted by default.
The fd field refers to the floppy disk and the swap field
refers to the tmpfs in the /tmp directory.
A sample vfstab file looks like:
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
#
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/dsk/c0t0d0s4 - - swap - no -
/dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no
-
/dev/dsk/c0t0d0s6 /dev/rdsk/c0t0d0s6 /usr ufs 1 no
-
/dev/dsk/c0t0d0s3 /dev/rdsk/c0t0d0s3 /var ufs 1 no
-
/dev/dsk/c0t0d0s7 /dev/rdsk/c0t0d0s7 /export/home ufs 2
yes -
/dev/dsk/c0t0d0s5 /dev/rdsk/c0t0d0s5 /opt ufs 2 yes
-
/dev/dsk/c0t0d0s1 /dev/rdsk/c0t0d0s1 /usr/openwin ufs 2 yes -
swap - /tmp tmpfs - yes -
Finding Information About the Mounted File Systems
The /etc/mnttab file comprises a table that defines which partitions
and/or disks are currently mounted by the system.
The /etc/mnttab file contains the following details about each
mounted file system:
-
The file system name
-
The mount point directory
-
The file system type
-
The mount command options
-
A number denoting the time of the mounted file system
A sample mnttab file:
/dev/dsk/c0t0d0s0 / ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200000 1014366934
/dev/dsk/c0t0d0s6 /usr ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200006 1014366934
/proc /proc proc dev=4300000 1014366933
mnttab /etc/mnttab mntfs dev=43c0000 1014366933
fd /dev/fd fd rw,suid,dev=4400000 1014366935
/dev/dsk/c0t0d0s3 /var ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200003 1014366937
swap /var/run tmpfs xattr,dev=1 1014366937
swap /tmp tmpfs xattr,dev=2 1014366939
/dev/dsk/c0t0d0s5 /opt ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200005 1014366939
/dev/dsk/c0t0d0s7 /export/home ufs rw,intr,largefiles,xattr,onerror
=panic,suid,dev=2200007 1014366939
/dev/dsk/c0t0d0s1 /usr/openwin ufs rw,intr,largefiles,xattr,onerror
=panic,suid,dev=2200001 1014366939
-hosts /net autofs indirect,nosuid,ignore,nobrowse,dev=4580001 10143669
44
auto_home /home autofs indirect,ignore,nobrowse,dev=4580002 10143669
44
-xfn /xfn autofs indirect,ignore,dev=4580003 1014366944
sun:vold(pid295) /vol nfs ignore,dev=4540001 1014366950
#
Restricting the File Size
Some applications and processes create temporary files that occupy a
lot of hard disk space. As a result, it is necessary to impose a restriction
on the size of the files that are created.
Solaris provides tools to control the storage. They are:
-
The ulimit command
-
Disk quotas
ulimit Command
The ulimit command is a built-in shell command, which displays
the current file size limit. The default value for the maximum file size,
set inside the kernel, is 1500 blocks. The following syntax displays the
current limit:
$ ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
memory(kbytes) unlimited
If the limit is not set, it reports as unlimited.
The system administrator and the individual users change this value to
set the file size at the system level and at the user level, respectively.
The following is the syntax of the ulimit command:
ulimit <value>
For example, the following syntax sets the file size limit to 1600 blocks:
# ulimit 1600
# ulimit -a
time(seconds) unlimited
file(blocks) 1600
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
memory(kbytes) unlimited
#
The file size can be limited at the system level or the user level. To
set it at the system level, change the value of the ulimit variable
in the /etc/profile file. To set it at the user level, change the
value in the .profile file present in the user's home directory.
The user-level setting always takes precedence over the system-level setting.
It is the user's profile file that sets the working environment.
Note: The ulimit values set at the user level and system level
cannot exceed the default ulimit value set in the kernel.
Blogged by matty as
Solaris Storage matty Sat 29 Jan 2005 12:14 am
I recently needed to grow a Solaris UFS
file system, and accomplished this with
the growfs(1m)
utility. The growfs(1m) utility takes two
arguments. The first argument to
growfs
( the value passed to -M ) is the mount
point of the file system to grow. The second
argument is the raw device that backs this
mount point. The following example will
grow /test to the maximum size available
on the meta device d100:
$ growfs -M /test /dev/md/rdsk/d100
To see how many sectors will be available
on d100 after the grow operation completes,
you can run newfs with the -N option,
and compare that with the current value
of df (1m):
$ newfs -N /dev/md/dsk/d100
/dev/md/rdsk/d0: 232331520 sectors in 56944
cylinders of 16 tracks, 255 sectors
113443.1MB in 2191 cyl groups (26 c/g, 51.80MB/g,
6400 i/g)
This will report the number of sectors,
cylinders and MBs that would be allocated
if a new file system was created on meta
device d100. As always, test everything
on a non critical system prior to making
changes to critical boxen.
Recently, I wanted to create a UFS file system on a Maxtor OneTouch
II external hard drive I have. I wanted to use the external hard drive
for storing some large files and I was going to use the drive exclusively
with one of my Solaris systems. Now, I didn't find much information
on the web about how to perform this with Solaris (maybe I wasn't searching
very well or something) so I thought I would post the procedure I followed
here so I'll know how to do it again if I need to.
After plugging the hard drive into my system via one of the USB ports,
we can verify that the disk was recognized by the OS by examining the
/var/adm/messages file. With the hard drive I was using,
I saw entries like the following:Mar 2 13:10:33 solaris-filer usba: [ID 912658 kern.info] USB 2.0 device (usbd49,7100) operating at hi speed (USB 2.x) on USB 2.0 root hub: storage@3, scsa2u
sb0 at bus address 2
Mar 2 13:10:33 solaris-filer usba: [ID 349649 kern.info] Maxtor OneTouch II L60LHYQG
Mar 2 13:10:33 solaris-filer genunix: [ID 936769 kern.info] scsa2usb0 is /pci@0,0/pci1028,11d@1d,7/storage@3
Mar 2 13:10:33 solaris-filer genunix: [ID 408114 kern.info] /pci@0,0/pci1028,11d@1d,7/storage@3 (scsa2usb0) online
Mar 2 13:10:33 solaris-filer scsi: [ID 193665 kern.info] sd1 at scsa2usb0: target 0 lun 0
The dmesg command could also be
used to see similar information. Also, we could use the rmformat command
(this lists removable media) to see this information in a much nicer
format like so:
# rmformat -l
Looking for devices...
1. Logical Node: /dev/rdsk/c1t0d0p0
Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
Connected Device: QSI CDRW/DVD SBW242U UD25
Device Type: DVD Reader
2. Logical Node: /dev/rdsk/c2t0d0p0
Physical Node: /pci@0,0/pci1028,11d@1d,7/storage@3/disk@0,0
Connected Device: Maxtor OneTouch II 023g
Device Type: Removable
#
Now that we now the drive has been identified by Solaris (as /dev/rdsk/c2t0d0p0)
we need to create one Solaris partition (this is Solaris 10 running
on the x86 architecture) that uses the whole disk. This accomplished
by passing the -B flag to the fdisk command,
like so:# fdisk -B /dev/rdsk/c2t0d0p0
Now we will print the disk table to standard out like so:# fdisk -W - /dev/rdsk/c2t0d0p0
This will output the following information to the screen for the hard
drive I am using:
* /dev/rdsk/c2t0d0p0 default fdisk table
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 36483 cylinders
*
* systid:
* 1: DOSOS12
* 2: PCIXOS
* 4: DOSOS16
* 5: EXTDOS
* 6: DOSBIG
* 7: FDISK_IFS
* 8: FDISK_AIXBOOT
* 9: FDISK_AIXDATA
* 10: FDISK_0S2BOOT
* 11: FDISK_WINDOWS
* 12: FDISK_EXT_WIN
* 14: FDISK_FAT95
* 15: FDISK_EXTLBA
* 18: DIAGPART
* 65: FDISK_LINUX
* 82: FDISK_CPM
* 86: DOSDATA
* 98: OTHEROS
* 99: UNIXOS
* 101: FDISK_NOVELL3
* 119: FDISK_QNX4
* 120: FDISK_QNX42
* 121: FDISK_QNX43
* 130: SUNIXOS
* 131: FDISK_LINUXNAT
* 134: FDISK_NTFSVOL1
* 135: FDISK_NTFSVOL2
* 165: FDISK_BSD
* 167: FDISK_NEXTSTEP
* 183: FDISK_BSDIFS
* 184: FDISK_BSDISWAP
* 190: X86BOOT
* 191: SUNIXOS2
* 238: EFI_PMBR
* 239: EFI_FS
*
* Id Act Bhead Bsect Bcyl Ehead Esect Ecyl Rsect Numsect
191 128 0 1 1 254 63 1023 16065 586083330
We now need to calculate the maximum amount of usable storage. This
is done by multiplying bytes/sectors (512 in my case) by the number
of sectors listed at the bottom of the output shown above. We then divide
this number by 1024*1024 to yield MBs.
So in my case, this will work out as 286173.5009765625 MB.
Now, we need to setup a partition table file. This will be a regular
text file and you can name it whatever you like. For the sake of this
post, I will name it disk_slices.txt. The contents of this file are:
slices: 0 = 2MB, 286170MB, "wm", "root" :
1 = 0, 1MB, "wu", "boot" :
2 = 0, 286172MB, "wm", "backup"
To create these slices on the disk, we run:# rmformat -s disk_slices.txt /dev/rdsk/c2t0d0p0
# devfsadm
# devfsadm -C
To create the UFS file system on the newly created slice, I run the
following and the output from running this command is also shown:# newfs /dev/rdsk/c2t0d0s0
newfs: construct a new file system /dev/rdsk/c2t0d0s0: (y/n)? y
/dev/rdsk/c2t0d0s0: 586076160 sectors in 95390 cylinders of 48 tracks, 128 sectors
286170.0MB in 5962 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
...............................................................................
........................................
super-block backups for last 10 cylinder groups at:
585105440, 585203872, 585302304, 585400736, 585499168, 585597600, 585696032,
585794464, 585892896, 585991328
#
And now I'm finished, I now have a UFS file system created on my USB
hard drive which can be mounted by my Solaris system. To mount this
file system, I can just:# mount -F ufs /dev/rdsk/c2t0d0p0 /u01
I should add is that anyone who tries to mount an unknown ufs filesystem
without at least running "fsck -n" over it probably deserves what they
get.
This methodology utilizes a tmpfs volume, and it can speed up operations
approximately three times.
This document describes a methodology for configuring a fast file
system that handles several small files on the Solaris Operating System.
This could be used for building a Java technology-based product or for
handling many operations on a large amount of small files. This
methodology utilizes a tmpfs volume, and it can speed up operations
approximately three times.
The requirements are as follows:
- Solaris 7 OS through Solaris 10 OS Update 1
- Some experience with Solaris system administration. This procedure
is not recommended for UNIX users who are uncomfortable with using
mount, maintaining
/etc/vfstab, or modifying their kernel
parameters.
Warning: Do not develop on a tmpfs volume. A tmpfs volume
is only persistent while the system is powered up, so a power loss or
system problem will cause you to lose any changes to that volume.
Procedure
Solaris tmpfs volumes are easy to create, but require a significant
amount of RAM and swap space. It is recommended that you have at least
1 Gbyte of RAM, but there have also been major performance gains on
systems with 512 Mbytes of RAM. In addition, you should add twice as
much swap space as the tmpfs volume you are creating. That is, for a
2-Gbyte tmpfs volume, add 4 Gbytes of swap space to the system. Feel
free to experiment with these values.
The following examples are for a 2-Gbyte tmpfs volume, which is approximately
what is needed to do a developer build. Replace
<swapfilename> with the absolute path to a
swapfile (such as /disk1/swapfile),
and <mountpoint> with the absolute path to
where you want the tmpfs volume mounted (such as
/ramdisk).
Add swap space to your workstation:
root# /usr/sbin/mkfile 2000m <swapfilename>
Create a mount point for the tmpfs volume:
root# mkdir <mountpoint>
Edit your /etc/vfstab file to use the
swap and create the tmpfs volume at boot time. Add the following two
lines:
<swapfilename> - - swap - no -
RAMDISK - <mountpoint> tmpfs - yes size=2000m
Note that on the Solaris 7 OS you may not make a single tmpfs volume
larger than 2 Gbytes.
Edit your kernel parameters to increase the number of files you can
create in the tmpfs volume. Add the following line to your
/etc/system file. (We've had the most success
using this value.)
set tmpfs:tmpfs_maxkmem=250000000
Reboot your workstation. Then verify that the tmpfs volume exists
at the size you specified:
% df -k <mountpoint>
Make the tmpfs volume writable. Note: This step is necessary
after each reboot of the workstation.
root# chmod 777 <mountpoint>
|
More UFS technical tidbits in
anticipation of OpenSolaris.
Today's talk is about UFS I/O. It is a complicated beast and has
many different parts and paths it can take.
Overview of file system I/O in Solaris:

The interaction of UFS and the VM subsystem has been the cause of
numerous bugs, and hard to find problems. Today's blog is an overview
of the UFS I/O, with particular attention paid to the VM subsystem
interaction. Details on the paths taken when a read() system
call is initiated are to show the interaction of UFS and the VM
subsystem. I am making some assumptions here that the readers of
this blog will have some basic Solaris file system knowledge, or
at a minimum some of the basic Solaris file system terminology is
understood.
Basic Solaris VM facts
Solaris virtual memory is demand
paged, and globally managed. There is integrated file caching and
it is layered to allow VM to describe multiple memory types. The
paging vnode cache is the unification of file and memory management
by use of a vnode object. 1 page of memory == <vnode, offset>
tuple. The UFS file system uses this relationship to implement
caching for vnodes. The paging vnode cache provides a set of functions
for cache management and I/O for vnodes.
The paging vnode cache functions
are specified with a pvn_ <xxx> title. The source code for this
is located at: xxxx. Some of the more important paging vnode functions
are listed below, with basic function descriptions. Also shown is
pointers to the code so you can get more detailed data about each
of these.
Some important paging vnode
cache functions:
pvn_read_kluster():
pvn_write_kluster():
-
Finds dirty pages within the
offset and length. Returns a list of locked pages ready to be
written.
-
Caller then sets up write call
with pageio_setup().
-
Write is initiated via a call
to bdev_strategy().
-
Synchronous writes require the
caller to call pvn_write_done(). Otherwise io_done()
will call this when write is complete.
pvn_vplist_dirty():
What is a seg_map and why do you care?
The
seg_map segment maintains mappings of pieces of files into
kernel address space. It is only used by file systems and it allows
copying of data to or from user to kernel address space. At any
given time, seg_map segment has some portion of total file
system cache mapped in to the kernel address space. The seg_map
segment driver divides the segment in to file system block sized
slots.
Some important seg_map functions:
segmap_getmap() && segmap_getmapflt():
-
Retrieves or creates mapping
-
getmapflt allows for creation
of segment if not found, calls ufs_getpage()
segmap_release():
segmap_pagecreate():
Important
in the mapping and getting data from the segmap driver is the fbuf
structure. It is defined as follows:
struct fbuf {
caddr_t fb_addr;
u int_t fb_count;
};
This
structure is used to get a mapping to part of a file via the segkmap
interfaces. It is also used by the pseudo bio functions(shown below)
for reading and writing of data. fbuf is used by directory
reading to get on UFS on disk contents via a call to blkatoff().
seg_vn and UFS and memory
mapped I/O:
Memory mapping allows for a file
to be mapped in the a processes address space. This mapping is done
via the VOP_MAP call and the seg_vn memory driver.
File pages are read when a fault occurs in the address space. The
seg_vn driver enables I/O's without process initiated system
calls. I/O is performed ,,in units of pages, upon reference to the
pages mapped into the address space. Reads are initiated by a memory
access, writes are initiated as the VM subsystem finds dirty pages
in the mapped address space.
So, why not use the seg_vn
driver for non mmap'd I/O as well.? It could be used for mapping
the file in to the kernel's address space, but seg_vn is
a complex segment driver that manages the mapping of protections,
copy-on-write fault handling, shared memory, etc...This is too heavy
weight for what is needed for read and write system calls, so the
seg_map driver was developed. Read and write system calls
only require a few basic mapping functions since they do not map
files into a process's address space. seg_map reduces locking
complexity and gives better performance.
Pseudo bio functions:
Solaris has a set of interfaces
which are considered buffered I/O interfaces, but that are used
to read and write buffers containing directory entries only. These
interfaces all use the seg_map driver for mapping to address
file data. The functions are fbread(), fbwrite(), fbrelese(),
fbdwrite(), fbiwrite(), fbzero(). Although these are not directly
shown in the picture above, they are important enough to be worth
mentioning.
A UFS/VM example, read()
system call - non mmap'd:
Note:
In general UFS caches the pages for write, but will also cache pages
for reads if they are frequently reusable.
read()->ufs_read()->rdip():
Technorati Tag:
Solaris
=====================================================================================
In case of broken links
please try to use Google search. If you find the page please notify
us about new location
Internal pages updates by age:
Latest :
Past week :
Past month :
Past year
Internal
External
The history of "Solaris UFS bug"
-
Alan Hargreaves' Weblog the opinion of a specialist not a
clueless security junkie or press lemming
Just noticed that Solaris has
an entry in
Month of Kernel bugs.While I agree that we have an issue
that needs looking at, I also believe that the contributor is making
much more of it than it really deserves.
First off, to paraphrase the issue:
If I give you a specially massaged filesystem and can convince
someone with the appropriate privilege to mount it, it will
crash the system.
I'd hardly call this a "denial of service", let alone exploitable.
First off, in order to perform a mount operation of a ufs filesystem,
you need sys_mount privilege. In Solaris, we currently
are runing under the concept of "least privilege". That is, a process
is given the least amount of privilege that it needs to run. So,
in order to exploit this you need to convince someone with the appropriate
level of privilege to mount your filesystem. This would also invlove
a bit of social engineering which went unmentioned.
That being said, they system should not panic off this filesystem
and I will log a bug to this effect. It is a shame that the contributor
did not make the crashdump files available as it would certainly
speed up any analysis.
One other thing that I should add is that anyone who tries
to mount an unknown ufs filesystem without at least running
"fsck -n" over it probably deserves what they get.
- Security junkies and clueless press lemmings (see )
Copyright © 1996-2012 by Dr. Nikolai Bezroukov.
www.softpanorama.org was
created as a service to the UN Sustainable Development Networking Programme (SDNP)
in the author free time. This document is an industrial compilation designed and created
exclusively for educational use and is distributed under the
Softpanorama Content License.
Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made
for educational purposes only in compliance with the fair use doctrine.
This is a Spartan WHYFF (We Help You For Free)
site written by people for whom English is not a native language.
Grammar and spelling errors should be expected.
The site contain some broken links as it develops
like a living tree...
Disclaimer:
- The statements, views and opinions presented on
this web page are those of the author and are not endorsed by, nor do they necessarily
reflect, the opinions of the author present and former employers, SDNP or any other
organization the author may be associated with.
- We do not warrant the correctness of the information provided or its fitness for any purpose
Last modified:
August 14, 2011