|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Solaris UFS File System
A file system consists of blocks of data. The number of bytes constituting a
block varies depending on the OS. The internal physical structure of a hard disk
consists of cylinders. The hard disk is divided into groups of cylinders known as
cylinder groups, further divided into blocks.
The file system is comprised of five main blocks (boot block, superblock, Inode
block, data block,
-
Boot block. The boot block is part of the disk label that contains
a loader used to boot the operating system.
-
Super block. All partitions within the Unix filing system usually
contain a special block called the super block.
The super block contains the basic information about the entire file system.
It stores the following details about the file system:
- The size of the file system
- The status of the file system
- The date and time of the last update
- The pathname of the last mount point
- Cylinder group size
- The name of the partition
- The modification time of the file system
- The number of data blocks
- The list of free and allocated blocks
A super block plays an important
role during the system boot up and shutdown process. When the system boots,
the details in the super block are loaded into the memory to improve the
speed of processing. The super block is then updated at regular time intervals
from the data in the memory. During system shutdown, a program called
sync writes the updated data in the memory back to the super block.
This process is very crucial because an inaccurate super block might even
lead to an unusable file system. This is precisely why the proper shutdown
of a Solaris system is essential.
Because of the critical nature of the super block, it is replicated at
the beginning of every cylinder group. These blocks are known as surrogate
super blocks. A damaged or corrupted super block is recovered from one of
the surrogate super blocks.
-
Inode block. Inode is a kernel structure that contains
a pointer to the disk blocks that store data. This pointer points to information
such as file type, permission type, owner and group information, file size,
file modification time, and so on. Note that the inode does not contain
the filename as part of the information. The filename is listed in a directory
that contains a list of filenames and related inodes associated with
the file. When a user attempts to access a given file by name, the name is looked
up in the directory where the corresponding inode is found. Inode
stores the following information about every file:
- The type of the file
- The owner
- The group
- The size of the file
- The time and date of creation
- The time and date of last modification
- The time and date of last access
- An array of 15 disk block addresses
Each inode has a unique number associated with it, called the
inode number. The -li option of the ls command displays
the inode number of a file:
# ls -li
When a user creates a file in the directory or modifies it, the following
events occur:
- The Inode of the file is stored in the Inode block
of the file system.
- The file contents are stored in the allocated data blocks referenced
by the Inode.
- The Inode number is stored in the directory.
-
Data block
The data block is the storage unit of data in the Solaris file system. The
default size of a data block in the Solaris file system is 8192 bytes. After
a block is full, the file is allotted another block. The addresses of these
blocks are stored as an array in the Inode.
The first 12 pointers in the array are direct addresses of the file; that
is, they point to the first 12 data blocks where the file contents are stored.
If the file grows larger than these 12 blocks, then a 13th block is added, which
does not contain data. This block, called an indirect block, contains pointers
to the addresses of the next set of direct blocks.
If the file grows still larger, then a 14th block is added, which contains
pointers to the addresses of a set of indirect blocks. This block is called
the double indirect block. If the file grows still larger, then a 15th block
is added, which contains pointers to the addresses of a set of double indirect
blocks. This block is called the triple indirect block.
-
Vnodes. A Virtual Node or vnode is a data structure that
represents an open file, directory, or device that appears in the file system
namespace. A vnode does not render the physical file system it implements.
The vnode interface allows high-level operating system modules to perform
uniform operations on vnodes.
Links
Hard and soft links are a great features of Unix. It is a reference in a directory
to a file stored in another directory. In case of soft links it can be a reference
to a directory. There might be multiple links to a file. Links eliminate redundancy
because you do not need to store multiple copies of a file.
Links are of two types: hard and soft (also known as symbolic).
- A hard link is a pointer to a file and is indistinguishable from
the original directory entry. Any changes to a file are independent of the name
used to reference the file. Hard links may not span file systems and may not
refer to directories. In other words hard links are "synonyms" for a file and
technically structured as a real directory entry. All hard links are equal.
There is no way to tell which is primary and which is secondary. Every
hard link must reside on the same mounted filesystem (usually a disk or a part
of a disk). You cannot make a new hard link to a file that is on a different
mounted filesystem. Hard links can not be made for directories (actually you
can make them if you are root, but all the consequences are yours: what
is ".." for children of such a "multiple personalities" directory?
)The ln command by default creates hard links.
- Symbolic links (sometimes called a soft links), is a special file
that contains path to another file (target), much like a shortcut in Windows.
Unlike a hard link, a symbolic link is asymmetrical and there it's easy to tell
which file is link and which is actual file. This difference gives symbolic
links certain qualities that hard links do not have, such as the ability to
link to directories, or to files on remote computers networked through NFS.
Also, when you delete a target file, symbolic links to that file become unusable,
whereas hard links preserve the contents of the file.
To create a symbolic link, you must use the -s option with the ln
command. Files that are soft linked contain an l symbol at the first bit
of the access permission bits displayed by the ls -l command, whereas those
that are hard linked do not contain the l symbol. A directory is symbolically
linked to a file. However, it cannot be hard linked.
It is obvious that no file exists with a link count less than one. Relative
pathnames . or .. are nothing but links for the current directory
and its parent directory. These are present in every directory: any directory
stores the two links ., .. and the Inode numbers of the
files. They can be listed by the ls -lia option. A directory must have
a minimum of two links. The number of links increases as the number of sub-directories
increase. Whenever you issue a command to list the file attributes, it refers to
the Inode block with the Inode number and the corresponding data
is retrieved.
Solaris File Systems and Their Functions
Each file system used in Solaris is intended for a specific purpose.
The root file system is at the top of an inverted tree structure. It is the first
file system that the kernel mounts during booting. It contains the kernel and device
drivers. The / directory is also called the mount pointdirectory
of the file system. All references in the file system are relative to this directory.
The entire file system structure is attached to the main system tree at the root
directory during the process of mounting, and hence the name. During the creation
of the file system, a lost + found directory is created within the mount
point directory. This directory is used to dump into the file system any unredeemed
files that were found during the customary file system check, which you do with
the fsck command.
/ (root)
The directory located at the top of the Unix file system. It is represented by
the "/" (forward slash) character.
/usr Contains commands and programs for system-level usage and administration.
/var Contains system log files and spooling files, which grow in size
with system usage.
/home Contains user home directories.
/opt Contains optional third-party software and applications.
/tmp Contains temporary files, which are cleared each time the system
is booted.
/proc Contains information about all active processes.
You create file systems with the newfs command. The newfs command
accepts only logical raw device names. The syntax is as follows:
newfs [ -v ] [ mkfs-options ] raw-special-device
For example, to create a file system on the disk slice c0t3d0s4, the
following command is used:
# newfs -v /dev/rdsk/c0t3d0s4
The -v option prints the actions in verbose mode. The newfs
command calls the mkfs command to create a file system. You can invoke
the mkfs command directly by specifying a -F option followed by
the type of file system.
Mounting File Systems
Mounting file systems is the next logical step to creating file systems. Mounting
refers to naming the file system and attaching it to the inverted tree structure.
This enables access from any point in the structure. A file system can be mounted
during booting, manually from the command line, or automatically if you have enabled
the automount feature.
With remote file systems, the server shares the file system over the network
and the client mounts it.
The / and /usr file systems, as mentioned earlier, are mounted
during booting. To mount a file system, attach it to a directory anywhere in the
main inverted tree structure. This directory is known as the mount point. The syntax
of the mount command is as follows:
# mount <logical block device name> <mount point>
The following steps mount a file system c0t2d0s7 on the /export/home
directory:
# mkdir /export/home
# mount /dev/dsk/c0t2d0s7 /export/home
You can verify the mounting by using the mount command, which lists
all the mounted file systems.
Note: If the mount point directory has any content prior to the mounting
operation, it is hidden and remains inaccessible until the file system is unmounted.
Data is stored and retrieved from the physical disk where the file system is mounted.
Although there are no defined specifications for creating the file systems on the
physical disk, slices are usually allocated as following:
0. Root or / Files and directories of the OS.
- Swap Virtual memory space.
- Refers to the entire disk.
- /export Different OS versions.
- /export/swap Unused. Left to user's choice.
- /opt Application software added to a system.
- /usr OS commands by users.
- /home Files created by users.
The slices shown above are all allocated on a single single disk. However, there
is no restriction that all file systems need to be located on a single disk. They
can also span across multiple disks. Slice 2 refers to the entire disk. Hence, if
you want to allocate an entire disk for a file system, you can do so by creating
it on slice 2. The mount command supports a variety of useful options.
|
Option
|
Description
|
|
-o largefiles
|
Files larger than 2GB are supported in the file system.
|
|
-o nolargefiles
|
Does not mount file systems with files larger than 2GB.
|
|
-o rw
|
File system is mounted with read and write permissions.
|
|
-o ro
|
File system is mounted with read-only permission.
|
|
-o bg
|
Repeats mount attempts in the background. Used with non-critical file
systems.
|
|
-o fg
|
Repeats mount attempts in the foreground. Used with critical file systems.
|
|
-p
|
Prints the list of mounted file systems in /etc/vfstab format.
|
|
-m
|
Mounts without making an entry in /etc/mnt /etc/tab file.
|
|
-O
|
Performs an Overlay mount. Mounts over an existing mount point.
|
The mountall command mounts all file systems that have the mount at
boot field in the /etc/vfstab file set to yes. It can also
be used anytime after booting.Unmounting File Systems
A file system can be unmounted with the umount command. The following
is the syntax for umount:
umount <mount-point or logical block device name >
File systems cannot be unmounted when they are in use or when the umount command is issued from any subdirectory within the file system mount point.
Note: A file system can be unmounted forcibly if you use the -f
option of the umount command. Please refer to the man page to learn
about the use of these options.
The umountall command is used to unmount a group of file systems. The
umountall command unmounts all file systems in the /etc/mnttab
file except the /, /usr, /var, and /proc file
systems. If you want to unmount all the file systems from a specified host, use
the -h option. If you want to unmount all the file systems mounted from
remote hosts, use the -r option.
/etc/vfstab File
The /etc/vfstab (Virtual File System Table) file plays a very important
role in system operations. This file contains one record for every device that has
to be automatically mounted when the system enters run level 2.
|
Column Name
|
Description
|
|
device to mount
|
The logical block name of the device to be mounted. It can also be a
remote resource name for NFS.
|
|
device to fsck
|
The logical raw device name to be subjected to the fsck check
during booting. It is not applicable for read-only file systems, such as
High Sierra File System (HSFS) and network File systems such as NFS.
|
|
Mount point
|
The mount point directory.
|
|
FS type
|
The type of the file system.
|
|
fsck pass
|
The number used by fsck to decide whether the file system is to be checked.
0 File system is not checked.
1 File system is checked sequentially.
2 File system is checked simultaneously along with other file systems
where this field is set to 2.
|
|
Mount at boot
|
The file system to be mounted by the mount all command at boot
time is determined by this field. The options are either yes or no.
|
|
Mount options
|
The mount options to be supported by the mount command while
the particular file system is mounted.
|
Note the no values in this field for the root, /usr,
and /var file systems. These are mounted by default. The fd field
refers to the floppy disk and the swap field refers to the tmpfs
in the /tmp directory.
A sample vfstab file looks like:
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
#
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/dsk/c0t0d0s4 - - swap - no -
/dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no
-
/dev/dsk/c0t0d0s6 /dev/rdsk/c0t0d0s6 /usr ufs 1 no
-
/dev/dsk/c0t0d0s3 /dev/rdsk/c0t0d0s3 /var ufs 1 no
-
/dev/dsk/c0t0d0s7 /dev/rdsk/c0t0d0s7 /export/home ufs 2
yes -
/dev/dsk/c0t0d0s5 /dev/rdsk/c0t0d0s5 /opt ufs 2 yes
-
/dev/dsk/c0t0d0s1 /dev/rdsk/c0t0d0s1 /usr/openwin ufs 2 yes -
swap - /tmp tmpfs - yes -
Finding Information About the Mounted File Systems
The /etc/mnttab file comprises a table that defines which partitions
and/or disks are currently mounted by the system.
The /etc/mnttab file contains the following details about each mounted
file system:
-
The file system name
-
The mount point directory
-
The file system type
-
The mount command options
-
A number denoting the time of the mounted file system
A sample mnttab file:
/dev/dsk/c0t0d0s0 / ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200000 1014366934
/dev/dsk/c0t0d0s6 /usr ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200006 1014366934
/proc /proc proc dev=4300000 1014366933
mnttab /etc/mnttab mntfs dev=43c0000 1014366933
fd /dev/fd fd rw,suid,dev=4400000 1014366935
/dev/dsk/c0t0d0s3 /var ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200003 1014366937
swap /var/run tmpfs xattr,dev=1 1014366937
swap /tmp tmpfs xattr,dev=2 1014366939
/dev/dsk/c0t0d0s5 /opt ufs rw,intr,largefiles,xattr,onerror=panic,s
uid,dev=2200005 1014366939
/dev/dsk/c0t0d0s7 /export/home ufs rw,intr,largefiles,xattr,onerror
=panic,suid,dev=2200007 1014366939
/dev/dsk/c0t0d0s1 /usr/openwin ufs rw,intr,largefiles,xattr,onerror
=panic,suid,dev=2200001 1014366939
-hosts /net autofs indirect,nosuid,ignore,nobrowse,dev=4580001 10143669
44
auto_home /home autofs indirect,ignore,nobrowse,dev=4580002 10143669
44
-xfn /xfn autofs indirect,ignore,dev=4580003 1014366944
sun:vold(pid295) /vol nfs ignore,dev=4540001 1014366950
#
Restricting the File Size
Some applications and processes create temporary files that occupy a lot of hard
disk space. As a result, it is necessary to impose a restriction on the size of
the files that are created.
Solaris provides tools to control the storage. They are:
-
The ulimit command
-
Disk quotas
ulimit Command
The ulimit command is a built-in shell command, which displays the current
file size limit. The default value for the maximum file size, set inside the kernel,
is 1500 blocks. The following syntax displays the current limit:
$ ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
memory(kbytes) unlimited
If the limit is not set, it reports as unlimited.
The system administrator and the individual users change this value to set the
file size at the system level and at the user level, respectively. The following
is the syntax of the ulimit command:
ulimit <value>
For example, the following syntax sets the file size limit to 1600 blocks:
# ulimit 1600
# ulimit -a
time(seconds) unlimited
file(blocks) 1600
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
memory(kbytes) unlimited
#
The file size can be limited at the system level or the user level. To set it
at the system level, change the value of the ulimit variable in the
/etc/profile file. To set it at the user level, change the value in the
.profile file present in the user's home directory. The user-level setting
always takes precedence over the system-level setting. It is the user's profile
file that sets the working environment.
Note: The ulimit values set at the user level and system level cannot
exceed the default ulimit value set in the kernel.
Blogged by matty as
Solaris Storage matty Sat 29 Jan 2005 12:14 am
I recently needed to grow a Solaris UFS file
system, and accomplished this with the
growfs(1m) utility.
The growfs(1m) utility takes two arguments. The
first argument to growfs
( the value passed to -M ) is the mount point
of the file system to grow. The second argument
is the raw device that backs this mount point. The
following example will grow /test to the maximum
size available on the meta device d100:
$ growfs -M /test /dev/md/rdsk/d100
To see how many sectors will be available on
d100 after the grow operation completes, you can
run newfs with the -N option, and compare that
with the current value of df (1m):
$ newfs -N /dev/md/dsk/d100
/dev/md/rdsk/d0: 232331520 sectors in 56944 cylinders
of 16 tracks, 255 sectors
113443.1MB in 2191 cyl groups (26 c/g, 51.80MB/g,
6400 i/g)
This will report the number of sectors, cylinders
and MBs that would be allocated if a new file system
was created on meta device d100. As always, test
everything on a non critical system prior to making
changes to critical boxen.
Recently, I wanted to create a UFS file system on a Maxtor OneTouch II external
hard drive I have. I wanted to use the external hard drive for storing some
large files and I was going to use the drive exclusively with one of my Solaris
systems. Now, I didn't find much information on the web about how to perform
this with Solaris (maybe I wasn't searching very well or something) so I thought
I would post the procedure I followed here so I'll know how to do it again if
I need to.
After plugging the hard drive into my system via one of the USB ports, we can
verify that the disk was recognized by the OS by examining the /var/adm/messages
file. With the hard drive I was using, I saw entries like the following:Mar 2 13:10:33 solaris-filer usba: [ID 912658 kern.info] USB 2.0 device (usbd49,7100) operating at hi speed (USB 2.x) on USB 2.0 root hub: storage@3, scsa2u
sb0 at bus address 2
Mar 2 13:10:33 solaris-filer usba: [ID 349649 kern.info] Maxtor OneTouch II L60LHYQG
Mar 2 13:10:33 solaris-filer genunix: [ID 936769 kern.info] scsa2usb0 is /pci@0,0/pci1028,11d@1d,7/storage@3
Mar 2 13:10:33 solaris-filer genunix: [ID 408114 kern.info] /pci@0,0/pci1028,11d@1d,7/storage@3 (scsa2usb0) online
Mar 2 13:10:33 solaris-filer scsi: [ID 193665 kern.info] sd1 at scsa2usb0: target 0 lun 0
The dmesg command could also be used to
see similar information. Also, we could use the rmformat command (this lists
removable media) to see this information in a much nicer format like so:
# rmformat -l
Looking for devices...
1. Logical Node: /dev/rdsk/c1t0d0p0
Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
Connected Device: QSI CDRW/DVD SBW242U UD25
Device Type: DVD Reader
2. Logical Node: /dev/rdsk/c2t0d0p0
Physical Node: /pci@0,0/pci1028,11d@1d,7/storage@3/disk@0,0
Connected Device: Maxtor OneTouch II 023g
Device Type: Removable
#
Now that we now the drive has been identified by Solaris (as /dev/rdsk/c2t0d0p0)
we need to create one Solaris partition (this is Solaris 10 running on the x86
architecture) that uses the whole disk. This accomplished by passing the
-B flag to the fdisk command, like so:# fdisk -B /dev/rdsk/c2t0d0p0
Now we will print the disk table to standard out like so:# fdisk -W - /dev/rdsk/c2t0d0p0
This will output the following information to the screen for the hard drive
I am using:
* /dev/rdsk/c2t0d0p0 default fdisk table
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 36483 cylinders
*
* systid:
* 1: DOSOS12
* 2: PCIXOS
* 4: DOSOS16
* 5: EXTDOS
* 6: DOSBIG
* 7: FDISK_IFS
* 8: FDISK_AIXBOOT
* 9: FDISK_AIXDATA
* 10: FDISK_0S2BOOT
* 11: FDISK_WINDOWS
* 12: FDISK_EXT_WIN
* 14: FDISK_FAT95
* 15: FDISK_EXTLBA
* 18: DIAGPART
* 65: FDISK_LINUX
* 82: FDISK_CPM
* 86: DOSDATA
* 98: OTHEROS
* 99: UNIXOS
* 101: FDISK_NOVELL3
* 119: FDISK_QNX4
* 120: FDISK_QNX42
* 121: FDISK_QNX43
* 130: SUNIXOS
* 131: FDISK_LINUXNAT
* 134: FDISK_NTFSVOL1
* 135: FDISK_NTFSVOL2
* 165: FDISK_BSD
* 167: FDISK_NEXTSTEP
* 183: FDISK_BSDIFS
* 184: FDISK_BSDISWAP
* 190: X86BOOT
* 191: SUNIXOS2
* 238: EFI_PMBR
* 239: EFI_FS
*
* Id Act Bhead Bsect Bcyl Ehead Esect Ecyl Rsect Numsect
191 128 0 1 1 254 63 1023 16065 586083330
We now need to calculate the maximum amount of usable storage. This is done
by multiplying bytes/sectors (512 in my case) by the number of sectors listed
at the bottom of the output shown above. We then divide this number by 1024*1024
to yield MBs.
So in my case, this will work out as 286173.5009765625 MB.
Now, we need to setup a partition table file. This will be a regular text file
and you can name it whatever you like. For the sake of this post, I will name
it disk_slices.txt. The contents of this file are:
slices: 0 = 2MB, 286170MB, "wm", "root" :
1 = 0, 1MB, "wu", "boot" :
2 = 0, 286172MB, "wm", "backup"
To create these slices on the disk, we run:# rmformat -s disk_slices.txt /dev/rdsk/c2t0d0p0
# devfsadm
# devfsadm -C
To create the UFS file system on the newly created slice, I run the following
and the output from running this command is also shown:# newfs /dev/rdsk/c2t0d0s0
newfs: construct a new file system /dev/rdsk/c2t0d0s0: (y/n)? y
/dev/rdsk/c2t0d0s0: 586076160 sectors in 95390 cylinders of 48 tracks, 128 sectors
286170.0MB in 5962 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
...............................................................................
........................................
super-block backups for last 10 cylinder groups at:
585105440, 585203872, 585302304, 585400736, 585499168, 585597600, 585696032,
585794464, 585892896, 585991328
#
And now I'm finished, I now have a UFS file system created on my USB hard drive
which can be mounted by my Solaris system. To mount this file system, I can
just:# mount -F ufs /dev/rdsk/c2t0d0p0 /u01
I should add is that anyone who tries to mount an unknown ufs filesystem
without at least running "fsck -n" over it probably deserves what they get.
This methodology utilizes a tmpfs volume, and it can speed up operations approximately
three times.
This document describes a methodology for configuring a fast file system
that handles several small files on the Solaris Operating System. This could
be used for building a Java technology-based product or for handling many operations
on a large amount of small files. This methodology utilizes a tmpfs volume,
and it can speed up operations approximately three times.
The requirements are as follows:
- Solaris 7 OS through Solaris 10 OS Update 1
- Some experience with Solaris system administration. This procedure is
not recommended for UNIX users who are uncomfortable with using
mount, maintaining /etc/vfstab,
or modifying their kernel parameters.
Warning: Do not develop on a tmpfs volume. A tmpfs volume is only
persistent while the system is powered up, so a power loss or system problem
will cause you to lose any changes to that volume.
Procedure
Solaris tmpfs volumes are easy to create, but require a significant amount
of RAM and swap space. It is recommended that you have at least 1 Gbyte of RAM,
but there have also been major performance gains on systems with 512 Mbytes
of RAM. In addition, you should add twice as much swap space as the tmpfs volume
you are creating. That is, for a 2-Gbyte tmpfs volume, add 4 Gbytes of swap
space to the system. Feel free to experiment with these values.
The following examples are for a 2-Gbyte tmpfs volume, which is approximately
what is needed to do a developer build. Replace <swapfilename>
with the absolute path to a swapfile (such as
/disk1/swapfile), and <mountpoint>
with the absolute path to where you want the tmpfs volume mounted (such as
/ramdisk).
Add swap space to your workstation:
root# /usr/sbin/mkfile 2000m <swapfilename>
Create a mount point for the tmpfs volume:
root# mkdir <mountpoint>
Edit your /etc/vfstab file to use the swap and
create the tmpfs volume at boot time. Add the following two lines:
<swapfilename> - - swap - no -
RAMDISK - <mountpoint> tmpfs - yes size=2000m
Note that on the Solaris 7 OS you may not make a single tmpfs volume larger
than 2 Gbytes.
Edit your kernel parameters to increase the number of files you can create
in the tmpfs volume. Add the following line to your /etc/system
file. (We've had the most success using this value.)
set tmpfs:tmpfs_maxkmem=250000000
Reboot your workstation. Then verify that the tmpfs volume exists at the
size you specified:
% df -k <mountpoint>
Make the tmpfs volume writable. Note: This step is necessary after
each reboot of the workstation.
root# chmod 777 <mountpoint>
|
More UFS technical tidbits in anticipation
of OpenSolaris. Today's talk is about
UFS I/O. It is a complicated beast and has many different parts and paths
it can take.
Overview of file system I/O in Solaris:

The interaction of UFS and the VM subsystem has been the cause of numerous
bugs, and hard to find problems. Today's blog is an overview of the UFS
I/O, with particular attention paid to the VM subsystem interaction. Details
on the paths taken when a read() system call is initiated are to
show the interaction of UFS and the VM subsystem. I am making some assumptions
here that the readers of this blog will have some basic Solaris file system
knowledge, or at a minimum some of the basic Solaris file system terminology
is understood.
Basic Solaris VM facts
Solaris virtual memory is demand paged,
and globally managed. There is integrated file caching and it is layered
to allow VM to describe multiple memory types. The paging vnode cache is
the unification of file and memory management by use of a vnode object.
1 page of memory == <vnode, offset> tuple. The UFS file system uses
this relationship to implement caching for vnodes. The paging vnode cache
provides a set of functions for cache management and I/O for vnodes.
The paging vnode cache functions are
specified with a pvn_ <xxx> title. The source code for this is located at:
xxxx. Some of the more important paging vnode functions are listed below,
with basic function descriptions. Also shown is pointers to the code so
you can get more detailed data about each of these.
Some important paging vnode cache functions:
pvn_read_kluster():
pvn_write_kluster():
-
Finds dirty pages within the offset
and length. Returns a list of locked pages ready to be written.
-
Caller then sets up write call with
pageio_setup().
-
Write is initiated via a call to
bdev_strategy().
-
Synchronous writes require the caller
to call pvn_write_done(). Otherwise io_done() will call
this when write is complete.
pvn_vplist_dirty():
What is a seg_map and why do you care?
The seg_map
segment maintains mappings of pieces of files into kernel address space.
It is only used by file systems and it allows copying of data to or from
user to kernel address space. At any given time, seg_map segment
has some portion of total file system cache mapped in to the kernel address
space. The seg_map segment driver divides the segment in to file
system block sized slots.
Some important
seg_map functions:
segmap_getmap() && segmap_getmapflt():
-
Retrieves or creates mapping
-
getmapflt
allows for creation of segment if not found, calls ufs_getpage()
segmap_release():
segmap_pagecreate():
Important
in the mapping and getting data from the segmap driver is the fbuf structure.
It is defined as follows:
struct
fbuf {
caddr_t fb_addr;
u int_t
fb_count;
};
This structure
is used to get a mapping to part of a file via the segkmap interfaces. It
is also used by the pseudo bio functions(shown below) for reading and writing
of data. fbuf is used by directory reading to get on UFS on disk
contents via a call to blkatoff().
seg_vn and UFS and memory mapped
I/O:
Memory mapping allows for a file to be
mapped in the a processes address space. This mapping is done via the
VOP_MAP call and the seg_vn memory driver. File pages are read
when a fault occurs in the address space. The seg_vn driver enables
I/O's without process initiated system calls. I/O is performed ,,in units
of pages, upon reference to the pages mapped into the address space. Reads
are initiated by a memory access, writes are initiated as the VM subsystem
finds dirty pages in the mapped address space.
So, why not use the seg_vn driver
for non mmap'd I/O as well.? It could be used for mapping the file in to
the kernel's address space, but seg_vn is a complex segment driver
that manages the mapping of protections, copy-on-write fault handling, shared
memory, etc...This is too heavy weight for what is needed for read and write
system calls, so the seg_map driver was developed. Read and write
system calls only require a few basic mapping functions since they do not
map files into a process's address space. seg_map reduces locking
complexity and gives better performance.
Pseudo bio functions:
Solaris has a set of interfaces which
are considered buffered I/O interfaces, but that are used to read and write
buffers containing directory entries only. These interfaces all use the
seg_map driver for mapping to address file data. The functions are
fbread(), fbwrite(), fbrelese(), fbdwrite(), fbiwrite(), fbzero().
Although these are not directly shown in the picture above, they are
important enough to be worth mentioning.
A UFS/VM example, read() system call
- non mmap'd:
Note: In
general UFS caches the pages for write, but will also cache pages for reads
if they are frequently reusable.
read()->ufs_read()->rdip():
Technorati Tag:
Solaris
=====================================================================================
In case of broken links
please try to use Google search. If you find the page please notify
us about new location
Internal
External
The history of "Solaris UFS bug"
- Alan
Hargreaves' Weblog the opinion of a specialist not a clueless security
junkie or press lemming
Just noticed that Solaris has
an entry in
Month of Kernel bugs.While I agree that we have an issue that needs
looking at, I also believe that the contributor is making much more of it
than it really deserves.
First off, to paraphrase the issue:
If I give you a specially massaged filesystem and can convince someone
with the appropriate privilege to mount it, it will crash the system.
I'd hardly call this a "denial of service", let alone exploitable.
First off, in order to perform a mount operation of a ufs filesystem,
you need sys_mount privilege. In Solaris, we currently are runing
under the concept of "least privilege". That is, a process is given the
least amount of privilege that it needs to run. So, in order to exploit
this you need to convince someone with the appropriate level of privilege
to mount your filesystem. This would also invlove a bit of social engineering
which went unmentioned.
That being said, they system should not panic off this filesystem and
I will log a bug to this effect. It is a shame that the contributor did
not make the crashdump files available as it would certainly speed up any
analysis.
One other thing that I should add is that anyone who tries to
mount an unknown ufs filesystem without at least running "fsck -n"
over it probably deserves what they get.
- Security junkies and clueless press lemmings (see )
Copyright © 1996-2009 by Dr. Nikolai Bezroukov.
www.softpanorama.org was
created as a service to the UN Sustainable Development Networking Programme (SDNP)
in the author free time.
Submit
comments This document is an industrial compilation designed and created
exclusively for educational use and is placed under the copyright of the
Open Content License(OPL).
Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made
for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
- The statements, views and opinions presented on
this web page are those of the author and are not endorsed by, nor do they necessarily
reflect, the opinions of the author present and former employers, SDNP or any other
organization the author may be associated with.
- We do not warrant the correctness of the information provided or its
fitness for any purpose
- In no way this site is associated with or endorse cybersquatters
using
the term "softpanorama" with other main or country domains (e.g. softpanorama.com) with
bad faith intent to profit from the goodwill belonging to
someone else.
Last modified:
August 13, 2009