Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Solaris ZFS

News

Books/Certification books

Certification Recommended Links Recommended Papers Reference Selected Blueprints Selected man pages
Solaris ACLs     RAID 1 volumes
(mirroring)
RAID 5 volumes Shared disksets Humor Etc

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14, 2004. For a humorous introduction to ZFS' features, see presentation given by Pawel at EuroBSDCon 2007: http://youtube.com/watch?v=o3TGM0T1CvE.

The innovative Zettabyte File System broke records in scalability, reliability, and flexibility.  ZFS has been based in part of NetApp's write Anywhere File Layout (WAFL) system. It evolved pretty far from  WAFL and now has many differences. This table lists some of them (please read the blog replies which correct some table errors.).

Although performance is not usually cited as ZFS advantage ZFS is far faster than most users realize, especially in environments that involve typical files smaller than 5-10 megabytes. The native support of a volume manager in ZFS is also pretty interesting. That and copy on write semantics provide snapshots which are really important for some applications and for security.

ZFS is one of the few Unix filesystems that can go neck-to-neck with Microsoft NTFS for performance. Among important features:

ZFS is also supported in Free BSD and Mac OS X (leopard). There are rumors Apple may be preparing to adopt it as the default filesystem replacing the aging HFS+ in the future.

A good overview is available from BigAdmin Feature Article ZFS Overview and Guide

ZFS organizes physical devices into logical pools called storage pools. Both individual disks and array logical unit numbers (LUNs) visible to the operating system may be included in a ZFS pools.

...Storage pools can be sets of disks striped together with no redundancy (RAID 0), mirrored disks (RAID 1), striped mirror sets (RAID 1 + 0), or striped with parity (RAID Z). Additional disks can be added to pools at any time but they must be added with the same RAID level. For example, if a pool is configured with RAID 1, disks may be added only to the pool in mirrored sets in the same number as was used when the pool was created. As disks are added to pools, the additional storage is automatically used from that point forward.

Note: Adding disks to a pool causes data to be written to the new disks as writes are performed on the pool. Existing data is not redistributed automatically, but is redistributed when modified.

When organizing disks into pools, the following issues should be considered:

Note: RAID-Z is a special implementation of RAID-5 for ZFS allowing stripe sets to be more easily expanded with higher performance and availability.

Storage pools perform better as more disks are included. Include as many disks in each pool as possible and build multiple file systems on each pool.

ZFS File System

ZFS offers a POSIX-compliant file system interface to the operating system. In short, a ZFS file system looks and acts exactly like a UFS file system except that ZFS files can be much larger, ZFS file systems can be much larger, and ZFS will perform much better when configured properly.

Note: It is not necessary to know how big a file system needs to be to create it.

ZFS file systems will grow to the size of their storage pools automatically.

ZFS file systems must be built in one and only one storage pool, but a storage pool may have more than one defined file system. Each file system in a storage pool has access to all the unused space in the storage pool. As any one file system uses space, that space is reserved for that file system until the space is released back to the pool by removing the file(s) occupying the space. During this time, the available free space on all the file systems based on the same pool will decrease.

ZFS file systems are not necessarily managed in the /etc/vfstab file. Special, logical device files can be constructed on ZFS pools and mounted using the vfstab file, but that is outside the scope of this guide. The common way to mount a ZFS file system is to simply define it against a pool. All defined ZFS file systems automatically mount at boot time unless otherwise configured.

Finally, the default mount point for a ZFS file system is based on the name of the pool and the name of the file system. For example, a file system named data1 in pool indexes would mount as /indexes/data1 by default. This default can be overridden either when the file system is created or later if desired.

Command-Line Interface

The command-line interface consists primarily of the zfs and zpool commands.. Using these commands, all the storage devices in any system can be configured and made available. A graphical interface is available through the Sun Management Center. Please see the SMC documentation at docs.sun.com for more information.

For example, assume that a new server named proddb.mydomain.com is being configured for use as a database server. Tables and indexes must be on separate disks but the disks must be configured for highly available service resulting in the maximum possible usable space. On a traditional system, at least two arrays would be configured on separate storage controllers, made available to the server by means of hardware RAID or logical volume management (such as Solaris Volume Manager) and UFS file systems built on the device files offered from the RAID or logical volume manager. This section describes how this same task would be done with ZFS.

Planning for ZFS

Tip 2: Use the format command to determine the list of available devices and to address configuration problems with those devices.

The following steps must be performed prior to configuring ZFS on a new system. All commands must be issued by root or by a user with root authority:

Additional planning information can be found at docs.sun.com.

In the running example, two bodies of JBOD ("just a bunch of disks" or non-RAID managed storage) are attached to the server. Though there is no reason to avoid hardware RAID systems when using ZFS, this example is clearer without hardware RAID systems. The following table lists the physical devices presented from attached storage.

c2t0d0
c4t0d0
c3t0d0
c5t0d0
c2t1d0
c4t1d0
c3t1d0
c5t1d0
c2t2d0
c4t2d0
c3t2d0
c5t2d0
c2t3d0
c4t3d0
c3t3d0
c5t3d0

Based on the need to separate indexes from data, it is decided to use two pools named indexes and tables, respectively. In order to avoid controller contention, all the disks from controllers 2 and 4 will be in the indexes pool and those from controllers 3 and 5 will be in the tables pool. Both pools will be configured using RAID-Z for maximum usable capacity.

Creating a Storage Pool

Storage pools are created with the zpool command. Please see the man page, zpool (1M), for information on all the command options. However, the following command syntax builds a new ZFS pool:

#  zpool create <pool_name> [<configuration>] <device_files>

The command requires the user to supply a name for the new pool and the disk device file names without path (c#t#d# as opposed to /dev/dsk/c#t#d#). In addition, if a configuration flag, such as mirror or raidz, is used, the list of devices will be configured using the requested configuration. Otherwise, all disks named are striped together with no parity or other highly available features.

Tip 3: Check out the -m option for defining a specific mount point or the -R option for redefining the relative root path for the default mount point.

Continuing the example, the zpool commands to build two RAID-Z storage pools of eight disks, each with minimum controller contention, would be as follows:

# zpool create indexes raidz c2t0d0 c2t1d0 c2t2d0 \
  c2t3d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0
# zpool create tables raidz c3t0d0 c3t1d0 c3t2d0 \
  c3t3d0 c5t0d0 c5t1d0 c5t2d0 c5t3d0

The effect of these commands will be to create two pools named indexes and tables, respectively, each with RAID-Z striping and data redundancy. ZFS pool names can be named anything starting with a letter except the strings mirror, raidz, spare, or any string starting with c# where # is any digit 0 through 9. ZFS pool names can include only letters, digits, dashes, underscores, or periods.

Creating File Systems

If the default file system that is created is not adequate to suit the needs of the system, additional file systems can be created using the zfs command. Please see the man page, zfs (1M), for detailed information on the command's options.

Suppose, in the running example, two databases were to be configured on the new storage and for management purposes, each database needed to have its own mount points in the indexes and tables pools. Use the zfs command to create the desired file systems as follows:

# zfs create indexes/db1
# zfs create indexes/db2
# zfs create tables/db1
# zfs create tables/db2

Note: Be careful when naming file systems. It is possible to reuse the same name for different file systems in different pools, which might be confusing.

The effect is to add a separate mount point for db1 and db2 under each of /indexes and /tables. In the mount output, something like the following would be shown:

The space available to /indexes, /indexes/db1, and /indexes/db2 is all of the space defined in the indexes pool. Likewise, the space available to /tables, /tables/db1, and /tables/db2 is all of the space defined in the tables pool. The file systems db1 and db2 in each pool are mounted as separate file systems in order to provide distinct control and management interfaces for each defined file system.

Tip 4: Check out the set options of the zfs command to manipulate the mount point and other properties of each file system.

Displaying Information

Information on the pools and file systems can be displayed using the list commands for zpool and zfs. Other commands exist as well. Please read the man pages for zfs and zpool for the complete list.

# zpool list

NAME            SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
indexes         240M    110K    240M     0%  ONLINE     -
tables          240M    110K    240M     0%  ONLINE     -

# zfs list

NAME            USED    AVAIL  REFER  MOUNTPOINT
indexes         107K    208M   25.5K  /indexes
indexes/db1     24.5K   208M   24.5K  /indexes/db1
indexes/db2     24.5K   208M   24.5K  /indexes/db2
tables          107K    208M   25.5K  /tables
tables/db1      24.5K   208M   24.5K  /indexes/db1
tables/db2      24.5K   208M   24.5K  /indexes/db2

Monitoring

Though a detailed discussion of monitoring is out of this document's scope, this overview would be incomplete without some mention of the ZFS built-in monitoring. As with management, the command to monitor the system is simple:

# zpool iostat <pool_name> <interval> <count>

This command works very much like the iostat command found in the operating system. If the pool name is not specified, the command reports on all defined pools. If no count is specified, the command reports until stopped. A separate command was needed as the iostat command in the operating system cannot see the true reads and writes performed by ZFS; it can see only those submitted to and requested from file systems.

The command output is as follows:

# zpool iostat test_pool 5 10

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test_pool     80K  1.52G      0      7      0   153K
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0

Other commands can be used to contribute to an administrator's understanding of the status, performance, options, and configuration of running ZFS pools and file systems. Please read the man pages for zfs and zpool for more information.

There is ZFS centered litigation between  NetApps and Sun. in 2007 NetApp alleged that Sun violated seven of its patents and demanded Sun remove its ZFS file system from the open-source community and storage products, and limit its use to computing devices. Please note that Sun indemnifies all its customers against IP claims.

In October 2007 Sun counter-sued, saying NetApp infringed 22 of its patents which puts NetApp in typical for such suits crossfire.  Sun requested the removal of all NetApp products from the marketplace.  It the was a big present to EMC as the letter below suggests:

To NetApp Employees and Customers on Sun’s Lawsuit

[Note: This is an e-mail that I sent internally to our employees, with the expectation that they might also share it with customers. Some of it repeats previous posts, but other parts are different. In the spirit of openness, I decided to post here as well.]

To: everyone-at-netapp
Subject: Sun's Lawsuit Against NetApp

This morning, Sun filed suit seeking a “permanent injunction against NetApp” to remove almost all of our products from the market place. That’s some pretty scary language! It seems designed to make NetApp employees wonder, Do I still have a job? And customers to wonder, Is it safe to buy NetApp products?

I’d like to reassure you. Your job is safe. Our products are all still for sale.

Can you ever remember a Fortune 1000 company being shut down by patents? It just doesn’t happen! Even for the RIM/Blackberry case, which is the closest I can think of to a big company being shut down, it took years and years to get to that point, and was still averted in the end. I think it’s safe to say the odds of Sun fulfilling their threat are near zero.

If you are a customer, you can be confident buying NetApp products.

If you are an employee, just keep doing your job! Even if your job is to partner with Sun, keep doing your job. Here’s a ironic story. When James and I received the IEEE Storage Systems Award for our work in WAFL and appliances “which has revolutionized storage”, it was a Sun employee who organized the session where the award was presented. He was friendly, we were friendly, and we didn’t talk about the lawsuit. You can do it too. The first minute or two might feel odd, but then you’ll get over it. We have many joint customers to take care of.

Also NetApp landed on the wrong side of open source debate which will cost them a lot both in goodwill and actual customers.  Old proverb "Those who live in glass houses should not throw stones" is very relevant here.  The shadow of SCO  over NetApp is very real threat to their viability on the marketplace. As a powerful movement open source community is a factor that weights into all deliberations. 

I think Netapp made a mistake here: for companies of approximately equal size the courtroom really doesn't work well as a venue for battles in storage industry. Lawsuits about software patent (usually with broad, absurg generalizations included in the patent) infringement claims are extremely risky ventures that can backfire.  Among near equals, the costly patent enforcement game is essentially a variant of MAD (mutually-assured destruction).
Biotech companies learned this a long time ago, when they realized that it makes little sense to sue each other over drug-enabling tech even before FDA approval, which is the true gating function which confers the desired monopoly and knocks the other out of the ring. In case of software prior art defense can work wonders for most so called patents.

After Oracle acquisition of Sun NetApp claims should be reviewed as they politically NetApp cannot go after Oracle -- the main database that is using NetAPP storage appliances.  Any attempt to extract money from Oracle means a lot of lost revenue for the company.  Unless Oracle does not care about open source ZFS existence, which is also a possibility. 


Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index


Old News ;-)

Understanding and managing NFSv4 ACLs

April 16, 2009 E O N

Using EON/Opensolaris and ZFS for storage will at some point cause you to cross paths with NFSv4 Access Control Lists. The control available through ACLs are really granular and powerful but they are also hard to manage and a bit confusing. Here i'll share my methods of handling ACLs which requires some pre-requisite reading to help understand the Compact Access codes:
add_file w, add_subdirectory p, append_data p, delete d , delete_child D , execute x , list_directory r , read_acl c , read_attributes a , read_data r , read_xattr R , write_xattr W , write_data w , write_attributes A , write_acl C , write_owner o
Inheritance compact codes:(remember i on a directory causes a recursive inheritance)
file_inherit f , dir_inherit d , inherit_only i , no_propagate n
ACL set codes:
full_set = rwxpdDaARWcCos = all permissions
modify_set = rwxpdDaARWc--s = all permissions except write_acl, write_owner
read_set = r-----a-R-c--- = read_data, read_attributes, read_xattr, read_acl
write_set = -w-p---A-W---- = write_data, append_data, write_attributes, write_xattr
If I create a file/folder (foo) via a windows client on a SMB/CIFS share the permissions typically resemble.
eon:/deep/tank#ls -Vd foo
d---------+  2 admin    stor           2 Apr 20 14:12 foo
         user:admin:rwxpdDaARWcCos:-------:allow
   group:2147483648:rwxpdDaARWcCos:-------:allow
This works fine for the owner (admin) but in a case where multiple people (family) use the storage, adding user access and more control over sharing is usually required. So how do I simply add the capability needed? If I wish to modify this(above), I always start by going back to default values
eon:/deep/tank#chmod A- foo
eon:/deep/tank#ls -Vd foo
d---------   2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:rwxp----------:-------:deny
             group@:--------------:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
I then copy and paste them directly into a terminal or script (vi /tmp/bar) for trial and error and simply flip the bits I wish to test on or off. Note I'm using A= which will wipe and replace with whatever I define. With A+ or A-, it adds or removes the matched values. So my script will look like this after the above is copied
chmod -R A=\
owner@:rwxp----------:-------:deny,\
owner@:-------A-W-Co-:-------:allow,\
group@:rwxp----------:-------:deny,\
group@:--------------:-------:allow,\
everyone@:rwxp---A-W-Co-:-------:deny,\
everyone@:------a-R-c--s:-------:allow \
foo
Let's modify group:allow to have write_set = -w-p---A-W----
chmod -R A=\
owner@:rwxp----------:-------:deny,\
owner@:-------A-W-Co-:-------:allow,\
group@:--------------:-------:deny,\
group@:-w-p---A-W----:-------:allow,\
everyone@:rwxp---A-W-Co-:-------:deny,\
everyone@:------a-R-c--s:-------:allow \
foo
Running the above
eon:/deep/tank#sh -x /tmp/bar
+ chmod -R A=owner@:rwxp----------:-------:deny,owner@:-------A-W-Co-:-------:allow,group@:--------------:-------:deny,group@:-w-p---A-W----:-------:allow,everyone@:rwxp---A-W-Co-:-------:deny,everyone@:------a-R-c--s:-------:allow foo
eon:/deep/tank#ls -Vd foo/
d----w----+  2 admin    stor           2 Apr 20 14:12 foo/
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
Adding a user (webservd) at layer 5, 6 with full_set permissions
eon:/deep/tank#eon:/deep/tank#chmod A+user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
      user:webservd:rwxpdDaARWcCos:-d-----:allow
      user:webservd:rwxpdDaARWcCos:f------:allow
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
Ooops, that's level 1, 2 so let's undo this by simply repeating the command with A- instead of A+. Then lets fix it by repeating the command with A5+ instead of A-
eon:/deep/tank#chmod A-user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
eon:/deep/tank#chmod A5+user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
      user:webservd:rwxpdDaARWcCos:-d-----:allow
      user:webservd:rwxpdDaARWcCos:f------:allow
          everyone@:------a-R-c--s:-------:allow
This covers adding, deleting, modifying and replacing NFSv4 ACLs. Hope that provides some guidance in case you have to tangle with NFSv4 ACLs. The more exercise you get with NFSv4 ACLs the more familiar you'll be with getting it to do what you want.

ZFS ACLs  by Mark Shellenbaum

Nov 16, 2005 | Mark Shellenbaum's Weblog

The ZFS file system uses a pure ACL model, that is compliant with the NFSv4 ACL model.  What is meant by pure ACL model, is that every file always has an ACL, unlike file systems such as UFS that have either an ACL or it has permission bits.  All access control decisions are governed by a file's ACL.  All file's still have permission bits, but they are constructed by analyzing a file's ACL.
 

NFSv4 ACL Overview
The ACL model in NFSv4 is similar to the Windows ACL model.  The NFSv4 ACL model supports a rich set of access permissions and inheritance controls.  An ACL in this model is composed of an array of access control entries (ACE).  Each ACE specifies the permissions, access type, inheritance flags and to whom the entry applies.  In the NFSv4 model the "who" argument of each ACE, may be either a username or groupname.  There are also a set of commonly know names, such as "owner@", "group@", "everyone@".  These abstractions are used by UNIX variant operating systems to indicate if the ACE is for the file owner, file group owner or for the world.  The everyone@ entry is not equivalent to the POSIX "other" class, it really is everyone.  The complete description of the NFSv4 ACL model is availabe in Section 5.11 of the NFSv4 protocol specification.
 
NFSv4 Access Permissions
Permission
 
Description
 
read_data
 
Permission to read the data of the file
 
list_data
 
Permission to list the contents of a directory
 
write_data
 
Permission to modify the file's data anywhere in the file's offset range.  This includes the ability to grow the file or write to an arbitrary offset.
 
add_file
 
Permission to add a new file to a directory
 
append_data
 
The ability to modify the data, but only starting at EOF.
 
add_subdirectory
 
Permission to create a subdirectory to a directory
 
read_xattr
 
The ability to read the extended attributes of a file or to do a lookup in the extended attributes directory.
 
write_xattr
 
The ability to create extended attributes or write to the extended attributes directory.
 
execute
 
Permission to execute a file
 
delete_child
 
Permission to delete a file within a directory
 
read_attributes
 
The ability to read basic attributes (non-ACLs) of a file.  Basic attributes are considered the stat(2) level attributes.
 
write_attributes
 
Permission to change the times associated with a file or directory to an arbitrary value
 
delete
 
Permission to delete a file
 
read_acl
 
Permission to read the ACL
 
write_acl
 
Permission to write a file's ACL
 
write_owner
 
Permission to change the owner or the ability to execute chown(1) or chgrp(1)
 
synchronize
 
Permission to access a file locally at the server with synchronous reads and writes.
 

 
NFSv4 Inheritance flags
Inheritance Flag
 
Description
 
file_inherit
 
Can be place on a directory and indicates that this ACE should be added to each new non-directory file created.
 
dir_inherit
 
Can be placed on a directory and indicates that this ACE should be added to each new directory created.
 
inherit_only
 
Placed on a directory, but does not apply to the directory itself, only to newly created files and directories.  This flag requires file_inherit and or dir_inherit to indicate what to inherit.
 
no_propagate
 
Placed on directories and indicates that ACL entries should only be inherited to one level of the tree.  This flag requires file_inherit and or dir_inherit to indicate what to inherit.
 
   
 

NFSv4 ACLs vs POSIX
 

The difficult part of using the NFSv4 ACL model was trying to still preserve POSIX compliance in the file system.  POSIX allows for what it calls "additonal" and "alternate" access methods.  An additional access method is defined to be layered upon the file permission bits, but they can only further restrict the standard access control mechanism.  The alternate file access control mechanism is defined to be independent of the file permission bits and which if enabled on a file may either restrict or extend the permissions of a given user.  Another major distinction between the additional and alternate access control mechanisms is that, any alternate file access control mechanism must be disabled after the file permission bits are changed with a chmod(2).  Additional mechanisms do not need to be disabled when a chmod is done.  

Most vendors that have implemented NFSv4 ACLs have taken the approach of "discarding" ACLs during a chmod(2).  This is a bit heavy handed, since a user went through the trouble of crafting a bunch of ACLs, only to have chmod(2) come through and destroy all of their hard work.  It was this single issue that was the biggest hurdle to POSIX compliance with ZFS in implementing NFSv4 ACLs.  In order to achieve this Sam, Lisa and I spent far too long trying to come up with a model that would preserve as much of the original ACL, while still being useful.   What we came up with is a model that retains additional access methods, and disabled, but doesn't delete alternate access controls.  Sam and Lisa have filed an internet draft which has the details about the chmod(2) algorithm and how to make NFSv4 ACLs POSIX complient.

 

So whats cool about this

Lets assume we have the following directory /sandbox/test.dir.
Its initial ACL looks like:

    % ls -dv test.dir
    drwxr-xr-x   2 ongk     bin            2 Nov 15 14:11 test.dir
         0:owner@::deny
         1:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
             /append_data/write_xattr/execute/write_attributes/write_acl
             /write_owner:allow
         2:group@:add_file/write_data/add_subdirectory/append_data:deny
         3:group@:list_directory/read_data/execute:allow
         4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
             /write_attributes/write_acl/write_owner:deny
         5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
             /read_acl/synchronize:allow



Now if I want to give "marks" the ability to create files, but not subdirectories in this
directory then the following ACL would achieve this.

    First lets make sure "marks" can't currently create files/directories

    $ mkdir /sandbox/bucket/test.dir/dir.1
    mkdir: Failed to make directory "/sandbox/test.dir/dir.1"; Permission denied


    $ touch /sandbox/test.dir/file.1
    touch: /sandbox/test.dir/file.1 cannot create


    Now lets give marks add_file permission

    % chmod A+user:marks:add_file:allow /sandbox/test.di
    % ls -dv test.dir
    drwxr-xr-x+  2 ongk     bin            2 Nov 15 14:11 test.dir
         0:user:marks:add_file/write_data:allow
         1:owner@::deny
         2:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
             /append_data/write_xattr/execute/write_attributes/write_acl
             /write_owner:allow
         3:group@:add_file/write_data/add_subdirectory/append_data:deny
         4:group@:list_directory/read_data/execute:allow
         5:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
             /write_attributes/write_acl/write_owner:deny
         6:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
             /read_acl/synchronize:allow


    Now lets see if it works for user "marks"

    $ id
    uid=76928(marks) gid=10(staff)

    $ touch file.1
    $ ls -v file.1
    -rw-r--r--   1 marks    staff          0 Nov 15 10:12 file.1
         0:owner@:execute:deny
         1:owner@:read_data/write_data/append_data/write_xattr/write_attributes
             /write_acl/write_owner:allow
         2:group@:write_data/append_data/execute:deny
         3:group@:read_data:allow
         4:everyone@:write_data/append_data/write_xattr/execute/write_attributes
             /write_acl/write_owner:deny
         5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
             :allow


    Now lets make sure "marks" can't create directories.

    $ mkdir dir.1
     mkdir: Failed to make directory "dir.1"; Permission denied

The write_owner permission is handled in a special way.  It allows for a user to "take" ownership of a file.  The following example will help illustrate this.  With the write_owner a user can only do a chown(2) to himself or to a group that he is a member of.

    We will start out with the following file.

    % ls -v file.test
    -rw-r--r--   1 ongk     staff          0 Nov 15 14:22 file.test
         0:owner@:execute:deny
         1:owner@:read_data/write_data/append_data/write_xattr/write_attributes
             /write_acl/write_owner:allow
         2:group@:write_data/append_data/execute:deny
         3:group@:read_data:allow
         4:everyone@:write_data/append_data/write_xattr/execute/write_attributes
             /write_acl/write_owner:deny
         5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
             :allow


    Now if user "marks" tries to chown(2) the file to himself he will get an error.

     $ chown marks file.test
     chown: file.test: Not owner

     $ chgrp staff file.test
     chgrp: file.test: Not owner


     Now lets give "marks" explicit write_owner permission.

     % chmod A+user:marks:write_owner:allow file.test
     % ls -v file.test
     -rw-r--r--+  1 ongk     staff          0 Nov 15 14:22 file.test
          0:user:marks:write_owner:allow
          1:owner@:execute:deny
          2:owner@:read_data/write_data/append_data/write_xattr/write_attributes
              /write_acl/write_owner:allow
          3:group@:write_data/append_data/execute:deny
          4:group@:read_data:allow
          5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
              /write_acl/write_owner:deny
          6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
              :allow

    Now lets see who "marks" can chown the file to.
   
    $ id
    uid=76928(marks) gid=10(staff)
    $ groups
    staff storage
    $ chown bin file.test
    chown: file.test: Not owner


    So "marks" can't give the file away.

    $ chown marks:staff file.test
   
Now lets look at an example to show how a user can be granted special delete permissions.  ZFS doesn't create any delete permissions when a file is created, instead it uses write_data/execute for permission to write to a directory and execute to search the directory.

    Lets first create a read-only directory and then give "marks" the ability to delete files.

    % ls -dv test.dir
    dr-xr-xr-x   2 ongk     bin            2 Nov 15 14:11 test.dir
         0:owner@:add_file/write_data/add_subdirectory/append_data:deny
         1:owner@:list_directory/read_data/write_xattr/execute/write_attributes
             /write_acl/write_owner:allow
         2:group@:add_file/write_data/add_subdirectory/append_data:deny
         3:group@:list_directory/read_data/execute:allow
         4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
             /write_attributes/write_acl/write_owner:deny
         5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
             /read_acl/synchronize:allow


    Now the directory has the following files:

    % ls -l
    total 3
    -r--r--r--   1 ongk     bin            0 Nov 15 14:28 file.1
    -r--r--r--   1 ongk     bin            0 Nov 15 14:28 file.2
    -r--r--r--   1 ongk     bin            0 Nov 15 14:28 file.3


    Now lets see if "marks" can delete any of the files?

    $  rm file.1
    rm: file.1: override protection 444 (yes/no)? y
    rm: file.1 not removed: Permission denied

    Now lets give "marks" delete permission on just file.1

    % chmod A+user:marks:delete:allow file.1
    % ls -v file.1
    -r--r--r--+  1 ongk     bin            0 Nov 15 14:28 file.1
         0:user:marks:delete:allow
         1:owner@:write_data/append_data/execute:deny
         2:owner@:read_data/write_xattr/write_attributes/write_acl/write_owner
             :allow
         3:group@:write_data/append_data/execute:deny
         4:group@:read_data:allow
         5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
             /write_acl/write_owner:deny
         6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
             :allow


    $ rm file.1
    rm: file.1: override protection 444 (yes/no)? y


Lets see what a chmod(1) that changes the mode would do to a file with a ZFS ACL.
We will start out with the following ACL which gives user bin read_data and write_data permission.

    $ ls -v file.1
    -rw-r--r--+  1 marks    staff          0 Nov 15 10:12 file.1
         0:user:bin:read_data/write_data:allow
         1:owner@:execute:deny
         2:owner@:read_data/write_data/append_data/write_xattr/write_attributes
             /write_acl/write_owner:allow
         3:group@:write_data/append_data/execute:deny
         4:group@:read_data:allow
         5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
             /write_acl/write_owner:deny
         6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
             :allow


    $ chmod 640 file.1
    $ ls -v file.1
    -rw-r-----+  1 marks    staff          0 Nov 15 10:12 file.1
         0:user:bin:write_data:deny
         1:user:bin:read_data/write_data:allow
         2:owner@:execute:deny
         3:owner@:read_data/write_data/append_data/write_xattr/write_attributes
             /write_acl/write_owner:allow
         4:group@:write_data/append_data/execute:deny
         5:group@:read_data:allow
         6:everyone@:read_data/write_data/append_data/write_xattr/execute
             /write_attributes/write_acl/write_owner:deny
         7:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow


    In this example ZFS has prepended a deny ACE to take away write_data permission.  This
    is an example of disabling "alternate" access methods.  More details about
    how ACEs are disabled are described in internet draft.
 
The ZFS admin guide and the chmod(1) manpages have many more examples of setting ACLs and how the inheritance model works.

With the ZFS ACL model access control is no longer limited to the simple "rwx" model that UNIX has used since its inception.

Lisa Week's Weblog

Over the last several months, I've been doing a lot of work with NFSv4 ACLs.  First, I worked with Sam to get NFSv4 ACL support into Solaris 10.  The major portion of this work involved implementing the pieces to be able to pass ACLs over-the-wire as defined by section 5.11 of the NFSv4 specification (RFC3530) and the translators (code to translate from UFS (or also referred to as POSIX-draft) ACLs to NFSv4 ACLs and back).  At that point, Solaris was further along with regard to ACLs than it ever had been, but was still not able to support the full semantics of NFSv4 ACLs.  So...here comes ZFS!

After getting the support for NFSv4 ACLs into Solaris 10, I started working on the ZFS ACL model with Mark and Sam.  So, you might wonder why a couple of NFS people (Sam and I) would be working with ZFS (Mark) on the ZFS ACL model...well that is a good question.   The reason for that is because ZFS has implemented native NFSv4 ACLs.  This is really exciting because it is the first time that Solaris is able to support the full semantics of NFSv4 ACLs as defined by RFC3530.

In order to implement native NFSv4 ACLs in ZFS, there were a lot of problems we had to overcome.  Some of the biggest struggles were ambiguities in the NFSv4 specification and the requirement for ZFS to be POSIX compliant.  These problems have been captured in an Internet Draft submitted by Sam and me on October 14, 2005.

ACLs in the Computer Industry:

What makes NFSv4 ACLs so special...so special to have the shiny, new ZFS implement them?  No previous attempt to specify a standard for ACLs has succeeded, therefore, we've seen a lot of different (non-standard) ACL models in the industry.  With NFS Version 4, we now have an IETF approved standard for ACLs.

As well as being a standard, the NFSv4 ACL model is very powerful.  It has a rich set of inheritance properties as well as a rich set of permission bits outside of just read, write and execute (as explained in the Access mask bits section below).  And for the Solaris NFSv4 implementation this means better interoperability with other vendor's NFSv4 implementations.

ACLs in Solaris:

Like I said before, ZFS has native NFSv4 ACLs!  This means that ZFS can fully support the semantics as defined by the NFSv4 specification (with the exception of a couple things, but that will be mentioned later).

What makes up an ACL?

ACLs are made up of zero or more Access Control Entries (ACEs).  Each ACE has multiple components and they are as follows:

1.) Type component:
        The type component of the ACE defines the type of ACE.  There
        are four types of ACEs: ALLOW, DENY, AUDIT, ALARM.


        The ALLOW type ACEs permit access.
        The DENY type ACES restrict access.
        The AUDIT type ACEs audit accesses.
        The ALARM type ACEs alarm accesses.

        The ALLOW and DENY type of ACEs are implemented in ZFS.
        AUDIT and ALARM type of ACEs are not yet implemented in ZFS.

        The possibilities of the AUDIT and ALARM type ACEs are described below.  I
        wanted to explain the flags that need to be used in conjunction with them before
        going into any detail on what they do, therefore, I gave this description its own
        section.

2.) Access mask bits component:
        The access mask bit component of the ACE defines the accesses
        that are controlled by the ACE.

        There are two categories of access mask bits:
        1.) The bits that control the access to the file
                i.e. write_data, read_data, write_attributes, read_attributes
        2.) The bits that control the management of the file
                i.e. write_acl, write_owner

        For an explanation of what each of the access mask bits actually control in ZFS,
        check out Mark's blog.

3.) Flags component:
        There are three categories of flags:
        1.) The bits that define inheritance properties of an ACE.
                i.e. file_inherit, directory_inherit, inherit_only,
                      no_propagate_inherit
                Again, for an explanation of these flags, check out Mark's blog.
        2.) The bits that define whether or not the ACE applies to a user or group
                i.e. identifier_group
        3.) The bits that work in conjunction with the AUDIT and ALARM type ACEs
                i.e. successful_access_flag, failed_access_flag.
                ZFS doesn't support these flags since they don't support AUDIT and
                ALARM type ACEs.

4.) who component:
        The who component defines the entity that the ACE applies to.

        For NFSv4, this component is a string identifier and it can be a user, group or
        special identifier (OWNER@, GROUP@, EVERYONE@).  An important thing to
        note about the EVERYONE@ special identifier is that it literally means everyone
        including the file's owner and owning group.  EVERYONE@ is not equivalent to
        the UNIX other entity.  (If you are curious as to why NFSv4 uses strings rather
        than integers (uids/gids), check out Eric's blog.)

        For ZFS, this component is an integer (uid/gid).

What do AUDIT and ALARM ACE types do?

The AUDIT and ALARM type of ACES trigger an audit or alarm event upon the successful or failed accesses depending  on the presence of the successful/failed access flags  (described above) as defined in the access mask bits of the ACE.  The ACEs of type AUDIT and ALARM don't play a role when  doing access checks on a file.  They only define an action to happen in the event that a certain access is attempted.

For example, lets say we have the following ACL:
 
lisagab:write_data::deny
lisagab:write_data:failed_access_flag:alarm
The first ACE affects the access that user, "lisagab", has to the  file.  The second ACE says if user, "lisagab", attempts to access this file for writing and fails, trigger an alarm event.

One important thing to remember is the fact that what we do in the event of auditing or alarming is still undefined.  Although, you can  think of it like this: when the access in question happens, auditing could be the logging the event to a file and alarming could be the sending of an email to an administrator.

How is access checking done?

To quote the NFSv4 specification:
 To determine if a request succeeds, each nfsace4 entry is processed
 in order by the server. Only ACEs which have a "who" that matches
 the requester are considered. Each ACE is processed until all of the
 bits of the requester's access have been ALLOWED. Once a bit (see
 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
 is encountered where the requester's access still has unALLOWED bits
 in common with the "access_mask" of the ACE, the request is denied.
What this means is:

The most important thing to note about access checking with NFSv4 ACLs is that it is very order dependent.  If a request for access is made, each ACE in the ACL is traversed in order.  The first ACE that matches the who of the requester and defines the access that is being requested is honored.

For example, lets say user, "lisagab", is requesting the ability to read the data of file, "foo" and "foo" has the following ACL:

 

everyone@:read_data::allow
lisagab:write_data::deny

lisagab would be allowed the ability to read_data because lisagab is covered by "everyone@".

Another thing that is important to know is that the access determined is cumulative.

For example, lets say user, "lisagab", is requesting the ability to read and write the data of file, "bar" and "bar" has the following ACL:

 
lisagab:read_data::allow
lisagab:write_data::allow

lisagab would be allowed the ability to read_data and write_data.

How to use ZFS/NFSv4 ACLs on Solaris:

Many of you may remember the setfacl(1) and getfacl(1) commands.  Well, those are still around, but won't help you much with manipulating ZFS or pure NFSv4 ACLs.  Those commands are only capable of manipulating the POSIX-draft ACLs as implemented by UFS.

As a part of the ZFS putback, Mark has modified the chmod(1) and ls(1) command line utilities in order to manipulate ACLs on Solaris.

chmod(1) and ls(1) now give us the ability to manipulate ZFS/NFSv4 ACLs.  Interestingly enough, these utilities can also manipulate POSIX-draft ACLs so, now there is a one stop shop for all your ACL needs.

[Sep 3, 2009] Working with filesystems using NFSV4 ACLs

THe NFSv4 (Network File System – Version 4) protocol introduces a new ACL (Access Control List) format that extends other existing ACL formats. NFSv4 ACL is easy to work with and introduces more detailed file security attributes, making NFSv4 ACLs more secure. Several operating systems like IBM® AIX®, Sun Solaris, and Linux® have implemented NFSv4 ACL in their filesystems.

Currently, the filesystems that support NFSv4 ACL in IBM AIX 5L version 5.3 and above are NFSv4, JFS2 with EAv2 (Extended Journaled Filesystem with Extended Attributes format version 2), and General Parallel Filesystem (GPFS). In Sun Solaris, this ACL model is supported by ZFS. In RedHat Linux, NFSv4 supports NFSv4 ACLs.

...ZFS supports the NFSv4 ACL model, and has implemented the commands in the form of new options to the existing ls and chmod commands. Thus, the ACLs can be set and displayed using the chmod and ls commands; no new command has been introduced. Because of this, it is very easy to work with ACLs in ZFS.

ZFS ACL format

ZFS ACLs follow a well-defined format. The format and the entities involved in this format are:


Syntax A
 
                
ACL_entry_type:Access_permissions/…/[:Inheritance_flags]:deny or allow
      

 

ACL_entry_type includes "owner@", "group@", or "everyone@".

For example:

group@:write_data/append_data/execute:deny


Syntax B
 
                
ACL_entry_type: ACL_entry_ID:Access_permissions/…/[:Inheritance_flags]:deny or allow
      

 

ACL_entry_type includes "user", or "group".

ACL_entry_ID includes "user_name", or "group_name".

For example:

user:samy:list_directory/read_data/execute:allow


Inheritance flags
 
          
f : FILE_INHERIT
d : DIRECTORY_INHERIT
i : INHERIT_ONLY
n : NO_PROPAGATE_INHERIT
S : SUCCESSFUL_ACCESS_ACE_FLAG
F : FAILED_ACCESS_ACE_FLAG

 

Listing ACLs of ZFS files and directories

ACLs can be listed using the ls command using the -v and -V options. For listing directory ACLs, use the -d option.

Operation Command
Listing ACL entries of files ls –[v | V] <file_name>
Listing ACL entries of dirs ls –d[v | V] <dir_name>

Example for listing ACLs of a file
 
ls -v file.1
-rw-r--r-- 1 root root 2703 Nov 4 12:37 file.1
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
       write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
       write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow


Example for listing ACLs of a directory
 
        
# ls -dv dir.1
drwxr-xr-x 2 root root 2 Nov 1 14:51 dir.1
0:owner@::deny
1:owner@:list_directory/read_data/add_file/write_data/add_subdirectory/
    append_data/write_xattr/execute/write_attributes/write_acl/write_owner:allow
2:group@:add_file/write_data/add_subdirectory/append_data:deny
3:group@:list_directory/read_data/execute:allow
4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr /
    write_attributes/write_acl/write_owner:deny
5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes /
    read_acl/synchronize:allow


Example for listing ACLs in a compact format
 
# ls -Vd dir.1
drwxr-xr-x   2 root     root           2 Sep  1 05:46 d
   owner@:--------------:------:deny
   owner@:rwxp---A-W-Co-:------:allow
   group@:-w-p----------:------:deny
   group@:r-x-----------:------:allow
everyone@:-w-p---A-W-Co-:------:deny
everyone@:r-xp--a-R-c--s:------:allow

 

In above examples, ACLs are displayed in a compact format. In this, access permissions and inheritance flags are displayed using masks. One ACL entry is displayed in each line, making the view easier to understand.


Modifying ACLs of ZFS files and directories
 

ACLs can be set or modified using the chmod command. The chmod command uses the ACL-specification, which includes the ACL-format (Syntax A or B), listed earlier.

Operation Command
Adding an ACL entry by index-ID # chmod Aindex_ID+acl_specification filename
Adding an ACL entry for a user # chmod A+acl_specification filename
Removing an ACL entry by index_ID # chmod Aindex_ID- filename
Removing an ACL entry by user # chmod A-acl_specification filename
Removing an ACL from a file # chmod A- filename
Replacing an ACL entry at index_ID # chmod Aindex_ID=acl_specification filename
Replacing an ACL of a file # chmod A=acl_specification filename

Examples of ZFS ACLs modifications


List ACL entries
 
# ls –v a
-rw-r--r--   1 root     root           0 Sep  1 04:25 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow




Add ACL entries
 
# chmod A+user:samy:read_data:allow a
# ls -v a
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A1+user:samy:execute:deny a
# ls -v a
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data:allow
1:user:samy:execute:deny
2:owner@:execute:deny
3:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
4:group@:write_data/append_data/execute:deny
5:group@:read_data:allow
6:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
7:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow



Replace ACL entries
 
# chmod A0=user:samy:read_data/write_data:allow a
# ls -v
total 2
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data/write_data:allow
1:user:samy:execute:deny
2:owner@:execute:deny
3:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
4:group@:write_data/append_data/execute:deny
5:group@:read_data:allow
6:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
7:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow


# chmod A=user:samy:read_data/write_data/append_data:allow a
# ls -v a
----------+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data/write_data/append_data:allow

 

ACLs can also be modified using the masks instead of specifying complete names.


Modifying ACL entries using masks
 
# ls -V a
-rw-r--r--+  1 root     root           0 Sep  5 01:50 a
user:samy:--------------:------:deny
user:samy:rwx-----------:------:allow
   owner@:--x-----------:------:deny
   owner@:rw-p---A-W-Co-:------:allow
   group@:-wxp----------:------:deny
   group@:r-------------:------:allow
everyone@:-wxp---A-W-Co-:------:deny
everyone@:r-----a-R-c--s:------:allow

# chmod A1=user:samy:rwxp:allow a

# ls -V a
-rw-r--r--+  1 root     root           0 Sep  5 01:50 a
user:samy:--------------:------:deny
user:samy:rwxp----------:------:allow
   owner@:--x-----------:------:deny
   owner@:rw-p---A-W-Co-:------:allow
   group@:-wxp----------:------:deny
   group@:r-------------:------:allow
everyone@:-wxp---A-W-Co-:------:deny
everyone@:r-----a-R-c--s:------:allow



Remove ACL entries
 
# ls -v a
-rw-r-----+  1 root     root           0 Sep  5 01:50 a
0:user:samy:read_data/write_data/execute:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny
6:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A- a
# ls -v a
-rw-r-----   1 root     root           0 Sep  5 01:50 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny
5:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A5- a
# ls -v a
-rw-r-----   1 root     root           0 Sep  5 01:50 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny

[Sep 2, 2009] Working With ZFS Snapshots (pdf)

Helps to understand the capabilities of ZFS snapshots, a read-only copy of a Solaris ZFS file system. ZFS snapshots can be created almost instantly and are a valuable tool for system administrators needing to perform backups.

You will learn:

After reading this guide, you will have a basic understanding of how snapshots can be integrated into your system administration procedures.

See also

ZFS vs. Linux Raid + LVM

Comparison of ZFS and Linux RAID +LVM

iZFS doesn't support raid 5 but does support raid-z that has better features and less limitations

iiRaidZ - A variation on RAID-5 which allows for better distribution of parity and eliminates the "RAID-5 write hole" (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group. A raidz group with N disks of size X can hold approximately (N-1)*Xbytes and can withstand one device failing before data integrity is compromised. The minimum number of devices in a raidz group is 2; the recommended number is between 3 and 9.

ivA clone is a writable volume or file system whose initialcontents are the same as another dataset. As with snapshots, creating a clone is nearly instantaneous, and initially consumes no additional space.

v[Linux] RAID (be it hardware- or software-), assumes that if a write to a disk doesn't return an error, then the write was successful. Therefore, if your disk corrupts data without returning an error, your data will become corrupted. This is of course very unlikely to happen, but it is possible, and it would result in a corrupt filesystem. http://www.tldp.org/HOWTO/Software-RAID-HOWTO-6.html

[May 12, 2008] ZFS what the ultimate file system really means for your desktop -- in plain English!

Ashton Mills 21 June 2007327 days ago.
So, Sun's ZFS file system has garnered publicity recently with the announcement of its inclusion in Mac OS X and, more recently, as a module for the Linux kernel. But if you don't readFilesystems Weekly, what is it and what does it mean for you?

Now I may just be showing my geek side a bit here, but file systems are awesome. Aside from the fact our machines would be nothing without them, the science behind them is frequently ingenious.

And ZFS (the Zettabyte File System) is no different. It has quite an extensive feature set just like its peers, but builds on this by adding a new layer of simplicity. According to the official site, ZFS key features are (my summary):
 

All up, as a geek, it's an exciting file system I'd love to play with -- currently however ZFS is part of Sun's Solaris, and under the CDDL (Common Development and Distribution License), which is actually based on the MPL (Mozilla Public License). As this is incompatible with the GPLv2, this means the code can't be ported to the Linux kernel. However, this has recently been satisfied by porting it across as a FUSE module but, being userspace, is slow though there hope this will improve. Looks like it's time to enable FUSE support in my kernel!

Of course, (in a few months time) you could also go for Mac OS X where, in Leopard, ZFS is already supported and there are rumours Apple may be preparing to adopt it as the default filesystem replacing the aging HFS+ in the future (but probably not in 10.5).

[Jun 27, 2007] Solaris ZFS and Microsoft Server 2003 NTFS File System Performance - BigAdmin Description

Description: This white paper explores the performance characteristics and differences of ZFS in the Solaris 10 OS and the Microsoft Windows Server 2003 NTFS file system.

[Jun 12, 2007] Apple's Leopard will use ZFS, but not exclusively | Tech news blog ...

Jun 12, 2007

... Apple confirmed statements by Sun's Jonathan Schwartz that Leopard will use ZFS, correcting an executive who Monday suggested otherwise.

[Apr 6, 2007] ZFS committed to the FreeBSD base.

Pawel Jakub Dawidek pjd at FreeBSD.org
Fri Apr 6 02:58:34 UTC 2007
Hi.

I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.

Commit log:

  Please welcome ZFS - The last word in file systems.
  
  ZFS file system was ported from OpenSolaris operating system. The code
  in under CDDL license.
  
  I'd like to thank all SUN developers that created this great piece of
  software.
  
  Supported by:	Wheel LTD (http://www.wheel.pl/)
  Supported by:	The FreeBSD Foundation (http://www.freebsdfoundation.org/)
  Supported by:	Sentex (http://www.sentex.net/)

Limitations.

  Currently ZFS is only compiled as kernel module and is only available
  for i386 architecture. Amd64 should be available very soon, the other
  archs will come later, as we implement needed atomic operations.

Missing functionality.

  - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
    iSCSI is also not supported at this point. This should be fixed in
    the future, we may also add support for sharing ZVOLs over ggate.
  - There is no support for ACLs and extended attributes.
  - There is no support for booting off of ZFS file system.

Other than that, ZFS should be fully-functional.

Enjoy!

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20070406/ee2df07b/attachment.pgp

[Apr 2, 2007] ZFS Overview and Guide - Features

[Aug 14, 2006] Techworld.com - ZFS - the future of file systems ? By Chris Mellor

Techworld

ZFS - the Zettabyte File System - is an enormous advance in capability on existing file systems. It provides greater space for files, hugely improved administration and greatly improved data security.

It is available in Sun's Solaris 10 and has been made open source. The advantages of ZFS look so great that its use may well spread to other UNIX distributions and even, possibly and eventually, to Windows.

Techworld has mentioned ZFS before. Here we provide a slightly wider and more detailed look at it. If you want to have even more information then the best resource is Sun's own website.

Why is ZFS a good thing?
It possesses advantages compared to existing file systems in these areas:-

- Scale
- Administration
- Data security and integrity

The key area is file system administration, followed by data security and file system size. ZFS started from a realisation that the existing file system concepts were hardly changed at all from the early days of computing. Then a computer knew about a disk which had files on it. A file system related to a single disk. On today's PCs the file systems are still disk-based with the Windows C: drive - A: and B: being floppy drives - and subsequent drives being D:, E:, etc.

To provide more space and bandwidth a software abstraction was added between the file system and the disks. It was called a volume manager and virtualised several disks into a volume.

Each volume has to be administered and growing volumes and file systems takes effort. Volume Manager software products became popular. The storage in a volume is specific to a server and application and can't be shared. Utilisation of storage is poor with any unused blocks on disks in volumes being unusable anywhere else.

ZFS starts from the concept that desktop and servers have many disks and that a good place to start abstracting this is at the operating system:file system interface. Consequently ZFS delivers, in effect, just one volume to the operating system. We might imagine it as disk:. From that point ZFS delivers scale, administration and data security features that other file systems do not.

ZFS has a layered stack with a POSIX-compliant operating system interface, then data management functions and, below that, increasingly device-specific functions. We might characterise ZFS as being a file system with a volume manager included within it, the data management function.

Data security
Data protection through RAID is clever but only goes so far. When data is written to disk it overwrites the current version of the data. There are instances of stray or phantom writes, mis-directed writes, DMA parity errors, disk driver bugs and accidental overwrites according to ZFS people, that the standard checksum approach won't detect.

The checksum is stored with the data block and is valid for that data block, but the data block shouldn't be there in the first place. The checksum is a disk-only checksum and doesn't cover against faults in the I/O path before that data gets written to disk.

If disks are mirrored then a block is simultaneously written to each mirror. If one drive or controller suffers a power failure then that mirror is out of synchronisation and needs re-synchronising with its twin.

With RAID if there is a loss of power between data and parity writes then disk contents are corrupted.

ZFS does things differently.

First of all it uses copy-on-write technology so that existing data blocks are not over-written. Instead new data blocks are written and their checksum stored with the pointer to them.

When a file write has been completed then the pointers to the previous blocks are changed so as to point to the new blocks. In other words the file write is treated as a transaction, an event that is atomic and has to be completed before it is confirmed or committed.

Secondly ZFS checks the disk contents looking for checksum/data mismatches. This process is called scrubbing. Any faults are corrected and a ZFS system exhibits what IBM calls autonomic computing capacity; it is self-healing.

Scale
ZFS uses a 128-bit addressing scheme and can store 256 quadrillion zettabytes. A zettabyte is 2 to the power 70 bytes or a billion TB. ZFS capacity limits are so far away as to be unimaginable. This is eye-catching stuff but unlikely to be a factor solving 64-bit file system capacity limitations for decades.

Administration
With ZFS all storage enters a common pool, called a zpool. Every disk or array added to ZFS disappears into this common pool. ZFS people characterise this storage pool as being akin to a computer's virtual memory.

A hierarchy of ZFS file systems can use that pool. Each can have its own attributes set, such as compression, a growth-limiting quota, or a set amount of space.

I/O characteristics
ZFS has its own I/O system. I/Os have a priority with read I/Os having a higher priority than write I/Os. That means that reads get executed even if writes are queued up.

Write I/Os have both a priority and a deadline. The deadline is sooner the higher the priority. Writes with the same deadline are executed in logical; block address order so that, in effect, they form a sequential series of writes across a disk which reduces head movement to a single sweep across the disk surface. What's happening is that random write I/Os are getting transformed into sets of sequential I/Os to make the overall write I/O rate faster.

Striping and blocksizes
ZFS stripes files automatically. Block sizes are dynamically set. Blocks are allocated from disks based on an algorithm that takes into account space available and I/O counts. When blocks are being written to the copy-on-write concept means that a sequential set of blocks can be used, speeding up write I/O.

ZFS and NetApp's WAFL
ZFS has been based in part of NetApp's write Anywhere File Layout (WAFL) system. It has moved on from WAFL and now has many differences. This table lists some of them. But do read the blog replies which correct some table errors.

There is more on the ZFS and WAFL similarities and differences here.

Snapshots unlimited and more
ZFS can take a virtually unlimited number if snapshots and these can be used to restore lost (deleted) files. However, they can't protect against disk crashes. For that RAID and backup to external devices are needed.

ZFS offers compression, encryption is being developed, and an initiative is under way to make it bootable. The compression is applied before data is written meaning that the write I/O burden is reduced and hence effective write speed increased further.

We may see Sun offering storage arrays with ZFS. For example we might see a SUN NAS box based on ZFS. This is purely speculative as is the idea that we might see Sun offered clustered NAS ZFS systems to take on Isilon and others in the high-performance, clustered, virtualised NAS area.

So what?
There is a lot of software engineering enthusiasm for ZFS and the engineers at Sun say that ZFS outperforms other file systems, for example the Solaris file system. It is faster at file operations and, other things being equal, a ZFS Solaris system will out-perform a non-ZFS Solaris system. Great, but will it out-perform other UNIX servers and Windows servers, again with other things being equal?

We don't know. We suspect it might but don't know by how much. Even then the popularity of ZFS will depend upon how it is taken up by Sun Solaris 10 customers and whether ports to apple and to Linux result in wide use. For us storage people the ports that really matter are to mainstream Unix versions such as AIX, HP-UX and Red Hat Linux, also SuSe Linux I suppose.

There is no news of a ZFS port to Windows and Vista's own advanced file system plans have quite recently been downgraded with its file system changes.

If Sun storage systems using ZFS, such as its X4500 'Thumper' server, with ZFS-enhanced direct-attached storage (DAS), and Honeycomb, become very popular and are as market-defining as EMC's Centera product then we may well see ZFS spreading. But their advantages have to be solid and substantial with users getting far, far better file-based application performance and a far, far lower storage system management burden. Such things need proving in practice.

To find out for yourself try these systems out or wait for others to do so.

How to reformat all of your systems and use ZFS.

1. So easy your mom could administer it

ZFS is administered by two commands, zpool and zfs. Most tasks typically require a single command to accomplish. And the commands are designed to make sense. For example, check out the commands to create a RAID 1 mirrored filesystem and place a quota on its size.

2. Honkin' big filesystems

How big do filesystems need to be? In a world where 640KB is certainly not enough for computer memory, current filesystems have reached or are reaching the end of their usefulness. A 64-bit filesystem would meet today's need, but estimates of the lifetime of a 64-bit filesystem is about 10 years. Extending to 128-bits gives ZFS an expected lifetime of 30 years (UFS, for comparison, is about 20 years old). So how much data can you squeeze into a 128-bit filesystem? 16 exabytes or 18 million terabytes. How many files can you cram into a ZFS filesystem? 200 million million.

Could anyone use a fileystem that large? No, not really. The topic has roused discussions about boiling the oceans if a real life storage unit that size was powered on. It may not be necessary to have 128 bits, but it doesn't hurt and we won't have to worry about running out of addressable space.

3. Filesystem, heal thyself

ZFS employs 256 bit checksums end-to-end to validate data stored under its protection. Most filesystem (and you know who you are) depend on the underlying hardware to detect corrupt data and then can only nag about it if they get such a message. Every block in a ZFS filesystem has a checksum associated with it. If ZFS detects a checksum mismatch on a raidz or mirrored filesystem, it will actively reconstruct the block from the available redundancy and go on about its job.

4. fsck off, fsck

fsck has been voted out of the house. We don't need it anymore. Because ZFS data are always consistent on disk, don't be afraid to yank out those power cords if you feel like it. Your ZFS filesystems will never require you to enter the superuser password more maintenance mode.

5. Compress to your heart's content

I've always been a proponent of optional and appropriate compression in filesystems. There are some data that are well suited to compression such as server logs. Many people get ruffled up over this topic, although I suspect that they were once burned by doublespace munching up an important document. When thoughtfully used, ZFS compression can improve disk I/O which is a common bottleneck. ZFS compression can be turned on for individual filesystems or hierarchies with a very easy single command.

6. Unconstrained architecture

UFS and other filesystems use a constrained model of fixed partitions or volumes, each filesystem having a set amount of available disk space. ZFS uses a pooled storage model. This is a significant departure from the traditional concept of filesystems. Many current production systems may have a single digit number of filesystems and adding or manipulating existing filesystems in such an environment is difficult.

In ZFS, pools are created from physical storage. Mirroring or the new RAID-Z redundancy exists at the pool level. Instead of breaking pools apart into filesystems, each newly created filesystem shares the available space in the pool, although a minimum amount of space can be reserved for it. ZFS filesystems exist in their own hierarchy, children filesystems inherit the properties of their parents, and each ZFS filesystem in the ZFS hierarchy can easily be mounted in different places in the system filesystem.
 

7. Grow filesystems without green thumb

If your pool becomes overcrowded, you can grow it. With one command. On a live production system. Enough said.

8. Dynamic striping

On by default, dynamic striping automatically includes all devices in a pool in writes simultaneously (stripe width spans all the avaiable media). This will speed up the I/O on systems with multiple paths to storage by load balancing the I/O on all of the paths.

9. The term "raidz" sounds so l33t
The new RAID-Z redundant storage model replaces RAID-5 and improves upon it. RAID-Z does not suffer from the "write hole" in which a stripe of data becomes corrupt because of a loss of power during the vulnerable period between writing the data and the parity. RAID-Z, like RAID-5, can survive the loss of one disk. A future release is planned using the keyword raidz2 which can tolerate the loss of two disks. Perhaps the best feature is that creating a raidz pool which is crazy simple.

10. Clones with no ethical issues

The simple creation of snapshots and clones of filesystems makes living with ZFS so much more enjoyable. A snapshot is a read-only point-in-time copy of a filesystem which takes practically no time to create and uses no additional space at the beginning. Any snapshot can be cloned to make a read-write filesystem and any snapshot of a filesystem can be restored to the original filesystem to return to the previous state. Snapshots can be written to other storage (disk, tape), transferred to another system, and converted back into a filesystem.

More information

For more information, check out Sun's official ZFS page and the detailed OpenSolaris community ZFS information. If you want to take ZFS out for a test drive, the latest version of Solaris Express has it built in and ready to go. Download it here.

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

NFSv4 ACLs

WAFL

Reference

If you want to learn more about the theory behind ZFS and find reference material have a look at ZFS Administration Guide, OpenSolaris ZFS, ZFS BigAdmin and ZFS Best Practices.

zfs-cheatsheet

ZFS Evil Tuning Guide - Siwiki

Recommended Papers

The Musings of Chris Samuel » Blog Archive » ZFS versus XFS with Bonnie++ patched to use random data

ZFS Tutorial Part 1

managing ZFS filesystems

Humor

 For a humorous introduction to ZFS' features, see presentation given by Pawel at EuroBSDCon 2007: http://youtube.com/watch?v=o3TGM0T1CvE.



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: September 03, 2009