Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Solaris ZFS

Old News ;-)

Books/Certification books

Certification Recommended Links Recommended Papers Reference Selected Blueprints Selected man pages
FAQs Mirroring Root Filesystem RAID 0 volumes
(striping/concat)
RAID 1 volumes
(mirroring)
RAID 5 volumes Shared disksets Humor Etc

The innovative Zettabyte File System broke records in scalability, reliability, and flexibility.

Although performance is not usually cited as ZFS advantage ZFS is far faster than most users realize, especially in environments that involve typical files smaller than 5-10 megabytes. The native support of a volume manager in ZFS is also pretty interesting. That and copy on write semantics provide snapshots which are really important for some applications. 

When designing your filesystem, pay attention to what role each partition will play for your particular application. Depending on your needs, you map partitions to physical disk to minimize load of each pair of disks (in case of mirroring) and improve performance.

If you're running a webserver, for example - it would benefit performance to have an separate pair of drives  dedicated to website storage. You might configure it with both the "noatime" and "logging" options mentioned below along with a "nosuid" option. This would offload requests to a separate drive and possibly separate SCSI controller channel.

A webservers have mostly a read-requests load. RAID 10 can be used, but RAID 5 can be used too as both provides a high read transaction rate and provides redundancy in case of a drive failure.

In no way you should ever mirror partitions on the same drive, exept for traning purposes: you'll seriously degrade your performance since you've effectively doubled your seeks.

For small web sites  (let's say up to 4G) it make sense to use /tmp  for websites as it is mapped to memory. That means that also you pages will be cached. The drawback is that you might need to order more memory for the server increasing the costs. But it is a better (and cheaper) deal then using SANs. You just need to load the content when server reboots. the problem is that after 4G the time to reboot the server became somewhat long but few websites are that big.  In any case it make sense to use entire drive for your webserver filesystem.  New USB storage might have read performance comparable with best harddrives has no latency for reading and might also be an option.

Logs from Web server can be written on system drive as the volume is rather slim.

You can also tweak the ufs filesystem for webserver by using noatime option (saves some writes) and "highwater" and "lowwater" marks with the "ufs_HW" and "ufs_LW" options in /etc/system. See the Sun Performance and Tuning book  (p. 172-173. ) and in Suns Solaris performance tuning  course.


Notes:
  • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
Google Search
Open directory

Research Index


Old News ;-)

ZFS vs. Linux Raid + LVM

Comparison of ZFS and Linux RAID +LVM

iZFS doesn't support raid 5 but does support raid-z that has better features and less limitations

iiRaidZ - A variation on RAID-5 which allows for better distribution of parity and eliminates the "RAID-5 write hole" (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group. A raidz group with N disks of size X can hold approximately (N-1)*Xbytes and can withstand one device failing before data integrity is compromised. The minimum number of devices in a raidz group is 2; the recommended number is between 3 and 9.

ivA clone is a writable volume or file system whose initialcontents are the same as another dataset. As with snapshots, creating a clone is nearly instantaneous, and initially consumes no additional space.

v[Linux] RAID (be it hardware- or software-), assumes that if a write to a disk doesn't return an error, then the write was successful. Therefore, if your disk corrupts data without returning an error, your data will become corrupted. This is of course very unlikely to happen, but it is possible, and it would result in a corrupt filesystem. http://www.tldp.org/HOWTO/Software-RAID-HOWTO-6.html

[May 12, 2008] ZFS what the ultimate file system really means for your desktop -- in plain English!

Ashton Mills 21 June 2007327 days ago.
So, Sun's ZFS file system has garnered publicity recently with the announcement of its inclusion in Mac OS X and, more recently, as a module for the Linux kernel. But if you don't readFilesystems Weekly, what is it and what does it mean for you?

Now I may just be showing my geek side a bit here, but file systems are awesome. Aside from the fact our machines would be nothing without them, the science behind them is frequently ingenious.

And ZFS (the Zettabyte File System) is no different. It has quite an extensive feature set just like its peers, but builds on this by adding a new layer of simplicity. According to the official site, ZFS key features are (my summary):
 

All up, as a geek, it's an exciting file system I'd love to play with -- currently however ZFS is part of Sun's Solaris, and under the CDDL (Common Development and Distribution License), which is actually based on the MPL (Mozilla Public License). As this is incompatible with the GPLv2, this means the code can't be ported to the Linux kernel. However, this has recently been satisfied by porting it across as a FUSE module but, being userspace, is slow though there hope this will improve. Looks like it's time to enable FUSE support in my kernel!

Of course, (in a few months time) you could also go for Mac OS X where, in Leopard, ZFS is already supported and there are rumours Apple may be preparing to adopt it as the default filesystem replacing the aging HFS+ in the future (but probably not in 10.5).
 

[Jun 27, 2007] Solaris ZFS and Microsoft Server 2003 NTFS File System Performance - BigAdmin Description

Description: This white paper explores the performance characteristics and differences of ZFS in the Solaris 10 OS and the Microsoft Windows Server 2003 NTFS file system.

[Jun 12, 2007] Apple's Leopard will use ZFS, but not exclusively | Tech news blog ...

Jun 12, 2007

... Apple confirmed statements by Sun's Jonathan Schwartz that Leopard will use ZFS, correcting an executive who Monday suggested otherwise.

[Apr 6, 2007] ZFS committed to the FreeBSD base.

Pawel Jakub Dawidek pjd at FreeBSD.org
Fri Apr 6 02:58:34 UTC 2007
Hi.

I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.

Commit log:

  Please welcome ZFS - The last word in file systems.
  
  ZFS file system was ported from OpenSolaris operating system. The code
  in under CDDL license.
  
  I'd like to thank all SUN developers that created this great piece of
  software.
  
  Supported by:	Wheel LTD (http://www.wheel.pl/)
  Supported by:	The FreeBSD Foundation (http://www.freebsdfoundation.org/)
  Supported by:	Sentex (http://www.sentex.net/)

Limitations.

  Currently ZFS is only compiled as kernel module and is only available
  for i386 architecture. Amd64 should be available very soon, the other
  archs will come later, as we implement needed atomic operations.

Missing functionality.

  - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
    iSCSI is also not supported at this point. This should be fixed in
    the future, we may also add support for sharing ZVOLs over ggate.
  - There is no support for ACLs and extended attributes.
  - There is no support for booting off of ZFS file system.

Other than that, ZFS should be fully-functional.

Enjoy!

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20070406/ee2df07b/attachment.pgp

[Apr 2, 2007] ZFS Overview and Guide - Features

[Aug 14, 2006] Techworld.com - ZFS - the future of file systems ?

By Chris Mellor, Techworld

ZFS - the Zettabyte File System - is an enormous advance in capability on existing file systems. It provides greater space for files, hugely improved administration and greatly improved data security.

It is available in Sun's Solaris 10 and has been made open source. The advantages of ZFS look so great that its use may well spread to other UNIX distributions and even, possibly and eventually, to Windows.

Techworld has mentioned ZFS before. Here we provide a slightly wider and more detailed look at it. If you want to have even more information then the best resource is Sun's own website.

Why is ZFS a good thing?
It possesses advantages compared to existing file systems in these areas:-

- Scale
- Administration
- Data security and integrity

The key area is file system administration, followed by data security and file system size. ZFS started from a realisation that the existing file system concepts were hardly changed at all from the early days of computing. Then a computer knew about a disk which had files on it. A file system related to a single disk. On today's PCs the file systems are still disk-based with the Windows C: drive - A: and B: being floppy drives - and subsequent drives being D:, E:, etc.

To provide more space and bandwidth a software abstraction was added between the file system and the disks. It was called a volume manager and virtualised several disks into a volume.

Each volume has to be administered and growing volumes and file systems takes effort. Volume Manager software products became popular. The storage in a volume is specific to a server and application and can't be shared. Utilisation of storage is poor with any unused blocks on disks in volumes being unusable anywhere else.

ZFS starts from the concept that desktop and servers have many disks and that a good place to start abstracting this is at the operating system:file system interface. Consequently ZFS delivers, in effect, just one volume to the operating system. We might imagine it as disk:. From that point ZFS delivers scale, administration and data security features that other file systems do not.

ZFS has a layered stack with a POSIX-compliant operating system interface, then data management functions and, below that, increasingly device-specific functions. We might characterise ZFS as being a file system with a volume manager included within it, the data management function.

Data security
Data protection through RAID is clever but only goes so far. When data is written to disk it overwrites the current version of the data. There are instances of stray or phantom writes, mis-directed writes, DMA parity errors, disk driver bugs and accidental overwrites according to ZFS people, that the standard checksum approach won't detect.

The checksum is stored with the data block and is valid for that data block, but the data block shouldn't be there in the first place. The checksum is a disk-only checksum and doesn't cover against faults in the I/O path before that data gets written to disk.

If disks are mirrored then a block is simultaneously written to each mirror. If one drive or controller suffers a power failure then that mirror is out of synchronisation and needs re-synchronising with its twin.

With RAID if there is a loss of power between data and parity writes then disk contents are corrupted.

ZFS does things differently.

First of all it uses copy-on-write technology so that existing data blocks are not over-written. Instead new data blocks are written and their checksum stored with the pointer to them.

When a file write has been completed then the pointers to the previous blocks are changed so as to point to the new blocks. In other words the file write is treated as a transaction, an event that is atomic and has to be completed before it is confirmed or committed.

Secondly ZFS checks the disk contents looking for checksum/data mismatches. This process is called scrubbing. Any faults are corrected and a ZFS system exhibits what IBM calls autonomic computing capacity; it is self-healing.

Scale
ZFS uses a 128-bit addressing scheme and can store 256 quadrillion zettabytes. A zettabyte is 2 to the power 70 bytes or a billion TB. ZFS capacity limits are so far away as to be unimaginable. This is eye-catching stuff but unlikely to be a factor solving 64-bit file system capacity limitations for decades.

Administration
With ZFS all storage enters a common pool, called a zpool. Every disk or array added to ZFS disappears into this common pool. ZFS people characterise this storage pool as being akin to a computer's virtual memory.

A hierarchy of ZFS file systems can use that pool. Each can have its own attributes set, such as compression, a growth-limiting quota, or a set amount of space.

I/O characteristics
ZFS has its own I/O system. I/Os have a priority with read I/Os having a higher priority than write I/Os. That means that reads get executed even if writes are queued up.

Write I/Os have both a priority and a deadline. The deadline is sooner the higher the priority. Writes with the same deadline are executed in logical; block address order so that, in effect, they form a sequential series of writes across a disk which reduces head movement to a single sweep across the disk surface. What's happening is that random write I/Os are getting transformed into sets of sequential I/Os to make the overall write I/O rate faster.

Striping and blocksizes
ZFS stripes files automatically. Block sizes are dynamically set. Blocks are allocated from disks based on an algorithm that takes into account space available and I/O counts. When blocks are being written to the copy-on-write concept means that a sequential set of blocks can be used, speeding up write I/O.

ZFS and NetApp's WAFL
ZFS has been based in part of NetApp's write Anywhere File Layout (WAFL) system. It has moved on from WAFL and now has many differences. This table lists some of them. But do read the blog replies which correct some table errors.

There is more on the ZFS and WAFL similarities and differences here.

Snapshots unlimited and more
ZFS can take a virtually unlimited number if snapshots and these can be used to restore lost (deleted) files. However, they can't protect against disk crashes. For that RAID and backup to external devices are needed.

ZFS offers compression, encryption is being developed, and an initiative is under way to make it bootable. The compression is applied before data is written meaning that the write I/O burden is reduced and hence effective write speed increased further.

We may see Sun offering storage arrays with ZFS. For example we might see a SUN NAS box based on ZFS. This is purely speculative as is the idea that we might see Sun offered clustered NAS ZFS systems to take on Isilon and others in the high-performance, clustered, virtualised NAS area.

So what?
There is a lot of software engineering enthusiasm for ZFS and the engineers at Sun say that ZFS outperforms other file systems, for example the Solaris file system. It is faster at file operations and, other things being equal, a ZFS Solaris system will out-perform a non-ZFS Solaris system. Great, but will it out-perform other UNIX servers and Windows servers, again with other things being equal?

We don't know. We suspect it might but don't know by how much. Even then the popularity of ZFS will depend upon how it is taken up by Sun Solaris 10 customers and whether ports to apple and to Linux result in wide use. For us storage people the ports that really matter are to mainstream Unix versions such as AIX, HP-UX and Red Hat Linux, also SuSe Linux I suppose.

There is no news of a ZFS port to Windows and Vista's own advanced file system plans have quite recently been downgraded with its file system changes.

If Sun storage systems using ZFS, such as its X4500 'Thumper' server, with ZFS-enhanced direct-attached storage (DAS), and Honeycomb, become very popular and are as market-defining as EMC's Centera product then we may well see ZFS spreading. But their advantages have to be solid and substantial with users getting far, far better file-based application performance and a far, far lower storage system management burden. Such things need proving in practice.

To find out for yourself try these systems out or wait for others to do so.

How to reformat all of your systems and use ZFS.

1. So easy your mom could administer it

ZFS is administered by two commands, zpool and zfs. Most tasks typically require a single command to accomplish. And the commands are designed to make sense. For example, check out the commands to create a RAID 1 mirrored filesystem and place a quota on its size.

2. Honkin' big filesystems


How big do filesystems need to be? In a world where 640KB is certainly not enough for computer memory, current filesystems have reached or are reaching the end of their usefulness. A 64-bit filesystem would meet today's need, but estimates of the lifetime of a 64-bit filesystem is about 10 years. Extending to 128-bits gives ZFS an expected lifetime of 30 years (UFS, for comparison, is about 20 years old). So how much data can you squeeze into a 128-bit filesystem? 16 exabytes or 18 million terabytes. How many files can you cram into a ZFS filesystem? 200 million million.

Could anyone use a fileystem that large? No, not really. The topic has roused discussions about boiling the oceans if a real life storage unit that size was powered on. It may not be necessary to have 128 bits, but it doesn't hurt and we won't have to worry about running out of addressable space.

3. Filesystem, heal thyself


ZFS employs 256 bit checksums end-to-end to validate data stored under its protection. Most filesystem (and you know who you are) depend on the underlying hardware to detect corrupt data and then can only nag about it if they get such a message. Every block in a ZFS filesystem has a checksum associated with it. If ZFS detects a checksum mismatch on a raidz or mirrored filesystem, it will actively reconstruct the block from the available redundancy and go on about its job.

4. fsck off, fsck


fsck has been voted out of the house. We don't need it anymore. Because ZFS data are always consistent on disk, don't be afraid to yank out those power cords if you feel like it. Your ZFS filesystems will never require you to enter the superuser password more maintenance mode.
 

5. Compress to your heart's content


I've always been a proponent of optional and appropriate compression in filesystems. There are some data that are well suited to compression such as server logs. Many people get ruffled up over this topic, although I suspect that they were once burned by doublespace munching up an important document. When thoughtfully used, ZFS compression can improve disk I/O which is a common bottleneck. ZFS compression can be turned on for individual filesystems or hierarchies with a very easy single command.

6. Unconstrained architecture

UFS and other filesystems use a constrained model of fixed partitions or volumes, each filesystem having a set amount of available disk space. ZFS uses a pooled storage model. This is a significant departure from the traditional concept of filesystems. Many current production systems may have a single digit number of filesystems and adding or manipulating existing filesystems in such an environment is difficult.

In ZFS, pools are created from physical storage. Mirroring or the new RAID-Z redundancy exists at the pool level. Instead of breaking pools apart into filesystems, each newly created filesystem shares the available space in the pool, although a minimum amount of space can be reserved for it. ZFS filesystems exist in their own hierarchy, children filesystems inherit the properties of their parents, and each ZFS filesystem in the ZFS hierarchy can easily be mounted in different places in the system filesystem.
 

7. Grow filesystems without green thumb

If your pool becomes overcrowded, you can grow it. With one command. On a live production system. Enough said.
 

8. Dynamic striping

On by default, dynamic striping automatically includes all devices in a pool in writes simultaneously (stripe width spans all the avaiable media). This will speed up the I/O on systems with multiple paths to storage by load balancing the I/O on all of the paths.

9. The term "raidz" sounds so l33t


The new RAID-Z redundant storage model replaces RAID-5 and improves upon it. RAID-Z does not suffer from the "write hole" in which a stripe of data becomes corrupt because of a loss of power during the vulnerable period between writing the data and the parity. RAID-Z, like RAID-5, can survive the loss of one disk. A future release is planned using the keyword raidz2 which can tolerate the loss of two disks. Perhaps the best feature is that creating a raidz pool which is crazy simple.

10. Clones with no ethical issues


The simple creation of snapshots and clones of filesystems makes living with ZFS so much more enjoyable. A snapshot is a read-only point-in-time copy of a filesystem which takes practically no time to create and uses no additional space at the beginning. Any snapshot can be cloned to make a read-write filesystem and any snapshot of a filesystem can be restored to the original filesystem to return to the previous state. Snapshots can be written to other storage (disk, tape), transferred to another system, and converted back into a filesystem.

More information

For more information, check out Sun's official ZFS page and the detailed OpenSolaris community ZFS information. If you want to take ZFS out for a test drive, the latest version of Solaris Express has it built in and ready to go. Download it here.

Recommended Links


In case of broken links please try to use Google search. If you find the page please notify us about new location
Google     

Reference

If you want to learn more about the theory behind ZFS and find reference material have a look at ZFS Administration Guide, OpenSolaris ZFS, ZFS BigAdmin and ZFS Best Practices.

zfs-cheatsheet

ZFS Evil Tuning Guide - Siwiki

Recommended Papers

The Musings of Chris Samuel » Blog Archive » ZFS versus XFS with Bonnie++ patched to use random data

ZFS Tutorial Part 1

managing ZFS filesystems



Copyright © 1996-2008 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: November 08, 2008