Traversal control

September 22nd, 2008 | www.linux-mag.com
Since find(1) came into being decades ago, programmers have been adding new features. Here’s the fourth of a series about some of those. Jerry Peek Monday, September 22nd, 2008

A few months ago, we finished the third of a series about features added to longstanding utility programs. This month we’ll look at the new features that GNU programmers and others have added to all of the other features that find(1) already had. (You can “find” an introduction to find in here: A Very Valuable Find.

There are lots of versions of find. We’ll cover the GNU version 4.1.20 (the latest, as of this writing, from the Debian stable distribution).

Filename Matching

Older versions of find had one way to check the name of an entry: the -name test. The argument to -name is a case-sensitive filename or shell wildcard pattern.

Shell wildcards are simpler than grep-like regular expressions, but they limit the matching -name can do. For instance, matching a file named with all uppercase characters is tough with shell wildcards (but simple with a regular expression, as we’ll see soon with -regex).

The string or wildcard pattern after -name is compared to the name of the entry currently being scanned, not that entry’s pathname. So, for instance, it’s easy to know whether the current filename ends in .c, but it’s a lot harder to know whether that file is in a directory named src.

The -path test, which was added fairly early to many find versions, is a shell wildcard-type pattern match against the entire current pathname. So, the test -path '*src/*.c' gets close to what we want here: it matches any pathname containing src, followed by any number of characters and a literal .c. That could be a file ./src/foo.c, but it could also be a file ./src/subdir/bar.c, or ./TeXsrc/foo.c, or something even messier. The wide-open matching of * meaning “zero or more of any character” can cause trouble when you need a specific pathname match.

The GNU find has several other name tests:

-iname matches a case-insensitive shell wildcard pattern.

-regex compares names to a case-sensitive regular expression. Like -path, the regular expression is tested against the entire current pathname. To test the name of the current entry, use a regular expression that starts by matching all non-slash characters up to the final slash (^[^/]*/). For instance, to match filenames that are all uppercase ASCII letters, try -regex '^[^/]*/[A-Z]+$').

-iregex does case-insensitive regexp matching of the pathname.

Another new test is -lname, which matches the target of a symbolic link. (Using other name tests, like -name, matches the name of the symlink itself.) The corresponding -ilname test does case-insensitive matching of the symlink target.

There are two other new tests and options for symbolic links:

-follow dereferences symlinks: it follows the file they point to.

-xtype is like the opposite of -type for symbolic links. If -follow isn’t given, -xtype checks the file that the symlink points to; otherwise, -xtype checks the symlink itself.

Timestamp matching

Older versions of find matched timestamps only in 24-hour intervals. For instance, the tests -mtime -3 and -mtime 2 are both true for files modified between 72 and 48 hours ago. Besides being a bit hard to understand at first, the three timestamp tests (-atime, -ctime and -mtime) also are limited to 24-hour granularity. If you needed more accuracy, you’d have to use -newer or ! -newer to match a timestamp file — often one created by touch(1). (Worse yet, many versions of find would silently ignore more than one -newer test in the same expression!)

The new -amin, -cmin and -mmin tests check timestamps a certain number of minutes ago. For instance, to find files accessed within the past hour, use -amin -60. (Note that it’s hard to test last-access times for directories. That’s because, when find searches through a file tree, it accesses all of the directories — which updates all directories’ last-access timestamps.)

Another new option, -daypart, tells find to measure times from the beginning of today instead of in 24-hour multiples. This frees you from dependence on the current time you run find.

Directory Control

Early versions didn’t give you much control over which directories find visited. Once -prune was added, you could write an expression to keep find from descending into certain directories. For instance, to keep from descending into the ./src subdirectory, you can do something like this:
find . -path ./src -prune -o -etc…
And to skip all directories named lib (and all of their subdirectories):
find . -name lib -prune -o -etc…
The -prune action is good for avoiding certain directories, but — without the regular expression tests added later, at least — it’s not so good for limiting searches to a particular depth. In particular, it may not be obvious how to process only the entries in the current directory without any recursion. (The answer with -prune is:
find . ( -type d ! -name . -prune ) 
   -o -etc…
which “prunes” all directories except the current directory “.”.)

The new -mindepth and -maxdepth options make this a lot easier. Use -maxdepth n ; to descend no more than n levels below the command-line arguments. The option -maxdepth 0 tells find to evaluate only the command-line arguments.

In the same way, -mindepth n> tells find to ignore the first n levels of subdirectories. Also, -mindepth 1 processes all files except the command-line arguments. For instance, find subdir -mindepth 1 -ls> will descend into subdir and list each of its contents, but won’t list subdir itself.

The -depth option has been in quite a few versions of find; it’s not as “new” as some of the other features we cover. It’s not related to -maxdepth or -mindepth, though. The sidebar “-depth explained” has more information about how this option is used.
-depth explained
Because find is often used to give filenames to archive programs like tar, it’s worth understanding -depth and that part of its purpose.

A tar archive is a stream of bytes that contain header information for each file (including its name and access permissions) followed by that file’s data. The archive is extracted in order from first byte to last.

Let’s say that you archive an unwritable directory. When you later extract that directory from the archive, its permissions will be set at the time it’s extracted:

If you didn’t use -depth to create the archive, the directory will be extracted before its contents. So the entries in that directory can’t be extracted — unless root is extracting the archive — because their directory isn’t writable at that point.

If you used -depth, though, the problem is solved. That’s because, when tar extracts a file before its directory, it temporarily creates a writable directory to hold the file. Then, later, when tar extracts the directory itself, the directory’s contents will already have been extracted — so all tar has to do is to set the directory’s new (unwritable) permissions.,

One “new” addition — which is actually in a lot of find versions — is -xdev or -mount. (GNU find understands both of those.) It tells find not to descend into directories mounted from other filesystems. This is handy, for example, to avoid network-mounted filesystems.

A more specific test is -fstype, which tests true if a file is on a certain type of filesystem. For instance, ! -fstype nfs is true for a file that’s not on an NFS-type filesystem. Different systems have different filesystem names and types, though. To get a listing of what’s on your system, use the new -printf action with its %F format directive to display the filesystems from the second field of each /etc/mtab entry:
% B
/                    type ext3
/proc                type proc
/dev/pts             type devpts
/dev/shm             type tmpfs
…
(You’ll probably find that same data in the second and third fields of each entry in /proc/mounts.)

Text Output

Early versions of find had basically one choice for outputting a pathname: print it to the standard output. Later, -ls was added; it gives an output format similar to ls -l. The new -printf action lets you use a C-like printf format. This has the usual format specifiers like the filename and the last-modification date, but it has others specific to find. For instance, %H tells you which command-line argument find was processing when it found this entry. One simple use for this is to make your own version of ls that gives just the information you want. As an example, the following bash function, findc, searches the command-line arguments (or, if there are no arguments, the current directory . instead) and prints information about all filenames ending with .c:
findc()
{
  find "${@-.}" -name '*.c' -printf 
    'DEPTH %2d  GROUP %-10g  NAME %fn'
}
(Note that the stat(1) utility might be simpler to use if you want a recursive listing and if stat’s format specifiers give the information you want.)

The longstanding -print action writes a pathname to the standard output, followed by a newline character. If that pathname happens to contain a newline, you get two newlines. (A newline is legal in a filename.) Most shells also break command-line arguments into words at whitespace (tabs, spaces and newlines); this means that command substitution (the backquote operators) could fail if, say, a filename contained spaces. It wasn’t too long before programmers fixed this problem by adding the -print0 action; it outputs a pathname followed by NUL (a zero byte). Because NUL isn’t legal in a filename, this pathname delimiter solved the problem — when find’s output was piped to the command xargs -0, which accepts NUL as an argument separator.

Because find can do many different tests as it traverses a filesystem, it’s good to be able to choose what should be done in each individual case. For instance, if you run a nightly cron job to clean up various files and directories from all of your disks, it’s nice to do all of the tests in a single pass through the filesystem — instead of making another complete pass for each of your tests. But it’s also good to avoid the overhead of running utilities like rm and rmdir over and over, once per file, in a find job like this one using -exec:
find /var/tmp -mtime +3 ( 
  ( -type f -exec rm -f {} ; ) -o 

  ( -type d -exec ..... {} ; ) 
)
This inefficiency could be solved by replacing -exec with -print or -print0, then piping find’s output to xargs. xargs collects arguments and passes them to another program each time it has collected “enough.” But all the text from -print or -print0 goes to find’s standard output, so there’s been no easy way to tell which pathnames were from which test (which are files, which are directories…).

The new -fprintf and -fprint0 actions can solve this problem. They write a formatted string to a file you specify. For instance, the following example writes a NUL-separated list of the files from /var/tmp into the file named by $files and a list of directories into the file named by $dirs:
dirs=`mktemp`
files=`mktemp`
find /var/tmp ( 
  ( -type f -fprint0 $files ) -o 
  ( -type d -fprint0 $dirs ) 
)
Other New Tests

The -empty test is true for an empty file or directory. (An empty file has no bytes; an empty directory has no entries.) One place this is handy is for removing empty directories while you’re cleaning a filesystem. If you also use <-depth>, all of the files in a directory should be removed before find examines the directory itself. Then you can use an expression like the following:
find /tmp -depth ( 
  ( -mtime +3 -type f -exec rm -f {} ; ) 
  -o ( -type d -empty -exec rmdir {} ; ) 
)
The -false “test” is always false, and -true is always true. These are a lot more efficient than the old methods (-exec false and -exec true) that execute the external Linux commands false(1) and true(1).

The -perm test has long accepted arguments like -perm 222 (which means “exactly mode 222″ — that is, write-only) and -perm -222 (which means “all of the write (2) mode bits are set”). Now -perm also accepts arguments starting with a plus sign. It means “any of these bits are set.” For instance, -perm +222 is true when any write bit is set.