Introduction into text files processing in bash

(with overview of classic Unix  text processing utilities)

News

Bash customization Recommended Links Introduction into text files processing in bash Strange Files Deletion and Renaming in Shell Temporary files

Line-Based Records

IFS Customizing Shell Dot Files: .profile, RC-file, and history Examples of .bashrc files Shell Prompts WinSCP Tips Readline and inputrc Piping Vim Buffer Through Unix Filters: ! and !! Commands
.screenrc examples Attaching to and detaching from screen sessions How to rename files with special characters in names Midnight Commander Tips and Tricks Shell Aliases basename function dirname
BASH Debugging Arithmetic Expressions in BASH String Operations if statements in shell sort command tr command exec command
dd cut command System Activity Reporter (sar) Shell Input and Output Redirection Unix Find Tutorial. Using -exec option with find Unix find tutorial Finding files using file name or path Brace Expansion
AWK Tips AWK one liners GNU Screen Tips VIM Tips pv Command completion BASH Debugging
  SSH Tips SCP Tips   Shell Input and Output Redirection Bash Built-in Variables Directory favorites
Brace Expansion Process Substitution in Shell Sequences of commands in Unix shell Subshells Shell scripts collections nmap_tips Annotated List of Bash Enhancements
Pipes in Loops Unix Sysadmin Tips Sysadmin Horror Stories Unix shells history Unix Shell Tips and Tricks Humor  Etc

Adapted from Chapter 11 and 12 of Linux Shell Scripting with Bash  by Ken O. Burtch. Outdate material removed. Some examples and explanations, which were wrong corrected.

Unix initially was created for AT&T legal department for processing patent applications. This stoke of luck helped tio til its design toward text processing needs. As the result everything in Unix is file and  text files play very important, central role.  Essentially Unix can bee viewed as a merger of an OS and a document processing system. It contains many utilities designed specificially for processing text files. That was not the case for earlier OS like OS/360.  Of course, some of the  now are hopelessly outdated (, some look underpowered (cut) while other such as find, grep, head, tail and  several other stand the test of the time.

A text file is a file containing lines. Each line ends with EOL symbol, which is Unix is "\n" and in MSDOS and Windows is "\r\n". that means that text files from Windows need to be converted into Unix foram. The most popular tool for such a conversion is dos2unix utility but any scripting language or shell tr command can be using  for such a conversion.

Further complication in exchange of files between Windows and Unix are files that contain blanks in the filename. Those should generally be avoided. There is no standard utility for conversion blanks to underscore in such files, but by tr command along with basename and dirname command can be used for this purpose (see below).

Many text processing Unix utilities can act as filters, processing the standard input generated by previous stage of pipeline and producing output passed the next stage of pipeline. The idea of pipelines was a revolutionary innovation introduced by Unix.

When a text file is passed through a pipeline, it is called a text stream, that is, a stream of text characters.

Working with Pathnames

The absolute path to any unix file consists of two components: path and basename. Linux has three commands for to parce the absolutes pafe name into path (which is a directory in which file reside) and basename -- name of the file in this  directory.

The basename command examines a path and displays the filename. It doesn't check to see whether the file exists.

$ basename /home/joeuser/www/robots.txt
robots.txt

If a suffix is included as a second parameter, basename deletes the suffix if it matches the file's suffix.

$ basename /home/joeuser/www/robots.txt .txt
robots

The corresponding program for extracting the path to the file is dirname.

$ dirname /home/joeuser/www/robots.txt
/home/joeuser/test

There is no trailing slash after the final directory in the path.

Using tr command and those two command you can replace blanks in a filename of any command with underscored in a following way

find /home/joeuser/ -name="*.txt" -type -f   -exec  /home/joeuser/bin/rename_windows_file.sh {}

where rename_windows_file.sh  is something like

#!/bin/bash
f=`basename "$1"`
d=`dirname "$1"`
f=`tr ' ' '_' <<< $f` 
mv "$1" $d/$f

To verify that the resulting pathname is a correct Linux pathname, you can use the pathchk command. This command verifies that the directories in the path (if they already exist) are accessible and that the names of the directories and file are not too long. If there is a problem with the path, pathchk reports the problem and returns an error code of 1.

$ pathchk "~/x" && echo "Acceptable path"
Acceptable path
$ mkdir a
$ chmod 400 a
$ pathchk "a/test.txt"

With the --portability (-p) switch, pathchk enforces stricter portability checks for all POSIX-compliant Unix systems. This identifies characters not allowed in a pathname, such as spaces.

$ pathchk "new file.txt"
$ pathchk -p "new file.txt"
pathchk: path 'new file.txt' contains nonportable character ' '

pathchk is useful for checking pathnames supplied from an outside source, such as pathnames from another script or those typed in by a user.

File Truncation

A particular feature of Unix-based operating systems, including the Linux ext3 file system, is the way space on a disk is reserved for a file. For directories  which are scesial type of files in Unix space is never released  if directory became very big after you many thousand name into it and then deleted most of them

If a program removes all 5,000 files from a large directory, and puts a single file in that directory, the directory will still have space reserved for 5,000 file entries. The only way to release this space is to remove and re-create the directory.

Identifying type of file using file command

The built-in type command identifies whether a command is built-in or not, and where the command is located if it is a Linux command.

To test files other than commands, the Linux file command performs a series of tests to determine the type of a file. First, file determines whether the file is a regular file or is empty. If the file is regular, file command consults the /usr/share/magic file, checking the first few bytes of the file in an attempt to determine what the file contains. If the file is an ASCII text file, it performs a check of common words to try to determine the language of the text.

$ file empty_file.txt
empty_file.txt: empty
$ file robots.txt
robots.txt: ASCII text

file also works with programs. If rename_windows_file.sh is a Bash script, file identifies it as a shell script.

file rename_windows_file.sh
rename_windows_file.sh: Bourne-Again shell script text
It also detects if file is binary executable
file /usr/bin/time
/usr/bin/test: ELF 32-bit LSB executable, Intel 80386, version 1,
dynamically linked (uses shared libs), stripped

For script programming, file's -b (brief) switch hides the name of the file and returns only the assessment of the file.

$ file -b robots.txt
ASCII text

Other useful switches include -f (file) to read filenames from a specific file. The -i switch returns the description as MIME type suitable for Web programming. With the -z (compressed) switch, file attempts to determine the type of files stored inside a compressed file. The -L switch follows symbolic links.

$ file -b -i robots.txt
text/plain, ASCII

Creating and Deleting Files

Files are deleted with the rm (remove) command. The -f (force) command removes a file even when the file permissions indicate the script cannot write to the file, but rm never removes a file from a directory that the script does not own. (The sticky bit is an exception )

As whenever you deal with files, always check that the file exists before you attempt to remove it.

#!/bin/bash
#
# rm_demo.sh: deleting a file with rm
shopt -s -o nounset

declare -rx SCRIPT=${0##*/}
declare -rx FILE2REMOVE="robots.bak"
declare -x  STATUS

if [ ! -f "$FILE2REMOVE" ] ; then
   printf "%s\n" "$SCRIPT: $FILE2REMOVE does not exist" >&2
   exit 192
else
   rm "$FILE2REMOVE" >&2
   STATUS=$?
   if [ $STATUS -ne 0 ] ; then
      printf "%s\n" "$SCRIPT: Failed to remove file $FILE2REMOVE" >&2
      exit $STATUS
   fi
fi

exit 0

When removing multiple files, avoid using the -r (recursive) switch or filename globbing. Instead, get a list of the files to delete (using a command such as find, discussed next) and test each individual file before attempting to remove any of them. This is slower than the alternatives but if a problem occurs no files are removed and you can safely check for the cause of the problem.

New, empty files are created with the touch command. The command is called touch because, when it's used on an existing file, it changes the modification time even though it makes no changes to the file.

touch is often combined with rm to create new, empty files for a script. Appending output with >> does not result in an error if the file exists, eliminating the need to remember whether a file exists.

For example, if a script is to produce a summary file called run_results.txt, a fresh file can be created:

#!/bin/bash
#
# touch_demo.sh: using touch to create a new, empty file

shopt -s -o nounset

declare -rx RUN_RESULTS="./run_results.txt"

if [ -f "$RUN_RESULTS" ] ; then
   rm -f "$RUN_RESULTS"
   if [ $? -ne 0 ] ; then
      printf "%s\n" "Error: unable to replace $RUN_RESULTS" >&2
   fi
   touch "$RUN_RESULTS"
fi

printf "Run stated %s\n" "'date'" >> "$RUN_RESULTS"

The -f switch forces the creation of a new file every time.

For script programming, file's -b (brief) switch hides the name of the file and returns only the assessment of the file.

$ file -b robots.txt
ASCII text

Other useful switches include -f (file) to read filenames from a specific file. The -i switch returns the description as MIME type suitable for Web programming. With the -z (compressed) switch, file attempts to determine the type of files stored inside a compressed file. The -L switch follows symbolic links.

$ file -b -i robots.txt
text/plain, ASCII

Moving and Copying Files

Files are renamed or moved to new directories using the mv (move) command. If -f (force) is used, move overwrites an existing file instead of reporting an error. Use -f only when it is safe to overwrite the file.

You can combine touch with mv to back up an old file under a different name before starting a new file. The Linux convention for backup files is to rename them with a trailing tilde (~).

#!/bin/bash
#
# backup_demo.sh

shopt -s -o nounset

declare -rx RUN_RESULTS="./run_results.txt"

if [ -f "$RUN_RESULTS" ] ; then
   mv -f "$RUN_RESULTS" "$RUN_RESULTS""~"
   if [ $? -ne 0 ] ; then
      printf "%s\n" "Error: unable to backup $RUN_RESULTS" >&2
   fi
   touch "$RUN_RESULTS"
fi

printf "Run stated %s\n" "'date'" >> "$RUN_RESULTS"

Because it is always safe to overwrite the backup, the move is forced with the -f switch. Archiving files is usually better than outright deleting because there is no way to “undelete” a file in Linux.

Similar to mv is the cp (copy) command. cp makes copies of a file and does not delete the original file. cp can also be used to make links instead of copies using the --link switch.

More Information About Files

There are two Linux commands that display information about a file that cannot be easily discovered with the test command.

The Linux stat command shows general information about the file, including the owner, the size, and the time of the last access.

$ stat ken.txt
  File: "ken.txt"
  Size: 84         Blocks: 8         Regular File
Access: (0664/-rw-rw-r--)         Uid: (  503/ joeuser)  Gid: (  503/ joeuser)
Device: 303        Inode: 131093     Links: 1
Access: Tue Feb 20 16:34:11 2001
Modify: Tue Feb 20 16:34:08 2001
Change: Tue Feb 20 16:34:08 2001

To make the information more readable from a script, use the -t (terse) switch. Each stat item is separated by a space.

$ stat -t robots.txt
robots.txt 21704 48 81fd 503 503 303 114674 1 6f 89 989439402
981490652 989436657

The Linux statftime command has similar capabilities to stat, but has a wider range of formatting options. statftime is similar to the date command: It has a string argument describing how the status information should be displayed. The argument is specified with the -f (format) switch.

The most common statftime format codes are as follows:

A complete list appears in the reference section at the end of this chapter.

By default, any of formatting codes referring to time will be based on the file's modified time.

$ statftime -f "%c" robots.txt
Tue Feb  6 15:17:32 2001

Other types of time can be selected by using a time code. The format argument is read left to right, which means different time codes can be combined in one format string. Using %_C, for example, changes the format codes to the inode change time (usually the time the file was created). Using %_L (local time) or %_U (UTC time) makes statftime behave like the date command.

$ statftime -f "modified time = %c current time = %_L%c" robots.txt
modified time = Tue Feb  6 15:17:32 2001 current time = Wed May
  9 15:49:01 2001
$ date
Wed May  9 15:49:01 2001

statftime can create meaningful archive filenames. Often files are sent with a name such as robots.txt and the script wants to save the robots with the date as part of the name.

$ statftime -f "%_a_%_L%m%d.txt" robots.txt
robots_0509.txt

Besides generating new filenames, statftime can be used to save information about a file to a variable.

$ BYTES='statftime -f "%_s" robots.txt'
$ printf "The file size is %d bytes\n" "$BYTES"
The file size is 21704 bytes

When a list of files is supplied on standard input, the command processes each file in turn. The %_z code provides the position of the filename in the list, starting at 1.

Downloading files (wget)

Linux has a convenient tool for downloading files from other computers oe Web sites on the Internet.  For downloading files from the websites wget (web get)  is usually used. It retrieves files using iether FTP or HTTP protocold.  wget is designed specifically to retrieve multiple files. If a connection is broken, wget tries to reconnect and continue to download the file.

The wget program uses the same form of address as a Web browser, supporting ftp:// and http:// URLs. Login information is added to a URL by placing user: and password@ prior to the hostname. FTP URLs can end with an optional ;type=a or ;type=i for ASCII or IMAGE FTP downloads. For example, to download the info.txt file from the joeuser login with the password jabber12 on the current computer, you use:

$ wget ftp://joeuser:jabber12@localhost/info.txt;type=i 

By default, wget uses --verbose message reporting. To report only errors, use the --quiet switch. To log what happened, append the results to a log file using --append-output and a log name and log the server responses with the --server-response switch.

$ wget --server-response --append-output wget.log \ ftp://joeuser:jabber12@localhost/info.txt;type=i

Whole accounts can be copied using the --mirror switch.

$ wget --mirror ftp://joeuser:jabber12@localhost;type=i

To make it easier to copy a set of files, the --glob switch can enable file pattern matching. --glob=on causes wget to pattern match any special characters in the filename. For example, to retrieve all text files:

$ wget --glob=on 'ftp://joeuser:jabber12@localhost/*.txt'

There are many special-purpose switches not covered here. A complete list of switches is in the reference section. Documentation is available on the wget home page at http://www.gnu.org/software/wget/wget.html.

Verifying Files

Files sent by FTP or wget can be further checked by computing a checksum. The Linux cksum command counts the number of bytes in a file and prints a cyclic redundancy check (CRC) checksum, which can be used to verify that the file arrived complete and intact. The command uses a POSIX-compliant algorithm.

$ cksum robots.txt
491404265 21799 robots.txt

There is also a Linux sum command that provides compatibility with older Unix systems, but be aware that cksum is incompatible with sum.

For greater checksum security, some distributions include a md5sum command to compute an MD5 checksum. The --status switch quietly tests the file. The --binary (or -b) switch treats the file as binary data as opposed to text. The --warn switch prints warnings about bad MD5 formatting. --check (or -c) checks the sum on a file.

$ md5sum robots.txt
945eecc13707d4a23e27730a44774004  robots.txt
$ md5sum robots.txt > robotssum.txt
$ md5sum --check robotssum.txt
file1.txt: OK

Differences between two files can be pinpointed with the Linux cmp command.

$ cmp robots.txt robots2.txt
robots.txt robots2.txt differ: char 179, line 6

If two files don't differ, cmp prints nothing.

Tabs and Spaces

The Linux expand command converts Tab characters into spaces. The default is eight spaces, although you can change this with --tabs=n (or -t n) to n spaces. The --tabs switch can also use a comma-separated list of Tab stops.

$ printf "\tA\tTEST\n" > test.txt
$ wc test.txt
      1       2       8 test.txt
$ expand test.txt | wc
      1       2      21

The --initial (or -i) switch converts only leading Tabs on a line.

$ expand --initial test.txt | wc
      1       2      15

The corresponding unexpand command converts multiple spaces back into Tab characters. The default is eight spaces to a Tab, but you can use the --tabs=n switch to change this. By default, only initial tabs are converted. Use the --all (or -a) switch to consider all spaces on a line.

Use expand to remove tabs from a file before processing it.

Temporary Files

Temporary files, files that exist only for the duration of a script's execution, are traditionally named using the $$ function. This function returns the process ID number of the current script. By including this number in the name of the temporary files, it makes the name of the file unique for each run of the script.

$ TMP="/tmp/reports.$$"
$ printf "%s\n" "$TMP"
/tmp/reports.20629
$ touch "$TMP"

The drawback to this traditional approach lies in the fact that the name of a temporary file is predictable. A hostile program can see the process ID of your scripts when it runs and use that information to identify which temporary files your scripts are using. The temporary file could be deleted or the data replaced in order to alter the behavior of your script.

For better security, or to create multiple files with unique names, Linux has the mktemp command. This command creates a temporary file and prints the name to standard output so it can be stored in a variable. Each time mktemp creates a new file, the file is given a unique name. The name is created from a filename template the program supplies, which ends in the letter X six times. mktemp replaces the six letters with a unique, random code to create a new filename.

$ TMP='mktemp /tmp/reports.XXXXXX'
$ printf "%s\n" "$TMP"
/tmp/reports.3LnWVw
$ ls -l "$TMP"
-rw-------    1 joeuser  joeuser         0 Aug  1 14:34 reports.3LnWVw

In this case, the letters XXXXXX are replaced with the code 3LnWvw.

mktemp creates temporary directories with the -d (directories) switch. You can suppress error messages with the -q (quiet) switch.

Lock Files

When many scripts share the same files, there needs to be a way for one script to indicate to another that it has finished its work. This typically happens when scripts overseen by two different development teams need to share files, or when a shared file can be used by only one script at a time.

A simple method for synchronizing scripts is the use of lock files. A lock file is like a flag variable: The existence of the file indicates a certain condition, in this case, that the file is being used by another program and should not be altered.

Most Linux distributions include a directory called /var/lock, a standard location to place lock files.

Suppose the invoicing files can be accessed by only one script at a time. A lock file called file_convesion_lock can be created to ensure only one script has access.

declare -r my_lockfile="/var/lock/file_convesion_lock"
while test ! -f "$my_lockfile" ; do
  printf "Waiting for conversion of files to finish...\n"
  sleep 10
done
touch "$my_lockfile"

This script fragment checks every 10 seconds for the presence of file_convesion_lock. When the file disappears, the loop completes and the script creates a new lock file and proceeds to do its work. When the work is complete, the script should remove the lock file to allow other scripts to proceed.

If a lock file is not removed when one script is finished, it causes the next script to loop indefinitely. The while loop can be modified to use a timeout so that the script stops with an error if the invoice files are not accessible after a certain period of time.

declare -r my_lockfile="/var/lock/file_convesion_lock"
declare -ir lock_timeout=1800    # 30 minutes
declare -i TIME=0
TIME_STARTED='date +%s'
while test ! -f "$my_lockfile" ; do
  printf "Waiting for the conversion of transferred files from windows to Unix format...\n"
  sleep 10
  TIME='date +%s'
  TIME=TIME-TIME_STARTED
  if [ $TIME -gt $lock_timeout ] ; then
     printf "Timed out waiting for files to be converted to Unix  format\n"
     exit 1
  fi
done

The date command's %s code returns the current clock time in seconds. When two executions of date are subtracted from each other, the result is the number of seconds since the first date command was executed. In this case, the timeout period is 1800 seconds, or 30 minutes.

Process Substitution

Sometimes the vertical bar pipe operators cannot be used to link a series of commands together. When a command in the pipeline does not use standard input, or when it uses two sources of input, a pipeline cannot be formed. To create pipes when normal pipelines do not work, Bash uses a special feature called process substitution.

When a command is enclosed in <(...), Bash runs the command separately in a subshell, redirecting the results to a temporary named pipe instead of standard input. In place of the command, Bash substitutes the name of a named pipe file containing the results of the command.

Process substitution can be used anywhere a filename is normally used. For example, the Linux grep command, a file-searching command, can search a file for a list of strings. A temporary file can be used to search a log file for references to the files in the current directory.

$ ls -1 > temp.txt
$ grep -f temp.txt /var/log/nightrun_log.txt
Wed Aug 29 14:18:38 EDT 2001 invoice_error.txt deleted
$ rm temp.txt
				

A pipeline cannot be used to combine these commands because the list of files is being read from temp.txt, not standard input. However, these two commands can be rewritten as a single command using process substitution in place of the temporary filename.

$ grep -f <(ls -1) /var/log/nightrun_log.txt
Wed Aug 29 14:18:38 EDT 2001 invoice_error.txt deleted

In this case, the results of ls -1 are written to a temporary pipe. grep reads the list of files from the pipe and matches them against the contents of the nightrun_log.txt file. The fact that Bash replaces the ls command with the name of a temporary pipe can be checked with a printf statement.

$ printf "%s\n" <(ls -1)
/dev/fd/63

Bash replaces -f <(ls -1) with -f /dev/fd/63. In this case, the pipe is opened as file descriptor 63. The left angle bracket (<) indicates that the temporary file is read by the command using it. Likewise, a right angle bracket (>) indicates that the temporary pipe is written to instead of read.

Using head and tail

The Linux head command returns the first lines contained in a file. By default, head prints the first 10 lines. You can specify a specific number of lines with the --lines=n (or -n n) switch. Similarly tail by  default prints 10 last lines

$ tail -n 50 /var/log/messages

You can abbreviate the -n switch to a minus sign and the number of lines.

$ tail -5 /var/log/messages

Combining tail and head in a pipeline, you can display any line or range of lines.

$ head -5000 /var/log/messages | tail -100

If the starting line is a plus sign instead of a minus sign, tail counts that number of lines from the start of the file and prints the remainder. This is a feature of tail, not the head command.

$ tail +17 /var/log/messages

When using head or tail on arbitrary files in a script, always check to make sure that the file is a regular file to avoid unpleasant surprises.

File Statistics

The Linux wc (word count) command provides statistics about a file. By default, wc shows the size of the file in lines, words, and characters. To make wc useful in scripts, switches must be used to return a single statistic.

The --bytes (or --chars or -c) switch returns the file size, the same value as the file size returned by statftime.

$ wc --bytes invoices.txt
  20411 invoices.txt

To use wc in a script, direct the file through standard input so that the filename is suppressed.

$ wc --bytes < status_log.txt
  57496

The --lines (or -l) switch returns the number of lines in the file. That is, it counts the number of line feed characters.

$ wc --lines < status_log.txt
   1569

The --max-line-length (or -L) switch returns the length of the longest line. The --words (or -w) switch counts the number of words in the file.

wc can be used with variables when their values are printed into a pipeline.

$ declare -r TITLE="Annual Grain Yield Report"
$ printf "%s\n" "$TITLE" | wc --words

Cutting

The Linux cut command removes substrings from all lines contained in a file.

The --fields (or -f) switch prints a section of a line marked by a specific character. The --delimiter (or -d) switch chooses the character. To use a space as a delimiter, it must be escaped with a backslash or enclosed in quotes.

$ declare -r TITLE="Annual Grain Yield Report"
$ printf "%s\n" "$TITLE" | cut -d' ' -f2
Grain

In this example, the delimiter is a space and the second field marked by a space is Grain. When cutting with printf, always make sure a line feed character is printed; otherwise, cut will return an empty string.

Multiple fields are indicated with commas and ranges as two numbers separated by a minus sign (-).

$ printf "%s\n" "$TITLE" | cut -d' ' -f 2,4
Grain Report

You separate multiple fields using the delimiter character. To use a different delimiter character when displaying the results, use the --output-delimiter switch.

The --characters (or -c) switch prints the specified characters' positions. This is similar to the dollar sign expression substrings but any character or range of characters can be specified. The --bytes (or -b) switch works identically but is provided for future support of multi-byte international characters.

$ printf "%s\n" "$TITLE" | cut --characters 1,3,6-8
Anl G

The --only-delimited (or -s) switch ignores lines in which the delimiter character doesn't appear. This is an easy way to skip a title or other notes at the beginning of a data file.

When used on multiple lines, cut cuts each line

$ cut -d, -f1 < /var/log/messages | head -3
Birchwood China Hutch
Bookcase Oak Veneer
Small Bookcase Oak Veneer

The script in below adds the quantity fields in /var/log/messages.

#!/bin/bash
#
# cut_demo.sh: compute the total quantity from /var/log/messages

shopt -o -s nounset

declare -i QTY
declare -ix TOTAL_QTY=0

cut -d, -f3 /var/log/messages | {
  while read QTY ; do
    TOTAL_QTY=TOTAL_QTY+QTY
  done
  printf "The total quantity is %d\n" "$TOTAL_QTY"
}
exit 0

Columns

Linux column command creates fixed-width columns. The columns are fitted to the size of the screen as determined by the COLUMNS environment variable, or to a specific row width using the -c switch.

$ column < robots.txt
Birchwood China Hutch,475.99,1,756      Bar Stool,45.99,1,756
Bookcase Oak Veneer,205.99,1,756        Lawn Chair,55.99,1,756
Small Bookcase Oak Veneer,205.99,1,756  Rocking Chair,287.99,1,757
Reclining Chair,1599.99,1,757           Cedar Armoire,825.99,1,757
Bunk Bed,705.99,1,757                   Mahogany Writing Desk,463.99,1,756
Queen Bed,925.99,1,757                  Garden Bench,149.99,1,757
Two-drawer Nightstand,125.99,1,756      Walnut TV Stand,388.99,1,756
Cedar Toy Chest,65.99,1,757             Victorian-style Sofa,1225.99,1,757
Six-drawer Dresser,525.99,1,757         Chair - Rocking,287.99,1,757
Pine Round Table,375.99,1,757           Grandfather Clock,2045.99,1,756

The -t switch creates a table from items delimited by a character specified by the -s switch.

$ column -s ',' -t < robots.txt | head -5
Birchwood China Hutch      475.99   1  756
Bookcase Oak Veneer        205.99   1  756
Small Bookcase Oak Veneer  205.99   1  756
Reclining Chair            1599.99  1  757
Bunk Bed                   705.99   1  757

The table fill-order can be swapped with the -x switch.

Finding Lines

The Linux grep command searches a file for lines matching a pattern.

On Classic Unix systems there are two other grep commands

GNU implementation combines these variations into one command. The egrep command runs grep with the --extended-regexp (or -E) switch, and the fgrep command runs grep with the --fixed-strings (or -F) switch.

The strange name grep originates in the early days of Unix, whereby one of the line-editor commands was g/re/p (globally search for a regular expression and print the matching lines). Because this editor command was used so often, a separate grep command was created to search files without first starting the line editor.

egrep mode (activated by option -E or by using egrep as the name of the command) uses AWK-style regular expressions. The basic symbols are as follows:

Notice that the symbols are not exactly the same as the globbing symbols used for file matching. For example, on the command line a question mark represents any character, whereas in egrep, the period has this effect.

The characters ?, +, {, |, (, and ) must appear escaped with backslashes to prevent Bash from treating them as file-matching characters.

in normal mode only basic regular expression are supported. In basic regular expression asterisk (*) is a placeholder representing zero or more characters.

$ grep "M*Desk" robots.txt
Mahogany Writing Desk,463.99,1,756

The --fixed-strings ( -F) switch suppresses the meaning of the pattern-matching characters. When used with M*Desk, grep searches for the exact string, including the asterisk, which does not appear anywhere in the file.

$ grep --fixed-strings "M*Desk" robots.txt

The --ignore-case (or -i) switch makes the search case insensitive. Searching for W shows all lines containing W and w.

$ grep --ignore-case "W" robots.txt
Birchwood China Hutch,475.99,1,756
Two-drawer Nightstand,125.99,1,756
Six-drawer Dresser,525.99,1,757
Lawn Chair,55.99,1,756
Mahogany Writing Desk,463.99,1,756
Walnut TV Stand,388.99,1,756

The --invert-match (or -v) switch shows the lines that do not match. Lines that match are not shown.

$ grep --invert-match "r" robots.txt
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
Pine Round Table,375.99,1,757
Walnut TV Stand,388.99,1,756

Regular expressions can be joined together with a vertical bar (|). This has the same effect as combining the results of two separate grep commands.

$ grep "Stool" robots.txt
Bar Stool,45.99,1,756
$ grep "Chair" robots.txt
Reclining Chair,1599.99,1,757
Lawn Chair,55.99,1,756
Rocking Chair,287.99,1,757
Chair - Rocking,287.99,1,757
$ grep "Stool\|Chair" robots.txt
Reclining Chair,1599.99,1,757
Bar Stool,45.99,1,756
Lawn Chair,55.99,1,756
Rocking Chair,287.99,1,757
Chair - Rocking,287.99,1,757

To identify the matching line, the --line-number (or -n) switch displays both the line number and the line. Using cut, head, and tail, the first line number can be saved in a variable. The number of bytes into the file can be shown with --byte-offset (or -b).

$ grep --line-number "Chair - Rock" robots.txt
19:Chair - Rocking,287.99,1,757
$ FIRST='grep --line-number "Chair - Rock" robots.txt | cut -d: -f1 | head -1'
$ printf "First occurrence at line %d\n" "$FIRST"
First occurrence at line 19
					  

The --count (or -c) switch counts the number of matches and displays the total.

$ CNT='grep --count "Chair" robots.txt'
$ printf "There are %d chair(s).\n" "$CNT"
There are 4 chair(s).

grep recognizes the standard character classes as well.

$ grep "[[:cntrl:]]" robots.txt

A complete list of Linux grep switches appears in the reference section at the end of the chapter.

Locating Files

The Linux locate command consults a database and returns a list of all pathnames containing a certain group of characters, much like a fixed-string grep.

$ locate /robots.txt
/home/joeuser/www/robots.txt
/home/joeuser/robots.txt
$ locate robots.txt
/home/joeuser/www/robots.txt
/home/joeuser/test/advocacy/old_robots.txt
/home/joeuser/robots.txt

Older versions of locate show any file on the system, even files you normally don't have access to. Newer versions only show files that you have permission to see.

The locate database is maintained by a command called updatedb. It is usually executed once a day by Linux distributions. For this reason, locate is very fast but useful only in finding files that are at least one day old.

Finding Files

The Linux find command searches for files that meet specific conditions such as files with a certain name or files greater than a certain size. find is similar to the following loop where MATCH is the matching criteria:

ls --recursive | while read FILE ; do
     # test file for a match
    if [ $MATCH ] ; then
       printf "%s\n" "$FILE"
    fi
done

This script recursively searches directories under the current directory, looking for a filename that matches some condition called MATCH.

find is much more powerful than this script fragment. Like the built-in test command, find switches create expressions describing the qualities of the files to find. There are also switches to change the overall behavior of find and other switches to indicate actions to perform when a match is made.

The basic matching switch is -name, which indicates the name of the file to find. Name can be a specific filename or it can contain shell path wildcard globbing characters like * and ?. If pattern matching is used, the pattern must be enclosed in quotation marks to prevent the shell from expanding it before the find command examines it.

$ find . -name "*.txt"
./robots.txt
./advocacy/linux.txt
./advocacy/old_robots.txt

The first parameter is the directory to start searching in. In this case, it's the current directory.

The previous find command matches any type of file, including files such as pipes or directories, which is not usually the intention of a user. The -type switch limits the files to a certain type of file. The -type f switch matches only regular files, the most common kind of search. The type can also be b (block device), c (character device), d (directory), p (pipe), l (symbolic link), or s (socket).

$ find . -name "*.txt" -type f
./robots.txt
./advocacy/linux.txt
./archive/old_robots.txt

The switch -name "*.txt" -type f is an example of a find expression. These switches match a file that meets both of these conditions (implicitly, a logical “and”). There are other operator switches for combining conditions into logical expressions, as follows:

For example, to count the number of regular files and directories, do this:

$ find . -type d -or -type f | wc -l
    224

The number of files without a .txt suffix can be counted as well.

$ find . ! -name "*.txt" -type f | wc -l
    185

Parentheses must be escaped by a backslash or quotes to prevent Bash from interpreting them as a subshell. Using parentheses, the number of files ending in .txt or .sh can be expressed as

$ find . "(" -name "*.txt" -or -name "*.sh" ")" -type f | wc -l
     11

Some expression switches refer to measurements of time. Historically, find times were measured in days, but the GNU version adds min switches for minutes. find looks for an exact match.

To search for files older than an amount of time, include a plus or minus sign. If a plus sign (+) precedes the amount of time, find searches for times greater than this amount. If a minus sign (-) precedes the time measurement, find searches for times less than this amount. The plus and minus zero days designations are not the same: +0 in days means “older than no days,” or in other words, files one or more days old. Likewise, -5 in minutes means “younger than 5 minutes” or “zero to four minutes old”.

There are several switches used to test the access time, which is the time a file was last read or written. The -anewer switch checks to see whether one file was accessed more recently than a specified file. -atime tests the number of days ago a file was accessed. -amin checks the access time in minutes.

Likewise, you can check the inode change time with -cnewer, -ctime, and -cmin. The inode time usually, but not always, represents the time the file was created. You can check the modified time, which is the time a file was last written to, by using -newer, -mtime, and -mmin.

To find files that haven't been changed in more than one day:

$ find . -name "*.sh"  -type f -mtime +0
./archive/old_robots.txt

To find files that have been accessed in the last 10 to 60 minutes:

$ find . -name "*.txt"  -type f -amin +9 -amin -61
./robots.txt
./advocacy/linux.txt

The -size switch tests the size of a file. The default measurement is 512-byte blocks, which is counterintuitive to many users and a common source of errors. Unlike the time-measurement switches, which have different switches for different measurements of time, to change the unit of measurement for size you must follow the amount with a b (bytes), c (characters), k (kilobytes), or w (16-bit words). There is no m (megabyte). Like the time measurements, the amount can have a minus sign (-) to test for files smaller than the specified size, or a plus sign (+) to test for larger files.

For example, use this to find log files greater than 1MB:

$ find . -type f -name "*.log" -size +1024k
./logs/giant.log

find shows the matching paths on standard output. Historically, the -print switch had to be used. Printing the paths is now the default behavior for most Unix-like operating systems, including Linux. If compatibility is a concern, add -print to the end of the find parameters.

To perform a different action on a successful match, use -exec. The -exec switch runs a program on each matching file. This is often combined with rm to delete matching files, or grep to further test the files to see whether they contain a certain pattern. The name of the file is inserted into the command by a pair of curly braces ({}) and the command ends with an escaped semicolon. (If the semicolon is not escaped, the shell interprets it as the end of the find command instead.)

$ find . -type f -name "*.txt" -exec grep Table {} \;
Pine Round Table,375.99,1,757
Pine Round Table,375.99,1,757

More than one action can be specified. To show the filename after a grep match, include -print.

$ find . -type f -name "*.txt" -exec grep Table {} \; -print
Pine Round Table,375.99,1,757
./robots.txt
Pine Round Table,375.99,1,757
./archive/old_robots.txt

find expects {} to appear by itself (that is, surrounded by whitespace). It can't be combined with other characters, such as in an attempt to form a new pathname.

The -exec switch can be slow for a large number of files: The command must be executed for each match. When you have the option of piping the results to a second command, the execution speed is significantly faster than when using -exec. A pipe generates the results with two commands instead of hundreds or thousands of commands.

The -ok switch works the same way as -exec except that it interactively verifies whether the command should run.

$ find . -type f -name "*.txt" -ok rm {} \;
< rm ... ./robots.txt > ? n
< rm ... ./advocacy/linux.txt > ? n
< rm ... ./advocacy/old_robots.txt > ? n
				

The -ls action switch lists the matching files with more detail. find runs ls -dils for each matching file.

$ find . -type f -name "*.txt" -ls
243300    4 -rw-rw-r--   1 joeuser  joeuser       592 May 17 14:41 ./robots.txt
114683    0 -rw-rw-r--   1 joeuser  joeuser         0 May 17 14:41 ./advocacy/l
inux.txt
114684    4 -rw-rw-r--   1 joeuser  joeuser       592 May 17 14:41 ./advocacy/o
ld_robots.txt
					  

The -printf switch makes find act like a searching version of the statftime command. The % format codes indicate what kind of information about the file to print. Many of these provide the same functions as statftime, but use a different code.

A complete list appears in the reference section.

The time codes also differ from statftime: statftime remembers the last type of time selected, whereas find requires the type of time for each time element printed.

$ find . -type f -name "*.txt" -printf "%f access time is %a\n"
robots.txt access time is Thu May 17 16:47:08 2001
linux.txt access time is Thu May 17 16:47:08 2001
old_robots.txt access time is Thu May 17 16:47:08 2001
$ find . -type f -name "*.txt" -printf "%f modified time as \
					hours:minutes is %TH:%TM\n"
robots.txt modified time as hours:minutes is 14:41
linux.txt modified time as hours:minutes is 14:41
old_robots.txt modified time as hours:minutes is 14:41

A complete list of find switches appears in the reference section.

Sorting

The Linux sort command sorts a file or a set of files. A file can be named explicitly or redirected to sort on standard input. The switches for sort are completely different from commands such as grep or cut. sort is one of the last commands to support long versions of switches: As a result, the short switches are used here. Even so, the switches for common options are not the same as other Linux commands.

To sort a file correctly, the sort command needs to know the sort key, the characters on each line that determine the order of the lines. Anything that isn't in the key is ignored for sorting purposes. By default, the entire line is considered the key.

The -f (fold character cases together) switch performs a case-insensitive sort (doesn't use the -i switch, as many other Linux commands use).

$ sort -f robots.txt
Bar Stool,45.99,1,756
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Bunk Bed,705.99,1,757
Cedar Armoire,825.99,1,757
Cedar Toy Chest,65.99,1,757
Chair - Rocking,287.99,1,757
Garden Bench,149.99,1,757
Grandfather Clock,2045.99,1,756
Lawn Chair,55.99,1,756
Mahogany Writing Desk,463.99,1,756
Pine Round Table,375.99,1,757
Queen Bed,925.99,1,757
Reclining Chair,1599.99,1,757
Rocking Chair,287.99,1,757
Six-drawer Dresser,525.99,1,757
Small Bookcase Oak Veneer,205.99,1,756
Two-drawer Nightstand,125.99,1,756
Victorian-style Sofa,1225.99,1,757
Walnut TV Stand,388.99,1,756

The -r (reverse) switch reverses the sorting order.

$ head robots.txt | sort -f -r
Two-drawer Nightstand,125.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Six-drawer Dresser,525.99,1,757
Reclining Chair,1599.99,1,757
Queen Bed,925.99,1,757
Pine Round Table,375.99,1,757
Cedar Toy Chest,65.99,1,757
Bunk Bed,705.99,1,757
Bookcase Oak Veneer,205.99,1,756
Birchwood China Hutch,475.99,1,756

If only part of the line is to be used as a key, the -k (key) switch determines which characters to use. The field delimiter is any group of space or Tab characters, but you can change this with the -t switch.

To sort the first 10 lines of the robots file on the second and subsequent fields, use this

$ head robots.txt | sort -f -t, -k2
Two-drawer Nightstand,125.99,1,756
Reclining Chair,1599.99,1,757
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Pine Round Table,375.99,1,757
Birchwood China Hutch,475.99,1,756
Six-drawer Dresser,525.99,1,757
Cedar Toy Chest,65.99,1,757
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757

The key position can be followed by the ending position, separated by a comma. For example, to sort only on the second field, use a key of -k 2,2.

If the field number has a decimal part, it represents the character of the field where the key begins. The first character in the field is 1. The first field always starts at the beginning of the line. For example, to sort by ignoring the first character, indicate that the key begins with the second character of the first field.

$ head robots.txt | sort -f -k1.2
Reclining Chair,1599.99,1,757
Cedar Toy Chest,65.99,1,757
Pine Round Table,375.99,1,757
Birchwood China Hutch,475.99,1,756
Six-drawer Dresser,525.99,1,757
Small Bookcase Oak Veneer,205.99,1,756
Bookcase Oak Veneer,205.99,1,756
Queen Bed,925.99,1,757
Bunk Bed,705.99,1,757
Two-drawer Nightstand,125.99,1,756

There are many switches that affect how a key is interpreted. The -b (blanks) switch indicates the key is a string with leading blanks that should be ignored. The -n (numeric) switch treats the key as a number. This switch recognizes minus signs and decimal portions, but not plus signs. The -g (general number) switch treats the key as a C floating-point number notation, allowing infinities, NaNs, and scientific notation. This option is slower than -n. Number switches always imply a -b. The -d (phone directory) switch only uses alphanumeric characters in the sorting key, ignoring periods, hyphens, and other punctuation. The -i (ignore unprintable) switch only uses printable characters in the sorting key. The -M (months) switch sorts by month name abbreviations.

There can be more than one sorting key. The key interpretation switches can be applied to individual keys by adding the character to the end of the key amount, such as -k4,4M, which means “sort on the fourth field that contains month names”. The -r and -f switches can also be used this way.

For a more complex example, the following sort command sorts on the account number, in reverse order, and then by the product name. The sort is case insensitive and skips leading blanks:

$ head robots.txt | sort -t, -k4,4rn -k1,1fb
Bunk Bed,705.99,1,757
Cedar Toy Chest,65.99,1,757
Pine Round Table,375.99,1,757
Queen Bed,925.99,1,757
Reclining Chair,1599.99,1,757
Six-drawer Dresser,525.99,1,757
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Two-drawer Nightstand,125.99,1,756

For long sorts, the -c (check only) switch checks the files to make sure they need sorting before you attempt to sort them. This switch returns a status code of 0 if the files are sorted.

A complete list of sort switches appears in the reference section.

Character Editing (tr)

The Linux tr (translate) command substitutes or deletes characters on standard input, writing the results to standard output.

The -d (delete) switch deletes a specific character.

$ printf "%s\n" 'The total is $234.45 US'
The total is $234.45 US
$ printf "%s\n" 'The total is $234.45 US' | tr -d '$'
The total is 234.45 US

Ranges of characters are represented as the first character, a minus sign, and the last character.

$ printf "%s\n" 'The total is $234.45 US' | tr -d 'A-Z'
he total is $234.45

tr supports GNU character classes.

$ printf "%s\n" 'The total is $234.45 US' | tr -d '[:upper:]'
he total is $234.45

Without any options, tr maps one set of characters to another. The first character in the first parameter is changed to the first character in the second parameter. The second character in the first parameter is changed to the second character in the second parameter. (And so on.)

$ printf "%s\n" "The cow jumped over the moon" | tr 'aeiou' 'AEIOU'
ThE cOw jUmpEd OvEr thE mOOn

tr supports character equivalence. To translate any e-like characters in a variable named FOREIGN_STRING to a plain e, for example, you use

$ printf "$FOREIGN_STRING" | tr "[=e=]" "e"

The --truncate-set1 (or -t) ignores any characters in the first parameter that don't have a matching character in the second parameter.

The --complement (or -c) switch reverses the sense of matching. The characters in the first parameter are not mapped into the second, but characters that aren't in the first parameter are changed to the indicated character.

$ printf "%s\n" "The cow jumped over the moon" | tr --complement 'aeiou' '?'
??e??o???u??e??o?e????e??oo??

The --squeeze-repeats (or -s) switch reduces multiple occurrences of a letter to a single character for each of the letters you specify.

$ printf "%s\n" "aaabbbccc" | tr --squeeze-repeats 'c'
aaabbbc

By far the most common use of tr is to translate MS-DOS text files to Unix text files. DOS text files have carriage returns and line feed characters, whereas Linux uses only line feeds to mark the end of a line. The extra carriage returns need to be deleted.

$ tr -d '\r' < dos.txt > linux.txt
				

Apple text files have carriage returns instead of line feeds. tr can take care of that as well by replacing the carriage returns.

$ tr '\r' '\n' < apple.txt > linux.txt
				

The other escaped characters recognized by tr are as follows:

You can perform more complicated file editing with the sed command, discussed next.

File Editing (sed)

The Linux sed (stream editor) command makes changes to a text file on a line-by-line basis. Although the name contains the word “editor,” it's not a text editor in the usual sense. You can't use it to interactively make changes to a file. Whereas the grep command locates regular expression patterns in a file, the sed command locates patterns and then makes alterations where the patterns are found.

sed's main argument is a complex four-part string, separated by slashes.

$ sed "s/dog/canine/g" animals.txt
				

The first part indicates the kind of editing sed will do. The second part is the pattern of characters that sed is looking for. The third part is the pattern of characters to apply with the command. The fourth part is the range of the editing (if there are multiple occurrences of the target pattern). In this example, in the sed expression "s/dog/canine/g", the edit command is s, the pattern to match is dog, the pattern to apply is canine, and the range is g. Using this expression, sed will substitute all occurrences of the string dog with canine in the file animals.txt.

The use of quotation marks around the sed expression is very important. Many characters with a special meaning to the shell also have a special meaning to sed. To prevent the shell from interpreting these characters before sed has a chance to analyze the expression, the expression must be quoted.

Like grep, sed uses regular expressions to describe the patterns. Also, there is no limit to the line lengths that can be processed by the Linux version of sed.

Some sed commands can operate on a specific line by including a line number. A line number can also be specified with an initial line and a stepping factor. 1~2 searches all lines, starting at line 1, and stepping by 2. That is, it picks all the odd lines in a file. A range of addresses can be specified with the first line, a comma, and the last line. 1,10 searches the first 10 lines. A trailing exclamation point reverses the sense of the search. 1,10! searches all lines except the first 10. If no lines are specified, all lines are searched.

The sed s (substitute) command replaces any matching pattern with new text.

To replace the word Pine with Cedar in the first 10 lines of the order file, use this

$ head robots.txt | sed 's/Pine/Cedar/g'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
Two-drawer Nightstand,125.99,1,756
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Cedar Round Table,375.99,1,757

Pine Round Table becomes Cedar Round Table.

If the replacement string is empty, the occurrence of the pattern is deleted.

$ head robots.txt | sed 's/757//g'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,
Bunk Bed,705.99,1,
Queen Bed,925.99,1,
Two-drawer Nightstand,125.99,1,756
Cedar Toy Chest,65.99,1,
Six-drawer Dresser,525.99,1,
Pine Round Table,375.99,1,

The caret (^) represents the start of a line.

$ head robots.txt | sed 's/^Bunk/DISCONTINUED - Bunk/g'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
DISCONTINUED - Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
Two-drawer Nightstand,125.99,1,756
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Pine Round Table,375.99,1,757

You can perform case-insensitive tests with the I (insensitive) modifier.

$ head robots.txt | sed 's/BED/BED/Ig'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
Bunk BED,705.99,1,757
Queen BED,925.99,1,757
Two-drawer Nightstand,125.99,1,756
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Pine Round Table,375.99,1,757

sed supports GNU character classes. To hide the prices, replace all the digits with underscores.

$ head robots.txt | sed 's/[[:digit:]]/_/g'
Birchwood China Hutch,___.__,_,___
Bookcase Oak Veneer,___.__,_,___
Small Bookcase Oak Veneer,___.__,_,___
Reclining Chair,____.__,_,___
Bunk Bed,___.__,_,___
Queen Bed,___.__,_,___
Two-drawer Nightstand,___.__,_,___
Cedar Toy Chest,__.__,_,___
Six-drawer Dresser,___.__,_,___
Pine Round Table,___.__,_,___

The d (delete) command deletes a matching line. You can delete blank lines with the pattern ^$ (that is, a blank line is the start of line, end of line, with nothing between).

$ head robots.txt | sed '/^$/d'
				

Without a pattern, you can delete particular lines by placing the line number before the d. For example, '1d' deletes the first line.

$ head robots.txt | sed '1d'
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
Two-drawer Nightstand,125.99,1,756
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Pine Round Table,375.99,1,757

A d by itself deletes all lines.

There are several line-oriented commands. The a (append) command inserts new text after a matching line. The i (insert) command inserts text before a matching line. The c (change) command replaces a group of lines.

To insert the title DISCOUNTED ITEMS: prior to Cedar Toy Chest, you do this

$ head robots.txt | sed '/Cedar Toy Chest/i\
					DISCOUNTED ITEMS:'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
Two-drawer Nightstand,125.99,1,756
DISCOUNTED ITEMS:
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Pine Round Table,375.99,1,757

To replace Bunk Bed, Queen Bed, and Two-drawer Nightstand with an Items deleted message, you can use

$ head robots.txt | sed '/^Bunk Bed/,/^Two-drawer/c\
					<Items deleted>'
Birchwood China Hutch,475.99,1,756
Bookcase Oak Veneer,205.99,1,756
Small Bookcase Oak Veneer,205.99,1,756
Reclining Chair,1599.99,1,757
<Items deleted>
Cedar Toy Chest,65.99,1,757
Six-drawer Dresser,525.99,1,757
Pine Round Table,375.99,1,757

You must follow the insert, append, and change commands by an escaped end of line.

The l (list) command is used to display unprintable characters. It displays characters as ASCII codes or backslash sequences.

$ printf "%s\015\t\004\n" "ABC" | sed -n "l"
ABC\r\t\004$

In this case, \015 (a carriage return) is displayed as \r, a \t Tab character is displayed as \t, and a \n line feed is displayed as a $ and a line feed. The character \004, which has no backslash equivalent, is displayed as \004. A, B, and C are displayed as themselves.

The y (transform) command is a specialized short form for the substitution command. It performs one-to-one character replacements. It is essentially equivalent to a group of single character substitutions.

For example, y/,/;/ is the same as s/,/;/g:

$ head robots.txt | sed 'y/,/;/'
Birchwood China Hutch;475.99;1;756
Bookcase Oak Veneer;205.99;1;756
Small Bookcase Oak Veneer;205.99;1;756
Reclining Chair;1599.99;1;757
Bunk Bed;705.99;1;757
Queen Bed;925.99;1;757
Two-drawer Nightstand;125.99;1;756
Cedar Toy Chest;65.99;1;757
Six-drawer Dresser;525.99;1;757
Pine Round Table;375.99;1;757

However, with patterns of more than one character, transform replaces any occurrence of the first character with the first character in the second pattern, the second character with the second character in the second pattern, and so on. This works like the tr command.

$ printf "%s\n" "order code B priority 1" | sed 'y/B1/C2/'
order code C priority 2

Lines unaffected by sed can be hidden with the --quiet (or -n or --silent) switch.

Like the transform command, there are other sed commands that mimic Linux commands. The p (print) command imitates the grep command by printing a matching line. This is useful only when the --quiet switch is used. The = (line number) command prints the line number of matching lines. The q (quit) command makes sed act like the head command, displaying lines until a certain line is encountered.

$ head robots.txt | sed --quiet '/Bed/p'
Bunk Bed,705.99,1,757
Queen Bed,925.99,1,757
$ head robots.txt | sed --quiet '/Bed/='
5
6

The remaining sed commands represent specialized actions. The flow of control is handled by the n (next) command. Files can be read with r or written with w. N (append next) combines two lines into one for matching purposes. D (multiple line delete) deletes multiple lines. P is multiple line print. h, H, g, G, and x enable you to save lines to a temporary buffer so that you can make changes, display the results, and then restore the original text for further analysis. This works like an electronic calculator's memory. Complicated sed expressions can feature branches to labels embedded in the expressions using the b command. The t (test) command acts as a shell elif or switch statement, attempting a series of operations until one succeeds. Subcommands can be embedded in sed with curly brackets. More documentation on these commands can be found using info sed.

Long sed scripts can be stored in a file. You can read the sed script from a file with the --file= (or -f) switch. You can include comments with a # character, like a shell script.

sed expressions can also be specified using the --expression= (or -e) switch, or can be read from standard input when a - filename is used.

You cannot use ASCII value escape sequences in sed patterns.

Compressing Files

Most Linux programs differentiate between archiving and compression. Archiving is the storage of a number of files into a single file. Compression is a reduction of file size by encoding the file. In general, an archive file takes up more space than the original files, so most archive files are also compressed.

The Linux bzip2 (BWH zip) command compresses files with Burrows-Wheeler-Huffman compression. This is the most commonly used compression format. Older compression programs are available on most distributions. gzip (GNU zip) compresses with LZ77 compression and is used extensively on older distributions. compress is an older Lempel-Ziv compression program available on most versions of Unix. zip is the Linux version of the DOS pkzip program. hexbin decompresses certain Macintosh archives.

The Linux tar (tape archive) command is the most commonly used archiving command, and it automatically compresses while archiving when the right command-line options are used. Although the command was originally used to collect files for storage on tape drives, it can also create disk files.

Originally, the tar command didn't use command-line switches: A series of single characters were used. The Linux version supports command-line switches as well as the older single character syntax for backward compatibility.

To use tar on files, the --file F (or -f F) switch indicates the filename to act on. At least one action switch must be specified to indicate what tar will do with the file. Remote files can be specified with a preceding hostname and a colon.

The --create (-c) switch creates a new tar file.

$ ls -l robots.txt
-rw-rw-r--    1 joeuser  joeuser       592 May 11 14:45 robots.txt
$ tar --create --file robots.tar robots.txt
$ ls -l robots.tar
-rw-rw-r--    1 joeuser  joeuser     10240 Oct  3 12:06 robots.tar

The archive file is significantly larger than the original file. To apply compression, chose the type of compression using --bzip (or -I) , --gzip (or -z) , --compress (or -Z) , or --use-compress-program to specify a particular compression program.

$ tar --create --file robots.tbz --bzip robots.txt
$ ls -l robots.tbz
-rw-rw-r--    1 joeuser  joeuser       421 Oct  3 12:12 robots.tbz
$ tar --create --file robots.tgz --gzip robots.txt
$ ls -l robots.tgz
-rw-rw-r--    1 joeuser  joeuser       430 Oct  3 12:11 robots.tgz

More than one file can be archived at once.

$ tar --create --file robots.tbz --bzip robots.txt robots2.txt
$ ls -l robots.tbz
-rw-rw-r--    1 joeuser  joeuser       502 Oct  3 12:14 robots.tbz

The new archive overwrites an existing one.

To restore the original files, use the --extract switch. Use --verbose to see the filenames. tar cannot auto-detect the compression format; you must specify the proper compression switch to avoid an error.

$ tar --extract --file robots.tbz
tar: 502 garbage bytes ignored at end of archive
tar: Error exit delayed from previous errors
$ tar --extract --bzip --file robots.tbz
$ tar --extract --verbose --bzip --file robots.tbz
robots.txt
robots2.txt

The --extract switch also restores any subdirectories in the pathname of the file. It's important to extract the files in the same directory where they were originally compressed to ensure they are restored to their proper places.

The tar command can also append files to the archive using --concatenate (or -A), compare to archives with --compare (or --diff or -d), remove files from the archive with --delete, list the contents with --list, and replace existing files with --update. tar silently performs these functions unless --verbose is used.

A complete list of tar switches appears in the reference section.

Another archiving program, cpio (copy in/out) is provided for compatibility with other flavors of Unix. The rpm package manager command is based on cpio.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Feb 21, 2019] How to prompt and read user input in a Bash shell script

Feb 21, 2019 | alvinalexander.com

By Alvin Alexander. Last updated: June 22 2017 Unix/Linux bash shell script FAQ: How do I prompt a user for input from a shell script (Bash shell script), and then read the input the user provides?

Answer: I usually use the shell script read function to read input from a shell script. Here are two slightly different versions of the same shell script. This first version prompts the user for input only once, and then dies if the user doesn't give a correct Y/N answer:

# (1) prompt user, and read command line argument
read -p "Run the cron script now? " answer

# (2) handle the command line argument we were given
while true
do
  case $answer in
   [yY]* ) /usr/bin/wget -O - -q -t 1 http://www.example.com/cron.php
           echo "Okay, just ran the cron script."
           break;;

   [nN]* ) exit;;

   * )     echo "Dude, just enter Y or N, please."; break ;;
  esac
done

This second version stays in a loop until the user supplies a Y/N answer:

while true
do
  # (1) prompt user, and read command line argument
  read -p "Run the cron script now? " answer

  # (2) handle the input we were given
  case $answer in
   [yY]* ) /usr/bin/wget -O - -q -t 1 http://www.example.com/cron.php
           echo "Okay, just ran the cron script."
           break;;

   [nN]* ) exit;;

   * )     echo "Dude, just enter Y or N, please.";;
  esac
done

I prefer the second approach, but I thought I'd share both of them here. They are subtly different, so not the extra break in the first script.

This Linux Bash 'read' function is nice, because it does both things, prompting the user for input, and then reading the input. The other nice thing it does is leave the cursor at the end of your prompt, as shown here:

Run the cron script now? _

(This is so much nicer than what I had to do years ago.)

[Jan 14, 2018] Linux Filesystem Events with inotify by Charles Fisher

Notable quotes:
"... Lukas Jelinek is the author of the incron package that allows users to specify tables of inotify events that are executed by the master incrond process. Despite the reference to "cron", the package does not schedule events at regular intervals -- it is a tool for filesystem events, and the cron reference is slightly misleading. ..."
"... The incron package is available from EPEL ..."
Jan 08, 2018 | www.linuxjournal.com

Triggering scripts with incron and systemd.

It is, at times, important to know when things change in the Linux OS. The uses to which systems are placed often include high-priority data that must be processed as soon as it is seen. The conventional method of finding and processing new file data is to poll for it, usually with cron. This is inefficient, and it can tax performance unreasonably if too many polling events are forked too often.

Linux has an efficient method for alerting user-space processes to changes impacting files of interest. The inotify Linux system calls were first discussed here in Linux Journal in a 2005 article by Robert Love who primarily addressed the behavior of the new features from the perspective of C.

However, there also are stable shell-level utilities and new classes of monitoring dæmons for registering filesystem watches and reporting events. Linux installations using systemd also can access basic inotify functionality with path units. The inotify interface does have limitations -- it can't monitor remote, network-mounted filesystems (that is, NFS); it does not report the userid involved in the event; it does not work with /proc or other pseudo-filesystems; and mmap() operations do not trigger it, among other concerns. Even with these limitations, it is a tremendously useful feature.

This article completes the work begun by Love and gives everyone who can write a Bourne shell script or set a crontab the ability to react to filesystem changes.

The inotifywait Utility

Working under Oracle Linux 7 (or similar versions of Red Hat/CentOS/Scientific Linux), the inotify shell tools are not installed by default, but you can load them with yum:

 # yum install inotify-tools
Loaded plugins: langpacks, ulninfo
ol7_UEKR4                                      | 1.2 kB   00:00
ol7_latest                                     | 1.4 kB   00:00
Resolving Dependencies
--> Running transaction check
---> Package inotify-tools.x86_64 0:3.14-8.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

==============================================================
Package         Arch       Version        Repository     Size
==============================================================
Installing:
inotify-tools   x86_64     3.14-8.el7     ol7_latest     50 k

Transaction Summary
==============================================================
Install  1 Package

Total download size: 50 k
Installed size: 111 k
Is this ok [y/d/N]: y
Downloading packages:
inotify-tools-3.14-8.el7.x86_64.rpm               |  50 kB   00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Warning: RPMDB altered outside of yum.
  Installing : inotify-tools-3.14-8.el7.x86_64                 1/1
  Verifying  : inotify-tools-3.14-8.el7.x86_64                 1/1

Installed:
  inotify-tools.x86_64 0:3.14-8.el7

Complete!

The package will include two utilities (inotifywait and inotifywatch), documentation and a number of libraries. The inotifywait program is of primary interest.

Some derivatives of Red Hat 7 may not include inotify in their base repositories. If you find it missing, you can obtain it from Fedora's EPEL repository , either by downloading the inotify RPM for manual installation or adding the EPEL repository to yum.

Any user on the system who can launch a shell may register watches -- no special privileges are required to use the interface. This example watches the /tmp directory:

$ inotifywait -m /tmp
Setting up watches.
Watches established.

If another session on the system performs a few operations on the files in /tmp:

$ touch /tmp/hello
$ cp /etc/passwd /tmp
$ rm /tmp/passwd
$ touch /tmp/goodbye
$ rm /tmp/hello /tmp/goodbye

those changes are immediately visible to the user running inotifywait:

/tmp/ CREATE hello
/tmp/ OPEN hello
/tmp/ ATTRIB hello
/tmp/ CLOSE_WRITE,CLOSE hello
/tmp/ CREATE passwd
/tmp/ OPEN passwd
/tmp/ MODIFY passwd
/tmp/ CLOSE_WRITE,CLOSE passwd
/tmp/ DELETE passwd
/tmp/ CREATE goodbye
/tmp/ OPEN goodbye
/tmp/ ATTRIB goodbye
/tmp/ CLOSE_WRITE,CLOSE goodbye
/tmp/ DELETE hello
/tmp/ DELETE goodbye

A few relevant sections of the manual page explain what is happening:

$ man inotifywait | col -b | sed -n '/diagnostic/,/helpful/p'
  inotifywait will output diagnostic information on standard error and
  event information on standard output. The event output can be config-
  ured, but by default it consists of lines of the following form:

  watched_filename EVENT_NAMES event_filename


  watched_filename
    is the name of the file on which the event occurred. If the
    file is a directory, a trailing slash is output.

  EVENT_NAMES
    are the names of the inotify events which occurred, separated by
    commas.

  event_filename
    is output only when the event occurred on a directory, and in
    this case the name of the file within the directory which caused
    this event is output.

    By default, any special characters in filenames are not escaped
    in any way. This can make the output of inotifywait difficult
    to parse in awk scripts or similar. The --csv and --format
    options will be helpful in this case.

It also is possible to filter the output by registering particular events of interest with the -e option, the list of which is shown here:

access create move_self
attrib delete moved_to
close_write delete_self moved_from
close_nowrite modify open
close move unmount

A common application is testing for the arrival of new files. Since inotify must be given the name of an existing filesystem object to watch, the directory containing the new files is provided. A trigger of interest is also easy to provide -- new files should be complete and ready for processing when the close_write trigger fires. Below is an example script to watch for these events:

#!/bin/sh
unset IFS                                 # default of space, tab and nl
                                          # Wait for filesystem events
inotifywait -m -e close_write \
   /tmp /var/tmp /home/oracle/arch-orcl/ |
while read dir op file
do [[ "${dir}" == '/tmp/' && "${file}" == *.txt ]] &&
      echo "Import job should start on $file ($dir $op)."

   [[ "${dir}" == '/var/tmp/' && "${file}" == CLOSE_WEEK*.txt ]] &&
      echo Weekly backup is ready.

   [[ "${dir}" == '/home/oracle/arch-orcl/' && "${file}" == *.ARC ]]
&&
      su - oracle -c 'ORACLE_SID=orcl ~oracle/bin/log_shipper' &

   [[ "${dir}" == '/tmp/' && "${file}" == SHUT ]] && break

   ((step+=1))
done

echo We processed $step events.

There are a few problems with the script as presented -- of all the available shells on Linux, only ksh93 (that is, the AT&T Korn shell) will report the "step" variable correctly at the end of the script. All the other shells will report this variable as null.

The reason for this behavior can be found in a brief explanation on the manual page for Bash: "Each command in a pipeline is executed as a separate process (i.e., in a subshell)." The MirBSD clone of the Korn shell has a slightly longer explanation:

# man mksh | col -b | sed -n '/The parts/,/do so/p'
  The parts of a pipeline, like below, are executed in subshells. Thus,
  variable assignments inside them fail. Use co-processes instead.

  foo | bar | read baz          # will not change $baz
  foo | bar |& read -p baz      # will, however, do so

And, the pdksh documentation in Oracle Linux 5 (from which MirBSD mksh emerged) has several more mentions of the subject:

General features of at&t ksh88 that are not (yet) in pdksh:
  - the last command of a pipeline is not run in the parent shell
  - `echo foo | read bar; echo $bar' prints foo in at&t ksh, nothing
    in pdksh (ie, the read is done in a separate process in pdksh).
  - in pdksh, if the last command of a pipeline is a shell builtin, it
    is not executed in the parent shell, so "echo a b | read foo bar"
    does not set foo and bar in the parent shell (at&t ksh will).
    This may get fixed in the future, but it may take a while.

$ man pdksh | col -b | sed -n '/BTW, the/,/aware/p'
  BTW, the most frequently reported bug is
    echo hi | read a; echo $a   # Does not print hi
  I'm aware of this and there is no need to report it.

This behavior is easy enough to demonstrate -- running the script above with the default bash shell and providing a sequence of example events:

$ cp /etc/passwd /tmp/newdata.txt
$ cp /etc/group /var/tmp/CLOSE_WEEK20170407.txt
$ cp /etc/passwd /tmp/SHUT

gives the following script output:

# ./inotify.sh
Setting up watches.
Watches established.
Import job should start on newdata.txt (/tmp/ CLOSE_WRITE,CLOSE).
Weekly backup is ready.
We processed events.

Examining the process list while the script is running, you'll also see two shells, one forked for the control structure:

$ function pps { typeset a IFS=\| ; ps ax | while read a
do case $a in *$1*|+([!0-9])) echo $a;; esac; done }


$ pps inot
  PID TTY      STAT   TIME COMMAND
 3394 pts/1    S+     0:00 /bin/sh ./inotify.sh
 3395 pts/1    S+     0:00 inotifywait -m -e close_write /tmp /var/tmp
 3396 pts/1    S+     0:00 /bin/sh ./inotify.sh

As it was manipulated in a subshell, the "step" variable above was null when control flow reached the echo. Switching this from #/bin/sh to #/bin/ksh93 will correct the problem, and only one shell process will be seen:

# ./inotify.ksh93
Setting up watches.
Watches established.
Import job should start on newdata.txt (/tmp/ CLOSE_WRITE,CLOSE).
Weekly backup is ready.
We processed 2 events.


$ pps inot
  PID TTY      STAT   TIME COMMAND
 3583 pts/1    S+     0:00 /bin/ksh93 ./inotify.sh
 3584 pts/1    S+     0:00 inotifywait -m -e close_write /tmp /var/tmp

Although ksh93 behaves properly and in general handles scripts far more gracefully than all of the other Linux shells, it is rather large:

$ ll /bin/[bkm]+([aksh93]) /etc/alternatives/ksh
-rwxr-xr-x. 1 root root  960456 Dec  6 11:11 /bin/bash
lrwxrwxrwx. 1 root root      21 Apr  3 21:01 /bin/ksh ->
                                               /etc/alternatives/ksh
-rwxr-xr-x. 1 root root 1518944 Aug 31  2016 /bin/ksh93
-rwxr-xr-x. 1 root root  296208 May  3  2014 /bin/mksh
lrwxrwxrwx. 1 root root      10 Apr  3 21:01 /etc/alternatives/ksh ->
                                                    /bin/ksh93

The mksh binary is the smallest of the Bourne implementations above (some of these shells may be missing on your system, but you can install them with yum). For a long-term monitoring process, mksh is likely the best choice for reducing both processing and memory footprint, and it does not launch multiple copies of itself when idle assuming that a coprocess is used. Converting the script to use a Korn coprocess that is friendly to mksh is not difficult:

#!/bin/mksh
unset IFS                              # default of space, tab and nl
                                       # Wait for filesystem events
inotifywait -m -e close_write \
   /tmp/ /var/tmp/ /home/oracle/arch-orcl/ \
   2</dev/null |&                      # Launch as Korn coprocess

while read -p dir op file              # Read from Korn coprocess
do [[ "${dir}" == '/tmp/' && "${file}" == *.txt ]] &&
      print "Import job should start on $file ($dir $op)."

   [[ "${dir}" == '/var/tmp/' && "${file}" == CLOSE_WEEK*.txt ]] &&
      print Weekly backup is ready.

   [[ "${dir}" == '/home/oracle/arch-orcl/' && "${file}" == *.ARC ]]
&&
      su - oracle -c 'ORACLE_SID=orcl ~oracle/bin/log_shipper' &

   [[ "${dir}" == '/tmp/' && "${file}" == SHUT ]] && break

   ((step+=1))
done

echo We processed $step events.

Note that the Korn and Bolsky reference on the Korn shell outlines the following requirements in a program operating as a coprocess:

Caution: The co-process must:

An fflush(NULL) is found in the main processing loop of the inotifywait source, and these requirements appear to be met.

The mksh version of the script is the most reasonable compromise for efficient use and correct behavior, and I have explained it at some length here to save readers trouble and frustration -- it is important to avoid control structures executing in subshells in most of the Borne family. However, hopefully all of these ersatz shells someday fix this basic flaw and implement the Korn behavior correctly.

A Practical Application -- Oracle Log Shipping

Oracle databases that are configured for hot backups produce a stream of "archived redo log files" that are used for database recovery. These are the most critical backup files that are produced in an Oracle database.

These files are numbered sequentially and are written to a log directory configured by the DBA. An inotifywatch can trigger activities to compress, encrypt and/or distribute the archived logs to backup and disaster recovery servers for safekeeping. You can configure Oracle RMAN to do most of these functions, but the OS tools are more capable, flexible and simpler to use.

There are a number of important design parameters for a script handling archived logs:

Given these design parameters, this is an implementation:

# cat ~oracle/archutils/process_logs

#!/bin/ksh93

set -euo pipefail
IFS=$'\n\t'  # http://redsymbol.net/articles/unofficial-bash-strict-mode/

(
 flock -n 9 || exit 1          # Critical section-allow only one process.

 ARCHDIR=~oracle/arch-${ORACLE_SID}

 APREFIX=${ORACLE_SID}_1_

 ASUFFIX=.ARC

 CURLOG=$(<~oracle/.curlog-$ORACLE_SID)

 File="${ARCHDIR}/${APREFIX}${CURLOG}${ASUFFIX}"

 [[ ! -f "$File" ]] && exit

 while [[ -f "$File" ]]
 do ((NEXTCURLOG=CURLOG+1))

    NextFile="${ARCHDIR}/${APREFIX}${NEXTCURLOG}${ASUFFIX}"

    [[ ! -f "$NextFile" ]] && sleep 60  # Ensure ARCH has finished

    nice /usr/local/bin/lzip -9q "$File"

    until scp "${File}.lz" "yourcompany.com:~oracle/arch-$ORACLE_SID"
    do sleep 5
    done

    CURLOG=$NEXTCURLOG

    File="$NextFile"
 done

 echo $CURLOG > ~oracle/.curlog-$ORACLE_SID

) 9>~oracle/.processing_logs-$ORACLE_SID

The above script can be executed manually for testing even while the inotify handler is running, as the flock protects it.

A standby server, or a DataGuard server in primitive standby mode, can apply the archived logs at regular intervals. The script below forces a 12-hour delay in log application for the recovery of dropped or damaged objects, so inotify cannot be easily used in this case -- cron is a more reasonable approach for delayed file processing, and a run every 20 minutes will keep the standby at the desired recovery point:

# cat ~oracle/archutils/delay-lock.sh

#!/bin/ksh93

(
 flock -n 9 || exit 1              # Critical section-only one process.

 WINDOW=43200                      # 12 hours

 LOG_DEST=~oracle/arch-$ORACLE_SID

 OLDLOG_DEST=$LOG_DEST-applied

 function fage { print $(( $(date +%s) - $(stat -c %Y "$1") ))
  } # File age in seconds - Requires GNU extended date & stat

 cd $LOG_DEST

 of=$(ls -t | tail -1)             # Oldest file in directory

 [[ -z "$of" || $(fage "$of") -lt $WINDOW ]] && exit

 for x in $(ls -rt)                    # Order by ascending file mtime
 do if [[ $(fage "$x") -ge $WINDOW ]]
    then y=$(basename $x .lz)          # lzip compression is optional

         [[ "$y" != "$x" ]] && /usr/local/bin/lzip -dkq "$x"

         $ORACLE_HOME/bin/sqlplus '/ as sysdba' > /dev/null 2>&1 <<-EOF
                recover standby database;
                $LOG_DEST/$y
                cancel
                quit
                EOF

         [[ "$y" != "$x" ]] && rm "$y"

         mv "$x" $OLDLOG_DEST
    fi
              

 done
) 9> ~oracle/.recovering-$ORACLE_SID

I've covered these specific examples here because they introduce tools to control concurrency, which is a common issue when using inotify, and they advance a few features that increase reliability and minimize storage requirements. Hopefully enthusiastic readers will introduce many improvements to these approaches.

The incron System

Lukas Jelinek is the author of the incron package that allows users to specify tables of inotify events that are executed by the master incrond process. Despite the reference to "cron", the package does not schedule events at regular intervals -- it is a tool for filesystem events, and the cron reference is slightly misleading.

The incron package is available from EPEL . If you have installed the repository, you can load it with yum:

# yum install incron
Loaded plugins: langpacks, ulninfo
Resolving Dependencies
--> Running transaction check
---> Package incron.x86_64 0:0.5.10-8.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=================================================================
 Package       Arch       Version           Repository    Size
=================================================================
Installing:
 incron        x86_64     0.5.10-8.el7      epel          92 k

Transaction Summary
==================================================================
Install  1 Package

Total download size: 92 k
Installed size: 249 k
Is this ok [y/d/N]: y
Downloading packages:
incron-0.5.10-8.el7.x86_64.rpm                      |  92 kB   00:01
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : incron-0.5.10-8.el7.x86_64                          1/1
  Verifying  : incron-0.5.10-8.el7.x86_64                          1/1

Installed:
  incron.x86_64 0:0.5.10-8.el7

Complete!

On a systemd distribution with the appropriate service units, you can start and enable incron at boot with the following commands:

# systemctl start incrond
# systemctl enable incrond
Created symlink from
   /etc/systemd/system/multi-user.target.wants/incrond.service
to /usr/lib/systemd/system/incrond.service.

In the default configuration, any user can establish incron schedules. The incrontab format uses three fields:

<path> <mask> <command>

Below is an example entry that was set with the -e option:

$ incrontab -e        #vi session follows

$ incrontab -l
/tmp/ IN_ALL_EVENTS /home/luser/myincron.sh $@ $% $#

You can record a simple script and mark it with execute permission:

$ cat myincron.sh
#!/bin/sh

echo -e "path: $1 op: $2 \t file: $3" >> ~/op

$ chmod 755 myincron.sh

Then, if you repeat the original /tmp file manipulations at the start of this article, the script will record the following output:

$ cat ~/op

path: /tmp/ op: IN_ATTRIB        file: hello
path: /tmp/ op: IN_CREATE        file: hello
path: /tmp/ op: IN_OPEN          file: hello
path: /tmp/ op: IN_CLOSE_WRITE   file: hello
path: /tmp/ op: IN_OPEN          file: passwd
path: /tmp/ op: IN_CLOSE_WRITE   file: passwd
path: /tmp/ op: IN_MODIFY        file: passwd
path: /tmp/ op: IN_CREATE        file: passwd
path: /tmp/ op: IN_DELETE        file: passwd
path: /tmp/ op: IN_CREATE        file: goodbye
path: /tmp/ op: IN_ATTRIB        file: goodbye
path: /tmp/ op: IN_OPEN          file: goodbye
path: /tmp/ op: IN_CLOSE_WRITE   file: goodbye
path: /tmp/ op: IN_DELETE        file: hello
path: /tmp/ op: IN_DELETE        file: goodbye

While the IN_CLOSE_WRITE event on a directory object is usually of greatest interest, most of the standard inotify events are available within incron, which also offers several unique amalgams:

$ man 5 incrontab | col -b | sed -n '/EVENT SYMBOLS/,/child process/p'

EVENT SYMBOLS

These basic event mask symbols are defined:

IN_ACCESS          File was accessed (read) (*)
IN_ATTRIB          Metadata changed (permissions, timestamps, extended
                   attributes, etc.) (*)
IN_CLOSE_WRITE     File opened for writing was closed (*)
IN_CLOSE_NOWRITE   File not opened for writing was closed (*)
IN_CREATE          File/directory created in watched directory (*)
IN_DELETE          File/directory deleted from watched directory (*)
IN_DELETE_SELF     Watched file/directory was itself deleted
IN_MODIFY          File was modified (*)
IN_MOVE_SELF       Watched file/directory was itself moved
IN_MOVED_FROM      File moved out of watched directory (*)
IN_MOVED_TO        File moved into watched directory (*)
IN_OPEN            File was opened (*)

When monitoring a directory, the events marked with an asterisk (*)
above can occur for files in the directory, in which case the name
field in the returned event data identifies the name of the file within
the directory.

The IN_ALL_EVENTS symbol is defined as a bit mask of all of the above
events. Two additional convenience symbols are IN_MOVE, which is a com-
bination of IN_MOVED_FROM and IN_MOVED_TO, and IN_CLOSE, which combines
IN_CLOSE_WRITE and IN_CLOSE_NOWRITE.

The following further symbols can be specified in the mask:

IN_DONT_FOLLOW     Don't dereference pathname if it is a symbolic link
IN_ONESHOT         Monitor pathname for only one event
IN_ONLYDIR         Only watch pathname if it is a directory

Additionally, there is a symbol which doesn't appear in the inotify sym-
bol set. It is IN_NO_LOOP. This symbol disables monitoring events until
the current one is completely handled (until its child process exits).

The incron system likely presents the most comprehensive interface to inotify of all the tools researched and listed here. Additional configuration options can be set in /etc/incron.conf to tweak incron's behavior for those that require a non-standard configuration.

Path Units under systemd

When your Linux installation is running systemd as PID 1, limited inotify functionality is available through "path units" as is discussed in a lighthearted article by Paul Brown at OCS-Mag .

The relevant manual page has useful information on the subject:

$ man systemd.path | col -b | sed -n '/Internally,/,/systems./p'

Internally, path units use the inotify(7) API to monitor file systems.
Due to that, it suffers by the same limitations as inotify, and for
example cannot be used to monitor files or directories changed by other
machines on remote NFS file systems.

Note that when a systemd path unit spawns a shell script, the $HOME and tilde ( ~ ) operator for the owner's home directory may not be defined. Using the tilde operator to reference another user's home directory (for example, ~nobody/) does work, even when applied to the self-same user running the script. The Oracle script above was explicit and did not reference ~ without specifying the target user, so I'm using it as an example here.

Using inotify triggers with systemd path units requires two files. The first file specifies the filesystem location of interest:

$ cat /etc/systemd/system/oralog.path

[Unit]
Description=Oracle Archivelog Monitoring
Documentation=http://docs.yourserver.com

[Path]
PathChanged=/home/oracle/arch-orcl/

[Install]
WantedBy=multi-user.target

The PathChanged parameter above roughly corresponds to the close-write event used in my previous direct inotify calls. The full collection of inotify events is not (currently) supported by systemd -- it is limited to PathExists , PathChanged and PathModified , which are described in man systemd.path .

The second file is a service unit describing a program to be executed. It must have the same name, but a different extension, as the path unit:

$ cat /etc/systemd/system/oralog.service

[Unit]
Description=Oracle Archivelog Monitoring
Documentation=http://docs.yourserver.com

[Service]
Type=oneshot
Environment=ORACLE_SID=orcl
ExecStart=/bin/sh -c '/root/process_logs >> /tmp/plog.txt 2>&1'

The oneshot parameter above alerts systemd that the program that it forks is expected to exit and should not be respawned automatically -- the restarts are limited to triggers from the path unit. The above service configuration will provide the best options for logging -- divert them to /dev/null if they are not needed.

Use systemctl start on the path unit to begin monitoring -- a common error is using it on the service unit, which will directly run the handler only once. Enable the path unit if the monitoring should survive a reboot.

Although this limited functionality may be enough for some casual uses of inotify, it is a shame that the full functionality of inotifywait and incron are not represented here. Perhaps it will come in time.

Conclusion

Although the inotify tools are powerful, they do have limitations. To repeat them, inotify cannot monitor remote (NFS) filesystems; it cannot report the userid involved in a triggering event; it does not work with /proc or other pseudo-filesystems; mmap() operations do not trigger it; and the inotify queue can overflow resulting in lost events, among other concerns.

Even with these weaknesses, the efficiency of inotify is superior to most other approaches for immediate notifications of filesystem activity. It also is quite flexible, and although the close-write directory trigger should suffice for most usage, it has ample tools for covering special use cases.

In any event, it is productive to replace polling activity with inotify watches, and system administrators should be liberal in educating the user community that the classic crontab is not an appropriate place to check for new files. Recalcitrant users should be confined to Ultrix on a VAX until they develop sufficient appreciation for modern tools and approaches, which should result in more efficient Linux systems and happier administrators.

Sidenote: Archiving /etc/passwd

Tracking changes to the password file involves many different types of inotify triggering events. The vipw utility commonly will make changes to a temporary file, then clobber the original with it. This can be seen when the inode number changes:

# ll -i /etc/passwd
199720973 -rw-r--r-- 1 root root 3928 Jul  7 12:24 /etc/passwd

# vipw
[ make changes ]
You are using shadow passwords on this system.
Would you like to edit /etc/shadow now [y/n]? n

# ll -i /etc/passwd
203784208 -rw-r--r-- 1 root root 3956 Jul  7 12:24 /etc/passwd

The destruction and replacement of /etc/passwd even occurs with setuid binaries called by unprivileged users:

$ ll -i /etc/passwd
203784196 -rw-r--r-- 1 root root 3928 Jun 29 14:55 /etc/passwd

$ chsh
Changing shell for fishecj.
Password:
New shell [/bin/bash]: /bin/csh
Shell changed.

$ ll -i /etc/passwd
199720970 -rw-r--r-- 1 root root 3927 Jul  7 12:23 /etc/passwd

For this reason, all inotify triggering events should be considered when tracking this file. If there is concern with an inotify queue overflow (in which events are lost), then the OPEN , ACCESS and CLOSE_NOWRITE,CLOSE triggers likely can be immediately ignored.

All other inotify events on /etc/passwd might run the following script to version the changes into an RCS archive and mail them to an administrator:

#!/bin/sh

# This script tracks changes to the /etc/passwd file from inotify.
# Uses RCS for archiving. Watch for UID zero.

PWMAILS=Charlie.Root@openbsd.org

TPDIR=~/track_passwd

cd $TPDIR

if diff -q /etc/passwd $TPDIR/passwd
then exit                                         # they are the same
else sleep 5                                      # let passwd settle
     diff /etc/passwd $TPDIR/passwd 2>&1 |        # they are DIFFERENT
     mail -s "/etc/passwd changes $(hostname -s)" "$PWMAILS"
     cp -f /etc/passwd $TPDIR                     # copy for checkin

#    "SCCS, the source motel! Programs check in and never check out!"
#     -- Ken Thompson

     rcs -q -l passwd                            # lock the archive
     ci -q -m_ passwd                            # check in new ver
     co -q passwd                                # drop the new copy
fi > /dev/null 2>&1

Here is an example email from the script for the above chfn operation:

-----Original Message-----
From: root [mailto:root@myhost.com]
Sent: Thursday, July 06, 2017 2:35 PM
To: Fisher, Charles J. <Charles.Fisher@myhost.com>;
Subject: /etc/passwd changes myhost

57c57
< fishecj:x:123:456:Fisher, Charles J.:/home/fishecj:/bin/bash
---
> fishecj:x:123:456:Fisher, Charles J.:/home/fishecj:/bin/csh

Further processing on the third column of /etc/passwd might detect UID zero (a root user) or other important user classes for emergency action. This might include a rollback of the file from RCS to /etc and/or SMS messages to security contacts. ______________________

Charles Fisher has an electrical engineering degree from the University of Iowa and works as a systems and database administrator for a Fortune 500 mining and manufacturing corporation.

[Dec 02, 2017] BASH Shell How To Redirect stderr To stdout ( redirect stderr to a File )

Dec 02, 2017 | www.cyberciti.biz

BASH Shell: How To Redirect stderr To stdout ( redirect stderr to a File ) Posted on March 12, 2008 March 12, 2008 in Categories BASH Shell , Linux , UNIX last updated March 12, 2008 Q. How do I redirect stderr to stdout? How do I redirect stderr to a file?

A. Bash and other modern shell provides I/O redirection facility. There are 3 default standard files (standard streams) open:

[a] stdin – Use to get input (keyboard) i.e. data going into a program.

[b] stdout – Use to write information (screen)

[c] stderr – Use to write error message (screen)

Understanding I/O streams numbers

The Unix / Linux standard I/O streams with numbers:

Handle Name Description
0 stdin Standard input
1 stdout Standard output
2 stderr Standard error
Redirecting the standard error stream to a file

The following will redirect program error message to a file called error.log:
$ program-name 2> error.log
$ command1 2> error.log

Redirecting the standard error (stderr) and stdout to file

Use the following syntax:
$ command-name &>file
OR
$ command > file-name 2>&1
Another useful example:
# find /usr/home -name .profile 2>&1 | more

Redirect stderr to stdout

Use the command as follows:
$ command-name 2>&1

[Oct 31, 2017] Bash process substitution by Tom Ryder

Notable quotes:
"... Thanks to Reddit user Rhomboid for pointing out an incorrect assertion about this syntax necessarily abstracting ..."
"... calls, which I've since removed. ..."
February 27, 2012 sanctum.geek.nz

For tools like diff that work with multiple files as parameters, it can be useful to work with not just files on the filesystem, but also potentially with the output of arbitrary commands. Say, for example, you wanted to compare the output of ps and ps -e with diff -u . An obvious way to do this is to write files to compare the output:

$ ps > ps.out
$ ps -e > pse.out
$ diff -u ps.out pse.out

This works just fine, but Bash provides a shortcut in the form of process substitution , allowing you to treat the standard output of commands as files. This is done with the <() and >() operators. In our case, we want to direct the standard output of two commands into place as files:

$ diff -u <(ps) <(ps -e)

This is functionally equivalent, except it's a little tidier because it doesn't leave files lying around. This is also very handy for elegantly comparing files across servers, using ssh :

$ diff -u .bashrc <(ssh remote cat .bashrc)

Conversely, you can also use the >() operator to direct from a filename context to the standard input of a command. This is handy for setting up in-place filters for things like logs. In the following example, I'm making a call to rsync , specifying that it should make a log of its actions in log.txt , but filter it through grep -vF .tmp first to remove anything matching the fixed string .tmp :

$ rsync -arv --log-file=>(grep -vF .tmp >log.txt) src/ host::dst/

Combined with tee this syntax is a way of simulating multiple filters for a stdout stream, transforming output from a command in as many ways as you see fit:

$ ps -ef | tee >(awk '$1=="tom"' >toms-procs.txt) \
               >(awk '$1=="root"' >roots-procs.txt) \
               >(awk '$1!="httpd"' >not-apache-procs.txt) \
               >(awk 'NR>1{print $1}' >pids-only.txt)

In general, the idea is that wherever on the command line you could specify a file to be read from or written to, you can instead use this syntax to make an implicit named pipe for the text stream.

Thanks to Reddit user Rhomboid for pointing out an incorrect assertion about this syntax necessarily abstracting mkfifo calls, which I've since removed.

[Oct 31, 2017] Temporary files by Tom Ryder

Mar 05, 2012 | sanctum.geek.nz

With judicious use of tricks like pipes, redirects, and process substitution in modern shells, it's very often possible to avoid using temporary files, doing everything inline and keeping them quite neat. However when manipulating a lot of data into various formats you do find yourself occasionally needing a temporary file, just to hold data temporarily.

A common way to deal with this is to create a temporary file in your home directory, with some arbitrary name, something like test or working :

$ ps -ef >~/test

If you want to save the information indefinitely for later use, this makes sense, although it would be better to give it a slightly more instructive name than just test .

If you really only needed the data temporarily, however, you're much better to use the temporary files directory. This is usually /tmp , but for good practice's sake it's better to check the value of TMPDIR first, and only use /tmp as a default:

$ ps -ef >"${TMPDIR:-/tmp}"/test

This is getting better, but there is still a significant problem: there's no built-in check that the test file doesn't already exist, perhaps being used by some other user or program, particularly another running instance of the same script.

To that end, we have the mktemp program, which creates an empty temporary file in the appropriate directory for you without overwriting anything, and prints the filename it created. This allows you to use the file inline in both shell scripts and one-liners, and is much safer than specifying hardcoded paths:

$ mktemp
/tmp/tmp.yezXn0evDf
$ procsfile=$(mktemp)
$ printf '%s\n' "$procsfile"
/tmp/tmp.9rBjzWYaSU
$ ps -ef >"$procsfile"

If you're going to create several such files for related purposes, you could also create a directory in which to put them using the -d option:

$ procsdir=$(mktemp -d)
$ printf '%s\n' "$procsdir"
/tmp/tmp.HMAhM2RBSO

On GNU/Linux systems, files of a sufficient age in TMPDIR are cleared on boot (controlled in /etc/default/rcS on Debian-derived systems, /etc/cron.daily/tmpwatch on Red Hat ones), making /tmp useful as a general scratchpad as well as for a kind of relatively reliable inter-process communication without cluttering up users' home directories.

In some cases, there may be additional advantages in using /tmp for its designed purpose as some administrators choose to mount it as a tmpfs filesystem, so it operates in RAM and works very quickly. It's also common practice to set the noexec flag on the mount to prevent malicious users from executing any code they manage to find or save in the directory.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Reference Section

grep Command Switches

find Command Switches

There are a large number of find expression switches. They include the following:

find -printf Formatting Codes

sort Command Switches

tar Command Switches

tr Command Switches

sed Command Switches

sed Editing Codes



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: January, 15, 2018