Softpanorama
May the source be with you, but remember the KISS principle ;-)

Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Softpanorama classification of sysadmin horror stories

Learning from other sysadmin's blunders and mishaps

News Enterprise Unix System Administration Recommended Links Unix System Monitoring Simple Unix Backup Tools Unix Configuration Management Tools Perl Admin Tools and Scripts Baseliners
Missing backup Rush/absence of testing of complex commands Creative uses of rm Working on the wrong computer Executing command in a wrong directory Reboot Blunders Locking yourself out Side effects of patching
Blunders due differences between Unix flavors Multiple sysadmin working on the same box Ownership changing blunders Premature or misguided optimization Pure stupidity LVM related mishaps Dot-star-errors and regular expressions blunders Abuse of privileges
Side effects of performing operations on home or application directories Typos in the commands with disastrous consequences Unintended consequences of automatic system maintenance scripts          
Safe-rm Workaholism and Burnout Coping with the toxic stress in IT environment The Unix Hater’s Handbook Tips Horror stories History Humor Etc

"More systems have been wiped out by admins than any hacker could do in a lifetime"

Rick Furniss

“Experience fails to teach where there is no desire to learn.”
"Everything happens to everybody sooner or later if there is time enough."
George Bernard Shaw

“Experience is the the most expensive teacher, but a fool will learn from no other.”

Benjamin Franklin

Unix system administration is an interesting and complex craft. It's good, if your work demands use of your technical skills, creativity and judgment. If it doesn't, then you're in absurd world of Dilbertized cubicle farms.. And unfortunately that happens too.

There is a lot of deep elegance in Unix, and a talented sesame like any talented craftsman is able to expose this hidden beauty by the masterful manipulation of complex objects  using classic Unix and command line. Which often amazes observers, who have Windows background.  In Unix administration you need to improvise on the job to get things done, create your own tools; is you want to be on advanced level you can't go "by manual". Unfortunately some of such improvisations produce unexpected side effects ;-)

In a way not only magic execution of complex sequences of commands is a part of this craftsmanship. Blunders and folklore about them are also the legitimate part of the craft. It's human to err after all. And if you are working as root, such an error can easily wipe a vital part of the system. In you are unlucky this is a production system. If you are especially unlucky there is no backup.

Sh*t happens, but there is a system in any madness ;-). That's why it is important to try to classify typical sysadmin mistakes. Regardless of the reason, every mistake should be documented as  it constitutes as an important lesson pointing to the class of similar errors possible. As saying goes "never waist a good crisis".  For example in many case when a simple mistake cause serious problem we observe absence of backups and absence of baseline, often both.

Learning from your own mistakes as well as mistakes of others is an important part of learning the craft. that's why it is important periodically reread such pages: they can prevent some horrible blunders.

Unix sysadmin who does not understand or don't remember about danger of rm -rf  .* is a bad sysadmin.  In addition keeping a personal journal of your SNAFU ( bas SNAFU like traffic incident is a confluence of several mistakes/simultaneous maneuvers /misunderstanding, etc; also like in the army incompetent bosses often play prominent role in such incidents)

Periodically browsing this personal log is really important as each of this incidents can easily be as they politically correctly say it "career limiting move".  While bad incidents stimulate learning and stimulates personal growth of a system administrator, there are less painful way to grow your knowledge. Including knowledge of bad incidents. That's why this set of pages was created. Reading it and other similar pages is a must for any aspiring sysadmin.

There are several fundamental reasons for blunders syadmins commit:

Having a classification system for such errors in one way to increase situation awareness. Although people usually badly learn on blunders committed b others. They prefer to make their own...  Still periodic reviews of them, like safety training can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as it tends to repeat itself ;-).  In some case reading about somebody else blunder does not convey the gravity of the situation in which you can find yourself by repeating it. For example, the understanding that dealing with files and directories staring with dot in Unix requires extreme caution probably can be acquired only by committing one (just one) such blunder.

The typical case of the loss of situational awareness is performing some critical operation on the wrong server. If you use Windows desktop to connect to Unix servers use MSVDM to create multiple desktop and change background for each to make the typing a command in a wrong terminal window less likely. If you prefer to work as root switch to root only on the server that you are working. Use your regular ID and sudo on the others. Other related reasons include:

In this page we will present "Softpanorama classification of sysadmin horror stories". It is not the first and hopefully not the last one. The author is indebted to Anatoly Ivasyuk who created original " The Unofficial Unix Administration Horror Story Summary. This list exists several versions:

There is a very true saying that experience keeps the most expensive school, but fools are unable to learn in any other ;-). In order to be able to learn from others we need to classify classic sysadmin horror stories. One such classification created by the author is presented below.

There is a very true saying that experience keeps the most expensive school but fools are unable to learn in any other ;-). In order to be able to learn from others we need to classify classic sysadmin horror stories. One such classification created by the author is presented below.

The main source is an old, but still quite relevant list of horror stories created by Anatoly Ivasyuk; There are two versions available:

Here is the author attempt to reorganize the categories and enhance the material by add more modern stories.

Softpanorama classification

  1. Missing backup. Please remember that backup is the last change for you to restore the system if something went terribly wrong. That means that before any dangerous steps you need to locate and check the existence of backup. Making another backup is also a good idea to that you have two or more recent copies. Attempt at least to brose the backup and see if data are intact is a must.
  2. Missing baseline and losing initial confirmation in the stream of changes. This is the most typical mistake in network troubleshooting and optimization is losing your initial configuration. This also might mean lack of preparation and lack of  situational awareness.  You need to take several steps to prevent this blunder from occurring and most important of them are baselines and backups.
  3. Locking yourself out
  4. Performing operation on a wrong computer. The naming schemes used by large corporation usually do not have enough distance between them to avoid such blunders. For example you can type XYZ300 instead of XYZ200. Another common situation is when yuou have several terminal windows open and in a hurry start working on a wrong server. That's why it's important that shell prompt shows the name of the host. Often, if you both have production computer and quality server for some application is wise never have two terminals opened simultaneously. Reopening it is not a big deal but can save you from some very unpleasant situations.
  5. Forgetting in which directory you are and executing command in a wrong directory. This is common mistake if work under severe time pressure or are very tired.
  6. Regular expressions related blunders. Novice sysadmins usually do not realize that '.*' also matches '..' often with disastrous consequences if commands like chmod, chown, rm are used recursively or in find command.
  7. Find filesystem traversal errors and other errors related to find. This is very common class of errors and it is covered in a separate page Typical Errors In Using Find
  8. Side effects of performing operations on home or application directories due to links to system directories. This is a pretty common mistake and I had committed it myself several time with various, but always unpleasant consequences.
  9. Misunderstanding of syntax of important command and/or not testing complex command before execution of production box. Such errors are often made under time pressure. One such case is using recursive rm, chown, chmod or find commands. Each of them deserves category of its own.
  10. Ownership changing blunders Those are common when using chown with find so you need to test the command first.
  11. Excessive zeal in improving security of the system ;-). A lot of current security recommendation are either stupid or counterproductive. In the hands of overly enthusiastic and semi-competent administrator it becomes a weapon that no hacker can ever match. I think more systems were destroyed by idiotic security measures that by hackers.
  12. Mistakes done under time pressure. Some of them were discussed above, but generally time pressure serves as a powerful catalyst for the most devastating mistakes.
  13. Patching horrors
  14. Unintended consequences of automatic system maintenance scripts
  15. Side effects/unintended consequences of multiple sysadmin working on the same box
  16. Premature or misguided optimization and/or cleanup of the system. Changing settings without full understanding consequences of such changes. Misguided attempts to get rid of unwanted file or directories (cleaning the system).
  17. Mistakes made because of the differences between various Unix/Linux flavors For example in Solaris run level 5 means reboot while in Linux run level 5 is running system with networking and X11.
  18. Stupid or preventable mistakes

Some personal experience

Reboots of wrong server

Such commands as reboot or mkinitrd can be pretty devastating when applied to wrong server. That mishap happens with a lot of administrators including myself, so it is prudent to take special measures to make it less probable.

This situation often is made more probable due to not fault-tolerant name scheme employed in many corporations where names of the servers differ by one symbol. For example, scheme serv01, serv02 serv03 and so on is a pretty dangerous name scheme as server names are different by only single digit and thus errors like working on the wrong server are much more probable.

The typical case of the loss of situational awareness is performing some critical operation on the wrong server. If you use Windows desktop to connect to Unix servers use MSVDM to create multiple desktop and change background for each to make the typing command in a wrong terminal window less likely

Even more complex scheme like Bsn01dls9, Nyc02sns10 were first three letter encode the location, then numeric suffix and then vendor of the hardware and OS installed are prone to such errors. My impression that unless first letters differ, there is a substantial chance of working on wrong server. Using favorite sport teams names is a better strategy and those "formal" name can be used as aliases.

Inadequate backup

If you try to distill the essence of horror stories most of them were upgraded from errors to horror stories due to inadequate backups.

Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster. This point is very difficult to understand by novice enterprise administrators. Rephrasing Bernard Show we can say "Experience keeps the most expensive school, but most sysadmins are unable to learn anywhere else". Please remember that in enterprise environment you will almost never be rewarded for innovations and contributions but in many cases you will be severely punished for blunders. In other words typical enterprise IT is a risk averse environment and you better understand that sooner rather then later...

If you try to distill the essence of horror stories most of them are about inadequate backups. Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster.

Rush and absence of planning are probably the second most important reason. In many cases sysadmin is stressed and that impair judgment.

Forgetting to chroot affected subtree

Another typical reason is abuse of privileges. If you have access to root that does not mean that you need to perform all operations as root. For example such simple operations' as

cd /home/joeuser
chown -R joeuser:joeuser .* 

performed as root cause substantial problems and time lost in recovery of ownership of system file. Computers are really fast now and of modern server such an opertaion can take a second or two :-(. 

Even with user privileges there will be some damage: it will affect all world writable files and directories. 

This is the case where chroot can provide tremendous help:

cd /home/joeuser 
chroot /home/joeuser
chown -R joeuser:joeuser .* 

Abuse of root privileges

Another typical reason is abuse of root privileges. Using sudo or RBAC (on Solaris) you can avoid some unpleasant surprises. Another good practice if to use screen with one screen for root operations and another for operations that can be performed under your on id.

Many Unix sysadmin horror stories are related to unintended consequences, unanticipated side effects of a particular Unix commands such as find and rm performed with root privileges. Unix is a complex OS and many intricate details (like behavior of commands like rm -r .* or chown -R a:a .*) can easily be forgotten from one encounter to another, especially if sysadmin works with several flavors of Unix or Unix and Windows servers.

For example recursive deletion of files either via rm -r or via find -exec rm {} \; has a lot of pitfalls that can destroy the server pretty nicely in less then a minute, if run without testing.

Some of those pitfalls can be viewed as a deficiency of rm implementation (it should automatically block * deletion of system directories like /, /etc/ and so on unless -f flag is specified, but Unix lacks system attributes for files although sticky bit on files can be used instead ). It is wise to use wrappers for rm. There are several more or less usable approach to writing such a wrapper:

Another important source of blunders is time pressure. Trying to do something quickly often lead to substantial downtime. Hurry slowly is one of the saying that are very true for sysadmin. But unfortunately very difficult to follow.

Sometimes your emotional state contribute to the problems: you didn’t have much sleep or your mind was distracted by your personal life problems. In such days it is important to slow down and be extra cautious.

Typos are another common source of serious, some time disastrous errors. One rule that should be followed (but as the memory of the last incident fades, this rule like any safety rules, usually is forgotten :-): if you are working as root and perform dangerous operations never type the directory path, copy it from the screen.

I once automatically typed /etc instead of etc trying to delete directory to free space on a backup directory on a production server (/etc probably in engraved in sysadmin head as it is typed so often and can be substituted for etc subconsciously). I realized that it was mistake and cancelled the command, but it was a fast server and one third of /etc was gone. The rest of the day was spoiled... Actually not completely: I learned quite a bit about the behavior of AIX in this situation and the structure of AIX /etc directory this day so each such disaster is actually a great learning experience, almost like one day training course ;-). But it's much less nerve wracking to get this knowledge from the course...

Another interesting thing is having backup was not enough is this case -- backup software stopped working. The same was true for telnet and ssh. And this was a remote server is a datacenter across the country. I restored the directory on the other non-production server (overwriting its /etc directory in this second box with the help of operations, tell me about cascading errors and Murphy law :-). Then netcat helped to transfer the tar file.

If you are working as root and perform dangerous operations never type a path of the command, copy it from the screen. If you can copy command from history instead of typing, do it !

In such cases network services with authentication stop working and the only way to transfer files is using CD/DVD, USB drive or netcat. That's why it is useful to have netcat on servers: netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it if the datacenter is remote.

netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it, if the datacenter is remote.

Dr. Nikolai Bezroukov


Top updates

Shop Amazon Cyber Monday Deals Week
Google Search


NEWS CONTENTS

Old News ;-)

The Unix Hater’s Handbook

wayback.archive.org

“rm” Is Forever

The principles above combine into real-life horror stories. A series of

exchanges on the Usenet news group alt.folklore.computers illustrates

our case:

Date: Wed, 10 Jan 90

From: djones@megatest.uucp (Dave Jones)

Subject: rm *

Newsgroups: alt.folklore.computers2

Anybody else ever intend to type:

% rm *.o

And type this by accident:

% rm *>o

Now you’ve got one new empty file called “o”, but plenty of room

for it!

Actually, you might not even get a file named “o” since the shell documentation

doesn’t specify if the output file “o” gets created before or after the

wildcard expansion takes place. The shell may be a programming language,

but it isn’t a very precise one.

Date: Wed, 10 Jan 90 15:51 CST

From: ram@attcan.uucp

Subject: Re: rm *

Newsgroups: alt.folklore.computers

I too have had a similar disaster using rm. Once I was removing a file

system from my disk which was something like /usr/foo/bin. I was in /

usr/foo and had removed several parts of the system by:

% rm -r ./etc

% rm -r ./adm

…and so on. But when it came time to do ./bin, I missed the period.

System didn’t like that too much.

Unix wasn’t designed to live after the mortal blow of losing its /bin directory.

An intelligent operating system would have given the user a chance to

recover (or at least confirm whether he really wanted to render the operating

system inoperable).

Unix aficionados accept occasional file deletion as normal. For example,

consider following excerpt from the comp.unix.questions FAQ:3

6) How do I “undelete” a file?

Someday, you are going to accidentally type something like:

% rm * .foo

and find you just deleted “*” instead of “*.foo”. Consider it a

rite of passage.

Of course, any decent systems administrator should be doing

regular backups. Check with your sysadmin to see if a recent

backup copy of your file is available.

“A rite of passage”? In no other industry could a manufacturer take such a

cavalier attitude toward a faulty product. “But your honor, the exploding

gas tank was just a rite of passage.” “Ladies and gentlemen of the jury, we

will prove that the damage caused by the failure of the safety catch on our

3comp.unix.questions is an international bulletin-board where users new to the

Unix Gulag ask questions of others who have been there so long that they don’t

know of any other world. The FAQ is a list of Frequently Asked Questions garnered

Changing rm’s Behavior Is Not an Option

After being bitten by rm a few times, the impulse rises to alias the rm command

so that it does an “rm -i” or, better yet, to replace the rm command

with a program that moves the files to be deleted to a special hidden directory,

such as ~/.deleted. These tricks lull innocent users into a false sense

of security.

Date: Mon, 16 Apr 90 18:46:33 199

From: Phil Agre <agre@gargoyle.uchicago.edu>

To: UNIX-HATERS

Subject: deletion

On our system, “rm” doesn’t delete the file, rather it renames in some

obscure way the file so that something called “undelete” (not

“unrm”) can get it back.

This has made me somewhat incautious about deleting files, since of

course I can always undelete them. Well, no I can’t. The Delete File

command in Emacs doesn’t work this way, nor does the D command

in Dired. This, of course, is because the undeletion protocol is not

part of the operating system’s model of files but simply part of a

kludge someone put in a shell command that happens to be called

“rm.”

As a result, I have to keep two separate concepts in my head, “deleting”

a file and “rm’ing” it, and remind myself of which of the two of

them I am actually performing when my head says to my hands

“delete it.”

Some Unix experts follow Phil’s argument to its logical absurdity and

maintain that it is better not to make commands like rm even a slight bit

friendly. They argue, though not quite in the terms we use, that trying to

make Unix friendlier, to give it basic amenities, will actually make it

worse. Unfortunately, they are right.

 

 

 

[Sep 04, 2014] Blunders with expansion of tar files, structure of which you do not understand

if you try to expand tar file in some production directory you accidentally can overwrite and change ownership of such directories and then spend a lot of type restored status quo. It is safer to expand such tar files in /tmp first and only after that seeing the results then decide whether to copy some directories of re-expand the tar file. Now in production directory.

[Sep 03, 2014]  Doing operation in a wrong directory among several similar directories

Sometimes directories are very similar, for example numbered directoriess created by some application such as task0001, task0002, ... task0256. In this case you can well perform operation on a wrong directory. For example send to tech support a tar file with the directory that instead of test data contain production run. 

[Oct 17, 2013]  Crontab file - The UNIX and Linux Forums

The loss of crontab is a serious trouble. This is one of a typical sysadmin blunders (Crontab file - The UNIX and Linux Forums)

mradsus

Hi All,
I created a crontab entry in a cron.txt file accidentally entered

crontab cron.txt.

Now my previous crontab -l entries are not showing up, that means i removed the scheduling of the previous jobs by running this command "crontab cron.txt"

How do I revert back to previously schedule jobs.
Please help. this is urgent.,
Thanks.

In this case, if you do not have a backup,  you only remedy is to try to extract cron commands from /var/log/messages.

[Jul 17, 2012] My 10 UNIX Command Line Mistakes

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk
tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I’ve switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I’ve learnt that:

  1. Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )
  2. The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
  3. Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.
  4. Use CVS to store configuration files.
  5. Wait and read command line again before hitting the dam [Enter] key.
  6. Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

[May 17, 2012] Pixar’s The Movie Vanishes, How Toy Story 2 Was Nearly Lost

In the 2010 animated short titled Studio Stories: The Movie Vanishes, we learn from Pixar’s Oren Jacob and Galyn Susman how a big chunk of Toy Story 2 movie files were nearly lost due to the accidental use of a Linux rm command (and a poor backup system). This short was included on the Toy Story 2 DVD extras.

Pixar studio stories - The movie vanishes (full) - YouTube

[Mar 16, 2012] Using right command in a wrong place

From email to Editor of Softpanorama...

This happened with Open view.  It has a command for agent reinstallation. opc-inst -r. The problem is that it needs to be run on the node not on the server and does not accept any arguments

In this case it was run on the server with predictable results. This was a production server of a large corporation so you can imagine the level of stress in putting down this fire...

[Oct 14, 2011]  Nasty surprise with the command cd joeuser; chown -R joeuser:joeuser .*

This is classic case of side effect of dot .*  along with -R  flag which cause complete tree traversal in Unix. The key issue here is not panic. The recovery is possible even if you do not have the map of all files permissions (and you better do it on regular basis). The first step is to use
for p in $(rpm -qa); do rpm --setugids $p; done
The second is to copy remaining ownership info from some similar system.  Especially important is to restore ownership in /dev directory.
Similar approach can be used for resoring permissions:
for p in $(rpm -qa); do rpm --setperms $p; done
Please note that the rpm --setperms command actaully resets setuid, setgid, and sticky bits. These must be set manually using some existing system as a baseline.

[Jul 22, 2011] Mailbag by Marcello Romani

Feb 02, 2011 | LG #186

Hi, I had a horror story similar to Ben's one, about two years ago. I backed up a PC and reinstalled the OS with the backup usb disk still attached. The OS I was reinstalling was a version of Windows (2000 or XP, I don't remember right now). When the partition creation screen appeared, the list items looked a bit different from what I was expecting, but as soon as I realized why, my fingers had already pressed the keys, deleting the existing partitions and creating a new ntfs one. Luckily, I stopped just before the "quick format" command... Searching the 'net for data recovery software, I came across TestDisk, which is target at partition table recovery. I was lucky enough to have wiped out only that portion of the usb disk, so in less than an hour I was able to regain access to the all of my data. Since then I always "safely remove" usb disks from the machine before doing anything potentially dangerous, and check "fdisk -l" at least three times before deciding that the arguments to "dd" are written correctly...

Marcello Romani TAG mailing list TAG@lists.linuxgazette.net http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net

[Jul 03, 2011] Be careful with naming servers

Some application like Oracle products are sensitive to DNS names you use, especially hostname. They store them in multiple places and there is no easy way to change it in all those places after Oracle product is installed. They also accept only long hostname (i.e. box.location.firm.com) instead of short.

If you mess with your hostname and DBA installed Oracle product you usually need to reinstall the box.

Such errors can happen if you copy files form ne servers to another to speed up the installation and forgot to modify /etc/hosts file or modified it incorrectly.

[Jun 03, 2011] Sysadmin Tales of Terror by Carla Schroder

February 19, 2003 | Enterprise Networking Planet

Cover One's Behind With Glory

Now let's be honest, documentation is boring and no fun. I don't care; just do it. Keep a project diary. Record everything you find. You don't want to shoulder the blame for someone else's mistakes or malfeasance. It is unlikely you'll get into legal trouble, but the possibility always exists. Record progress and milestones as well. Those in management tend to have short memories and limited attention spans when it comes to technical matters, so put everything in writing and make a point of reviewing your progress periodically. No need to put on long, windy presentations -- take ten minutes once a week to hit the high points. Emphasize the good news; after all, as the ace sysadmin, it is your job to make things work. Any dork can make a mess; it takes a real star to deliver the goods.

Be sure to couch your progress in terms meaningful to the person(s) you're talking to. A non-technical manager doesn't want to hear how many scripts you rewrote or how many routers you re-programmed. She wants to hear "Group A's email works flawlessly now, and I fixed their database server so it doesn't crash anymore. No more downtime for Group A." That kind of talk is music to a manager's ears.

Managing Users

In every business there are certain key people who wield great influence. They can make or break you. Don't focus exclusively on management -- the people who really run the show are the secretaries and administrative assistants. They know more than anyone about how things work, what's really important, and who is really important. Consult them. Listen to them. Suck up to them. Trust me, this will pay off handsomely. Also worth cultivating are relationships with the cleaning and maintenance people -- they see things no one else even knows about.

When you're new on the job and still figuring things out, the last thing you need is to field endless phone calls from users with problems. Make them put it in writing -- email, yellow pad, elaborate trouble-ticket system, whatever suits you. This gives you useful information and time to do some triage.

Managing Remote Users

If you have remote offices under your care, the phone can save a lot of travel. There's almost always one computer-savvy person in every office; make this person your ally and helper. At very least, this person will be able to give you coherent, understandable explanations. At best, they will be your remote hands and eyes, and will save you much trouble.

Such a person may be a candidate for training and possibly transferring to IT. Some people are afraid of helping someone like this for fear of losing out to them in some way. The truth, though, is that you never lose by helping people, so don't let that idea scare you off from giving a boost to a worthy person.

Getting Help

We all know how to use Google, Usenet, and other online resources to get assistance. By all means, don't be too proud -- ask! And by all means, don't be stupide either -- use a fake name and don't mention the company you work for. There's absolutely no upside to making such information public; there are, however, many downsides to doing so, like inviting security breaches, giving away too much information, making your company look bad, and besmirching your own reputation.

As I said at the beginning, these are strategies that have served me well. Feel free to send me your own ideas; I especially love to hear about true-life horror stories that have happy endings.

Resources

Life in the Trenches: A Sysadmin Speaks
10 Tips for Getting Along with People at Work
Linux Administration Books

[Jun 20, 2010] IT Resource Center forums - greatest blunders

Bill McNAMARA

I've done this with people looking over my shoulder (while in single user):

echo "/dev/vg00/lvol6 /tmp vxfs delaylog 0 2" > /etc/fstab
reboot!!

Other good ones:
mv /dev/ /Dev
(try it - and don't ask why!!)

Later,
Bill

Christian Gebhardt

Hi
As a newby in UNIX I had an Oracle Testinstallion on a production system
productiv directory: /u01/...
test directory: /test/u01/...

deleting the test installation:
cd /test
rm /u01

OOPS ...

After several bdf commands I noticed that the wrong lvol shrinks and stops the delete command with Ctrl'C

The database still worked without the most binaries and libraries and after a restore from tape without stopping and starting the database all was ok.

I love oracle ;-)

Chris

harry d brown jr

Learning hpux? Naw, that's not it....maybe it was learning to spell aix?? sco?? osf?? Nope, none of those.

The biggest blunder:

One morning I came in at my usual time of 6am, and had an operator ask me what was wrong with one of our production servers (servicing 6 banks). Well nothing worked at the console (it was already logged in as root). Even a "cat *" produced nothing but another shell prompt. I stopped and restarted the machine and when it attempted to come back up it didn't have any OS to run. Major issue, but we got our backup tapes from that night and restored the machine back to normal. I was clueless (sort of like today)

The next morning, the same operator caught me again, and this time I was getting angry (imagine that). Same crap, different day. Nothing was on any disk. This of course was before we had raid availble (not that that would have helped). So we restored the system from that nights backups and by 8am the banks have their systems up.

So now I have to fix this issue, but where the hell to start? I knew that production batch processing was done by 9PM, and that the backups started right after that. The backups completed around 1am, which were good backups, because we never lost a single transaction. But around 6am the stuff hit the fan. So I had a time frame: 1am-6am, something was clobbering the system. I went though the crons, but nothing really stood out, so I had to really dive into them. This is the code (well almost) I found in the script:

cd /tmp/uniplex/trash/garbage
rm -rf *

As soon as I saw those two lines, I realized that I was the one that had caused the system to crap out every morning. See, I needed some disk space, and I was doing some house cleaning, and I deleted the sub-directory "garbage" from the /tmp/uniplex/trash" directory. Of course the script is run by root, which attempted to "CD" to a non-existent directory, which failed, and cron was still cd'd to "/", it then proceeded to "rm -rf *" my system!

live free or die
harry

Bill Hassell

I guess my blunder sets the record for "most clobbered machines" in one day:

I created an inventory script to be used in the Response Center to track all the systems across the United States (about 320 systems). These are all test and problem replication machines but necessary for the R/C engineers to replicate customer problems.

The script was written about 1992 to handle version 7.0 and higher. About 1995, I had a number of useful scripts that it seemed reasonable to drop these into all 300 machines as a part of the inventory process (so far, so good). Then about that time, 10.01 was released and I made a few changes to the script. One was to change the useful script location from /usr/local/bin to /usr/contrib/bin because of bad directory permissions. I considered 'fixing' the bad permissions but since these systems must represent the customer environment, I decided to move everything.

Enter the shell option -u. I did not use that option in my scripts and due to a spelling error, an environment variable was used in rm -r which was null, thus removing the entire /usr/local directory on 320 machines overnight.

Needless to say, I never write scripts without set -u at the top of the script.

John Poff

We were doing a disaster recovery drill. I was busy Igniting a V-class server for our database server. I had finally gotten the OS on it after about three hours and I was running a slick little script I had written to recreate all the volume groups and filesystems. My script takes a list of available PVs and does a 'pvcreate -f' on them. Well, we started our drill at midnight &#91;not our idea but we had little choice&#93;, so around about 3:30am I was trying to run this script. It was chugging along just fine, pvcreating disks, and then the system hung. Not completely, but pretty much dead. After trying to reboot it, I eventually figured out that when I went through the interactive Ignite, I hadn't paid close attention to which disk Ignite had selected to load the OS on, and it had chosen one of the disks in the EMC array instead of one of the local Jamaica disks. My slick script came along and had pvcreated the disk that had the OS on it. Oops. There went a few more hours of work.

The good news is that after that mess they decided that we would never start a DR drill at midnight!

JP

Dave Johnson

Here is my worst.
We us BC's on our XP512. We stop the application, resync the BC, split the BC, start the application, mount the BC on same server, start backup to tape from BC. Well I had to add a LUN to the primary and BC. I recreated the BC. I forgot to change the script that mounts the BC to include the new LUN. The error message vgimport when you do not include all the LUN's is just a warning and it makes the volume group available. The backups seemed to be working just fine.
Well 2 months go by. I did not have enough available disk space to test my backups. (That has been changed). Then I decided to be proactive about deleted old files. So I wrote a script:
cd /the/directory/I/want/to/thin/out
find . -mtime +30 -exec rm {} \;

Well that was scheduled on cron to run just before backups one night. The next morning I get the call the system is not responding. (I guessed later the cd command had failed and the find ran from /).
After a reboot I find lots of files are missing from /etc /var /usr /stand and so on. No problem, just rebuild from the make_recovery tape created 2 nights before then restore the rest from backup.
Well step 1 was fine, but the backup tape was bad. The database was incomplete. It took 3 days (that is 24 hours per day) to find the most recient tape with a valid database. Then we had to reload all the data. After the 3rd day I was able to turn over recovery to the developers. It took about a week to get the application back on-line.
I have sent a request to HP to have the vgimport command changed so a vgimport that does not specify all the LUN's will fail unless some new command line param is used. They have not yet provided this "enhancement" as of the last time I checked a couple of months ago. I now test for this condition and send mail to root as well as fail the BC mount if it does.

Life lesson: TEST YOUR BACKUPS!!

Dave Unverhau

This is probably not too uncommon...needed to shutdown a server for service (one of several lined up along the floor...no...not racked). Grabbed the keyboard sitting on that box and quickly typed the shutdown string (with a -hy 0, of course) and got ready to service the box.

...ALWAYS make sure the keyboard is sitting on the box to which it is connected!

Deepak Extross

We had this developer who claimed that when he runs his program, it complains about /usr/bin/ld. (This was because of a missing shared library, he later discovered) It was decided to backup /usr/bin/ld and replace it with 'ld' from another machine on which his program worked.
No sooner was ld moved, than all hell breaks loose.
Users get coredumps in response to even simple commands like "ls", pwd", "cd"... New users cannot telnet into the system and those who are logged in are frozen in their tracks.

Both the developer and admin are still working with us...

RAC

Well I was very very new to HP-UX. Wanted to set up PPP connection with a password borrowed from a friend so that I could browse the net.

Did not worry that the remote support modem can not dial out from remote support port.
Went through all documents available, created device files dozen times, but never worked. In anguish did rm -fr `ltr|tail -4|Awk '{print $9}'
(That to pacify myself that I know complex commands)

But alas, I was /sbin/rc3.d.

Thought this is not going to work and left that.
Other colleage not aware of this rebooted the system for Veritas netbackup problem.

Within next two hours HP engineer was on-site. Was called by colleague.

Was watching whole recovery process, repeatedly saying "I want to learn, I want to learn"

Then came to know that can not be done.

Dave Johnson

Hey Bill,

When I reinstalled the OS from the make_recovery tape it wiped out the scirpt I wrote and the item on the cron. There is no evidance of what happened or who was responsible. I did however go straight to my boss to confess and take the blame. That above all is probably the strongest reason next to being able to recover at least some of the data why I was not terminated for it.

Did I mention in the first post this happened Feb of 2002????

Simon Hargrave

1. On a live Sun Enterprise server, you turn the key one way for maintenance, and one way for off. I wanted to turn it to maintenance but wasn't sure which way to turn it. Guess which way I chose...

2. On an XP512 I accidentally business-copied a new LUN over the top of a live LUN, because I put the wrong LUN ID in!!! Luckily the live datas backup had finished a full 3 minutes earlier...phew!

3. I can't take credit for this one my ex-boss did it, but I had to include it. On Solaris he added a filesystem in the vfstab file, but put the wrong device in the raw-device field. Concequently all backups backed up the wrong device, so when the data got trashed and required restoring, it...um...didn't exist on tape! Luckily for him he'd left the company 2 months before and I was left to explain what a halfwit he was ;)

Dave Chamberlin

I have stepped in TAR on a couple occasions. I Moved a tar file from production box to development box but I had tarred it with an absolute path. When I untarred it - it overwrote the existing directory - destroying all the developers updates! I have also been burned by the fact that xvf and cvf are very close on the keyboard - so my command to extract a tar came out once as tar -cvf - which of course erased the tar file.
Only other bad blunder was doing an lvreduce on a mounted file system - thought I was recovering space without affecting the other files on the volume. Luckily - they were backup up...

Martin Johnson

One of my coworkers decided to set up a pseudo root (UID=0) account for himself. He used useradd to create the account and made / his home directory. He was unaware that useradd does a "chown -R" to the home directory. So he became the owner of all files on the system. This was a pop3 mail server system, and the mail services did not like the change.

My coworker left for the day, leaving me with angry VPs looking over my shoulder demanding to know when email services will be back.

Marty <the coworker is now known as "chown boy">

fg

Greatest GAFF: Taking the word of someone who I thought knew what they were doing and had taken the proper precautions to ensure a recovery method for a rebuild of filesystems,

to make a long story short, no backup, no make_recovery, and then rebuilt filesystems. Data lost and had to rebuild. Recovered most of the data except for previous 24hrs.

MORAL of the story: Always have backups and make_recovery tapes done.

Richard Darling

When I upgraded from 10.20 to 11.0 I finished the system installed, and then used cpio to copy my user applications. One of the vendors had originally had their app installed in /usr (before my time), and I copied the app up one directory and wiped out /usr. By the way, I didn???t back up the installation before the cpio copy. It was a Friday night and I wanted to get out...figured I could backup after getting the apps copied over...learnt an important lesson regarding backups that night...
RD

Belinda Dermody

writing a script to chmod -R to r/w for the world on a dir. Not doing a check to see if I was in the proper directory and all of a sudden my bin directory files were all 666. Lucky enough I had multiple windows and it hadn't gotten to the sbin directory yet. Had a few inquiries why certain commands wouldnt work before I got it all back correctly. From then on, I do $? and check the return status before I issue any remove or chmod commands.

Ian Kidd

I was going to vi a script that performs a cold-backup of an oracle database. Since we prefer not to be root all the time, we use sudo.

So I typed, "sudo", but then was interrupted by someone. I then typed the name of the script when that person left. Nothing appeared on the screen immediately, so I got a coffee.

When I came back, I saw " sudo {script}" and realized - 1 minute the DBAs started screaming that their database was down - that I started a cold backup in the middle of a production day.

Duncan Edmonstone

My worst two:

Installing a server in a major call centre of a US bank...

I built the OS as required by our apps team in the US, and following our build standards put the system into trusted mode.

They installed the app, and realised they'd forgotten to ask me to put the system into NIS (system could be used by any of the call centre reps in over 40 call centers - a total of 15,000 NIS based accounts!) It's the middle of the night in the UK, so the apps team get a US admin to set up the system as a NIS client. (yes it shouldn't work when the box is trusted, but it does!)

Next day, the apps team is complaining about some stuff not working - can I take the system out of trusted mode so we can discount that? Sure course I can - I run tsconvert and wait.... and wait.... and wait.... hmmm - this usually takes about 30 seconds - what gives?

Try to open another window to check whats happening - can't log in as root, the password that worked two minutes ago no longer works!

Next root file system full messages start to scroll up the screen!

It turns out that tsconvert is busy taking ALL the NIS accounts and putting them in the /etc/passwd file (yes all 15,000 of them) and guess what? There's a root account in NIS!

All I can say is thank god for good backups!

The other one was a typical junior admin mistake which comes from not understanding shell file name generation fully:

A user can't log in, I go take a look at his home directory and note the permissions on his .profile are incorrect. I also note that the other '.' files are incorrect, so I do this:

cd /home/user
chmod 400 .*

I call the user and tell him to try again - he says he still can't log in! Huh?

So I go back and carry on looking for the problem, but before I know it the phone is ringing off the hook! No-one can log in now!

And then it dawns on me

I type the following:

cd /home/user
echo .*

and that returns (of course)

. .. .cshrc .exrc .login .profile .sh_history

Oops I didn't just change the permissions on the users '.' files - I also changed the permissions on the users directory, and (crucially!) the users parent directory /home!

These days I always use echo to check my file name pattern matching logic when doing this kind of thing...

We live and learn


Duncan

Vincent Fleming

I have been way too fortunate not to have really blundered all that bad (I've mostly done development), but one I've seen was a real good one...

The "security auditor", who apparently knew absolutely nothing about UNIX, was reviewing our development system, and decided that /tmp having world read/write permissions was not a good thing for security - so, in the middle of the day, he chmod 744 /tmp ... suddenly, 200+ developers (including myself) on the machine (it was a *very* large machine back in 1990) were unable to save their editor sessions!

So, of course, I use the "wall" command to point our their error so they can fix it quickly and I can save my 2+ hours of edits:

$ wall
who's the moron who changed the permissions on /tmp????
.
$

The funny thing was that I was the one they escorted out of the building that day...

The hazards of being a contractor and publically humiliating an employee...

Jerry Jordak

This one wasn't my fault, but is still funny.

One time, we had to add disk space to one of our servers. My manager at time also was in charge of the EMC disk environment, so he allocated an extra disk to the server. I configured the disk into the OS, did a pvcreate on it, and proceeded to add it to the volume group, extend the filesystem, etc...

At about that same time, another one of our servers started going absolutely nuts. It turns out that he accidentally gave me a disk that was already allocated that other system. That drive had held the binaries for that server's application. Oops...

Tom Danzig

As root:

find / -u 201 -chown dansmith

Did this afeter changing a user ID to another number. Should have user "-user" and not -u (I had usermod on my mind). System gladly ignored the -u and started changing all files to user dansmith (/etc/passwd, /etc/hosts, etc). Needless to say, system was hosed.

Was able to recover fine from make_recovery tape. Fortunately this was also a test box and not production.

Oh well ... live and learn! Mistakes are only bad if you don't learn from them.

Mark Fenton

Back in '92 on a NIS network, meant to wipe out a particular user's directory, but was one level up from same when issued rm -r *. Took three hours to back up all home directories on network....

Last year, I discovered that new is not necessarily better. Updating Db software I blithly stopped the Db, copied new software in, and restarted. Users couldn't get any processing done that day -- seems that there was a conversion program that was *supposed* to run that didn't. But that wasn't the blunder -- the blunder is that the most recent backup had been two days previous, so all the previous day's processing was gone... (and that had been an overtime day, too!)

Keely Jackson

My greatest blunder:

The guy who set up the live database had done it as himself rather than aa a separate dba user. He left the company and his user id was re-allocated to somebody in HR. The guy in HR subsequently left as well.

One day I decided to tidy up the system and remvoe the this user. I did this via sam, selected the option to delete all the users files thinking that nobody who was in HR could possibly own any important files.

Unfortunately I was somewhat mistaken. Of course the guy in HR now owned all the database files. The first thing I knew was when the users started to complain that the database was no longer available. I got the db back from restore but everybody had lost half a days work.

Needless to say, I now do not delete old users files but re-allocate them to a special 'leavers' user and check them all before deleting anything.

A good HP blunder.

HP were moving the live server - a K420 - between sites and the remvoal men managed to drop it down a flight of stairs. It landed on one of them who then had to be taken to hospital. Fortunately he was only bruised while the machine had a huge dent in it. Anyway, it got moved to the other site and booted up straight away with no problems. That is what I call resiliant hardware. As a precaution disks etc were changed but it is still running quite happily today.

Cheers
Keely

Michael Steele

    When I was first starting out I worked for a Telecom as an 'Application Administrator' and I sat in a small room with a half a dozen other admins and together we took calls from users as their calls escalated up from tier I support. We were tier II in a three tier organization.

    A month earlier someone from tier I confused a production server with a test server and rebooted it in the middle of the day. These servers were remotely connected over a large distance so it can be confusing. Care is needed before rebooting.

    The tier I culprit took a great deal of abuse for this mistake and soon became a victim of several jokes. An outage had been caused in a high availability environment which meant management, interviews, reports; It went on and on and was pretty brutal.

    And I was just as brutal as anyone.

    Their entire organization soon became victimize by everyone from our organization. The abuse traveled right up the management tree and all participated.

    It was hilarious, for us.

    Until I did the same thing a month later.

    There is nothing more humbling then 2000 people all knowing who you are for the wrong reason and I have never longed for anonymity more.

    Now I alway do a 'uname' or 'hostname' before a reboot, even when I'm right in front of it.

Geoff Wild

Problem Exists Between Keyboard And Chair:

Just did this yesterday:

tar cvf - /sapmnt/XXX | tar xvf -

Meant to do:

tar cvf - /sapmnt/XXX | (cd /sapmnttest/XXX ;tar xvf -)

Needless to say, I corrupted most of the files in /sapmnt/XXX

Rgds....Geoff

Suhas

1. Imagine what would have happened when, on a Solaris box, while taking backup of ld.so.1, instead of doing "cp", "mv" was done !!! As most of you would be aware, ld.so.1 is the library file that is accesses by every system call. The next 1 hour was sheer chaos .. and worst hour ever experienced!!!!
Lesson Learnt: "Look before you leap !!!"

2. Was responsible for changing the date on the back-up master server by nearly a year . That night was a horrifying night of my life.
Lesson Learnt : "A typo-error can cost you any-thing between $0 to infinity."

Keep forumming !!!!
Suhas

[Jun 12, 2010] Sysadmin Blunder (3rd Response) - Toolbox for IT Groups

chrisz

Did one also.

I was in a directory off of root and performed the following command:

chown -R someuser.somegroup .*

I didn't think much of the command, just wanted to change the owner and
group for all files with a . in the front of them and subdirs. Went well
for the files in the current directory until it reached the .. file
(previous directory). All the files and subdir's off of root changed to the
owner and group specified. I was wondering why the command was taking so
long to complete. BTW, it changed the owner and group for all NFS files
too! That's when the real fun started.
Some days you're the windshield, other days you're the bug!

Dan Wright

It didn't really cause any significant damage, but about 10 years ago, I had
recently become an admin of a network of mostly NeXT machines which were new
to me and the default shell was c-shell, which I also wasn't very familiar
with.

I had dialed in from home on nite to play around and become more familiar
with how things worked on NeXTStep.

In an attempt to kill a job, I typed in "kill 1" instead of "kill %1" - and
it probably was actually a "kill -9 1" and of course I was root.

And of course, 1 was the init process. I immediately lost connection and
had to do a hard reboot on that machine the next day before that user got in
(for some reason, the machine with the modem wasn't in my office, it was in
someone elses).

Fortunately, that wasn't a critical machine outside of normal business
hours. No harm, no foul, eh?


If you like this kind of story, there are a bunch here:

http://www2.hunter.com/~skh/humor/admin-horror.html

User123731

I have in the past touched a file called "-i" in important directories. This
will cause rm to see the "-i" and make the rm interactive before it acts on
other files/dirs if you do not specify a particular directory.

User451715:

Ha!

That's an easy one.. My first position as a Junior Admin in HPUX working in First line support about eight years ago..

I was working on a server, moving some files around, and mistakenly moved all of the files in the /etc directory to a lower level direcory (about 10 sub-dirs down)..

I sat there at the console wide-eyed, my heart dropped, and I turned and looked out the window, and saw my job sailing out of it, since this was a server that was being prepared to be deployed and that a month's worth of work would have been wasted.

Luckily, a Senior Admin who later became my greatest mentor (Phil Gifford), took pity on my situation, and we sat there and recovered the /etc directory before anyone knew what had happened.. The key here was, he walked me through the necessary steps to recover files from an ignite tape, and voila!

Needless to say, I learned all about why seasoned UNIX admins protect root privilege as it it was the 'Family Jewels'.. <chuckle>

Mike E.-

Bryan Irvine :

My biggest blunder wasn't an an AIX box but applies to the thread. I
once made an access list for a cisco box, and forgot that there is an
implicit "deny all" rule at the end. So I made my nifty access list and
enabled it, tested the traffic to see if it was blocked and lo and
behold it seemed to be working. Great! I went on with my life and
figured I go read some news or something. uhhhh that didn't work. tried
email..that didn't work. tried traceroutes, they all died at the router
I was jsut working on...then the phones started ringing.

*click*

lightbulb went on in my head and I ran as fast as I could to the router
to reboot it (lucky I hadn't written the nvram)

The phone didn't stop ringing for 45 minutes even though the problem
only existed for about 4 minutes.

But then, what do ya expect when you kill internet traffic at 5
locations across 2 states?

the guys on the cisco list said that if you haven't done similar you are
lying about the 5 years experience on your resume ;-)


--Bryan

jxtmills:

I needed to clear a directory full of ".directories" and I issued:

rm -r .*

It seemed to find ".." awfully fast. That was a bad day, but, we restored most of the files in about an hour.

John T. Mills

bryanwun:

I thought I was in an application dir but instead was in /usr
and did chown -R to a low level user
on top of that I did not have a mksysb backup,
and the machine was in production.
It continued to function for the users ok
but most shell commands returned nothing
I had to find another machine with the same OS and Maint level
write a script to gather ownership permissions
then write another script to apply the permissions to the
damaged machine. this returned most functionality
but I still cant install new software with installp
it goes through the motions then nothing is changed.

alain:

hi every one

for me , i remember 2

1 - 'rm *;old' in / directory note ';' instead of '.'

2 - kill #pid of the informix process (oninit) and delete it (i dreamed)

jturner:

Variation on a theme ... the 'rm -r theme'

as a junior admin on AIX3.2.5, I had been left to my own devices to create
some housekeeping scripts - all my draft scripts being created and tested
in my home directory with a '.jt' suffix. After completing the scripts I
found that I had inadvertantly placed some copies in / with a .jt suffix.
Easy job then to issue a 'rm *.jt' in / and all would be well. Well it
would have been if I hadn't put a space between the * and the .jt. And the
worst thing of all, not being a touch typist and looking at the keys, I
glanced at the screen before hitting the enter key and realised with horror
what was going to happen and STILL my little finger continued to proceed
toward the enter key.

Talk about 'Gone in 60 seconds' - my life was at an end - over - finished
- perhaps I could park cars or pump gas for a living. Like other
correspondents a helpful senior admin was on hand to smile kindly and show
me how to restore from mksysb-fortunately taken daily on these production
systems. (Thanks Pete :))) )

To this day, rm -i is my first choice with multiple rm's just as a
test!!!!!!

Happy rm-ing :)

daguenet:

I know that one. Does any body remember when the rm man page had a
warning not do rm -rf / as root? How may systems were rebuild due that
blunder. Not that I have ever done something like that, nor will ever
admit to it:).

Aaron

cal.staples:

That is a no brainer!

First a little background. I cooked up a script called "killme" which
would ask for a search string then parse the process table and return a
list of all matches. If the list contained the processes you wanted to
kill then you would answer "Yes", not once, but twice just to be sure
you thought about it. This was very handy at times so I put it out on
all of our servers.

Some time had passed and I had not used it for a while when I had a need
to kill a group of processes. So I typed the command not realizing that
I had forgotten the scripts name. Of course I was on our biggest
production system at that time and everything stopped in it's tracks!

Unknown to me was that there is an AIX command called "killall" which is
what I typed.

From the MAN page: "The killall command cancels all processes that you
started, except those producing the killall process. This command
provides a convenient means of canceling all processes created by the
shell that you control. When started by a root user, the killall command
cancels all cancellable processes except those processes that started
it."

And it doesn't ask for confirmation or anything! Fortunately the
database didn't get corrupted and we were able to bring everything back
on line fairly quickly. Needless to say we changed the name of this
command so it couldn't be run so easily.

"Killall" is a great command for a programmer developing an application
which goes wild and he/she needs to kill the processes and retain
control of the system, but it is very dangerous in the real world!

Jeff Scott:

The silliest mistake? That had to be a permissions change on /bin. I got a
call from an Oracle DBA that the $ORACLE_HOME/bin no longer belonged to
oracle:dba. We never found out how that happened. I logged in to change the
permissions. I accidentally typed cd /oracle.... /bin (note the space
before /bin), then cheerfully entered the following command:

#chown -R oracle:dba ./*

The command did not climb up to root fortunately, but it really made a mess.
We ended up restoring /bin from a backup taken the previous evening.


Jeff Scott
Darwin Partners/EMC


Cal S.

tzhou:

crontab -r when I wanted to do crontab -e. See letters e and r are side by
side on the keyboard. I had 2 pages of crontab and had no backup on the
machine !

Jeff Scott:

I've seen the rm -fr effect before. There were no entries in any crontab.
Before switching to sudo, the company used a homegrown utility to grant
things root access. The server accepted ETLs from other databases, acting as
a data warehouse. This utility logged locally, instead of logging via
syslogd with local and remote logging. So, when the system was erased, there
really was no way to determine the actual cause. Most of the interactive
users shared a set of group accounts, instead of enforcing individual
accounts and using su - or sudo. The outage cost the company $8 million USD
due to lost labor that had to be repeated. Clearly, it was caused by a
cleanup script, but it is anyone's guess which one.

Technically, this was not a sysadmin blunder, but it underscores the need
for individual accounts and for remote logging of ALL uses of group
accounts, including those performed by scripts.

It also underscores the absolute requirement that all scripts have error
trapping mechanisms. In this case, the rm -fr was likely preceded by a cd to
a nonexistent directory. Since this was performed as root, the script cd'd
to / instead. The rm -fr then removed everything. The other possibility is
that it applied itself to another directory, but, again, root privileges
allowed it to climb up the directory tree to root.

Aneesh Mohan

Hi Siv,

The greatest blunder happened frm myside is created a lvol named /dev/vg00/lvol11 and did newfs on /dev/vg00/rlvol1 :)


The second greatest blunder from my side is corrupting the root filesystem by the below 2 steps :)


#lvchange -C n /dev/vg00/lvol03

#lvextend -L 100 /dev/vg00/lvol03

Cheers,
Aneesh

[Jun 09, 2010] Halloween - IT Admin Horror Stories

Zimbra Forums

Well ... I was working for a large multi-national running HP-UX systems and Oracle/SAP, and one day the clock struck twelve and the OS just started to disappear Down went SAP and Oracle like a sack of spuds!

Mayhem broke with the IT manager standing over my shoulder wanting to know what had happened ... I did not have a clue, and I could not even get onto the system as it was completely hosed! So the task of restoring the server began and after 30 minutes I had everything backup and running again. phewww

Until 1pm! The system disappearing again What the hell is going off, panic set it, this time I managed to keep a couple of sessions open to allow me to check the system.

And then it clicked .... I wonder .... Yep indeed, somebody had setup a cronjob AS ROOT, that attempted to 'cd' to a directory which then proceeded with a 'rm -rf *'

Though the ******* other admin did not verify that the directory existed before performing the remove! Well once we had restored the system again, the cronjob was removed, and we were all running fine again.

Morale of the story is to always protect root access and ensure you have adequate backups!!!

[Jun 06, 2010] NFS-export as a poor man backdoor

You can't log-in to the box if /etc/passwd or /etc/shadow are gone...

Ric Werme: Oct 10, 2007 18:05:52 -0700

Bill McGonigle once learned:

> rm lets you remove libc too.  DAMHINT.

I managed to salvage one system because I had NFS-exported / and
could gain write access from another system.

After that I often did the export before replacing humorless files
like libc.so and sometimes did the update with NFS.  It was a
struggle to remember to type the /mnt before the /etc/passwd so
I tried to cd to the target directory copy files in.

  -Ric Werme

[Jun 06, 2010] Security zeal ;-)

Good judgment comes from experience, experience comes from poor judgment. So do new jobs...Sometimes, even entirely new careers!

> On 10/9/07, John Abreau <[EMAIL PROTECTED]> wrote:
>> ... I looked in /bin for suspicious files, and that was the
>> first time I ever noticed the file [ . It looked suspicious, so
>> of course I deleted it. :-/

[Jun 05, 2010] Directory formerly known as /etc ;-)

Tom Buskey
Thu, 11 Oct 2007 06:18:27 -0700

On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>
>
> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>
> >   Did you know 'rpm' will let you remove every package from the
> > system?
>
> rm lets you remove libc too.  DAMHINT.


I had a user call about a user supported system that was having issues.  We
explicitly do not support it and the users only use the root account.

He gave me the root account to login and I couldn't.  I went to his system &
looked around.  /etc was empty.  I told him he was fsked and he should ftp
any files he wanted to elsewhere & that he wouldn't be able to login again
or reboot.  In any event, we were not supporting it.

Sure enough, a help desk ticket came in for another admin, claiming that the
system got corrupted during bootup.  Why do users lie so often?  All it does
is obscure the problem...
Tom Buskey wrote:
> On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>   
>> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>>
>>     
>>>   Did you know 'rpm' will let you remove every package from the
>>> system?
>>>       
>> rm lets you remove libc too.  DAMHINT.
>>     
>
>
> I had a user call about a user supported system that was having issues.  We
> explicitly do not support it and the users only use the root account.
>
> He gave me the root account to login and I couldn't.  I went to his system &
> looked around.  /etc was empty.  I told him he was fsked and he should ftp
> any files he wanted to elsewhere & that he wouldn't be able to login again
> or reboot.  In any event, we were not supporting it.
>
> Sure enough, a help desk ticket came in for another admin, claiming that the
> system got corrupted during bootup.  Why do users lie so often?  All it does
> is obscure the problem...
>   
Hmmm. Did you check lost+found? I've had similar symptoms only to
discover that there was indeed a bad sector that remapped all of /etc/
and some of /var and /usr. fsck didn't help much until I moved the drive
to another system and ran fsck there.

But you're right - if its not supported, then they'll have to go elsewhere to get this done.

BTW: My point is: the user may not have lied, but was just calling the shot as s/he saw them.

--Bruce

[May 26, 2010] Never ever play lose with /boot partition.

Here is the recent story connected with the upgrade of the OS (in this case Suse 10) to a new service pack (SP3)

After the upgrade sysadmin discovered that instead of /boot partition mounted there is none but there is a /boot directory directory on the boot partition populated by the update. This is so called "split kernel" situation when one (older) version of kernel boots and then it finds different (more recent) modules in /lib/modules and complains. There reason of this strange behavior of Suse update was convoluted and connected with LVM upgrade it contained after which LVM blocked mounting of /boot partition.

Easy, he thought. Let's boot from DVD, mount boot partition to say /boot2 and copy all files from the /boot directory to the boot partition.

And he did exactly that. To make things "clean" he first wiped the "old" boot partition and then copied the directory.

After rebooting the server he see GRUB prompt; it never goes to the menu. This is a production server and the time slot for the upgrade was 30 min. Investigation that involves now other sysadmins and that took three hours as server needed to be rebooted, backups retrieved to other server from the tape, etc, reveals that /boot directory did not contain a couple of critical files such as /boot/message and /boot/grub/menu.lst. Remember /boot partition was wiped clean.

BTWs /boot/message is an executable and grub stops execution of stpped /boot/grub/menu.lst.when it encounted instruction

gfxmenu (hd0,1)/message

Here is an actual /boot/grub/menu.lst.

# Modified by YaST2. Last modification on Thu May 13 13:43:35 EDT 2010
default 0
timeout 8
gfxmenu (hd0,1)/message
##YaST - activate

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 splash=silent showopts
initrd /initrd-2.6.16.60-0.54.5-smp

###Don't change this comment - YaST2 identifier: Original name: failsafe###
title Failsafe -- SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 showopts ide=nodma apm=off acpi=off noresume edd=off 3
initrd /initrd-2.6.16.60-0.54.5-smp

Luckily there was a backup done before this "fix". Four hours later server was bootable again.

Sysadmin Stories Moral of these stories

October 19, 2009 | UnixNewbie.org

From: jarocki@dvorak.amd.com (John Jarocki)
Organization: Advanced Micro Devices, Inc.; Austin, Texas

- Never hand out directions on “how to” do some sysadmin task until the directions have been tested thoroughly.

– Corollary: Just because it works one one flavor on *nix says nothing about the others. ‘-}

– Corollary: This goes for changes to rc.local (and other such “vital” scripties.

2

From: ericw@hobbes.amd.com (Eric Wedaa)
Organization: Advanced Micro Devices, Inc.

-NEVER use ‘rm ‘, use rm -i ‘ instead.
-Do backups more often than you go to church.
-Read the backup media at least as often as you go to church.
-Set up your prompt to do a `pwd` everytime you cd.
-Always do a `cd .` before doing anything.
-DOCUMENT all your changes to the system (We use a text file
called /Changes)
-Don’t nuke stuff you are not sure about.
-Do major changes to the system on Saturday morning so you will
have all weekend to fix it.
-Have a shadow watching you when you do anything major.
-Don’t do systems work on a Friday afternoon. (or any other time
when you are tired and not paying attention.)

3

From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501

1) The “man” pages don’t tell you everything you need to know.
2) Don’t do backups to floppies.
3) Test your backups to make sure they are readable.
4) Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
5) Strenuously avoid systems with inadequate backup and restore programs wherever possible (thank goodness for “restore” with an “e”!).
6) If you’ve never done sysadmin work before, take a formal training class.
7) You get what you pay for. There’s no substitute for experience.
9) It’s a lot less painful to learn from someone else’s experience than your own (that’s what this thread is about, I guess

4

From: jimh@pacdata.uucp (Jim Harkins)
Organization: Pacific Data Products

If you appoint someone to admin your machine you better be willing to train them. If they’ve never had a hard disk crash on them you might want to ensure they understand hardware does stuff like that.

5

From: dvsc-a@minster.york.ac.uk
Organization: Department of Computer Science, University of York, England

Beware anything recursive when logged in as root!

6

From: matthews@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

*NEVER* move something important. Copy, VERIFY, and THEN delete.

7

From: almquist@chopin.udel.edu (Squish)
Organization: Human Interface Technology Lab (on vacation)

When you are doing some BIG type the command reread what you’ve typed about 100 times to make sure its sunk in (:

8

From: Nick Sayer

If / is full, du /dev.

9

From: TRIEMER@EAGLE.WESLEYAN.EDU
Organization: Wesleyan College

Never ever assume that some prepackaged script that you are running does anything right.

Admin Stories UnixNewbie.org

This is a modified list from “The Unofficial Unix Administration Horror Story Summary” by Anatoly Ivasyuk.

» Creative uses of rm
» How not to free up space on your drive
» Dealing with /dev files
» Making backups
» Blaming it on the hardware
» Partitioning the drives
» Configuring the system
» Upgrading the system
» All about file permissions
» Machine dependencies
» Miscellaneous stories (a.k.a. ‘oops’)
» What we have learned

My 10 UNIX Command Line Mistakes by Vivek Gite

with 90 comments

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

userdel Command

The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):
userdel foo

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:
killall process-name

Destroyed named.conf

I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:
./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):

cd /mnt/bacupusbharddisk

tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I’ve switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Accidentally Changed Hostname and Triggered False Alarm

Accidentally changed the current hostname (I wanted to see current hostname settings) for one of our cluster node. Within minutes I received an alert message on both mobile and email.
hostname foo.example.com

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop

Wrong CNAME DNS Entry

Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:
echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I’ve learnt that:

  1. Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )
  2. The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
  3. Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.
  4. Use CVS to store configuration files.
  5. Wait and read command line again before hitting the dam [Enter] key.
  6. Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

Jon 06.21.09 at 2:42 am
My all time favorite mistake was a simple extra space:

cd /usr/lib
ls /tmp/foo/bar

I typed
rm -rf /tmp/foo/bar/ *
instead of
rm -rf /tmp/foo/bar/*
The system doesn’t run very will without all of it’s libraries……
georgesdev 06.21.09 at 9:15 am
never type anything such as:
rm -rf /usr/tmp/whatever
maybe you are going to type enter by mistake before the end of the line. You would then for example erase all your disk starting on /.

if you want to use -rf option, add it at the end on the line:
rm /usr/tmp/whatever -rf
and even this way, read your line twice before adding -rf

3ToKoJ 06.21.09 at 9:26 am
public network interface shutdown … done
typing unix command on wrong box … done
Delete apache DocumentRoot … done
Firewall lockdone … done with a NAT rule redirecting the configuration interface of the firewall to another box, serial connection saved me

I can add, being trapped by aptitude keeping tracks of previously planned — but not executed — actions, like “remove slapd from the master directory server”

UnixEagle 06.21.09 at 11:03 am

Rebooted the wrong box
While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot
Instead of appending text to Apache config file, I overwritten it’s contents
Firewall lockdown while changing the ssh port
Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don’t believe your self you did that, but every mistake, regardless of it’s silliness, should be a learned lesson.
If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn’t have much sleep or your mind was confused about personal life or …..etc.

I like Einstein’s quote, you really have to do mistakes to learn.

Selected Comments
7 smaramba 06.21.09 at 11:31 am
typing unix command on wrong box and firewall lockdown are all time classics: been there, done that.
but for me the absolute worst, on linux, was checking a mounted filesystem on a production server…

fsck /dev/sda2

the root filesystem was rendered unreadable. system down. dead. users really pissed off.
fortunately there was a full backup and the machine rebooted within an hour.

8 od 06.21.09 at 12:50 pm
“Typing UNIX Commands on Wrong Box”

Yea, I did that one too. Wanted to shut down my own vm but I issued init 0 on a remote server which I accessed via ssh. And oh yes, it was the production server.

10 sims 06.22.09 at 2:23 am
Funny thing, I don’t remember typing typing in the wrong console. I think that’s because I usually have the hostname right there. Fortunately, I don’t do the same things over and over again very much. Which means I don’t remember command syntax for all but most used commands.

Locking myself out while configuring the firewall – done – more than once. It wasn’t really a CLI mistake though. Just being a n00b.

georgesdev, good one. I usually:

ls -a /path/to/files
to double check the contents
then up arrowkey homekey hit del a few times and type rm. I always get nervous with rm sitting at the prompt. I’ll have to remember that -rf at the end of the line.

I always make mistakes making links. I can never remember the syntax. :/

Here’s to less CLI mistakes… (beer)

Grant D. Vallance 06.22.09 at 7:56 am
A couple of days ago I typed and executed (as root): rm -rf /* on my home development server. Not good. Thankfully, the server at the time had nothing important on it, which is why I had no backups …

I am still not sure *why* I did when I have read about all the warnings about using this command. (A dyslexic moment with the syntax?)

Ah well, a good lesson learned. At least it was not the disaster it could of been. I shall be *very* paranoid about this command in the future.

Joren 06.22.09 at 9:30 am
I wanted to remove the subfolder etc from the /usr/local/matlab/ directory. So I accidentally added the ‘/’ symbol in a force of habit when going to the /etc folder and I typed from the /usr/local/matlab directory:

sudo rm /etc

instead of

sudo rm etc

Without the entire /etc folder the computer didn't work anymore (which was to be expected ofcourse) and I ended up reinstalling my computer.

Robsteranium 06.22.09 at 11:05 am
Aza Rashkin explains how habituation can lead to stupid errors – confirming “yes I’m sure/ overwrite file etc” automatically without realising it. Perhaps rm and the > command need an undo/ built-in backup…
Ramaswamy 06.22.09 at 10:47 am
Deleted the files
I used place some files in /tmp/rama and some conf files at /home//httpd/conf file
I used to swap between these two directories by “cd -”
Executed the command rm -fr ./*
supposed to remove the files at /tmp/rama/*, but ended up by removing the file at /home//httpd/conf/*, with out any backup
Yonitg 06.23.09 at 8:06 am
Great post !
I did my share of system mishaps,
killing servers in production, etc.
the most emberassing one was sending 70K users the wrong message.
or beter yet, telling the CEO we have a major crysis, gathering up many people to solve it, and finding that it is nothing at all while all the management is standing in my cube.
Solaris 06.23.09 at 8:37 pm
Firewall lock out: done.
Command on wrong server: done.

And the worst: update and upgrade while some important applications were running, of
course on a production server.. as someone mentioned the system doesn’t run very well
without all of its original libraries :)

Peko 06.30.09 at 8:46 am
I invented a new one today.

Just assuming that a [-v] option stands for –verbose

Yep, most of the time. But not on a [pkill] command.
[pkill -v myprocess] will kill _any_ process you can kill — except those whose name contains “myprocess”. Ooooops. :-!
(I just wanted pkill to display “verbose” information when killing processes)

Yes, I know. Pretty dumb thing. Lesson learned ?

I would suggest adding another critical rule to your list:
” Read The Fantastic Manual — First” ;-)

Jai Prakash 07.03.09 at 1:43 pm
Mistake 1:

My Friend tried to see last reboot time and mistakenly executed command “last | reboot” instead of “last | grep reboot”

It made a outage on Production DB server.

Mistake 2:

Another guy, wants to see the FQDN on solaris box and executed “hostname -f”
It changed the hostname name to “-f” and clients faced lot of connectivity issues due to this mistake.
[ hostname -f is used in Linux to see FQDN name but it solaris its usage is different ]

32 Name 07.04.09 at 5:20 pm
Worse thing i’ve done so far, It accidentally dropped a MySQL database containing 13k accounts for a gameserver :D

Luckily i had backups but took a little while to restore,

33 Vince Stevenson 07.06.09 at 6:23 pm
I was dragged into a meeting one day and forgot to secure my Solaris session. A colleague and former friend did this: alias ll=’/usr/sbin/shutdown -g5 -i5 “Bye bye Vince”‘ He must have thought that I was logged into my personal host machine, not the company’s cashcow server. What happens when it all goes wrong. Secure your session… Rgds Vince
Bjarne Rasmussen 07.07.09 at 7:56 pm
well, tried many times, the crontab fast typing failure…

crontab -r instead of -e
e for edit
r for remove..

now i always use -l for list before editing…

35 Ian 07.08.09 at 4:15 am
Made a script that automatically removes all files from a directory. Now, rather than making it logically (this was early on) I did it stupidly.

cd /tmp/files
rm ./*

Of course, eventually someone removed /tmp/files..

36 shlomi 07.12.09 at 9:21 am
Hi

On My RHEL 5 sever I create /tmp mount point to my storage and tmpwatch script that run under cron.daily removes files which have not been accessed 12 hours !!!

52 foo 09.25.09 at 9:41 pm
wanted to kill all the instances of a service on HP-UX (pkill like util not available)…

# ps -aef | grep -v foo | awk {print’$2′} | xargs kill -9

Typed “grep -v” instead of “grep -i” and u can guess what happened :(

LinAdmin 09.29.09 at 2:38 pm
Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var . – the ssh session on the just broken box was still open . This saved my life :-P
Deltaray 10.03.09 at 4:37 am
Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.

I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don’t see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you’d use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

Guntram 10.05.09 at 7:51 pm
This is why i never use pkill; always use something like “ps ….| grep …” and, when it’s ok, type a ” | awk ‘{print $2}’ | xargs kill” behind it. But, as a normal user, something like “pkill -v bash” might make perfect sense if you’re sitting at the console (so you can’t just switch to a different window or something) and have a background program rapidly filling your screen.

Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of “delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

sqn 10.07.09 at 6:05 pm
tried to lockout a folder by removing it’s attributes (chmod 000) as a beginner and wanted to impress myself, did:

# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven’t deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

dev 10.15.09 at 10:15 am
while I was working on many ssh window:

rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm
In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn’t sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql — SHOULD be —
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!
Mr Z 10.18.09 at 5:13 am
After 10+ years I’ve made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:

ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It’s all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm
cd /var/opt/sysadmin/etc
rm -f /etc

note the /etc. It was supposed to be rm -rf etc

Jonix 10.23.09 at 11:18 am
The deadline were coming too close to comfort, I’d worked for too looong hours for months.

We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in “/var/www/web/” and entered “rm temp/*” which i misspelled at some point as “rm tmp/ *”. I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

Cougar 10.29.09 at 3:00 pm
rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..
BattleHardened 10.30.09 at 1:33 am
Some of my more choice moments:

postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn’t put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm’d (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find’s to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

Shantanu Oak 11.03.09 at 11:20 am
scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the “somefile” of that server!!
scp somefile root@192.168.0.1:/root/
thatguy 11.04.09 at 3:37 pm
Hmm, most of these mistakes I have done – but my personal favourite.

# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn’t need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn’t happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

Gurudatt 11.06.09 at 12:05 am
chmod 777 /

never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm
so in recovering a binary backup of a large mysql database, produced by copying and tarballing ‘/var/lib/mysql’, I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the ‘var’ directory in ‘/tmp’ . I wanted to type:
rm -rf var/

instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

Derek 11.12.09 at 10:26 pm
Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn’t been running tests(normally they can take 24-72 hours to run)..

symlinks… ack! I was cleaning up space and thought weird.. I don’t remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn’t the one who had deleted the binaries, just got to witness the aftermath.

jason 11.18.09 at 4:19 pm
The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….
Mr Z 11.19.09 at 3:02 pm
@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg
M.S. Babaei 08.01.09 at 3:39 am
once upon a time mkfs is killing me on ext3 partition I want
instead of
mkfs.ext3 /dev/sda1
I did this
mkfs.ext3 /dev/sdb1

I never forget what I lost??

Simon B 08.07.09 at 2:47 pm
Whilst a colleague was away from their keyboard I entered :


rm -rf *

… but did not press enter on the last line (as a joke). I expected them to come back and see it as a joke and rofl….back space… The unthinkable happened, the screen went to sleep and they banged the enter key to wake it up a couple of times. We lost 3 days worth of business and some new clients. estimated cost $50,000+

ginzero 08.17.09 at 5:10 am
tar cvf /dev/sda1 blah blah…
47 Kevin 08.25.09 at 10:50 am
tar cvf my_dir/* dir.tar
and your write your archive in the first file of the directory …
48 ST 09.17.09 at 10:14 am
I’ve done the wrong server thing. SSH’d into the mailserver to archive some old messages and clear up space.
Mistake #1: I didn’t logoff when I was done, but simply minimized the terminal and kept working
Mistake#2: At the end of the day I opened what I thought was a local terminal and typed:
/sbin/shutdown -h now
thinking I was bringing down my laptop. The angry phone calls started less than a minute later. Thankfully, I just had to run to the server room and press power.

I never thought about using CVS to backup config files. After doing some really dumb things to files in /etc (deleting, stupid edits, etc), I started creating a directory to hold original config files, and renaming those files things like httpd.conf.orig or httpd.conf.091709

As always, the best way to learn this operating system is to break it…however unintentionally.

49 Wolf Halton 09.21.09 at 3:16 pm
Attempting to update a Fedora box over the wire from Fedora8 to Fedora9
I updated the repositories to the Fedora9 repos, and ran
# yum -y upgrade
I have now tested this on a couple of boxes and without exception the upgrades failed with many loose older-version packages and dozens of missing dependencies, as well as some fun circular dependencies which cannot be resolved. By the time it is done, eth0 is disabled and a reboot will not get to the kernel-choice stage.

Oddly, this kind of update works great in Ubuntu.

50 Ruben 09.24.09 at 8:23 pm
while cleaning the backup hdd late the night, a ‘/’ can change everything…

“rm -fr /home” instead of “rm -fr home/”

It was a sleepless night, but thanks to Carlo Wood and his ext3grep I rescued about 95% of data ;-)

51 foo 09.25.09 at 9:36 pm
# svn add foo
—-> Added 5 extra files that were not to be commited, so I decided to undo the change,delete the files and add to svn again…..
# svn rm foo –force

and it deleted the foo directory from disk :(…lost all my code just before the dead line :(

52 foo 09.25.09 at 9:41 pm
wanted to kill all the instances of a service on HP-UX (pkill like util not available)…

# ps -aef | grep -v foo | awk {print’$2′} | xargs kill -9

Typed “grep -v” instead of “grep -i” and u can guess what happened :(

53 LinAdmin 09.29.09 at 2:38 pm
Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var . – the ssh session on the just broken box was still open . This saved my life :-P
54 Deltaray 10.03.09 at 4:37 am
Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.

I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don’t see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you’d use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

55 Guntram 10.05.09 at 7:51 pm
This is why i never use pkill; always use something like “ps ….| grep …” and, when it’s ok, type a ” | awk ‘{print $2}’ | xargs kill” behind it. But, as a normal user, something like “pkill -v bash” might make perfect sense if you’re sitting at the console (so you can’t just switch to a different window or something) and have a background program rapidly filling your screen.

Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of “delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

56 sqn 10.07.09 at 6:05 pm
tried to lockout a folder by removing it’s attributes (chmod 000) as a beginner and wanted to impress myself, did:

# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven’t deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

57 dev 10.15.09 at 10:15 am
while I was working on many ssh window:

rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm
In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn’t sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql — SHOULD be —
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!
59 Mr Z 10.18.09 at 5:13 am
After 10+ years I’ve made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:

ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It’s all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm
cd /var/opt/sysadmin/etc
rm -f /etc

note the /etc. It was supposed to be rm -rf etc

61 Jonix 10.23.09 at 11:18 am
The deadline were coming too close to comfort, I’d worked for too looong hours for months.

We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in “/var/www/web/” and entered “rm temp/*” which i misspelled at some point as “rm tmp/ *”. I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

62 Cougar 10.29.09 at 3:00 pm
rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..
63 BattleHardened 10.30.09 at 1:33 am
Some of my more choice moments:

postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn’t put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm’d (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find’s to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

64 PowerPeeCee 11.02.09 at 1:01 am
As an Ubuntu user for a while, Y’all are giving me nightmares, I will make extra discs and keep them handy. Eek! I am sure that I will break it somehow rather spectacularly at some point.
65 mahelious 11.02.09 at 10:44 pm
second day on the job i rebooted apache on the live web server, forgetting to first check the cert password. i was finally able to find it in an obscure doc file after about 30 minutes. the resulting firestorm of angry clients would have made Nero proud. I was very, very surprised to find out I still had a job after that debacle.

lesson learned: keep your passwords secure, but handy

66 Shantanu Oak 11.03.09 at 11:20 am
scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the “somefile” of that server!!
scp somefile root@192.168.0.1:/root/
67 thatguy 11.04.09 at 3:37 pm
Hmm, most of these mistakes I have done – but my personal favourite.

# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn’t need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn’t happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

68 Gurudatt 11.06.09 at 12:05 am
chmod 777 /

never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm
so in recovering a binary backup of a large mysql database, produced by copying and tarballing ‘/var/lib/mysql’, I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the ‘var’ directory in ‘/tmp’ . I wanted to type:
rm -rf var/

instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

70 Henry 11.10.09 at 6:00 pm
Nice post and familiar with the classic mistakes.

My all time classic:
- rm -rf /foo/bar/ * [space between / and *]

Be carefull with clamscan’s:
–detect-pua=yes –detect-structured=yes –remove=no –move=DIRECTORY

I chose to scan / instead of /home/user and I ended with a screwed apt, libs, and missing files from allover the place :D I luckily had –log=/home/user/scan.log and not console output, so I could restore the moved files one by one
next time I use –copy instead of move and never start with /

these 2 happened at home, while working I’ve learned a long time ago (SCO Unix times) to backup files before rm :D

71 Derek 11.12.09 at 10:26 pm
Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn’t been running tests(normally they can take 24-72 hours to run)..

symlinks… ack! I was cleaning up space and thought weird.. I don’t remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn’t the one who had deleted the binaries, just got to witness the aftermath.

72 Ahmad Abubakr 11.13.09 at 2:23 pm
My favourite :)

sudo chmod 777 /
73 jason 11.18.09 at 4:19 pm
The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….
74 Mr Z 11.19.09 at 3:02 pm
@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg
75 John 11.20.09 at 2:29 am
Clearing up space used by no-longer-needed archive files:

# du -sh /home/myuser/oldserver/var
32G /home/myuser/oldserver/var
# cd /home/myuser/oldserver
# rm -rf /var

The box ran for 6 months after doing this, by the way, until I had to shut it down to upgrade the RAM…although of course all the mail, Web content, and cron jobs were gone. *sigh*

76 Erick Mendes 11.24.09 at 7:55 pm
Yesterday I’ve locked my self outside of a switch I was setting up. lol
I was setting up a VLAN on it and my PC was directly connected to it thru one of the ports I messed up.

Had to get thru serial to undo vlan config.

Oh, the funny thing is that some hours later my boss just made the same mistake lol

77 John Kennedy 11.25.09 at 2:09 pm
Remotely logged into a (Solaris) box at 3am. Made some changes that required a reboot. Being too lazy to even try and remember the difference between Solaris and Linux shutdown commands I decided to use init. I typed init 0…No one at work to hit the power switch for me so I had to make the 30 minute drive into work.
This one I chalked up to being a noob…I was on an XTerminal which was connected to a Solaris machine. I wanted to reboot the terminal due to display problems…Instead of just powering off the terminal I typed reboot on the commandline. I was logged in as root…
78 bram 11.27.09 at 8:45 pm
on a remote freebsd box:

[root@localhost ~]# pkg_delete bash

The next time i tried to log in, it kept on telling me access denied… hmmmm… ow sh#t

(since my default shell in /etc/passwd was still pointing to a non-existent /usr/local/bin/bash, i would never be able to log in)

79 Li Tai Fang 11.29.09 at 8:02 am
On a number of occasions, I typed “rm” when I wanted to type “mv,” i.e., I wanted to rename a file, but instead I deleted it.
80 vmware 11.30.09 at 4:59 am
last | reboot
instead
last | grep reboot
81 ColtonCat 12.02.09 at 4:21 am
I have a habit of renaming config files I work on to the same file with a “~” at the end for a backup, so that I can roll back if I make a mistake, and then once all is well I just do a rm *~. Trouble happened to me when I accidentally typed rm * ~ and as Murphy would have it a production asterisk telephony server.
82 bye bye box 12.02.09 at 7:54 pm
Slicked the wrong box in a production data center at my old job.

In all fairness it was labeled wrong on the box and kvm ID.

Now I’ve learned to check hostname before decom’ing anything.

83 Murphy's Red 12.02.09 at 9:11 pm
Running out of diskspace while updating a kernel on FreeBSD.

Not fully inserting a memory module on my home machine which shortcircuited my motherboard.

On several occasions i had to use a rdesktop session to windows machine and use putty to connect to a machine (yep.. i know it sounds weird ;-) ) Anyway.. text copied in windows is stored differently than text copied in the shell. Why changing a root passwd on a box, (password copied using putty) i just control v-ed it and logged off. I had to go to the datacenter to boot into single user mode to acces the box again.

Using the same crappy setup, i copied some text in windows, accidently hit control-v in the putty screen of the box i was logged into as root, the first word was halt, the last character an enter.

Configuring nat on the wrong interface while connected through ssh

Adding a new interface on a machine, filled in the details of a home network in kudzu which changed the default gateway to 192.168.1.1 on the main interface. Only checking the output of ifconfig but not the traffic or gateway and dns settings.

fsck -y on filesystem without unmounting it

84 ehrichweiss 12.03.09 at 6:55 pm
I’ve definitely rebooted the wrong box, locked myself out with firewall rules, rm -rf’ed a huge portion of my system. I had my infant son bang on the keyboard for my SGI Indigo2 and somehow hit the right key combo to undo a couple of symlinks I had created for /usr(I had to delete them a couple of times in the process of creating them) AND cleared the terminal/history so I had no idea what was going on when I started getting errors. I had created the symlink a week prior so it took me a while to figure out what I had to do to get the system operational again.

My best and most recent FUBAR was when I was backing up my system(I have horrible, HORRIBLE luck with backups to the point I don’t bother doing them any more for the most part); I was using mondorescue and backing the files up to an NTFS partition I had mounted under /mondo and had done a backup that wouldn’t restore anything because of an apostrophe or single quote in one of the file names was backing up, so I had to remove the files causing the problem which wasn’t really a biggie and did the backup, then formatted the drive as I had been planning………..only to discover that I hadn’t remounted the NTFS partition under /mondo as I had thought and all 30+ GB of data was gone. I attempted recovery several times but it was just gone.

85 fly 12.04.09 at 3:55 pm
my personal favorite, a script somehow created few dozens file in /etc dit … all named ??somestrings so i promplty did rm -rf ??* … (at the point when i hit [enter] i remebered that ? is a wildchar … Too late :)) luckily that was my home box … but reinstall was imminent :)
86 bips 12.06.09 at 9:56 am
il m’est arrivé de farie :
crontab -r

au lieu de :
crontab -e

ce qui a eu pour effet de vider la liste crontab…

87 bips 12.06.09 at 9:59 am
also i’ve done

shutdown -n
(i thaught -n meant “now”)

which had for consequence to reboot the server without networking…

88 Deltaray 12.06.09 at 4:51 pm
bips: What does shutdown -n do? Its not in the shutdown man page.
89 miss 12.14.09 at 8:42 am
crontab -e vs crontab -r is the best :)
90 marty 12.18.09 at 12:21 am
the extra space before a * is one I’ve done before only the root cause was tab completion.

#rm /some/directory/FilesToBeDele[TAB]*

Thinking there were multiple files that began with FilesToBeDele. Instead, there was only one and pressing tab put in the extra space. Luckily I was in my home dir, and there was a file with write only permission so rm paused to ask if I was sure. I ^C and wiped my brow. Of course the [TAB] is totally unneccesary in this instance, but my pinky is faster than my brain.

Copy Your Linux Install to a Different Partition or Drive

Jul 9, 2009
If you need to move your Linux installation to a different hard drive or partition (and keep it working) and your distro uses grub this tech tip is what you need.

To start, get a live CD and boot into it. I prefer Ubuntu for things like this. It has Gparted. Now follow the steps outlined below.

Copying

Configuration

Install Grub

That’s it! You should now have a bootable working copy of your source drive on your destination drive! You can use this to move to a different drive, partition, or filesystem.

Related Stories:
Linux - Compare two directories(Feb 18, 2009)
Cloning Linux Systems With CloneZilla Server Edition (CloneZilla SE)(Jan 22, 2009)
Copying a Filesystem between Computers(Oct 28, 2008)
rsnapshot: rsync-Based Filesystem Snapshot(Aug 26, 2008)
K9Copy Helps Make DVD Backups Easy(Aug 23, 2008)

Hosing Your Root Account By S. Lee Henry

If you manage your own Unix system, you might be interested in hearing how easy it is to make your root account completely inaccessible -- and then how to fix the problem. I have landed in this situation twice in my career and, each time, ended up having to boot my Solaris box off a CD-ROM in order to gain control of it.

The first time I ran into this problem, someone else had made a typing mistake in the root user's shell in the /etc/passwd file. Instead of saying "/bin/sh", the field was made to say "/bin/sch", suggesting to me that the intent had been to switch to /bin/csh. Due to the typing mistake, however, not only could root not log in but no one could su to the root account. Instead, we got error messages like these:

    login: root
    Password:
    Login incorrect

    boson% su -
    Password:
    su: cannot run /bin/sch: No such file or directory

The second time, I rdist'ed a new set of /etc files to a new Solaris box I was setting up without realizing that the root shell on the source system had been set to /bin/tcsh. Because this offspring of the C shell is not available on most Unix boxes (and certainly isn't delivered with Solaris), I found myself facing the same situation that I had run into many years before.

I couldn't log in as root. I couldn't su to the root account. I couldn't use rcp (even from a trusted host) -- because it checks the shell. I could ftp a copy of tcsh, but could not make it executable. I couldnt boot the system in single user mode (it also looked for a valid shell). The only option at my disposal was to boot the system from a CD ROM. Once I had done this, I had two choices: 1) I could mount my root partition on /a, cd to /a/etc, replace the shell in the /etc/passwd file, unmount /a, and then reboot. 2) I could mount my root partition on /a, cd to /a/bin, chmod 755 the copy of tcsh that I had previously ftped there, unmount /a, and then reboot.

I fixed root's entry in the /etc/passwd file and made my new tcsh file executable to prevent any possible recurrence of the problem. To avoid these problems, I usually don't allow the root shell to be set to anything other than /bin/sh (or /bin/csh if I'm pressured into it). The Bourne shell (or bash) is generally the best shell for root because it's on every system and the system start/stop scripts (in the /etc/rc?.d or /etc/rc.d/rc?.d directories) are almost exclusively written in sh syntax. Hence, should one of these files fail to include the #!/bin/sh designator, they will still run properly.

Surprised by how easily and completely I had made my system unusable, I was left running around the office looking for the secret stash of Solaris CD-ROMs to repair the damage. By the way, changing the file on the rdist source host and rdist'ing the files a second time would not have worked because even rdist requires the root account on the system be working properly. The rdist tool is based on rcp.

Recommended Links

Unix Admin. Horror Story Summary, version 1.0 compiled by: Anatoly Ivasyuk (anatoly@nick.csh.rit.edu)

The Unofficial Unix Administration Horror Story Summary, version 1.1

developerWorks Linux Technical library view

More 2-Cent Tips

More 2-Cent Tips

More 2-Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

Two Cent BASH Shell Script Tips

Lots More 2 Cent Tips...

Some great 2¢ Tips...




Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2014 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting hosting of this site with different providers to distribute and speed up access. Currently there are two functional mirrors: softpanorama.info (the fastest) and softpanorama.net.

Disclaimer:

The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: October 20, 2014