Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Softpanorama classification of sysadmin horror stories

The data loss is a new calamity of technological age that affects all of us.
But only few can contribute to it to the extent system administrator can ;-)

10-15 min spend on re-reading this page once a month can help to avoid some of the situations described below. A spectacular blunder is often too valuable to be forgotten as it tends to repeat itself in a year or two ;-).

Version 2.5 (Nov 17, 2020)

News	Enterprise Unix System Administration	Recommended Links	Defensive programming	Creative uses of rm	Saferm -- wrapper for rm command	dirhist utility	Linux command line helpers	Sysadmin cheatsheets
Missing backup horror stories	Rsnapshot -- a very elegant approach to incremental backups	Relax-and-Recover on RHEL	Accidental Shutdowns/Reboot Blunders	Lack of testing of complex, potentially destructive, commands before execution of production box	Executing command in a wrong directory	Typos in the commands with disastrous consequences	tio (think it over) shell function	Mistakes made because of the differences between various Unix/Linux flavors
Performing the operation on a wrong server	Colorizing terminals in Teraterm based on hostname	Pure stupidity	Abrupt loss of power horror stories	Multiple sysadmin working on the same box	Side effects of performing infrequent or poorly understood operations	Side effects of patching	Reboot Blunders	molly-guard
Recovery of LVM partitions	Dot-star-errors and regular expressions blunders	Premature or misguided optimization	Ownership changing blunders	Locking yourself out	Excessive zeal in improving security of the system	Unintended consequences	Workagolism and Burnout
Coping with the toxic stress in IT environment	An observation about corporate security departments			The Unix Hater’s Handbook	Tips	Original Anatoly Ivasyuk collection of Sysadmin Horror Stories	Humor	Larry Wall - Wikiquote

Introduction
SNAFU as a classic career limited move ;-)
Fundamental reasons for blunders sysadmins commit
Raising situation awareness by doing self-safety training
Some typical cases of loss of situational awareness
The issues connected with ego and hubris
Softpanorama classification of sysadmin blunders
Some personal experience
What other authors are saying
Acknowledgement: The initial source, the inspiration and the base of which this page had grown is an old, but still quite relevant list of horror stories created in early 1990th by Anatoly Ivasyuk; There are two versions available:
- Version 1.0: text file which is archived at Softpanorama. It is still available from many sources on the Web, for example
- Version 1.1 which is a more rare version, which exists in two variants:
  - a text file created by Anatoly Ivasyuk and archived at Softpanorama
  - [Disappeared as of Jan 10, 2019] a partial (first 8 sections) HTML version compiled by Anke Weinberger in 1994 with some corrections made in 1996.
    - Probably the same , "HTMLized" version exists at yak.net and as on Jan 10, 2019 is still available. It was archived at Softpanorama

Introduction

"More systems have been wiped out by admins
than any hacker could do in a lifetime"

Rick Furniss

“Experience fails to teach where there is no desire to learn.”
"Everything happens to everybody sooner or later if there is time enough."
George Bernard Shaw

“Experience is the most expensive teacher, but a fool will learn from no other.”

Benjamin Franklin

Unix system administration is an interesting and complex craft. It's good, if your work demands the use of all your technical skills, creativity and judgment. If it doesn't, then you're in absurd world of Dilbertized cubicle farms and bureaucratic stupidity. And unfortunately that happens too.

There is a lot of deep elegance in Unix, and a talented sysadmin, like any talented craftsman, is able to expose this hidden beauty by the masterful manipulation of complex objects using classic Unix utilities, pipes, and command line. Which often amazes observers, who have Windows background. In Unix administration you need to improvise on the job to get things done, create your own tools and master command line environment; if you want to be on advanced level you can't go "by manual", you need to improvise. Unfortunately some of such improvisations produce unexpected side effects ;-)

In a way, not only execution of complex sequences of commands is a part of this craftsmanship. Blunders and folklore about them are also the legitimate part of the craft. It's human to err after all. And if you are working as root, such an error can easily wipe a vital part of the system. In you are unlucky this is a production system. If you are especially unlucky there is no backup. It is the presence of absence of the most recent backup that often distinguishes the horror story from a minor nuisance. That's why many veteran sysadmins create personal backup of /etc before doing something complex and/or potantially risky. I think sysadmin should backup /etc on your first login as root this day. That can be done from /root/.bash_profile. or a similar dot files that you use and does not take any time, because the task can be launched in the background and login continue without waiting for its completion.

In sysadmin job the conditions are often far from perfect. Some are overworked, some lack the necessary knowledge (or lost the still in the avalanche of routine work does not require it ), some are just lazy and try to cut corners. Trying to cut corners is not necessary bad per se, as in real work it is often necessary to disobey established rules and act boldly and decisively. But like operating with sharp blade that entails risks and sometimes people are burned. As Larry Wall put it "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise. "

Blinders, often called SNAFU, happen in professional life of any sysadmin, but there is a system in any madness ;-). To a certain extent they are similar to car crashes. Both are very rare and are causes not by one factor but by the unique combination of several negative factors. Both couse a lot of grie, up to depression. Most are preventable.

There generally two types of horror stories.

The extremely rare and extremely horrible stories, which have Poisson distribution. A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task of investigating the number of cavaliers in the Prussian army killed accidentally by the horse kick (also applicable to rare diseases occurrences, birth defects and generic mutations, fires and floods in datacenters.) It is difficult to avoid such this type of SNAFU, although some policies makes them less likely to happen and if they happen limit that scope of damage.
Repeatable blunders with the roots in the deficiencies of the OS, key utilities implementation (rm, find, chown), or "human factors." (so called "pilots errors") Those usually have a certain pattern, and they happen more often. For each of those types several Internet stories usually exist and can be collected and analyzed. Classic type of this type of horror stories are Creative uses of rm. Performing the operation on a wrong server and This type of horror stories can be classified and measures can be developed to avoid them or at least drop their frequent to Poisson distribution type of horror stories. Several classes of such tools exists although none is in widespread use.
The key point is the the behavior of such utilities should be different in interactive sessions in comparison with scripts and some attempts to compensate for known and already studied blunders should be implemented via wrappers or other means. For example even such simple measure as alias
```
alias reboot='echo $HOSTNAME; sleep 10; reboot"
```
can help to present accidental reboots of the wrong server. Which is pretty common horror stories for large datacenters.

Among known wrappers and measures that that already proved be be somewhat effective in preventing "known SNAFU" we can mention:
1. Dynamic backup on FIT type of USB drive (up to 512GB) inserted in USB slot on the server (or a Class 10 memory card) like Relax-and-Recover on RHEL or Rsnapshot -- a very elegant approach to incremental backups,
2. molly-guard type of scripts, which prevent accidental reboot
3. saferm type of utilities that try to prevent mass deletion and deletion of system directories (such as /etc or its subdirectories) as well as utilities that organize Windows-style Trash can for deleted files ( trashy - Trashy · GitLab )
4. TIO (Think It Over) type of scripts, like my think utility. Such utilities use the feature of aliases -- they are executed only in interactive session to provide the context of the command (server name and the current directory) and a small delay (3-5 sec) before the execution of certain potentially dangerous commands (find . -exec , rm -r, chmod -R, reboot, shutdown, halt ) enough to realize that you are making a blunder. They are useful in situation when the price of any serious errors on sysadmin part is a way too high. One example is making some substantial changes on remote server that does not have any remote control capabilities or they are dead. In this case you should never rush and submit a command that has destructive potential without inspection. But this is easier said then done: often such changes are done while being very tired, or implemented under time pressure, or management pressure, or you are not quite familiar with the system you are working with. In this case you can alias dangerous commands in such away that they are not executed the first time they are entered or executed with small delay (mkfs command has built in delay 10 sec allowing you to cancel it within this period if you realized that you entered the wrong device. In the simplest form that can aliases like
```
alias find='sleep 7; find' 
```
  In a more complex form used in think the first time you submitted the command the utility just provides you the contest and some "tips" instead of execution. Only after you inspect the first "draft" retrieve it from history mode and resubmit prefixing it with the backlash to suppress the alias. This is not a panacea, but might be helpful in situation, when the cost of any typo, or a small mistake is way too high.
5. dirhist type of scripts, which automatically track changes in /etc and other vital system directories and inform sysadmin if any of files defined as "critical" in config files changes. You will be surprised how helpful is to have the history of changes in /etc directory in troubleshooting even if the server is administered by a single sysadmin.
6. Colorizers of shell session background (easy to implement in Teraterm) which provide different color for each server making a very common blunder of executing a command on a wrong server far less likely.
7. Command history enhancers (first of all adding timestamp to history helps a lot to track mishaps and conflicting actions of different sysadmins ), including various type of directory favorite creators (usually using existing push/pop/dirs mechanism)
8. Visual shells like Midnight commander which provide better visibility and safety to typical files operation especially copying of system files.
9. File command -exec option blunder preventer, which at first execution replace -exec or --delete option with -ls option (usually part of TIO scripts mentioned above)
10. File recovery utility for ext filesystem (Midnight commander provides this functionality)
11. NCD type of utilities for more quick and safe navigation of the directory tree. Might help with issuing command in a wrong directory problem.
12. Utilities that backup boot and MBR and similar type of information on newer systems, making disk recovery easier in case of accidental formatting.

Other types might exist too. This list if probably not exhaustive, although I tried ;-)

That's why it is important to try to classify typical sysadmin mistakes. People learn from experience and that's why each sysadmin should maintain his own lab journal. Regardless of the reason, every mistake should be documented as it constitutes as an important lesson pointing to the class of similar errors possible. As saying goes "never waist a good crisis". It is an opportunity to make your system more safe, change your work routines for the better, and hopefully prevent similar occurrences in the future. For example in many cases when a simple mistake cause serious problems, we observe absence of backups and absence of baseline, often both. Just simple routine that delayed some action for 3-5 sec can prevent many of horror storing because such blunders are usually recognized instantly.

Each SNAFU is also an opportunity to make your system more safe, change your work routines for the better, and hopefully prevent similar occurrences in the future.

Again, the most common sysadmin blunder in Unix/Linux is probably wiping out useful data by wrong rm command. This class of errors are often called Creative uses of rm. This danger can't be completely avoided via wrapper such as s aferm, but their frequency can be drastically diminished. In such cases nothing can replace an up-to-date backup, so doing a backup before any large scale deletion or file reorganization is a must and probably is more important that the use of saferm. And doing the backup of the /etc directory when you start working on a server is a must too.

Doing a backup before any large scale deletion or file reorganization is a must and probably is more important that the use of saferm.
And doing the backup of the /etc directory when you start working on a server is a must too.

What is really bad, is that after the disaster happened, sysadmin tend to react impulsively trying to save the situation. And those steps can dramatically increase the damage. So the Rule Number ONE is "After the disaster happen do not rush into action. Take time to analyze the "crime scene" and try diligently like a real detective preserve all evidence. Document everything. Even minor details might be crucial to lessening the scope of the damage inflicted.

The Rule Number ONE is "After the disaster happen do not rush into action. Take time to analyze the "crime scene" and try diligently like a real detective preserve all evidence. Document everything. Even minor details might be crucial to lessening the scope of the damage inflicted.

Then do your research. I have a recent case when a small (40TB) RAID 5 disk array was lost due to failure of two drives that went unnoticed by operators. Investigation had shown that is actually consisted of two RAID5 virtual drives, with the second mostly unused. The partition was configured as logic volume in Linux LVM consisting of two PV (physical volumes) and there was a documentation in Internet as for how to recover from this situation (see Recovery of LVM partitions ). The best case in when you have one missing PV (preferably the second, which was the case in this SNAFU). This information allowed to recover most of the files. I created a lab from unused server, experimented with this method for a day and then managed to recover most of the files on the production server. If I panicked, and simply reinitialized everything the same day and reformatted the partition, the data would be lost.

If user data are lost, you need to look where a copy can exist. Often if user works on multiple server he/she copies his files to the additional location, and at least part of them can be restored from this copy. The same is true for sysadmins home directories and /root, although if sysadmin does not use a FIT USB drive for "on-the-fly" backups of /etc/, root, /boot and other system directories, using, say, Rsnapshot or dirhist he/she is playing with fire in any case

If user data are lost, you need to look where a copy can exist. Often if user works on multiple server he/she copies his files to the additional locations, and at least part of them can be restored from this copy.

There are several steps that you can do to mitigate the damage:

Pay attention to creation of solid backup infrastructure. If necessary, private. Missing backup is the root of all evil
Block destructive operations for all level 2 system directories and /etc/config file via the wrapper.
- In this sense aliases are a really great tool as they work only in interactive sessions. So, for example aliasing reboot to safe-reboot wrapper will not affect any script that uses this command.
- Block operation on any directory that is in your favorite directories list if you have any (requires a script). Remove potentially dangerous alias to rm in default RHEL installation (where rm is aliased to rm -i for root user) with something more reasonable (like rm -I or saferm ). The rm='rm -i' alias is a bad choice, because after you get used to it, you will automatically expect rm to prompt you by default before removing files. Of course, one day you can run it on an account that hasn't that alias set and before you understand what's going on, it is too late.
Many such blunders occur because you type the potentially destructive command on the command line. Use TIO (think it over) type of scripts or at least an editor that allows command line execution of a line in the buffer (for example, vim has this capability). For example, when you operate on the backup copy of the system directory it is easy to type automatically the name of the directory with slash in front of it (rm -r /etc instead of rm -r etc ) because your brain is conditioned to type it this way. So is always prudent to rename the directory first to something else. Use absolute path for rm in all cases.
- If safer to type such command in the editor first. Or at least type options and arguments to the echo command instead of the command you are trying to execute, and only then replace the name echo with the name of the actual command. Or better think-style wrapper as descibed above.
- It is better to use file managers like WinSCP or Midnight commander, as they provide visual feedback and you do not need to type the name of the file or directory which eliminated the whole class of potential errors. WinSCP also allows you to use editor like NotePad++ for editing files and scripts directly in server directories, which is much safer and more convenient then using vi, nano, mcedit or any other Linux command line editor.
- you can use a wrapper like saferm for rm command -- a script can check for common blunder. Such script will not permit to delete you any system directories and some important files. Among other useful preventive checks introduces a delay between hitting enter and starting of the operation and list several file to be deleted (for example the first five and the last three) and the total number. Prototypes of such scripts are available and can be adapted to your needs. The problem with this solution that after disaster you enhance such script and uses it for a while. But then you often "regress" to usual way of doing things until the next disaster strikes.
- You can execute move command (moving for example, to /Trash folder) instead and then delete files and directories from /Trash folder which has, say 90 day expiration period for files in it. Of course this does not work for mass deletion of files to save space -- the situation when such blunder most commonly committed.

You should never try to a large scale deletion of files in a hurry, or while destructed and doing other tasks simultaneously. Often major blunder are done in attempts urgently "free space" while concentrating on some other task. Mass deletion is like surgical operation and requires your full attention. So if you do is as a "side task" you can erase important files of databases. Again, view it as surgical operation which requires clean environment, patience and cool head. Make a one minute break before it. Also, remember that the ability to resist user pressure is a virtue of sysadmin.

View any large scale deletion as a surgical operation which requires clean environment, patience and cool head. Also, remember that the ability to resist user pressure is a virtue of sysadmin.

In any case learning, rereading this or similar pages periodically like some kind of "sysadmin safety training" helps to avoid typical rm blunders (especially including ".." in the list of directories to be deleted as in classic rm .* ). Please note that after you read about them the awareness lasts just a couple of week or month at best. After that you firmly forget about those things and the danger returns. So maintaining proper level of awareness about them is an important part of the art of Unix system administration. You probably need to schedule reading Creative uses of rm in your calendar. I do.

You probably need to schedule reading Creative uses of rm in your calendar. I do. The 13 of each month is a very appropriate day for sysadmin safety training and self-study on this topic.

13 of each month is a very appropriate day for sysadmin safety training and self-study on this topic.

I use it for updating this set of pages ;-)

Another common and disastrous blunder for any Unix sysadmin who jungle many dozens of servers is performing an operation on the wrong server. They vary from reboot of the production server instead of quality ( or testing ) server, removal of file on the original filesystem instead of backup filesystem (while this file does not exist on the backup), and many others. See Performing the operation on a wrong server

Similar blunder is an accidental reboot. The cause can as simple as disconnecting wrong power cord, or executing reboot command in wrong terminal window. Using different colors in the background can help in the latter case.

Executing reboot command in the wrong terminal window is a common reason for accidental reboots

Learning from your own mistakes as well as mistakes of others is an important part of learning the craft. But the awareness does not last long. That's why it is important periodically reread such pages: they can prevent some horrible blunders

SNAFU as a classic career limited move for Unix linux sysadmin ;-)

The term SNAFU and "Houston we have a problem" often means the same thing

SNAFU is an acronym that is widely used to stand for the sarcastic expression Situation Normal: All Fucked Up. This term is often used as a synonym for the "sysadmin blunder". And implicitly includes some efforts for the cover-up of an embarrassing situation.

So the efforts to avoid them are well justified, and exist since early 90th, if not earlier. That's why this and similar pages that exist on the web (including Original Anatoly Ivasyuk collection of Sysadmin Horror Story created in early 90th). They all, while far from perfect, still represent a useful study material similar to any course in shell programming. And should treated as such. That mean studied and periodically refreshed. The latter, as I already mentioned, is especially important as the awareness fades in a month or two. Eventually in a year or so the information is wiped out from your memory by new information and the flow of new problems that are typical for any sysadmin job. So the person inevitably regress to old ("dangerous") way of doing things, for example abandoning the use of "protective" wrapper for the rm command. This is especially typical for Creative uses of rm, Missing backup horror stories, and Performing the operation on a wrong server types of SNAFU

Usually the memory fades quickly and in several months or a year most of us are quite ready to repeat them again ;-)

IMHO periodic reread of such pages is the only realistic way to keep the awareness on a proper level. This is similar how large enterprises conduct once a month "security awareness training. And why the letter often generates in useless exercise it might be very useful o incorporate some related to typical blunder slides into them.

After 60 years or so on Unix existence (which also the duration of existence of the problem with the expansion of ".*" basic regular expression on the command line) many Unix/Linux sysadmins still do not understand or don't remember about danger of rm -rf .*

Reading those pages can help. In addition keeping a personal journal of your SNAFU ( a typical SNAFU is like traffic incident is a confluence of several mistakes/simultaneous maneuvers /misunderstanding, etc; also like in the army incompetent bosses often play prominent role in such incidents, creating unnecessary and harmful pressure in already very stressful situation.

Periodically browsing this personal log is really important as each of this incidents can easily be as they typically, saying it politically correctly represent at "career limiting move". Sometimes resulting in termination of employment.

But there is always a silver lining in each dark cloud. When handled properly incidents stimulate learning and stimulates personal growth of a system administrator. Although in many cases there are less painful way to grow your knowledge. Including knowledge of bad incidents. That's why this set of pages was created. Reading it and other similar pages might help any aspiring sysadmin to avoid blunder for which many people already paid the price, which in certain cases includes termination of employment...

Fundamental reasons for blunders sysadmins commit

There are several fundamental reasons for blunders sysadmins commit:

Absence of backup. This is No.1 reason while mistake became a disaster. See Missing backup horror stories for more information.

One thing that distinguishes a professional Unix sysadmin from an amateur is the attitude to backup and the level of knowledge of backup technologies. Professional sysadmin knows all too well that often the difference between major SNAFU and a nuisance is availability of the up-to-date backup of data.

One thing that distinguishes a professional Unix sysadmin from an amateur is the attitude to backup and the level of knowledge of backup technologies

Here is pretty telling poem from unknown source (well originally Paul McCartney :-) on the subject :

Yesterday,
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.

"Overconfidence and false sense of security" which often demonstrates itself in the absence of testing of complex commands due to compliancy and arrogance. Mishap and blunders are rare events. Typically everything goes OK. And that sense of complacency eventually backfires. It is so important that it deserves safety trading session each month and a separate page.
The false sense of security invites performing dangerous actions without proper preparation and checking. All of us think that we are great on command line. In in most cases (say 99.999%) this is true. But there is another 0.001%., where it is not. The fact that you use command line for a decade of more does not actually shield you from committing horrible blunders, if you are not careful. Especially if your prefer, as many sysadmin do, to work as root. Using TIO (think it over) script that delays execution of potentially destructive command for 3-6 sec and/or verifying first the command by typing them on the editor first and only then copying it to the command line are not only good, but necessary practice. Especially, if the server on which you are working is hundreds miles away. It is so easy one day absolutely automatically type something like
```
rm * 171206.log
```
instead of
```
rm *171206.log
```
Our brains sometimes tend to play jokes on us.

Another aspect of the same is that complexity of environment and hidden interactions between components are ignored and you jump into the action without investigating possible consequences of the move. For example even trivial operation like fixing the way calendar year is represented (so called "year 2000 problem") proved to be a very complex mess. Similarly even a simple upgrade of the version of the complier, or interpreter done at the request of one user, can disrupt the work of others. This is typical both with hardware and software operations. For example, sometimes sysadmin shut himself out of remote box by performing a network reconfiguration operation that does not take into account what type of network connection he/she is using. Of course, now in most case you have a server remote control unit (DRAC/ILO, etc) those days, but the problem is that it might nor work. Such things can crash as was typical for certain versions of DELL DRAC 7 ( Resetting frozen iDRAC without unplugging the server ) and HP ILO. For several years ILO on HP Proliant DL580 G7 (which otherwise were pretty decent and very reliable four socket 4U servers) did not last more then week. And to reboot it you need to disconnect power cables form the server (which is pretty much idiotic solution for rack servers; but this is HP with its very complex and capricious hardware)

If the server is remote and those two mishaps happened simultaneously you have a problem.
Excessive zeal. As Talleyrand advised to young diplomats: "First and foremost try to avoid excessive zeal." That very wise recommendation is fully applicable to sysadmin activities especially important regarding the efforts to "improve security" which often lead to horrible SNAFU and more often then not does not improve overall security. Often doing nothing NOW is the most optimal cause of action. It let you to have time to think about the situation and understand it better.
- "Do not jump into the action until the next morning" is often not a bad advice.
- Another trivial corollary of this maxim is you should never to start anything important on Fridays or before vacation ("to finish everything before vacations" -- the road to hell is really paved with good intentions ;-), unless you really want your vacation to be spoiled ;-).
- Rush is a form of excessive zeal. Doing something "quick and fast" to help the company, or your manager, or your colleagues, often can turn into unmitigated disaster.
Inability to resist requests to violate established procedures when you are pressed. That first this is related to the violation of the "Rule no 1: create backup before starting any activity that can screw up OS or important components.
Believing the user version of the situation, without checking the gory details. First of all users often do not understand what they want. Or what exactly happened and why. So blindly following their instruction is a sure recipe for a disaster. So, always ask yourself, does the particular user know what they want? Do they really understand the situation or this is an illusion. Often if you check that answer is: No, no, and again no. For example, if a user wrote you a email requesting that a newer version R language interpreter should be installed on your servers ASAP, because previous version is too old (which is true), without checking you might miss the real meaning of his message, which quite different from the requested action (and that means that following his request leads to a rather big SNAFU):
1. The user is an is a typical luser (idiot/novice/incompetent) who know neither Linux nor R well (and do not try to improve his/her knowledge) and tried to install some R package (or a group of packages). When installation failed he/she just jumped to the conclusion that the problem is with the R interpreter version, because he heard that there is a newer one.
2. The user inhered some code, which he does not understand and it does not run under the currently installed interpreter. In his infinite wisdom the user decided that the problem is not in his/her lack of knowledge of R, but in the R interpreter.
3. Combination of (a) and (b)
4. Some other reason with the incompetence as the root case
If in this case you jump to action and update the interpreter you now can face several more serious problems:
1. You might need to restore everything from backup after another user complain (and that means for example on all 16 or more servers that you just updated, spending a good part of your weekend ;-). At this point you might discover that not all server have the most recent backup which spells troubles.
2. The problem that the user faces became much worse, and now you are in the loop to help him/her, because it is you who made it worse. In other words, you now own the problem.
Another more humiliating story of the same type Opensource.com

The accidental spammer (An anonymous story)

It's a pretty common story that new sys admins have to tell: They set up an email server and don't restrict access as a relay, and months later they discover they've been sending millions of spam email across the world. That's not what happened to me.

I set up a Postfix and Dovecot email server, it was running fine, it had all the right permissions and all the right restrictions. It worked brilliantly for years. Then one morning, I was given a file of a few hundred email addresses. I was told it was an art organization list, and there was an urgent announcement that must be made to the list as soon as possible. So, I got right on it. I set up an email list, I wrote a quick sed command to pull out the addresses from the file, and I imported all the addresses. Then, I activated everything.

Within ten minutes, my server nearly falls over. It turns out I had been asked to set up a mailing list for people we'd never met, never contacted before, and who had no idea they were being added to a mailing list. I had unknowingly set up a way for us to spam hundreds of people at arts organizations and universities. Our address got blacklisted by a few places, and it took a week for the angry emails to stop. Lesson: Ask for more information, especially if someone is asking you to import hundreds of addresses.

Loss of situational awareness. The latter is the ability to identify, process, and comprehend the critical elements of information about what is happening; the state of being alert to any, often subtle, clues. When you are tied that often means that you lost part or all situational awareness and are inclined to perform reckless actions. So most horrible SNAFU often happens with you are tied, or exhausted, or sleepy.

Another source of the loss of situational awareness is connected with lack of preparedness, when the person already forgot important details about particular procedure or subsystem due to very infrequent problems with them, but failed to RTFM and jumps into action.

Loss of situational awareness typically happens when you tied, exhausted or under pressure. It is often is connected with lack of preparedness, when the person already forgot important details about particular procedure or subsystem due to very infrequent problems with them, but failed to RTFM and jumps into action.

In this sense while long troubleshooting session can be beneficial as only this way you get a "mental picture" (like traffic controller) of what is happening, extremely long troubleshooting sessions (nighters) are counterproductive (and even dangerous) just because of this factor: in such conditions you can accidentally destroy with one stoke a vital part of the OS or find some other creative way to make situation worse. Working too long shifts in case you are dealing with SNAFU often creates much bigger problem then the problem you were dealing with.

Avoiding any complex or potentially destructive operation when you are tired is a prudent advice, but due to specific of sysadmin work with its unpredictable load peaks it is very difficult to follow. Here are a couple of tips:

When you can visualize something instead of just relying on your "mental picture", do it. Multiple terminal sessions to the same box are in this sense a must. In some cases OFM file manager like Midnight commander and/or using X interface (for example via VNC) with GUI file manager like Worker in addition to command window might help as it provides context that is lacking in pure command line. Removing files with mc is much safer operation then using the command line, as you have a visual feedback.
Connection should be secure and reliable as sudden disconnect in the middle of the long operation can amplify the damage. For example even such simple utility as screen can prevent problems caused by sudden disconnect to the remote box, while performing operation that cannot be interrupted. Using nohup with all sensitive operations is also a good practice.
All your operations should be logged and backups should be taken continually before and after important change. Most terminal emulators allow to create a log. This option should be used to created your private database. Some scripts can be written to clear the log and convert them to more useful info pages.
Always make a backup of /etc at the start of each day. Make a full backup of the system in case you are doing something dangerous. Loss of two hours is nothing in comparison with a week of frantic troubleshooting and restore operations that spill into weekends. With 8TB USB 3.0 drives available, creation of up to 8TB of backup in one session is not a problem.

"Skipping "cool down period" after the disaster" often the most damage is caused not by the SNAFU itself but subsequent hasty and badly thought out "recovery" actions. People tend to react to disaster on an emotional basis, with feelings overweighting the logic and rush to actions trying to save the situation, while making it worse. Humans are “wired” biologically to fear first and think later. So after experiencing first, often relatively minor problem sysadmin often overreact and commits a huge blunder, trying to correct this error without full understanding of the situation. At this time minor problem became real SNAFU.
The key in facing any serious problem is to give yourself some "cool down" period. Just a couple of minutes of thinking about the problem can save you from making a misguided action that makes the situation tremendously worse, sometime irreparable. In any case creating a backup is not a step you can skip. This is a step that differentiates between amateur and professional.
Misperception of the complexity of the environment and associated risks. Modern hardware and system software is way too complex and sometimes dealing with components of modern server became minefield (especially if this happens rarely and previous lessons and knowledge are long forgotten) can also lead to disasters too. For example HP P410 RAID controller has interesting property to "forget" its configuration in certain circumstances if you remove the drive that is not used by controller while the server is up. In this case on reboot you get something like
```
<4>cciss 0000:05:00.0: cciss: Trying to put board into performant mode
<4>cciss 0000:05:00.0: Placing controller into performant mode
<6> cciss/c0d0: unknown partition table    
```
Formally it should allow hot swapping and removal of inactive drive. But here your jaw drops, especially if you realize that you have no recent backup.
Reckless driving. Desire to "cut corners" often is connected with being tired, personal problems, excessive hubris, bravado, being over caffeinated, etc. It is very similar to reckless driving. Absence of testing of complex commands listed above can also be classified as an example of "reckless driving". And Linux is so complex OS and has so many important command that it is impossible to remember all the gory details. You need to refresh you memory by consulting your notes, map pages and WEB first. If this is not done and you rely on you intuition in using some feature you can be badly disappointed :-(. For example, people often forget that ".*" matches “.” and ".." and do rm command or other "destructive" command this without prior testing what set of files is effected on production server.

An “Ohnosecond” is defined as the period of time between when you hit the Enter key and you realize what you just did.

There is a difference between a test server and production server in a sense that any action on production server should be verified prior to execution. The similarity with traffic incidents involving reckless drivers, a reckless sysadmin is aware of the risk and consciously disregard it.

State laws usually define reckless driving as “driving with a willful or a wanton disregard for the safety of persons or property,” or in similar terms. Courts consider alcohol and drug use as a factor in deciding whether the driver’s actions were reckless.

Raising situation awareness by doing self-safety training

"Those Who Forget History Are Doomed to Repeat It"
Multiple authors

"Those who cannot remember the past are condemned to repeat it."

George Santayana

Having even primitive recoding of your blunders in a form of, say, plain text file, HTML page, Word document. or special logbook is a good way to increase situation awareness. Some blunders are repetitive.

People usually are unable to learn on blunders committed by others. They prefer to make their own... And even in this case after a year of two the lesson is typically completely forgotten.

Re-reading of description of your own blunder typically provides strong emotional reaction and reinforces understanding of dangers related to this blunders. This type of "emotional memory" is very important in helping to avoid similar blunder in the future. That means that periodic reviews of descriptions of your own blunders is really necessary part of sysadmin arsenal. Re-reading those description should be periodic (for example once a quarter) self-safety training, much like safety training in large corporations.

I can attest that those 10-15 min spend on re-reading and enhancing this material once a month can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as they tends to repeat itself ;-). And people tend to commit them again and again. If you read some of the stories form late 90th they often sound as if they were written yesterday.

10-15 min spend on re-reading and enhancing this material one a month can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as it tends to repeat itself ;-).

Reading about somebody else blunder does not fully convey the gravity of the situation in which you can find yourself by repeating it. But it can serve as weaker substitute for log of own blunders. For example, the understanding that dealing with files and directories staring with dot in Unix requires extreme caution probably can be acquired only by committing one (just one) such blunder.

Dealing with RAID controllers is another areas that requires extreme caution, a good planning and availability of verified backup. Sometimes even routine firmware update turns into unmitigated disaster. This is also an area where the difference between minor nuisance and major disaster is presence of the most recent backup.

Some typical cases of loss of situational awareness

Here is constructed by myself list of typical cases of the loss of situational awareness

Performing some critical operation on the wrong server. If you have multiple terminal session for server with close names, at one point you can find yourself performing operation on a wrong server. One of the simplest ways is to change background of the terminal of the server on which you are perfuming critical operations to yellow (or some other distinct color). This can be done in Teraterm. If you use Windows desktop to connect to Unix servers use MSVDM to create multiple desktop and change background for each to make the typing a command in a wrong terminal window less likely. If you prefer to work as root switch to root only on the server that you are working. Use your regular ID and sudo on the others.
Failure to keep record of your steps and verify steps before applying them to the production box. For example, people often forget that ".*" matches “.” and ".." and do rm command or other "destructive" command this without prior testing what set of files is effected on production server. There is a difference between a test server and production server in a sense that any action on production server should be verified prior to execution. The similarity with traffic incidents here is that like reckless driver reckless sysadmin is aware of the risk and consciously disregard it.
Destruction by the noise typical for large datacenter. I strongly recommend noise cancellation headphones for a noisy datacenter—it greatly reduced my noise/stress level from days of datacenter work.
Sleep deprivation, which leads to worsening mood and communication skills, inability to focus, decreased mental and physical performance. Chronic sleep deprivation can lead to neurotic behavior
Excessive use of caffeine. Excessive use of caffeine does not help in case of sleep deprivation and lead to side effects including overly excited, aggressive behaviour and related set of blunders. In very high doses caffeine can affect heart rate pushing it high.
Extreme fatigue, especially after multi-hour troubleshooting binges. Stress situations usually increases the chance that fatigue will impair your abilities
Confusion or use of "gut feeling" instead of consulting the man page, Web and, if applicable, your notes, when using some obscure command switches and such.
Departure from standard operating procedures, taking shortcuts and sudden change of plan.
Ambiguity of environment like presence of etc directory in home directory confused with /etc directory and command rm -r /etc entered automatically when you want to delete it content because it is hardwired in your brain.
Fixation or preoccupation with the speed (meeting deadline) instead of quality.

History of this effort

In this page we will present "Softpanorama classification of sysadmin horror stories". It is not the first such effort and hopefully not the last one. And we need to pay proper tribute to the pioneer in this are -- Anatoly Ivasyuk

The author is indebted to Anatoly Ivasyuk who created original " The Unofficial Unix Administration Horror Story Summary. This list exists in two major versions:

Version 1.0: text file which is available from many sources, for example http://www-uxsup.csx.cam.ac.uk/misc/horror.txt and other mirrors.
Version 1.1 which exists in two variants:
- a text file created by Anatoly Ivasyuk
- a partial (first 8 sections) HTML version compiled by Anke Weinberger in 1994 with some corrections in 1996.

One thing that we need in this area is a good classification. While Anatoly Ivasyuk did the first, the most difficult step more can be done. One such classification created by the author is presented below.

This page and related subpages can be viewed as an attempt to create more relevant classification of sysadmin blunders, reorganize the existing material and enhance the content by adding more modern stories.

The issues connected with ego and hubris

hubris: Overbearing pride or presumption; arrogance:
"There is no safety in unlimited technological hubris”

( McGeorge Bundy)

All the world's a stage,
And all the men and women merely players;
They have their exits and their entrances,
And one man in his time plays many parts,

Shakespeare, As You Like It Act 2, scene 7, 139–143

I think there's a lot of naivete and hubris within our mix of personalities.

- Ian Williams

Hubris (/ˈhjuːbrɪs/, also hybris, from ancient Greek ὕβρις) describes a personality quality of extreme or foolish pride or dangerous over-confidence.[1] In its ancient Greek context, it typically describes behavior that defies the norms of behavior or challenges the gods, and which in turn brings about the downfall, or nemesis, of the perpetrator of hubris (Hubris - Wikipedia)

Larry Wall once said that "The three chief virtues of a programmer are: Laziness, Impatience and Hubris.". I assume that this was a joke. This is not true for programmers. But for system administrators those three qualities are mortal sins. Especially the last two. Just hubris alone will never let you to be a good system administrator. That's what distinguish system administrators from artists.

We're all victims of our own hubris at times. Success usually breeds a degree of hubris. But some people are more affected then others. The problems start when people are shy to ask more experienced colleagues for advice of information, because they are afraid to demonstrated that they do not know something, which other assume they know. Sometimes this is the reason that lead to disasters.

If the senior, more experienced sysadmin looks at you like you’re an idiot, ask him why. It's better to be thought an idiot for asking than proven to be an idiot by not asking!

Softpanorama classification of sysadmin blunders

Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )

Vivek Gite

Creative uses of rm with unintended consequences This is an intrinsic, unavoidable danger in Linux. Like using a sharp blade or chainsaw. Blunders happen very infrequently, but even a single one can be devastating and if happen on production server can cost your job. That means that the level of knowledge of intricacies of rm command directly correlated with the level of qualification of the Linux sysadmin. Please read recommendation in Creative uses of rm with unintended consequences. they were created as generalization of unfortunate episodes (usually called SNAFU) of many sysadmins including myself.
Missing backup. Please remember that backup is the last change for you to restore the system if something went terribly wrong. That means that before any dangerous steps you need to locate and check the existence of backup. Making another backup is also a good idea to that you have two or more recent copies. Attempt at least to brose the backup and see if data are intact is a must.
Missing baseline and losing initial confirmation in the stream of changes. This is the most typical mistake in network troubleshooting and optimization is losing your initial configuration. This also might mean lack of preparation and lack of situational awareness. You need to take several steps to prevent this blunder from occurring and most important of them are baselines and backups.
Locking yourself out
- Hosing your root account
- Accidentally cutting access to the remote box or hosing in some way your remote network connection (locking yourself out). For example, changing firewall rules and not testing them before logout.
- Forgetting root password on a remote box. This is very common problem, especially if access to the remote box is rare. Generally, if regular passwords are used it is important to wear electronic watches with memo pad like Casio (but not smartwatch ;-)
Performing operation on a wrong computer. The naming schemes used by large corporations usually do not have enough distance between them to avoid such blunders. also if you work on multiple terminal and do not distinguish them with color, you can easily make such a blunder. For example, you can type XYZ300 instead of XYZ200 and login to the wrong box. If you are in a hurry and do not check the name, you proceed with operation intended for different box. Another common situation is when you have several terminal windows open and in a hurry start working on a wrong server. That's why it's important that shell prompt shows the name of the host (but it is not enough; in case of terminal the color of the background is also important; probably more important). Often, if you both have a production server and a quality server for some application is wise never have two terminals opened simultaneously if you are doing some tricky and potentially disastrous (if done on the wrong box) staff . Reopening it is not a big deal but can save you from some very unpleasant situations.
Forgetting in which directory you are and executing command in a wrong directory. This is common mistake if work under severe time pressure or are very tired.
Regular expressions related blunders. Novice sysadmins usually do not realize that '.*' also matches '..' often with disastrous consequences if commands like chmod, chown, rm are used recursively or in find command.
Find filesystem traversal errors and other errors related to find. This is very common class of errors and it is covered in a separate page Typical Errors In Using Find
Side effects of performing operations on home or application directories due to links to system directories. This is a pretty common mistake and I had committed it myself several time with various, but always unpleasant consequences.
Misunderstanding of syntax of important command and/or not testing complex command before execution of production box. Such errors are often made under time pressure. One such case is using recursive rm, chown, chmod or find commands. Each of them deserves category of its own.
- Running complex rm command without testing it using ls . You should always use ls -Rl command to test complex rm -R commands ( -R, --recursive means process subdirectories recursively) or when rm is used in exec option of find command
- Mistyping path or file name in rm commands. It's better to use copy and paste operation for directories and files used in rm command as it helps to avoid various typos.
Ownership changing blunders Those are common when using chown with find so you need to test the command first.
Excessive zeal in improving security of the system ;-). A lot of current security recommendation are either stupid or counterproductive. In the hands of overly enthusiastic and semi-competent administrator it becomes a weapon that no hacker can ever match. I think more systems were destroyed by idiotic security measures that by hackers.
Mistakes done under time pressure. Some of them were discussed above, but generally time pressure serves as a powerful catalyst for the most devastating mistakes.
Patching horrors
Unintended consequences of automatic system maintenance scripts
Side effects/unintended consequences of multiple sysadmin working on the same box
Premature or misguided optimization and/or cleanup of the system. Changing settings without full understanding consequences of such changes. Misguided attempts to get rid of unwanted file or directories (cleaning the system).
Mistakes made because of the differences between various Unix/Linux flavors For example in Solaris run level 5 means reboot while in Linux run level 5 is running system with networking and X11.
Stupid or preventable mistakes including those when dealing with complex server hardware.

Some personal experience

Cleaning NFS mounted home directory to save space

To speed up installation of the sever I mounted my home directory from another server. Then forgot about it and it remained mounted. CentOS 6.9 was installed on server. Later researcher asked to reinstall on it RHEL as one of his application were supported only on RHEL and I started with backing up all critical directories "just in case". Thinking that I already have a copy of my home directory elsewhere I decided to shrink space on /home filesystem and not realizing that it was NFS mounted deleted it.

Reboot of wrong server

Such commands as reboot or mkinitrd can be pretty devastating when applied to wrong server. That mishap happens with a lot of administrators including myself, so it is prudent to take special measures to make it less probable.

This situation often is made more probable due to not fault-tolerant name scheme employed in many corporations where names of the servers differ by one symbol. For example, scheme serv01, serv02 serv03 and so on is a pretty dangerous name scheme as server names are different by only single digit and thus errors like working on the wrong server are much more probable.

The typical case of the loss of situational awareness is performing some critical operation on the wrong server. If you use Windows desktop to connect to Unix servers use MSVDM to create multiple desktop and change background for each to make the typing command in a wrong terminal window less likely

Even more complex scheme like Bsn01dls9, Nyc02sns10 were first three letter encode the location, then numeric suffix and then vendor of the hardware and OS installed are prone to such errors. My impression that unless first letters differ, there is a substantial chance of working on wrong server. Using favorite sport teams names is a better strategy and those "formal" name can be used as aliases.

Inadequate backup

If you try to distill the essence of horror stories most of them were upgraded from errors to horror stories due to inadequate backups.

Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster. This point is very difficult to understand by novice enterprise administrators. Rephrasing Bernard Show we can say "Experience keeps the most expensive school, but most sysadmins are unable to learn anywhere else". Please remember that in enterprise environment you will almost never be rewarded for innovations and contributions but in many cases you will be severely punished for blunders. In other words typical enterprise IT is a risk averse environment and you better understand that sooner rather then later...

If you try to distill the essence of horror stories most of them are about inadequate backups. Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster.

Rush and absence of planning are probably the second most important reason. In many cases sysadmin is stressed and that impair judgment.

Forgetting to chroot affected subtree

Another typical reason is abuse of privileges. If you have access to root that does not mean that you need to perform all operations as root. For example such simple operations' as

cd /home/joeuser
chown -R joeuser:joeuser .*

performed as root cause substantial problems and time lost in recovery of ownership of system file. Computers are really fast now and of modern server such an operation can take a second or two :-(.

Even with user privileges there will be some damage: it will affect all world writable files and directories.

This is the case where chroot can provide tremendous help:

cd /home/joeuser 
chroot /home/joeuser
chown -R joeuser:joeuser .*

Abuse of root privileges

Another typical reason is abuse of root privileges. Using sudo or RBAC (on Solaris) you can avoid some unpleasant surprises. Another good practice if to use screen with one screen for root operations and another for operations that can be performed under your on ID or under privileges of wheel group (or other group to which all sysadmins belong).

Many Unix sysadmin horror stories are related to unintended consequences, unanticipated effects of a particular Unix commands such as find and rm performed with root privileges. Unix is a complex OS and many intricate details (like behavior of commands like rm -r .* or chown -R a:a .*) can easily be forgotten from one encounter to another, especially if sysadmin works with several flavors of Unix or Unix and Windows servers.

For example recursive deletion of files either via rm -r or via find -exec rm {} \; has a lot of pitfalls that can destroy the server pretty nicely in less then a minute, if run without testing.

Some of those pitfalls can be viewed as a deficiency of rm implementation (it should automatically block * deletion of system directories like /, /etc/ and so on unless -f flag is specified, but Unix lacks system attribute for files although in some case sticky bit on directories (like /tmp) can help).

That means that it is wise to use wrappers for rm. There are several more or less usable approach to writing such a wrapper:

Configurable blacklist of files and directories that should never be removed. This is an approach implement in Perl script safe-rm
Redefining rm as mv to /junk directory. Which should be cleaned periodically with find -mtime -7 or greater.
Displaying several first targets before executing actual rm command and the total number of affected files and asking for confirmation. In this case rm is wrapped with the shell function, as in command line rm is usually typed without path.

Another important source of blunders is time pressure. Trying to do something quickly cutting corners (such as creating verified a backup) often lead to substantial downtime. Hurry slowly is one of the saying that are very true for sysadmin. But unfortunately very difficult to follow. In any case always backup /etc/directory on your login (this should be done from profile or bashrc script.

In any case always backup /etc/directory on your login (this should be done from profile or bashrc script.

Sometimes your emotional state contribute to the problems: you didn't have much sleep or your mind was distracted by your personal life problems. In such days it is important to slow down and be extra cautious. Doing nothing is such cases is much better that creating another SNAFU.

Typos are another common source of serious, some time disastrous errors. One rule that should be followed (but as the memory of the last incident fades, this rule like any safety rules, usually is forgotten :-): if you are working as root and perform dangerous operations never type the directory path, always copy it from the history, if possible or list it via ls command and copy from the screen.

If you are working as root and perform dangerous operations never type the directory path, especially complex path. Always try to copy it from the history, if possible or list it via ls command and then copy it from the screen.

I once automatically typed /etc instead of etc trying to delete directory to free space on a backup directory on a production server (/etc probably in engraved in sysadmin head as it is typed so often and can be substituted for etc subconsciously). I realized that it was mistake and cancelled the command, but it was a fast server and one third of /etc was gone. The rest of the day was spoiled... Actually not completely: I learned quite a bit about the behavior of AIX in this situation and the structure of AIX /etc directory this day so each such disaster is actually a great learning experience, almost like one day or even one week training course ;-). But it's much less nerve wracking to get this knowledge from a regular course...

Another interesting thing is having backup was not enough is this case -- backup software sometimes can stop working and the server has an illusion of the backup not the actual backup. That happens with HP Data Protector, which is too complex software to operate reliably. The same can be true for ssh and rsync based backup -- something in the configuration changes and that went unnoticed until too late. And this was a remote server is a datacenter across the country. I restored the directory on the other non-production server (overwriting its /etc directory in this second box with the help of operations, tell me about cascading errors and Murphy law :-). Then netcat helped to transfer the tar file.

If you are working as root and perform dangerous operations never type a directory path of the command, copy it from the screen. If you can copy command from history instead of typing, just do it !

In such cases network services with authentication stop working and the only way to transfer files is using CD/DVD, USB drive or netcat. That's why it is useful to have netcat on servers: netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it if the datacenter is remote.

netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it, if the datacenter is remote.

What other authors are saying

Linux Server Hacks, Volume Two Tips & Tools for Connecting, Monitoring, and Troubleshooting William von Hagen, Brian K. Jones

Avoid Common Junior Mistakes

Get over the junior admin hump and land in guru territory.

No matter how "senior" you become, and no matter how omnipotent you feel in your current role, you will eventually make mistakes. Some of them may be quite large. Some will wipe entire weekends right off the calendar. However, the key to success in administering servers is to mitigate risk, have an exit plan, and try to make sure that the damage caused by potential mistakes is limited. Here are some common mistakes to avoid on your road to senior-level guru status.
Don't Take the root Name in Vain
Try really hard to forget about root. Here's a quick comparison of the usage of root by a seasoned vet versus by a junior administrator.

Solid, experienced administrators will occasionally forget that they need to be root to perform some function. Of course they know they need to be root as soon as they see their terminal filling with errors, but running su -root occasionally slips their mind. No big deal. They switch to root, they run the command, and they exit the root shell. If they need to run only a single command, such as a make install, they probably just run it like this:
	$ su -c 'make install'
This will prompt you for the root password and, if the password is correct, will run the command and dump you back to your lowly user shell.

A junior-level admin, on the other hand, is likely to have five terminals open on the same box, all logged in as root. Junior admins don't consider keeping a terminal that isn't logged in as root open on a production machine, because "you need root to do anything anyway." This is horribly bad form, and it can lead to some really horrid results. Don't become root if you don't have to be root!

Building software is a good example. After you download a source package, unzip it in a place you have access to as a user. Then, as a normal user, run your ./configure and make commands. If you're installing the package to your ~/bin directory, you can run make install as yourself. You only need root access if the program will be installed into directories to which only root has write access, such as /usr/local.

My mind was blown one day when I was introduced to an entirely new meaning of "taking the root name in vain." It doesn't just apply to running commands as root unnecessarily. It also applies to becoming root specifically to grant unprivileged access to things that should only be accessible by root!

I was logged into a client's machine (as a normal user, of course), poking around because the user had reported seeing some odd log messages. One of my favorite commands for tracking down issues like this is ls -lahrt/etc, which does a long listing of everything in the directory, reverse sorted by modification time. In this case, the last thing listed (and hence, the last thing modified) was /etc/shadow. Not too odd if someone had added a user to the local machine recently, but it so happened that this company used NIS+, and the permissions had been changed on the file!

I called the number they'd told me to call if I found anything, and a junior administrator admitted that he had done that himself because he was writing a script that needed to access that file. Ugh.

Don't Get Too Comfortable

Junior admins tend to get really into customizing their environments. They like to show off all the cool things they've recently learned, so they have custom window manager setups, custom logging setups, custom email configurations, custom tunneling scripts to do work from their home machines, and, of course, custom shells and shell initializations.

That last one can cause a bit of headache. If you have a million aliases set up on your local machine and some other set of machines that mount your home directory (thereby making your shell initialization accessible), things will probably work out for that set of machines. More likely, however, is that you're working in a mixed environment with Linux and some other Unix variant. Furthermore, the powers that be may have standard aliases and system-wide shell profiles that were there long before you were.

At the very least, if you modify the shell you have to test that everything you're doing works as expected on all the platforms you administer. Better is just to keep a relatively bare-bones administrative shell. Sure, set the proper environment variables, create three or four aliases, and certainly customize the command prompt if you like, but don't fly off into the wild blue yonder sourcing all kinds of bash completion commands, printing the system load to your terminal window, and using shell functions to create your shell prompt. Why not?

Well, because you can't assume that the same version of your shell is running everywhere, or that the shell was built with the same options across multiple versions of multiple platforms! Furthermore, you might not always be logging in from your desktop. Ever see what happens if you mistakenly set up your initialization file to print stuff to your terminal's titlebar without checking where you're coming from? The first time you log in from a dumb terminal, you'll realize it wasn't the best of ideas. Your prompt can wind up being longer than the screen!

Just as versions and build options for your shell can vary across machines, so too can "standard" commands-drastically! Running chown -R has wildly different effects on Solaris than it does on Linux machines, for example. Solaris will follow symbolic links and keep on truckin', happily skipping about your directory hierarchy and recursively changing ownership of files in places you forgot existed. This doesn't happen under Linux. To get Linux to behave the same way, you need to use the -H flag explicitly. There are lots of commands that exhibit different behavior on different operating systems, so be on your toes!

Also, test your shell scripts across platforms to make sure that the commands you call from within the scripts act as expected in any environments they may wind up in.

Don't Perform Production Commands "Off the Cuff"

Many environments have strict rules about how software gets installed, how new machines are built and pushed into production, and so on. However, there are also thousands of sites that don't enforce any such rules, which quite frankly can be a bit scary.

Not having the funds to come up with a proper testing and development environment is one thing. Having a blatant disregard for the availability of production services is quite another. When performing software installations, configuration changes, mass data migrations, and the like, do yourself a huge favor (actually, a couple of favors):

Script the procedure!

Script it and include checks to make sure that everything in the script runs without making any assumptions. Check to make sure each step has succeeded before moving on.

Script a backout procedure.

If you've moved all the data, changed the configuration, added a user for an application to run as, and installed the application, and something blows up, you really will not want to spend another 40 minutes cleaning things up so that you can get things back to normal. In addition, if things blow up in production, you could panic, causing you to misjudge, mistype, and possibly make things worse. Script it!

The process of scripting these procedures also forces you to think about the consequences of what you're doing, which can have surprising results. I once got a quarter of the way through a script before realizing that there was an unmet dependency that nobody had considered. This realization saved us a lot of time and some cleanup as well.

Ask Questions

The best tip any administrator can give is to be conscious of your own ignorance. Don't assume you know every conceivable side effect

Dr. Nikolai Bezroukov

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20201118 : Why the lone wolf mentality is a sysadmin mistake by Scott McBrien ( Jul 10, 2019 , www.redhat.com )
20201105 : Wiping out RHEL 7 auth files (passwd, shadow, groups) with RHEL6 auth files due to misunderstanding about what version of RHEL is on target server ( Nov 05, 2020 )
20201102 : Wiping files due to misguided attempt to save space. ( Nov 02, 2020 )
20200714 : Sysadmin tales- How to keep calm and not panic when things break by Glen Newell ( Jul 10, 2020 , www.redhat.com )
20191108 : What breaks our systems A taxonomy of black swans by Laura Nolan Feed ( Oct 25, 2018 , opensource.com )
20191108 : How to prevent and recover from accidental file deletion in Linux Enable Sysadmin ( Nov 08, 2019 , www.redhat.com )
20191108 : My first sysadmin mistake by Jim Hall ( Nov 08, 2019 , opensource.com )
20191108 : How to use Sanoid to recover from data disasters Opensource.com ( Nov 08, 2019 , opensource.com )
20191107 : What breaks our systems A taxonomy of black swans Opensource.com ( Nov 07, 2019 , opensource.com )
20191107 : How to prevent and recover from accidental file deletion in Linux Enable Sysadmin ( Nov 07, 2019 , www.redhat.com )
20191106 : Sysadmin 101 Leveling Up by Kyle Rankin ( Nov 06, 2019 , www.linuxjournal.com )
20191106 : 7 Ways to Make Fewer Mistakes at Work by Carey-Lee Dixon ( May 31, 2015 , www.linkedin.com )
20191106 : 10+ mistakes Linux newbies make - TechRepublic ( Nov 06, 2019 , www.techrepublic.com )
20191106 : Destroying multiple production databases by Jan Gerrit Kootstra ( Aug 08, 2019 , www.redhat.com )
20191106 : My 10 Linux and UNIX Command Line Mistakes by Vivek Gite ( May 20, 2018 , www.cyberciti.biz )
20191025 : Get inode number of a file on linux - Fibrevillage ( Oct 25, 2019 , www.fibrevillage.com )
20191025 : Howto Delete files by inode number by Erik ( Feb 10, 2011 , erikimh.com )
20191025 : unix - Remove a file on Linux using the inode number - Super User ( Oct 25, 2019 , superuser.com )
20191025 : Linux - Unix Find Inode Of a File Command ( Jun 21, 2012 , www.cyberciti.biz )
20190929 : IPTABLES makes corporate security scans go away ( Sep 29, 2019 , www.reddit.com )
20190904 : Basic Trap for File Cleanup ( Sep 04, 2019 , www.putorius.net )
20190826 : linux - Avoiding accidental 'rm' disasters - Super User ( Aug 26, 2019 , superuser.com )
20190826 : bash - How to prevent rm from reporting that a file was not found ( Aug 26, 2019 , stackoverflow.com )
20190826 : shell - rm -rf return codes ( Aug 26, 2019 , superuser.com )
20190726 : The day the virtual machine manager died by Nathan Lager ( Jul 26, 2019 , www.redhat.com )
20190429 : When the disaster hit, you need to resolve things quickly and efficiently, with panic being the worst enemy. Amount of training and previous experience become crucial factors in such situations ( Apr 29, 2019 , www.nakedcapitalism.com )
20190326 : I wiped out a call center by mistyping the user profile expiration purge parameters in a script before leaving for the day. ( Mar 26, 2019 , twitter.com )
20190301 : Molly-guard for CentOS 7 UoB Unix by dg12158 ( Sep 21, 2015 , bris.ac.uk )
20190301 : molly-guard protects machines from accidental shutdowns-reboots by ruchi ( Nov 28, 2009 , www.ubuntugeek.com )
20190301 : Confirm before executing shutdown-reboot command on linux by Ilija Matoski ( Oct 23, 2017 , matoski.com )
20190221 : https://github.com/MikeDacre/careful_rm ( Feb 21, 2019 , github.com )
20190221 : https://github.com/lagerspetz/linux-stuff/blob/master/scripts/saferm.sh by Eemil Lagerspetz ( Feb 21, 2019 , github.com )
20190221 : The rm='rm -i' alias is an horror ( Feb 21, 2019 , superuser.com )
20190221 : Ubuntu Manpage trash - Command line trash utility. ( Feb 21, 2019 , manpages.ubuntu.com )
20190129 : hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior ( Dec 04, 2012 , serverfault.com )
20190129* xfs corrupted after power failure ( Oct 15, 2013 , www.linuxquestions.org ) [Recommended]
20190129 : an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF. ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : HVAC units greatly help to increase reliability ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : [SOLVED] Unable to mount root file system after a power failure ( Jan 29, 2019 , www.linuxquestions.org )
20190129 : A new term PEBKAC ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : Are you sure? ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : Your tax dollars at government It work ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : Your worst sysadmin horror story ( Jan 29, 2019 , www.reddit.com )
20190129 : Extra security can be a dangerious thing ( Jul 20, 2017 , www.linuxjournal.com )
20190129 : Backing things up with rsync ( Jul 20, 2017 , www.linuxjournal.com )
20190129 : It helps if somebody checked if the equpment really has power, but often this step is skipped. ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : It can be hot inside the rack ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : "Sure, I get out my laptop, plug in the network cable, get on the internet from home. I start the VPN client, take out this paper with the code on it, and type it in..." Yup. He wrote down the RSA token's code before he went home. ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : How electricians can help to improve server uptime ( Jan 29, 2019 , thwack.solarwinds.com )
20190128* Testing backup system as the main source of power outatages ( Jan 28, 2019 , thwack.solarwinds.com ) [Recommended]
20190128 : False alarm: bas small inmashine room due to electrical light not a server ( Jan 28, 2019 , www.reddit.com )
20190128 : Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in. ( Jan 28, 2019 , www.reddit.com )
20190128 : Format of wrong particon initiated during RHEL install ( Jan 28, 2019 , www.reddit.com )
20190128 : I still went to work that day, tired, grumpy and hyped on caffeine teetering between consciousness and a comatose state ( Jan 28, 2019 , thwack.solarwinds.com )
20190128 : Any horror stories about tired sysadmins... ( Jan 28, 2019 , thwack.solarwinds.com )
20190128 : Something about the meaning of the word space ( Jul 13, 2015 , thwack.solarwinds.com )
20190128 : Happy Sysadmin Appreciation Day 2016 ( Jan 28, 2019 , opensource.com )
20190128 : The ghost of the failed restore ( Nov 01, 2018 , opensource.com )
20190128 : The danger of a single backup harddrive (USB or not) ( Nov 08, 2002 , www.linuxjournal.com )
20190128 : Those power cables ;-) ( Jan 28, 2019 , opensource.com )
20190128 : "Right," I said. "Time to get the backup." I knew I had to leave when I saw his face start twitching and he whispered: "Backup ...?" ( Jan 28, 2019 , opensource.com )
20190128 : regex - Safe rm -rf function in shell script ( Jan 28, 2019 , stackoverflow.com )
20190128 : That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem ( Jan 28, 2019 , www.reddit.com )
20190114 : Safe rm stops you accidentally wiping the system! @ New Zealand Linux ( Jan 14, 2019 , www.nzlinux.com )
20190110* When idiots are offloaded to security department, interesting things with network eventually happen ( May 27, 2018 , linux.slashdot.org ) [Recommended]
20190110 : saferm Safely remove files, moving them to GNOME/KDE trash instead of deleting by Eemil Lagerspetz ( Jan 10, 2019 , github.com )
20181022 : linux - If I rm -rf a symlink will the data the link points to get erased, to ( Oct 22, 2018 , unix.stackexchange.com )
20181022 : Does rm -rf follow symbolic links? ( Jan 25, 2012 , superuser.com )
20181005 : Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk ( Oct 05, 2018 , cam.ac.uk )
20181005 : One wrong find command can create one weak frantic recovery efforts ( Oct 05, 2018 , cam.ac.uk )
20181005 : When some filenames are etched in your brain you can type them several times repeating the same blunder again and again by Anatoly Ivasyuk ( Oct 05, 2018 , cam.ac.uk )
20181005 : Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once ( Oct 05, 2018 , www.reddit.com )
20181005 : sudo yum -y remove krb5 (this removes coreutils) ( Oct 05, 2018 , www.reddit.com )
20181005 : Trying to preserve connection after networking change while working on the core switch remotely backfired, as sysadmin forgot to cancel scheduled reload comment after testing change ( Oct 05, 2018 , www.reddit.com )
20181005 : I learned a valuable lesson about pressing buttons without first fully understanding what they do. ( Oct 05, 2018 , www.reddit.com )
20181005 : A newbie turned production server off to replace a monitor ( Oct 05, 2018 , www.reddit.com )
20181005 : Sometimes one extra space makes a big difference ( Oct 05, 2018 , cam.ac.uk )
20181005 : Deletion of files purpose of which you do not understand sometimes backfire by Anatoly Ivasyuk ( Oct 05, 2018 , cam.ac.uk )
20181005 : Danger of hidden symlinks ( Oct 05, 2018 , cam.ac.uk )
20181005 : Hidden symlinks and recursive deletion of the directories ( Oct 05, 2018 , www.reddit.com )
20181005 : Automatically putting slash in front of directory named like system (named like bin,etc,usr, var) which are all etched in sysadmin memory ( Oct 05, 2018 , www.reddit.com )
20181005 : I corrupted a 400TB data warehouse. ( Oct 05, 2018 , www.reddit.com )
20181002 : Rookie almost wipes customer's entire inventory unbeknownst to sysadmin ( Oct 02, 2018 , theregister.co.uk )
20180730 : Sudo related horror story ( Jul 30, 2018 , www.sott.net )
20180422 : Unix-Linux Horror Stories Unix Horror Stories The good thing about Unix, is when it screws up, it does so very quickly ( Aug 04, 2011 , unixhorrorstories.blogspot.com )
20180422 : Unix Horror story script question Unix Linux Forums Shell Programming and Scripting ( Apr 22, 2018 , www.unix.com )
20180422 : THE classic Unix horror story programming ( Apr 22, 2008 , www.reddit.com )
20180422 : rm and Its Dangers (Unix Power Tools, 3rd Edition) ( Apr 22, 2018 , docstore.mik.ua )
20180422 : How to prevent a mistaken rm -rf for specific folders? ( Jan 20, 2013 , unix.stackexchange.com )
20180421 : Any alias of rm is a very stupid idea ( Feb 14, 2017 , www.cyberciti.biz )
20180328 : Sysadmin wiped two servers, left the country to escape the shame by Simon Sharwood ( Mar 26, 2018 , theregister.co.uk )
20171207 : First Rule of Usability Don't Listen to Users ( Dec 07, 2017 , www.nngroup.com )
20171207 : The rogue DHCP server ( Dec 07, 2017 , opensource.com )
20170720 : The ULTIMATE Horrors story with recovery! ( Nov 08, 2002 , www.linuxjournal.com )
20170720 : These Guys Didn't Back Up Their Files, Now Look What Happened ( Jul 20, 2017 , www.makeuseof.com )
20170720 : How Toy Story 2 Almost Got Deleted... Except That One Person Made A Home Backup ( May 01, 2018 , Techdirt )
20170720 : Scary Backup Stories by Paul Barry ( Nov 07, 2002 , Linux Journal )
20170507 : centos - Do not play those dangerous games with resizing of partitions unless absolutely necessary ( www.softpanorama.org )
20170505 : As Unix does not have a rename command usage of mv for renaming can lead to SNAFU ( www.softpanorama.org )
20170505 : The key problem with cp utility is that it does not preserve timestamp of the file. ( www.vanityfair.com )
20170214 : My 10 UNIX Command Line Mistakes ( Feb 14, 2017 , www.cyberciti.biz )
20170212 : Vendor support vs. local support ( Vendor support vs. local support, Feb 12, 2017 )
20170212 : Stupidity of the manager effect ( Stupidity of the manager effect, Feb 12, 2017 )
20170212 : Just the push of the button in the opened datacenter ( Just the push of the button in the opened datacenter, Feb 12, 2017 )
20170211 : Being way too lazy is not always beneficial ( Being way too lazy is not always beneficial, Feb 11, 2017 )
20170210 : An inventive idea of reusing the socket into which the switch was plugged ( An inventive idea of reusing the socket into which the switch was plugged, Feb 10, 2017 )
20170208 : A side effect of simultaneous changes on many boxes can be networking storm when boxes start communing all at once ( Feb 08, 2017 )
20170207 : Troubleshooting method for networking problems: work up the OSI model - layer 1 - check the cabling ( Troubleshooting method for networking problems: work up the OSI model - layer 1 - check the cabling, Feb 07, 2017 )
20170206 : The way to keep senior management informed ( The way to keep senior management informed, Feb 06, 2017 )
20170205 : Cutting yourself from the networked server by putting down and then up eth0 interface ( Cutting yourself from the networked server by putting down and then up eth0 interface, Feb 05, 2017 )
20170204* How do I fix mess created by accidentally untarred files in the current dir, aka tar bomb ( Feb 04, 2017 , superuser.com ) [Recommended]
20170204 : Restoring deleted /tmp folder ( Jan 13, 2015 , cyberciti.biz )
20170204 : Use CDPATH to access frequent directories in bash - Mac OS X Hints ( Feb 04, 2017 , hints.macworld.com )
20150804 : My 10 UNIX Command Line Mistakes by Vivek Gite ( June 21, 2009 )
20150804 : The Unix Hater's Handbook ( The Unix Hater's Handbook, )
20150804 : wayback.archive.org ( wayback.archive.org, )
20140904 : Blunders with expansion of tar files, structure of which you do not understand ( Sep 04, 2014 )
20140903 : Doing operation in a wrong directory among several similar directories ( Sep 03, 2014 )
20131017 : Crontab file - The UNIX and Linux Forums ( Crontab file - The UNIX and Linux Forums, Oct 17, 2013 )
20120717 : My 10 UNIX Command Line Mistakes ( My 10 UNIX Command Line Mistakes, Jul 17, 2012 )
20120517 : Pixar's The Movie Vanishes, How Toy Story 2 Was Nearly Lost ( Pixar's The Movie Vanishes, How Toy Story 2 Was Nearly Lost, May 17, 2012 )
20120316 : Using right command in a wrong place ( Mar 16, 2012 )
20111014 : Nasty surprise with the command cd joeuser; chown -R joeuser:joeuser .* ( Oct 14, 2011 )
20110722 : Mailbag by Marcello Romani ( Feb 02, 2011 , LG #186 )
20110703 : Be careful with naming servers ( Jul 03, 2011 )
20110603 : Sysadmin Tales of Terror by Carla Schroder ( February 19, 2003 , Enterprise Networking Planet )
20100620 : IT Resource Center forums - greatest blunders ( IT Resource Center forums - greatest blunders, Jun 20, 2010 )
20100612 : Sysadmin Blunder (3rd Response) - Toolbox for IT Groups ( Sysadmin Blunder (3rd Response) - Toolbox for IT Groups, Jun 12, 2010 )
20100609 : Halloween - IT Admin Horror Stories ( Zimbra Forums )
20100606 : NFS-export as a poor man backdoor ( NFS-export as a poor man backdoor , Jun 06, 2010 )
20100606 : Security zeal ;-) ( Security zeal ;-), Jun 06, 2010 )
20100605 : Directory formerly known as /etc ;-) ( Jun 05, 2010 )
20100526 : Never ever play lose with /boot partition. ( May 26, 2010 )
20100526 : Sysadmin Stories Moral of these stories ( October 19, 2009 , UnixNewbie.org )
20100526 : Admin Stories UnixNewbie.org ( Admin Stories UnixNewbie.org, )
20100526 : My 10 UNIX Command Line Mistakes by Vivek Gite ( My 10 UNIX Command Line Mistakes, )
20100526 : Copy Your Linux Install to a Different Partition or Drive ( Jul 9, 2009 )
20100526 : Hosing Your Root Account by S. Lee Henry ( )

Old News ;-)

"Those Who Forget History Are Doomed to Repeat It"

Multiple authors

"Those who cannot remember the past are condemned to repeat it."

George Santayana

An "Ohnosecond" is defined as the period of time between when you hit enter and you realize what you just did.

[Nov 18, 2020] Why the lone wolf mentality is a sysadmin mistake by Scott McBrien

Jul 10, 2019 | www.redhat.com

If you have worked in system administration for a while, you've probably run into a system administrator who doesn't write anything down and keeps their work a closely-guarded secret. When I've run into administrators like this, I often ask why they do this, and the response is usually a joking, "Job security." Which, may not actually be all that joking.

Don't be that person. I've worked in several shops, and I have yet to see someone "work themselves out of a job." What I have seen, however, is someone that can't take a week off without being called by the team repeatedly. Or, after this person left, I have seen a team struggle to detangle the mystery of what that person was doing, or how they were managing systems under their control.

[Nov 05, 2020] Wiping out RHEL 7 auth files (passwd, shadow, groups) with RHEL6 auth files due to misunderstanding about what version of RHEL is on target server

The story was send to me by email. In the case discussed it was not human error per se. The old version of the script that has this operation as a part of its functionality was restored accidentally and this error remained unnoticed. Then when the script run to create new user and propagateds wrong auth files to RHEL7 servers, ssh stopped working. The disaster happened on around two dozen servers at once (the power of Ansible, pdsh and friends ;-)

While propagating changes in passwd/shagow/group by copying them from reference server is a widely used shortcut, it easily becomes a blunder when files from the wrong version are pushed (for example auth files from RHEL6 pushed to RHEL 7 servers).

In case of RHEL6 auth files were pushed into RHEL 7. As the result ssh instantly stops working and access to the remote server was possible only via DRAC/ILO. Having a ticker that run from cron and writes information to NFS about the current status of the sever helps in such case, but only if such ticker is launched from NFS as it can be modified to restore the files from the backup.

[Nov 02, 2020] Wiping files due to misguided attempt to save space.

One installation implemented seemingly bright idea of having a shared disk space that it monitored by a cron script that first warns and then a week later removed files that are more then two months old.

Guess what happened when a user tried to download files to this space from another server using download method which preserved timestamp.

[Jul 14, 2020] Sysadmin tales- How to keep calm and not panic when things break by Glen Newell

Jul 10, 2020 | www.redhat.com

Sysadmin tales: How to keep calm and not panic when things break When an incident occurs, resist the urge to freak out. Instead, use these tips to help you keep your cool and find a solution.

Take a sysadmin skills assessment

Explore training and certification options

Read a guide to human communication for sysadmins

I was working on several projects simultaneously for a small company that had been carved out of a larger one that had gone out of business. The smaller company had inherited some of the bigger company's infrastructure, and all the headaches along with it. That day, I had some additional consultants working with me on a project to migrate email service from a large proprietary onsite cluster to a cloud provider, while at the same time, I was working on reconfiguring a massive storage array.

At some point, I clicked the wrong button. All of a sudden, I started getting calls. The CIO and the consultants were standing in front of my desk. The email servers were completely offline -- they responded, but could not access the backing storage. I didn't know it yet, but I had deleted the storage pool for the active email servers.

My vision blurred into a tunnel, and my stomach fell into a bottomless pit. I struggled to breathe. I did my best to maintain a poker face as the executives and consultants watched impatiently. I scanned logs and messages looking for clues. I ran tests on all the components to find the source of the issue and came up with nothing. The data seemed to be gone, and panic was setting in.

I pushed back from the desk and excused myself to use the restroom. Closing and latching the door behind me, I contemplated my fate for a moment, then splashed cold water on my face and took a deep breath. Then it dawned on me: earlier, I had set up an active mirror of that storage pool. The data was all there; I just needed to reconnect it.

I returned to my desk and couldn't help a bit of a smirk. A couple of commands, a couple of clicks, and a sip of coffee. About five minutes of testing, and I could say, "Sorry, guys. Should be good now." The whole thing had happened in about 30 minutes.
We've all been there
Everyone makes mistakes, even the most senior and venerable engineers and systems administrators. We're all human. It just so happens that, as a sysadmin, a small mistake in a moment can cause very visible problems, and, PANIC. This is normal, though. What separates the hero from the unemployed in that moment, can be just a few simple things.

When an incident occurs, focusing on who's at fault can be tempting; blame is something we know how to do and can do something about, and it can even offer some relief if we can tell ourselves it's not our fault. But in fact, blame accomplishes nothing and can be counterproductive in a moment of crisis -- it can distract us from finding a solution to the problem, and create even more stress.
Backups, backups, backups
This is just one of the times when having a backup saved the day for me, and for a client. Every sysadmin I've ever worked with will tell you the same thing -- always have a backup. Do regular backups. Make backups of configurations you are working on. Make a habit of creating a backup as the first step in any project. There are some great articles here on Enable Sysadmin about the various things you can do to protect yourself.

Another good practice is to never work on production systems until you have tested the change. This may not always be possible, but if it is, the extra effort and time will be well worth it for the rare occasions when you have an unexpected result, so you can avoid the panic of wondering where you might have saved your most recent resume. Having a plan and being prepared can go a long way to avoiding those very stressful situations.
Breathe in, breathe out
The panic response in humans is related to the "fight or flight" reflex, which served our ancestors so well. It's a really useful resource for avoiding saber tooth tigers (and angry CFOs), but not so much for understanding and solving complex technical problems. Understanding that it's normal but not really helpful, we can recognize it and find a way to overcome it in the moment.

The simplest way we can tame the impulse to blackout and flee is to take a deep breath (or several). Studies have shown that simple breathing exercises and meditation can improve our general outlook and ability to focus on a specific task. There is also evidence that temperature changes can make a difference; something as simple as a splash of water on the face or an ice-cold beverage can calm a panic. These things work for me.
Walk the path of troubleshooting, one step at a time
Once we have convinced ourselves that the world is not going to end immediately, we can focus on solving the problem. Take the situation one element, one step at a time to find what went wrong, then take that and apply the solution(s) systematically. Again, it's important to focus on the problem and solution in front of you rather than worrying about things you can't do anything about right now or what might happen later. Remember, blame is not helpful, and that includes blaming yourself.

Most often, when I focus on the problem, I find that I forget to panic, and I can do even better work on the solution. Many times, I have found solutions I wouldn't have seen or thought of otherwise in this state.
Take five
Another thing that's easy to forget is that, when you've been working on a problem, it's important to give yourself a break. Drink some water. Take a short walk. Rest your brain for a couple of minutes. Hunger, thirst, and fatigue can lead to less clear thinking and, you guessed it, panic.
Time to face the music
My last piece of advice -- though certainly not the least important -- is, if you are responsible for an incident, be honest about what happened. This will benefit you for both the short and long term.

During the early years of the space program, the directors and engineers at NASA established a routine of getting together and going over what went wrong and what and how to improve for the next time. The same thing happens in the military, emergency management, and healthcare fields. It's also considered good agile/DevOps practice. Some of the smartest, highest-strung engineers, administrators, and managers I've known and worked with -- people with millions of dollars and thousands of lives in their area of responsibility -- have insisted on the importance of learning lessons from mistakes and incidents. It's a mark of a true professional to own up to mistakes and work to improve.

It's hard to lose face, but not only will your colleagues appreciate you taking responsibility and working to improve the team, but I promise you will rest better and be able to manage the next problem better if you look at these situations as learning opportunities.

Accidents and mistakes can't ever be avoided entirely, but hopefully, you will find some of this advice useful the next time you face an unexpected challenge.

[ Want to test your sysadmin skills? Take a skills assessment today . ]

[Nov 08, 2019] What breaks our systems A taxonomy of black swans by Laura Nolan Feed

Oct 25, 2018 | opensource.com
Find and fix outlier events that create issues before they trigger severe production problems. Black swans are a metaphor for outlier events that are severe in impact (like the 2008 financial crash). In production systems, these are the incidents that trigger problems that you didn't know you had, cause major visible impact, and can't be fixed quickly and easily by a rollback or some other standard response from your on-call playbook. They are the events you tell new engineers about years after the fact.
Black swans, by definition, can't be predicted, but sometimes there are patterns we can find and use to create defenses against categories of related problems.

For example, a large proportion of failures are a direct result of changes (code, environment, or configuration). Each bug triggered in this way is distinctive and unpredictable, but the common practice of canarying all changes is somewhat effective against this class of problems, and automated rollbacks have become a standard mitigation.

As our profession continues to mature, other kinds of problems are becoming well-understood classes of hazards with generalized prevention strategies.
Black swans observed in the wild
All technology organizations have production problems, but not all of them share their analyses. The organizations that publicly discuss incidents are doing us all a service. The following incidents describe one class of a problem and are by no means isolated instances. We all have black swans lurking in our systems; it's just some of us don't know it yet.
Hitting limits
Programming and development

Programming cheat sheets

New Python content

Our latest JavaScript articles

Recent Perl posts

Red Hat Developers Blog

Running headlong into any sort of limit can produce very severe incidents. A canonical example of this was Instapaper's outage in February 2017 . I challenge any engineer who has carried a pager to read the outage report without a chill running up their spine. Instapaper's production database was on a filesystem that, unknown to the team running the service, had a 2TB limit. With no warning, it stopped accepting writes. Full recovery took days and required migrating its database.
The organizations that publicly discuss incidents are doing us all a service. Limits can strike in various ways. Sentry hit limits on maximum transaction IDs in Postgres . Platform.sh hit size limits on a pipe buffer . SparkPost triggered AWS's DDoS protection . Foursquare hit a performance cliff when one of its datastores ran out of RAM .

One way to get advance knowledge of system limits is to test periodically. Good load testing (on a production replica) ought to involve write transactions and should involve growing each datastore beyond its current production size. It's easy to forget to test things that aren't your main datastores (such as Zookeeper). If you hit limits during testing, you have time to fix the problems. Given that resolution of limits-related issues can involve major changes (like splitting a datastore), time is invaluable.

When it comes to cloud services, if your service generates unusual loads or uses less widely used products or features (such as older or newer ones), you may be more at risk of hitting limits. It's worth load testing these, too. But warn your cloud provider first.

Finally, where limits are known, add monitoring (with associated documentation) so you will know when your systems are approaching those ceilings. Don't rely on people still being around to remember.
Spreading slowness
"The world is much more correlated than we give credit to. And so we see more of what Nassim Taleb calls 'black swan events' -- rare events happen more often than they should because the world is more correlated."
-- Richard Thaler

HostedGraphite's postmortem on how an AWS outage took down its load balancers (which are not hosted on AWS) is a good example of just how much correlation exists in distributed computing systems. In this case, the load-balancer connection pools were saturated by slow connections from customers that were hosted in AWS. The same kinds of saturation can happen with application threads, locks, and database connections -- any kind of resource monopolized by slow operations.

HostedGraphite's incident is an example of externally imposed slowness, but often slowness can result from saturation somewhere in your own system creating a cascade and causing other parts of your system to slow down. An incident at Spotify demonstrates such spread -- the streaming service's frontends became unhealthy due to saturation in a different microservice. Enforcing deadlines for all requests, as well as limiting the length of request queues, can prevent such spread. Your service will serve at least some traffic, and recovery will be easier because fewer parts of your system will be broken.

Retries should be limited with exponential backoff and some jitter. An outage at Square, in which its Redis datastore became overloaded due to a piece of code that retried failed transactions up to 500 times with no backoff, demonstrates the potential severity of excessive retries. The Circuit Breaker design pattern can be helpful here, too.

Dashboards should be designed to clearly show utilization, saturation, and errors for all resources so problems can be found quickly.
Thundering herds
Often, failure scenarios arise when a system is under unusually heavy load. This can arise organically from users, but often it arises from systems. A surge of cron jobs that starts at midnight is a venerable example. Mobile clients can also be a source of coordinated demand if they are programmed to fetch updates at the same time (of course, it is much better to jitter such requests).

Events occurring at pre-configured times aren't the only source of thundering herds. Slack experienced multiple outages over a short time due to large numbers of clients being disconnected and immediately reconnecting, causing large spikes of load. CircleCI saw a severe outage when a GitLab outage ended, leading to a surge of builds queued in its database, which became saturated and very slow.

Almost any service can be the target of a thundering herd. Planning for such eventualities -- and testing that your plan works as intended -- is therefore a must. Client backoff and load shedding are often core to such approaches.

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.
Automation systems are complex systems
"Complex systems are intrinsically hazardous systems."
-- Richard Cook, MD

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing. The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.

Christina Schulman and Etienne Perot of Google describe some examples in their talk Help Protect Your Data Centers with Safety Constraints . One incident sent Google's entire in-house content delivery network (CDN) to disk-erase.

Schulman and Perot suggest using a central service to manage constraints, which limits the pace at which destructive automation can operate, and being aware of system conditions (for example, avoiding destructive operations if the service has recently had an alert).

Automation systems can also cause havoc when they interact with operators (or with other automated systems). Reddit experienced a major outage when its automation restarted a system that operators had stopped for maintenance. Once you have multiple automation systems, their potential interactions become extremely complex and impossible to predict.

It will help to deal with the inevitable surprises if all this automation writes logs to an easily searchable, central place. Automation systems should always have a mechanism to allow them to be quickly turned off (fully or only for a subset of operations or targets).
Defense against the dark swans
These are not the only black swans that might be waiting to strike your systems. There are many other kinds of severe problem that can be avoided using techniques such as canarying, load testing, chaos engineering, disaster testing, and fuzz testing -- and of course designing for redundancy and resiliency. Even with all that, at some point your system will fail.

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. For example, one unpleasant issue you might have to deal with is a complete outage of your network. It's important to have a fail-safe communications channel completely independent of your own infrastructure and its dependencies. For instance, if you run on AWS, using a service that also runs on AWS as your fail-safe communication method is not a good idea. A phone bridge or an IRC server that runs somewhere separate from your main systems is good. Make sure everyone knows what the communications platform is and practices using it.

Another principle is to ensure that your monitoring and your operational tools rely on your production systems as little as possible. Separate your control and your data planes so you can make changes even when systems are not healthy. Don't use a single message queue for both data processing and config changes or monitoring, for example -- use separate instances. In SparkPost: The Day the DNS Died , Jeremy Blosser presents an example where critical tools relied on the production DNS setup, which failed.
The psychology of battling the black swan
To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations ( including Google ) successfully use a version of FEMA's Incident Command System. There should be a clear way for any on-call individual to call for assistance in the event of a major problem they can't resolve alone.

For long-running incidents, it's important to make sure people don't work for unreasonable lengths of time and get breaks to eat and sleep (uninterrupted by a pager). It's easy for exhausted engineers to make a mistake or overlook something that might resolve the incident faster.
Learn more
There are many other things that could be said about black (or formerly black) swans and strategies for dealing with them. If you'd like to learn more, I highly recommend these two books dealing with resilience and stability in production: Susan Fowler's Production-Ready Microservices and Michael T. Nygard's Release It! .

Laura Nolan will present What Breaks Our Systems: A Taxonomy of Black Swans at LISA18 , October 29-31 in Nashville, Tennessee, USA.

[Nov 08, 2019] How to prevent and recover from accidental file deletion in Linux Enable Sysadmin

trashy - Trashy · GitLab might make sense in simple cases. But often massive file deletions are about attempts to get free space.

Nov 08, 2019 | www.redhat.com
Back up
You knew this would come first. Data recovery is a time-intensive process and rarely produces 100% correct results. If you don't have a backup plan in place, start one now.

Better yet, implement two. First, provide users with local backups with a tool like rsnapshot . This utility creates snapshots of each user's data in a ~/.snapshots directory, making it trivial for them to recover their own data quickly.

There are a great many other open source backup applications that permit your users to manage their own backup schedules.

Second, while these local backups are convenient, also set up a remote backup plan for your organization. Tools like AMANDA or BackupPC are solid choices for this task. You can run them as a daemon so that backups happen automatically.

Backup planning and preparation pay for themselves in both time, and peace of mind. There's nothing like not needing emergency response procedures in the first place.
Ban rm
On modern operating systems, there is a Trash or Bin folder where users drag the files they don't want out of sight without deleting them just yet. Traditionally, the Linux terminal has no such holding area, so many terminal power users have the bad habit of permanently deleting data they believe they no longer need. Since there is no "undelete" command, this habit can be quite problematic should a power user (or administrator) accidentally delete a directory full of important data.

Many users say they favor the absolute deletion of files, claiming that they prefer their computers to do exactly what they tell them to do. Few of those users, though, forego their rm command for the more complete shred , which really removes their data. In other words, most terminal users invoke the rm command because it removes data, but take comfort in knowing that file recovery tools exist as a hacker's un- rm . Still, using those tools take up their administrator's precious time. Don't let your users -- or yourself -- fall prey to this breach of logic.

If you really want to remove data, then rm is not sufficient. Use the shred -u command instead, which overwrites, and then thoroughly deletes the specified data

However, if you don't want to actually remove data, don't use rm . This command is not feature-complete, in that it has no undo feature, but has the capacity to be undone. Instead, use trashy or trash-cli to "delete" files into a trash bin while using your terminal, like so:
$ trash ~/example.txt
$ trash --list
example.txt
One advantage of these commands is that the trash bin they use is the same your desktop's trash bin. With them, you can recover your trashed files by opening either your desktop Trash folder, or through the terminal.

If you've already developed a bad rm habit and find the trash command difficult to remember, create an alias for yourself:
$ echo "alias rm='trash'"
Even better, create this alias for everyone. Your time as a system administrator is too valuable to spend hours struggling with file recovery tools just because someone mis-typed an rm command.
Respond efficiently
Unfortunately, it can't be helped. At some point, you'll have to recover lost files, or worse. Let's take a look at emergency response best practices to make the job easier. Before you even start, understanding what caused the data to be lost in the first place can save you a lot of time:

If someone was careless with their trash bin habits or messed up dangerous remove or shred commands, then you need to recover a deleted file.

If someone accidentally overwrote a partition table, then the files aren't really lost. The drive layout is.

In the case of a dying hard drive, recovering data is secondary to the race against decay to recover the bits themselves (you can worry about carving those bits into intelligible files later).

No matter how the problem began, start your rescue mission with a few best practices:

Stop using the drive that contains the lost data, no matter what the reason. The more you do on this drive, the more you risk overwriting the data you're trying to rescue. Halt and power down the victim computer, and then either reboot using a thumb drive, or extract the damaged hard drive and attach it to your rescue machine.

Do not use the victim hard drive as the recovery location. Place rescued data on a spare volume that you're sure is working. Don't copy it back to the victim drive until it's been confirmed that the data has been sufficiently recovered.

If you think the drive is dying, your first priority after powering it down is to obtain a duplicate image, using a tool like ddrescue or Clonezilla .

Once you have a sense of what went wrong, It's time to choose the right tool to fix the problem. Two such tools are Scalpel and TestDisk , both of which operate just as well on a disk image as on a physical drive.
Practice (or, go break stuff)
At some point in your career, you'll have to recover data. The smart practices discussed above can minimize how often this happens, but there's no avoiding this problem. Don't wait until disaster strikes to get familiar with data recovery tools. After you set up your local and remote backups, implement command-line trash bins, and limit the rm command, it's time to practice your data recovery techniques.

Download and practice using Scalpel, TestDisk, or whatever other tools you feel might be useful. Be sure to practice data recovery safely, though. Find an old computer, install Linux onto it, and then generate, destroy, and recover. If nothing else, doing so teaches you to respect data structures, filesystems, and a good backup plan. And when the time comes and you have to put those skills to real use, you'll appreciate knowing what to do.

[Nov 08, 2019] My first sysadmin mistake by Jim Hall

Wiping out of `/etc` directory is one thing that sysadmin accidentally do. This is often happen if the other directory is name etc, for example /Backup/etc. In such cases you automatically put a slash in front of etc because it is ingrained in your mind. And you put the slash in front of etc subconsciously, not realizing what you are doing. And then faces consequences. If you do not use saferm, the result are pretty devastating. In most cases the sever does not die, but new logins are impossible. SSH session survives. That's why it is important to backup /etc/at the first login to the server. On modern severs it takes a couple of seconds.

If subdirectories are intact then you still can copy the content from another server. But content of sysconfig subdirectory in linux is unique to the server and you need a backup to restore it.

Notable quotes:

"... As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch. ..."

"... I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the ..."

Nov 08, 2019 | opensource.com
rm command in the wrong directory. As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch.
My clue that I'd done something wrong was an error message that rm couldn't delete certain subdirectories. But the cache directory should contain only files! I immediately stopped the rm command and looked at what I'd done. And then I panicked. All at once, a million thoughts ran through my head. Did I just destroy an important server? What was going to happen to the system? Would I get fired?

Fortunately, I'd run rm * and not rm -rf * so I'd deleted only files. The subdirectories were still there. But that didn't make me feel any better.

Immediately, I went to my supervisor and told her what I'd done. She saw that I felt really dumb about my mistake, but I owned it. Despite the urgency, she took a few minutes to do some coaching with me. "You're not the first person to do this," she said. "What would someone else do in your situation?" That helped me calm down and focus. I started to think less about the stupid thing I had just done, and more about what I was going to do next.

I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the /etc directory.

Once I had my plan of action, the rest was easy. It was just a matter of running the right commands to copy the /etc files from another server and edit the configuration so it matched the system. Thanks to my practice of documenting everything, I used my existing documentation to make any final adjustments. I avoided having to completely restore the server, which would have meant a huge disruption.

To be sure, I learned from that mistake. For the rest of my years as a systems administrator, I always confirmed what directory I was in before running any command.

I also learned the value of building a "mistake strategy." When things go wrong, it's natural to panic and think about all the bad things that might happen next. That's human nature. But creating a "mistake strategy" helps me stop worrying about what just went wrong and focus on making things better. I may still think about it, but knowing my next steps allows me to "get over it."

[Nov 08, 2019] How to use Sanoid to recover from data disasters Opensource.com

Nov 08, 2019 | opensource.com

filesystem-level snapshot replication to move data from one machine to another, fast . For enormous blobs like virtual machine images, we're talking several orders of magnitude faster than rsync .

If that isn't cool enough already, you don't even necessarily need to restore from backup if you lost the production hardware; you can just boot up the VM directly on the local hotspare hardware, or the remote disaster recovery hardware, as appropriate. So even in case of catastrophic hardware failure , you're still looking at that 59m RPO, <1m RTO.

https://www.youtube.com/embed/5hEixXutaPo

Backups -- and recoveries -- don't get much easier than this.

The syntax is dead simple:
root@box1:~# syncoid pool/images/vmname root@box2:pooln
ame/images/vmname
Or if you have lots of VMs, like I usually do... recursion!
root@box1:~# syncoid -r pool/images/vmname root@box2:po
olname/images/vmname
This makes it not only possible, but easy to replicate multiple-terabyte VM images hourly over a local network, and daily over a VPN. We're not talking enterprise 100mbps symmetrical fiber, either. Most of my clients have 5mbps or less available for upload, which doesn't keep them from automated, nightly over-the-air backups, usually to a machine sitting quietly in an owner's house.
Preventing your own Humpty Level Events
Sanoid is open source software, and so are all its dependencies. You can run Sanoid and Syncoid themselves on pretty much anything with ZFS. I developed it and use it on Linux myself, but people are using it (and I support it) on OpenIndiana, FreeBSD, and FreeNAS too.

You can find the GPLv3 licensed code on the website (which actually just redirects to Sanoid's GitHub project page), and there's also a Chef Cookbook and an Arch AUR repo available from third parties.

[Nov 07, 2019] What breaks our systems A taxonomy of black swans Opensource.com

Nov 07, 2019 | opensource.com

What breaks our systems: A taxonomy of black swans Find and fix outlier events that create issues before they trigger severe production problems. 25 Oct 2018 Laura Nolan Feed 147 up 2 comments Image credits : Eumelincen . CC0 x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

Black swans, by definition, can't be predicted, but sometimes there are patterns we can find and use to create defenses against categories of related problems.

For example, a large proportion of failures are a direct result of changes (code, environment, or configuration). Each bug triggered in this way is distinctive and unpredictable, but the common practice of canarying all changes is somewhat effective against this class of problems, and automated rollbacks have become a standard mitigation.

As our profession continues to mature, other kinds of problems are becoming well-understood classes of hazards with generalized prevention strategies.
Black swans observed in the wild
All technology organizations have production problems, but not all of them share their analyses. The organizations that publicly discuss incidents are doing us all a service. The following incidents describe one class of a problem and are by no means isolated instances. We all have black swans lurking in our systems; it's just some of us don't know it yet.
Hitting limits
Programming and development

Programming cheat sheets

New Python content

Our latest JavaScript articles

Recent Perl posts

Red Hat Developers Blog

Running headlong into any sort of limit can produce very severe incidents. A canonical example of this was Instapaper's outage in February 2017 . I challenge any engineer who has carried a pager to read the outage report without a chill running up their spine. Instapaper's production database was on a filesystem that, unknown to the team running the service, had a 2TB limit. With no warning, it stopped accepting writes. Full recovery took days and required migrating its database.
The organizations that publicly discuss incidents are doing us all a service. Limits can strike in various ways. Sentry hit limits on maximum transaction IDs in Postgres . Platform.sh hit size limits on a pipe buffer . SparkPost triggered AWS's DDoS protection . Foursquare hit a performance cliff when one of its datastores ran out of RAM .

One way to get advance knowledge of system limits is to test periodically. Good load testing (on a production replica) ought to involve write transactions and should involve growing each datastore beyond its current production size. It's easy to forget to test things that aren't your main datastores (such as Zookeeper). If you hit limits during testing, you have time to fix the problems. Given that resolution of limits-related issues can involve major changes (like splitting a datastore), time is invaluable.

When it comes to cloud services, if your service generates unusual loads or uses less widely used products or features (such as older or newer ones), you may be more at risk of hitting limits. It's worth load testing these, too. But warn your cloud provider first.

Finally, where limits are known, add monitoring (with associated documentation) so you will know when your systems are approaching those ceilings. Don't rely on people still being around to remember.
Spreading slowness
"The world is much more correlated than we give credit to. And so we see more of what Nassim Taleb calls 'black swan events' -- rare events happen more often than they should because the world is more correlated."
-- Richard Thaler

HostedGraphite's postmortem on how an AWS outage took down its load balancers (which are not hosted on AWS) is a good example of just how much correlation exists in distributed computing systems. In this case, the load-balancer connection pools were saturated by slow connections from customers that were hosted in AWS. The same kinds of saturation can happen with application threads, locks, and database connections -- any kind of resource monopolized by slow operations.

HostedGraphite's incident is an example of externally imposed slowness, but often slowness can result from saturation somewhere in your own system creating a cascade and causing other parts of your system to slow down. An incident at Spotify demonstrates such spread -- the streaming service's frontends became unhealthy due to saturation in a different microservice. Enforcing deadlines for all requests, as well as limiting the length of request queues, can prevent such spread. Your service will serve at least some traffic, and recovery will be easier because fewer parts of your system will be broken.

Retries should be limited with exponential backoff and some jitter. An outage at Square, in which its Redis datastore became overloaded due to a piece of code that retried failed transactions up to 500 times with no backoff, demonstrates the potential severity of excessive retries. The Circuit Breaker design pattern can be helpful here, too.

Dashboards should be designed to clearly show utilization, saturation, and errors for all resources so problems can be found quickly.
Thundering herds
Often, failure scenarios arise when a system is under unusually heavy load. This can arise organically from users, but often it arises from systems. A surge of cron jobs that starts at midnight is a venerable example. Mobile clients can also be a source of coordinated demand if they are programmed to fetch updates at the same time (of course, it is much better to jitter such requests).

Events occurring at pre-configured times aren't the only source of thundering herds. Slack experienced multiple outages over a short time due to large numbers of clients being disconnected and immediately reconnecting, causing large spikes of load. CircleCI saw a severe outage when a GitLab outage ended, leading to a surge of builds queued in its database, which became saturated and very slow.

Almost any service can be the target of a thundering herd. Planning for such eventualities -- and testing that your plan works as intended -- is therefore a must. Client backoff and load shedding are often core to such approaches.

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.
Automation systems are complex systems
"Complex systems are intrinsically hazardous systems."
-- Richard Cook, MD

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing. The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.

Christina Schulman and Etienne Perot of Google describe some examples in their talk Help Protect Your Data Centers with Safety Constraints . One incident sent Google's entire in-house content delivery network (CDN) to disk-erase.

Schulman and Perot suggest using a central service to manage constraints, which limits the pace at which destructive automation can operate, and being aware of system conditions (for example, avoiding destructive operations if the service has recently had an alert).

Automation systems can also cause havoc when they interact with operators (or with other automated systems). Reddit experienced a major outage when its automation restarted a system that operators had stopped for maintenance. Once you have multiple automation systems, their potential interactions become extremely complex and impossible to predict.

It will help to deal with the inevitable surprises if all this automation writes logs to an easily searchable, central place. Automation systems should always have a mechanism to allow them to be quickly turned off (fully or only for a subset of operations or targets).
Defense against the dark swans
These are not the only black swans that might be waiting to strike your systems. There are many other kinds of severe problem that can be avoided using techniques such as canarying, load testing, chaos engineering, disaster testing, and fuzz testing -- and of course designing for redundancy and resiliency. Even with all that, at some point your system will fail.

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. For example, one unpleasant issue you might have to deal with is a complete outage of your network. It's important to have a fail-safe communications channel completely independent of your own infrastructure and its dependencies. For instance, if you run on AWS, using a service that also runs on AWS as your fail-safe communication method is not a good idea. A phone bridge or an IRC server that runs somewhere separate from your main systems is good. Make sure everyone knows what the communications platform is and practices using it.

Another principle is to ensure that your monitoring and your operational tools rely on your production systems as little as possible. Separate your control and your data planes so you can make changes even when systems are not healthy. Don't use a single message queue for both data processing and config changes or monitoring, for example -- use separate instances. In SparkPost: The Day the DNS Died , Jeremy Blosser presents an example where critical tools relied on the production DNS setup, which failed.
The psychology of battling the black swan
To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations ( including Google ) successfully use a version of FEMA's Incident Command System. There should be a clear way for any on-call individual to call for assistance in the event of a major problem they can't resolve alone.

For long-running incidents, it's important to make sure people don't work for unreasonable lengths of time and get breaks to eat and sleep (uninterrupted by a pager). It's easy for exhausted engineers to make a mistake or overlook something that might resolve the incident faster.
Learn more
There are many other things that could be said about black (or formerly black) swans and strategies for dealing with them. If you'd like to learn more, I highly recommend these two books dealing with resilience and stability in production: Susan Fowler's Production-Ready Microservices and Michael T. Nygard's Release It! .

Laura Nolan will present What Breaks Our Systems: A Taxonomy of Black Swans at LISA18 , October 29-31 in Nashville, Tennessee, USA.

[Nov 07, 2019] How to prevent and recover from accidental file deletion in Linux Enable Sysadmin

trashy - Trashy · GitLab might make sense in simple case. But often deletions are about increasing free space.

Nov 07, 2019 | www.redhat.com
Back up
You knew this would come first. Data recovery is a time-intensive process and rarely produces 100% correct results. If you don't have a backup plan in place, start one now.

Better yet, implement two. First, provide users with local backups with a tool like rsnapshot . This utility creates snapshots of each user's data in a ~/.snapshots directory, making it trivial for them to recover their own data quickly.

There are a great many other open source backup applications that permit your users to manage their own backup schedules.

Second, while these local backups are convenient, also set up a remote backup plan for your organization. Tools like AMANDA or BackupPC are solid choices for this task. You can run them as a daemon so that backups happen automatically.

Backup planning and preparation pay for themselves in both time, and peace of mind. There's nothing like not needing emergency response procedures in the first place.
Ban rm
On modern operating systems, there is a Trash or Bin folder where users drag the files they don't want out of sight without deleting them just yet. Traditionally, the Linux terminal has no such holding area, so many terminal power users have the bad habit of permanently deleting data they believe they no longer need. Since there is no "undelete" command, this habit can be quite problematic should a power user (or administrator) accidentally delete a directory full of important data.

Many users say they favor the absolute deletion of files, claiming that they prefer their computers to do exactly what they tell them to do. Few of those users, though, forego their rm command for the more complete shred , which really removes their data. In other words, most terminal users invoke the rm command because it removes data, but take comfort in knowing that file recovery tools exist as a hacker's un- rm . Still, using those tools take up their administrator's precious time. Don't let your users -- or yourself -- fall prey to this breach of logic.

If you really want to remove data, then rm is not sufficient. Use the shred -u command instead, which overwrites, and then thoroughly deletes the specified data

However, if you don't want to actually remove data, don't use rm . This command is not feature-complete, in that it has no undo feature, but has the capacity to be undone. Instead, use trashy or trash-cli to "delete" files into a trash bin while using your terminal, like so:
$ trash ~/example.txt
$ trash --list
example.txt
One advantage of these commands is that the trash bin they use is the same your desktop's trash bin. With them, you can recover your trashed files by opening either your desktop Trash folder, or through the terminal.

If you've already developed a bad rm habit and find the trash command difficult to remember, create an alias for yourself:
$ echo "alias rm='trash'"
Even better, create this alias for everyone. Your time as a system administrator is too valuable to spend hours struggling with file recovery tools just because someone mis-typed an rm command.
Respond efficiently
Unfortunately, it can't be helped. At some point, you'll have to recover lost files, or worse. Let's take a look at emergency response best practices to make the job easier. Before you even start, understanding what caused the data to be lost in the first place can save you a lot of time:

If someone was careless with their trash bin habits or messed up dangerous remove or shred commands, then you need to recover a deleted file.

If someone accidentally overwrote a partition table, then the files aren't really lost. The drive layout is.

In the case of a dying hard drive, recovering data is secondary to the race against decay to recover the bits themselves (you can worry about carving those bits into intelligible files later).

No matter how the problem began, start your rescue mission with a few best practices:

Stop using the drive that contains the lost data, no matter what the reason. The more you do on this drive, the more you risk overwriting the data you're trying to rescue. Halt and power down the victim computer, and then either reboot using a thumb drive, or extract the damaged hard drive and attach it to your rescue machine.

Do not use the victim hard drive as the recovery location. Place rescued data on a spare volume that you're sure is working. Don't copy it back to the victim drive until it's been confirmed that the data has been sufficiently recovered.

If you think the drive is dying, your first priority after powering it down is to obtain a duplicate image, using a tool like ddrescue or Clonezilla .

Once you have a sense of what went wrong, It's time to choose the right tool to fix the problem. Two such tools are Scalpel and TestDisk , both of which operate just as well on a disk image as on a physical drive.
Practice (or, go break stuff)
At some point in your career, you'll have to recover data. The smart practices discussed above can minimize how often this happens, but there's no avoiding this problem. Don't wait until disaster strikes to get familiar with data recovery tools. After you set up your local and remote backups, implement command-line trash bins, and limit the rm command, it's time to practice your data recovery techniques.

Download and practice using Scalpel, TestDisk, or whatever other tools you feel might be useful. Be sure to practice data recovery safely, though. Find an old computer, install Linux onto it, and then generate, destroy, and recover. If nothing else, doing so teaches you to respect data structures, filesystems, and a good backup plan. And when the time comes and you have to put those skills to real use, you'll appreciate knowing what to do.

[Nov 06, 2019] Sysadmin 101 Leveling Up by Kyle Rankin

Nov 06, 2019 | www.linuxjournal.com

This is the fourth in a series of articles on systems administrator fundamentals. These days, DevOps has made even the job title "systems administrator" seems a bit archaic like the "systems analyst" title it replaced. These DevOps positions are rather different from sysadmin jobs in the past with a much larger emphasis on software development far beyond basic shell scripting and as a result often are filled with people with software development backgrounds without much prior sysadmin experience.

In the past, a sysadmin would enter the role at a junior level and be mentored by a senior sysadmin on the team, but in many cases these days, companies go quite a while with cloud outsourcing before their first DevOps hire. As a result, the DevOps engineer might be thrust into the role at a junior level with no mentor around apart from search engines and Stack Overflow posts.

In the first article in this series, I explained how to approach alerting and on-call rotations as a sysadmin. In the second article , I discussed how to automate yourself out of a job. In the third , I covered why and how you should use tickets. In this article, I describe the overall sysadmin career path and what I consider the attributes that might make you a "senior sysadmin" instead of a "sysadmin" or "junior sysadmin", along with some tips on how to level up.

Keep in mind that titles are pretty fluid and loose things, and that they mean different things to different people. Also, it will take different people different amounts of time to "level up" depending on their innate sysadmin skills, their work ethic and the opportunities they get to gain more experience. That said, be suspicious of anyone who leveled up to a senior level in any field in only a year or two -- it takes time in a career to make the kinds of mistakes and learn the kinds of lessons you need to learn before you can move up to the next level.

Kyle Rankin is a Tech Editor and columnist at Linux Journal and the Chief Security Officer at Purism. He is the author of Linux Hardening in Hostile Networks , DevOps Troubleshooting , The Official Ubuntu Server Book , Knoppix Hacks , Knoppix Pocket Reference , Linux Multimedia Hacks and Ubuntu Hacks , and also a contributor to a number of other O'Reilly books. Rankin speaks frequently on security and open-source software including at BsidesLV, O'Reilly Security Conference, OSCON, SCALE, CactusCon, Linux World Expo and Penguicon. You can follow him at @kylerankin.

[Nov 06, 2019] 7 Ways to Make Fewer Mistakes at Work by Carey-Lee Dixon

May 31, 2015 | www.linkedin.com
Follow Digital Marketing Executive at LASCO Financial Services
Though mistakes are not intentional and are inevitable, that doesn't mean we should take a carefree approach to getting things done. There are some mistakes we make in the workplace, which could be easily avoided if we paid a little more attention to what we were doing. Agree? We've all made them and possibly mulled over a few silly mistakes we have made in the past. But, I am here to tell you that mistakes doesn't make you 'bad' person, it's more of a great learning experience - of what you can do better and how you can get it right the next time. And having made a few silly mistakes in my work life, I guarantee that if you adopt a few of these approaches that I have been applying in my work life, I am pretty sure you too will make you fewer mistakes at work.
1. Give your full attention to what you are doing
...dedicate uninterrupted times to accomplish that [important] task. Do whatever it takes, to get it done with your full attention, so if it means eliminating distractions, taking breaks in between and working with a to-do list, do it. But trying to send emails, editing that blog post and doing whatever else, may lead to you making a few unwanted mistakes.
Tip: Eliminate distractions. 2. Ask Questions
Often, we make mistakes because we didn't ask that one question. Either we were too proud to or we thought we had it 'covered.' Unsure about the next step to take or how to undertake a task? Do your homework and ask someone who is more knowledgeable than you are, ask someone who can guide you accordingly. Worried about what others will think? Who cares? Asking questions only make you smarter, not dumb. And so what if others think you are dumb. Their opinion doesn't matter anyway, asking questions helps you to make fewer mistakes and as my mom would say, 'Put on the mask and ask' . Each task usually comes with a challenge and requires you learn something new, so use the resources available to you, like the more experienced colleagues to get all the information you need that will enable you to make fewer mistakes.
Tip: Do your homework. Ask for help. 3. Use checklists
Checklists can be used to help you structure what needs to be done before you publish that article or submit that project. They are quite useful especially when you have a million things to do. Since I am responsible for getting multiple tasks done, I often use checklists/to-do lists to help keep me get structured and to ensure I don't leave anything undone. In general, lists are great and using one to detail things to do, or steps required to move to the next stage will help to minimize errors, especially when you have a number of things on your plate. And did I mention, Richard Branson is also big on lists . That's how he gets a lot of things done.
4. Review, review, review
Carefully review your work. I must admit, I get a little paranoid, about delivering error-free work. Like, seriously, I don't like making them and often beat up myself if I send an email with some silly grammatical errors. And that's why reviewing your work before you click send, is a must-do. Often, we submit our work with errors because we are working against a tight deadline and didn't give yourself enough time to review what was done. The last thing you really need is your boss in neck for the document that was due last week, which you just completed without much time to review it. So, if you have spent endless hours working on a project, is proud your work and ready to show it to the team - take a break and come back to review it. Taking a break and then getting back to review what was done will allow you to find those mistakes before others can. And yes, the checklist is quite useful in the review process - so use it.
Tip: Get a second eye. 5. Get a second eye
Even when you have done careful review, chances are there will still be mistakes. It happens. So getting a second eye, especially one from a more experienced person can find that one error you overlooked. Sometimes we overlook the details, because we are in a hurry or not 100% focused on the task at hand, getting that other set of eyes to check for errors or an important point, that you missed, is always useful.
Tip: Get a second eye from someone more experienced or knowledgeable. 6. Allow enough time
In making mistakes at work, I realise I am more prone to making mistakes when I am working against a tight deadline . Failure to allow enough time for a project or for review can lead to missed requirements and incompleteness, which results in failure to meet desired expectations. That's why it is essential to be smart in estimating the time needed to accomplish a task, which should include time for review. Ideally, you want to give yourself enough time, to do research, complete a document/project, review what was done and ask for a second eye , so setting realistic schedules is most important in making fewer mistakes.
Tip: Limit working against tight deadlines. 7. Learn from others mistakes
No matter how much you know or think you know, it always important to learn from the mistakes of others. What silly mistakes did a co-worker make that caused a big stir in the office? Make note of it and intentionally try not to make the same mistakes too. Some of the greatest lessons are those we learn from others. So pay attention to past mistakes made, what they did right, what they didn't nail and how they got out of the rut.
Tip: Pay close attention to the mistakes others make.
No matter how much you know or think you know, it is always important to learn from the mistakes of others. Remember, mistakes are meant to teach you not break you . So if you make mistakes, it only shows us that sometimes we need to take a different approach to getting things done.

Mistakes are meant to teach you not break you

No one wants to make mistakes; I sure don't. But that does not mean we should be afraid of them. I have made quite a few mistakes in my work life, which has only proven that I need to be more attentive and that I need to ask for help more than I usually do. So, take the necessary steps to make fewer mistakes but at the same time, don't beat up yourself over the ones you make.

A great resource on mistakes in the workplace, Mistakes I Made at Work . A great resource on focusing on less and increasing productivity, One Thing .

____________________________________________________

For more musings, career lessons and tips that you can apply to your personal and professional life visit my personal blog, www.careyleedixon.com . I enjoy working on being the version of myself, helping others to grow in their personal and professional lives while doing what matters. For questions or to book me for writing/speaking engagements on career and personal development, email me at [email protected]

[Nov 06, 2019] 10+ mistakes Linux newbies make - TechRepublic

Nov 06, 2019 | www.techrepublic.com

javascript:void(0)
7: Giving up too quickly
Here's another issue I see all too often. After a few hours (or a couple of days) working with Linux, new users will give up for one reason or another. I understand giving up when they realize something simply doesn't work (such as when they MUST use a proprietary application or file format). But seeing Linux not work under average demands is rare these days. If you see new Linux users getting frustrated, try to give them a little extra guidance. Sometimes getting over that initial hump is the biggest challenge they will face.

[Nov 06, 2019] Destroying multiple production databases by Jan Gerrit Kootstra

Aug 08, 2019 | www.redhat.com
In my 22-year-old career as an IT specialist, I encountered two major issues where -- due to my mistakes -- important production databases were blown apart. Here are my stories. Freshman mistake
The first time was in the late 1990s when I started working at a service provider for my local municipality's social benefit agency. I got an assignment as a newbie system administrator to remove retired databases from the server where databases for different departments were consolidated.

Due to a type error on a top-level directory, I removed two live database files instead of the one retired database. What was worse was that due to the complexity of the database consolidation during the restore, other databases were hit, too. Repairing all databases took approximately 22 hours.
What helped
A good backup that was tested each night by recovering an empty file at the end of the tar archive catalog, after the backup was made. Future-looking statement

It's important to learn from our mistakes. What I learned is this:

Write down the steps you will perform and have them checked by a senior sysadmin. It was the first time I did not ask for a review by one of the seniors. My bad.

Be nice to colleagues from other teams. It was a DBA that saved me.

Do not copy such a complex setup of sharing databases over shared file systems.

Before doing a life cycle management migration, go for a separation of the filesystems per database to avoid the complexity and reduce the chances of human error.

Change your approach: Later in my career, I always tried to avoid lift and shift migrations.

Senior sysadmin mistake
In a period where partly offshoring IT activities was common practice in order to reduce costs, I had to take over a database filesystem extension on a Red Hat 5 cluster. Given that I set up this system a couple of years before, I had not checked the current situation.

I assumed the offshore team was familiar with the need to attach all shared LUNs to both nodes of the two-node cluster. My bad, never assume. As an Australian tourist once mentioned when a friend and I were on a vacation in Ireland after my Latin grammar school graduation: "Do not make an ars out of you me." Or, another phrase: "Assuming is the mother of all mistakes."

Well, I fell for my own trap. I went for the filesystem extension on the active node, and without checking the passive node's ( node2 ) status, tested a failover. Because we had agreed to run the database on node2 until the next update window, I had put myself in trouble.

As the databases started to fail, we brought the database cluster down. No issues yet, but all hell broke loose when I ran a filesystem check on an LVM-based system with missing physical volumes.
Looking back
I would say you're stupid to myself. Running pvs , lvs , or vgs would have alerted me that LVM detected issues. Also, comparing multipath configuration files would have revealed probable issues.

So, next time, I would first, check to see if LVM contains issues before going for the last resort: A filesystem check and trying to fix the millions of errors. Most of the time you will destroy files, anyway.
What saved my day
What saved my day back then was:

My good relations with colleagues over all teams, where a short talk with a great storage admin created the correct zoning to the required LUNs, and I got great help from a DBA who had deep knowledge of the clustered databases.

A good database backup.

Great management and a great service manager. They kept the annoyed customer away from us.

Not making make promises I could not keep, like: "I will fix it in three hours." Instead, statements such as the one below help keep the customer satisfied: "At the current rate of fixing the filesystem, I cannot guarantee a fix within so many hours. As we just passed the 10% mark, I suggest we stop this approach and discuss another way to solve the issue."

Future-looking statement
I definitely learned some things. For example, always check the environment you're about to work on before any change. Never assume that you know how an environment looks -- change is a constant in IT.

Also, share what you learned from your mistakes. Train offshore colleagues instead of blaming them. Also, inform them about the impact the issue had on the customer's business. A continent's major transport hub cannot be put on hold due to a sysadmin's mistake.

A shutdown of the transport hub might have been needed if we failed to solve the issue and the backup site in a data centre of another service provider would have been hurt too. Part of the hub is a harbour and we could have blown up a part of the harbour next to a village of about 10,000 people if both a cotton ship and an oil tanker would have gotten lost on the harbour master's map and collided.
General lessons learned
I learned some important lessons overall from these and other mistakes:

Be humble enough to admit your mistakes.

Be arrogant enough to state that you are one of the few people that can help fix the issues you caused.

Show leadership of the solvers' team, or at least make sure that all of the team's roles will be fulfilled -- including the customer relations manager.

Take back the role of problem-solver after the team is created, if that is what was requested.

"Be part of the solution, do not become part of the problem," as a colleague says.

I cannot stress this enough: Learn from your mistakes to avoid them in the future, rather than learning how to make them on a weekly basis. Jan Gerrit Kootstra Solution Designer (for Telco network services). Red Hat Accelerator. More about me

[Nov 06, 2019] My 10 Linux and UNIX Command Line Mistakes by Vivek Gite

May 20, 2018 | www.cyberciti.biz

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk tar -zcvf project.tar.gz functions
I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I have switched to rsnapshot )
rsync -av -delete /dest /src
Again, I had no backup.

... ... ...
All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill .
From all those mistakes I have learn that:

You must keep a good set of backups. Test your backups regularly too.

The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots .

Use CVS/git to store configuration files.

Wait and read command line twice before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Ansible, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and more.

Mistakes are the inevitable, so have you made any mistakes that have caused some sort of downtime? Please add them into the comments section below.

[Oct 25, 2019] Get inode number of a file on linux - Fibrevillage

Oct 25, 2019 | www.fibrevillage.com

Get inode number of a file on linux

An inode is a data structure in UNIX operating systems that contains important information pertaining to files within a file system. When a file system is created in UNIX, a set amount of inodes is created, as well. Usually, about 1 percent of the total file system disk space is allocated to the inode table.

How do we find a file's inode ?
ls -i Command: display inode
ls -i Command: display inode
$ls -i /etc/bashrc
131094 /etc/bashrc
131094 is the inode of /etc/bashrc.
Stat Command: display Inode
$stat /etc/bashrc
  File: `/etc/bashrc'
  Size: 1386          Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 131094      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-12-10 10:01:29.509908811 -0800
Modify: 2013-06-06 11:31:51.792356252 -0700
Change: 2013-06-06 11:31:51.792356252 -0700
find command: display inode
$find ./ -iname sysfs_fc_tools.tar -printf '%p %i\n'
./sysfs_fc_tools.tar 28311964
Notes :
    %p stands for file path
    %i stands for inode number
tree command: display inode under a directory
#tree -a -L 1 --inodes /etc
/etc
├── [ 132896]  a2ps
├── [ 132898]  a2ps.cfg
├── [ 132897]  a2ps-site.cfg
├── [ 133315]  acpi
├── [ 131864]  adjtime
├── [ 132340]  akonadi
...
usecase of using inode
find / -inum XXXXXX -print to find the full path for each file pointing to inode XXXXXX.
Though you can use the example to do rm action, but simply I discourage to do so, for security concern in find command, also in other file system, same inode refers a very different file.
filesystem repair
If you get a bad luck on your filesystem, most of time, run fsck to fix it. It helps if you have inode info of the filesystem in hand.
This is another big topic, I'll have another article for it.

[Oct 25, 2019] Howto Delete files by inode number by Erik

Feb 10, 2011 | erikimh.com

linux administration - tips, notes and projects
6 Comments

Ever mistakenly pipe output to a file with special characters that you couldn't remove?

-rw-r–r– 1 eriks eriks 4 2011-02-10 22:37 –fooface

Good luck. Anytime you pass any sort of command to this file, it's going to interpret it as a flag. You can't fool rm, echo, sed, or anything else into actually deeming this a file at this point. You do, however, have a inode for every file.

Traditional methods fail:

[eriks@jaded: ~]$ rm -f –fooface
rm: unrecognized option '–fooface'
Try `rm ./–fooface' to remove the file `–fooface'.
Try `rm –help' for more information.
[eriks@jaded: ~]$ rm -f '–fooface'
rm: unrecognized option '–fooface'
Try `rm ./–fooface' to remove the file `–fooface'.
Try `rm –help' for more information.

So now what, do you live forever with this annoyance of a file sitting inside your filesystem, never to be removed or touched again? Nah.

We can remove a file, simply by an inode number, but first we must find out the file inode number:

$ ls -il | grep foo

Output:

[eriks@jaded: ~]$ ls -il | grep foo
508160 drwxr-xr-x 3 eriks eriks 4096 2010-10-27 18:13 foo3
500724 -rw-r–r– 1 eriks eriks 4 2011-02-10 22:37 –fooface
589907 drwxr-xr-x 2 eriks eriks 4096 2010-11-22 18:52 tempfoo
589905 drwxr-xr-x 2 eriks eriks 4096 2010-11-22 18:48 tmpfoo

The number you see prior to the file permission set is actually the inode # of the file itself.

Hint: 500724 is inode number we want removed.

Now use find command to delete file by inode:

# find . -inum 500724 -exec rm -i {} \;

There she is.

[eriks@jaded: ~]$ find . -inum 500724 -exec rm -i {} \;
rm: remove regular file `./–fooface'? y

[Oct 25, 2019] unix - Remove a file on Linux using the inode number - Super User

Oct 25, 2019 | superuser.com

,
ome other methods include:
escaping the special chars:
[~]$rm \"la\*
use the find command and only search the current directory. The find command can search for inode numbers, and has a handy -delete switch:
[~]$ls -i
7404301 "la*

[~]$find . -maxdepth 1 -type f -inum 7404301
./"la*

[~]$find . -maxdepth 1 -type f -inum 7404301 -delete
[~]$ls -i
[~]$
,
Maybe I'm missing something, but...
rm '"la*'
Anyways, filenames don't have inodes, files do. Trying to remove a file without removing all filenames that point to it will damage your filesystem.

[Oct 25, 2019] Linux - Unix Find Inode Of a File Command

Jun 21, 2012 | www.cyberciti.biz
... ... ..
stat Command: Display Inode

You can also use the stat command as follows:
$ stat fileName-Here $ stat /etc/passwd
Sample outputs:
  File: `/etc/passwd'
  Size: 1644            Blocks: 8          IO Block: 4096   regular file
Device: fe01h/65025d    Inode: 25766495    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-05 16:29:42.000000000 +0530
Modify: 2012-05-05 16:29:20.000000000 +0530
Change: 2012-05-05 16:29:21.000000000 +0530
Share on Facebook Twitter
Posted by: Vivek Gite
The author is the creator of nixCraft and a seasoned sysadmin, DevOps engineer, and a trainer for the Linux operating system/Unix shell scripting. Get the latest tutorials on SysAdmin, Linux/Unix and open source topics via RSS/XML feed or weekly email newsletter .

[Sep 29, 2019] IPTABLES makes corporate security scans go away

Sep 29, 2019 | www.reddit.com

r/ShittySysadmin • Posted by u/TBoneJeeper 1 month ago IPTABLES makes corporate security scans go away

In a remote office location, corporate's network security scans can cause many false alarms and even take down services if they are tickled the wrong way. Dropping all traffic from the scanner's IP is a great time/resource-saver. No vulnerability reports, no follow-ups with corporate. No time for that. 12 comments 93% Upvoted What are your thoughts? Log in or Sign up log in sign up Sort by level 1 name_censored_ 9 points · 1 month ago

Seems a bit like a bandaid to me.

A good shitty sysadmin breaks the corporate scanner's SMTP, so it can't report back.

A great shitty sysadmin spins up their own scanner instance (rigged to always report AOK) and fiddles with arp/routing/DNS to hijack the actual scanner.

A 10x shitty sysadmin installs a virus on the scanner instance, thus discrediting the corporate security team for years.

level 2 TBoneJeeper 3 points · 1 month ago
Good ideas, but sounds like a lot of work. Just dropping their packets had the desired effect and took 30 seconds. level 3 name_censored_ 5 points · 1 month ago

No-one ever said being lazy was supposed to be easy. level 2 spyingwind 2 points · 1 month ago

To be serious, closing of unused ports is good practice. Even better if used services can only be accessed from know sources. Such as the DB only allows access from the App server. A jump box, like a guacd server for remote access for things like RDP and SSH, would help reduce the threat surface. Or go further and setup Ansible/Chef/etc to allow only authorized changes. level 3 gortonsfiJr 2 points · 1 month ago

Except, seriously, in my experience the security teams demand that you make big security holes for them in your boxes, so that they can hammer away at them looking for security holes. level 4 asmiggs 1 point · 1 month ago

Security teams will always invoke the worst case scenario, 'what if your firewall is borked?', 'what if your jumpbox is hacked?' etc. You can usually give their scanner exclusive access to get past these things but surprise surprise the only worst case scenario I've faced is 'what if your security scanner goes rogue?'. level 5 gortonsfiJr 1 point · 1 month ago

What if you lose control of your AD domain and some rogue agent gets domain admin rights? Also, we're going to need domain admin rights.

...Is this a test? level 6 spyingwind 1 point · 1 month ago

What if an attacker was pretending o be a security company? No DA access! You can plug in anywhere, but if port security blocks your scanner, then I can't help. Also only 80 and 443 are allowed into our network. level 3 TBoneJeeper 1 point · 1 month ago

Agree. But in rare cases, the ports/services are still used (maybe rarely), yet have "vulnerabilities" that are difficult to address. Some of these scanners hammer services so hard, trying every CGI/PHP/java exploit known to man in rapid succession, and older hardware/services cannot keep up and get wedged. I remember every Tuesday night I would have to go restart services because this is when they were scanned. Either vendor support for this software version was no longer available, or would simply require too much time to open vendor support cases to report the issues, argue with 1st level support, escalate, work with engineering, test fixes, etc. level 1 rumplestripeskin 1 point · 1 month ago

Yes... and use Ansible to update iptables on each of your Linux VMs. level 1 rumplestripeskin 1 point · 1 month ago

I know somebody who actually did this. level 2 TBoneJeeper 2 points · 1 month ago

Maybe we worked together :-)

[Sep 04, 2019] Basic Trap for File Cleanup

Sep 04, 2019 | www.putorius.net

Basic Trap for File Cleanup

Using an trap to cleanup is simple enough. Here is an example of using trap to clean up a temporary file on exit of the script.
#!/bin/bash
trap "rm -f /tmp/output.txt" EXIT
yum -y update > /tmp/output.txt
if grep -qi "kernel" /tmp/output.txt; then
     mail -s "KERNEL UPDATED" [email protected] < /tmp/output.txt
fi
NOTE: It is important that the trap statement be placed at the beginning of the script to function properly. Any commands above the trap can exit and not be caught in the trap.

Now if the script exits for any reason, it will still run the rm command to delete the file. Here is an example of me sending SIGINT (CTRL+C) while the script was running.
# ./test.sh
 ^Cremoved '/tmp/output.txt'
NOTE: I added verbose ( -v ) output to the rm command so it prints "removed". The ^C signifies where I hit CTRL+C to send SIGINT.

This is a much cleaner and safer way to ensure the cleanup occurs when the script exists. Using EXIT ( 0 ) instead of a single defined signal (i.e. SIGINT – 2) ensures the cleanup happens on any exit, even successful completion of the script.

[Aug 26, 2019] linux - Avoiding accidental 'rm' disasters - Super User

Aug 26, 2019 | superuser.com

Avoiding accidental 'rm' disasters Ask Question Asked 6 years, 3 months ago Active 6 years, 3 months ago Viewed 1k times 1

Mr_Spock ,May 26, 2013 at 11:30

Today, using sudo -s , I wanted to rm -R ./lib/ , but I actually rm -R /lib/ .
I had to reinstall my OS (Mint 15) and re-download and re-configure all my packages. Not fun.

How can I avoid similar mistakes in the future?

Vittorio Romeo ,May 26, 2013 at 11:55
First of all, stop executing everything as root . You never really need to do this. Only run individual commands with sudo if you need to. If a normal command doesn't work without sudo, just call sudo !! to execute it again.
If you're paranoid about rm , mv and other operations while running as root, you can add the following aliases to your shell's configuration file:
[ $UID = 0 ] && \
  alias rm='rm -i' && \
  alias mv='mv -i' && \
  alias cp='cp -i'
These will all prompt you for confirmation ( -i ) before removing a file or overwriting an existing file, respectively, but only if you're root (the user with ID 0).

Don't get too used to that though. If you ever find yourself working on a system that doesn't prompt you for everything, you might end up deleting stuff without noticing it. The best way to avoid mistakes is to never run as root and think about what exactly you're doing when you use sudo .

[Aug 26, 2019] bash - How to prevent rm from reporting that a file was not found

Aug 26, 2019 | stackoverflow.com

How to prevent rm from reporting that a file was not found? Ask Question Asked 7 years, 4 months ago Active 1 year, 4 months ago Viewed 101k times 133 19

pizza ,Apr 20, 2012 at 21:29

I am using rm within a BASH script to delete many files. Sometimes the files are not present, so it reports many errors. I do not need this message. I have searched the man page for a command to make rm quiet, but the only option I found is -f , which from the description, "ignore nonexistent files, never prompt", seems to be the right choice, but the name does not seem to fit, so I am concerned it might have unintended consequences.

Is the -f option the correct way to silence rm ? Why isn't it called -q ?

Does this option do anything else?

Keith Thompson ,Dec 19, 2018 at 13:05
The main use of -f is to force the removal of files that would not be removed using rm by itself (as a special case, it "removes" non-existent files, thus suppressing the error message).
You can also just redirect the error message using
$ rm file.txt 2> /dev/null
(or your operating system's equivalent). You can check the value of $? immediately after calling rm to see if a file was actually removed or not.
vimdude ,May 28, 2014 at 18:10

Yes, -f is the most suitable option for this.

tripleee ,Jan 11 at 4:50
-f is the correct flag, but for the test operator, not rm
[ -f "$THEFILE" ] && rm "$THEFILE"
this ensures that the file exists and is a regular file (not a directory, device node etc...)
mahemoff ,Jan 11 at 4:41

\rm -f file will never report not found.

Idelic ,Apr 20, 2012 at 16:51

As far as rm -f doing "anything else", it does force ( -f is shorthand for --force ) silent removal in situations where rm would otherwise ask you for confirmation. For example, when trying to remove a file not writable by you from a directory that is writable by you.

Keith Thompson ,May 28, 2014 at 18:09

I had same issue for cshell. The only solution I had was to create a dummy file that matched pattern before "rm" in my script.

[Aug 26, 2019] shell - rm -rf return codes

Aug 26, 2019 | superuser.com

rm -rf return codes Ask Question Asked 6 years ago Active 6 years ago Viewed 15k times 8 0

SheetJS ,Aug 15, 2013 at 2:50

Any one can let me know the possible return codes for the command rm -rf other than zero i.e, possible return codes for failure cases. I want to know more detailed reason for the failure of the command unlike just the command is failed(return other than 0).

Adrian Frühwirth ,Aug 14, 2013 at 7:00
To see the return code, you can use echo $? in bash.
To see the actual meaning, some platforms (like Debian Linux) have the perror binary available, which can be used as follows:
$ rm -rf something/; perror $?
rm: cannot remove `something/': Permission denied
OS error code   1:  Operation not permitted
rm -rf automatically suppresses most errors. The most likely error you will see is 1 (Operation not permitted), which will happen if you don't have permissions to remove the file. -f intentionally suppresses most errors
Adrian Frühwirth ,Aug 14, 2013 at 7:21
grabbed coreutils from git....
looking at exit we see...
openfly@linux-host:~/coreutils/src $ cat rm.c | grep -i exit
  if (status != EXIT_SUCCESS)
  exit (status);
  /* Since this program exits immediately after calling 'rm', rm need not
  atexit (close_stdin);
          usage (EXIT_FAILURE);
        exit (EXIT_SUCCESS);
          usage (EXIT_FAILURE);
        error (EXIT_FAILURE, errno, _("failed to get attributes of %s"),
        exit (EXIT_SUCCESS);
  exit (status == RM_ERROR ? EXIT_FAILURE : EXIT_SUCCESS);
Now looking at the status variable....
openfly@linux-host:~/coreutils/src $ cat rm.c | grep -i status
usage (int status)
  if (status != EXIT_SUCCESS)
  exit (status);
  enum RM_status status = rm (file, &x);
  assert (VALID_STATUS (status));
  exit (status == RM_ERROR ? EXIT_FAILURE : EXIT_SUCCESS);
looks like there isn't much going on there with the exit status.

I see EXIT_FAILURE and EXIT_SUCCESS and not anything else.

so basically 0 and 1 / -1

To see specific exit() syscalls and how they occur in a process flow try this
openfly@linux-host:~/ $ strace rm -rf $whatever
fairly simple.

ref:

http://www.unix.com/man-page/Linux/EXIT_FAILURE/exit/

[Jul 26, 2019] The day the virtual machine manager died by Nathan Lager

"Dangerous" commands like dd should probably be always typed first in the editor and only when you verity that you did not make a blunder , executed...

A good decision was to go home and think the situation over, not to aggravate it with impulsive attempts to correct the situation, which typically only make it worse.

Lack of checking of the health of backups suggest that this guy is an arrogant sucker, despite his 20 years of sysadmin experience.

Notable quotes:

"... I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! ..."

"... Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. ..."

Jul 26, 2019 | www.redhat.com

... ... ...

See, my RHEV manager was a VM running on a stand-alone Kernel-based Virtual Machine (KVM) host, separate from the cluster it manages. I had been running RHEV since version 3.0, before hosted engines were a thing, and I hadn't gone through the effort of migrating. I was already in the process of building a new set of clusters with a new manager, but this older manager was still controlling most of our production VMs. It had filled its disk again, and the underlying database had stopped itself to avoid corruption.

See, for whatever reason, we had never set up disk space monitoring on this system. It's not like it was an important box, right?

So, I logged into the KVM host that ran the VM, and started the well-known procedure of creating a new empty disk file, and then attaching it via virsh . The procedure goes something like this: Become root , use dd to write a stream of zeros to a new file, of the proper size, in the proper location, then use virsh to attach the new disk to the already running VM. Then, of course, log into the VM and do your disk expansion.

I logged in, ran sudo -i , and started my work. I ran cd /var/lib/libvirt/images , ran ls -l to find the existing disk images, and then started carefully crafting my dd command:
dd ... bs=1k count=40000000 if=/dev/zero ... of=./vmname-disk ...
Which was the next disk again? <Tab> of=vmname-disk2.img <Back arrow, Back arrow, Back arrow, Back arrow, Backspace> Don't want to dd over the existing disk, that'd be bad. Let's change that 2 to a 3 , and Enter . OH CRAP, I CHANGED THE 2 TO A 2 NOT A 3 ! <Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C>

I still get sick thinking about this. I'd done the stupidest thing I possibly could have done, I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! (The kind that's at work late, trying to get this one little thing done before he heads off to see his friend. The kind that thinks he knows better, and thought he was careful enough to not make such a newbie mistake. Gah.)

So, how fast does dd start writing zeros? Faster than I can move my fingers from the Enter key to the Ctrl+C keys. I tried a number of things to recover the running disk from memory, but all I did was make things worse, I think. The system was still up, but still broken like it was before I touched it, so it was useless.

Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. The next day I owned up to the boss and co-workers pretty much the moment I walked in the door. We started taking an inventory of what we had, and what was lost. I had taken the precaution of setting up backups ages ago. So, we thought we had that to fall back on.

I opened a ticket with Red Hat support and filled them in on how dumb I'd been. I can only imagine the reaction of the support person when they read my ticket. I worked a help desk for years, I know how this usually goes. They probably gathered their closest coworkers to mourn for my loss, or get some entertainment out of the guy who'd been so foolish. (I say this in jest. Red Hat's support was awesome through this whole ordeal, and I'll tell you how soon. )

So, I figured the next thing I would need from my broken server, which was still running, was the backups I'd diligently been collecting. They were on the VM but on a separate virtual disk, so I figured they were safe. The disk I'd overwritten was the last disk I'd made to expand the volume the database was on, so that logical volume was toast, but I've always set up my servers such that the main mounts -- / , /var , /home , /tmp , and /root -- were all separate logical volumes.

In this case, /backup was an entirely separate virtual disk. So, I scp -r 'd the entire /backup mount to my laptop. It copied, and I felt a little sigh of relief. All of my production systems were still running, and I had my backup. My hope was that these factors would mean a relatively simple recovery: Build a new VM, install RHEV-M, and restore my backup. Simple right?

By now, my boss had involved the rest of the directors, and let them know that we were looking down the barrel of a possibly bad time. We started organizing a team meeting to discuss how we were going to get through this. I returned to my desk and looked through the backups I had copied from the broken server. All the files were there, but they were tiny. Like, a couple hundred kilobytes each, instead of the hundreds of megabytes or even gigabytes that they should have been.

Happy feeling, gone.

Turns out, my backups were running, but at some point after an RHEV upgrade, the database backup utility had changed. Remember how I said this system had existed since version 3.0? Well, 3.0 didn't have an engine-backup utility, so in my RHEV training, we'd learned how to make our own. Mine broke when the tools changed, and for who knows how long, it had been getting an incomplete backup -- just some files from /etc .

No database. Ohhhh ... Fudge. (I didn't say "Fudge.")

I updated my support case with the bad news and started wondering what it would take to break through one of these 4th-floor windows right next to my desk. (Ok, not really.)

At this point, we basically had three RHEV clusters with no manager. One of those was for development work, but the other two were all production. We started using these team meetings to discuss how to recover from this mess. I don't know what the rest of my team was thinking about me, but I can say that everyone was surprisingly supportive and un-accusatory. I mean, with one typo I'd thrown off the entire department. Projects were put on hold and workflows were disrupted, but at least we had time: We couldn't reboot machines, we couldn't change configurations, and couldn't get to VM consoles, but at least everything was still up and operating.

Red Hat support had escalated my SNAFU to an RHEV engineer, a guy I'd worked with in the past. I don't know if he remembered me, but I remembered him, and he came through yet again. About a week in, for some unknown reason (we never figured out why), our Windows VMs started dropping offline. They were still running as far as we could tell, but they dropped off the network, Just boom. Offline. In the course of a workday, we lost about a dozen windows systems. All of our RHEL machines were working fine, so it was just some Windows machines, and not even every Windows machine -- about a dozen of them.

Well great, how could this get worse? Oh right, add a ticking time bomb. Why were the Windows servers dropping off? Would they all eventually drop off? Would the RHEL systems eventually drop off? I made a panicked call back to support, emailed my account rep, and called in every favor I'd ever collected from contacts I had within Red Hat to get help as quickly as possible.

I ended up on a conference call with two support engineers, and we got to work. After about 30 minutes on the phone, we'd worked out the most insane recovery method. We had the newer RHEV manager I mentioned earlier, that was still up and running, and had two new clusters attached to it. Our recovery goal was to get all of our workloads moved from the broken clusters to these two new clusters.

Want to know how we ended up doing it? Well, as our Windows VMs were dropping like flies, the engineers and I came up with this plan. My clusters used a Fibre Channel Storage Area Network (SAN) as their storage domains. We took a machine that was not in use, but had a Fibre Channel host bus adapter (HBA) in it, and attached the logical unit numbers (LUNs) for both the old cluster's storage domains and the new cluster's storage domains to it. The plan there was to make a new VM on the new clusters, attach blank disks of the proper size to the new VM, and then use dd (the irony is not lost on me) to block-for-block copy the old broken VM disk over to the newly created empty VM disk.

I don't know if you've ever delved deeply into an RHEV storage domain, but under the covers it's all Logical Volume Manager (LVM). The problem is, the LV's aren't human-readable. They're just universally-unique identifiers (UUIDs) that the RHEV manager's database links from VM to disk. These VMs are running, but we don't have the database to reference. So how do you get this data?

virsh ...

Luckily, I managed KVM and Xen clusters long before RHEV was a thing that was viable. I was no stranger to libvirt 's virsh utility. With the proper authentication -- which the engineers gave to me -- I was able to virsh dumpxml on a source VM while it was running, get all the info I needed about its memory, disk, CPUs, and even MAC address, and then create an empty clone of it on the new clusters.

Once I felt everything was perfect, I would shut down the VM on the broken cluster with either virsh shutdown , or by logging into the VM and shutting it down. The catch here is that if I missed something and shut down that VM, there was no way I'd be able to power it back on. Once the data was no longer in memory, the config would be completely lost, since that information is all in the database -- and I'd hosed that. Once I had everything, I'd log into my migration host (the one that was connected to both storage domains) and use dd to copy, bit-for-bit, the source storage domain disk over to the destination storage domain disk. Talk about nerve-wracking, but it worked! We picked one of the broken windows VMs and followed this process, and within about half an hour we'd completed all of the steps and brought it back online.

We did hit one snag, though. See, we'd used snapshots here and there. RHEV snapshots are lvm snapshots. Consolidating them without the RHEV manager was a bit of a chore, and took even more leg work and research before we could dd the disks. I had to mimic the snapshot tree by creating symbolic links in the right places, and then start the dd process. I worked that one out late that evening after the engineers were off, probably enjoying time with their families. They asked me to write the process up in detail later. I suspect that it turned into some internal Red Hat documentation, never to be given to a customer because of the chance of royally hosing your storage domain.

Somehow, over the course of 3 months and probably a dozen scheduled maintenance windows, I managed to migrate every single VM (of about 100 VMs) from the old zombie clusters to the working clusters. This migration included our Zimbra collaboration system (10 VMs in itself), our file servers (another dozen VMs), our Enterprise Resource Planning (ERP) platform, and even Oracle databases.

We didn't lose a single VM and had no more unplanned outages. The Red Hat Enterprise Linux (RHEL) systems, and even some Windows systems, never fell to the mysterious drop-off that those dozen or so Windows servers did early on. During this ordeal, though, I had trouble sleeping. I was stressed out and felt so guilty for creating all this work for my co-workers, I even had trouble eating. No exaggeration, I lost 10lbs.

So, don't be like Nate. Monitor your important systems, check your backups, and for all that's holy, double-check your dd output file. That way, you won't have drama, and can truly enjoy Sysadmin Appreciation Day!

Nathan Lager is an experienced sysadmin, with 20 years in the industry. He runs his own blog at undrground.org, and hosts the Iron Sysadmin Podcast. More about me

[Apr 29, 2019] When the disaster hit, you need to resolve things quickly and efficiently, with panic being the worst enemy. Amount of training and previous experience become crucial factors in such situations

It is rarely just one thing that causes an “accident”. There are multiple contributors here.

Notable quotes:

"... Panic in my experience stems from a number of things here, but two crucial ones are: ..."

"... not knowing what to do, or learned actions not having any effect ..."

Apr 29, 2019 | www.nakedcapitalism.com

vlade , April 29, 2019 at 11:04 am

...I suspect that for both of those, when they hit, you need to resolve things quickly and efficiently, with panic being the worst enemy.

Panic in my experience stems from a number of things here, but two crucial ones are:
– input overload
– not knowing what to do, or learned actions not having any effect

Both of them can be, to a very large extent, overcome with training, training, and more training (of actually practising the emergency situation, not just reading about it and filling questionairres).

... ... ...

[Mar 26, 2019] I wiped out a call center by mistyping the user profile expiration purge parameters in a script before leaving for the day.

Mar 26, 2019 | twitter.com

SwiftOnSecurity ‏ 7:07 PM - 25 Mar 2019

I wiped out a call center by mistyping the user profile expiration purge parameters in a script before leaving for the day.

https:// twitter.com/soniagupta504/ status/1109979183352942592

SwiftOnSecurity ‏ 7:08 PM - 25 Mar 2019

Luckily most of it was backed up with a custom-built user profile roaming system, but still it was down for an hour and a half and degraded for more...

[Mar 01, 2019] Molly-guard for CentOS 7 UoB Unix by dg12158

Sep 21, 2015 | bris.ac.uk

Since I was looking at this already and had a few things to investigate and fix in our systemd-using hosts, I checked how plausible it is to insert a molly-guard-like password prompt as part of the reboot/shutdown process on CentOS 7 (i.e. using systemd).

Problems encountered include:

Asking for a password from a service/unit in systemd -- Use systemd-ask-password and needs some agent setup to reply to this correctly?

The reboot command always walls a message to all logged in users before it even runs the new reboot-molly unit, as it expects a reboot to happen. The argument --no-wall stops this but that requires a change to the reboot command. Hence back to the original problem of replacing packaged files/symlinks with RPM

The reboot.target unit is a "systemd.special" unit, which means that it has some special behaviour and cannot be renamed. We can modify it, of course, by editing the reboot.target file.

How do we get a systemd unit to run first and block anything later from running until it is complete? (In fact to abort the reboot but just for this time rather than being set as permanently failed. Reboot failing is a bit of a strange situation for it to be in ) The dependencies appear to work but the reboot target is quite keen on running other items from the dependency list -- I'm more than likely doing something wrong here!

So for now this is shelved. It would be nice to have a solution though, so any hints from systemd experts are gratefully received!

(Note that CentOS 7 uses systemd 208, so new features in later versions which help won't be available to us) This entry was posted in Uncategorized by dg12158 . Bookmark the permalink .

[Mar 01, 2019] molly-guard protects machines from accidental shutdowns-reboots by ruchi

Nov 28, 2009 | www.ubuntugeek.com
molly-guard installs a shell script that overrides the existing shutdown/reboot/halt/poweroff commands and first runs a set of scripts, which all have to exit successfully, before molly-guard invokes the real command.
One of the scripts checks for existing SSH sessions. If any of the four commands are called interactively over an SSH session, the shell script prompts you to enter the name of the host you wish to shut down. This should adequately prevent you from accidental shutdowns and reboots.

This shell script passes through the commands to the respective binaries in /sbin and should thus not get in the way if called non-interactively, or locally.

The tool is basically a replacement for halt, reboot and shutdown to prevent such accidents.

Install molly-guard in ubuntu

sudo apt-get install molly-guard

or click on the following link

apt://molly-guard

Now that it's installed, try it out (on a non production box). Here you can see it save me from rebooting the box Ubuntu-test

Ubuntu-test:~$ sudo reboot
W: molly-guard: SSH session detected!
Please type in hostname of the machine to reboot: ruchi
Good thing I asked; I won't reboot Ubuntu-test ...
W: aborting reboot due to 30-query-hostname exiting with code 1.
Ubuntu-Test:~$

By default you're only protected on sessions that look like SSH sessions (have $SSH_CONNECTION set). If, like us, you use alot of virtual machines and RILOE cards, edit /etc/molly-guard/rc and uncomment ALWAYS_QUERY_HOSTNAME=true. Now you should be prompted for any interactive session.

[Mar 01, 2019] Confirm before executing shutdown-reboot command on linux by Ilija Matoski

Notable quotes:

"... rushing to leave and was still logged into a server so I wanted to shutdown my laptop, but what I didn't notice is that I was still connected to the remote server. ..."

Oct 23, 2017 | matoski.com
rushing to leave and was still logged into a server so I wanted to shutdown my laptop, but what I didn't notice is that I was still connected to the remote server. Luckily before pressing enter I noticed I'm not on my machine but on a remote server. So I was thinking there should be a very easy way to prevent it from happening again, to me or to anyone else.
So first thing we need to create a new bash script at /usr/local/bin/confirm with the contents bellow and with execution permissions
#!/usr/bin/env bash
echo "About to execute $1 command"
echo -n "Would you like to proceed y/n? "
read reply

if [ "$reply" = y -o "$reply" = Y ]
then
   $1 "${@:2}"
else
   echo "$1 ${@:2} cancelled"
fi
Now only thing left to do is to setup the aliases so they go through this command to confirm instead of directly calling the command.

So I create the following files

/etc/profile.d/confirm-shutdown.sh
alias shutdown="/usr/local/bin/confirm /sbin/shutdown"
/etc/profile.d/confirm-reboot.sh
alias reboot="/usr/local/bin/confirm /sbin/reboot"
Now when I actually try to do a shutdown/reboot it will prompt me like so.
ilijamt@x1 ~ $ reboot 
Before proceeding to perform /sbin/reboot, please ensure you have approval to perform this task
Would you like to proceed y/n? n
/sbin/reboot  cancelled

[Feb 21, 2019] https://github.com/MikeDacre/careful_rm

Feb 21, 2019 | github.com

rm is a powerful *nix tool that simply drops a file from the drive index. It doesn't delete it or put it in a Trash can, it just de-indexes it which makes the file hard to recover unless you want to put in the work, and pretty easy to recover if you are willing to spend a few hours trying (use shred to actually secure erase files).

careful_rm.py is inspired by the -I interactive mode of rm and by safe-rm . safe-rm adds a recycle bin mode to rm, and the -I interactive mode adds a prompt if you delete more than a handful of files or recursively delete a directory. ZSH also has an option to warn you if you recursively rm a directory.

These are all great, but I found them unsatisfying. What I want is for rm to be quick and not bother me for single file deletions (so rm -i is out), but to let me know when I am deleting a lot of files, and to actually print a list of files that are about to be deleted . I also want it to have the option to trash/recycle my files instead of just straight deleting them.... like safe-rm , but not so intrusive (safe-rm defaults to recycle, and doesn't warn).

careful_rm.py is fundamentally a simple rm wrapper, that accepts all of the same commands as rm , but with a few additional options features. In the source code CUTOFF is set to 3 , so deleting more files than that will prompt the user. Also, deleting a directory will prompt the user separately with a count of all files and subdirectories within the folders to be deleted.

Furthermore, careful_rm.py implements a fully integrated trash mode that can be toggled on with -c . It can also be forced on by adding a file at ~/.rm_recycle , or toggled on only for $HOME (the best idea), by ~/.rm_recycle_home . The mode can be disabled on the fly by passing --direct , which forces off recycle mode.

The recycle mode tries to find the best location to recycle to on MacOS or Linux, on MacOS it also tries to use Apple Script to trash files, which means the original location is preserved (note Applescript can be slow, you can disable it by adding a ~/.no_apple_rm file, but Put Back won't work). The best location for trashes goes in this order:

$HOME/.Trash on Mac or $HOME/.local/share/Trash on Linux

<mountpoint>/.Trashes on Mac or <mountpoint>/.Trash-$UID on Linux

/tmp/$USER_trash

Always the best trash can to avoid Volume hopping is favored, as moving across file systems is slow. If the trash does not exist, the user is prompted to create it, they then also have the option to fall back to the root trash ( /tmp/$USER_trash ) or just rm the files.

/tmp/$USER_trash is almost always used for deleting system/root files, but note that you most likely do not want to save those files, and straight rm is generally better.

[Feb 21, 2019] https://github.com/lagerspetz/linux-stuff/blob/master/scripts/saferm.sh by Eemil Lagerspetz

Shell script that tires to implement trash can idea

Feb 21, 2019 | github.com

#!/bin/bash

##

## saferm.sh

## Safely remove files, moving them to GNOME/KDE trash instead of deleting.

## Made by Eemil Lagerspetz

## Login <vermind@drache>

##

## Started on Mon Aug 11 22:00:58 2008 Eemil Lagerspetz

## Last update Sat Aug 16 23:49:18 2008 Eemil Lagerspetz

##

version= " 1.16 " ;

... ... ...

[Feb 21, 2019] The rm='rm -i' alias is an horror

Feb 21, 2019 | superuser.com

The rm='rm -i' alias is an horror because after a while using it, you will expect rm to prompt you by default before removing files. Of course, one day you'll run it with an account that hasn't that alias set and before you understand what's going on, it is too late.

... ... ...
If you want save aliases, but don't want to risk getting used to the commands working differently on your system than on others, you can to disable rm like this
alias rm='echo "rm is disabled, use remove or trash or /bin/rm instead."'
Then you can create your own safe alias, e.g.
alias remove='/bin/rm -irv'
or use trash instead.

[Feb 21, 2019] Ubuntu Manpage trash - Command line trash utility.

Feb 21, 2019 | manpages.ubuntu.com

xenial ( 1 ) trash.1.gz

Provided by: trash-cli_0.12.9.14-2_all

NAME

       trash - Command line trash utility.
SYNOPSIS

       trash [arguments] ...
DESCRIPTION

       Trash-cli  package  provides  a command line interface trashcan utility compliant with the
       FreeDesktop.org Trash Specification.  It remembers the name, original path, deletion date,
       and permissions of each trashed file.

ARGUMENTS

       Names of files or directory to move in the trashcan.

EXAMPLES

       $ cd /home/andrea/
       $ touch foo bar
       $ trash foo bar
BUGS

       Report bugs to http://code.google.com/p/trash-cli/issues

AUTHORS

       Trash  was  written  by Andrea Francia <[email protected]> and Einar Orn
       Olason <[email protected]>.  This manual page was written by  Steve  Stalcup  <[email protected]>.
       Changes made by Massimo Cavalleri <[email protected]>.

SEE ALSO

       trash-list(1),   trash-restore(1),   trash-empty(1),   and   the   FreeDesktop.org   Trash
       Specification at http://www.ramendik.ru/docs/trashspec.html.

       Both are released under the GNU General Public License, version 2 or later.

[Jan 29, 2019] hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior

Dec 04, 2012 | serverfault.com

My company makes an embedded Debian Linux device that boots from an ext3 partition on an internal SSD drive. Because the device is an embedded "black box", it is usually shut down the rude way, by simply cutting power to the device via an external switch.

This is normally okay, as ext3's journalling keeps things in order, so other than the occasional loss of part of a log file, things keep chugging along fine.

However, we've recently seen a number of units where after a number of hard-power-cycles the ext3 partition starts to develop structural issues -- in particular, we run e2fsck on the ext3 partition and it finds a number of issues like those shown in the output listing at the bottom of this Question. Running e2fsck until it stops reporting errors (or reformatting the partition) clears the issues.

My question is... what are the implications of seeing problems like this on an ext3/SSD system that has been subjected to lots of sudden/unexpected shutdowns?

My feeling is that this might be a sign of a software or hardware problem in our system, since my understanding is that (barring a bug or hardware problem) ext3's journalling feature is supposed to prevent these sorts of filesystem-integrity errors. (Note: I understand that user-data is not journalled and so munged/missing/truncated user-files can happen; I'm specifically talking here about filesystem-metadata errors like those shown below)

My co-worker, on the other hand, says that this is known/expected behavior because SSD controllers sometimes re-order write commands and that can cause the ext3 journal to get confused. In particular, he believes that even given normally functioning hardware and bug-free software, the ext3 journal only makes filesystem corruption less likely, not impossible, so we should not be surprised to see problems like this from time to time.

Which of us is right?
Embedded-PC-failsafe:~# ls
Embedded-PC-failsafe:~# umount /mnt/unionfs
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 46948.
Fix<y>? yes

Directory inode 46948, block 0, offset 12: directory corrupted
Salvage<y>? yes

Entry 'status_2012-11-26_14h13m41.csv' in /var/log/status_logs (46956) has deleted/unused inode 47075.  Clear<y>? yes
Entry 'status_2012-11-26_10h42m58.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47076.  Clear<y>? yes
Entry 'status_2012-11-26_11h29m41.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47080.  Clear<y>? yes
Entry 'status_2012-11-26_11h42m13.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47081.  Clear<y>? yes
Entry 'status_2012-11-26_12h07m17.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47083.  Clear<y>? yes
Entry 'status_2012-11-26_12h14m53.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47085.  Clear<y>? yes
Entry 'status_2012-11-26_15h06m49.csv' in /var/log/status_logs (46956) has deleted/unused inode 47088.  Clear<y>? yes
Entry 'status_2012-11-20_14h50m09.csv' in /var/log/status_logs (46956) has deleted/unused inode 47073.  Clear<y>? yes
Entry 'status_2012-11-20_14h55m32.csv' in /var/log/status_logs (46956) has deleted/unused inode 47074.  Clear<y>? yes
Entry 'status_2012-11-26_11h04m36.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47078.  Clear<y>? yes
Entry 'status_2012-11-26_11h54m45.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47082.  Clear<y>? yes
Entry 'status_2012-11-26_12h12m20.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47084.  Clear<y>? yes
Entry 'status_2012-11-26_12h33m52.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47086.  Clear<y>? yes
Entry 'status_2012-11-26_10h51m59.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47077.  Clear<y>? yes
Entry 'status_2012-11-26_11h17m09.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47079.  Clear<y>? yes
Entry 'status_2012-11-26_12h54m11.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47087.  Clear<y>? yes

Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Couldn't fix parent of inode 46948: Couldn't find parent directory entry

Pass 4: Checking reference counts
Unattached inode 46945
Connect to /lost+found<y>? yes

Inode 46945 ref count is 2, should be 1.  Fix<y>? yes
Inode 46953 ref count is 5, should be 4.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(208264--208266) -(210062--210068) -(211343--211491) -(213241--213250) -(213344--213393) -213397 -(213457--213463) -(213516--213521) -(213628--213655) -(213683--213688) -(213709--213728) -(215265--215300) -(215346--215365) -(221541--221551) -(221696--221704) -227517
Fix<y>? yes

Free blocks count wrong for group #6 (17247, counted=17611).
Fix<y>? yes

Free blocks count wrong (161691, counted=162055).
Fix<y>? yes

Inode bitmap differences:  +(47089--47090) +47093 +47095 +(47097--47099) +(47101--47104) -(47219--47220) -47222 -47224 -47228 -47231 -(47347--47348) -47350 -47352 -47356 -47359 -(47457--47488) -47985 -47996 -(47999--48000) -48017 -(48027--48028) -(48030--48032) -48049 -(48059--48060) -(48062--48064) -48081 -(48091--48092) -(48094--48096)
Fix<y>? yes

Free inodes count wrong for group #6 (7608, counted=7624).
Fix<y>? yes

Free inodes count wrong (61919, counted=61935).
Fix<y>? yes


embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****

embeddedrootwrite: ********** WARNING: Filesystem still has errors **********

embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks

Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (46948) is big.
Split<y>? yes

Missing '..' in directory inode 46948.
Fix<y>? yes

Setting filetype for entry '..' in ... (46948) to 2.
Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Pass 4: Checking reference counts
Inode 2 ref count is 12, should be 13.  Fix<y>? yes

Pass 5: Checking group summary information

embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****
embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks
Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite: clean, 657/62592 files, 87882/249937 blocks
filesystems hardware ssd ext3 share | improve this question edited Dec 5 '12 at 18:40 ewwhite 173k 75 364 712 asked Dec 4 '12 at 1:13 Jeremy Friesner Jeremy Friesner 611 1 8 25

Have you all thought of changing to ext4 or ZFS? – mdpc Dec 4 '12 at 2:14

I've thought about changing to ext4, at least... would that help address this issue? Would ZFS be better still? – Jeremy Friesner Dec 4 '12 at 2:17

1 Neither option would fix this. We still use devices with supercapacitors in ZFS, and battery or flash-protected cache is recommended for ext4 in server applications. – ewwhite Dec 4 '12 at 2:54

add a comment | 2 Answers 2 active oldest votes 10 You're both wrong (maybe?)... ext3 is coping the best it can with having its underlying storage removed so abruptly.
Your SSD probably has some type of onboard cache. You don't mention the make/model of SSD in use, but this sounds like a consumer-level SSD versus an enterprise or industrial-grade model .

Either way, the cache is used to help coalesce writes and prolong the life of the drive. If there are writes in-transit, the sudden loss of power is definitely the source of your corruption. True enterprise and industrial SSD's have supercapacitors that maintain power long enough to move data from cache to nonvolatile storage, much in the same way battery-backed and flash-backed RAID controller caches work .

If your drive doesn't have a supercap, the in-flight transactions are being lost, hence the filesystem corruption. ext3 is probably being told that everything is on stable storage, but that's just a function of the cache. share | improve this answer edited Apr 13 '17 at 12:14 Community ♦ 1 answered Dec 4 '12 at 1:24 ewwhite ewwhite 173k 75 364 712

Sorry to you and everyone who upvoted this, but you're just wrong. Handling the loss of in progress writes is exactly what the journal is for, and as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent. You only need a supercap or battery backed raid cache so you can enable write cache without having to enable barriers, which sacrifices some performance to maintain data correctness. – psusi Dec 5 '12 at 19:12

@psusi The SSD in use probably has cache explicitly enabled or relies on an internal buffer regardless of the OS_level setting. The data in that cache is what a supercapacitor-enabled SSD would protect. – ewwhite Dec 5 '12 at 19:30

The data in the cache doesn't need protecting if you enable IO barriers. Most consumer type drives ship with write caching disabled by default and you have to enable it if you want it, exactly because it causes corruption issues if the OS is not careful. – psusi Dec 5 '12 at 19:35

@pusi Old now, but you mention this: as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent. That's the thing: because of storage controllers that tend to assume older disks, SSDs will sometimes lie about whether data is flushed. You do need that supercap. – Joel Coel Aug 9 '15 at 22:01

add a comment | 2 You are right and your coworker is wrong. Barring something going wrong the journal makes sure you never have inconsistent fs metadata. You might check with hdparm to see if the drive's write cache is enabled. If it is, and you have not enabled IO barriers ( off by default on ext3, on by default in ext4 ), then that would be the cause of the problem.
The barriers are needed to force the drive write cache to flush at the correct time to maintain consistency, but some drives are badly behaved and either report that their write cache is disabled when it is not, or silently ignore the flush commands. This prevents the journal from doing its job. share | improve this answer answered Dec 5 '12 at 19:09 psusi psusi 2,617 11 9

-1 for reading-comprehension... – ewwhite Dec 5 '12 at 19:34

@ewwhite, maybe you should try reading, and actually writing a useful response instead of this childish insult. – psusi Dec 5 '12 at 19:36

+1 this answer probably could be improved, as any other answer in any QA. But at least provides some light and hints. @downvoters: improve the answer yourselves, or comment on possible flows, but downvoting this answer without proper justification is just disgusting! – Alberto Dec 6 '12 at 21:44

[Jan 29, 2019] xfs corrupted after power failure

Highly recommended!

Oct 15, 2013 | www.linuxquestions.org

katmai90210
hi guys,

i have a problem. yesterday there was a power outage at one of my datacenters, where i have a relatively large fileserver. 2 arrays, 1 x 14 tb and 1 x 18 tb both in raid6, with a 3ware card.

after the outage, the server came back online, the xfs partitions were mounted, and everything looked okay. i could access the data and everything seemed just fine.

today i woke up to lots of i/o errors, and when i rebooted the server, the partitions would not mount:

Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN a<ffffffff80056933>] pdflush+0x0/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff80056a84>] pdflush+0x151/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff800cd931>] wb_kupdate+0x0/0x16a
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032c2b>] kthread+0xfe/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
Oct 14 04:09:17 kp4 kernel: [<ffffffff800a3ab7>] keventd_create_kthread+0x0/0xc4
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032b2d>] kthread+0x0/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN at line 279 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff88342331
Oct 14 04:09:17 kp4 kernel:

got a bunch of these in dmesg.

The array is fine:

[root@kp4 ~]# tw_cli
//kp4> focus c6
s
//kp4/c6> how

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 13969.8 RiW ON
u1 RAID-6 OK - - 256K 16763.7 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u1 2.73 TB SATA 0 - Hitachi HDS723030AL
p1 OK u1 2.73 TB SATA 1 - Hitachi HDS723030AL
p2 OK u1 2.73 TB SATA 2 - Hitachi HDS723030AL
p3 OK u1 2.73 TB SATA 3 - Hitachi HDS723030AL
p4 OK u1 2.73 TB SATA 4 - Hitachi HDS723030AL
p5 OK u1 2.73 TB SATA 5 - Hitachi HDS723030AL
p6 OK u1 2.73 TB SATA 6 - Hitachi HDS723030AL
p7 OK u1 2.73 TB SATA 7 - Hitachi HDS723030AL
p8 OK u0 2.73 TB SATA 8 - Hitachi HDS723030AL
p9 OK u0 2.73 TB SATA 9 - Hitachi HDS723030AL
p10 OK u0 2.73 TB SATA 10 - Hitachi HDS723030AL
p11 OK u0 2.73 TB SATA 11 - Hitachi HDS723030AL
p12 OK u0 2.73 TB SATA 12 - Hitachi HDS723030AL
p13 OK u0 2.73 TB SATA 13 - Hitachi HDS723030AL
p14 OK u0 2.73 TB SATA 14 - Hitachi HDS723030AL

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx

i googled for solutions and i think i jumped the horse by doing
xfs_repair -L /dev/sdc
it would not clean it with xfs_repair /dev/sdc, and everybody pretty much says the same thing.

this is what i was getting when trying to mount the array.

Filesystem Corruption of in-memory data detected. Shutting down filesystem xfs_check

Did i jump the gun by using the -L switch :/ ?
jefro

Here is the RH data on that.

https://docs.fedoraproject.org/en-US...xfsrepair.html

[Jan 29, 2019] an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF.

Jan 29, 2019 | thwack.solarwinds.com

George Sutherland Jul 8, 2015 9:58 AM ( in response to RandyBrown ) had similar thing happen with an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF. Clear plastic cover installed with in 24 hours.... after 3 hours of recovery!

PS... He told his boss that he did not do it.... the camera that focused on the door told a much different story. He was persona non grata at our site after that.

[Jan 29, 2019] HVAC units greatly help to increase reliability

Jan 29, 2019 | thwack.solarwinds.com

sleeper_777 Jul 15, 2015 1:07 PM

Worked at a bank. 6" raised floor. Liebert cooling units on floor with all network equipment. Two units developed a water drain issue over a weekend.

About an hour into Monday morning, devices, servers, routers, in a domino effect starting shorting out and shutting down or blowing up, literally.

Opened the floor tiles to find three inches of water.

We did not have water alarms on the floor at the time.

Shortly after the incident, we did.

But the mistake was very costly and multiple 24 hour shifts of IT people made it a week of pure h3ll.

[Jan 29, 2019] In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning

Jan 29, 2019 | thwack.solarwinds.com

aaronleet Jul 13, 2015 8:45 AM

In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning.

[Jan 29, 2019] [SOLVED] Unable to mount root file system after a power failure

Jan 29, 2019 | www.linuxquestions.org
07-01-2012, 12:56 PM # 1

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Rep: Unable to mount root file system after a power failure

[ Log in to get rid of this advertisement] We had a storm yesterday and the power dropped out, causing my Ubuntu server to shut off. Now, when booting, I get
[ 0.564310] Kernel panic - not syncing: VFS: Unable to mount root fs on unkown-block(0,0)

It looks like a file system corruption, but I'm having a hard time fixing the problem. I'm using Rescue Remix 12-04 to boot from USB and get access to the system.

Using

sudo fdisk -l

Shows the hard drive as

/dev/sda1: Linux
/dev/sda2: Extended
/dev/sda5: Linux LVM

Using

sudo lvdisplay

Shows LV Names as

/dev/server1/root
/dev/server1/swap_1

Using

sudo blkid

Shows types as

/dev/sda1: ext2
/dev/sda5: LVM2_member
/dev/mapper/server1-root: ext4
/dev/mapper/server1-swap_1: swap

I can mount sda1 and server1/root and all the files appear normal, although I'm not really sure what issues I should be looking for. On sda1, I see a grub folder and several other files. On root, I see the file system as it was before I started having trouble.

I've ran the following fsck commands and none of them report any errors

sudo fsck -f /dev/sda1
sudo fsck -f /dev/server1/root
sudo fsck.ext2 -f /dev/sda1
sudo fsck.ext4 -f /dev/server1/root

and I still get the same error when the system boots.

I've hit a brick wall.

What should I try next?

What can I look at to give me a better understanding of what the problem is?

Thanks,
David

damateem

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by damateem

07-02-2012, 05:58 AM # 2

syg00 LQ Veteran
Registered: Aug 2003 Location: Australia Distribution: Lots ... Posts: 17,415
Rep: Might depend a bit on what messages we aren't seeing.
Normally I'd reckon that means that either the filesystem or disk controller support isn't available. But with something like Ubuntu you'd expect that to all be in place from the initrd. And that is on the /boot partition, and shouldn't be subject to update activity in a normal environment. Unless maybe you're real unlucky and an update was in flight.

Can you chroot into the server (disk) install and run from there successfully ?.

syg00

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by syg00

07-02-2012, 06:08 PM # 3

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: I had a very hard time getting the Grub menu to appear. There must be a very small window for detecting the shift key. Holding it down through the boot didn't work. Repeatedly hitting it at about twice per second didn't work. Increasing the rate to about 4 hits per second got me into it.
Once there, I was able to select an older kernel (2.6.32-39-server). The non-booting kernel was 2.6.32-40-server. 39 booted without any problems.

When I initially setup this system, I couldn't send email from it. It wasn't important to me at the time, so I planned to come back and fix it later. Last week (before the power drop), email suddenly started working on its own. I was surprised because I haven't specifically performed any updates. However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.

Next, I'm going to try updating to the latest kernel and see if it has the same problem.

Thanks,
David

damateem

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by damateem
07-02-2012, 06:24 PM # 4
frieza Senior Member Contributing Member
Registered: Feb 2002 Location: harvard, il Distribution: Ubuntu 11.4,DD-WRT micro plus ssh,lfs-6.6,Fedora 15,Fedora 16 Posts: 3,233
Rep:
imho auto updates are dangerous, if you want my opinion, make sure auto updates are off, and only have the system tell you there are updates, that way you can chose not to install them during a power failure
as for a possible future solution for what you went through, unlike other keys, the shift key being held doesn't register as a stuck key to the best of my knowledge, so you can hold the shift key to get into grub, after that, edit the recovery line (the e key) to say at the end, init=/bin/bash then boot the system using the keys specified on the bottom of the screen, then once booted to a prompt, you would run
Code:
fsck -f {root partition}
(in this state, the root partition should be either not mounted or mounted read-only, so you can safely run an fsck on the drive)
note the -f seems to be an undocumented flag that does a more thorough scan than merely a standard run of fsck.

then reboot, and hopefully that fixes things

glad things seem to be working for the moment though.
frieza

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Visit frieza's homepage!

Find More Posts by frieza

07-02-2012, 06:32 PM # 5

suicidaleggroll LQ Guru Contributing Member
Registered: Nov 2010 Location: Colorado Distribution: OpenSUSE, CentOS Posts: 5,573
Rep: Quote:

Originally Posted by damateem However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.

I think this is very likely. Delayed reboots after performing an update can make tracking down errors impossibly difficult. I had a system a while back that wouldn't boot, turns out it was caused by an update I had done 6 MONTHS earlier, and the system had simply never been restarted afterward.

suicidaleggroll

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by suicidaleggroll

07-04-2012, 10:18 AM # 6

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: I discovered the root cause of the problem. When I attempted the update, I found that the boot partition was full. So I suspect that caused issues for the auto update, but they went undetected until the reboot.
I next tried to purge old kernels using the instructions at

http://www.liberiangeek.net/2011/11/...neiric-ocelot/

but that failed because a previous install had not completed, but it couldn't complete because of the full partition. So had no choice but to manually rm the oldest kernel and it's associated files. With that done, the command

apt-get -f install

got far enough that I could then purge the unwanted kernels. Finally,

sudo apt-get update
sudo apt-get upgrade

brought everything up to date.

I will be deactivating the auto updates.

Thanks for all the help!

David

[Jan 29, 2019] A new term PEBKAC

Jan 29, 2019 | thwack.solarwinds.com

dtreloar Jul 30, 2015 8:51 PM PEBKAC

P roblem

E xists

B etween

K eyboard

A nd

C hair

or the most common fault is the id ten t or ID10T

[Jan 29, 2019] Are you sure?

Jan 29, 2019 | thwack.solarwinds.com

RichardLetts

Jul 13, 2015 8:13 PM Dealing with my ISP:

Me: There is a problem with your head-end router, you need to get an engineer to troubleshoot it

Them: no the problem is with your cable modem and router, we can see it fine on our network

Me: That's interesting because I powered it off and disconnected it from the wall before we started this conversation.

Them: Are you sure?

Me: I'm pretty sure that the lack of blinky lights means it's got no power but if you think it's still working fine then I'd suggest the problem at your end of this phone conversation and not at my end.

[Jan 29, 2019] Your tax dollars at government It work

Jan 29, 2019 | thwack.solarwinds.com

pzjones Jul 8, 2015 10:34 AM
My story is about required processes...Need to add DHCP entries to the DHCP server. Here is the process. Receive request. Write 5 page document (no exaggeration) detailing who submitted the request, why the request was submitted, what the solution would be, the detailed steps of the solution including spreadsheet showing how each field would be completed and backup procedures. Produce second document to include pre execution test plan, and post execution test plan in minute detail. Submit to CAB board for review, submit to higher level advisory board for review; attend CAB meeting for formal approval; attend additional approval board meeting if data center is in freeze; attend post implementation board for lessons learned...Lesson learned: now I know where our tax dollars go...

[Jan 29, 2019] Your worst sysadmin horror story

Notable quotes:

"... Disk Array not found. ..."

"... Windows 2003 is now loading. ..."

Jan 29, 2019 | www.reddit.com

highlord_fox Moderator | /r/sysadmin Sock Puppet 10 points 11 points 12 points 3 years ago (1 child)

9-10 year old Poweredge 2950. Four drives, 250GB ea, RAID 5. Not even sure the fourth drive was even part of the array at this point. Backups consist of cloud file-level backup of most of the server's files. I was working on the server, updating the OS, rebooting it to solve whatever was ailing it at the time, and it was probably about 7-8PM on a Friday. I powered it off, and went to power it back on.
Disk Array not found.

SHIT SHIT SHIT SHIT SHIT SHIT SHIT . Power it back off. Power it back on.

Disk Array not found.

I stared at it, and hope I don't have to call for emergency support on the thing. Power it off and back on a third time.

Windows 2003 is now loading.

OhThankTheGods

I didn't power it off again until I replaced it, some 4-6 months later. And then it stayed off for a good few weeks, before I had to buy a Perc 5i card off ebay to get it running again. Long story short, most of the speed issues I was having was due to the card dying. AH WELL.

EDIT: Formatting.

[Jan 29, 2019] Extra security can be a dangerious thing

Viewing backup logs is vital. Often it only looks that backup is going fine...

Notable quotes:

"... Things looked fine until someone noticed that a directory with critically important and sensitive data was missing. Turned out that some manager had decided to 'secure' the directory by doing 'chmod 000 dir' to protect the data from inquisitive eyes when the data was not being used. ..."

"... Of course, tar complained about the situation and returned with non-null status, but since the backup procedure had seemed to work fine, no one thought it necessary to view the logs... ..."

Jul 20, 2017 | www.linuxjournal.com

Anonymous, 11/08/2002

At an unnamed location it happened thus... The customer had been using a home built 'tar' -based backup system for a long time. They were informed enough to have even tested and verified that recovery would work also.

Everything had been working fine, and they even had to do a recovery which went fine. Well, one day something evil happened to a disk and they had to replace the unit and do a full recovery.

Things looked fine until someone noticed that a directory with critically important and sensitive data was missing. Turned out that some manager had decided to 'secure' the directory by doing 'chmod 000 dir' to protect the data from inquisitive eyes when the data was not being used.

Of course, tar complained about the situation and returned with non-null status, but since the backup procedure had seemed to work fine, no one thought it necessary to view the logs...

[Jan 29, 2019] Backing things up with rsync

Notable quotes:

"... I RECURSIVELY DELETED ALL THE LIVE CORPORATE WEBSITES ON FRIDAY AFTERNOON AT 4PM! ..."

"... This is why it's ALWAYS A GOOD IDEA to use Midnight Commander or something similar to delete directories!! ..."

"... rsync with ssh as the transport mechanism works very well with my nightly LAN backups. I've found this page to be very helpful: http://www.mikerubel.org/computers/rsync_snapshots/ ..."

Jul 20, 2017 | www.linuxjournal.com

Anonymous on Fri, 11/08/2002 - 03:00.

The Subject, not the content, really brings back memories.

Imagine this, your tasked with complete control over the network in a multi-million dollar company. You've had some experience in the real world of network maintaince, but mostly you've learned from breaking things at home.

Time comes to implement (yes this was a startup company), a backup routine. You carefully consider the best way to do it and decide copying data to a holding disk before the tape run would be perfect in the situation, faster restore if the holding disk is still alive.

So off you go configuring all your servers for ssh pass through, and create the rsync scripts. Then before the trial run you think it would be a good idea to create a local backup of all the websites.

You logon to the web server, create a temp directory and start testing your newly advance rsync skills. After a couple of goes, you think your ready for the real thing, but you decide to run the test one more time.

Everything seems fine so you delete the temp directory. You pause for a second and your month drops open wider than it has ever opened before, and a feeling of terror overcomes you. You want to hide in a hole and hope you didn't see what you saw.

I RECURSIVELY DELETED ALL THE LIVE CORPORATE WEBSITES ON FRIDAY AFTERNOON AT 4PM!

Anonymous on Sun, 11/10/2002 - 03:00.

This is why it's ALWAYS A GOOD IDEA to use Midnight Commander or something similar to delete directories!!

...Root for (5) years and never trashed a filesystem yet (knockwoody)...

Anonymous on Fri, 11/08/2002 - 03:00.

rsync with ssh as the transport mechanism works very well with my nightly LAN backups. I've found this page to be very helpful: http://www.mikerubel.org/computers/rsync_snapshots/

[Jan 29, 2019] It helps if somebody checked if the equpment really has power, but often this step is skipped.

Notable quotes:

"... On closer inspection, noticed this power lead was only half in the socket... I connected this back to the original switch, grabbed the "I.T manager" and asked him to "just push the power lead"... his face? Looked like Casper the friendly ghost. ..."

Jan 29, 2019 | thwack.solarwinds.com

nantwiched Jul 13, 2015 11:18 AM

I've had a few horrors, heres a few...

Had to travel from Cheshire to Glasgow (4+hours) at 3am to get to a major high street store for 8am, an hour before opening. A switch had failed and taken out a whole floor of the store. So I prepped the new switch, using the same power lead from the failed switch as that was the only available lead / socket. No power. Initially thought the replacement switch was faulty and I would be in trouble for not testing this prior to attending site...

On closer inspection, noticed this power lead was only half in the socket... I connected this back to the original switch, grabbed the "I.T manager" and asked him to "just push the power lead"... his face? Looked like Casper the friendly ghost.

Problem solved at a massive expense to the company due to the out of hours charges. Surely that would be the first thing to check? Obviously not...

The same thing happened in Aberdeen, a 13 hour round trip to resolve a fault on a "failed router". The router looked dead at first glance, but after taking the side panel off the cabinet, I discovered it always helps if the router is actually plugged in...

Yet the customer clearly said everything is plugged in as it should be and it "must be faulty"... It does tend to appear faulty when not supplied with any power...

[Jan 29, 2019] It can be hot inside the rack

Jan 29, 2019 | thwack.solarwinds.com

jemertz Mar 28, 2016 12:16 PM

Shortly after I started my first remote server-monitoring job, I started receiving, one by one, traps for servers that had gone heartbeat missing/no-ping at a remote site. I looked up the site, and there were 16 total servers there, of which about 4 or 5 (and counting) were already down. Clearly not network issues. I remoted into one of the ones that was still up, and found in the Windows event viewer that it was beginning to overheat.

I contacted my front-line team and asked them to call the site to find out if the data center air conditioner had gone out, or if there was something blocking the servers' fans or something. He called, the client at the site checked and said the data center was fine, so I dispatched IBM (our remote hands) to go to the site and check out the servers. They got there and called in laughing.

There was construction in the data center, and the contractors, being thoughtful, had draped a painter's dropcloth over the server racks to keep off saw dust. Of COURSE this caused the servers to overheat. Somehow the client had failed to mention this.

...so after all this went down, the client had the gall to ask us to replace the servers "just in case" there was any damage, despite the fact that each of them had shut itself down in order to prevent thermal damage. We went ahead and replaced them anyway. (I'm sure they were rebuilt and sent to other clients, but installing these servers on site takes about 2-3 hours of IBM's time on site and 60-90 minutes of my remote team's time, not counting the rebuild before recycling.
Oh well. My employer paid me for my time, so no skin off my back.

[Jan 29, 2019] "Sure, I get out my laptop, plug in the network cable, get on the internet from home. I start the VPN client, take out this paper with the code on it, and type it in..." Yup. He wrote down the RSA token's code before he went home.

Jan 29, 2019 | thwack.solarwinds.com

jm_sysadmin Expert Jul 8, 2015 7:04 AM

I was just starting my IT career, and I was told a VIP user couldn't VPN in, and I was asked to help. Everything checked out with the computer, so I asked the user to try it in front of me. He took out his RSA token, knew what to do with it, and it worked.

I also knew this user had been complaining of this issue for some time, and I wasn't the first person to try to fix this. Something wasn't right.

I asked him to walk me through every step he took from when it failed the night before.

"Sure, I get out my laptop, plug in the network cable, get on the internet from home. I start the VPN client, take out this paper with the code on it, and type it in..." Yup. He wrote down the RSA token's code before he went home. See that little thing was expensive, and he didn't want to lose it. I explained that the number changes all time, and that he needed to have it with him. VPN issue resolved.

[Jan 29, 2019] How electricians can help to improve server uptime

Notable quotes:

"... "Oh my God, the server room is full of smoke!" Somehow they hooked up things wrong and fed 220v instead of 110v to all the circuits. Every single UPS was dead. Several of the server power supplies were fried. ..."

Jan 29, 2019 | thwack.solarwinds.com

wfordham Jul 13, 2015 1:09 PM

This happened back when we had an individual APC UPS for each server. Most of the servers were really just whitebox PCs in a rack mount case running a server OS.

The facilities department was doing some planned maintenance on the electrical panel in the server room over the weekend. They assured me that they were not going to touch any of the circuits for the server room, just for the rooms across the hallway. Well, they disconnected power to the entire panel. Then they called me to let me know what they did. I was able to remotely verify that everything was running on battery just fine. I let them know that they had about 20 minutes to restore power or I would need to start shutting down servers. They called me again and said,

"Oh my God, the server room is full of smoke!" Somehow they hooked up things wrong and fed 220v instead of 110v to all the circuits. Every single UPS was dead. Several of the server power supplies were fried.

And a few motherboards didn't make it either. It took me the rest of the weekend kludging things together to get the critical systems back online.

[Jan 28, 2019] Testing backup system as the main source of power outatages

Highly recommended!

Jan 28, 2019 | thwack.solarwinds.com

gcp Jul 8, 2015 10:33 PM

Many years ago I worked at an IBM Mainframe site. To make systems more robust they installed a UPS system for the mainframe with battery bank and a honkin' great diesel generator in the yard.

During the commissioning of the system, they decided to test the UPS cutover one afternoon - everything goes *dark* in seconds. Frantic running around to get power back on and MF restarted and databases recovered (afternoon, remember? during the work day...). Oh! The UPS batteries were not charged! Oops.

Over the next few weeks, they did two more 'tests' during the working day, with everything going *dark* in seconds for various reasons. Oops.

Then they decided - perhaps we should test this outside of office hours. (YAY!)

Still took a few more efforts to get everything working - diesel generator wouldn't start automatically, fixed that and forgot to fill up the diesel tank so cutover was fine until the fuel ran out.

Many, many lessons learned from this episode.

[Jan 28, 2019] False alarm: bas small inmashine room due to electrical light not a server

Jan 28, 2019 | www.reddit.com

radiomix Jack of All Trades 5 points 6 points 7 points 3 years ago (2 children)

I was in my main network facility, for a municipal fiber optic ring. Outside were two technicians replacing our backup air conditioning unit. I walk inside after talking with the two technicians, turn on the lights and begin walking around just visually checking things around the room. All of a sudden I started smelling that dreaded electric hot/burning smell. In this place I have my core switch, primary router, a handful of servers, some customer equipment and a couple of racks for my service provider. I start running around the place like a mad man sniffing all the equipment. I even called in the AC technicians to help me sniff.
After 15 minutes we could not narrow down where it was coming from. Finally I noticed that one of the florescent lights had not come on. I grabbed a ladder and opened it up.

The ballast had burned out on the light and it just so happen to be the light right in front of the AC vent blowing the smell all over the room.

The last time I had smelled that smell in that room a major piece of equipment went belly up and there was nothing I could do about it.

benjunmun 2 points 3 points 4 points 3 years ago (0 children)
The exact same thing has happened to me. Nothing quite as terrifying as the sudden smell of ozone as you're surrounded by critical computers and electrical gear.

[Jan 28, 2019] Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

Jan 28, 2019 | www.reddit.com

eraser_6776 VP IT/Sec (a damn suit) 9 points 10 points 11 points 3 years ago (1 child)

May 22, 2004. There was a rather massive storm here that spurred one of the [biggest Tornaodes recorded in Nebraska]( www.tornadochaser.net/hallam.html ) and I was a sysadmin for a small company. It was a Saturday, aka beer day, and as all hell was breaking loose my friends and roomates' pagers and phones were all going off. "Ha ha!" I said, looking at a silent cellphone "sucks to be you!"
Next morning around 10 my phone rings, and I groggily answer it because it's the owner of the company. "You'd better come in here, none of the computers will turn on" he says. Slight panic, but I hadn't received any emails. So it must have been breakers, and I can get that fixed. No problem.

I get into the office and something strikes me. That eery sound of silence. Not a single machine is on.. why not? Still shaking off too much beer from the night before, I go into the server room and find out why I didn't get paged. Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

I start walking around the office trying to turn on machines and.. dead. All of them. Every last desktop won't power on. That's when panic REALLY set in.

In the aftermath I found out two things - one, when the building was built, it was built with a steel roof and steel trusses. Two, when my predecessor had the network cabling wired he hired an idiot who didn't know fire code and ran the network cabling, conveniently, along the trusses into the ceiling. Thus, when lightning hit the building it had a perfect ground path to every workstation in the company. Some servers that weren't in the primary cabinet had been wired to a wall jack (which, in turn, went up into the ceiling then back down into the cabinet because you know, wire management!). Thankfully they were all "legacy" servers.

The only thing that saved the main servers was that Cisco 2924 XL-EN's are some badass mofo's that would die before they let that voltage pass through to the servers in the cabinet. At least that's what I told myself.

All in all, it ended up being one of the longest work weeks ever as I first had to source a bunch of switches, fast to get things like mail and the core network back up. Next up was feeding my buddies a bunch of beer and pizza after we raided every box store in town for spools of Cat 5 and threw wire along the floor.

Finally I found out that CDW can and would get you a whole lot of desktops delivered to your door with your software pre-installed in less than 24 hours if you have an open checkbook. Thanks to a great insurance policy, we did. Shipping and "handling" for those were more than the cost of the machines (again, this was back in 2004 and they were business desktops so you can imagine).

Still, for weeks after I had non-stop user complaints that generally involved "..I think this is related to the lightning ". I drank a lot that summer.

[Jan 28, 2019] Format of wrong particon initiated during RHEL install

Notable quotes:

"... Look at the screen, check out what it is doing, realize that the installer had grabbed the backend and he said yeah format all(we are not sure exactly how he did it). ..."

Jan 28, 2019 | www.reddit.com

kitched 5 points 6 points 7 points 3 years ago (2 children)

~10 years ago. 100GB drives on a node attached to an 8TB SAN. Cabling is all hooked up as we are adding this new node to manage the existing data on the SAN. A guy that is training up to help, we let him install RedHat and go through the GUI setup. Did not pay attention to him, and after a while wonder what is taking so long. Walk over to him and he is still staring at the install screen and says, "Hey guys, this format sure is taking a while".
Look at the screen, check out what it is doing, realize that the installer had grabbed the backend and he said yeah format all(we are not sure exactly how he did it).

Middle of the day, better kick off the tape restore for 8TB of data.

[Jan 28, 2019] I still went to work that day, tired, grumpy and hyped on caffeine teetering between consciousness and a comatose state

Big mistake. This is a perfect state to commit some big SNAFU

Jan 28, 2019 | thwack.solarwinds.com

porterseceng Jul 9, 2015 9:44 AM

I was the on-call technician for the security team supporting a Fortune 500 logistics company, in fact it was my first time being on-call. My phone rings at about 2:00 AM and the help desk agent says that the Citrix portal is down for everyone. This is a big deal because it's a 24/7 shop with people remoting in all around the world. While not strictly a security appliance, my team was responsible for the Citrix Access Gateway that was run on a NetScaler. Also on the line are the systems engineers responsible for the Citrix presentation/application servers.

I log in, check the appliance, look at all of the monitors, everything is reporting up. After about 4 hours of troubleshooting and trying everything within my limited knowledge of this system we get my boss on the line to help.

It came down to this: the Citrix team didn't troubleshoot anything and it was the StoreFront and broker servers that were having the troubles; but since the CAG wouldn't let people see any applications they instantly pointed the finger at the security team and blamed us.

I still went to work that day, tired, grumpy and hyped on caffeine teetering between consciousness and a comatose state because of two reasons: the Citrix team doesn't know how to do their job and I was too tired to ask the investigating questions like "when did it stop working? has anything changed? what have you looked at so far?".

[Jan 28, 2019] Any horror stories about tired sysadmins...

Long story short, don't drink soda late at night, especially near your laptop! Soda spills are not easy to cleanup.

Jan 28, 2019 | thwack.solarwinds.com

mickyred 1 point 2 points 3 points 4 years ago (1 child)

I initially read this as "Any horror stories about tired sysadmins..."
cpbills Sr. Linux Admin 1 point 2 points 3 points 4 years ago (0 children)
They exist. This is why 'good' employers provide coffee.

[Jan 28, 2019] Something about the meaning of the word space

Jul 13, 2015 | thwack.solarwinds.com

Jul 13, 2015 7:44 AM

Trying to walk a tech through some switch config.

me: type config space t

them: it doesn't work

me: <sigh> <spells out config> space the single letter t

them: it still doesn't work

--- try some other rudimentary things ---

me: uh, are you typing in the word 'space'?

them: you said to

[Jan 28, 2019] Happy Sysadmin Appreciation Day 2016

Jan 28, 2019 | opensource.com

dale.sykora on 29 Jul 2016 Permalink

I have a horror story from another IT person. One day they were tasked with adding a new server to a rack in their data center. They added the server... being careful to not bump a cable to the nearby production servers, SAN, and network switch. The physical install went well. But when they powered on the server, the ENTIRE RACK went dark. Customers were not happy:( IT turns out that the power circuit they attached the server to was already at max capacity and thus they caused the breaker to trip. Lessons learned... use redundant power and monitor power consumption.

Another issue was being a newbie on a Cisco switch and making a few changes and thinking the innocent sounding "reload" command would work like Linux does when you restart a daemon. Watching 48 link activity LEDs go dark on your vmware cluster switch... Priceless

[Jan 28, 2019] The ghost of the failed restore

Notable quotes:

"... "Of course! You told me that I had to stay a couple of extra hours to perform that task," I answered. "Exactly! But you preferred to leave early without finishing that task," he said. "Oh my! I thought it was optional!" I exclaimed. ..."

"... "It was, it was " ..."

"... Even with the best solution that promises to make the most thorough backups, the ghost of the failed restoration can appear, darkening our job skills, if we don't make a habit of validating the backup every time. ..."

Nov 01, 2018 | opensource.com

In a well-known data center (whose name I do not want to remember), one cold October night we had a production outage in which thousands of web servers stopped responding due to downtime in the main database. The database administrator asked me, the rookie sysadmin, to recover the database's last full backup and restore it to bring the service back online.

But, at the end of the process, the database was still broken. I didn't worry, because there were other full backup files in stock. However, even after doing the process several times, the result didn't change.

With great fear, I asked the senior sysadmin what to do to fix this behavior.

"You remember when I showed you, a few days ago, how the full backup script was running? Something about how important it was to validate the backup?" responded the sysadmin.

"Of course! You told me that I had to stay a couple of extra hours to perform that task," I answered. "Exactly! But you preferred to leave early without finishing that task," he said. "Oh my! I thought it was optional!" I exclaimed.

"It was, it was "

Moral of the story: Even with the best solution that promises to make the most thorough backups, the ghost of the failed restoration can appear, darkening our job skills, if we don't make a habit of validating the backup every time.

[Jan 28, 2019] The danger of a single backup harddrive (USB or not)

The most typical danger is dropping of the hard drive on the floor.

Notable quotes:

"... Also, backing up to another disk in the same computer will probably not save you when lighting strikes, as the backup disk is just as likely to be fried as the main disk. ..."

"... In real life, the backup strategy and hardware/software choices to support it is (as most other things) a balancing act. The important thing is that you have a strategy, and that you test it regularly to make sure it works as intended (as the main point is in the article). Also, realizing that achieving 100% backup security is impossible might save a lot of time in setting up the strategy. ..."

Nov 08, 2002 | www.linuxjournal.com

Anonymous on Fri, 11/08/2002

Why don't you just buy an extra hard disk and have a copy of your important data there. With today's prices it doesn't cost anything.

Anonymous on Fri, 11/08/2002 - 03:00. A lot of people seams to have this idea, and in many situations it should work fine.

However, there is the human factor. Sometimes simple things go wrong (as simple as copying a file), and it takes a while before anybody notices that the contents of this file is not what is expected. This means you have to have many "generations" of backup of the file in order to be able to restore it, and in order to not put all the "eggs in the same basket" each of the file backups should be on a physical device.

Also, backing up to another disk in the same computer will probably not save you when lighting strikes, as the backup disk is just as likely to be fried as the main disk.

In real life, the backup strategy and hardware/software choices to support it is (as most other things) a balancing act. The important thing is that you have a strategy, and that you test it regularly to make sure it works as intended (as the main point is in the article). Also, realizing that achieving 100% backup security is impossible might save a lot of time in setting up the strategy.

(I.e. you have to say that this strategy has certain specified limits, like not being able to restore a file to its intermediate state sometime during a workday, only to the state it had when it was last backed up, which should be a maximum of xxx hours ago and so on...)

Hallvard P

[Jan 28, 2019] Those power cables ;-)

Jan 28, 2019 | opensource.com

John Fano on 31 Jul 2016

I was reaching down to power up the new UPS as my guy was stepping out from behind the rack and the whole rack went dark. His foot caught the power cord of the working UPS and pulled it just enough to break the contacts and since the battery was failed it couldn't provide power and shut off. It took about 30 minutes to bring everything back up..

Things went much better with the second UPS replacement. :-)

[Jan 28, 2019] "Right," I said. "Time to get the backup." I knew I had to leave when I saw his face start twitching and he whispered: "Backup ...?"

Jan 28, 2019 | opensource.com

SemperOSS on 13 Sep 2016 Permalink This one seems to be a classic too:

Working for a large UK-based international IT company, I had a call from newest guy in the internal IT department: "The main server, you know ..."

"Yes?"

"I was cleaning out somebody's homedir ..."

"Yes?"

"Well, the server stopped running properly ..."

"Yes?"

"... and I can't seem to get it to boot now ..."

"Oh-kayyyy. I'll just totter down to you and give it an eye."

I went down to the basement where the IT department was located and had a look at his terminal screen on his workstation. Going back through the terminal history, just before a hefty amount of error messages, I found his last command: 'rm -rf /home/johndoe /*'. And I probably do not have to say that he was root at the time (it was them there days before sudo, not that that would have helped in his situation).

"Right," I said. "Time to get the backup." I knew I had to leave when I saw his face start twitching and he whispered: "Backup ...?"

==========

Bonus entries from same company:

It was the days of the 5.25" floppy disks (Wikipedia is your friend, if you belong to the younger generation). I sometimes had to ask people to send a copy of a floppy to check why things weren't working properly. Once I got a nice photocopy and another time, the disk came with a polite note attached ... stapled through the disk, to be more precise!

[Jan 28, 2019] regex - Safe rm -rf function in shell script

Jan 28, 2019 | stackoverflow.com

community wiki
5 revs
,May 23, 2017 at 12:26
This question is similar to What is the safest way to empty a directory in *nix?
I'm writing bash script which defines several path constants and will use them for file and directory manipulation (copying, renaming and deleting). Often it will be necessary to do something like:
rm -rf "/${PATH1}"
rm -rf "${PATH2}/"*
While developing this script I'd want to protect myself from mistyping names like PATH1 and PATH2 and avoid situations where they are expanded to empty string, thus resulting in wiping whole disk. I decided to create special wrapper:
rmrf() {
    if [[ $1 =~ "regex" ]]; then
        echo "Ignoring possibly unsafe path ${1}"
        exit 1
    fi

    shopt -s dotglob
    rm -rf -- $1
    shopt -u dotglob
}
Which will be called as:
rmrf "/${PATH1}"
rmrf "${PATH2}/"*
Regex (or sed expression) should catch paths like "*", "/*", "/**/", "///*" etc. but allow paths like "dir", "/dir", "/dir1/dir2/", "/dir1/dir2/*". Also I don't know how to enable shell globbing in case like "/dir with space/*". Any ideas?

EDIT: this is what I came up with so far:
rmrf() {
    local RES
    local RMPATH="${1}"
    SAFE=$(echo "${RMPATH}" | sed -r 's:^((\.?\*+/+)+.*|(/+\.?\*+)+.*|[\.\*/]+|.*/\.\*+)$::g')
    if [ -z "${SAFE}" ]; then
        echo "ERROR! Unsafe deletion of ${RMPATH}"
        return 1
    fi

    shopt -s dotglob
    if [ '*' == "${RMPATH: -1}" ]; then
        echo rm -rf -- "${RMPATH/%\*/}"*
        RES=$?
    else
        echo rm -rf -- "${RMPATH}"
        RES=$?
    fi
    shopt -u dotglob

    return $RES
}
Intended use is (note an asterisk inside quotes):
rmrf "${SOMEPATH}"
rmrf "${SOMEPATH}/*"
where $SOMEPATH is not system or /home directory (in my case all such operations are performed on filesystem mounted under /scratch directory).

CAVEATS:

not tested very well

not intended to use with paths possibly containing '..' or '.'

should not be used with user-supplied paths

rm -rf with asterisk probably can fail if there are too many files or directories inside $SOMEPATH (because of limited command line length) - this can be fixed with 'for' loop or 'find' command
SpliFF ,Jun 14, 2009 at 13:45
I've found a big danger with rm in bash is that bash usually doesn't stop for errors. That means that:
cd $SOMEPATH
rm -rf *
Is a very dangerous combination if the change directory fails. A safer way would be:
cd $SOMEPATH && rm -rf *
Which will ensure the rf won't run unless you are really in $SOMEPATH. This doesn't protect you from a bad $SOMEPATH but it can be combined with the advice given by others to help make your script safer.

EDIT: @placeybordeaux makes a good point that if $SOMEPATH is undefined or empty cd doesn't treat it as an error and returns 0. In light of that this answer should be considered unsafe unless $SOMEPATH is validated as existing and non-empty first. I believe cd with no args should be an illegal command since at best is performs a no-op and at worse it can lead to unexpected behaviour but it is what it is.
Sazzad Hissain Khan ,Jul 6, 2017 at 11:45

nice trick, I am one stupid victim. – Sazzad Hissain Khan Jul 6 '17 at 11:45

placeybordeaux ,Jun 21, 2018 at 22:59

If $SOMEPATH is empty won't this rm -rf the user's home directory? – placeybordeaux Jun 21 '18 at 22:59

SpliFF ,Jun 27, 2018 at 4:10

@placeybordeaux The && only runs the second command if the first succeeds - so if cd fails rm never runs – SpliFF Jun 27 '18 at 4:10

placeybordeaux ,Jul 3, 2018 at 18:46

@SpliFF at least in ZSH the return value of cd $NONEXISTANTVAR is 0 – placeybordeaux Jul 3 '18 at 18:46

ruakh ,Jul 13, 2018 at 6:46

Instead of cd $SOMEPATH , you should write cd "${SOMEPATH?}" . The ${varname?} notation ensures that the expansion fails with a warning-message if the variable is unset or empty (such that the && ... part is never run); the double-quotes ensure that special characters in $SOMEPATH , such as whitespace, don't have undesired effects. – ruakh Jul 13 '18 at 6:46

community wiki
2 revs
,Jul 24, 2009 at 22:36

There is a set -u bash directive that will cause exit, when uninitialized variable is used. I read about it here , with rm -rf as an example. I think that's what you're looking for. And here is set's manual .

,Jun 14, 2009 at 12:38

I think "rm" command has a parameter to avoid the deleting of "/". Check it out.

Max ,Jun 14, 2009 at 12:56

Thanks! I didn't know about such option. Actually it is named --preserve-root and is not mentioned in the manpage. – Max Jun 14 '09 at 12:56

Max ,Jun 14, 2009 at 13:18

On my system this option is on by default, but it cat't help in case like rm -ri /* – Max Jun 14 '09 at 13:18

ynimous ,Jun 14, 2009 at 12:42

I would recomend to use realpath(1) and not the command argument directly, so that you can avoid things like /A/B/../ or symbolic links.

Max ,Jun 14, 2009 at 13:30

Useful but non-standard command. I've found possible bash replacement: archlinux.org/pipermail/pacman-dev/2009-February/008130.html – Max Jun 14 '09 at 13:30

Jonathan Leffler ,Jun 14, 2009 at 12:47
Generally, when I'm developing a command with operations such as ' rm -fr ' in it, I will neutralize the remove during development. One way of doing that is:
RMRF="echo rm -rf"
...
$RMRF "/${PATH1}"
This shows me what should be deleted - but does not delete it. I will do a manual clean up while things are under development - it is a small price to pay for not running the risk of screwing up everything.

The notation ' "/${PATH1}" ' is a little unusual; normally, you would ensure that PATH1 simply contains an absolute pathname.

Using the metacharacter with ' "${PATH2}/"* ' is unwise and unnecessary. The only difference between using that and using just ' "${PATH2}" ' is that if the directory specified by PATH2 contains any files or directories with names starting with dot, then those files or directories will not be removed. Such a design is unlikely and is rather fragile. It would be much simpler just to pass PATH2 and let the recursive remove do its job. Adding the trailing slash is not necessarily a bad idea; the system would have to ensure that $PATH2 contains a directory name, not just a file name, but the extra protection is rather minimal.

Using globbing with ' rm -fr ' is usually a bad idea. You want to be precise and restrictive and limiting in what it does - to prevent accidents. Of course, you'd never run the command (shell script you are developing) as root while it is under development - that would be suicidal. Or, if root privileges are absolutely necessary, you neutralize the remove operation until you are confident it is bullet-proof.
Max ,Jun 14, 2009 at 13:09

To delete subdirectories and files starting with dot I use "shopt -s dotglob". Using rm -rf "${PATH2}" is not appropriate because in my case PATH2 can be only removed by superuser and this results in error status for "rm" command (and I verify it to track other errors). – Max Jun 14 '09 at 13:09

Jonathan Leffler ,Jun 14, 2009 at 13:37

Then, with due respect, you should use a private sub-directory under $PATH2 that you can remove. Avoid glob expansion with commands like 'rm -rf' like you would avoid the plague (or should that be A/H1N1?). – Jonathan Leffler Jun 14 '09 at 13:37

Max ,Jun 14, 2009 at 14:10

Meanwhile I've found this perl project: http://code.google.com/p/safe-rm/

community wiki
too much php ,Jun 15, 2009 at 1:55
If it is possible, you should try and put everything into a folder with a hard-coded name which is unlikely to be found anywhere else on the filesystem, such as ' foofolder '. Then you can write your rmrf() function as:
rmrf() {
    rm -rf "foofolder/$PATH1"
    # or
    rm -rf "$PATH1/foofolder"
}
There is no way that function can delete anything but the files you want it to.
vadipp ,Jan 13, 2017 at 11:37

Actually there is a way: if PATH1 is something like ../../someotherdir – vadipp Jan 13 '17 at 11:37

community wiki
btop ,Jun 15, 2009 at 6:34
You may use
set -f    # cf. help set
to disable filename generation (*).
community wiki
Howard Hong ,Oct 28, 2009 at 19:56
You don't need to use regular expressions.
Just assign the directories you want to protect to a variable and then iterate over the variable. eg:
protected_dirs="/ /bin /usr/bin /home $HOME"
for d in $protected_dirs; do
    if [ "$1" = "$d" ]; then
        rm=0
        break;
    fi
done
if [ ${rm:-1} -eq 1 ]; then
    rm -rf $1
fi
,
Add the following codes to your ~/.bashrc
# safe delete
move_to_trash () { now="$(date +%Y%m%d_%H%M%S)"; mv "$@" ~/.local/share/Trash/files/"$@_$now"; }
alias del='move_to_trash'

# safe rm
alias rmi='rm -i'
Every time you need to rm something, first consider del , you can change the trash folder. If you do need to rm something, you could go to the trash folder and use rmi .

One small bug for del is that when del a folder, for example, my_folder , it should be del my_folder but not del my_folder/ since in order for possible later restore, I attach the time information in the end ( "$@_$now" ). For files, it works fine.

[Jan 28, 2019] That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem

Jan 28, 2019 | www.reddit.com

VexingRaven 1 point 2 points 3 points 3 years ago (1 child)

Not really a horror story but definitely one of my first "Oh shit" moments. I was the FNG helpdesk/sysadmin at a company of 150 people. I start getting calls that something (I think it was Outlook) wasn't working in Citrix, apparently something broken on one of the Citrix servers. I'm 100% positive it will be fixed with a reboot (I've seen this before on individual PCs), so I diligently start working to get people off that Citrix server (one of three) so I can reboot it.
I get it cleared out, hit Reboot... And almost immediately get a call from the call center manager saying every single person just got kicked off Citrix. Oh shit. But there was nobody on that server! Apparently that server also housed the Secure Gateway server which my senior hadn't bothered to tell me or simply didn't know (Set up by a consulting firm). Whoops. Thankfully the servers were pretty fast and people's sessions reconnected a few minutes later, no harm no foul. And on the plus side, it did indeed fix the problem.

And that's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem.

[Jan 14, 2019] Safe rm stops you accidentally wiping the system! @ New Zealand Linux

Jan 14, 2019 | www.nzlinux.com

Francois Marier October 21, 2009 at 10:34 am
Another related tool, to prevent accidental reboots of servers this time, is molly-guard:

http://packages.debian.org/sid/molly-guard

It asks you to type the hostname of the machine you want to reboot as an extra confirmation step.

[Jan 10, 2019] When idiots are offloaded to security department, interesting things with network eventually happen

Highly recommended!

Security department often does more damage to the network then any sophisticated hacker can. Especially if they are populated with morons, as they usually are. One of the most blatant examples is below... Those idiots decided to disable Traceroute (which means ICMP) in order to increase security.

Notable quotes:

"... Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems. ..."

"... Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this. ..."

"... Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting? Yes another example of how ignorant opinions can become common sense. ..."

"... Disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply. ..."

"... You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it. Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes. ..."

"... You have no idea what you're talking about, at any level. "disabled ICMP" - state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass. ..."

"... In short, he's a moron. I have reason to suspect you might be, too. ..."

"... No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours. ..."

"... It's another example of security by stupidity which seldom provides security, but always buys added cost. ..."

"... A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net] ..."

"... Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up. ..."

"... Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually. ..."

May 27, 2018 | linux.slashdot.org

jfdavis668 ( 1414919 ) , Sunday May 27, 2018 @11:09AM ( #56682996 )

Re:So ( Score: 5 , Interesting)
Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems.

Anonymous Coward writes:
Re: ( Score: 2 , Insightful)
What is the point? If an intruder is already there couldn't they just upload their own binary?

Hylandr ( 813770 ) , Sunday May 27, 2018 @05:57PM ( #56685274 )
Re: So ( Score: 5 , Interesting)
They can easily. And often time will compile their own tools, versions of Apache, etc..

At best it slows down incident response and resolution while doing nothing to prevent discovery of their networks. If you only use Vlans to segregate your architecture you're boned.

gweihir ( 88907 ) , Sunday May 27, 2018 @12:19PM ( #56683422 )
Re: So ( Score: 5 , Interesting)
Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this.

bferrell ( 253291 ) , Sunday May 27, 2018 @12:20PM ( #56683430 ) Homepage Journal
Re: So ( Score: 4 , Interesting)
Except it DOESN'T secure anything, simply renders things a little more obscure... Since when is obscurity security?

fluffernutter ( 1411889 ) writes:
Re: ( Score: 3 )
Doing something to make things more difficult for a hacker is better than doing nothing to make things more difficult for a hacker. Unless you're lazy, as many of these things should be done as possible.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:37PM ( #56684878 )
Re:So ( Score: 5 , Insightful)
No.

Things like this don't slow down "hackers" with even a modicum of network knowledge inside of a functioning network. What they do slow down is your ability to troubleshoot network problems.

Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting? Yes another example of how ignorant opinions can become common sense.

mSparks43 ( 757109 ) writes:
Re: So ( Score: 2 )
Pretty much my reaction. like WTF? OTON, redhat flavors all still on glibc2 starting to become a regular p.i.t.a. so the chances of this actually becoming a thing to be concerned about seem very low.

Kinda like gdpr, same kind of groupthink that anyone actually cares or concerns themselves with policy these days.

ruir ( 2709173 ) writes:
Re: ( Score: 3 )
Disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply.

Or they can do reverse traceroute at least until the border edge of your firewall via an external site.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:32PM ( #56684858 )
Re:So ( Score: 4 , Insightful)
You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it. Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )
That's hilarious... I am *the guy* who runs the network. I am our senior network engineer. Every line in every router -- mine.

You have no idea what you're talking about, at any level. "disabled ICMP" - state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )
Nonsense. I conceded that morons may actually go through the work to totally break their PMTUD, IP error signaling channels, and make their nodes "invisible"

I understand "networking" at a level I'm pretty sure you only have a foggy understanding of. I write applications that require layer-2 packet building all the way up to layer-4.

In short, he's a moron. I have reason to suspect you might be, too.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )
A CDS is MAC. Turning off ICMP toward people who aren't allowed to access your node/network is understandable. They can't get anything else though, why bother supporting the IP control channel? CDS does *not* say turn off ICMP globally. I deal with CDS, SSAE16 SOC 2, and PCI compliance daily. If your CDS solution only operates with a layer-4 ACL, it's a pretty simple model, or You're Doing It Wrong (TM)

nyet ( 19118 ) writes:
Re: ( Score: 3 )

> I'm not a network person

IOW, nothing you say about networking should be taken seriously.

kevmeister ( 979231 ) , Sunday May 27, 2018 @05:47PM ( #56685234 ) Homepage
Re:So ( Score: 4 , Insightful)
No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours.

The problem is that doing this requires things like packet fragmentation which greatly increases router CPU load and reduces the maximum PPS of your network as well s resulting in dropped packets requiring re-transmission and may also result in widow collapse fallowed with slow-start, though rapid recovery mitigates much of this, it's still not free.

It's another example of security by stupidity which seldom provides security, but always buys added cost.

Hylandr ( 813770 ) writes:
Re: ( Score: 3 )
As a server engineer I am experiencing this with our network team right now.

Do you have some reading that I might be able to further educate myself? I would like to be able to prove to the directors why disabling ICMP on the network may be the cause of our issues.

Zaelath ( 2588189 ) , Sunday May 27, 2018 @07:51PM ( #56685758 )
Re:So ( Score: 4 , Informative)
A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net]

Bing Tsher E ( 943915 ) , Sunday May 27, 2018 @01:22PM ( #56683792 ) Journal
Re: Denying ICMP echo @ server/workstation level t ( Score: 5 , Insightful)
Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up.

Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually.

[Jan 10, 2019] saferm Safely remove files, moving them to GNOME/KDE trash instead of deleting by Eemil Lagerspetz

Jan 10, 2019 | github.com

#!/bin/bash
##
## saferm.sh
## Safely remove files, moving them to GNOME/KDE trash instead of deleting.
## Made by Eemil Lagerspetz
## Login   <vermind@drache>
## 
## Started on  Mon Aug 11 22:00:58 2008 Eemil Lagerspetz
## Last update Sat Aug 16 23:49:18 2008 Eemil Lagerspetz
##

version="1.16";

## flags (change these to change default behaviour)
recursive="" # do not recurse into directories by default
verbose="true" # set verbose by default for inexperienced users.
force="" #disallow deleting special files by default
unsafe="" # do not behave like regular rm by default

## possible flags (recursive, verbose, force, unsafe)
# don't touch this unless you want to create/destroy flags
flaglist="r v f u q"

# Colours
blue='\e[1;34m'
red='\e[1;31m'
norm='\e[0m'

## trashbin definitions
# this is the same for newer KDE and GNOME:
trash_desktops="$HOME/.local/share/Trash/files"
# if neither is running:
trash_fallback="$HOME/Trash"

# use .local/share/Trash?
use_desktop=$( ps -U $USER | grep -E "gnome-settings|startkde|mate-session|mate-settings|mate-panel|gnome-shell|lxsession|unity" )

# mounted filesystems, for avoiding cross-device move on safe delete
filesystems=$( mount | awk '{print $3; }' )

if [ -n "$use_desktop" ]; then
    trash="${trash_desktops}"
    infodir="${trash}/../info";
    for k in "${trash}" "${infodir}"; do
        if [ ! -d "${k}" ]; then mkdir -p "${k}"; fi
    done
else
    trash="${trash_fallback}"
fi

usagemessage() {
        echo -e "This is ${blue}saferm.sh$norm $version. LXDE and Gnome3 detection.
    Will ask to unsafe-delete instead of cross-fs move. Allows unsafe (regular rm) delete (ignores trashinfo).
    Creates trash and trashinfo directories if they do not exist. Handles symbolic link deletion.
    Does not complain about different user any more.\n";
        echo -e "Usage: ${blue}/path/to/saferm.sh$norm [${blue}OPTIONS$norm] [$blue--$norm] ${blue}files and dirs to safely remove$norm"
        echo -e "${blue}OPTIONS$norm:"
        echo -e "$blue-r$norm      allows recursively removing directories."
        echo -e "$blue-f$norm      Allow deleting special files (devices, ...)."
  echo -e "$blue-u$norm      Unsafe mode, bypass trash and delete files permanently."
        echo -e "$blue-v$norm      Verbose, prints more messages. Default in this version."
  echo -e "$blue-q$norm      Quiet mode. Opposite of verbose."
        echo "";
}

detect() {
    if [ ! -e "$1" ]; then fs=""; return; fi
    path=$(readlink -f "$1")
    for det in $filesystems; do
        match=$( echo "$path" | grep -oE "^$det" )
        if [ -n "$match" ]; then
            if [ ${#det} -gt ${#fs} ]; then
                fs="$det"
            fi
        fi
    done
}


trashinfo() {
#gnome: generate trashinfo:
        bname=$( basename -- "$1" )
    fname="${trash}/../info/${bname}.trashinfo"
    cat < "${fname}"
[Trash Info]
Path=$PWD/${1}
DeletionDate=$( date +%Y-%m-%dT%H:%M:%S )
EOF
}

setflags() {
    for k in $flaglist; do
        reduced=$( echo "$1" | sed "s/$k//" )
        if [ "$reduced" != "$1" ]; then
            flags_set="$flags_set $k"
        fi
    done
  for k in $flags_set; do
        if [ "$k" == "v" ]; then
            verbose="true"
        elif [ "$k" == "r" ]; then 
            recursive="true"
        elif [ "$k" == "f" ]; then 
            force="true"
        elif [ "$k" == "u" ]; then 
            unsafe="true"
        elif [ "$k" == "q" ]; then 
    unset verbose
        fi
  done
}

performdelete() {
                        # "delete" = move to trash
                        if [ -n "$unsafe" ]
                        then
                          if [ -n "$verbose" ];then echo -e "Deleting $red$1$norm"; fi
                    #UNSAFE: permanently remove files.
                    rm -rf -- "$1"
                        else
                          if [ -n "$verbose" ];then echo -e "Moving $blue$k$norm to $red${trash}$norm"; fi
                    mv -b -- "$1" "${trash}" # moves and backs up old files
                        fi
}

askfs() {
  detect "$1"
  if [ "${fs}" != "${tfs}" ]; then
    unset answer;
    until [ "$answer" == "y" -o "$answer" == "n" ]; do
      echo -e "$blue$1$norm is on $blue${fs}$norm. Unsafe delete (y/n)?"
      read -n 1 answer;
    done
    if [ "$answer" == "y" ]; then
      unsafe="yes"
    fi
  fi
}

complain() {
  msg=""
  if [ ! -e "$1" -a ! -L "$1" ]; then # does not exist
    msg="File does not exist:"
        elif [ ! -w "$1" -a ! -L "$1" ]; then # not writable
    msg="File is not writable:"
        elif [ ! -f "$1" -a ! -d "$1" -a -z "$force" ]; then # Special or sth else.
        msg="Is not a regular file or directory (and -f not specified):"
        elif [ -f "$1" ]; then # is a file
    act="true" # operate on files by default
        elif [ -d "$1" -a -n "$recursive" ]; then # is a directory and recursive is enabled
    act="true"
        elif [ -d "$1" -a -z "${recursive}" ]; then
                msg="Is a directory (and -r not specified):"
        else
                # not file or dir. This branch should not be reached.
                msg="No such file or directory:"
        fi
}

asknobackup() {
  unset answer
        until [ "$answer" == "y" -o "$answer" == "n" ]; do
          echo -e "$blue$k$norm could not be moved to trash. Unsafe delete (y/n)?"
          read -n 1 answer
        done
        if [ "$answer" == "y" ]
        then
          unsafe="yes"
          performdelete "${k}"
          ret=$?
                # Reset temporary unsafe flag
          unset unsafe
          unset answer
        else
          unset answer
        fi
}

deletefiles() {
  for k in "$@"; do
          fdesc="$blue$k$norm";
          complain "${k}"
          if [ -n "$msg" ]
          then
                  echo -e "$msg $fdesc."
    else
        #actual action:
        if [ -z "$unsafe" ]; then
          askfs "${k}"
        fi
                  performdelete "${k}"
                  ret=$?
                  # Reset temporary unsafe flag
                  if [ "$answer" == "y" ]; then unset unsafe; unset answer; fi
      #echo "MV exit status: $ret"
      if [ ! "$ret" -eq 0 ]
      then 
        asknobackup "${k}"
      fi
      if [ -n "$use_desktop" ]; then
          # generate trashinfo for desktop environments
        trashinfo "${k}"
      fi
    fi
        done
}

# Make trash if it doesn't exist
if [ ! -d "${trash}" ]; then
    mkdir "${trash}";
fi

# find out which flags were given
afteropts=""; # boolean for end-of-options reached
for k in "$@"; do
        # if starts with dash and before end of options marker (--)
        if [ "${k:0:1}" == "-" -a -z "$afteropts" ]; then
                if [ "${k:1:2}" == "-" ]; then # if end of options marker
                        afteropts="true"
                else # option(s)
                    setflags "$k" # set flags
                fi
        else # not starting with dash, or after end-of-opts
                files[++i]="$k"
        fi
done

if [ -z "${files[1]}" ]; then # no parameters?
        usagemessage # tell them how to use this
        exit 0;
fi

# Which fs is trash on?
detect "${trash}"
tfs="$fs"

# do the work
deletefiles "${files[@]}"

[Oct 22, 2018] linux - If I rm -rf a symlink will the data the link points to get erased, to

Notable quotes:

"... Put it in another words, those symlink-files will be deleted. The files they "point"/"link" to will not be touch. ..."

Oct 22, 2018 | unix.stackexchange.com

user4951 ,Jan 25, 2013 at 2:40
This is the contents of the /home3 directory on my system:
./   backup/    hearsttr@  lost+found/  randomvi@  sexsmovi@
../  freemark@  investgr@  nudenude@    romanced@  wallpape@
I want to clean this up but I am worried because of the symlinks, which point to another drive.

If I say rm -rf /home3 will it delete the other drive?
John Sui

rm -rf /home3 will delete all files and directory within home3 and home3 itself, which include symlink files, but will not "follow"(de-reference) those symlink.
Put it in another words, those symlink-files will be deleted. The files they "point"/"link" to will not be touch.

[Oct 22, 2018] Does rm -rf follow symbolic links?

Jan 25, 2012 | superuser.com
I have a directory like this:
$ ls -l
total 899166
drwxr-xr-x 12 me scicomp       324 Jan 24 13:47 data
-rw-r--r--  1 me scicomp     84188 Jan 24 13:47 lod-thin-1.000000-0.010000-0.030000.rda
drwxr-xr-x  2 me scicomp       808 Jan 24 13:47 log
lrwxrwxrwx  1 me scicomp        17 Jan 25 09:41 msg -> /home/me/msg
And I want to remove it using rm -r .

However I'm scared rm -r will follow the symlink and delete everything in that directory (which is very bad).

I can't find anything about this in the man pages. What would be the exact behavior of running rm -rf from a directory above this one?
LordDoskias Jan 25 '12 at 16:43, Jan 25, 2012 at 16:43

How hard it is to create a dummy dir with a symlink pointing to a dummy file and execute the scenario? Then you will know for sure how it works! –

hakre ,Feb 4, 2015 at 13:09

X-Ref: If I rm -rf a symlink will the data the link points to get erased, too? ; Deleting a folder that contains symlinks – hakre Feb 4 '15 at 13:09

Susam Pal ,Jan 25, 2012 at 16:47
Example 1: Deleting a directory containing a soft link to another directory.
susam@nifty:~/so$ mkdir foo bar
susam@nifty:~/so$ touch bar/a.txt
susam@nifty:~/so$ ln -s /home/susam/so/bar/ foo/baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── foo
    └── baz -> /home/susam/so/bar/

3 directories, 1 file
susam@nifty:~/so$ rm -r foo
susam@nifty:~/so$ tree
.
└── bar
    └── a.txt

1 directory, 1 file
susam@nifty:~/so$
So, we see that the target of the soft-link survives.

Example 2: Deleting a soft link to a directory
susam@nifty:~/so$ ln -s /home/susam/so/bar baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── baz -> /home/susam/so/bar

2 directories, 1 file
susam@nifty:~/so$ rm -r baz
susam@nifty:~/so$ tree
.
└── bar
    └── a.txt

1 directory, 1 file
susam@nifty:~/so$
Only, the soft link is deleted. The target of the soft-link survives.

Example 3: Attempting to delete the target of a soft-link
susam@nifty:~/so$ ln -s /home/susam/so/bar baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── baz -> /home/susam/so/bar

2 directories, 1 file
susam@nifty:~/so$ rm -r baz/
rm: cannot remove 'baz/': Not a directory
susam@nifty:~/so$ tree
.
├── bar
└── baz -> /home/susam/so/bar

2 directories, 0 files
The file in the target of the symbolic link does not survive.

The above experiments were done on a Debian GNU/Linux 9.0 (stretch) system.
Wyrmwood ,Oct 30, 2014 at 20:36

rm -rf baz/* will remove the contents – Wyrmwood Oct 30 '14 at 20:36

Buttle Butkus ,Jan 12, 2016 at 0:35

Yes, if you do rm -rf [symlink], then the contents of the original directory will be obliterated! Be very careful. – Buttle Butkus Jan 12 '16 at 0:35

frnknstn ,Sep 11, 2017 at 10:22

Your example 3 is incorrect! On each system I have tried, the file a.txt will be removed in that scenario. – frnknstn Sep 11 '17 at 10:22

Susam Pal ,Sep 11, 2017 at 15:20

@frnknstn You are right. I see the same behaviour you mention on my latest Debian system. I don't remember on which version of Debian I performed the earlier experiments. In my earlier experiments on an older version of Debian, either a.txt must have survived in the third example or I must have made an error in my experiment. I have updated the answer with the current behaviour I observe on Debian 9 and this behaviour is consistent with what you mention. – Susam Pal Sep 11 '17 at 15:20

Ken Simon ,Jan 25, 2012 at 16:43

Your /home/me/msg directory will be safe if you rm -rf the directory from which you ran ls. Only the symlink itself will be removed, not the directory it points to.
The only thing I would be cautious of, would be if you called something like "rm -rf msg/" (with the trailing slash.) Do not do that because it will remove the directory that msg points to, rather than the msg symlink itself.

> ,Jan 25, 2012 at 16:54

"The only thing I would be cautious of, would be if you called something like "rm -rf msg/" (with the trailing slash.) Do not do that because it will remove the directory that msg points to, rather than the msg symlink itself." - I don't find this to be true. See the third example in my response below. – Susam Pal Jan 25 '12 at 16:54

Andrew Crabb ,Nov 26, 2013 at 21:52

I get the same result as @Susam ('rm -r symlink/' does not delete the target of symlink), which I am pleased about as it would be a very easy mistake to make. – Andrew Crabb Nov 26 '13 at 21:52

,

rm should remove files and directories. If the file is symbolic link, link is removed, not the target. It will not interpret a symbolic link. For example what should be the behavior when deleting 'broken links'- rm exits with 0 not with non-zero to indicate failure

[Oct 05, 2018] Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk

Oct 05, 2018 | cam.ac.uk

From: [email protected] (Marc Fraioli)
Organization: Grebyn Timesharing

Well, here's a good one for you:

I was happily churning along developing something on a Sun workstation, and was getting a number of annoying permission denieds from trying to
write into a directory heirarchy that I didn't own. Getting tired of that, I decided to set the permissions on that subtree to 777 while Iwas working, so I wouldn't have to worry about it.

Someone had recently told me that rather than using plain "su", it was good to use "su -", but the implications had not yet sunk in. (You can probably see where this is going already, but I'll go to the bitter end.)

Anyway, I cd'd to where I wanted to be, the top of my subtree, and did su -. Then I did chmod -R 777. I then started to wonder why it was taking so damn long when there were only about 45 files in 20 directories under where I (thought) I was. Well, needless to say, su - simulates a real login, and had put me into root's home directory, /, so I was proceeding to set file permissions for the whole system to wide open.

I aborted it before it finished, realizing that something was wrong, but this took quite a while to straighten out.

Marc Fraioli

[Oct 05, 2018] One wrong find command can create one weak frantic recovery efforts

This is a classic SNAFU known and described for more then 30 years. It is still repeated in various forms thousand time by different system administrators. You can get permission of file installed via RPM back rather quickly and without problems. For all other files you need a backup or educated guess.

Ahh, the hazards of working with sysadmins who are not ready to be sysadmins in the first place

Oct 05, 2018 | cam.ac.uk

From: [email protected] (Jerry Rocteur)
Organization: InCC.com Perwez Belgium

Horror story,

I sent one of my support guys to do an Oracle update in Madrid.

As instructed he created a new user called esf and changed the files
in /u/appl to owner esf, however in doing so he *must* have cocked up
his find command, the command was:

find /u/appl -user appl -exec chown esf {} \;

He rang me up to tell me there was a problem, I logged in via x25 and
about 75% of files on system belonged to owner esf.

VERY little worked on system.

What a mess, it took me a while and I came up with a brain wave to
fix it but it really screwed up the system.

Moral: be *very* careful of find execs, get the syntax right!!!!

[Oct 05, 2018] When some filenames are etched in your brain you can type them several times repeating the same blunder again and again by Anatoly Ivasyuk

Notable quotes:

"... I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops. ..."

Oct 05, 2018 | cam.ac.uk

From Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk
From: [email protected] (Tim Smith)

Organization: University of Washington, Seattle

I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only
one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops.

I called up the person who handled backups, and he restored the password file.

A couple of days later, I did it again! This time, after he restored it, he made a link, /etc/safe_from_tim.

About a week later, I overwrote /etc/passwd, rather than removing it. After he restored it again, he installed a daemon that kept a copy of /etc/passwd, on another file system, and automatically restored it if it appeared to have been damaged.

Fortunately, I finished my work on /etc/lpspl around this time, so we didn't have to see if I could find a way to wipe out a couple of filesystems...

--Tim Smith

[Oct 05, 2018] Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once

Oct 05, 2018 | www.reddit.com

ardwin 5 years ago (9 children)

Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once.
cobra10101010 5 years ago (1 child)
Oh God..that is scary in true sense..hope everything was okay
ardwin 5 years ago (0 children)
I quickly learned that the 911 operators, are trained to do their jobs without any kind of computer support. It made me feel better.
reebzor 5 years ago (1 child)
I did this too!
edit: except I was the one that deployed the software that rebooted the machines

vocatus 5 years ago (0 children)
Hey, maybe you should go apologize to ardwin. I bet he was pissed.

[Oct 05, 2018] sudo yum -y remove krb5 (this removes coreutils)

Oct 05, 2018 | www.reddit.com

DrGirlfriend Systems Architect 2 points 3 points 4 points 5 years ago (5 children)

sudo yum -y remove krb5 (this removes coreutils)

deleted a production LUN rather than the development LUN - destroyed several months worth of assets for a book that was scheduled to go to print in a few weeks (found a good backup that was "only" two weeks out of date)

forgot to "wr mem" on new ATM routers at a remote site 70 miles away

2960G 2 points 3 points 4 points 5 years ago (1 child)
+1 for the "yum -y". Had the 'pleasure' of fixing a box one of my colleagues did "yum -y remove openssl". Through utter magic managed to recover it without reinstalling :-)
chriscowley DevOps 0 points 1 point 2 points 5 years ago (0 children)
Do I explain. I would probably curled the RPMs of the repo into cpio and put them into place manually (been there)
vocatus NSA/DOD/USAR/USAP/AEXP [ S ] 0 points 1 point 2 points 5 years ago (1 child)
That last one gave me the shivers.

[Oct 05, 2018] Trying to preserve connection after networking change while working on the core switch remotely backfired, as sysadmin forgot to cancel scheduled reload comment after testing change

Notable quotes:

"... "All monitoring for customer is showing down except the edge firewalls". ..."

"... as soon as they said it I knew I forgot to cancel the reload. ..."

"... That was a fun day.... What's worse is I was following a change plan, I just missed the "reload cancel". Stupid, stupid, stupid, stupid. ..."

Oct 05, 2018 | www.reddit.com

Making some network changes in a core switch, use 'reload in 5' as I wasn't 100% certain the changes wouldn't kill my remote connection.

Changes go in, everything stays up, no apparent issues. Save changes, log out.

"All monitoring for customer is showing down except the edge firewalls".

... as soon as they said it I knew I forgot to cancel the reload.

permalink

0xD6 5 years ago

This one hit pretty close to home having spent the last month at a small Service Provider with some serious redundancy issues. We're working through them one by one, but there is one outage in particular that was caused by the same situation... Only the scope was pretty "large".

Performed change, was distracted by phone call. Had an SMS notifying me of problems with a legacy border that I had just performed my changes on. See my PuTTY terminal and my blood starts to run cold. "Reload requested by 0xd6".

...Fuck I'm thinking, but everything should be back soon, not much I can do now.

However, not only did our primary transit terminate on this legacy device, our old non-HSRP L3 gateways and BGP nail down routes for one of our /20s and a /24... So, because of my forgotten reload I withdrew the majority of our network from all peers and the internet at large.

That was a fun day.... What's worse is I was following a change plan, I just missed the "reload cancel". Stupid, stupid, stupid, stupid.

[Oct 05, 2018] I learned a valuable lesson about pressing buttons without first fully understanding what they do.

Oct 05, 2018 | www.reddit.com

WorkOfOz (0 children)

This is actually one of my standard interview questions since I believe any sys admin that's worth a crap has made a mistake they'll never forget.
Here's mine, circa 2001. In response to a security audit, I had to track down which version of the Symantec Antivirus was running and what definition was installed on every machine in the company. I had been working through this for awhile and got a bit reckless.

There was a button in the console that read 'Virus Sweep'. Thinking it'd get the info from each machine and give me the details, I pressed it.. I was wrong..

Very Wrong. Instead it proceeded to initiate a virus scan on every machine including all of the servers.

Less than 5 minutes later, many of our older servers and most importantly our file servers froze. In the process, I took down a trade floor for about 45 minutes while we got things back up. I learned a valuable lesson about pressing buttons without first fully understanding what they do.

[Oct 05, 2018] A newbie turned production server off to replace a monitor

Oct 05, 2018 | www.reddit.com

just_call_in_sick 5 years ago (1 child)

A friend of the family was an IT guy and he gave me the usual high school unpaid intern job. My first day, he told me that a computer needed the monitor replaced. He gave me this 13" CRT and sent me on my way. I found the room (a wiring closet) with a tiny desk and a large desktop tower on it.
TURNED OFF THE COMPUTER and went about replacing the monitor. I think it took about 5 minutes for people start wondering why they can no longer use the file server and can't save their files they have been working on all day.

It turns out that you don't have to turn off computers to replace the monitor.

[Oct 05, 2018] Sometimes one extra space makes a big difference

Oct 05, 2018 | cam.ac.uk

From: [email protected] (Richard H. E. Eiger)
Organization: Olivetti (Schweiz) AG, Branch Office Berne

In article <[email protected]> [email protected]
(Tim Smith) writes:
> I was working on a line printer spooler, which lived in /etc. I wanted
> to remove it, and so issued the command "rm /etc/lpspl." There was only
> one problem. Out of habit, I typed "passwd" after "/etc/" and removed
> the password file. Oops.
>
[deleted to save space[
>
> --Tim Smith

Here's another story. Just imagine having the sendmail.cf file in /etc. Now, I was working on the sendmail stuff and had come up with lots of sendmail.cf.xxx which I wanted to get rid of so I typed "rm -f sendmail.cf. *". At first I was surprised about how much time it took to remove some 10 files or so. Hitting the interrupt key, when I finally saw what had happened was way to late, though.

Fortune has it that I'm a very lazy person. That's why I never bothered to just back up directories with data that changes often. Therefore I managed to restore /etc successfully before rebooting... :-) Happy end, after all. Of course I had lost the only well working version of my sendmail.cf...

Richard

[Oct 05, 2018] Deletion of files purpose of which you do not understand sometimes backfire by Anatoly Ivasyuk

Oct 05, 2018 | cam.ac.uk

Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk

From: [email protected] (Philip Enteles)
Organization: Haas School of Business, Berkeley

As a new system administrator of a Unix machine with limited space I thought I was doing myself a favor by keeping things neat and clean. One
day as I was 'cleaning up' I removed a file called 'bzero'.

Strange things started to happen like vi didn't work then the compliants started coming in. Mail didn't work. The compilers didn't work. About this time the REAL system administrator poked his head in and asked what I had done.

Further examination showed that bzero is the zeroed memory without which the OS had no operating space so anything using temporary memory was non-functional.

The repair? Well things are tough to do when most of the utilities don't work. Eventually the REAL system administrator took the system to single user and rebuilt the system including full restores from a tape system. The Moral is don't be to anal about things you don't understand.

Take the time learn what those strange files are before removing them and screwing yourself.

Philip Enteles

[Oct 05, 2018] Danger of hidden symlinks

Oct 05, 2018 | cam.ac.uk

From: [email protected] (Chris Calabrese)
Organization: AT&T Bell Labs, Murray Hill, NJ, USA

In article <[email protected]> [email protected] writes:
>On a old decstation 3100

I was deleting last semesters users to try to dig up some disk space, I also deleted some test users at the same time.

One user took longer then usual, so I hit control-c and tried ls. "ls: command not found"

Turns out that the test user had / as the home directory and the remove user script in Ultrix just happily blew away the whole disk.

>U...~

[Oct 05, 2018] Hidden symlinks and recursive deletion of the directories

Notable quotes:

"... Fucking asshole ex-sysadmin taught me a good lesson about checking for symlink bombs. ..."

Oct 05, 2018 | www.reddit.com

mavantix Jack of All Trades, Master of Some; 5 years ago (4 children)

I was cleaning up old temp folders of junk on Windows 2003 server, and C:\temp was full of shit. Most of it junk. Rooted deep in the junk, some asshole admin had apparently symlink'd sysvol to a folder in there. Deleting wiped sysvol.
There where no usable backups, well, there where but the ArcServe was screwed by lack of maintenance.

Spent days rebuilding policies.

Fucking asshole ex-sysadmin taught me a good lesson about checking for symlink bombs.

...and no I didn't tell this story to teach any of your little princesses to do the same when you leave your company.

[Oct 05, 2018] Automatically putting slash in front of directory named like system (named like bin,etc,usr, var) which are all etched in sysadmin memory

This is why you should never type rm command on command line. Type it in editor first.

Oct 05, 2018 | www.reddit.com

aultl Senior DevOps Engineer
rm -rf /var
I was trying to delete /var/named/var
nekoeth0 Linux Admin, 5 years ago
Haha, that happened to me too. I had to use a live distro, chroot, copy, what not. It was fun!

[Oct 05, 2018] I corrupted a 400TB data warehouse.

Oct 05, 2018 | www.reddit.com

I corrupted a 400TB data warehouse.

Took 6 days to restore from tape.

permalink

mcowger VCDX | DevOps Guy 8 points 9 points 10 points 5 years ago (0 children)

Meh - happened a long time ago.
Had a big Solaris box (E6900) running Oracle 10 for the DW. Was going to add some new LUNs to the box and also change some of the fiber pathing to go through a new set of faster switches. Had the MDS changes prebuilt, confirmed in with another admin, through change control, etc.

Did fabric A, which went through fine, and then did fabric B without pausing or checking that the new paths came up on side A before I knocked over side B (in violation of my own approved plan). For the briefest of instants, there were no paths to the devices and Oracle was configured in full async write mode :(. Instant corruption of the tables that were active. Tried to do use archivelogs to bring it back, but no dice (and this is before Flashbacks, etc). So we were hosed.

Had to have my DBA babysit the RMAN restore for the entire weekend :(. 1GBe links to backup infrastructure.

RCA resulted in MANY MANY changes to the design of that system, and me just barely keeping my job.

invisibo DevOps 2 points 3 points 4 points 5 years ago (0 children)
You just made me say "holy shit! Out loud. You win.
FooHentai 2 points 3 points 4 points 5 years ago (0 children)
Ouch.
I dropped a 500Gb RAID set. There were 2 identical servers in the rack right next to each other. Both OpenFiler, both unlabeled. Didn't know about the other one and was told to 'wipe the OpenFiler'. Got a call half an hour later from a team wondering where all their test VMs had gone.

vocatus NSA/DOD/USAR/USAP/AEXP [ S ] 1 point 2 points 3 points 5 years ago (0 children)
I have to hear the story.

[Oct 02, 2018] Rookie almost wipes customer's entire inventory unbeknownst to sysadmin

Notable quotes:

"... At that moment, everything from / and onward began deleting forcefully and Reginald described his subsequent actions as being akin to "flying flat like a dart in the air, arms stretched out, pointy finger fully extended" towards the power switch on the mini computer. ..."

Oct 02, 2018 | theregister.co.uk

I was going to type rm -rf /*.old* – which would have forcibly removed all /dgux.old stuff, including any sub-directories I may have created with that name," he said.

But – as regular readers will no doubt have guessed – he didn't.

"I fat fingered and typed rm -rf /* – and then I accidentally hit enter instead of the "." key."

At that moment, everything from / and onward began deleting forcefully and Reginald described his subsequent actions as being akin to "flying flat like a dart in the air, arms stretched out, pointy finger fully extended" towards the power switch on the mini computer.

"Everything got quiet."

Reginald tried to boot up the system, but it wouldn't. So instead he booted up off a tape drive to run the mini Unix installer and mounted the boot "/" file system as if he were upgrading – and then checked out the damage.

"Everything down to /dev was deleted, but I was so relieved I hadn't deleted the customer's database and only system files."

Reginald did what all the best accident-prone people do – kept the cock-up to himself, hoped no one would notice and started covering his tracks, by recreating all the system files.

Over the next three hours, he "painstakingly recreated the entire device tree by hand", at which point he could boot the machine properly – "and even the application worked out".

Jubilant at having managed the task, Reginald tried to keep a lid on the heart that was no doubt in his throat by this point and closed off his work, said goodbye to the sysadmin and went home to calm down. Luckily no one was any the wiser.

"If the admins read this message, this would be the first time they hear about it," he said.

"At the time they didn't come in to check what I was doing, and the system was inaccessible to the users due to planned maintenance anyway."

Did you feel the urge to confess to errors no one else at your work knew about? Do you know someone who kept something under their hat for years? Spill the beans to Who, Me? by emailing us here . ® Re: If rm -rf /* doesn't delete anything valuable

Eh? As I read it, Reginald kicked off the rm -rf /*, then hit the power switch before it deleted too much. The tape rescue revealed that "everything down to /dev" had been deleted, ie. everything in / beginnind a,b,c and some d. On a modern system that might include /boot and /bin, but evidently was not a total disaster on Reg's server.

Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> I remember discovering the hard way that when you delete an email account in Thunderbird and it asks if you want to delete all the files associated with it it actually means do you want to delete the entire directory tree below where the account is stored .... so, as I discovered, saying "yes" when the reason you are deleting the account is because you'd just created it in the wrong place in the the directory tree is not a good idea - instead of just deleting the new account I nuked all the data associated with all our family email accounts!

big_D Monday 1st October 2018 10:05 GMT bpfh

div Re: .cobol
"Delete is right above Rename in the bloody menu"

Probably designed by the same person who designed the crontab app then, with the command line options -e to edit and -r to remove immediately without confirmation. Misstype at your peril...

I found this out - to my peril - about 3 seconds before I realised that it was a good idea for a server's crontab to include a daily executed crontab -l > /foo/bar/crontab-backup.txt ...

Jason Bloomberg

div Re: .cobol
I went to delete the original files, but I only got as far as "del *.COB" befiore hitting return.

I managed a similar thing but more deliberately; belatedly finding "DEL FOOBAR.???" included files with no extensions when it didn't on a previous version (Win3.1?).

That wasn't the disaster it could have been but I've had my share of all-nighters making it look like I hadn't accidentally scrubbed a system clean.

Down not across

div Re: .cobol

Probably designed by the same person who designed the crontab app then, with the command line options -e to edit and -r to remove immediately without confirmation. Misstype at your peril...

Using crontab -e is asking for trouble even without mistypes. I've see too many corrupted or truncated crontabs after someone has edited them with crontab -e. crontab -l > crontab.txt;vi crontab.txt;crontab crontab.txt is much better way.

You mean not everyone has crontab entry that backs up crontab at least daily?

MrBanana

div Re: .cobol
"WAH! I copied the .COBOL back to .COB and started over again. As I knew what I wanted to do this time, it only took about a day to re-do what I had deleted."

When this has happened to me, I end up with better code than I had before. Re-doing the work gives you a better perspective. Even if functionally no different it will be cleaner, well commented, and laid out more consistently. I sometimes now do it deliberately (although just saving the first new version, not deleting it) to clean up the code.

big_D

div Re: .cobol
I totally agree, the resultant code was better than what I had previously written, because some of the mistakes and assumptions I'd made the first time round and worked around didn't make it into the new code.

Woza

div Reminds me of the classic
https://www.ee.ryerson.ca/~elf/hack/recovery.html

Anonymous South African Coward

div Re: Reminds me of the classic
https://www.ee.ryerson.ca/~elf/hack/recovery.html

Was about to post the same. It is a legendary classic by now.

Chairman of the Bored

div One simple trick...
...depending on your shell and its configuration a zero size file in each directory you care about called '-i' will force the rampaging recursive rm, mv, or whatever back into interactive mode. By and large it won't defend you against mistakes in a script, but its definitely saved me from myself when running an interactive shell.

It's proven useful enough to earn its own cronjob that runs once a week and features a 'find -type d' and touch '-i' combo on systems I like.

Glad the OP's mad dive for the power switch saved him, I wasn't so speedy once. Total bustification. Hence this one simple trick...

Now if I could ever fdisk the right f$cking disk, I'd be set!

PickledAardvark

div Re: One simple trick...
"Can't you enter a command to abort the wipe?"

Maybe. But you still have to work out what got deleted.

On the first Unix system I used, an admin configured the rm command with a system alias so that rm required a confirmation. Annoying after a while but handy when learning.

When you are reconfiguring a system, delete/rm is not the only option. Move/mv protects you from your errors. If the OS has no move/mv, then copy, verify before delete.

Doctor Syntax

div Re: One simple trick...
"Move/mv protects you from your errors."

Not entirely. I had a similar experience with mv. I was left with a running shell so could cd through the remains of the file system end list files with echo * but not repair it..

Although we had the CDs (SCO) to reboot the system required a specific driver which wasn't included on the CDs and hadn't been provided by the vendor. It took most of a day before they emailed the correct driver to put on a floppy before I could reboot. After that it only took a few minutes to put everything back in place.

Chairman of the Bored

div Re: One simple trick...
@Chris Evans,

Yes there are a number of things you can do. Just like Windows a quick ctrl-C will abort a rm operation taking place in an interactive shell. Destroying the window in which the interactive shell running rm is running will work, too (alt-f4 in most window managers or 'x' out of the window)

If you know the process id of the rm process you can 'kill $pid' or do a 'killall -KILL rm'

Couple of problems:

(1) law of maximum perversity says that the most important bits will be destroyed first in any accident sequence

(2) by the time you realize the mistake there is no time to kill rm before law 1 is satisfied

The OP's mad dive for the power button is probably the very best move... provided you are right there at the console. And provided the big red switch is actually connected to anything

Colin Bull 1

div cp can also be dangerous
After several years working in a DOS environment I got a job as project Manager / Sys admin on a Unix based customer site for a six month stint. On my second day I wanted to use a test system to learn the software more, so decided to copy the live order files to the test system.

Unfortunately I forgot the trailing full stop as it was not needed in DOS - so the live order index file over wrote the live data file. And the company only took orders for next day delivery so it wiped all current orders.

Luckily it printed a sales acknowledgement every time an order was placed so I escaped death and learned never miss the second parameter with cp command.

Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> i'd written a script to deploy the latest changes to the live environment. worked great. except one day i'd entered a typo and it was now deploying the same files to the remote directory, over and again.

it did that for 2 whole years with around 7 code releases. not a single person realised the production system was running the same code after each release with no change in functionality. all the customer cared about was 'was the site up?'

not a single person realised. not the developers. not the support staff. not me. not the testers. not the customer. just made you think... wtf had we been doing for 2 years??? Yet Another Anonymous coward

div Look on the bright side, any bugs your team had introduced in those 2 years had been blocked by your intrinsically secure script

Prst. V.Jeltz

div not a single person realised. not the developers. not the support staff. not me. not the testers. not the customer. just made you think... wtf had we been doing for 2 years???
That is Classic! not surprised about the AC!

Bet some of the beancounters were less than impressed , probly on customer side :)

Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> Re: ...then there's backup stories...

Many years ago (pre internet times) a client phones at 5:30 Friday afternoon. It was the IT guy wanting to run through the steps involved in recovering from a backup. Their US headquarters had a hard disk fail on their accounting system. He was talking the Financial Controller through a recovery and while he knew his stuff he just wanted to double check everything.

8pm the same night the phone rang again - how soon could I fly to the states? Only one of the backup tapes was good. The financial controller had put the sole remaining good backup tape in the drive, then popped out to get a bite to eat at 7pm because it was going to be a late night. At 7:30pm the scheduled backup process copied the corrupted database over the only remaining backup.

Saturday was spent on the phone trying to talk them through everything I could think of.

Sunday afternoon I was sitting in a private jet winging it's way to their US HQ. Three days of very hard work later we'd managed to recreate the accounting database from pieces of corrupted databases and log files. Another private jet ride home - this time the pilot was kind enough to tell me there was a cooler full of beer behind my seat. Olivier2553

div Re: Welcome to the club!
"Lesson learned: NEVER decide to "clean up some old files" at 4:30 on a Friday afternoon. You WILL look for shortcuts and it WILL bite you on the ass."

Do not do anything of some significance on Friday. At all. Any major change, big operation, etc. must be made by Thursday at the latest, so in case of cock-up, you have the Friday (plus days week-end) to repair it.

JQW

div I once wiped a large portion of a hard drive after using find with exec rm -rf {} - due to not taking into account the fact that some directories on the system had spaces in them.

Will Godfrey

div Defensive typing
I've long been in the habit of entering dangerous commands partially in reverse, so in the case of theO/Ps one I've have done:

' -rf /*.old* '

then gone back top the start of the line and entered the ' rm ' bit.

sisk

div A couple months ago on my home computer (which has several Linux distros installed and which all share a common /home because I apparently like to make life difficult for myself - and yes, that's as close to a logical reason I have for having multiple distros installed on one machine) I was going to get rid of one of the extraneous Linux installs and use the space to expand the root partition for one of the other distros. I realized I'd typed /dev/sdc2 instead of /dev/sdc3 at the same time that I verified that, yes, I wanted to delete the partition. And sdc2 is where the above mentioned shared /home lives. Doh.
Fortunately I have a good file server and a cron job running rsync every night, so I didn't actually lose any data, but I think my heart stopped for a few seconds before I realized that.

Kevin Fairhurst

div Came in to work one Monday to find that the Unix system was borked... on investigation it appeared that a large number of files & folders had gone missing, probably by someone doing an incorrect rm.
Our systems were shared with our US office who supported the UK outside of our core hours (we were in from 7am to ensure trading was ready for 8am, they were available to field staff until 10pm UK time) so we suspected it was one of our US counterparts who had done it, but had no way to prove it.

Rather than try and fix anything, they'd gone through and deleted all logs and history entries so we could never find the evidence we needed!

Restoring the system from a recent backup brought everything back online again, as one would expect!

DavidRa

div Sure they did, but the universe invented better idiots
Of course. However, the incompletely-experienced often choose to force bypass that configuration. For example, a lot of systems aliased rm to "rm -i" by default, which would force interactive confirmations. People would then say "UGH, I hate having to do this" and add their own customisations to their shells/profiles etc:

unalias rm

alias rm=rm -f

Lo and behold, now no silly confirmations, regardless of stupidity/typos/etc.

[Jul 30, 2018] Sudo related horror story

Jul 30, 2018 | www.sott.net

A new sysadmin decided to scratch his etch in sudoers file and in the standard definition of additional sysadmins via wheel group
## Allows people in group wheel to run all commands
# %wheel        ALL=(ALL)       ALL
he replaced ALL with localhost
## Allows people in group wheel to run all commands
# %wheel        localhost=(ALL)       ALL
then without testing he distributed this file to all servers in the datacenter. Sysadmin who worked after him discovered that sudo su - commandno longer works and they can't get root using their tried and true method ;-)

[Apr 22, 2018] Unix-Linux Horror Stories Unix Horror Stories The good thing about Unix, is when it screws up, it does so very quickly

Notable quotes:

"... And then I realized I had thrashed the server. Completely. ..."

"... There must be a way to fix this , I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall . That utility has a repair command, swrepair . ..."

"... you probably don't want that user owning /bin/nologin. ..."

Aug 04, 2011 | unixhorrorstories.blogspot.com

Unix Horror Stories: The good thing about Unix, is when it screws up, it does so very quickly The project to deploy a new, multi-million-dollar commercial system on two big, brand-new HP-UX servers at a brewing company that shall not be named, had been running on time and within budgets for several months. Just a few steps remained, among them, the migration of users from the old servers to the new ones.

The task was going to be simple: just copy the home directories of each user from the old server to the new ones, and a simple script to change the owner so as to make sure that each home directory was owned by the correct user. The script went something like this:
#!/bin/bash
cat /etc/passwd|while read line
      do
         USER=$(echo $line|cut -d: -f1)
         HOME=$(echo $line|cut -d: -f6)
         chown -R $USER $HOME
      done
[NOTE: the script does not filter out system ids from userids and that's a grave mistake. also it was run before it was tested ; -) -- NNB]

As you see, this script is pretty simple: obtain the user and the home directory from the password file, and then execute the chown command recursively on the home directory. I copied the files, executed the script, and thought, great, just 10 minutes and all is done.

That's when the calls started.

It turns out that while I was executing those seemingly harmless commands, the server was under acceptance test. You see, we were just one week away from going live and the final touches were everything that was required. So the users in the brewing company started testing if everything they needed was working like in the old servers. And suddenly, the users noticed that their system was malfunctioning and started making furious phone calls to my boss and then my boss started to call me.

And then I realized I had thrashed the server. Completely. My console was still open and I could see that the processes started failing, one by one, reporting very strange messages to the console, that didn't look any good. I started to panic. My workmate Ayelen and I (who just copied my script and executed it in the mirror server) realized only too late that the home directory of the root user was / -the root filesystem- so we changed the owner of every single file in the filesystem to root!!! That's what I love about Unix: when it screws up, it does so very quickly, and thoroughly.

There must be a way to fix this , I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall . That utility has a repair command, swrepair . So the following command put the system files back in order, needing a few permission changes on the application directories that weren't installed with the package manager:

swrepair -F

But the story doesn't end here. The next week, we were going live, and I knew that the migration of the users would be for real this time, not just a test. My boss and I were going to the brewing company, and he receives a phone call. Then he turns to me and asks me, "What was the command that you used last week?". I told him and I noticed that he was dictating it very carefully. When we arrived, we saw why: before the final deployment, a Unix administrator from the company did the same mistake I did, but this time, people from the whole country were connecting to the system, and he received phone calls from a lot of angry users. Luckily, the mistake could be fixed, and we all, young and old, went back to reading the HP-UX manual. Those things can come handy sometimes!

Morale of this story: before doing something on the users directories, take the time to see which is the User ID of actual users - which start usually in 500 but it's configuration-dependent - because system users's IDs are lower than that.

Send in your Unix horror story, and it will be featured here in the blog!

Greetings,
Agustin

Colin McD, 16 de marzo de 2017, 15:02

This script is so dangerous. You are giving home directories to say the apache user and you probably don't want that user owning /bin/nologin.

[Apr 22, 2018] Unix Horror story script question Unix Linux Forums Shell Programming and Scripting

Apr 22, 2018 | www.unix.com

scottsiddharth Registered User

Unix Horror story script question

This text and script is borrowed from the "Unix Horror Stories" document.
It states as follows

"""""Management told us to email a security notice to every user on the our system (at that time, around 3000 users). A certain novice administrator on our system wanted to do it, so I instructed them to extract a list of users from /etc/passwd, write a simple shell loop to do the job, and throw it in the background.
Here's what they wrote (bourne shell)...

for USER in `cat user.list`; do mail $USER <message.text & done

Have you ever seen a load average of over 300 ??? """" END

My question is this- What is wrong with the script above? Why did it find a place in the Horror stories? It worked well when I tried it.

Maybe he intended to throw the whole script in the background and not just the Mail part. But even so it works just as well... So?

Thunderbolt

RE:Unix Horror story script question
I think, it does well deserve to be placed Horror stories.
Consider the given server for with or without SMTP service role, this script tries to process 3000 mail commands in parallel to send the text to it's 3000 repective receipents.

Have you ever tried with valid 3000 e-mail IDs, you can feel the heat of CPU (sar 1 100)

P.S.: I did not tested it but theoretically affirmed.

Best Regards.

Thunderbolt, View Public Profile 3 11-24-2008 - Original Discussion by scottsiddharth
Quote:

Originally Posted by scottsiddharth

Thank you for the reply. But isn't that exactly what the real admin asked the novice admin to do.

Is there a better script or solution ?

Well, Let me try to make it sequential to reduce the CPU load, but it will take no. of users*SLP_INT(default=1) seconds to execute....
#Interval between concurrent mail commands excution in seconds, minimum 1 second.
      SLP_INT=1
      for USER in `cat user.list`;
      do; 
         mail $USER <message.text; [ -z "${SLP_INT}" ] && sleep 1 || sleep ${SLP_INT}" ;
      done

[Apr 22, 2018] THE classic Unix horror story programming

Looks like not much changed since 1986. I amazed how little changed with Unix over the years. RM remains a danger although zsh and -I option on gnu rm are improvement. . I think every sysadmin wiped out important data with rm at least once in his career. So more work on this problem is needed.

Notable quotes:

"... Because we are creatures of habit. If you ALWAYS have to type 'yes' for every single deletion, it will become habitual, and you will start doing it without conscious thought. ..."

"... Amazing what kind of damage you can recover from given enough motivation. ..."

"... " in "rm -rf ~/ ..."

Apr 22, 2008 | www.reddit.com

probablycorey 10 years ago (35 children)

A little trick I use to ensure I never delete the root or home dir... Put a file called -i in / and ~
If you ever call rm -rf *, -i (the request confirmation option) will be the first path expanded. So your command becomes...

rm -rf -i

Catastrophe Averted!

mshade 10 years ago (0 children)
That's a pretty good trick! Unfortunately it doesn't work if you specify the path of course, but will keep you from doing it with a PWD of ~ or /.
Thanks!

aythun 10 years ago (2 children)
Or just use zsh. It's awesome in every possible way.
brian@xenon:~/tmp/test% rm -rf *
zsh: sure you want to delete all the files in /home/brian/tmp/test [yn]?
rex5249 10 years ago (1 child)
I keep an daily clone of my laptop and I usually do some backups in the middle of the day, so if I lose a disk it isn't a big deal other than the time wasted copying files.
MyrddinE 10 years ago (1 child)
Because we are creatures of habit. If you ALWAYS have to type 'yes' for every single deletion, it will become habitual, and you will start doing it without conscious thought.
Warnings must only pop up when there is actual danger, or you will become acclimated to, and cease to see, the warning.

This is exactly the problem with Windows Vista, and why so many people harbor such ill-will towards its 'security' system.

zakk 10 years ago (3 children)
and if I want to delete that file?!? ;-)
alanpost 10 years ago (0 children)
I use the same trick, so either of:
$ rm -- -i

or

$ rm ./-i

will work.

emag 10 years ago (0 children)
rm /-i ~/-i
nasorenga 10 years ago * (2 children)
The part that made me the most nostalgic was his email address: mcvax!ukc!man.cs.ux!miw
Gee whiz, those were the days... (Edit: typo)

floweryleatherboy 10 years ago (6 children)
One of my engineering managers wiped out an important server with rm -rf. Later it turned out he had a giant stock of kiddy porn on company servers.
monstermunch 10 years ago (16 children)
Whenever I use rm -rf, I always make sure to type the full path name in (never just use *) and put the -rf at the end, not after the rm. This means you don't have to worry about hitting "enter" in the middle of typing the path name (it won't delete the directory because the -rf is at the end) and you don't have to worry as much about data deletion from accidentally copy/pasting the command somewhere with middle click or if you redo the command while looking in your bash history.
Hmm, couldn't you alias "rm -rf" to mv the directory/files to a temp directory to be on the safe side?

branston 10 years ago (8 children)
Aliasing 'rm' is fairly common practice in some circles. It can have its own set of problems however (filling up partitions, running out of inodes...)
amnezia 10 years ago (5 children)
you could alias it with a script that prevents rm -rf * being run in certain directories.
jemminger 10 years ago (4 children)
you could also alias it to 'ls' :)
derefr 10 years ago * (1 child)
One could write a daemon that lets the oldest files in that directory be "garbage collected" when those conditions are approaching. I think this is, in a roundabout way, how Windows' shadow copy works.
branston 10 years ago (0 children)
Could do. Think we might be walking into the over-complexity trap however. The only time I've ever had an rm related disaster was when accidentally repeating an rm that was already in my command buffer. I looked at trying to exclude patterns from the command history but csh doesn't seem to support doing that so I gave up.
A decent solution just occurred to me when the underlying file system supports snapshots (UFS2 for example). Just snap the fs on which the to-be-deleted items are on prior to the delete. That needs barely any IO to do and you can set the snapshots to expire after 10 minutes.

Hmm... Might look at implementing that..

mbm 10 years ago (0 children)
Most of the original UNIX tools took the arguments in strict order, requiring that the options came first; you can even see this on some modern *BSD systems.
shadowsurge 10 years ago (1 child)
I just always format the command with ls first just to make sure everything is in working order. Then my neurosis kicks in and I do it again... and a couple more times just to make sure nothing bad happens.
Jonathan_the_Nerd 10 years ago (0 children)
If you're unsure about your wildcards, you can use echo to see exactly how the shell will expand your arguments.
splidge 10 years ago (0 children)
A better trick IMO is to use ls on the directory first.. then when you are sure that's what you meant type rm -rf !$ to delete it.
earthboundkid 10 years ago * (0 children)
Ever since I got burned by letting my pinky slip on the enter key years ago, I've been typing echo path first, then going back and adding the rm after the fact.
zerokey 10 years ago * (2 children)
Great story. Halfway through reading, I had a major wtf moment. I wasn't surprised by the use of a VAX, as my old department just retired their last VAX a year ago. The whole time, I'm thinking, "hello..mount the tape hardware on another system and, worst case scenario, boot from a live cd!"
Then I got to, "The next idea was to write a program to make a device descriptor for the tape deck" and looked back at the title and realized that it was from 1986 and realized, "oh..oh yeah...that's pretty fucked."

iluvatar 10 years ago (0 children)

Great story

Yeah, but really, he had way too much of a working system to qualify for true geek godhood. That title belongs to Al Viro . Even though I've read it several times, I'm still in awe every time I see that story...

cdesignproponentsist 10 years ago (0 children)
FreeBSD has backup statically-linked copies of essential system recovery tools in /rescue, just in case you toast /bin, /sbin, /lib, ld-elf.so.1, etc.
It won't protect against a rm -rf / though (and is not intended to), although you could chflags -R schg /rescue to make them immune to rm -rf.

clytle374 10 years ago * (9 children)
It happens, I tried a few months back to rm -rf bin to delete a directory and did a rm -rf /bin instead.
First thought: That took a long time.

Second thought: What do you mean ls not found.

I was amazed that the desktop survived for nearly an hour before crashing.

earthboundkid 10 years ago (8 children)
This really is a situation where GUIs are better than CLIs. There's nothing like the visual confirmation of seeing what you're obliterating to set your heart into the pit of your stomach.
jib 10 years ago (0 children)
If you're using a GUI, you probably already have that. If you're using a command line, use mv instead of rm.
In general, if you want the computer to do something, tell it what you want it to do, rather than telling it to do something you don't want and then complaining when it does what you say.

earthboundkid 10 years ago (3 children)
Yes, but trash cans aren't manly enough for vi and emacs users to take seriously. If it made sense and kept you from shooting yourself in the foot, it wouldn't be in the Unix tradition.
earthboundkid 10 years ago (1 child)

Are you so low on disk space that it's important for your trash can to be empty at all times?

Why should we humans have to adapt our directory names to route around the busted-ass-ness of our tools? The tools should be made to work with capital letters and spaces. Or better, use a GUI for deleting so that you don't have to worry about OMG, I forgot to put a slash in front of my space!

Seriously, I use the command line multiple times every day, but there are some tasks for which it is just not well suited compared to a GUI, and (bizarrely considering it's one thing the CLI is most used for) one of them is moving around and deleting files.

easytiger 10 years ago (0 children)
Thats a very simple bash/ksh/python/etc script.

script a move op to a hidden dir on the /current/ partition.

alias this to rm

wrap rm as an alias to delete the contents of the hidden folder with confirmation

mattucf 10 years ago (3 children)
I'd like to think that most systems these days don't have / set as root's home directory, but I've seen a few that do. :/
dsfox 10 years ago (0 children)
This is a good approach in 1986. Today I would just pop in a bootable CDROM.
fjhqjv 10 years ago * (5 children)
That's why I always keep stringent file permissions and never act as the root user.
I'd have to try to rm -rf, get a permission denied error, then retype sudo rm -rf and then type in my password to ever have a mistake like that happen.

But I'm not a systems administrator, so maybe it's not the same thing.

toast_and_oj 10 years ago (2 children)
I aliased "rm -rf" to "omnomnom" and got myself into the habit of using that. I can barely type "omnomnom" when I really want to, let alone when I'm not really paying attention. It's saved one of my projects once already.
shen 10 years ago (0 children)
I've aliased "rm -rf" to "rmrf". Maybe I'm just a sucker for punishment.
I haven't been bit by it yet, the defining word being yet.

robreim 10 years ago (0 children)
I would have thought tab completion would have made omnomnom potentially easier to type than rm -rf (since the -rf part needs to be typed explicitly)
immure 10 years ago (0 children)
It's not.
lespea 10 years ago (0 children)
before I ever do something like that I make sure I don't have permissions so I get an error, then I press up, home, and type sudo <space> <enter> and it works as expected :)
kirun 10 years ago (0 children)
And I was pleased the other day how easy it was to fix the system after I accidentally removed kdm, konqueror and kdesktop... but these guys are hardcore.
austin_k 10 years ago (0 children)
I actually started to feel sick reading that. I've been in a IT disaster before where we almost lost a huge database. Ugh.. I still have nightmares.
umilmi81 10 years ago (4 children)
Task number 1 with a UNIX system. Alias rm to rm -i. Call the explicit path when you want to avoid the -i (ie: /bin/rm -f). Nobody is too cool to skip this basic protection.
flinchn 10 years ago (0 children)
i did an application install at an LE agency last fall - stupid me mv ./etc ./etcbk <> mv /etc /etcbk
ahh that damned period

DrunkenAsshole 10 years ago (0 children)
Were the "*"s really needed for a story that has plagued, at one point or another, all OS users?
xelfer 10 years ago (0 children)
Is the home directory for root / for some unix systems? i thought 'cd' then 'rm -rf *' would have deleted whatever's in his home directory (or whatever $HOME points to)
srparish 10 years ago (0 children)
Couldn't he just have used the editor to create the etc files he wanted, and used cpio as root to copy that over as an /etc?
sRp

stox 10 years ago (1 child)
Been there, done that. Have the soiled underwear to prove it. Amazing what kind of damage you can recover from given enough motivation.
sheepskin 10 years ago * (0 children)
I had a customer do this, he killed it about the same time. I told him he was screwed and I'd charge him a bunch of money to take down his server, rebuild it from a working one and put it back up. But the customer happened to have a root ftp session up, and was able to upload what he needed to bring the system back. by the time he was done I rebooted it to make sure it was cool and it booted all the way back up.
Of course I've also had a lot of customer that have done it, and they where screwed, and I got to charge them a bunch of money.

jemminger 10 years ago (0 children)
pfft. that's why lusers don't get root access.
supersan 10 years ago (2 children)
i had the same thing happened to me once.. my c:\ drive was running ntfs and i accidently deleted the "ntldr" system file in the c:\ root (because the name didn't figure much).. then later, i couldn't even boot in the safe mode! and my bootable disk didn't recognize the c:\ drive because it was ntfs!! so sadly, i had to reinstall everything :( wasted a whole day over it..
b100dian 10 years ago (0 children)
Yes, but that's a single file. I suppose anyone can write hex into mbr to copy ntldr from a samba share!
bobcat 10 years ago (0 children)
http://en.wikipedia.org/wiki/Emergency_Repair_Disk
boredzo 10 years ago (0 children)
Neither one is the original source. The original source is Usenet, and I can't find it with Google Groups. So either of these webpages is as good as the other.
docgnome 10 years ago (0 children)
In 1986? On a VAX?
MarlonBain 10 years ago (0 children)

This classic article from Mario Wolczko first appeared on Usenet in 1986 .

amoore 10 years ago (0 children)
I got sidetracked trying to figure out why the fictional antagonist would type the extra "/ " in "rm -rf ~/ ".
Zombine 10 years ago (2 children)

...it's amazing how much of the system you can delete without it falling apart completely. Apart from the fact that nobody could login (/bin/login?), and most of the useful commands had gone, everything else seemed normal.

Yeah. So apart from the fact that no one could get any work done or really do anything, things were working great!

I think a more rational reaction would be "Why on Earth is this big, important system on which many people rely designed in such a way that a simple easy-to-make human error can screw it up so comprehensively?" or perhaps "Why on Earth don't we have a proper backup system?"

daniels220 10 years ago (1 child)
The problem wasn't the backup system, it was the restore system, which relied on the machine having a "copy" command. Perfectly reasonable assumption that happened not to be true.
Zombine 10 years ago * (0 children)
Neither backup nor restoration serves any purpose in isolation. Most people would group those operations together under the heading "backup;" certainly you win only a semantic victory by doing otherwise. Their fail-safe data-protection system, call it what you will, turned out not to work, and had to be re-engineered on-the-fly.
I generally figure that the assumptions I make that turn out to be entirely wrong were not "perfectly reasonable" assumptions in the first place. Call me a traditionalist.

[Apr 22, 2018] rm and Its Dangers (Unix Power Tools, 3rd Edition)

Apr 22, 2018 | docstore.mik.ua
14.3. rm and Its Dangers
Under Unix, you use the rm command to delete files. The command is simple enough; you just type rm followed by a list of files. If anything, rm is too simple. It's easy to delete more than you want, and once something is gone, it's permanently gone. There are a few hacks that make rm somewhat safer, and we'll get to those momentarily. But first, here's a quick look at some of the dangers.

To understand why it's impossible to reclaim deleted files, you need to know a bit about how the Unix filesystem works. The system contains a "free list," which is a list of disk blocks that aren't used. When you delete a file, its directory entry (which gives it its name) is removed. If there are no more links ( Section 10.3 ) to the file (i.e., if the file only had one name), its inode ( Section 14.2 ) is added to the list of free inodes, and its datablocks are added to the free list.

Well, why can't you get the file back from the free list? After all, there are DOS utilities that can reclaim deleted files by doing something similar. Remember, though, Unix is a multitasking operating system. Even if you think your system is a single-user system, there are a lot of things going on "behind your back": daemons are writing to log files, handling network connections, processing electronic mail, and so on. You could theoretically reclaim a file if you could "freeze" the filesystem the instant your file was deleted -- but that's not possible. With Unix, everything is always active. By the time you realize you made a mistake, your file's data blocks may well have been reused for something else.

When you're deleting files, it's important to use wildcards carefully. Simple typing errors can have disastrous consequences. Let's say you want to delete all your object ( .o ) files. You want to type:
% rm *.o
But because of a nervous twitch, you add an extra space and type:
% rm * .o
It looks right, and you might not even notice the error. But before you know it, all the files in the current directory will be gone, irretrievably.

If you don't think this can happen to you, here's something that actually did happen to me. At one point, when I was a relatively new Unix user, I was working on my company's business plan. The executives thought, so as to be "secure," that they'd set a business plan's permissions so you had to be root ( Section 1.18 ) to modify it. (A mistake in its own right, but that's another story.) I was using a terminal I wasn't familiar with and accidentally created a bunch of files with four control characters at the beginning of their name. To get rid of these, I typed (as root ):
# rm ????*
This command took a long time to execute. When about two-thirds of the directory was gone, I realized (with horror) what was happening: I was deleting all files with four or more characters in the filename.

The story got worse. They hadn't made a backup in about five months. (By the way, this article should give you plenty of reasons for making regular backups ( Section 38.3 ).) By the time I had restored the files I had deleted (a several-hour process in itself; this was on an ancient version of Unix with a horrible backup utility) and checked (by hand) all the files against our printed copy of the business plan, I had resolved to be very careful with my rm commands.

[Some shells have safeguards that work against Mike's first disastrous example -- but not the second one. Automatic safeguards like these can become a crutch, though . . . when you use another shell temporarily and don't have them, or when you type an expression like Mike's very destructive second example. I agree with his simple advice: check your rm commands carefully! -- JP ]

-- ML

[Apr 22, 2018] How to prevent a mistaken rm -rf for specific folders?

Notable quotes:

"... There's nothing more on a traditional Linux, but you can set Apparmor/SELinux/ rules that prevent rm from accessing certain directories. ..."

"... Probably your best bet with it would be to alias rm -ri into something memorable like kill_it_with_fire . This way whenever you feel like removing something, go ahead and kill it with fire. ..."

Jan 20, 2013 | unix.stackexchange.com

I think pretty much people here mistakenly 'rm -rf'ed the wrong directory, and hopefully it did not cause a huge damage.. Is there any way to prevent users from doing a similar unix horror story?? Someone mentioned (in the comments section of the previous link) that

... I am pretty sure now every unix course or company using unix sets rm -fr to disable accounts of people trying to run it or stop them from running it ...

Is there any implementation of that in any current Unix or Linux distro? And what is the common practice to prevent that error even from a sysadmin (with root access)?

It seems that there was some protection for the root directory (/) in Solaris (since 2005) and GNU (since 2006). Is there anyway to implement the same protection way to some other folders as well??

To give it more clarity, I was not asking about general advice about rm usage (and I've updated the title to indicate that more), I want something more like the root folder protection: in order to rm -rf / you have to pass a specific parameter: rm -rf --no-preserve-root /.. Is there similar implementations for customized set of directories? Or can I specify files in addition to / to be protected by the preserve-root option?

amyassin, Jan 20, 2013 at 17:26

I think pretty much people here mistakenly ' rm -rf 'ed the wrong directory, and hopefully it did not cause a huge damage.. Is there any way to prevent users from doing a similar unix horror story ?? Someone mentioned (in the comments section of the previous link ) that

... I am pretty sure now every unix course or company using unix sets rm -fr to disable accounts of people trying to run it or stop them from running it ...

Is there any implementation of that in any current Unix or Linux distro? And what is the common practice to prevent that error even from a sysadmin (with root access)?

It seems that there was some protection for the root directory ( / ) in Solaris (since 2005) and GNU (since 2006). Is there anyway to implement the same protection way to some other folders as well??

To give it more clarity, I was not asking about general advice about rm usage (and I've updated the title to indicate that more), I want something more like the root folder protection: in order to rm -rf / you have to pass a specific parameter: rm -rf --no-preserve-root / .. Is there similar implementations for customized set of directories? Or can I specify files in addition to / to be protected by the preserve-root option?

mattdm, Jan 20, 2013 at 17:33

1) Change management 2) Backups. – mattdm Jan 20 '13 at 17:33

Keith, Jan 20, 2013 at 17:40

probably the only way would be to replace the rm command with one that doesn't have that feature. – Keith Jan 20 '13 at 17:40

sr_, Jan 20, 2013 at 18:28

safe-rm maybe – sr_ Jan 20 '13 at 18:28

Bananguin, Jan 20, 2013 at 21:07

most distros do `alias rm='rm -i' which makes rm ask you if you are sure.
Besides that: know what you are doing. only become root if necessary. for any user with root privileges security of any kind must be implemented in and by the user. hire somebody if you can't do it yourself.over time any countermeasure becomes equivalaent to the alias line above if you cant wrap your own head around the problem. – Bananguin Jan 20 '13 at 21:07

midnightsteel, Jan 22, 2013 at 14:21

@amyassin using rm -rf can be a resume generating event. Check and triple check before executing it – midnightsteel Jan 22 '13 at 14:21

Gilles, Jan 22, 2013 at 0:18

To avoid a mistaken rm -rf, do not type rm -rf .
If you need to delete a directory tree, I recommend the following workflow:

If necessary, change to the parent of the directory you want to delete.

mv directory-to-delete DELETE

Explore DELETE and check that it is indeed what you wanted to delete

rm -rf DELETE

Never call rm -rf with an argument other than DELETE . Doing the deletion in several stages gives you an opportunity to verify that you aren't deleting the wrong thing, either because of a typo (as in rm -rf /foo /bar instead of rm -rf /foo/bar ) or because of a braino (oops, no, I meant to delete foo.old and keep foo.new ).

If your problem is that you can't trust others not to type rm -rf, consider removing their admin privileges. There's a lot more that can go wrong than rm .

Always make backups .

Periodically verify that your backups are working and up-to-date.

Keep everything that can't be easily downloaded from somewhere under version control.

With a basic unix system, if you really want to make some directories undeletable by rm, replace (or better shadow) rm by a custom script that rejects certain arguments. Or by hg rm .

Some unix variants offer more possibilities.

On OSX, you can set an access control list on a directory preventing deletion of the files and subdirectories inside it, without preventing the creation of new entries or modification of existing entries: chmod +a 'group:everyone deny delete_child' somedir (this doesn't prevent the deletion of files in subdirectories: if you want that, set the ACL on the subdirectory as well).

On Linux, you can set rules in SELinux, AppArmor or other security frameworks that forbid rm to modify certain directories.

amyassin, Jan 22, 2013 at 9:41

Yeah backing up is the most amazing solution, but I was thinking of something like the --no-preserve-root option, for other important folder.. And that apparently does not exist even as a practice... – amyassin Jan 22 '13 at 9:41

Gilles, Jan 22, 2013 at 20:32

@amyassin I'm afraid there's nothing more (at least not on Linux). rm -rf already means "delete this, yes I'm sure I know what I'm doing". If you want more, replace rm by a script that refuses to delete certain directories. – Gilles Jan 22 '13 at 20:32

Gilles, Jan 22, 2013 at 22:17

@amyassin Actually, I take this back. There's nothing more on a traditional Linux, but you can set Apparmor/SELinux/ rules that prevent rm from accessing certain directories. Also, since your question isn't only about Linux, I should have mentioned OSX, which has something a bit like what you want. – Gilles Jan 22 '13 at 22:17

qbi, Jan 22, 2013 at 21:29
If you are using rm * and the zsh, you can set the option rmstarwait :
setopt rmstarwait
Now the shell warns when you're using the * :
> zsh -f
> setopt rmstarwait
> touch a b c
> rm *
zsh: sure you want to delete all the files in /home/unixuser [yn]? _
When you reject it ( n ), nothing happens. Otherwise all files will be deleted.
Drake Clarris, Jan 22, 2013 at 14:11

EDIT as suggested by comment:
You can change the attribute of to immutable the file or directory and then it cannot be deleted even by root until the attribute is removed.

chattr +i /some/important/file

This also means that the file cannot be written to or changed in anyway, even by root . Another attribute apparently available that I haven't used myself is the append attribute ( chattr +a /some/important/file . Then the file can only be opened in append mode, meaning no deletion as well, but you can add to it (say a log file). This means you won't be able to edit it in vim for example, but you can do echo 'this adds a line' >> /some/important/file . Using > instead of >> will fail.

These attributes can be unset using a minus sign, i.e. chattr -i file

Otherwise, if this is not suitable, one thing I practice is to always ls /some/dir first, and then instead of retyping the command, press up arrow CTL-A, then delete the ls and type in my rm -rf if I need it. Not perfect, but by looking at the results of ls, you know before hand if it is what you wanted.

NlightNFotis, Jan 22, 2013 at 8:27

One possible choice is to stop using rm -rf and start using rm -ri . The extra i parameter there is to make sure that it asks if you are sure you want to delete the file.
Probably your best bet with it would be to alias rm -ri into something memorable like kill_it_with_fire . This way whenever you feel like removing something, go ahead and kill it with fire.

amyassin, Jan 22, 2013 at 14:24

I like the name, but isn't f is the exact opposite of i option?? I tried it and worked though... – amyassin Jan 22 '13 at 14:24

NlightNFotis, Jan 22, 2013 at 16:09

@amyassin Yes it is. For some strange kind of fashion, I thought I only had r in there. Just fixed it. – NlightNFotis Jan 22 '13 at 16:09

Silverrocker, Jan 22, 2013 at 14:46

To protect against an accidental rm -rf * in a directory, create a file called "-i" (you can do this with emacs or some other program) in that directory. The shell will try to interpret -i and will cause it to go into interactive mode.
For example: You have a directory called rmtest with the file named -i inside. If you try to rm everything inside the directory, rm will first get -i passed to it and will go into interactive mode. If you put such a file inside the directories you would like to have some protection on, it might help.

Note that this is ineffective against rm -rf rmtest .

ValeriRangelov, Dec 21, 2014 at 3:03

If you understand C programming language, I think it is possible to rewrite the rm source code and make a little patch for kernel. I saw this on one server and it was impossible to delete some important directories and when you type 'rm -rf /direcotyr' it send email to sysadmin.

[Apr 21, 2018] Any alias of rm is a very stupid idea

Option -I is more modern and more useful then old option -i. It is highly recommended. And it make sense to to use alias with it contrary to what this author states (he probably does not understand that aliases do not wok for non-interactive sessions.).

The point the author make is that when you automatically expect rm to be aisles to rm -i you get into trouble on machines where this is not the case. And that's completely true.

But it does not solve the problem as respondents soon became automatic. stated. Writing your own wrapper is a better deal. One such wrapper -- safe-rm already exists and while not perfect is useful

Notable quotes:

"... A co-worker had such an alias. Imagine the disaster when, visiting a customer site, he did "rm *" in the customer's work directory and all he got was the prompt for the next command after rm had done what it was told to do. ..."

"... It you want a safety net, do "alias del='rm -I –preserve_root'", ..."

Feb 14, 2017 | www.cyberciti.biz
Art Protin June 12, 2012, 9:53 pm

Any alias of rm is a very stupid idea (except maybe alias rm=echo fool).

A co-worker had such an alias. Imagine the disaster when, visiting a customer site, he did "rm *" in the customer's work directory and all he got was the prompt for the next command after rm had done what it was told to do.

It you want a safety net, do "alias del='rm -I –preserve_root'",

Drew Hammond March 26, 2014, 7:41 pm
^ This x10000.
I've made the same mistake before and its horrible.

[Mar 28, 2018] Sysadmin wiped two servers, left the country to escape the shame by Simon Sharwood

Mar 26, 2018 | theregister.co.uk
"This revolutionary product allowed you to basically 'mirror' two file servers," Graham told The Register . "It was clever stuff back then with a high speed 100mb FDDI link doing the mirroring and the 10Mb LAN doing business as usual."
Graham was called upon to install said software at a British insurance company, which involved a 300km trip on Britain's famously brilliant motorways with a pair of servers in the back of a company car.

Maybe that drive was why Graham made a mistake after the first part of the job: getting the servers set up and talking.

"Sadly the software didn't make identifying the location of each disk easy," Graham told us. "And – ummm - I mirrored it the wrong way."

"The net result was two empty but beautifully-mirrored servers."

Oops.

Graham tried to find someone to blame, but as he was the only one on the job that wouldn't work.

His next instinct was to run, but as the site had a stack of Quarter Inch Cartridge backup tapes, he quickly learned that "incremental back-ups are the work of the devil."

Happily, all was well in the end.

[Dec 07, 2017] First Rule of Usability Don't Listen to Users

Notable quotes:

"... So, do users know what they want? No, no, and no. Three times no. ..."

Dec 07, 2017 | www.nngroup.com

But ultimately, the way to get user data boils down to the basic rules of usability

Watch what people actually do.

Do not believe what people say they do.

Definitely don't believe what people predict they may do in the future.

... ... ...
So, do users know what they want? No, no, and no. Three times no.
Finally, you must consider how and when to solicit feedback. Although it might be tempting to simply post a survey online, you're unlikely to get reliable input (if you get any at all). Users who see the survey and fill it out before they've used the site will offer irrelevant answers. Users who see the survey after they've used the site will most likely leave without answering the questions. One question that does work well in a website survey is "Why are you visiting our site today?" This question goes to users' motivation and they can answer it as soon as they arrive.

[Dec 07, 2017] The rogue DHCP server

Notable quotes:

"... from Don Watkins ..."

Dec 07, 2017 | opensource.com

from Don Watkins

I am a liberal arts person who wound up being a technology director. With the exception of 15 credit hours earned on my way to a Cisco Certified Network Associate credential, all of the rest of my learning came on the job. I believe that learning what not to do from real experiences is often the best teacher. However, those experiences can frequently come at the expense of emotional pain. Prior to my Cisco experience, I had very little experience with TCP/IP networking and the kinds of havoc I could create albeit innocently due to my lack of understanding of the nuances of routing and DHCP.

At the time our school network was an active directory domain with DHCP and DNS provided by a Windows 2000 server. All of our staff access to the email, Internet, and network shares were served this way. I had been researching the use of the K12 Linux Terminal Server ( K12LTSP ) project and had built a Fedora Core box with a single network card in it. I wanted to see how well my new project worked so without talking to my network support specialists I connected it to our main LAN segment. In a very short period of time our help desk phones were ringing with principals, teachers, and other staff who could no longer access their email, printers, shared directories, and more. I had no idea that the Windows clients would see another DHCP server on our network which was my test computer and pick up an IP address and DNS information from it.

I had unwittingly created a "rogue" DHCP server and was oblivious to the havoc that it would create. I shared with the support specialist what had happened and I can still see him making a bee-line for that rogue computer, disconnecting it from the network. All of our client computers had to be rebooted along with many of our switches which resulted in a lot of confusion and lost time due to my ignorance. That's when I learned that it is best to test new products on their own subnet.

[Jul 20, 2017] The ULTIMATE Horrors story with recovery!

Notable quotes:

"... Have you ever left your terminal logged in, only to find when you came back to it that a (supposed) friend had typed "rm -rf ~/*" and was hovering over the keyboard with threats along the lines of "lend me a fiver 'til Thursday, or I hit return"? Undoubtedly the person in question would not have had the nerve to inflict such a trauma upon you, and was doing it in jest. So you've probably never experienced the worst of such disasters.... ..."

"... I can't remember what happened in the succeeding minutes; my memory is just a blur. ..."

"... (We take dumps of the user files every Thursday; by Murphy's Law this had to happen on a Wednesday). ..."

"... By yet another miracle of good fortune, the terminal from which the damage had been done was still su'd to root (su is in /bin, remember?), so at least we stood a chance of all this working. ..."

Nov 08, 2002 | www.linuxjournal.com

Anonymous on Fri, 11/08/2002 - 03:00.
Its here .. Unbeliveable..

[I had intended to leave the discussion of "rm -r *" behind after the compendium I sent earlier, but I couldn't resist this one.

I also received a response from rutgers!seismo!hadron!jsdy (Joseph S. D. Yao) that described building a list of "dangerous" commands into a shell and dropping into a query when a glob turns up. They built it in so it couldn't be removed, like an alias. Anyway, on to the story! RWH.] I didn't see the message that opened up the discussion on rm, but thought you might like to read this sorry tale about the perils of rm....

(It was posted to net.unix some time ago, but I think our postnews didn't send it as far as it should have!)

----------------------------------------------------------------

Have you ever left your terminal logged in, only to find when you came back to it that a (supposed) friend had typed "rm -rf ~/*" and was hovering over the keyboard with threats along the lines of "lend me a fiver 'til Thursday, or I hit return"? Undoubtedly the person in question would not have had the nerve to inflict such a trauma upon you, and was doing it in jest. So you've probably never experienced the worst of such disasters....

It was a quiet Wednesday afternoon. Wednesday, 1st October, 15:15 BST, to be precise, when Peter, an office-mate of mine, leaned away from his terminal and said to me, "Mario, I'm having a little trouble sending mail." Knowing that msg was capable of confusing even the most capable of people, I sauntered over to his terminal to see what was wrong. A strange error message of the form (I forget the exact details) "cannot access /foo/bar for userid 147" had been issued by msg.

My first thought was "Who's userid 147?; the sender of the message, the destination, or what?" So I leant over to another terminal, already logged in, and typed
grep 147 /etc/passwd
only to receive the response
/etc/passwd: No such file or directory.
Instantly, I guessed that something was amiss. This was confirmed when in response to
ls /etc
I got
ls: not found.
I suggested to Peter that it would be a good idea not to try anything for a while, and went off to find our system manager. When I arrived at his office, his door was ajar, and within ten seconds I realised what the problem was. James, our manager, was sat down, head in hands, hands between knees, as one whose world has just come to an end. Our newly-appointed system programmer, Neil, was beside him, gazing listlessly at the screen of his terminal. And at the top of the screen I spied the following lines:
# cd 
# rm -rf * 
Oh, *****, I thought. That would just about explain it.

I can't remember what happened in the succeeding minutes; my memory is just a blur. I do remember trying ls (again), ps, who and maybe a few other commands beside, all to no avail. The next thing I remember was being at my terminal again (a multi-window graphics terminal), and typing
cd / 
echo * 
I owe a debt of thanks to David Korn for making echo a built-in of his shell; needless to say, /bin, together with /bin/echo, had been deleted. What transpired in the next few minutes was that /dev, /etc and /lib had also gone in their entirety; fortunately Neil had interrupted rm while it was somewhere down below /news, and /tmp, /usr and /users were all untouched.

Meanwhile James had made for our tape cupboard and had retrieved what claimed to be a dump tape of the root filesystem, taken four weeks earlier. The pressing question was, "How do we recover the contents of the tape?". Not only had we lost /etc/restore, but all of the device entries for the tape deck had vanished. And where does mknod live?

You guessed it, /etc.

How about recovery across Ethernet of any of this from another VAX? Well, /bin/tar had gone, and thoughtfully the Berkeley people had put rcp in /bin in the 4.3 distribution. What's more, none of the Ether stuff wanted to know without /etc/hosts at least. We found a version of cpio in /usr/local, but that was unlikely to do us any good without a tape deck.

Alternatively, we could get the boot tape out and rebuild the root filesystem, but neither James nor Neil had done that before, and we weren't sure that the first thing to happen would be that the whole disk would be re-formatted, losing all our user files. (We take dumps of the user files every Thursday; by Murphy's Law this had to happen on a Wednesday).

Another solution might be to borrow a disk from another VAX, boot off that, and tidy up later, but that would have entailed calling the DEC engineer out, at the very least. We had a number of users in the final throes of writing up PhD theses and the loss of a maybe a weeks' work (not to mention the machine down time) was unthinkable.

So, what to do? The next idea was to write a program to make a device descriptor for the tape deck, but we all know where cc, as and ld live. Or maybe make skeletal entries for /etc/passwd, /etc/hosts and so on, so that /usr/bin/ftp would work. By sheer luck, I had a gnuemacs still running in one of my windows, which we could use to create passwd, etc., but the first step was to create a directory to put them in.

Of course /bin/mkdir had gone, and so had /bin/mv, so we couldn't rename /tmp to /etc. However, this looked like a reasonable line of attack.

By now we had been joined by Alasdair, our resident UNIX guru, and as luck would have it, someone who knows VAX assembler. So our plan became this: write a program in assembler which would either rename /tmp to /etc, or make /etc, assemble it on another VAX, uuencode it, type in the uuencoded file using my gnu, uudecode it (some bright spark had thought to put uudecode in /usr/bin), run it, and hey presto, it would all be plain sailing from there. By yet another miracle of good fortune, the terminal from which the damage had been done was still su'd to root (su is in /bin, remember?), so at least we stood a chance of all this working.

Off we set on our merry way, and within only an hour we had managed to concoct the dozen or so lines of assembler to create /etc. The stripped binary was only 76 bytes long, so we converted it to hex (slightly more readable than the output of uuencode), and typed it in using my editor. If any of you ever have the same problem, here's the hex for future reference:
070100002c000000000000000000000000000000000000000000000000000000 
0000dd8fff010000dd8f27000000fb02ef07000000fb01ef070000000000bc8f 
8800040000bc012f65746300 
I had a handy program around (doesn't everybody?) for converting ASCII hex to binary, and the output of /usr/bin/sum tallied with our original binary. But hang on---how do you set execute permission without /bin/chmod? A few seconds thought (which as usual, lasted a couple of minutes) suggested that we write the binary on top of an already existing binary, owned by me...problem solved.

So along we trotted to the terminal with the root login, carefully remembered to set the umask to 0 (so that I could create files in it using my gnu), and ran the binary. So now we had a /etc, writable by all.

From there it was but a few easy steps to creating passwd, hosts, services, protocols, (etc), and then ftp was willing to play ball. Then we recovered the contents of /bin across the ether (it's amazing how much you come to miss ls after just a few, short hours), and selected files from /etc. The key file was /etc/rrestore, with which we recovered /dev from the dump tape, and the rest is history.

Now, you're asking yourself (as I am), what's the moral of this story? Well, for one thing, you must always remember the immortal words, DON'T PANIC. Our initial reaction was to reboot the machine and try everything as single user, but it's unlikely it would have come up without /etc/init and /bin/sh. Rational thought saved us from this one.

The next thing to remember is that UNIX tools really can be put to unusual purposes. Even without my gnuemacs, we could have survived by using, say, /usr/bin/grep as a substitute for /bin/cat. And the final thing is, it's amazing how much of the system you can delete without it falling apart completely. Apart from the fact that nobody could login (/bin/login?), and most of the useful commands had gone, everything else seemed normal. Of course, some things can't stand life without say /etc/termcap, or /dev/kmem, or /etc/utmp, but by and large it all hangs together.

I shall leave you with this question: if you were placed in the same situation, and had the presence of mind that always comes with hindsight, could you have got out of it in a simpler or easier way?

Answers on a postage stamp to:

Mario Wolczko

------------------------------------------------------------------------

Dept. of Computer Science ARPA: miw%[email protected]

The University USENET: mcvax!ukc!man.cs.ux!miw

Manchester M13 9PL JANET: [email protected]
U.K. 061-273 7121 x 5699

[Jul 20, 2017] These Guys Didn't Back Up Their Files, Now Look What Happened

Notable quotes:

"... Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information. ..."

"... "I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you" ..."

"... Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else. ..."

Jul 20, 2017 | www.makeuseof.com
Back in college, I used to work just about every day as a computer cluster consultant. I remember a month after getting promoted to a supervisor, I was in the process of training a new consultant in the library computer cluster. Suddenly, someone tapped me on the shoulder, and when I turned around I was confronted with a frantic graduate student – a 30-something year old man who I believe was Eastern European based on his accent – who was nearly in tears.
"Please need help – my document is all gone and disk stuck!" he said as he frantically pointed to his PC.

Now, right off the bat I could have told you three facts about the guy. One glance at the blue screen of the archaic DOS-based version of Wordperfect told me that – like most of the other graduate students at the time – he had not yet decided to upgrade to the newer, point-and-click style word processing software. For some reason, graduate students had become so accustomed to all of the keyboard hot-keys associated with typing in a DOS-like environment that they all refused to evolve into point-and-click users.

The second fact, gathered from a quick glance at his blank document screen and the sweat on his brow told me that he had not saved his document as he worked. The last fact, based on his thick accent, was that communicating the gravity of his situation wouldn't be easy. In fact, it was made even worse by his answer to my question when I asked him when he last saved.

"I wrote 30 pages."

Calculated out at about 600 words a page, that's 18000 words. Ouch.

Then he pointed at the disk drive. The floppy disk was stuck, and from the marks on the drive he had clearly tried to get it out with something like a paper clip. By the time I had carefully fished the torn and destroyed disk out of the drive, it was clear he'd never recover anything off of it. I asked him what was on it.

"My thesis."

I gulped. I asked him if he was serious. He was. I asked him if he'd made any backups. He hadn't.
Making Backups of Backups
If there is anything I learned during those early years of working with computers (and the people that use them), it was how critical it is to not only save important stuff, but also to save it in different places. I would back up floppy drives to those cool new zip drives as well as the local PC hard drive. Never, ever had a single copy of anything.

Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information.

To drive that lesson home, I wanted to share a collection of stories that I found around the Internet about some recent cases were people suffered that horrible fate – from thousands of files to entire drives worth of data completely lost. These are people where the only remaining option is to start running recovery software and praying, or in other cases paying thousands of dollars to a data recovery firm and hoping there's something to find.
Not Backing Up Projects
The first example comes from Yahoo Answers , where a user that only provided a "?" for a user name (out of embarrassment probably), posted:

"I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you"

The folks answering immediately dove into suggesting that the person run recovery software, and one person suggested that the person run a search on the computer for *.ppt.

... ... ...

Doing Backups Wrong

Then, there's a scenario of actually trying to do a backup and doing it wrong, losing all of the files on the original drive. That was the case for the person who posted on Tech Support Forum , that after purchasing a brand new Toshiba Laptop and attempting to transfer old files from an external hard drive, inadvertently wiped the files on the hard drive.

Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else.

While the description of the problem is a little broken, from the sound of it, the person thought they were backing up from one direction, while they were actually backing up in the other direction. At least in this case not all of the original files were deleted, but a majority were.

[Jul 20, 2017] How Toy Story 2 Almost Got Deleted... Except That One Person Made A Home Backup

Notable quotes:

"... as a general observation, large organizations/corporations tend to opt for incredibly expensive, incredibly complex, incredibly overblown backup "solutions" sold to them by vendors rather than using the stock, well-tested, reliable tools that they already have. ..."

"... in over 30 years of working in the field, the second-worst product I have ever had the misfortune to deal with is Legato (now EMC) NetWorker. ..."

"... Panic can lead to further problems ..."

May 01, 2018 | Techdirt

Here's a random story, found via Kottke , highlighting how Pixar came very close to losing a very large portion of Toy Story 2 , because someone did an rm * (non geek: "remove all" command). And that's when they realized that their backups hadn't been working for a month. Then, the technical director of the film noted that, because she wanted to see her family and kids, she had been making copies of the entire film and transferring it to her home computer. After a careful trip from the Pixar offices to her home and back, they discovered that, indeed, most of the film was saved:

http://www.youtube.com/embed/EL_g0tyaIeE?rel=0
Now, mostly, this is just an amusing little anecdote, but two things struck me:
How in the world do they not have more "official" backups of something as major as Toy Story 2 . In the clip they admit that it was potentially 20 to 30 man-years of work that may have been lost. It makes no sense to me that this would include a single backup system. I wonder if the copy, made by technical director Galyn Susman, was outside of corporate policy. You would have to imagine that at a place like Pixar, there were significant concerns about things "getting out," and so the policy likely wouldn't have looked all that kindly on copies being used on home computers.
The Mythbusters folks wonder if this story was a little over-dramatized , and others have wondered how the technical director would have "multiple terabytes of source material" on her home computer back in 1999. That resulted in an explanation from someone who was there that what was deleted was actually the database containing the master copies of the characters, sets, animation, etc. rather than the movie itself. Of course, once again, that makes you wonder how it is that no one else had a simple backup. You'd think such a thing would be backed up in dozens of places around the globe for safe keeping...
Hans B PUFAL ( profile ), 18 May 2012 @ 5:53am
Reminds me of .... Some decades ago I was called to a customer site, a bank, to diagnose a computer problem. On my arrival early in the morning I noted a certain panic in the air. On querying my hosts I was told that there had been an "issue" the previous night and that they were trying, unsuccessfully, to recover data from backup tapes. The process was failing and panic ensued.
Though this was not the problem I had been called on to investigate, I asked some probing questions, made a short phone call, and provided the answer, much to the customer's relief.

What I found was that for months if not years the customer had been performing backups of indexed sequential files, that is data files with associated index files, without once verifying that the backed-up data could be recovered. On the first occasion of a problem requiring such a recovery they discovered that they just did not work.

The answer? Simply recreate the index files from the data. For efficiency reasons (this was a LONG time ago) the index files referenced the data files by physical disk addresses. When the backup tapes were restored the data was of course no longer at the original place on the disk and the index files were useless. A simple procedure to recreate the index files solved the problem.

Clearly whoever had designed that system had never tested a recovery, nor read the documentation which clearly stated the issue and its simple solution.

So here is a case of making backups, but then finding them flawed when needed.

Anonymous Coward , 18 May 2012 @ 6:00am
Re: Reminds me of .... That's why, in the IT world, you ALWAYS do a "dry run" when you want to deploy something, and you monitor the heck out of critical systems.
Rich Kulawiec , 18 May 2012 @ 6:30am
Two notes on backups
1. Everyone who has worked in computing for any period of time has their own backup horror story. I'll spare you mine, but note that as a general observation, large organizations/corporations tend to opt for incredibly expensive, incredibly complex, incredibly overblown backup "solutions" sold to them by vendors rather than using the stock, well-tested, reliable tools that they already have. (e.g., "why should we use dump, which is open-source/reliable/portable/tested/proven/efficient/etc., when we could drop $40K on closed-source/proprietary/non-portable/slow/bulky software from a vendor?"

Okay, okay, one comment: in over 30 years of working in the field, the second-worst product I have ever had the misfortune to deal with is Legato (now EMC) NetWorker.

2. Hollywood has a massive backup and archiving problem. How do we know? Because they keep telling us about it. There are a series of self-promoting commercials that they run in theaters before movies, in which they talk about all of the old films that are slowly decaying in their canisters in vast warehouses, and how terrible this is, and how badly they need charitable contributions from the public to save these treasures of cinema before they erode into dust, etc.

Let's skip the irony of Hollywood begging for money while they're paying professional liar Chris Dodd millions and get to the technical point: the easiest and cheapest way to preserve all of these would be to back them up to the Internet. Yes, there's a one-time expense of cleaning up the analog versions and then digitizing them at high resolution, but once that's done, all the copies are free. There's no need for a data center or elaborate IT infrastructure: put 'em on BitTorrent and let the world do the work. Or give copies to the Internet Archive. Whatever -- the point is that once we get past the analog issues, the only reason that this is a problem is that they
made it a problem by refusing to surrender control.
saulgoode ( profile ), 18 May 2012 @ 6:38am
Re: Two notes on backups "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus Torvalds
Anonymous Coward , 18 May 2012 @ 7:02am
What I suspect is that she was copying the rendered footage. If the footage was rendered at a resolution and rate fitting to DVD spec, that'd put the raw footage at around 3GB to 4GB for a full 90min, which just might fit on the 10GB HDD that were available back then on a laptop computer (remember how small OSes were back then).
Even losing just the rendered raw footage (or even processed footage), would be a massive setback. It takes a long time across a lot of very powerful computers to render film quality footage. If it was processed footage then it's even more valuable as that takes a lot of man hours of post fx to make raw footage presentable to a consumer audience.

aldestrawk ( profile ), 18 May 2012 @ 8:34am
a retelling by Oren Jacob Oren Jacob, the Pixar director featured in the animation, has made a comment on the Quora post that explains things in much more detail. The narration and animation was telling a story, as in storytelling. Despite the 99% true caption at the end, a lot of details were left out which misrepresented what had happened. Still, it was a fun tale for anyone who had dealt with backup problems. Oren Jacob's retelling in the comment makes it much more realistic and believable.
The terabytes level of data came from whoever posted the video on Quora. The video itself never mentions the actual amount of data lost or the total amount the raw files represent. Oren says, vaguely, that it was much less than a terabyte. There were backups! The last one was from two days previous to the delete event. The backup was flawed in that it produced files that when tested, by rendering, exhibited errors.
They ended up patching a two-month old backup together with the home computer version (two weeks old). This was labor intensive as some 30k files had to be individually checked.

The moral of the story.

Firstly, always test a restore at some point when implementing a backup system.

Secondly, don't panic! Panic can lead to further problems. They could well have introduced corruption in files by abruptly unplugging the computer.

Thirdly, don't panic! Despite, somehow, deleting a large set of files these can be recovered apart from a backup system.

Deleting files, under Linux as well as just about any OS, only involves deleting the directory entries. There is software which can recover those files as long as further use of the computer system doesn't end up overwriting what is now free space.

Mason Wheeler , 18 May 2012 @ 10:01am
Re: a retelling by Oren Jacob
Panic can lead to further problems. They could well have introduced corruption in files by abruptly unplugging the computer.

What's worse? Corrupting some files or deleting all files?

aldestrawk ( profile ), 18 May 2012 @ 10:38am
Re: Re: a retelling by Oren Jacob
In this case they were not dealing with unknown malware that was steadily erasing the system as they watched. There was, apparently, a delete event at a single point in time that had repercussions that made things disappear while people worked on the movie.

I'll bet things disappeared when whatever editing was being done required a file to be refreshed.

A refresh operation would make the related object disappear when the underlying file was no longer available.

Apart from the set of files that had already been deleted, more files could have been corrupted when the computer was unplugged.

Having said that, this occurred in 1999 when they were probably using the Ext2 filesystem under Linux. These days most everyone uses a filesystem that includes journaling which protects against corruption that may occur when a computer loses power. Ext3 is a journaling filesystem and was introduced in 2001.

In 1998 I had to rebuild my entire home computer system. A power glitch introduced corruption in a Windows 95 system file and use of a Norton recovery tool rendered the entire disk into a handful of unusable files. It took me ten hours to rebuild the OS and re-install all the added hardware, software, and copy personal files from backup floppies. The next day I went out and bought a UPS. Nowadays, sometimes the UPS for one of my computers will fail during one of the three dozen power outages a year I get here. I no longer have problems with that because of journaling.

Danny ( profile ), 18 May 2012 @ 10:49am
I've gotta story like this too Ive posted in athe past on Techdirt that I used to work for Ticketmaster. The is an interesting TM story that I don't think ever made it into the public, so I will do it now.

Back in the 1980s each TM city was on an independent computer system (PDP unibus systems with RM05 or CDC9766 disk drives. The drives were fixed removable boxes about the size of a washing machine, the removable disk platters about the size of the proverbial breadbox. Each platter held 256mb formatted.

Each city had itts own operations policies, but generally, the systems ran with mirrored drives, the database was backed up every night, archival copies were made monthly. In Chicago, where I worked, we did not have offsite backup in the 1980s. The Bay Area had the most interesting system for offsite backup.

The Bay Area BASS operation, bought by TM in the mid 1980s, had a deal with a taxi driver. They would make their nightly backup copies in house, and make an extra copy on a spare disk platter. Tis cabbie would come by the office about 2am each morning, and they'd put the spare disk platter in his trunk, swapping it for the previous day's copy that had been his truck for 24 hours. So, for the cost of about two platters ($700 at the time) and whatever cash they'd pay the cabbie, they had a mobile offsite copy of their database circulating the Bay Area at all times.

When the World Series earthquake hit in October 1988, the TM office in downtown Oakland was badly damaged. The only copy of the database that survived was the copy in the taxi cab.

That incident led TM corporate to establish much more sophisticated and redundant data redundancy policies.

aldestrawk ( profile ), 18 May 2012 @ 11:30am
Re: I've gotta story like this too I like that story. Not that it matters anymore, but taxi cab storage was probably a bad idea. The disks were undoubtedly the "Winchester" type and when powered down the head would be parked on a "landing strip". Still, subjecting these drives to jolts from a taxi riding over bumps in the road could damage the head or cause it to be misaligned. You would have known though it that actually turned out to be a problem. Also, I wouldn't trust a taxi driver with the company database. Although, that is probably due to an unreasonable bias towards cab drivers. I won't mention the numerous arguments with them (not in the U.S.) over fares and the one physical fight with a driver who nearly ran me down while I was walking.
Huw Davies , 19 May 2012 @ 1:20am
Re: Re: I've gotta story like this too RM05s are removable pack drives. The heads stay in the washing machine size unit - all you remove are the platters.
That One Guy ( profile ), 18 May 2012 @ 5:00pm
What I want to know is this... She copied bits of a movie to her home system... how hard did they have to pull in the leashes to keep Disney's lawyers from suing her to infinity and beyond after she admitted she'd done so(never mind the fact that he doing so saved them apparently years of work...)?
Lance , 3 May 2014 @ 8:53am

http://thenextweb.com/media/2012/05/21/how-pixars-toy-story-2-was-deleted-twice-once-by-technology-a nd-again-for-its-own-good/

Evidently, the film data only took up 10 GB in those days. Nowhere near TB...

[Jul 20, 2017] Scary Backup Stories by Paul Barry

Good backup is that backup that was checked using actual restore procedure. Anything else is just an approximation of this as devil often is is derails.

Notable quotes:

"... All the tapes were then checked, and they were all ..."

Nov 07, 2002 | Linux Journal

The dangers of not testing your backup procedures and some common pitfalls to avoid.

Backups. We all know the importance of making a backup of our most important systems. Unfortunately, some of us also know that realizing the importance of performing backups often is a lesson learned the hard way. Everyone has their scary backup stories. Here are mine. Scary Story #1

Like a lot of people, my professional career started out in technical support. In my case, I was part of a help-desk team for a large professional practice. Among other things, we were responsible for performing PC LAN backups for a number of systems used by other departments. For one especially important system, we acquired fancy new tape-backup equipment and a large collection of tapes. A procedure was put in place, and before-you-go-home-at-night backups became a standard. Some months later, a crash brought down the system, and all the data was lost. Shortly thereafter, a call came in for the latest backup tape. It was located and dispatched, and a recovery was attempted. The recovery failed, however, as the tape was blank . A call came in for the next-to-last backup tape. Nervously, it was located and dispatched, and a recovery was attempted. It also failed because this tape also was blank. Amid long silences and pink-slip glares, panic started to set in as the tape from three nights prior was called up. This attempt resulted in a lot of shouting.

All the tapes were then checked, and they were all blank. To add insult to injury, the problem wasn't only that the tapes were blank--they weren't even formatted! The fancy new backup equipment wasn't smart enough to realize the tapes were not formatted, so it allowed them to be used. Note: writing good data to an unformatted tape is never a good idea.

Now, don't get me wrong, the backup procedures themselves were good. The problem was that no one had ever tested the whole process--no one had ever attempted a recovery. Was it no small wonder then that each recovery failed?

For backups to work, you need to do two things: (1) define and implement a good procedure and (2) test that it works.

To this day, I can't fathom how my boss (who had overall responsibility for the backup procedures) managed not to get fired over this incident. And what happened there has always stayed with me.

A Good Solution

When it comes to doing backups on Linux systems, a number of standard tools can help avoid the problems discussed above. Marcel Gagné's excellent book (see Resources) contains a simple yet useful script that not only performs the backup but verifies that things went well. Then, after each backup, the script sends an e-mail to root detailing what occurred.

I'll run through the guts of a modified version of Marcel's script here, to show you how easy this process actually is. This bash script starts by defining the location of a log and an error file. Two mv commands then copy the previous log and error files to allow for the examination of the next-to-last backup (if required):
#! /bin/bash
backup_log=/usr/local/.Backups/backup.log
backup_err=/usr/local/.Backups/backup.err
mv $backup_log $backup_log.old
mv $backup_err $backup_err.old
With the log and error files ready, a few echo commands append messages (note the use of >>) to each of the files. The messages include the current date and time (which is accessed using the back-ticked date command). The cd command then changes to the location of the directory to be backed up. In this example, that directory is /mnt/data, but it could be any location:
echo "Starting backup of /mnt/data: `date`." >> $backup_log
echo "Errors reported for backup/verify: `date`." >> $backup_err
cd /mnt/data
The backup then starts, using the tried and true tar command. The -cvf options request the creation of a new archive (c), verbose mode (v) and the name of the file/device to backup to (f). In this example, we backup to /dev/st0, the location of an attached SCSI tape drive:
    tar -cvf /dev/st0 . 2>>$backup_err
Any errors produced by this command are sent to STDERR (standard error). The above command exploits this behaviour by appending anything sent to STDERR to the error file as well (using the 2>> directive).

When the backup completes, the script then rewinds the tape using the mt command, before listing the files on the tape with another tar command (the -t option lists the files in the named archive). This is a simple way of verifying the contents of the tape. As before, we append any errors reported during this tar command to the error file. Additionally, informational messages are added to the log file at appropriate times:
mt -f /dev/st0 rewind
echo "Verifying this backup: `date`" >>$backup_log
tar -tvf /dev/st0 2>>$backup_err
echo "Backup complete: `date`" >>$backup_log
To conclude the script, we concatenate the error file to the log file (with cat ), then e-mail the log file to root (where the -s option to the mail command allows the specification of an appropriate subject line):
    cat $backup_err >> $backup_log
    mail -s "Backup status report for /mnt/data" root < $backup_log
And there you have it, Marcel's deceptively simple solution to performing a verified backup and e-mailing the results to an interested party. If only we'd had something similar all those years ago.

... ... ...

[May 07, 2017] centos - Do not play those dangerous games with resizing of partitions unless absolutely necessary

Copying to additional drive (can be USB), repartitioning and then copying everything back is a safer bet

www.softpanorama.org

In theory, you could reduce the size of sda1, increase the size of the extended partition, shift the contents of the extended partition down, then increase the size of the PV on the extended partition and you'd have the extra room. However, the number of possible things that can go wrong there is just astronomical, so I'd recommend either buying a second hard drive (and possibly transferring everything onto it in a more sensible layout, then repartitioning your current drive better) or just making some bind mounts of various bits and pieces out of /home into / to free up a bit more space.

--womble

[May 05, 2017] As Unix does not have a rename command usage of mv for renaming can lead to SNAFU

www.softpanorama.org

If destination does not exist it behaves as rename command but if destination exists and is directory it move it one level up

For example, if you have directories /home and home2 and want to move all subdirectories from /home2 to /home and the directory /home is empty you can't use
mv home2 home
if you forget to remove the directory /home, mv silently will create /home/home2 directory and you have a problem if this is user home directories.

[May 05, 2017] The key problem with cp utility is that it does not preserve timestamp of the file.

Expected behaviour of copy command by windows users is that it preserves attributes. But this in not true for Unix cp command.

Using -r option without -p option destroys all timestamps.

www.vanityfair.com

-p -- Preserve the characteristics of the source_file. Copy the contents, modification times, and permission modes of the source_file to the destination files.

You might wish to create an alias
alias cp='cp -p'
as I can't imagine case where regular Unix behaviour is desirable.

[Feb 14, 2017] My 10 UNIX Command Line Mistakes

Feb 14, 2017 | www.cyberciti.biz

Destroyed named.conf
I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:
./mkzone example.com > /var/named/chroot/etc/named.conf
Destroyed Working Backups with Tar and Rsync (personal backups)
I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk tar -zcvf project.tar.gz functions
I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now Iâ€™ve switched to rsnapshot )
rsync -av -delete /dest /src
Again, I had no backup.

[Feb 12, 2017] Vendor support vs. local support

Jonathan.White Jul 9, 2015 10:14 AM (in response to nickzourdos)

We had a client that said their IBM application was running slow because of the "network". (The mysterious place that packets vanish into like a black hole...lol) I explained to them that the application spans two data centers in separate states across several different pieces of equipment. They said they didn't feel like going through the process of opening another ticket with IBM since IBM would require them to gather a bunch of logs and do a lot of investigation work on their side. Instead they decided to punt it over to the networking team by opening a ticket/incident that read something along the lines that their application was slow due to network related issues.

To help get things moving along I setup a weekly call to get a status on where we were with the troubleshooting process. The first thing I would do was a role call. I would ask who was on the line and then very specifically ask if IBM was on the call. Every time they informed us that IBM wasn't on the call and hadn't been engaged. We were at a standstill and the calls would end very quickly after role call because IBM was the missing piece. We needed someone with enough knowledge of the application to tell us what exactly was slow so we could track it down across the network. Based on the clients initial thought process with punting over to networking you can imagine how well they knew their application.

Needless to say after a few weeks of role call they asked me to cancel the meetings since they contacted IBM and tweaked a few application settings that corrected the problems. The issue was resolved on our end by a simple role call which was strategically done to get this problem routed to the proper group despite the client's laziness....

[Feb 12, 2017] Stupidity of the manager effect

nickzourdos Jul 9, 2015 10:13 AM

So the Exchange server had a bit of a hiccup one day, back when I was on the help desk. There was an hour window where one of the databases got behind and the queues had to catch up. This caused ~200 users to have slow or unresponsive Outlook clients. I got an angry call from someone in accounting after about 20 minutes of downtime, and she proceeded to assume the roll of tech support manager:

Her: So is the email down?

Me: Yes, we've notified our system administrators and they have already fixed the issue. We're waiting for the server to return to normal, but for now it's playing catchup.

Her: So how are you going to prevent the help desk from getting swamped with calls? Don't you think it would be a good idea to help deflect the calls you're getting?

Me: We're actually not that swamped. The outage only applies to 205 users in the company that are on that specific database.

Her: Ok but what are you going to do about it? What about those 205 people who are having problems? Shouldn't you notify them? How hard is it to send a mass email letting them know that the server is down?

Me: I... don't think they would get the email if the email server is down for them.

Her: Well I'm going to send a mass email to the accounting department, I suggest you do the same for the rest of the company.

[Feb 12, 2017] Just the push of the button in the opened datacenter

atreides71 Jul 15, 2015 6:49 PM
My first job was in a Hewlett Packard reseller company, the small datacenter was plain sight from the lobby so our sales executives could talk visitors about the infrastructure we were using to run the company systems ( ERP, email, BI, etc ), and they had the bad habit to let people in so they could see the different solutions very close.
Someday one of those executives must had left the door open, it was summer holiday time so we had a visit of a reseller and he was accompanied by his child son, who quickly found that the door was opened, came into the datacenter and pushed a single button, the on/off button of the Progress Database Server that kept the ERP information. He did and he left the datacenter without being noticed.

In just a couple of minutes we had a lot of calls from all the branch offices asking about the ERP service; it took us 1 or 2 hours to find the failure, check the raid status, the database integrity and put in on line again, we had a meeting looking for the root cause of the outage until someone had the idea to check the video of the security cameras, then we found the real responsible for the fail of the system.

Since then, the Datacenter remained closed.

[Feb 11, 2017] Being way too lazy is not always beneficial

When a customer gets a replacement disk for their SAN and doesn't replace it for a week saying "I just couldn't bring myself to care about the SAN this week." and then another disk goes bad the next day.

mleon Jul 15, 2015 9:13 AM

When a customer gets a replacement disk for their SAN and doesn't replace it for a week saying "I just couldn't bring myself to care about the SAN this week." and then another disk goes bad the next day.

[Feb 10, 2017] An inventive idea of reusing the socket into which the switch was plugged

jimtech18 Jul 31, 2015 1:34 PM

No hazard pay:

Replacing a failing switch in a high pressure test lab (one with signs that warn of the danger of pinhole leaks being able to KILL you). Up near the top of the stupid-tall 20' step ladder when the lab tech holding the ladder tells me about the guy who fell off this same ladder last year and broke his hip (now he tells me). (Who puts a switch in the ceiling supports anyway? Apparently there used to be a wall there that the switch was mounted to. Construction guys removed the wall so the switch and wires just got moved up and mounted at the ceiling! duh what else would you do? long before my time) Back to the challenge at hand. As I'm messing around with the switch the Hydrogen alarm mounted near the ceiling starts wailing, and the guy on the floor says "That's not good!" and leaves the room, remember him, he was steadying the ladder that someone fell off of last year that I am still at the top of. He soon returns and holds the ladder as I climb all the way down, seems like twice as far as when I climbed up.

By this time the Hydrogen alarm has stopped and both techs say that there is nothing to worry about and that I should finish the switch replacement so they can get back to work. Of course, as a SYSADMIN, I go back up the crappy 20' step ladder and finish swapping out the failed switch with a POE powered one, problem resolved. I take the failed switch back to my office and it works fine, what? How could that be? Turns out the extension cord that the switch (mounted at the ceiling) was plugged into had been unplugged by the first tech who was HELPING me because he needed an outlet. Once I pointed out the cause of the whole issue he said "Oh Yeah, that's where that cord goes, Oh Well, it's fixed now and I get to keep using the outlet, thanks".

[Feb 08, 2017] A side effect of simultaneous changes on many boxes can be networking storm when boxes start communing all at once

Deltona Jul 31, 2015 4:50 AM

I was supposed to do some routine redundancy tests at a remote site in another country. After implementing and testing everything successfully, I enabled EnergyWise on a couple hundred switches in one go. The broadcast storm that followed brought everything in the DC down to a halt.

It took me two hours to figure out why this happened and i missed my flight back home. A couple months later, a dozen plus firmwares were released to address this issue.

[Feb 07, 2017] Troubleshooting method for networking problems: work up the OSI model - layer 1 - check the cabling

Troubleshooting method - work up the OSI model - layer 1 - check the cabling. After checking the cabling, check the cabling again. Before you're ready to escalate, ask for help, check the cabling again.

pseudocyber Jul 28, 2015 9:20 AM (in response to adcast)

Troubleshooting method - work up the OSI model - layer 1 - check the cabling. After checking the cabling, check the cabling again. Before you're ready to escalate, ask for help, check the cabling again.

[Feb 06, 2017] The way to keep senior management informed

rbrickler Jul 17, 2015 11:52 AM

I was working for Network Operations in a company several years back. It was a small company and we had a VP that was not tech savvy. We were having an issue one day, and he came running into the Network Operations Center asking what was going on.

One of our coworkers looked at him and said, relax, it is no big deal, we have everything under control. He asked what was the problem.

Our coworker said, "the flux capacitor stopped working, but we got it restarted." The VP said OK, turned around and left the room to go report to the execs about our Flux Capacitor issue....

[Feb 05, 2017] Cutting yourself from the networked server by putting down and then up eth0 interface

jemertz Mar 30, 2016 10:26 AM
When working in a remote lab, on a Linux server which you're connecting to through eth0:
use: ifdown eth0; ifup eth0

not:
ifdown eth0 
ifup eth0
Doing it on one line means it comes back up right after it goes down. Doing it on two lines means you lose connection before you can type the second line. I figured this out the hard way, and haven't made the same mistake a second time.

[Feb 04, 2017] How do I fix mess created by accidentally untarred files in the current dir, aka tar bomb

Highly recommended!

In such cases the UID of the file is often different from uid of "legitimate" files in polluted directories and you probably can use this fact for quick elimination of the tar bomb, But the idea of using the list of files from the tar bomb to eliminate offending files also works if you observe some precautions -- some directories that were created can have the same names as existing directories. Never do rm in -exec or via xargs without testing.

Notable quotes:

"... You don't want to just rm -r everything that tar tf tells you, since it might include directories that were not empty before unpacking! ..."

"... Another nice trick by @glennjackman, which preserves the order of files, starting from the deepest ones. Again, remove echo when done. ..."

"... One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column. ..."

"... That kind of (antisocial) archive is called a tar bomb because of what it does. Once one of these "explodes" on you, the solutions in the other answers are way better than what I would have suggested. ..."

"... The easiest (laziest) way to do that is to always unpack a tar archive into an empty directory. ..."

"... The t option also comes in handy if you want to inspect the contents of an archive just to see if it has something you're looking for in it. If it does, you can, optionally, just extract the file(s) you want. ..."

Feb 04, 2017 | superuser.com

linux - Undo tar file extraction mess - Super User

first try to issue
tar tf archive
tar will list the contents line by line.
This can be piped to xargs directly, but beware : do the deletion very carefully. You don't want to just rm -r everything that tar tf tells you, since it might include directories that were not empty before unpacking!

You could do
tar tf archive.tar | xargs -d'\n' rm -v
tar tf archive.tar | sort -r | xargs -d'\n' rmdir -v
to first remove all files that were in the archive, and then the directories that are left empty.

sort -r (glennjackman suggested tac instead of sort -r in the comments to the accepted answer, which also works since tar 's output is regular enough) is needed to delete the deepest directories first; otherwise a case where dir1 contains a single empty directory dir2 will leave dir1 after the rmdirpass, since it was not empty before dir2 was removed.

This will generate a lot of
rm: cannot remove `dir/': Is a directory
and
rmdir: failed to remove `dir/': Directory not empty

      rmdir: failed to remove `file': Not a directory
Shut this up with 2>/dev/null if it annoys you, but I'd prefer to keep as much information on the process as possible.

And don't do it until you are sure that you match the right files. And perhaps try rm -i to confirm everything. And have backups, eat your breakfast, brush your teeth, etc.

===

List the contents of the tar file like so:
tar tzf myarchive.tar
Then, delete those file names by iterating over that list:
while IFS= read -r file; do echo "$file"; done < <(tar tzf myarchive.tar.gz)
This will still just list the files that would be deleted. Replace echo with rm if you're really sure these are the ones you want to remove. And maybe make a backup to be sure.

In a second pass, remove the directories that are left over:
while IFS= read -r file; do rmdir "$file"; done < <(tar tzf myarchive.tar.gz)
This prevents directories with from being deleted if they already existed before.

Another nice trick by @glennjackman, which preserves the order of files, starting from the deepest ones. Again, remove echo when done.
tar tvf myarchive.tar | tac | xargs -d'\n' echo rm
This could then be followed by the normal rmdir cleanup.

Here's a possibility that will take the extracted files and move them to a subdirectory, cleaning up your main folder.
    #!/usr/bin/perl -w  

    use strict  ;  
    use   Getopt  ::  Long  ;  

    my $clean_folder   =     "clean"  ;  
    my $DRY_RUN  ;  
    die   "Usage: $0 [--dry] [--clean=dir-name]\n"  
          if     (     !  GetOptions  (  "dry!"     =>   \$DRY_RUN  ,  
                           "clean=s"     =>   \$clean_folder  ));  

      # Protect the 'clean_folder' string from shell substitution  
    $clean_folder   =~   s  /  '/'  \\  ''  /  g  ;  

      # Process the "tar tv" listing and output a shell script.  
    print   "#!/bin/sh\n"     if     (     !  $DRY_RUN   );  
      while     (<>)  
      {  
        chomp  ;  

          # Strip out permissions string and the directory entry from the 'tar' list  
        my $perms   =   substr  (  $_  ,     0  ,     10  );  
        my $dirent   =   substr  (  $_  ,     48  );  

          # Drop entries that are in subdirectories  
        next   if     (   $dirent   =~   m  :/.:     );  

          # If we're in "dry run" mode, just list the permissions and the directory  
          # entries.  
          #  
          if     (   $DRY_RUN   )  
          {  
            print   "$perms|$dirent\n"  ;  
            next  ;  
          }  

          # Emit the shell code to clean up the folder  
        $dirent   =~   s  /  '/'  \\  ''  /  g  ;  
        print   "mv -i '$dirent' '$clean_folder'/.\n"  ;  
      } 
Save this to the file fix-tar.pl and then execute it like this:
 $ tar tvf myarchive  .  tar   |   perl fix  -  tar  .  pl   --  dry 
This will confirm that your tar list is like mine. You should get output like:
  -  rw  -  rw  -  r  --|  batch
  -  rw  -  rw  -  r  --|  book  -  report  .  png
  -  rwx  ------|  CaseReports  .  png
  -  rw  -  rw  -  r  --|  caseTree  .  png
  -  rw  -  rw  -  r  --|  tree  .  png
drwxrwxr  -  x  |  sample  / 
If that looks good, then run it again like this:
$ mkdir cleanup
$ tar tvf myarchive  .  tar   |   perl fix  -  tar  .  pl   --  clean  =  cleanup   >   fixup  .  sh 
The fixup.sh script will be the shell commands that will move the top-level files and directories into a "clean" folder (in this instance, the folder called cleanup). Have a peek through this script to confirm that it's all kosher. If it is, you can now clean up your mess with:
 $ sh fixup  .  sh 
I prefer this kind of cleanup because it doesn't destroy anything that isn't already destroyed by being overwritten by that initial tar xv.

Note: if that initial dry run output doesn't look right, you should be able to fiddle with the numbers in the two substr function calls until they look proper. The $perms variable is used only for the dry run so really only the $dirent substring needs to be proper.

One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column.

One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column.
===
That kind of (antisocial) archive is called a tar bomb because of what it does. Once one of these "explodes" on you, the solutions in the other answers are way better than what I would have suggested.

The best "solution", however, is to prevent the problem in the first place.

The easiest (laziest) way to do that is to always unpack a tar archive into an empty directory. If it includes a top level directory, then you just move that to the desired destination. If not, then just rename your working directory (the one that was empty) and move that to the desired location.

If you just want to get it right the first time, you can run tar -tvf archive-file.tar | less and it will list the contents of the archive so you can see how it is structured and then do what is necessary to extract it to the desired location to start with.

The t option also comes in handy if you want to inspect the contents of an archive just to see if it has something you're looking for in it. If it does, you can, optionally, just extract the file(s) you want.

[Feb 04, 2017] Restoring deleted /tmp folder

Jan 13, 2015 | cyberciti.biz

As my journey continues with Linux and Unix shell, I made a few mistakes. I accidentally deleted /tmp folder. To restore it all you have to do is:
mkdir /tmp
chmod 1777 /tmp
chown root:root /tmp
ls -ld /tmp
 
mkdir /tmp chmod 1777 /tmp chown root:root /tmp ls -ld /tmp 

[Feb 04, 2017] Use CDPATH to access frequent directories in bash - Mac OS X Hints

Feb 04, 2017 | hints.macworld.com

The variable CDPATH defines the search path for the directory containing directories. So it served much like "directories home". The dangers are in creating too complex CDPATH. Often a single directory works best. For example export CDPATH = /srv/www/public_html . Now, instead of typing cd /srv/www/public_html/CSS I can simply type: cd CSS
Use CDPATH to access frequent directories in bash
Mar 21, '05 10:01:00AM • Contributed by: jonbauman
I often find myself wanting to cd to the various directories beneath my home directory (i.e. ~/Library, ~/Music, etc.), but being lazy, I find it painful to have to type the ~/ if I'm not in my home directory already. Enter CDPATH, as desribed in man bash ):

The search path for the cd command. This is a colon-separated list of directories in which the shell looks for destination directories specified by the cd command. A sample value is ".:~:/usr".
Personally, I use the following command (either on the command line for use in just that session, or in .bash_profile for permanent use):
CDPATH=".:~:~/Library"
This way, no matter where I am in the directory tree, I can just cd dirname , and it will take me to the directory that is a subdirectory of any of the ones in the list. For example:
$ cd
$ cd Documents 
/Users/baumanj/Documents
$ cd Pictures
/Users/username/Pictures
$ cd Preferences
/Users/username/Library/Preferences
etc...
[ robg adds: No, this isn't some deeply buried treasure of OS X, but I'd never heard of the CDPATH variable, so I'm assuming it will be of interest to some other readers as well.]
cdable_vars is also nice
Authored by: clh on Mar 21, '05 08:16:26PM
Check out the bash command shopt -s cdable_vars
From the man bash page:

cdable_vars
If set, an argument to the cd builtin command that is not a directory is assumed to be the name of a variable whose value is the directory to change to.

With this set, if I give the following bash command:
export d="/Users/chap/Desktop"

I can then simply type

cd d

to change to my Desktop directory.

I put the shopt command and the various export commands in my .bashrc file.

[Aug 04, 2015] My 10 UNIX Command Line Mistakes by Vivek Gite

The thread of comments after the article is very educational. We reproduce only a small fraction.

June 21, 2009

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.
userdel Command
The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):
userdel foo
... ... ...
Destroyed Working Backups with Tar and Rsync (personal backups)
I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk
tar -zcvf project.tar.gz functions
I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.
Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

... ... ...

Public Network Interface Shutdown
I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down
Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box
I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop
Wrong CNAME DNS Entry
Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:
echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload
Failed To Update Postfix RBL Configuration

In 2006 ORDB went out of operation. But, I failed to update my Postfix RBL settings. One day ORDB was re-activated and it was returning every IP address queried as being on its blacklist. The end result was a disaster.

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )

The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.

Use CVS to store configuration files.

Wait and read command line again before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

Jon June 21, 2009, 2:42 am

My all time favorite mistake was a simple extra space:

cd /usr/lib ls /tmp/foo/bar

I typed
rm -rf /tmp/foo/bar/ *

instead of

rm -rf /tmp/foo/bar/*

The system doesn't run very will without all of it's libraries……

Vinicius August 21, 2010, 5:42 pm

I Did something similar on a remote server
I was going to type 'chmod -R 755 ./' but i throw 'chmod -R 755 /' |:

Daniel December 30, 2013, 9:40 pm

I typed 'chmod -R 777′ , to allow all files to have rwx permissions from all users (RPi) .

Doesn't work that well without sudo!

robert wlaschin May 1, 2012, 9:57 pm

Hm… I was trying to format a USB flash

dd if=big_null_file of=/dev/sdb

unfortunately /dev/sdb was my local secondary drive, sdc was the usb … shucks.

I discovered this after I rebooted.

Jeff April 21, 2011, 10:46 pm

I did something similar on my first day as a junior admin. As root, I copied my buddy's dot files (.profile, etc.) from his home directory to mine because he had some cool customizations. He also had some scripts in a directory called .scripts/ that he wanted me to copy. I gave myself ownership of the dot files and the contents of the .scripts directory with this command:

cd ~jeff; chown -R jeff .*

It was only later that I realized that ".*" matched "." and "..", so my userid owned the entire machine… which happened to be our production Oracle database.

That was 15 years ago and we've both changed jobs a few times, but that friend reminds me of that mistake every time I see him.

Garry April 11, 2014, 8:02 pm
I once had a bunch of dot files I wanted to remove. So I did:
rm -r .*
This, of course, includes ".." – recursively.

I had taken over SysAdmin of a server. The server had a cron job that ran, as root, that cd'ed into a directory and did a find, removing any files older than 3 days. It was to clean up the log files of some program they had. They quit using the program. About a year later, someone removed the directory. The cron job ran. The cd into the log file directory didn't work, but the cron job kept going. It was still in / – removing any files older than 3 days old! I restored the filesystems and went home to get some sleep, thinking I would investigate root cause after I had some rest. As soon as my head hit the pillow, the phone rang. "It did it again". The cron job had run again.

Lastly, I once had an accidental copy & paste, which renamed (mv) /usr/lib. Did you know the "mv" command uses libraries in /usr/lib? I found that out the hard way when I discovered I could not move it back to its original pathname. Nor could I copy it (cp uses /usr/lib).

An "Ohnosecond" is defined as the period of time between when you hit enter and you realize what you just did.
Michael Shigorin April 12, 2014, 8:14 am

That's why set -e or #!/bin/sh -e (in this particular case I'd just tell find that_dir … though). --[The -e flag's long name is errexit, causing the script to immediately exit on the first error. -- NNB]

My .. incident has taught me to hit tab just in case to see what actually gets removed; BTW zsh is very helpful in that regard, it has some safety net means for the usual * ~ cases - but then again touching nothing with destructive tools when tired, especially as root, is a bitter but prudent decision.

Regarding /usr/lib: ALT Linux coreutils are built properly ;-) (although there are some leftovers as we've found when looking with some Gentoo guys at LVEE conference)

georgesdev June 21, 2009, 9:15 am
never type anything such as:
rm -rf /usr/tmp/whatever
maybe you are going to type enter by mistake before the end of the line. You would then for example erase all your disk starting on /.

if you want to use -rf option, add it at the end on the line:
rm /usr/tmp/whatever -rf
and even this way, read your line twice before adding -rf
Daniel Hoherd May 4, 2012, 4:58 pm

Another good test is to first do "echo rm -rf /dir/whatever/*" to see the expansion of the glob and what will be deleted. I especially do this when writing loops, then just pipe to bash when I know I've got it right.

Denis November 23, 2010, 9:27 am

I think it is a good practice to use parameter i whithin the -rf:

rm -rfi /usr/tmp/whatever

-i will ask you do you sure to delete all that stuff.

John February 25, 2011, 11:11 am

I worked with a guy who always used "rm -rf" to delete anything. And he always logged in as root. Another worker set the stage for him by creating a file called "~" in a visible location (that would be a filed entered as "\~", as not to expand to the user's home directory. User one then dealt with that file with "rm -rf ~". This was when the root home directory was / and not something like /root. You got it.

Cody March 22, 2011, 1:33 pm
(Note to mod: put this in wrong place initially; sorry about that. here is the correct place).

This reminds me of when I told a friend a way to auto-log out on login (many ways but this would be more obscure). He then told someone who was "annoying" him to try it on his shell. End result was this person was furious. Quite so. And although I don't find it so funny now (keyword not as – I still think it's amusing), I found it hilarious then (hey, was young and obnoxious as can be!).

The command, for what its worth :
echo "PS1=`kill -9 0`" >> ~/.bash_profile
Yes, that's setting the prompt to run the command : kill -9 0 upon sourcing of ~/.bash_profile which means kill that shell. Bad idea!

I don't even remember what inspired me to think of that command as this was years and years ago. However, it does bring up an important point :

Word of the wise : if you do not know what a command does, don't run it! Amazing how many fail that one…
Peter Odding January 7, 2012, 6:40 pm

I once read a nasty trick that's also fun in a very sadistic kind of way:

echo 'echo sleep 1 >> ~/.profile' >> /home/unhappy-user/.profile

The idea is that every time the user logs in it will take a second longer than the previous time… This stacks up quickly and gets reallllly annoying :-)

Daniel April 23, 2015, 10:53 am

What about echo "PS1=$PS1 ; `sleep 1`" >> ~/.bash_profile
I'm not sure if it works, but it's pretty cool.

3ToKoJ June 21, 2009, 9:26 am

public network interface shutdown … done

typing unix command on wrong box … done

Delete apache DocumentRoot … done

Firewall lockdone … done with a NAT rule redirecting the configuration interface of the firewall to another box, serial connection saved me
I can add, being trapped by aptitude keeping tracks of previously planned - but not executed - actions, like "remove slapd from the master directory server"

UnixEagle June 21, 2009, 11:03 am

Rebooted the wrong box
While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot

Instead of appending text to Apache config file, I overwritten it's contents

Firewall lockdown while changing the ssh port

Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.
If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

smaramba June 21, 2009, 11:31 am
Yyping unix command on wrong box and firewall lockdown are all time classics: been there, done that. But for me the absolute worst, on linux, was checking a mounted filesystem on a production server…
fsck /dev/sda2
The root filesystem was rendered unreadable. system down. Dead. Users really pissed off. fortunately there was a full backup and the machine rebooted within an hour.
Don May 10, 2011, 4:14 pm

I know this thread is a couple of years old but …

Using lpr from the command line, forgetting that I was logged in to a remote machine in another state. My print job contained sensitive information which was now on a printer several hundred miles away! Fortunately, a friend intercepted the message and emailed me while I was trying to figure out what was wrong with my printer :-)

od June 21, 2009, 12:50 pm

"Typing UNIX Commands on Wrong Box"

Yea, I did that one too. Wanted to shut down my own vm but I issued init 0 on a remote server which I accessed via ssh. And oh yes, it was the production server.

Adi June 21, 2009, 10:24 pm
tar -czvf /path/to/file file_archive.tgz
instead of
tar -czvf file_archive.tgz /path/to/file
I ended up destroying that file and had no backup as this command was intended to provide the first backup – it was on the DHCP Linux production server and the file wad dhcpd.conf!

The Unix Hater's Handbook

wayback.archive.org

"rm" Is Forever

The principles above combine into real-life horror stories. A series of

exchanges on the Usenet news group alt.folklore.computers illustrates

our case:

Date: Wed, 10 Jan 90

From: [email protected] (Dave Jones)

Subject: rm *

Newsgroups: alt.folklore.computers2

Anybody else ever intend to type:

% rm *.o

And type this by accident:

% rm *>o

Now you've got one new empty file called "o", but plenty of room

for it!

Actually, you might not even get a file named "o" since the shell documentation

doesn't specify if the output file "o" gets created before or after the

wildcard expansion takes place. The shell may be a programming language,

but it isn't a very precise one.

Date: Wed, 10 Jan 90 15:51 CST

From: [email protected]

Subject: Re: rm *

Newsgroups: alt.folklore.computers

I too have had a similar disaster using rm. Once I was removing a file

system from my disk which was something like /usr/foo/bin. I was in /

usr/foo and had removed several parts of the system by:

% rm -r ./etc

% rm -r ./adm

…and so on. But when it came time to do ./bin, I missed the period.

System didn't like that too much.

Unix wasn't designed to live after the mortal blow of losing its /bin directory.

An intelligent operating system would have given the user a chance to

recover (or at least confirm whether he really wanted to render the operating

system inoperable).

Unix aficionados accept occasional file deletion as normal. For example,

consider following excerpt from the comp.unix.questions FAQ:3

6) How do I "undelete" a file?

Someday, you are going to accidentally type something like:

% rm * .foo

and find you just deleted "*" instead of "*.foo". Consider it a

rite of passage.

Of course, any decent systems administrator should be doing

regular backups. Check with your sysadmin to see if a recent

backup copy of your file is available.

"A rite of passage"? In no other industry could a manufacturer take such a

cavalier attitude toward a faulty product. "But your honor, the exploding

gas tank was just a rite of passage." "Ladies and gentlemen of the jury, we

will prove that the damage caused by the failure of the safety catch on our

3comp.unix.questions is an international bulletin-board where users new to the

Unix Gulag ask questions of others who have been there so long that they don't

know of any other world. The FAQ is a list of Frequently Asked Questions garnered

Changing rm's Behavior Is Not an Option

After being bitten by rm a few times, the impulse rises to alias the rm command

so that it does an "rm -i" or, better yet, to replace the rm command

with a program that moves the files to be deleted to a special hidden directory,

such as ~/.deleted. These tricks lull innocent users into a false sense

of security.

Date: Mon, 16 Apr 90 18:46:33 199

From: Phil Agre <[email protected]>

To: UNIX-HATERS

Subject: deletion

On our system, "rm" doesn't delete the file, rather it renames in some

obscure way the file so that something called "undelete" (not

"unrm") can get it back.

This has made me somewhat incautious about deleting files, since of

course I can always undelete them. Well, no I can't. The Delete File

command in Emacs doesn't work this way, nor does the D command

in Dired. This, of course, is because the undeletion protocol is not

part of the operating system's model of files but simply part of a

kludge someone put in a shell command that happens to be called

"rm."

As a result, I have to keep two separate concepts in my head, "deleting"

a file and "rm'ing" it, and remind myself of which of the two of

them I am actually performing when my head says to my hands

"delete it."

Some Unix experts follow Phil's argument to its logical absurdity and

maintain that it is better not to make commands like rm even a slight bit

friendly. They argue, though not quite in the terms we use, that trying to

make Unix friendlier, to give it basic amenities, will actually make it

worse. Unfortunately, they are right.

[Sep 04, 2014] Blunders with expansion of tar files, structure of which you do not understand

if you try to expand tar file in some production directory you accidentally can overwrite and change ownership of such directories and then spend a lot of type restored status quo. It is safer to expand such tar files in /tmp first and only after that seeing the results then decide whether to copy some directories of re-expand the tar file. Now in production directory.

[Sep 03, 2014] Doing operation in a wrong directory among several similar directories

Sometimes directories are very similar, for example numbered directoriess created by some application such as task0001, task0002, ... task0256. In this case you can well perform operation on a wrong directory. For example send to tech support a tar file with the directory that instead of test data contain production run.

[Oct 17, 2013] Crontab file - The UNIX and Linux Forums

The loss of crontab is a serious trouble. This is one of a typical sysadmin blunders (Crontab file - The UNIX and Linux Forums)

mradsus

Hi All,
I created a crontab entry in a cron.txt file accidentally entered

crontab cron.txt.

Now my previous crontab -l entries are not showing up, that means i removed the scheduling of the previous jobs by running this command "crontab cron.txt"

How do I revert back to previously schedule jobs.
Please help. this is urgent.,
Thanks.

In this case, if you do not have a backup, you only remedy is to try to extract cron commands from /var/log/messages.

[Jul 17, 2012] My 10 UNIX Command Line Mistakes

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk tar -zcvf project.tar.gz functions
I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )

The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.

Use CVS to store configuration files.

Wait and read command line again before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

[May 17, 2012] Pixar's The Movie Vanishes, How Toy Story 2 Was Nearly Lost

In the 2010 animated short titled Studio Stories: The Movie Vanishes, we learn from Pixar's Oren Jacob and Galyn Susman how a big chunk of Toy Story 2 movie files were nearly lost due to the accidental use of a Linux rm command (and a poor backup system). This short was included on the Toy Story 2 DVD extras.

Pixar studio stories - The movie vanishes (full) - YouTube

[Mar 16, 2012] Using right command in a wrong place

From email to Editor of Softpanorama...

This happened with Open view. It has a command for agent reinstallation. opc-inst -r. The problem is that it needs to be run on the node not on the server and does not accept any arguments

In this case it was run on the server with predictable results. This was a production server of a large corporation so you can imagine the level of stress in putting down this fire...

[Oct 14, 2011] Nasty surprise with the command `cd joeuser; chown -R joeuser:joeuser .*`

This is classic case of side effect of dot `.*` along with `-R` flag which cause complete tree traversal in Unix. The key issue here is not panic. The recovery is possible even if you do not have the map of all files permissions (and you better do it on regular basis). The first step is to use

for p in $(rpm -qa); do rpm --setugids $p; done

The second is to copy remaining ownership info from some similar system. Especially important is to restore ownership in /dev directory.

Similar approach can be used for resoring permissions:

for p in $(rpm -qa); do rpm --setperms $p; done

Please note that the `rpm --setperms` command actaully resets setuid, setgid, and sticky bits. These must be set manually using some existing system as a baseline.

[Jul 22, 2011] Mailbag by Marcello Romani

Feb 02, 2011 | LG #186

Hi, I had a horror story similar to Ben's one, about two years ago. I backed up a PC and reinstalled the OS with the backup usb disk still attached. The OS I was reinstalling was a version of Windows (2000 or XP, I don't remember right now). When the partition creation screen appeared, the list items looked a bit different from what I was expecting, but as soon as I realized why, my fingers had already pressed the keys, deleting the existing partitions and creating a new ntfs one. Luckily, I stopped just before the "quick format" command... Searching the 'net for data recovery software, I came across TestDisk, which is target at partition table recovery. I was lucky enough to have wiped out only that portion of the usb disk, so in less than an hour I was able to regain access to the all of my data. Since then I always "safely remove" usb disks from the machine before doing anything potentially dangerous, and check "fdisk -l" at least three times before deciding that the arguments to "dd" are written correctly...

Marcello Romani TAG mailing list [email protected] http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net

[Jul 03, 2011] Be careful with naming servers

Some application like Oracle products are sensitive to DNS names you use, especially hostname. They store them in multiple places and there is no easy way to change it in all those places after Oracle product is installed. They also accept only long hostname (i.e. box.location.firm.com) instead of short.

If you mess with your hostname and DBA installed Oracle product you usually need to reinstall the box.

Such errors can happen if you copy files form ne servers to another to speed up the installation and forgot to modify /etc/hosts file or modified it incorrectly.

[Jun 03, 2011] Sysadmin Tales of Terror by Carla Schroder

February 19, 2003 | Enterprise Networking Planet

Cover One's Behind With Glory

Now let's be honest, documentation is boring and no fun. I don't care; just do it. Keep a project diary. Record everything you find. You don't want to shoulder the blame for someone else's mistakes or malfeasance. It is unlikely you'll get into legal trouble, but the possibility always exists. Record progress and milestones as well. Those in management tend to have short memories and limited attention spans when it comes to technical matters, so put everything in writing and make a point of reviewing your progress periodically. No need to put on long, windy presentations -- take ten minutes once a week to hit the high points. Emphasize the good news; after all, as the ace sysadmin, it is your job to make things work. Any dork can make a mess; it takes a real star to deliver the goods.

Be sure to couch your progress in terms meaningful to the person(s) you're talking to. A non-technical manager doesn't want to hear how many scripts you rewrote or how many routers you re-programmed. She wants to hear "Group A's email works flawlessly now, and I fixed their database server so it doesn't crash anymore. No more downtime for Group A." That kind of talk is music to a manager's ears.

Managing Users

In every business there are certain key people who wield great influence. They can make or break you. Don't focus exclusively on management -- the people who really run the show are the secretaries and administrative assistants. They know more than anyone about how things work, what's really important, and who is really important. Consult them. Listen to them. Suck up to them. Trust me, this will pay off handsomely. Also worth cultivating are relationships with the cleaning and maintenance people -- they see things no one else even knows about.

When you're new on the job and still figuring things out, the last thing you need is to field endless phone calls from users with problems. Make them put it in writing -- email, yellow pad, elaborate trouble-ticket system, whatever suits you. This gives you useful information and time to do some triage.

Managing Remote Users

If you have remote offices under your care, the phone can save a lot of travel. There's almost always one computer-savvy person in every office; make this person your ally and helper. At very least, this person will be able to give you coherent, understandable explanations. At best, they will be your remote hands and eyes, and will save you much trouble.

Such a person may be a candidate for training and possibly transferring to IT. Some people are afraid of helping someone like this for fear of losing out to them in some way. The truth, though, is that you never lose by helping people, so don't let that idea scare you off from giving a boost to a worthy person.

Getting Help

We all know how to use Google, Usenet, and other online resources to get assistance. By all means, don't be too proud -- ask! And by all means, don't be stupide either -- use a fake name and don't mention the company you work for. There's absolutely no upside to making such information public; there are, however, many downsides to doing so, like inviting security breaches, giving away too much information, making your company look bad, and besmirching your own reputation.

As I said at the beginning, these are strategies that have served me well. Feel free to send me your own ideas; I especially love to hear about true-life horror stories that have happy endings.

Resources

Life in the Trenches: A Sysadmin Speaks
10 Tips for Getting Along with People at Work
Linux Administration Books

[Jun 20, 2010] IT Resource Center forums - greatest blunders

Bill McNAMARA

I've done this with people looking over my shoulder (while in single user):

echo "/dev/vg00/lvol6 /tmp vxfs delaylog 0 2" > /etc/fstab
reboot!!

Other good ones:
mv /dev/ /Dev
(try it - and don't ask why!!)

Later,
Bill

Christian Gebhardt

Hi
As a newby in UNIX I had an Oracle Testinstallion on a production system
productiv directory: /u01/...
test directory: /test/u01/...

deleting the test installation:
cd /test
rm /u01

OOPS ...

After several bdf commands I noticed that the wrong lvol shrinks and stops the delete command with Ctrl'C

The database still worked without the most binaries and libraries and after a restore from tape without stopping and starting the database all was ok.

I love oracle ;-)

Chris

harry d brown jr

Learning hpux? Naw, that's not it....maybe it was learning to spell aix?? sco?? osf?? Nope, none of those.

The biggest blunder:

One morning I came in at my usual time of 6am, and had an operator ask me what was wrong with one of our production servers (servicing 6 banks). Well nothing worked at the console (it was already logged in as root). Even a "cat *" produced nothing but another shell prompt. I stopped and restarted the machine and when it attempted to come back up it didn't have any OS to run. Major issue, but we got our backup tapes from that night and restored the machine back to normal. I was clueless (sort of like today)

The next morning, the same operator caught me again, and this time I was getting angry (imagine that). Same crap, different day. Nothing was on any disk. This of course was before we had raid availble (not that that would have helped). So we restored the system from that nights backups and by 8am the banks have their systems up.

So now I have to fix this issue, but where the hell to start? I knew that production batch processing was done by 9PM, and that the backups started right after that. The backups completed around 1am, which were good backups, because we never lost a single transaction. But around 6am the stuff hit the fan. So I had a time frame: 1am-6am, something was clobbering the system. I went though the crons, but nothing really stood out, so I had to really dive into them. This is the code (well almost) I found in the script:

cd /tmp/uniplex/trash/garbage
rm -rf *

As soon as I saw those two lines, I realized that I was the one that had caused the system to crap out every morning. See, I needed some disk space, and I was doing some house cleaning, and I deleted the sub-directory "garbage" from the /tmp/uniplex/trash" directory. Of course the script is run by root, which attempted to "CD" to a non-existent directory, which failed, and cron was still cd'd to "/", it then proceeded to "rm -rf *" my system!

live free or die
harry

Bill Hassell

I guess my blunder sets the record for "most clobbered machines" in one day:

I created an inventory script to be used in the Response Center to track all the systems across the United States (about 320 systems). These are all test and problem replication machines but necessary for the R/C engineers to replicate customer problems.

The script was written about 1992 to handle version 7.0 and higher. About 1995, I had a number of useful scripts that it seemed reasonable to drop these into all 300 machines as a part of the inventory process (so far, so good). Then about that time, 10.01 was released and I made a few changes to the script. One was to change the useful script location from /usr/local/bin to /usr/contrib/bin because of bad directory permissions. I considered 'fixing' the bad permissions but since these systems must represent the customer environment, I decided to move everything.

Enter the shell option -u. I did not use that option in my scripts and due to a spelling error, an environment variable was used in rm -r which was null, thus removing the entire /usr/local directory on 320 machines overnight.

Needless to say, I never write scripts without set -u at the top of the script.

John Poff

We were doing a disaster recovery drill. I was busy Igniting a V-class server for our database server. I had finally gotten the OS on it after about three hours and I was running a slick little script I had written to recreate all the volume groups and filesystems. My script takes a list of available PVs and does a 'pvcreate -f' on them. Well, we started our drill at midnight [not our idea but we had little choice], so around about 3:30am I was trying to run this script. It was chugging along just fine, pvcreating disks, and then the system hung. Not completely, but pretty much dead. After trying to reboot it, I eventually figured out that when I went through the interactive Ignite, I hadn't paid close attention to which disk Ignite had selected to load the OS on, and it had chosen one of the disks in the EMC array instead of one of the local Jamaica disks. My slick script came along and had pvcreated the disk that had the OS on it. Oops. There went a few more hours of work.

The good news is that after that mess they decided that we would never start a DR drill at midnight!

JP

Dave Johnson

Here is my worst.
We us BC's on our XP512. We stop the application, resync the BC, split the BC, start the application, mount the BC on same server, start backup to tape from BC. Well I had to add a LUN to the primary and BC. I recreated the BC. I forgot to change the script that mounts the BC to include the new LUN. The error message vgimport when you do not include all the LUN's is just a warning and it makes the volume group available. The backups seemed to be working just fine.
Well 2 months go by. I did not have enough available disk space to test my backups. (That has been changed). Then I decided to be proactive about deleted old files. So I wrote a script:
cd /the/directory/I/want/to/thin/out
find . -mtime +30 -exec rm {} \;

Well that was scheduled on cron to run just before backups one night. The next morning I get the call the system is not responding. (I guessed later the cd command had failed and the find ran from /).
After a reboot I find lots of files are missing from /etc /var /usr /stand and so on. No problem, just rebuild from the make_recovery tape created 2 nights before then restore the rest from backup.
Well step 1 was fine, but the backup tape was bad. The database was incomplete. It took 3 days (that is 24 hours per day) to find the most recient tape with a valid database. Then we had to reload all the data. After the 3rd day I was able to turn over recovery to the developers. It took about a week to get the application back on-line.
I have sent a request to HP to have the vgimport command changed so a vgimport that does not specify all the LUN's will fail unless some new command line param is used. They have not yet provided this "enhancement" as of the last time I checked a couple of months ago. I now test for this condition and send mail to root as well as fail the BC mount if it does.

Life lesson: TEST YOUR BACKUPS!!

Dave Unverhau

This is probably not too uncommon...needed to shutdown a server for service (one of several lined up along the floor...no...not racked). Grabbed the keyboard sitting on that box and quickly typed the shutdown string (with a -hy 0, of course) and got ready to service the box.

...ALWAYS make sure the keyboard is sitting on the box to which it is connected!

Deepak Extross

We had this developer who claimed that when he runs his program, it complains about /usr/bin/ld. (This was because of a missing shared library, he later discovered) It was decided to backup /usr/bin/ld and replace it with 'ld' from another machine on which his program worked.
No sooner was ld moved, than all hell breaks loose.
Users get coredumps in response to even simple commands like "ls", pwd", "cd"... New users cannot telnet into the system and those who are logged in are frozen in their tracks.

Both the developer and admin are still working with us...

RAC

Well I was very very new to HP-UX. Wanted to set up PPP connection with a password borrowed from a friend so that I could browse the net.

Did not worry that the remote support modem can not dial out from remote support port.
Went through all documents available, created device files dozen times, but never worked. In anguish did rm -fr `ltr|tail -4|Awk '{print $9}'
(That to pacify myself that I know complex commands)

But alas, I was /sbin/rc3.d.

Thought this is not going to work and left that.
Other colleage not aware of this rebooted the system for Veritas netbackup problem.

Within next two hours HP engineer was on-site. Was called by colleague.

Was watching whole recovery process, repeatedly saying "I want to learn, I want to learn"

Then came to know that can not be done.

Dave Johnson

Hey Bill,

When I reinstalled the OS from the make_recovery tape it wiped out the scirpt I wrote and the item on the cron. There is no evidance of what happened or who was responsible. I did however go straight to my boss to confess and take the blame. That above all is probably the strongest reason next to being able to recover at least some of the data why I was not terminated for it.

Did I mention in the first post this happened Feb of 2002????

Simon Hargrave

1. On a live Sun Enterprise server, you turn the key one way for maintenance, and one way for off. I wanted to turn it to maintenance but wasn't sure which way to turn it. Guess which way I chose...

2. On an XP512 I accidentally business-copied a new LUN over the top of a live LUN, because I put the wrong LUN ID in!!! Luckily the live datas backup had finished a full 3 minutes earlier...phew!

3. I can't take credit for this one my ex-boss did it, but I had to include it. On Solaris he added a filesystem in the vfstab file, but put the wrong device in the raw-device field. Concequently all backups backed up the wrong device, so when the data got trashed and required restoring, it...um...didn't exist on tape! Luckily for him he'd left the company 2 months before and I was left to explain what a halfwit he was ;)

Dave Chamberlin

I have stepped in TAR on a couple occasions. I Moved a tar file from production box to development box but I had tarred it with an absolute path. When I untarred it - it overwrote the existing directory - destroying all the developers updates! I have also been burned by the fact that xvf and cvf are very close on the keyboard - so my command to extract a tar came out once as tar -cvf - which of course erased the tar file.
Only other bad blunder was doing an lvreduce on a mounted file system - thought I was recovering space without affecting the other files on the volume. Luckily - they were backup up...

Martin Johnson

One of my coworkers decided to set up a pseudo root (UID=0) account for himself. He used useradd to create the account and made / his home directory. He was unaware that useradd does a "chown -R" to the home directory. So he became the owner of all files on the system. This was a pop3 mail server system, and the mail services did not like the change.

My coworker left for the day, leaving me with angry VPs looking over my shoulder demanding to know when email services will be back.

Marty <the coworker is now known as "chown boy">

fg

Greatest GAFF: Taking the word of someone who I thought knew what they were doing and had taken the proper precautions to ensure a recovery method for a rebuild of filesystems,

to make a long story short, no backup, no make_recovery, and then rebuilt filesystems. Data lost and had to rebuild. Recovered most of the data except for previous 24hrs.

MORAL of the story: Always have backups and make_recovery tapes done.

Richard Darling

When I upgraded from 10.20 to 11.0 I finished the system installed, and then used cpio to copy my user applications. One of the vendors had originally had their app installed in /usr (before my time), and I copied the app up one directory and wiped out /usr. By the way, I didn???t back up the installation before the cpio copy. It was a Friday night and I wanted to get out...figured I could backup after getting the apps copied over...learnt an important lesson regarding backups that night...
RD

Belinda Dermody

writing a script to chmod -R to r/w for the world on a dir. Not doing a check to see if I was in the proper directory and all of a sudden my bin directory files were all 666. Lucky enough I had multiple windows and it hadn't gotten to the sbin directory yet. Had a few inquiries why certain commands wouldnt work before I got it all back correctly. From then on, I do $? and check the return status before I issue any remove or chmod commands.

Ian Kidd

I was going to vi a script that performs a cold-backup of an oracle database. Since we prefer not to be root all the time, we use sudo.

So I typed, "sudo", but then was interrupted by someone. I then typed the name of the script when that person left. Nothing appeared on the screen immediately, so I got a coffee.

When I came back, I saw " sudo {script}" and realized - 1 minute the DBAs started screaming that their database was down - that I started a cold backup in the middle of a production day.

Duncan Edmonstone

My worst two:

Installing a server in a major call centre of a US bank...

I built the OS as required by our apps team in the US, and following our build standards put the system into trusted mode.

They installed the app, and realised they'd forgotten to ask me to put the system into NIS (system could be used by any of the call centre reps in over 40 call centers - a total of 15,000 NIS based accounts!) It's the middle of the night in the UK, so the apps team get a US admin to set up the system as a NIS client. (yes it shouldn't work when the box is trusted, but it does!)

Next day, the apps team is complaining about some stuff not working - can I take the system out of trusted mode so we can discount that? Sure course I can - I run tsconvert and wait.... and wait.... and wait.... hmmm - this usually takes about 30 seconds - what gives?

Try to open another window to check whats happening - can't log in as root, the password that worked two minutes ago no longer works!

Next root file system full messages start to scroll up the screen!

It turns out that tsconvert is busy taking ALL the NIS accounts and putting them in the /etc/passwd file (yes all 15,000 of them) and guess what? There's a root account in NIS!

All I can say is thank god for good backups!

The other one was a typical junior admin mistake which comes from not understanding shell file name generation fully:

A user can't log in, I go take a look at his home directory and note the permissions on his .profile are incorrect. I also note that the other '.' files are incorrect, so I do this:

cd /home/user
chmod 400 .*

I call the user and tell him to try again - he says he still can't log in! Huh?

So I go back and carry on looking for the problem, but before I know it the phone is ringing off the hook! No-one can log in now!

And then it dawns on me

I type the following:

cd /home/user
echo .*

and that returns (of course)

. .. .cshrc .exrc .login .profile .sh_history

Oops I didn't just change the permissions on the users '.' files - I also changed the permissions on the users directory, and (crucially!) the users parent directory /home!

These days I always use echo to check my file name pattern matching logic when doing this kind of thing...

We live and learn

Duncan

Vincent Fleming

I have been way too fortunate not to have really blundered all that bad (I've mostly done development), but one I've seen was a real good one...

The "security auditor", who apparently knew absolutely nothing about UNIX, was reviewing our development system, and decided that /tmp having world read/write permissions was not a good thing for security - so, in the middle of the day, he chmod 744 /tmp ... suddenly, 200+ developers (including myself) on the machine (it was a *very* large machine back in 1990) were unable to save their editor sessions!

So, of course, I use the "wall" command to point our their error so they can fix it quickly and I can save my 2+ hours of edits:

$ wall
who's the moron who changed the permissions on /tmp????
.
$

The funny thing was that I was the one they escorted out of the building that day...

The hazards of being a contractor and publically humiliating an employee...

Jerry Jordak

This one wasn't my fault, but is still funny.

One time, we had to add disk space to one of our servers. My manager at time also was in charge of the EMC disk environment, so he allocated an extra disk to the server. I configured the disk into the OS, did a pvcreate on it, and proceeded to add it to the volume group, extend the filesystem, etc...

At about that same time, another one of our servers started going absolutely nuts. It turns out that he accidentally gave me a disk that was already allocated that other system. That drive had held the binaries for that server's application. Oops...

Tom Danzig

As root:
find / -u 201 -chown dansmith

Did this afeter changing a user ID to another number. Should have user "-user" and not -u (I had usermod on my mind). System gladly ignored the -u and started changing all files to user dansmith (/etc/passwd, /etc/hosts, etc). Needless to say, system was hosed.

Was able to recover fine from make_recovery tape. Fortunately this was also a test box and not production.

Oh well ... live and learn! Mistakes are only bad if you don't learn from them.

Mark Fenton

Back in '92 on a NIS network, meant to wipe out a particular user's directory, but was one level up from same when issued rm -r *. Took three hours to back up all home directories on network....

Last year, I discovered that new is not necessarily better. Updating Db software I blithly stopped the Db, copied new software in, and restarted. Users couldn't get any processing done that day -- seems that there was a conversion program that was *supposed* to run that didn't. But that wasn't the blunder -- the blunder is that the most recent backup had been two days previous, so all the previous day's processing was gone... (and that had been an overtime day, too!)

Keely Jackson

My greatest blunder:

The guy who set up the live database had done it as himself rather than aa a separate dba user. He left the company and his user id was re-allocated to somebody in HR. The guy in HR subsequently left as well.

One day I decided to tidy up the system and remvoe the this user. I did this via sam, selected the option to delete all the users files thinking that nobody who was in HR could possibly own any important files.

Unfortunately I was somewhat mistaken. Of course the guy in HR now owned all the database files. The first thing I knew was when the users started to complain that the database was no longer available. I got the db back from restore but everybody had lost half a days work.

Needless to say, I now do not delete old users files but re-allocate them to a special 'leavers' user and check them all before deleting anything.

A good HP blunder.

HP were moving the live server - a K420 - between sites and the remvoal men managed to drop it down a flight of stairs. It landed on one of them who then had to be taken to hospital. Fortunately he was only bruised while the machine had a huge dent in it. Anyway, it got moved to the other site and booted up straight away with no problems. That is what I call resiliant hardware. As a precaution disks etc were changed but it is still running quite happily today.

Cheers
Keely

Michael Steele

When I was first starting out I worked for a Telecom as an 'Application Administrator' and I sat in a small room with a half a dozen other admins and together we took calls from users as their calls escalated up from tier I support. We were tier II in a three tier organization.

A month earlier someone from tier I confused a production server with a test server and rebooted it in the middle of the day. These servers were remotely connected over a large distance so it can be confusing. Care is needed before rebooting.

The tier I culprit took a great deal of abuse for this mistake and soon became a victim of several jokes. An outage had been caused in a high availability environment which meant management, interviews, reports; It went on and on and was pretty brutal.

And I was just as brutal as anyone.

Their entire organization soon became victimize by everyone from our organization. The abuse traveled right up the management tree and all participated.

It was hilarious, for us.

Until I did the same thing a month later.

There is nothing more humbling then 2000 people all knowing who you are for the wrong reason and I have never longed for anonymity more.

Now I alway do a 'uname' or 'hostname' before a reboot, even when I'm right in front of it.

Geoff Wild

Problem Exists Between Keyboard And Chair:

Just did this yesterday:

tar cvf - /sapmnt/XXX | tar xvf -

Meant to do:

tar cvf - /sapmnt/XXX | (cd /sapmnttest/XXX ;tar xvf -)

Needless to say, I corrupted most of the files in /sapmnt/XXX

Rgds....Geoff

Suhas

1. Imagine what would have happened when, on a Solaris box, while taking backup of ld.so.1, instead of doing "cp", "mv" was done !!! As most of you would be aware, ld.so.1 is the library file that is accesses by every system call. The next 1 hour was sheer chaos .. and worst hour ever experienced!!!!
Lesson Learnt: "Look before you leap !!!"

2. Was responsible for changing the date on the back-up master server by nearly a year . That night was a horrifying night of my life.
Lesson Learnt : "A typo-error can cost you any-thing between $0 to infinity."

Keep forumming !!!!
Suhas

[Jun 12, 2010] Sysadmin Blunder (3rd Response) - Toolbox for IT Groups

chrisz

Did one also.

I was in a directory off of root and performed the following command:

chown -R someuser.somegroup .*

I didn't think much of the command, just wanted to change the owner and
group for all files with a . in the front of them and subdirs. Went well
for the files in the current directory until it reached the .. file
(previous directory). All the files and subdir's off of root changed to the
owner and group specified. I was wondering why the command was taking so
long to complete. BTW, it changed the owner and group for all NFS files
too! That's when the real fun started.
Some days you're the windshield, other days you're the bug!

Dan Wright

It didn't really cause any significant damage, but about 10 years ago, I had
recently become an admin of a network of mostly NeXT machines which were new
to me and the default shell was c-shell, which I also wasn't very familiar
with.

I had dialed in from home on nite to play around and become more familiar
with how things worked on NeXTStep.

In an attempt to kill a job, I typed in "kill 1" instead of "kill %1" - and
it probably was actually a "kill -9 1" and of course I was root.

And of course, 1 was the init process. I immediately lost connection and
had to do a hard reboot on that machine the next day before that user got in
(for some reason, the machine with the modem wasn't in my office, it was in
someone elses).

Fortunately, that wasn't a critical machine outside of normal business
hours. No harm, no foul, eh?

If you like this kind of story, there are a bunch here:

http://www2.hunter.com/~skh/humor/admin-horror.html

User123731

I have in the past touched a file called "-i" in important directories. This
will cause rm to see the "-i" and make the rm interactive before it acts on
other files/dirs if you do not specify a particular directory.

User451715:

Ha!

That's an easy one.. My first position as a Junior Admin in HPUX working in First line support about eight years ago..

I was working on a server, moving some files around, and mistakenly moved all of the files in the /etc directory to a lower level direcory (about 10 sub-dirs down)..

I sat there at the console wide-eyed, my heart dropped, and I turned and looked out the window, and saw my job sailing out of it, since this was a server that was being prepared to be deployed and that a month's worth of work would have been wasted.

Luckily, a Senior Admin who later became my greatest mentor (Phil Gifford), took pity on my situation, and we sat there and recovered the /etc directory before anyone knew what had happened.. The key here was, he walked me through the necessary steps to recover files from an ignite tape, and voila!

Needless to say, I learned all about why seasoned UNIX admins protect root privilege as it it was the 'Family Jewels'.. <chuckle>

Mike E.-

Bryan Irvine :

My biggest blunder wasn't an an AIX box but applies to the thread. I
once made an access list for a cisco box, and forgot that there is an
implicit "deny all" rule at the end. So I made my nifty access list and
enabled it, tested the traffic to see if it was blocked and lo and
behold it seemed to be working. Great! I went on with my life and
figured I go read some news or something. uhhhh that didn't work. tried
email..that didn't work. tried traceroutes, they all died at the router
I was jsut working on...then the phones started ringing.

*click*

lightbulb went on in my head and I ran as fast as I could to the router
to reboot it (lucky I hadn't written the nvram)

The phone didn't stop ringing for 45 minutes even though the problem
only existed for about 4 minutes.

But then, what do ya expect when you kill internet traffic at 5
locations across 2 states?

the guys on the cisco list said that if you haven't done similar you are
lying about the 5 years experience on your resume ;-)

--Bryan

jxtmills:

I needed to clear a directory full of ".directories" and I issued:

rm -r .*

It seemed to find ".." awfully fast. That was a bad day, but, we restored most of the files in about an hour.

John T. Mills

bryanwun:

I thought I was in an application dir but instead was in /usr
and did chown -R to a low level user
on top of that I did not have a mksysb backup,
and the machine was in production.
It continued to function for the users ok
but most shell commands returned nothing
I had to find another machine with the same OS and Maint level
write a script to gather ownership permissions
then write another script to apply the permissions to the
damaged machine. this returned most functionality
but I still cant install new software with installp
it goes through the motions then nothing is changed.

alain:

hi every one

for me , i remember 2

1 - 'rm *;old' in / directory note ';' instead of '.'

2 - kill #pid of the informix process (oninit) and delete it (i dreamed)

jturner:

Variation on a theme ... the 'rm -r theme'

as a junior admin on AIX3.2.5, I had been left to my own devices to create
some housekeeping scripts - all my draft scripts being created and tested
in my home directory with a '.jt' suffix. After completing the scripts I
found that I had inadvertantly placed some copies in / with a .jt suffix.
Easy job then to issue a 'rm *.jt' in / and all would be well. Well it
would have been if I hadn't put a space between the * and the .jt. And the
worst thing of all, not being a touch typist and looking at the keys, I
glanced at the screen before hitting the enter key and realised with horror
what was going to happen and STILL my little finger continued to proceed
toward the enter key.

Talk about 'Gone in 60 seconds' - my life was at an end - over - finished
- perhaps I could park cars or pump gas for a living. Like other
correspondents a helpful senior admin was on hand to smile kindly and show
me how to restore from mksysb-fortunately taken daily on these production
systems. (Thanks Pete :))) )

To this day, rm -i is my first choice with multiple rm's just as a
test!!!!!!

Happy rm-ing :)

daguenet:

I know that one. Does any body remember when the rm man page had a
warning not do rm -rf / as root? How may systems were rebuild due that
blunder. Not that I have ever done something like that, nor will ever
admit to it:).

Aaron

cal.staples:

That is a no brainer!

First a little background. I cooked up a script called "killme" which
would ask for a search string then parse the process table and return a
list of all matches. If the list contained the processes you wanted to
kill then you would answer "Yes", not once, but twice just to be sure
you thought about it. This was very handy at times so I put it out on
all of our servers.

Some time had passed and I had not used it for a while when I had a need
to kill a group of processes. So I typed the command not realizing that
I had forgotten the scripts name. Of course I was on our biggest
production system at that time and everything stopped in it's tracks!

Unknown to me was that there is an AIX command called "killall" which is
what I typed.

From the MAN page: "The killall command cancels all processes that you
started, except those producing the killall process. This command
provides a convenient means of canceling all processes created by the
shell that you control. When started by a root user, the killall command
cancels all cancellable processes except those processes that started
it."

And it doesn't ask for confirmation or anything! Fortunately the
database didn't get corrupted and we were able to bring everything back
on line fairly quickly. Needless to say we changed the name of this
command so it couldn't be run so easily.

"Killall" is a great command for a programmer developing an application
which goes wild and he/she needs to kill the processes and retain
control of the system, but it is very dangerous in the real world!

Jeff Scott:

The silliest mistake? That had to be a permissions change on /bin. I got a
call from an Oracle DBA that the $ORACLE_HOME/bin no longer belonged to
oracle:dba. We never found out how that happened. I logged in to change the
permissions. I accidentally typed cd /oracle.... /bin (note the space
before /bin), then cheerfully entered the following command:

#chown -R oracle:dba ./*

The command did not climb up to root fortunately, but it really made a mess.
We ended up restoring /bin from a backup taken the previous evening.

Jeff Scott
Darwin Partners/EMC

Cal S.

tzhou:

crontab -r when I wanted to do crontab -e. See letters e and r are side by
side on the keyboard. I had 2 pages of crontab and had no backup on the
machine !

Jeff Scott:

I've seen the rm -fr effect before. There were no entries in any crontab.
Before switching to sudo, the company used a homegrown utility to grant
things root access. The server accepted ETLs from other databases, acting as
a data warehouse. This utility logged locally, instead of logging via
syslogd with local and remote logging. So, when the system was erased, there
really was no way to determine the actual cause. Most of the interactive
users shared a set of group accounts, instead of enforcing individual
accounts and using su - or sudo. The outage cost the company $8 million USD
due to lost labor that had to be repeated. Clearly, it was caused by a
cleanup script, but it is anyone's guess which one.

Technically, this was not a sysadmin blunder, but it underscores the need
for individual accounts and for remote logging of ALL uses of group
accounts, including those performed by scripts.

It also underscores the absolute requirement that all scripts have error
trapping mechanisms. In this case, the rm -fr was likely preceded by a cd to
a nonexistent directory. Since this was performed as root, the script cd'd
to / instead. The rm -fr then removed everything. The other possibility is
that it applied itself to another directory, but, again, root privileges
allowed it to climb up the directory tree to root.

Aneesh Mohan

Hi Siv,

The greatest blunder happened frm myside is created a lvol named /dev/vg00/lvol11 and did newfs on /dev/vg00/rlvol1 :)

The second greatest blunder from my side is corrupting the root filesystem by the below 2 steps :)

#lvchange -C n /dev/vg00/lvol03

#lvextend -L 100 /dev/vg00/lvol03

Cheers,
Aneesh

[Jun 09, 2010] Halloween - IT Admin Horror Stories

Zimbra Forums

Well ... I was working for a large multi-national running HP-UX systems and Oracle/SAP, and one day the clock struck twelve and the OS just started to disappear Down went SAP and Oracle like a sack of spuds!

Mayhem broke with the IT manager standing over my shoulder wanting to know what had happened ... I did not have a clue, and I could not even get onto the system as it was completely hosed! So the task of restoring the server began and after 30 minutes I had everything backup and running again. phewww

Until 1pm! The system disappearing again What the hell is going off, panic set it, this time I managed to keep a couple of sessions open to allow me to check the system.

And then it clicked .... I wonder .... Yep indeed, somebody had setup a cronjob AS ROOT, that attempted to 'cd' to a directory which then proceeded with a 'rm -rf *'

Though the ******* other admin did not verify that the directory existed before performing the remove! Well once we had restored the system again, the cronjob was removed, and we were all running fine again.

Morale of the story is to always protect root access and ensure you have adequate backups!!!

[Jun 06, 2010] NFS-export as a poor man backdoor

You can't log-in to the box if /etc/passwd or /etc/shadow are gone...

Ric Werme: Oct 10, 2007 18:05:52 -0700

Bill McGonigle once learned:

> rm lets you remove libc too.  DAMHINT.

I managed to salvage one system because I had NFS-exported / and
could gain write access from another system.

After that I often did the export before replacing humorless files
like libc.so and sometimes did the update with NFS.  It was a
struggle to remember to type the /mnt before the /etc/passwd so
I tried to cd to the target directory copy files in.

  -Ric Werme

[Jun 06, 2010] Security zeal ;-)

Good judgment comes from experience, experience comes from poor judgment. So do new jobs...Sometimes, even entirely new careers!

> On 10/9/07, John Abreau <[EMAIL PROTECTED]> wrote:
>> ... I looked in /bin for suspicious files, and that was the
>> first time I ever noticed the file [ . It looked suspicious, so
>> of course I deleted it. :-/

[Jun 05, 2010] Directory formerly known as /etc ;-)

Tom Buskey
Thu, 11 Oct 2007 06:18:27 -0700

On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>
>
> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>
> >   Did you know 'rpm' will let you remove every package from the
> > system?
>
> rm lets you remove libc too.  DAMHINT.

I had a user call about a user supported system that was having issues.  We
explicitly do not support it and the users only use the root account.

He gave me the root account to login and I couldn't.  I went to his system &
looked around.  /etc was empty.  I told him he was fsked and he should ftp
any files he wanted to elsewhere & that he wouldn't be able to login again
or reboot.  In any event, we were not supporting it.

Sure enough, a help desk ticket came in for another admin, claiming that the
system got corrupted during bootup.  Why do users lie so often?  All it does
is obscure the problem...

Tom Buskey wrote:
> On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>   
>> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>>
>>     
>>>   Did you know 'rpm' will let you remove every package from the
>>> system?
>>>       
>> rm lets you remove libc too.  DAMHINT.
>>     
>
>
> I had a user call about a user supported system that was having issues.  We
> explicitly do not support it and the users only use the root account.
>
> He gave me the root account to login and I couldn't.  I went to his system &
> looked around.  /etc was empty.  I told him he was fsked and he should ftp
> any files he wanted to elsewhere & that he wouldn't be able to login again
> or reboot.  In any event, we were not supporting it.
>
> Sure enough, a help desk ticket came in for another admin, claiming that the
> system got corrupted during bootup.  Why do users lie so often?  All it does
> is obscure the problem...
>   
Hmmm. Did you check lost+found? I've had similar symptoms only to
discover that there was indeed a bad sector that remapped all of /etc/
and some of /var and /usr. fsck didn't help much until I moved the drive
to another system and ran fsck there.

But you're right - if its not supported, then they'll have to go elsewhere to get this done.

BTW: My point is: the user may not have lied, but was just calling the shot as s/he saw them.

--Bruce

[May 26, 2010] Never ever play lose with /boot partition.

Here is the recent story connected with the upgrade of the OS (in this case Suse 10) to a new service pack (SP3)

After the upgrade sysadmin discovered that instead of /boot partition mounted there is none but there is a /boot directory directory on the boot partition populated by the update. This is so called "split kernel" situation when one (older) version of kernel boots and then it finds different (more recent) modules in /lib/modules and complains. There reason of this strange behavior of Suse update was convoluted and connected with LVM upgrade it contained after which LVM blocked mounting of /boot partition.

Easy, he thought. Let's boot from DVD, mount boot partition to say /boot2 and copy all files from the /boot directory to the boot partition.

And he did exactly that. To make things "clean" he first wiped the "old" boot partition and then copied the directory.

After rebooting the server he see GRUB prompt; it never goes to the menu. This is a production server and the time slot for the upgrade was 30 min. Investigation that involves now other sysadmins and that took three hours as server needed to be rebooted, backups retrieved to other server from the tape, etc, reveals that /boot directory did not contain a couple of critical files such as /boot/message and /boot/grub/menu.lst. Remember /boot partition was wiped clean.

BTWs /boot/message is an executable and grub stops execution of stpped /boot/grub/menu.lst.when it encounted instruction

gfxmenu (hd0,1)/message

Here is an actual /boot/grub/menu.lst.

# Modified by YaST2. Last modification on Thu May 13 13:43:35 EDT 2010
default 0
timeout 8
gfxmenu (hd0,1)/message
##YaST - activate

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 splash=silent showopts
initrd /initrd-2.6.16.60-0.54.5-smp

###Don't change this comment - YaST2 identifier: Original name: failsafe###
title Failsafe -- SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 showopts ide=nodma apm=off acpi=off noresume edd=off 3
initrd /initrd-2.6.16.60-0.54.5-smp

Luckily there was a backup done before this "fix". Four hours later server was bootable again.

Sysadmin Stories Moral of these stories

October 19, 2009 | UnixNewbie.org

From: [email protected] (John Jarocki)
Organization: Advanced Micro Devices, Inc.; Austin, Texas

- Never hand out directions on "how to" do some sysadmin task until the directions have been tested thoroughly.

– Corollary: Just because it works one one flavor on *nix says nothing about the others. '-}

– Corollary: This goes for changes to rc.local (and other such "vital" scripties.

2

From: [email protected] (Eric Wedaa)
Organization: Advanced Micro Devices, Inc.

-NEVER use 'rm ', use rm -i ' instead.
-Do backups more often than you go to church.
-Read the backup media at least as often as you go to church.
-Set up your prompt to do a `pwd` everytime you cd.
-Always do a `cd .` before doing anything.
-DOCUMENT all your changes to the system (We use a text file
called /Changes)
-Don't nuke stuff you are not sure about.
-Do major changes to the system on Saturday morning so you will
have all weekend to fix it.
-Have a shadow watching you when you do anything major.
-Don't do systems work on a Friday afternoon. (or any other time
when you are tired and not paying attention.)

3

From: [email protected] (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501

1) The "man" pages don't tell you everything you need to know.
2) Don't do backups to floppies.
3) Test your backups to make sure they are readable.
4) Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
5) Strenuously avoid systems with inadequate backup and restore programs wherever possible (thank goodness for "restore" with an "e"!).
6) If you've never done sysadmin work before, take a formal training class.
7) You get what you pay for. There's no substitute for experience.
9) It's a lot less painful to learn from someone else's experience than your own (that's what this thread is about, I guess

4

From: [email protected] (Jim Harkins)
Organization: Pacific Data Products

If you appoint someone to admin your machine you better be willing to train them. If they've never had a hard disk crash on them you might want to ensure they understand hardware does stuff like that.

5

From: [email protected]
Organization: Department of Computer Science, University of York, England

Beware anything recursive when logged in as root!

6

From: [email protected] (Mike Matthews)
Organization: /etc/organization

*NEVER* move something important. Copy, VERIFY, and THEN delete.

7

From: [email protected] (Squish)
Organization: Human Interface Technology Lab (on vacation)

When you are doing some BIG type the command reread what you've typed about 100 times to make sure its sunk in (:

8

From: Nick Sayer

If / is full, du /dev.

9

From: [email protected]
Organization: Wesleyan College

Never ever assume that some prepackaged script that you are running does anything right.

Admin Stories UnixNewbie.org

This is a modified list from "The Unofficial Unix Administration Horror Story Summary" by Anatoly Ivasyuk.

" Creative uses of rm
" How not to free up space on your drive
" Dealing with /dev files
" Making backups
" Blaming it on the hardware
" Partitioning the drives
" Configuring the system
" Upgrading the system
" All about file permissions
" Machine dependencies
" Miscellaneous stories (a.k.a. 'oops')
" What we have learned

My 10 UNIX Command Line Mistakes by Vivek Gite

with 90 comments
Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

userdel Command

The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):
userdel foo

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:
killall process-name

Destroyed named.conf

I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:
./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):

cd /mnt/bacupusbharddisk

tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Accidentally Changed Hostname and Triggered False Alarm

Accidentally changed the current hostname (I wanted to see current hostname settings) for one of our cluster node. Within minutes I received an alert message on both mobile and email.
hostname foo.example.com

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt service httpd stop

Wrong CNAME DNS Entry

Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:
echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )

The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.

Use CVS to store configuration files.

Wait and read command line again before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

Jon 06.21.09 at 2:42 am

My all time favorite mistake was a simple extra space:
cd /usr/lib ls /tmp/foo/bar
I typed
rm -rf /tmp/foo/bar/ *
instead of
rm -rf /tmp/foo/bar/*
The system doesn't run very will without all of it's libraries……

georgesdev 06.21.09 at 9:15 am

never type anything such as:
rm -rf /usr/tmp/whatever
maybe you are going to type enter by mistake before the end of the line. You would then for example erase all your disk starting on /.
if you want to use -rf option, add it at the end on the line:
rm /usr/tmp/whatever -rf
and even this way, read your line twice before adding -rf

3ToKoJ 06.21.09 at 9:26 am

public network interface shutdown … done
typing unix command on wrong box … done
Delete apache DocumentRoot … done
Firewall lockdone … done with a NAT rule redirecting the configuration interface of the firewall to another box, serial connection saved me
I can add, being trapped by aptitude keeping tracks of previously planned - but not executed - actions, like "remove slapd from the master directory server"

UnixEagle 06.21.09 at 11:03 am

Rebooted the wrong box
While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot
Instead of appending text to Apache config file, I overwritten it's contents
Firewall lockdown while changing the ssh port
Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.
If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

Selected Comments

7 smaramba 06.21.09 at 11:31 am

typing unix command on wrong box and firewall lockdown are all time classics: been there, done that.
but for me the absolute worst, on linux, was checking a mounted filesystem on a production server…
fsck /dev/sda2

the root filesystem was rendered unreadable. system down. dead. users really pissed off.
fortunately there was a full backup and the machine rebooted within an hour.

8 od 06.21.09 at 12:50 pm

"Typing UNIX Commands on Wrong Box"
Yea, I did that one too. Wanted to shut down my own vm but I issued init 0 on a remote server which I accessed via ssh. And oh yes, it was the production server.

10 sims 06.22.09 at 2:23 am

Funny thing, I don't remember typing typing in the wrong console. I think that's because I usually have the hostname right there. Fortunately, I don't do the same things over and over again very much. Which means I don't remember command syntax for all but most used commands.
Locking myself out while configuring the firewall – done – more than once. It wasn't really a CLI mistake though. Just being a n00b.

georgesdev, good one. I usually:

ls -a /path/to/files
to double check the contents
then up arrowkey homekey hit del a few times and type rm. I always get nervous with rm sitting at the prompt. I'll have to remember that -rf at the end of the line.

I always make mistakes making links. I can never remember the syntax. :/

Here's to less CLI mistakes… (beer)

Grant D. Vallance 06.22.09 at 7:56 am

A couple of days ago I typed and executed (as root): rm -rf /* on my home development server. Not good. Thankfully, the server at the time had nothing important on it, which is why I had no backups …
I am still not sure *why* I did when I have read about all the warnings about using this command. (A dyslexic moment with the syntax?)

Ah well, a good lesson learned. At least it was not the disaster it could of been. I shall be *very* paranoid about this command in the future.

Joren 06.22.09 at 9:30 am

I wanted to remove the subfolder etc from the /usr/local/matlab/ directory. So I accidentally added the '/' symbol in a force of habit when going to the /etc folder and I typed from the /usr/local/matlab directory:
sudo rm /etcinstead of sudo rm etc
Without the entire /etc folder the computer didn't work anymore (which was to be expected ofcourse) and I ended up reinstalling my computer.

Robsteranium 06.22.09 at 11:05 am

Aza Rashkin explains how habituation can lead to stupid errors – confirming "yes I'm sure/ overwrite file etc" automatically without realising it. Perhaps rm and the > command need an undo/ built-in backup…

Ramaswamy 06.22.09 at 10:47 am

Deleted the files
I used place some files in /tmp/rama and some conf files at /home//httpd/conf file
I used to swap between these two directories by "cd -"
Executed the command rm -fr ./*
supposed to remove the files at /tmp/rama/*, but ended up by removing the file at /home//httpd/conf/*, with out any backup

Yonitg 06.23.09 at 8:06 am

Great post !
I did my share of system mishaps,
killing servers in production, etc.
the most emberassing one was sending 70K users the wrong message.
or beter yet, telling the CEO we have a major crysis, gathering up many people to solve it, and finding that it is nothing at all while all the management is standing in my cube.

Solaris 06.23.09 at 8:37 pm

Firewall lock out: done.
Command on wrong server: done.
And the worst: update and upgrade while some important applications were running, of
course on a production server.. as someone mentioned the system doesn't run very well
without all of its original libraries :)

Peko 06.30.09 at 8:46 am

I invented a new one today.
Just assuming that a [-v] option stands for –verbose

Yep, most of the time. But not on a [pkill] command.
[pkill -v myprocess] will kill _any_ process you can kill - except those whose name contains "myprocess". Ooooops. :-!
(I just wanted pkill to display "verbose" information when killing processes)

Yes, I know. Pretty dumb thing. Lesson learned ?

I would suggest adding another critical rule to your list:
" Read The Fantastic Manual - First" ;-)

Jai Prakash 07.03.09 at 1:43 pm

Mistake 1:
My Friend tried to see last reboot time and mistakenly executed command "last | reboot" instead of "last | grep reboot"

It made a outage on Production DB server.

Mistake 2:
Another guy, wants to see the FQDN on solaris box and executed "hostname -f"
It changed the hostname name to "-f" and clients faced lot of connectivity issues due to this mistake.
[ hostname -f is used in Linux to see FQDN name but it solaris its usage is different ]

32 Name 07.04.09 at 5:20 pm

Worse thing i've done so far, It accidentally dropped a MySQL database containing 13k accounts for a gameserver :D
Luckily i had backups but took a little while to restore,

33 Vince Stevenson 07.06.09 at 6:23 pm

I was dragged into a meeting one day and forgot to secure my Solaris session. A colleague and former friend did this: alias ll='/usr/sbin/shutdown -g5 -i5 "Bye bye Vince"' He must have thought that I was logged into my personal host machine, not the company's cashcow server. What happens when it all goes wrong. Secure your session… Rgds Vince

Bjarne Rasmussen 07.07.09 at 7:56 pm

well, tried many times, the crontab fast typing failure…
crontab -r instead of -e
e for edit
r for remove..

now i always use -l for list before editing…

35 Ian 07.08.09 at 4:15 am

Made a script that automatically removes all files from a directory. Now, rather than making it logically (this was early on) I did it stupidly.
cd /tmp/files
rm ./*

Of course, eventually someone removed /tmp/files..

36 shlomi 07.12.09 at 9:21 am

Hi
On My RHEL 5 sever I create /tmp mount point to my storage and tmpwatch script that run under cron.daily removes files which have not been accessed 12 hours !!!

52 foo 09.25.09 at 9:41 pm

wanted to kill all the instances of a service on HP-UX (pkill like util not available)…
# ps -aef | grep -v foo | awk {print'$2′} | xargs kill -9

Typed "grep -v" instead of "grep -i" and u can guess what happened :(

LinAdmin 09.29.09 at 2:38 pm

Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var .– the ssh session on the just broken box was still open . This saved my life :-P

Deltaray 10.03.09 at 4:37 am

Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.
I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don't see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you'd use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

Guntram 10.05.09 at 7:51 pm

This is why i never use pkill; always use something like "ps ….| grep …" and, when it's ok, type a " | awk '{print $2}' | xargs kill" behind it. But, as a normal user, something like "pkill -v bash" might make perfect sense if you're sitting at the console (so you can't just switch to a different window or something) and have a background program rapidly filling your screen.
Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of "delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

sqn 10.07.09 at 6:05 pm

tried to lockout a folder by removing it's attributes (chmod 000) as a beginner and wanted to impress myself, did:
# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven't deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

dev 10.15.09 at 10:15 am

while I was working on many ssh window:
rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm

In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn't sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql - SHOULD be -
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!

Mr Z 10.18.09 at 5:13 am

After 10+ years I've made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:
ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It's all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm

cd /var/opt/sysadmin/etc
rm -f /etc
note the /etc. It was supposed to be rm -rf etc

Jonix 10.23.09 at 11:18 am

The deadline were coming too close to comfort, I'd worked for too looong hours for months.
We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in "/var/www/web/" and entered "rm temp/*" which i misspelled at some point as "rm tmp/ *". I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

Cougar 10.29.09 at 3:00 pm

rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..

BattleHardened 10.30.09 at 1:33 am

Some of my more choice moments:
postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn't put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm'd (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find's to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

Shantanu Oak 11.03.09 at 11:20 am

scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the "somefile" of that server!!
scp somefile [email protected].0.1:/root/

thatguy 11.04.09 at 3:37 pm

Hmm, most of these mistakes I have done – but my personal favourite.
# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn't need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn't happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

Gurudatt 11.06.09 at 12:05 am

chmod 777 /
never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm

so in recovering a binary backup of a large mysql database, produced by copying and tarballing '/var/lib/mysql', I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the 'var' directory in '/tmp' . I wanted to type:
rm -rf var/
instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

Derek 11.12.09 at 10:26 pm

Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn't been running tests(normally they can take 24-72 hours to run)..
symlinks… ack! I was cleaning up space and thought weird.. I don't remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn't the one who had deleted the binaries, just got to witness the aftermath.

jason 11.18.09 at 4:19 pm

The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….

Mr Z 11.19.09 at 3:02 pm

@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg

M.S. Babaei 08.01.09 at 3:39 am

once upon a time mkfs is killing me on ext3 partition I want
instead of
mkfs.ext3 /dev/sda1
I did this
mkfs.ext3 /dev/sdb1
I never forget what I lost??

Simon B 08.07.09 at 2:47 pm

Whilst a colleague was away from their keyboard I entered :
rm -rf *
… but did not press enter on the last line (as a joke). I expected them to come back and see it as a joke and rofl….back space… The unthinkable happened, the screen went to sleep and they banged the enter key to wake it up a couple of times. We lost 3 days worth of business and some new clients. estimated cost $50,000+

ginzero 08.17.09 at 5:10 am

tar cvf /dev/sda1 blah blah…

47 Kevin 08.25.09 at 10:50 am

tar cvf my_dir/* dir.tar
and your write your archive in the first file of the directory …

48 ST 09.17.09 at 10:14 am

I've done the wrong server thing. SSH'd into the mailserver to archive some old messages and clear up space.
Mistake #1: I didn't logoff when I was done, but simply minimized the terminal and kept working
Mistake#2: At the end of the day I opened what I thought was a local terminal and typed:
/sbin/shutdown -h now
thinking I was bringing down my laptop. The angry phone calls started less than a minute later. Thankfully, I just had to run to the server room and press power.
I never thought about using CVS to backup config files. After doing some really dumb things to files in /etc (deleting, stupid edits, etc), I started creating a directory to hold original config files, and renaming those files things like httpd.conf.orig or httpd.conf.091709

As always, the best way to learn this operating system is to break it…however unintentionally.

49 Wolf Halton 09.21.09 at 3:16 pm

Attempting to update a Fedora box over the wire from Fedora8 to Fedora9
I updated the repositories to the Fedora9 repos, and ran
# yum -y upgrade
I have now tested this on a couple of boxes and without exception the upgrades failed with many loose older-version packages and dozens of missing dependencies, as well as some fun circular dependencies which cannot be resolved. By the time it is done, eth0 is disabled and a reboot will not get to the kernel-choice stage.
Oddly, this kind of update works great in Ubuntu.

50 Ruben 09.24.09 at 8:23 pm

while cleaning the backup hdd late the night, a '/' can change everything…
"rm -fr /home" instead of "rm -fr home/"

It was a sleepless night, but thanks to Carlo Wood and his ext3grep I rescued about 95% of data ;-)

51 foo 09.25.09 at 9:36 pm

# svn add foo
--> Added 5 extra files that were not to be commited, so I decided to undo the change,delete the files and add to svn again…..
# svn rm foo –force
and it deleted the foo directory from disk :(…lost all my code just before the dead line :(

52 foo 09.25.09 at 9:41 pm

wanted to kill all the instances of a service on HP-UX (pkill like util not available)…
# ps -aef | grep -v foo | awk {print'$2′} | xargs kill -9

Typed "grep -v" instead of "grep -i" and u can guess what happened :(

53 LinAdmin 09.29.09 at 2:38 pm

Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var .– the ssh session on the just broken box was still open . This saved my life :-P

54 Deltaray 10.03.09 at 4:37 am

Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.
I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don't see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you'd use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

55 Guntram 10.05.09 at 7:51 pm

This is why i never use pkill; always use something like "ps ….| grep …" and, when it's ok, type a " | awk '{print $2}' | xargs kill" behind it. But, as a normal user, something like "pkill -v bash" might make perfect sense if you're sitting at the console (so you can't just switch to a different window or something) and have a background program rapidly filling your screen.
Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of "delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

56 sqn 10.07.09 at 6:05 pm

tried to lockout a folder by removing it's attributes (chmod 000) as a beginner and wanted to impress myself, did:
# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven't deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

57 dev 10.15.09 at 10:15 am

while I was working on many ssh window:
rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm

In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn't sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql - SHOULD be -
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!

59 Mr Z 10.18.09 at 5:13 am

After 10+ years I've made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:
ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It's all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm

cd /var/opt/sysadmin/etc
rm -f /etc
note the /etc. It was supposed to be rm -rf etc

61 Jonix 10.23.09 at 11:18 am

The deadline were coming too close to comfort, I'd worked for too looong hours for months.
We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in "/var/www/web/" and entered "rm temp/*" which i misspelled at some point as "rm tmp/ *". I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

62 Cougar 10.29.09 at 3:00 pm

rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..

63 BattleHardened 10.30.09 at 1:33 am

Some of my more choice moments:
postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn't put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm'd (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find's to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

64 PowerPeeCee 11.02.09 at 1:01 am

As an Ubuntu user for a while, Y'all are giving me nightmares, I will make extra discs and keep them handy. Eek! I am sure that I will break it somehow rather spectacularly at some point.

65 mahelious 11.02.09 at 10:44 pm

second day on the job i rebooted apache on the live web server, forgetting to first check the cert password. i was finally able to find it in an obscure doc file after about 30 minutes. the resulting firestorm of angry clients would have made Nero proud. I was very, very surprised to find out I still had a job after that debacle.
lesson learned: keep your passwords secure, but handy

66 Shantanu Oak 11.03.09 at 11:20 am

scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the "somefile" of that server!!
scp somefile [email protected].0.1:/root/

67 thatguy 11.04.09 at 3:37 pm

Hmm, most of these mistakes I have done – but my personal favourite.
# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn't need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn't happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

68 Gurudatt 11.06.09 at 12:05 am

chmod 777 /
never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm

so in recovering a binary backup of a large mysql database, produced by copying and tarballing '/var/lib/mysql', I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the 'var' directory in '/tmp' . I wanted to type:
rm -rf var/
instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

70 Henry 11.10.09 at 6:00 pm

Nice post and familiar with the classic mistakes.
My all time classic:
- rm -rf /foo/bar/ * [space between / and *]

Be carefull with clamscan's:
–detect-pua=yes –detect-structured=yes –remove=no –move=DIRECTORY

I chose to scan / instead of /home/user and I ended with a screwed apt, libs, and missing files from allover the place :D I luckily had –log=/home/user/scan.log and not console output, so I could restore the moved files one by one
next time I use –copy instead of move and never start with /

these 2 happened at home, while working I've learned a long time ago (SCO Unix times) to backup files before rm :D

71 Derek 11.12.09 at 10:26 pm

Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn't been running tests(normally they can take 24-72 hours to run)..
symlinks… ack! I was cleaning up space and thought weird.. I don't remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn't the one who had deleted the binaries, just got to witness the aftermath.

72 Ahmad Abubakr 11.13.09 at 2:23 pm

My favourite :)
sudo chmod 777 /

73 jason 11.18.09 at 4:19 pm

The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….

74 Mr Z 11.19.09 at 3:02 pm

@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg

75 John 11.20.09 at 2:29 am

Clearing up space used by no-longer-needed archive files:
# du -sh /home/myuser/oldserver/var
32G /home/myuser/oldserver/var
# cd /home/myuser/oldserver
# rm -rf /var

The box ran for 6 months after doing this, by the way, until I had to shut it down to upgrade the RAM…although of course all the mail, Web content, and cron jobs were gone. *sigh*

76 Erick Mendes 11.24.09 at 7:55 pm

Yesterday I've locked my self outside of a switch I was setting up. lol
I was setting up a VLAN on it and my PC was directly connected to it thru one of the ports I messed up.
Had to get thru serial to undo vlan config.

Oh, the funny thing is that some hours later my boss just made the same mistake lol

77 John Kennedy 11.25.09 at 2:09 pm

Remotely logged into a (Solaris) box at 3am. Made some changes that required a reboot. Being too lazy to even try and remember the difference between Solaris and Linux shutdown commands I decided to use init. I typed init 0…No one at work to hit the power switch for me so I had to make the 30 minute drive into work.
This one I chalked up to being a noob…I was on an XTerminal which was connected to a Solaris machine. I wanted to reboot the terminal due to display problems…Instead of just powering off the terminal I typed reboot on the commandline. I was logged in as root…

78 bram 11.27.09 at 8:45 pm

on a remote freebsd box:
[root@localhost ~]# pkg_delete bash

The next time i tried to log in, it kept on telling me access denied… hmmmm… ow sh#t

(since my default shell in /etc/passwd was still pointing to a non-existent /usr/local/bin/bash, i would never be able to log in)

79 Li Tai Fang 11.29.09 at 8:02 am

On a number of occasions, I typed "rm" when I wanted to type "mv," i.e., I wanted to rename a file, but instead I deleted it.

80 vmware 11.30.09 at 4:59 am

last | reboot
instead
last | grep reboot

81 ColtonCat 12.02.09 at 4:21 am

I have a habit of renaming config files I work on to the same file with a "~" at the end for a backup, so that I can roll back if I make a mistake, and then once all is well I just do a rm *~. Trouble happened to me when I accidentally typed rm * ~ and as Murphy would have it a production asterisk telephony server.

82 bye bye box 12.02.09 at 7:54 pm

Slicked the wrong box in a production data center at my old job.
In all fairness it was labeled wrong on the box and kvm ID.

Now I've learned to check hostname before decom'ing anything.

83 Murphy's Red 12.02.09 at 9:11 pm

Running out of diskspace while updating a kernel on FreeBSD.
Not fully inserting a memory module on my home machine which shortcircuited my motherboard.

On several occasions i had to use a rdesktop session to windows machine and use putty to connect to a machine (yep.. i know it sounds weird ;-) ) Anyway.. text copied in windows is stored differently than text copied in the shell. Why changing a root passwd on a box, (password copied using putty) i just control v-ed it and logged off. I had to go to the datacenter to boot into single user mode to acces the box again.

Using the same crappy setup, i copied some text in windows, accidently hit control-v in the putty screen of the box i was logged into as root, the first word was halt, the last character an enter.

Configuring nat on the wrong interface while connected through ssh

Adding a new interface on a machine, filled in the details of a home network in kudzu which changed the default gateway to 192.168.1.1 on the main interface. Only checking the output of ifconfig but not the traffic or gateway and dns settings.

fsck -y on filesystem without unmounting it

84 ehrichweiss 12.03.09 at 6:55 pm

I've definitely rebooted the wrong box, locked myself out with firewall rules, rm -rf'ed a huge portion of my system. I had my infant son bang on the keyboard for my SGI Indigo2 and somehow hit the right key combo to undo a couple of symlinks I had created for /usr(I had to delete them a couple of times in the process of creating them) AND cleared the terminal/history so I had no idea what was going on when I started getting errors. I had created the symlink a week prior so it took me a while to figure out what I had to do to get the system operational again.
My best and most recent FUBAR was when I was backing up my system(I have horrible, HORRIBLE luck with backups to the point I don't bother doing them any more for the most part); I was using mondorescue and backing the files up to an NTFS partition I had mounted under /mondo and had done a backup that wouldn't restore anything because of an apostrophe or single quote in one of the file names was backing up, so I had to remove the files causing the problem which wasn't really a biggie and did the backup, then formatted the drive as I had been planning………..only to discover that I hadn't remounted the NTFS partition under /mondo as I had thought and all 30+ GB of data was gone. I attempted recovery several times but it was just gone.

85 fly 12.04.09 at 3:55 pm

my personal favorite, a script somehow created few dozens file in /etc dit … all named ??somestrings so i promplty did rm -rf ??* … (at the point when i hit [enter] i remebered that ? is a wildchar … Too late :)) luckily that was my home box … but reinstall was imminent :)

86 bips 12.06.09 at 9:56 am

il m'est arrivé de farie :
crontab -r
au lieu de :
crontab -e

ce qui a eu pour effet de vider la liste crontab…

87 bips 12.06.09 at 9:59 am

also i've done
shutdown -n
(i thaught -n meant "now")

which had for consequence to reboot the server without networking…

88 Deltaray 12.06.09 at 4:51 pm

bips: What does shutdown -n do? Its not in the shutdown man page.

89 miss 12.14.09 at 8:42 am

crontab -e vs crontab -r is the best :)

90 marty 12.18.09 at 12:21 am

the extra space before a * is one I've done before only the root cause was tab completion.
#rm /some/directory/FilesToBeDele[TAB]*

Thinking there were multiple files that began with FilesToBeDele. Instead, there was only one and pressing tab put in the extra space. Luckily I was in my home dir, and there was a file with write only permission so rm paused to ask if I was sure. I ^C and wiped my brow. Of course the [TAB] is totally unneccesary in this instance, but my pinky is faster than my brain.

Copy Your Linux Install to a Different Partition or Drive

Jul 9, 2009
If you need to move your Linux installation to a different hard drive or partition (and keep it working) and your distro uses grub this tech tip is what you need.
To start, get a live CD and boot into it. I prefer Ubuntu for things like this. It has Gparted. Now follow the steps outlined below.

Copying
Mount both your source and destination partitions.
Run this command from a terminal:
  $ sudo cp -afv /path/to/source/* /path/to/destination
Don't forget the asterisk after the source path.
After the command finishes copying, shut down, remove the source drive, and boot the live CD again.
Configuration
Mount your destination drive (or partition).

Run the command "gksu gedit" (or use nano or vi).
Edit the file /etc/fstab. Change the UUID or device entry with the mount point / (the root partition) to your new drive. You can find your new drive's (or partition's) UUID with this command:
  $ ls -l /dev/disk/by-uuid/
Edit the file /boot/grub/menu.lst. Change the UUID of the appropriate entries at the bottom of the file to the new one.
Install Grub
Run sudo grub.
At the Grub prompt, type:
  find /boot/grub/menu.lst
This will tell you what your new drive and partition's number is. (Something like hd(0,0))
Type:
  root hd(0,0)
but replace "hd(0,0)" with your partition's number from above.
Type:
  setup hd(0)
but replace "hd(0)" with your drive's number from above. (Omit the comma and the number after it).
That's it! You should now have a bootable working copy of your source drive on your destination drive! You can use this to move to a different drive, partition, or filesystem.

Related Stories:
Linux - Compare two directories(Feb 18, 2009)
Cloning Linux Systems With CloneZilla Server Edition (CloneZilla SE)(Jan 22, 2009)
Copying a Filesystem between Computers(Oct 28, 2008)
rsnapshot: rsync-Based Filesystem Snapshot(Aug 26, 2008)
K9Copy Helps Make DVD Backups Easy(Aug 23, 2008)

Hosing Your Root Account By S. Lee Henry

If you manage your own Unix system, you might be interested in hearing how easy it is to make your root account completely inaccessible -- and then how to fix the problem. I have landed in this situation twice in my career and, each time, ended up having to boot my Solaris box off a CD-ROM in order to gain control of it.

The first time I ran into this problem, someone else had made a typing mistake in the root user's shell in the /etc/passwd file. Instead of saying "/bin/sh", the field was made to say "/bin/sch", suggesting to me that the intent had been to switch to /bin/csh. Due to the typing mistake, however, not only could root not log in but no one could su to the root account. Instead, we got error messages like these:
    login: root
    Password:
    Login incorrect

    boson% su -
    Password:
    su: cannot run /bin/sch: No such file or directory
The second time, I rdist'ed a new set of /etc files to a new Solaris box I was setting up without realizing that the root shell on the source system had been set to /bin/tcsh. Because this offspring of the C shell is not available on most Unix boxes (and certainly isn't delivered with Solaris), I found myself facing the same situation that I had run into many years before.

I couldn't log in as root. I couldn't su to the root account. I couldn't use rcp (even from a trusted host) -- because it checks the shell. I could ftp a copy of tcsh, but could not make it executable. I couldnt boot the system in single user mode (it also looked for a valid shell). The only option at my disposal was to boot the system from a CD ROM. Once I had done this, I had two choices: 1) I could mount my root partition on /a, cd to /a/etc, replace the shell in the /etc/passwd file, unmount /a, and then reboot. 2) I could mount my root partition on /a, cd to /a/bin, chmod 755 the copy of tcsh that I had previously ftped there, unmount /a, and then reboot.

I fixed root's entry in the /etc/passwd file and made my new tcsh file executable to prevent any possible recurrence of the problem. To avoid these problems, I usually don't allow the root shell to be set to anything other than /bin/sh (or /bin/csh if I'm pressured into it). The Bourne shell (or bash) is generally the best shell for root because it's on every system and the system start/stop scripts (in the /etc/rc?.d or /etc/rc.d/rc?.d directories) are almost exclusively written in sh syntax. Hence, should one of these files fail to include the #!/bin/sh designator, they will still run properly.

Surprised by how easily and completely I had made my system unusable, I was left running around the office looking for the secret stash of Solaris CD-ROMs to repair the damage. By the way, changing the file on the rdist source host and rdist'ing the files a second time would not have worked because even rdist requires the root account on the system be working properly. The rdist tool is based on rcp.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: April 27, 2021

Softpanorama classification of sysadmin horror stories

The data loss is a new calamity of technological age that affects all of us. But only few can contribute to it to the extent system administrator can ;-)

Version 2.5 (Nov 17, 2020)

SNAFU as a classic career limited move for Unix linux sysadmin ;-)

History of this effort

Old News ;-)

[Nov 18, 2020] Why the lone wolf mentality is a sysadmin mistake by Scott McBrien

Jul 10, 2019 | www.redhat.com

[Nov 05, 2020] Wiping out RHEL 7 auth files (passwd, shadow, groups) with RHEL6 auth files due to misunderstanding about what version of RHEL is on target server

[Nov 02, 2020] Wiping files due to misguided attempt to save space.

[Jul 14, 2020] Sysadmin tales- How to keep calm and not panic when things break by Glen Newell

Jul 10, 2020 | www.redhat.com

[Nov 08, 2019] What breaks our systems A taxonomy of black swans by Laura Nolan Feed

Oct 25, 2018 | opensource.com

[Nov 08, 2019] How to prevent and recover from accidental file deletion in Linux Enable Sysadmin

trashy - Trashy · GitLab might make sense in simple cases. But often massive file deletions are about attempts to get free space.

Nov 08, 2019 | www.redhat.com

[Nov 08, 2019] My first sysadmin mistake by Jim Hall

If subdirectories are intact then you still can copy the content from another server. But content of sysconfig subdirectory in linux is unique to the server and you need a backup to restore it.

Notable quotes:

"... As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch. ..."

"... I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the ..."

Nov 08, 2019 | opensource.com

[Nov 08, 2019] How to use Sanoid to recover from data disasters Opensource.com

Nov 08, 2019 | opensource.com

[Nov 07, 2019] What breaks our systems A taxonomy of black swans Opensource.com

Nov 07, 2019 | opensource.com

[Nov 07, 2019] How to prevent and recover from accidental file deletion in Linux Enable Sysadmin

trashy - Trashy · GitLab might make sense in simple case. But often deletions are about increasing free space.

Nov 07, 2019 | www.redhat.com

[Nov 06, 2019] Sysadmin 101 Leveling Up by Kyle Rankin

Nov 06, 2019 | www.linuxjournal.com

[Nov 06, 2019] 7 Ways to Make Fewer Mistakes at Work by Carey-Lee Dixon

May 31, 2015 | www.linkedin.com

[Nov 06, 2019] 10+ mistakes Linux newbies make - TechRepublic

Nov 06, 2019 | www.techrepublic.com

[Nov 06, 2019] Destroying multiple production databases by Jan Gerrit Kootstra

Aug 08, 2019 | www.redhat.com

[Nov 06, 2019] My 10 Linux and UNIX Command Line Mistakes by Vivek Gite

May 20, 2018 | www.cyberciti.biz

[Oct 25, 2019] Get inode number of a file on linux - Fibrevillage

Oct 25, 2019 | www.fibrevillage.com

[Oct 25, 2019] Howto Delete files by inode number by Erik

Feb 10, 2011 | erikimh.com

[Oct 25, 2019] unix - Remove a file on Linux using the inode number - Super User

Oct 25, 2019 | superuser.com

[Oct 25, 2019] Linux - Unix Find Inode Of a File Command

Jun 21, 2012 | www.cyberciti.biz

[Sep 29, 2019] IPTABLES makes corporate security scans go away

Sep 29, 2019 | www.reddit.com

[Sep 04, 2019] Basic Trap for File Cleanup

Sep 04, 2019 | www.putorius.net

[Aug 26, 2019] linux - Avoiding accidental 'rm' disasters - Super User

Aug 26, 2019 | superuser.com

[Aug 26, 2019] bash - How to prevent rm from reporting that a file was not found

Aug 26, 2019 | stackoverflow.com

[Aug 26, 2019] shell - rm -rf return codes

Aug 26, 2019 | superuser.com

[Jul 26, 2019] The day the virtual machine manager died by Nathan Lager

"Dangerous" commands like dd should probably be always typed first in the editor and only when you verity that you did not make a blunder , executed...

A good decision was to go home and think the situation over, not to aggravate it with impulsive attempts to correct the situation, which typically only make it worse.

Lack of checking of the health of backups suggest that this guy is an arrogant sucker, despite his 20 years of sysadmin experience.

Notable quotes:

"... I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! ..."

"... Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. ..."

Jul 26, 2019 | www.redhat.com

[Apr 29, 2019] When the disaster hit, you need to resolve things quickly and efficiently, with panic being the worst enemy. Amount of training and previous experience become crucial factors in such situations

It is rarely just one thing that causes an “accident”. There are multiple contributors here.

Notable quotes:

"... Panic in my experience stems from a number of things here, but two crucial ones are: ..."

"... not knowing what to do, or learned actions not having any effect ..."

Apr 29, 2019 | www.nakedcapitalism.com

[Mar 26, 2019] I wiped out a call center by mistyping the user profile expiration purge parameters in a script before leaving for the day.

Mar 26, 2019 | twitter.com

[Mar 01, 2019] Molly-guard for CentOS 7 UoB Unix by dg12158

Sep 21, 2015 | bris.ac.uk

[Mar 01, 2019] molly-guard protects machines from accidental shutdowns-reboots by ruchi

Nov 28, 2009 | www.ubuntugeek.com

[Mar 01, 2019] Confirm before executing shutdown-reboot command on linux by Ilija Matoski

Notable quotes:

The data loss is a new calamity of technological age that affects all of us.
But only few can contribute to it to the extent system administrator can ;-)