Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Large Enterprise Unix administration

“it is better to solve the right problem the wrong way than the wrong problem the right way”.

Since we could find little prior art, we set out to create it. Over the course of several years of deploying, reworking, and administering large mission-critical infrastructures, we developed a certain methodology and toolset. We began thinking of an entire infrastructure as one large enterprise cluster, rather than as a collection of individual hosts. This change of perspective, and the decisions it invoked, made a world of difference in cost and ease of administration. The standard functionality includes:

There is relatively little prior art in print which addresses the problems of large Unix  infrastructures in any holistic sense. Thanks to the work of many dedicated people we now see extensive coverage of individual tools, techniques, and policies [nemeth] [frisch] [stern] [dns] [evard] [limoncelli] [anderson] . But it is difficult in practice to find  the best way

We recognize that there really is no "standard" way to assemble or manage large infrastructures of UNIX machines. While the components that make up a typical infrastructure are generally well-known, professional infrastructure architects tend to use those components in radically different ways to accomplish the same ends. In the process, we usually write a great deal of code to glue those components together, duplicating each others' work in incompatible ways.

Because infrastructures are usually ad hoc, setting up a new infrastructure or attempting to harness an existing unruly infrastructure can be bewildering for new sysadmins. The sequence of steps needed to develop a comprehensive infrastructure is relatively straightforward, but the discovery of that sequence can be time-consuming and fraught with error. Moreover, mistakes made in the early stages of setup or migration can be difficult to remove for the lifetime of the infrastructure.

We will discuss the sequence that we developed and offer a brief glimpse into a few of the many tools and techniques this perspective generated. If nothing else, we hope to provide a lightning rod for future discussion. We operate a web site (www.infrastructures.org) and mailing list for collaborative evolution of infrastructure designs. Many of the details missing from this document should show up on the web site.

In our search for answers, we were heavily influenced by the MIT Athena project [athena] , the OSF Distributed Computing Environment [dce] , and by work done at Carnegie Mellon University [sup] [afs] and the National Institute of Standards and Technology [depot]

This is a fitting definition of a system administrator - making repairs, dealing with people, and trying to anticipate as well as prevent problems.

An experienced admin is not necessarily a good admin

The best admin is the person with the right mindset, not the person with the most time on the job. In fact, the more "indispensable" an admin seems to be, the more likely they are a bad admin. The dead giveaway is if the person needs to be around to answer questions all the time, which strongly suggests two things:

  1. That the documentation is incomplete

    Complete network, host and application documentation is must for every site. If an admin is constantly called while on vacation or while others are on call is probably not documenting their systems properly. Everything needed to understand and fix problems at their site should be clearly documented in a central location.
     

  2. That the systems aren't discoverable

    The systems should be as self-documenting as possible. This means that start/stop/reload scripts should be in conventional locations (like /etc/init.d/ on SysV and on most Linuxes), MOTD messages should give helpful info, scripts should have comments explaining their usage and purpose, and automated alerts should send useful information about the error condition(s) found. Leave nothing to be rediscovered every time someone new has to work on the systems, let them spend their time working on the actual issue(s) at hand.

    An admin unfamiliar with the machine(s) in question should be able to find their way around the system with a minimum of trouble.

 

Many admins become mired in their site's problems, and stop trying to improve their situation. They accept that their disks keep filling up, that their applications keep dying, and that mundane tasks take up all their time.

If they were to write cron jobs to trim files that grow until filesystems fill, restart dead applications with init or cfengine or cron scripts or daemontools, and automate repetitive tasks from cron or cfengine, they would have a smooth running network. Once things run smoothly, they can spend their time updating software, improving security, or any of the many projects that improve overall conditions. Such projects get little effort spent on them at sites without a proactive attitude.

Recommended Links

 

MRINetwork.com Warning: Social Networking Can Be Hazardous to Your Job Search



Copyright © 1996-2008 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Standard disclaimer: The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: March 15, 2008