Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Optimizing usage of NFS in Grid Engine

News  Grid Engine Tuning Recommended Links Installation Planning Installation of the Master Host Installation of the Execution Hosts SGE Queues
SGE Commands Reference Changing SGE spool to local directory  Optimizing usage of NFS in Grid Engine Grid Engine Config Tips SGE Parallel Environment Configuring Hosts From the Command Line SGE cheat sheet
Enterprise Unix System Administration Perl Admin Tools and Scripts Duke University Tools for SGE Simple Unix Backup Tools Sysadmin Horror Stories Humor Etc

Above all, not too much zeal

Taleyrand

Adapted from  liv.ac.uk Reducing and Eliminating NFS usage by Grid Engine

The default installation of Grid Engine assumes that the $SGE_ROOT  directory is on a shared filesystem accessible by all hosts in the cluster. (This is referred to as NFS here, but might be another sort of networked filesystem. Local caching, e.g. with AFS or NFS+FS-Cache, might change the performance considerations below.)

For a large cluster or high throughput, this could entail significant NFS traffic, but it is unlikely to be a problem on a low-throughput cluster with less then, say, 32 nodes.

As output files for application are usually redirected to NFS anyway, you can reduce only part of NFS traffic, so it is unclear whether the game is worth the candles.

At the same time RPM-based open source distributions of SGE inflict many dependencies on computational nodes that need to be resolved anyway, so it is unclear that exactly you win by sharing the whole $SGE_ROOT tree. SGE usually not upgrade but completely reinstalled anyway so upgrade problem is a misnomer.

There are various ways to reduce NFS traffic, including a way to eliminate entirely the requirement that Grid Engine operate using shared files. However, for each alternative, there is a subsequent loss of convenience, and in some cases, functionality. So measurement is essential. As Donald Knuth once remarked "Premature optimization is the root of all evil." 

This HOWTO explains how to implement the different alternatives.

Levels of Grid Engine NFS dependencies

Note: color indicates at each level which part of the file structure below is moved out of NFS sharing

SGE_ROOT
Configuration Description Advantage Disadvantage

$SGE_ROOT is shared via NFS

executables, configuration files, spool directories: all shared simple to install but only with tar files based installation. With RPM-based this is a mixed blessing.

easer to upgrade via tar files, but this is not a questionable advantage.

easy to debug as all the  necessary files can be viewed from the master host.

potentially significant NFS traffic

You need to jump thou the hoops to resolve RPM dependencies on each execution host anyway, so what we are fighting for ?

$SGE_ROOT/$SGE_CELL is shared via NFS

executables are local to each compute host

configuration files, spool directories: all shared

Still convenient to debug

Probably the best option with RPMs-based installation as dependencies are resolved during local installation of RPMs

No clear disadvantages for RPM-based installation.
just local spool directories used, for example /var/spool/sge You can use this option with any other. It is completely independent choice.

So executables, configuration files: can be either shared or not

spool directories: local to each compute host

simple to install

easy to upgrade

reduction in NFS traffic but unclear how significant it is. You need some measurements.

less convenient to debug (must go to individual host to see execd messages file). It does not solve  the problem with output files for MPI jobs.
$SGE_ROOT/$SGE_CELL/common configuration files: shared

executables are local to each compute host

spool directories: local to each compute host

reduction of NFS traffic up to elimination of it from single node jobs with local output files.  (NOTE: consequences especially seen when running massively parallel jobs across many nodes) No disadvantages. It does not make sense not to share this directory.  You probably need to share more, but never less.
Everything is installed locally, no NFS used. executables, configuration files, spool directories: all local to each compute host elimination of NFS requirement less convenient to install and upgrade

less convenient to debug

less convenient to change some configuration parameters (must modify files on every host)

loss of shadow master functionality; partial loss of  qacc -j capability.

Local Spool Directories

The spool directory for each execd is the greatest source of NFS traffic for Grid Engine. When jobs are dispatched to an exec host, the job script gets transferred via the qmaster and then written to the spool directory. Each job gets its own subdirectory, into which additional information is written by both the execd and the job shepherd process. Logfiles are also written into the spool directory, for both the execd as well as the individual jobs.

By configuring local spool directories, all that traffic can be redirected to the local disk on each compute host, thus isolating it from the rest of the network as well as reducing the I/O latency. One disadvantage is that, in order to view the logfiles for a particular job, you need to log onto the system where the job ran, instead of simply looking in the shared directory. This would be necessary for debugging of a job problem - the messages file contains information that isn't in the qacct output which can be useful to users as well as administrators.

The path to the spool directory controlled by the parameter execd_spool_dir; it should be set to a directory on the local compute host which is owned by the admin user and which ideally can handle intensive reading/writing (e.g., /var/spool/sge). The execd_spool_dir parameter can be specified when running the install_qmaster script. However, this directory must already exist and be owned by the admin user, or else the script will complain and the execd will not function properly. The spool directory must also have root permission, or files written by the shepherd will be world-writable. Alternatively, the execd_spool_dir parameter can be changed in the cluster configuration (man sge_conf); the execds need to be halted before this change can be made. Please make you read sge_conf(5).

Local Executables

In the default setup, all hosts in a cluster read the binary files for daemons and commands off the shared directory. For daemons, this only occurs once, when they start up. When jobs run, other processes are invoked, such as the shepherd and the rshd (for interactive jobs). In a high-throughput cluster, or when invoking a massively-parallel job across many nodes, there is a possibility that many simultaneous NFS read accesses to these other executables could occur. To counter this, you could make all executables be local to the compute hosts.

In this configuration, rather than sharing $SGE_ROOT over NFS to the compute hosts, you would only share $SGE_ROOT/$SGE_CELL/common (you would also implement local spool directories as described above). On each compute host, you would need to install both the "common" and the architecture-specific binary packages. Then, you would mount the shared $SGE_ROOT/$SGE_CELL/common directory before invoking the install_execd script. In order to prevent confusion, make sure that the path to $SGE_ROOT is identical on the master host and compute hosts, e.g., SGE_ROOT=/opt/sge on all hosts.

For submit and admin hosts, you could choose to either install the executables locally, or else mount them from some shared version of $SGE_ROOT, since it is unlikely that NFS traffic on these types of hosts would be a cause for concern in terms of performance.

Local Configuration Files

Although the above two setups describe ways to reduce NFS traffic to almost nil, there might be other reasons why NFS is not desired. For example, the only available version of NFS for your operating environment might not be considered reliable enough for production use. In this case, you can choose not to share the configuration directory $SGE_ROOT/$SGE_CELL/common, but instead have it be local to each compute host. This would result in no files being shared via NFS. However, because you are no longer using a common set of files shared by all systems, there is some functionality which requires some extra effort to use, and other functionality which no longer works.

1) When you modify certain configuration files, the modification would need to be made manually across all hosts in the cluster. These files are located in the $SGE_ROOT/$SGE_CELL/common directory:

2) Another consequence is that the qacct command will only work if executed on the master host. This is because the accounting file, where all historical information is stored, is only updated on the master host. Because qacct will by default read information from the file $SGE_ROOT/$SGE_CELL/common/accounting, it will only be accurate on the master host. qacct can be directed to read information from any file, using the -f flag, so one alternative is to manually copy the accounting file periodically to another system, where the analysis can take place.

3) Finally, if you do not share the $SGE_ROOT/$SGE_CELL/common directory, you cannot use the Shadow Master facility. The Shadow Master feature relies upon a shared filesystem to keep track of the active master, so without NFS, Shadow Mastering does not work.

To install with this type of setup, proceed as follows:

  1. unpack/untar the Grid Engine distribution on each system (common and architecture-specific packages) to the same pathname on each system
  2. install the master host completely
  3. modify all the configuration files mentioned above to suit the requirements of your site
  4. on the master hosts make an archive of the directory $SGE_ROOT/$SGE_CELL/common
  5. on each exec host, unpack the archive created above
  6. on each exec host, run the install_execd script. It should automatically read in the configurations from the directory which was unpacked.

Other Considerations

Even though Grid Engine can function perfectly well without NFS (except the noted functionality), there are other considerations which might lead to unexpected behavior.

Home directories

Unless otherwise specified, Grid Engine runs jobs in the user's home directory. If this is not shared, then whatever files are created will be placed in the home directory on the host where the job is executed. Also, any configuration given in dot-files, such as .cshrc and .sge_requests, will be read out of the home directory on the host where the job is executed. Finally, if the home directory of the user actually does not exist on the compute host, the job will go into an error state. You need to make sure that for every user, and on every compute host, a home directory is present and contains all the desired dot-file configurations. Also, for jobs run with the -cwd flag, the current path will be recorded, and when the job executes on the compute host, unless the exact same path is accessible to the user running the job, the job will go into an error state.

Application and data files

Obviously, without NFS there needs to be a way to stage data files in and out, and the application files (binaries, libraries, config files, databases, etc.) would also need to be either already present on each compute host or also staged in. The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific stage-in/stage-out facility. Alternatively, these steps could be embedded into jobs scripts directly.

User virtualization

If application availability and data file staging were accounted for, one could in principle run Grid Engine without NFS over a WAN. However, part of the Grid Engine built-in authentication is that the username of the user submitting a job must be recognized on the compute host where the job runs. If running across administrative domains, the username might not exist on the target exec host. In this case, some of the solutions include:


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Jul 12, 2012] Reducing and Eliminating NFS usage by Grid Engine

This is an updated version in comparison with gridscheduler.sourceforge.net

liv.ac.uk

[Jun 08, 2012] Reducing and Eliminating NFS usage by Grid Engine

gridscheduler.sourceforge.net

In the default setup, all hosts in a cluster read the binary files for daemons and commands off the shared directory. For daemons, this only occurs once, when they start up. When jobs run, other processes are invoked, such as the shepherd and the rshd (for interactive jobs). In a high-throughput cluster, or when invoking a massively-parallel job across many nodes, there is a possibility that many simultaneous NFS read accesses to these other executables could occur. To counter this, you could make all executables be local to the compute hosts.

In this configuration, rather than sharing $SGE_ROOT over NFS to the compute hosts, you would only share $SGE_ROOT/$SGE_CELL/common (you would also implement local spool directories as described above). On each compute host, you would need to install both the "common" and the architecture-specific binary packages. Then, you would mount the shared $SGE_ROOT/$SGE_CELL/common directory before invoking the install_execd script. In order to prevent confusion, make sure that the path to $SGE_ROOT is identical on the master host and compute hosts, e.g., SGE_ROOT=/opt/sge on all hosts.

For submit and admin hosts, you could choose to either install the executables locally, or else mount them from some shared version of $SGE_ROOT, since it is unlikely that NFS traffic on these types of hosts would be a cause for concern in terms of performance.

Local Configuration Files

Although the above two setups describe ways to reduce NFS traffic to almost nil, there might be other reasons why NFS is not desired. For example, the only available version of NFS for your operating environment might not be considered reliable enough for production use. In this case, you can choose not to share the configuration directory $SGE_ROOT/$SGE_CELL/common, but instead have it be local to each compute host. This would result in no files being shared via NFS. However, because you are no longer using a common set of files shared by all systems, there is some functionality which requires some extra effort to use, and other functionality which no longer works.

1) When you modify certain configuration files, the modification would need to be made manually across all hosts in the cluster. These files are located in the $SGE_ROOT/$SGE_CELL/common directory:

2) Another consequence is that the qacct command will only work if executed on the master host. This is because the accounting file, where all historical information is stored, is only updated on the master host. Because qacct will by default read information from the file $SGE_ROOT/$SGE_CELL/common/accounting, it will only be accurate on the master host. qacct can be directed to read information from any file, using the -f flag, so one alternative is to manually copy the accounting file periodically to another system, where the analysis can take place.

3) Finally, if you do not share the $SGE_ROOT/$SGE_CELL/common directory, you cannot use the Shadow Master facility. The Shadow Master feature relies upon a shared filesystem to keep track of the active master, so without NFS, Shadow Mastering does not work.

To install with this type of setup, proceed as follows:

  1. unpack/untar the Grid Engine distribution on each system (common and architecture-specific packages) to the same pathname on each system
  2. install the master host completely
  3. modify all the configuration files mentioned above to suit the requirements of your site
  4. on the master hosts make an archive of the directory $SGE_ROOT/$SGE_CELL/common
  5. on each exec host, unpack the archive created above
  6. on each exec host, run the install_execd script. It should automatically read in the configurations from the directory which was unpacked.

[gridengine users] SGE and NFS

If you go 100% local my recommendation would just be to put the whole $SGE_ROOT out on the local nodes. The time it would take to winnow down to the minimal file set is not worth it relative to the size of the whole thing.

Chris Dagdigian dag at sonsorol.org

Wed Nov 12 16:31:44 UTC 2014


my $.02

SGE can run 100% local without NFS - the main thing (in my experience) that you lose in this config is the easy troublshooting ability of going into a central $SGE_ROOT/$SGE_CELL/ and seeing all of the various node spool and message files. It's annoying but not a dealbreaker especially after seeing what you are experiencing.

That said, I do a ton of SGE work with classic spooling on EMC Isilon storage - some environments that do close to 1 million jobs/month in throughput and we've never seen a catastrophic loss of jobs or spool data. Most are without Bright although I know of at least one group running Bright on 1000 cores sitting on top of Isilon storage and they've not seen anything like this either.

If you go 100% local my recommendation would just be to put the whole $SGE_ROOT out on the local nodes. The time it would take to winnow down to the minimal file set is not worth it relative to the size of the whole thing.

-Chris


Skylar Thompson skylar2 at u.washington.edu

Wed Nov 12 16:33:56 UTC 2014


Hi Eric,

We produce our own RPMs using FPM, just so we don't have to have the
executables on NFS. When the NFS storage is busy, it can make GE unusable
and sometimes unstable (if you hit protocol timeouts) if the executables
and/or job spool are on NFS.

On Wed, Nov 12, 2014 at 04:26:51PM +0000, Peskin, Eric wrote:
> All,
>
> Does SGE have to use NFS or can it work locally on each node?
> If parts of it have to be on NFS, what is the minimal subset?
> How much of this changes if you want redundant masters?
>
> We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE 2011.11. Specifically, SGE is provided by a Bright package: sge-2011.11-360_cm6.0.x86_64
>
> Twice, we have lost all the running SGE jobs when the cluster failed over from one head node to the other. =( Not supposed to happen.
> Since then, we have also had many individual jobs get lost. The later situation correlates with messages in the system logs saying
>
> > abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' seems to be deleted
>
> That file lives on an NFS mount on our Isilon storage.
> Surely, the executables don't have to be on NFS?
> Interesting, we are using local spooling, the spool directory on each node is /cm/local/apps/sge/var/spool , which is, indeed local.
> But the $SGE_ROOT , /cm/shared/apps/sge/2011.11 lives on NFS.
> Does any of it need to?
> Maybe just the var part would need to: /cm/shared/apps/sge/var ?
>
> Thanks,
> Eric

Reuti reuti at staff.uni-marburg.de
Wed Nov 12 16:49:14 UTC 2014

Am 12.11.2014 um 17:26 schrieb Peskin, Eric:

> All,
>
> Does SGE have to use NFS or can it work locally on each node?
> If parts of it have to be on NFS, what is the minimal subset?

Usually it's sufficient to have the spool directories local. We never have had problems with sge_execd not being accessible.

At one point I played around with staging the complete /usr/sge to the node while it boots. IIRC I had /usr/sge as symbolic link to some NFS mount of SGE, and after the copy process I replaced the symbolic link with one to a local directory. So the overall path stayed the same all the time.

-- Reuti

Feng Zhang prod.feng at gmail.com
Wed Nov 12 16:53:15 UTC 2014

Bright sets Spool to be local on each node, while the config and
excusables on NFS if you have a HA configuration on your head servers.
I think in theory, if the active head fails, you can bring it offline
and make the passive head active manually, and your jobs will not be
lost.

>From the error message, looks like the NFS server is failed too, that
the node can not mount it. Is the NFS server installed on the failed
head server? I remember that Bright recommends to use a separate NFS
server.

Bill Bryce bbryce at univa.com
Wed Nov 12 16:55:09 UTC 2014

more suggestions....

Since you are using Bright Cluster Manager to mange the configuration of Grid Engine, you should talk to Bright support and make sure there are no unwanted side effects caused by changing the configuration. Several of our customers forget that Bright is managing the Grid Engine cluster and modify the Grid Engine configuration directly - then 10 minutes later Bright 'rewrites' the config.

If everything is local you can't have a shadow master since the spool directory is on the local grid engine master. The accounting files will also be written on the master - so accessing them from other machines won't work unless you copy things around.

Peskin, Eric Eric.Peskin at nyumc.org
Wed Nov 12 17:32:48 UTC 2014

The NFS server is separate -- our Isilon storage. We are working with EMC to determine whether there are issues there. But in the meantime, I am trying to figure out how independent we can get from NFS to limit vulnerability to that sort of problem.



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: September 12, 2017