Softpanorama

May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Migrating to Son of Grid Engine 8.1.8

News Son of Grid Engine Recommended Links UNIVA Grid Engine Sun SGE 6.2u5
Installation of Son of Grid Engine 8.1.8 RPMs for Master Host Installation of the Son of Grid Engine 8.1.8 RPMs for Execution Host Installation of Engine Master Installation of the Execution Hosts Using the command line installer
Excluding SGE queue from scheduling Backup of SGE configuration Usage of NFS in Grid Engine

Duke University Tools

Gridengine diag tool
SGE Troubleshooting Tips SGE History Humor Etc

Introduction

Son of Grid Engine is the only actively maintained free version of SGE. Other versions are abandonware. the current version is 8.1.8.

There are scripts to save and load the actual configuration to text files:

ls /usr/sge/util/upgrade_modules
inst_upgrade.sh load_sge_config.sh save_sge_config.sh

Preparation steps

Migration involves the following steps (we assume a simple NFS based SGE installation here):

  1. Disable all queues on the cluster, so nobody can submit new jobs and wait until all running jobs finish on at least one execution host. Then you can proceed with steps related to execution host on it. You do not need to wait until all tasks finish. For example you can install RPMs on exec host even without new master running of the master host.

    qmod -d "*.q" 
  2. While you are waiting for jobs finish create full backup of the configuration on the master host. See Backup of SGE configuration for details
    1. Run your custom script
    2. Run standard SGE backup script
       
  3. Shut down sgeexec.$SGE_CLUSTER_NAME execution daemon on all execution nodes, or just those that finished execution of the tasks.  You can start working on this execution host that are free right now. for example you can install ROMs (but can't yet run installer). See Installation of the Son of Grid Engine 8.1.8 RPMs for Execution Host
     
  4. Shutdown sgemaster.$SGE_CLUSTER_NAME daemon on the head node.  Unregister it from all runlevels with the command:

    chkconfig  sgemaster.$SGE_CLUSTER_NAME  off.

  5. Create a tar file of old $SGE_ROOT tree and /etc  directory.  Copy it to some other server just in case.
    tar cvzf /var/SGE_backup/sge`date +"%y%m%d"`
  6. Remove environment creation script in /etc/init.d/sge.sh if you use one. Do it now, as it is easy to forget to do it later. Or edit /etc/profile and /etc/bashrc, if you have statement

    which is a less elegant way to achieve the same effect.

  7. Rename $SGE_ROOT directory to into say /opt/sge_old. If you install in a different directory create a soft link from old directory to the new one.  This is important as in sizable installation God knows how many script refer to old directory and if you change it and do not move old directory they all will pick up old executables with pretty interesting consequences.
     

  8. Rename old startup script sgeexec.SGE_CLUSTER_NAME on execution nodes to sgeexec.SGE_CLUSTER_NAME.old or something like that
     
  9. Remove old  sgeexec script with chkconfig from all run levels.

     sgeexec.$SGE_CLUSTER_NAME  off.

  10. Remove old    sgemaster.$SGE_CLUSTER_NAME script  with chkconfig from all run levels.

If you change the location of $SGE_ROOT, investigate where old settings of $SGE_ROOT are located and change them

If you have a large cluster with a lot of users it is usually not a good ides to change the location of your $SGE_ROOT.

Still sometimes this is a necessary operation. Expect a lot of pain.

The problem is the  old installation setting of SGE_ROOT can pop up in the most unexpected places. So you need to find as many of them as possible, including scripts that generate submit files in commercial applications. Not everybody uses the variable $SGE_ROOT, as they should. This is why old location of SGE should be now symlinked to the new $SGE_ROOT. But even this might not be enough. Some applications and script might set SGE_ROOT to the old location. 

This is a real investigation and you need to allocate time for that. In other words this is a very important and time consuming step, that can't be skipped. Plan accordingly.

  1. Use find with grep scan the whole filesystem to see files with old location (we assume the old version of SGE is stored in $SGE_ROOT_OLD). You should such a scan several times one for regular files, the second for dot files and the third for scripts. 
    find /root $SGE_ROOT_OLD /home -type d -exec grep $SGE_ROOT_OLD {} \;
  2. Verify that setting of $SGE_ROOT are all eliminated.

  3. Correct this setting in /etc/profile and profiles of all execution hosts. If you use /etc/profile.d/sge.sh include module on execution hosts it should be overwritten with new. But you can do it later. For now just delete old file from master and execution hosts.
     

  4. Set and export on the master host variable $SGE_ROOT, $SGE_CELL and $SGE_CLUSTER_NAME

    1. $SGE_ROOT variable to /opt/sge,

    2. $SGE_CELL to default

    3. $SGE_CLUSTER_NAME to your cluster name, for example dell 

    Installer can use those settings. Check twice if you export correct value of $SGE_ROOT variable both on master and several randomly selected execution hosts.  Typos happen.
     

  5. Try to proactively modify applications submission generation scripts if you use commercial applications like Medea. Some application generate submit script to sge have hard coded path to qsub or similar  things. You need to change it. Where to change depends on particular application but often those are Perl or TCL script that generate submit file.

Install  master host 

  1. Install RPMs. See Installation of Son of Grid Engine 8.1.8 RPMs for Master Host
  2. Install qmaster. See Installation of Master Host

Install execution hosts

  1. Install RPMs and fix discrepancies. Installation of the Son of Grid Engine 8.1.8 RPMs for Execution Host
  2. Install execution host. See SGE Execution Host Installation

Post instal steps

  1. Check if you copied new environment creation script to /etc/init.d/sge.sh if you use one. It should be done on all execution nodes nodes too. Or modified /etc/profile and /etc/bashrc, which is a less elegant way to achieve the same effect.
     
  2. Try to restore PE.  SoGE creates three PE by default. Each of them can be used for comparison with your old PE to see what changed.  Usually mpi PE which is installed by SoGE by default is a good target for comparison.
    [root@qmaster pe]# qconf  -sp mpi
    pe_name            mpi
    slots              99999
    user_lists         NONE
    xuser_lists        NONE
    start_proc_args    NONE
    stop_proc_args     NONE
    allocation_rule    $fill_up
    control_slaves     TRUE
    job_is_first_task  FALSE
    urgency_slots      min
    accounting_summary FALSE
    qsort_args         NONE
    [root@qmaster pe]# qconf  -sp mpi > mpi.soge
    [root@qmaster pe]# diff mpi.old mpi.soge
    2c2
    < slots              999
    ---
    > slots              99999
    11c11,12
    < accounting_summary TRUE
    ---
    > accounting_summary FALSE
    > qsort_args         NONE
  3. Try to restore queues

    Often queue can't be imported directly as the template has some small differences. You need to write a script to convert old queues to a new format. This usually involves trivial changes like deletion of a contain field(s) and/or addition of a field(s). For example in the example below we see that the attribute jc_list is not recognized by SoGE and should be deleted from all old queues to be make A operation possible:

    [0]root@qmaster: # qconf -Aq blades.q.old
    error: unknown attribute name "jc_list"
    error: error reading file: "blades.q.old"
  4. Restore users. Start with operators.

Change some default settings

  1. Change setting of sched_conf from false to true. See Enabling scheduling information in qstat -j for details. This setting defines the configuration file format for Grid Engineís scheduler. In order to modify the configuration, use the graphical userís interface qmon or the
    qconf -msconf
    command.
     

    A default configuration is provided with the Grid Engine distribution package, but it usually has  the setting  schedd_job_info=false which should be changed to schedd_job_info=true

    Again, the command is:

    qconf -msconf
  2. If you use Infiniband increase the valued of H_MEMORYLOCKED parameter. See  ulimit problem with infiniband in SGE
     

 


Top updates

Bulletin Latest Past week Past month
Google Search


NEWS CONTENTS

Old News ;-)

[Nov 08, 2014]  Migrating from sge 6.2u5 to 8.1.7

Reuti reuti at staff.uni-marburg.de
Tue Oct 28 22:28:39 UTC 2014


Am 26.10.2014 um 06:16 schrieb Stuart Barkley:

> Picking a somewhat random message from a slightly old thread to follow
> up on...
> 
> We have been successfully running 8.1.7 on a newly built cluster.
> 
> Now it is getting close to time to migrate our older 356 node cluster
> over to SoGE and I'm fairly nervous about doing the conversion.
> 
> What sort of expected compatibility (if any) is there between SGE
> 6.2u5 and SoGE 8.1.7 between execd and qmaster?
> 
> Is it possible to run new execd nodes with the old qmaster?
> 
> Is it possible to run old execd nodes with the new qmaster?
> 
> Either of these would allow for a more graceful transition to SoGE.

No, it's best to start from scratch. There was also a discussion a couple of days ago what fields need to be added to certain entries created by `save_sge_config.sh`.

Hello,

I've compiled and tested SGE 8.1.7. It looks like it works well in our environment (Linux Ubuntu workstations and virtual nodes).

Now, I'd like to upgrade our current 6.2u5 cluster to this 8.1.7 version.

I've seen that there's a "inst_sge -upd" that should do the job but I cannot find documentation on how to use it and from which version to which one it's able to correctly upgrade. I'd like to keep the current configuration (complexes, user groups, host groups, policies, spool etc...).

Can someone point me to a good documentation ?

Best regards

Norbert

===

Hi Norbert:

As Reuti says here: http://gridengine.org/pipermail/users/2012-October/004927.html

There are scripts to save and load the actual configuration to text files:

ls /usr/sge/util/upgrade_modules
inst_upgrade.sh load_sge_config.sh save_sge_config.sh

We've used these a few times over the years to good effect.

Good luck with the upgrade!

Cheers,

-Hugh

Am 10.10.2012 um 16:10 schrieb Esztermann, Ansgar:

> we are currently upgrading our cluster management software. Since this involves setting up a new master, migrating to a new GE version now might be a good idea as well. Is there any "official" procedure for that? (We have 6.2u5p1, the one that comes with our new cluster software is 2011.11.)
>
> Failing that, how would we go about transferring GE configuration from the old to the new machine?

There are scripts to save and load the actual configuration to text files:

$ ls /usr/sge/util/upgrade_modules
inst_upgrade.sh load_sge_config.sh save_sge_config.sh

-- Reuti

In the message dated: Thu, 18 Sep 2014 18:58:02 -0000,
The pithy ruminations from "MacMullan, Hugh" on
<Re: [gridengine users] Migrating from sge 6.2u5 to 8.1.7> were:
=> Hi Norbert:
=>
=> As Reuti says here: http://gridengine.org/pipermail/users/2012-October/004927.html
=>
=> There are scripts to save and load the actual configuration to text files:
=>

Having recently upgraded from 6.2u5 to 8.1.6, I have some observations that may be helpful.

=> $ ls /usr/sge/util/upgrade_modules
=> inst_upgrade.sh load_sge_config.sh save_sge_config.sh =>

SGE 8.1.x requires some parameters that do not exist in 6.2u5. For us, the issue was that the field "qsort_args" is required in PE definitions under 8.1.x, but is not present in 6.2u5, so loading the previous configurations failed to create the PEs under the new version. There may be other required values with similar issues.

You may need to run "load_sge_config.sh" multiple times due to the order of operations. For example, the script may attempt to set the parameters of a queue parameters before creating the queue, causing an error on the first execution, then success the second time load_sge_config.sh is run.

The output of 'qhost' is different -- 8.1.x adds 'NSOC' and 'NCOR' fields, so programs that parse qhost may need alteration.

Mark

=> We've used these a few times over the years to good effect.

=>
=> Good luck with the upgrade!
=>
=> Cheers,
=> -Hugh
=>


Recommended Links

Top Visited

Bulletin Latest Past week Past month
Google Search





Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: November 13, 2014