Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

SGE Troubleshooting

News SGE queue states and state codes Recommended Links Job Post Mortem Job stuck in the queue problem SGE log files Enabling scheduling information qstat -j
Job or Queue Reported in Error State E MPI startup failure with message memlock is too small on infiniband ulimit problem with infiniband Creating and modifying SGE Queues Parallel Environment Gridengine diag tool Backup of SGE configuration
Monitoring and Controlling Jobs Monitoring Queues qsub -- Submitting Jobs To Queue Instances SGE Submit Scripts MPI    
 Main troubleshooting commandss qstat qacct command qping qhost qmod SGE cheat sheet
Installation of SCE on a small set of multicore servers Usage of NFS Installation of the Master Host Installation of the Execution Hosts Perl Admin Tools and Scripts Humor Etc

Jobs queried by qstat can be in different queue states:

The most typical situation In SGE is a stalled job, or situation in which a node does not accept any jobs (see also Job or Queue Reported in Error State E)

As with any troubleshooting the first step is to read SGE logs.  See SGE log files

There are several reasons such a situation can arise:

  1. passwordless ssh is misconfigured (for example a node was reinstalled, but old ssh hashes remain and that prevents passwordless ssh)
     
  2. Status of particular queue in SGE is "in error"(see also Job or Queue Reported in Error State E). If a job fails in a way that makes SGE think the node might be at fault (as is the save with misconfigured SSH), SGE will mark the node in error.
    'qstat -f' will show any nodes marked as in error state
    'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...
    qstat -s p shows pending jobs, which is all those with state "qw" and "hqw".
    qstat -s h shows hold jobs, which is all those with state "hqw".
    qstat -u "*" | grep " qw" 
Using those command you can also create your custom scripts to help you to analyze the situation. Starting with the most promitive like: alias jobstat 'source <code_script>
echo "Running jobs: " ` qstat -u '*' | awk ' { if ($5 == "r") print $0 }' | wc -l` ;
echo "Pending jobs: " ` qstat -u '*'  |  awk ' { if ($5 == "qw" || $5 == "hqw")  print $0 }' | wc -l ` ;
echo "------------------------" ;
echo "Total   " `qstat -u '*' | wc -l `

The first thing in such a situation is to see output of the command qstat -f (see Queue states for explanation of possible states)

Here is an example of this command output of qstat -f
[1] root@node17: # qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node16                 BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
all.q@node17                 BIP   0/0/12         0.06     lx24-amd64
---------------------------------------------------------------------------------
all.q@node52                 BIP   0/0/12         12.02    lx24-amd64
---------------------------------------------------------------------------------
all.q@node53                 BIP   0/0/80         39.72    lx24-amd64
---------------------------------------------------------------------------------
all.q@node54                 BIP   0/0/80         0.02     lx24-amd64
---------------------------------------------------------------------------------
all.q@wx3481-ustc              BIP   0/0/8          -NA-     lx24-amd64    au
---------------------------------------------------------------------------------
c12.q@node52                 BIP   0/0/12         12.02    lx24-amd64
---------------------------------------------------------------------------------
c32.q@node16                 BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
c32.q@node53                 BIP   0/0/32         39.72    lx24-amd64
---------------------------------------------------------------------------------
c32.q@node54                 BIP   0/0/32         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
c40.q@node53                 BIP   0/0/40         39.72    lx24-amd64
---------------------------------------------------------------------------------
c40.q@node54                 BIP   0/0/40         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
m12a.q@node52                BIP   0/12/12        12.02    lx24-amd64
---------------------------------------------------------------------------------
m32a.q@node16                BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
m40a.q@node54                BIP   0/0/40         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
m40b.q@node53                BIP   0/40/40        39.72    lx24-amd64
To correct situation when job is in E state you can try to use qmod -c command (you can issue qmod -c "*" to do blanket resetting of all such situations on all the nodes).

For example:

qmod -c c40.q c32.q m40a.q
root@node17 changed state of "c40.q@node54" (no error)
Queue instance "c40.q@node53" is already in the specified state: no error
Queue instance "c32.q@node16" is already in the specified state: no error
root@node17 changed state of "c32.q@node54" (no error)
Queue instance "c32.q@node53" is already in the specified state: no error
root@node17 changed state of "m40a.q@node54" (no error)
qhost Print out execution host configuration and load. For example:
# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node16                   lx24-amd64     32  0.01  126.0G 1018.1M   62.5G     0.0
node17                   lx24-amd64     12  0.09    5.8G  637.4M   11.7G  204.0K
node52                   lx24-amd64     12 12.02   47.2G    2.0G   49.1G     0.0
node53                   lx24-amd64     80 39.56  126.0G    1.4G   62.5G     0.0
node54                   lx24-amd64     80  0.02  126.0G  502.9M   62.5G     0.0
node8                    lx24-amd64      8     -   15.7G       -   15.6G       -

qstat -g c Print out the current queue utilization

qstat -u <user> Only show jobs of a special user

qstat -j <job id> Print out detailed information about the job with the specified job id.  If it is not enabled, you need to enable this type of reporting Enabling scheduling information in command qstat -j

Common pitfalls

The most typical is Your job "starves" in the waiting queue

problem description possible reason example solution
Your job "starves" in the waiting queue The farm is full check the output of "qstat -g c" for available nodes
You requested resources which cannot be fullfilled: -l h_cpu > 48:00:00 you can just request cpu time < 49 hours
Only some of a set of identical jobs die You did not specify your requirements correctly You did not specify h_cpu If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed
Too many jobs access data on the same file server at once Use AFS!
Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option.
All your jobs die at once There are problems writing the log files (job's STDOUT/STDERR) The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/naf.desy.de/...: File too large". Do not store more than 1000 output files per directory.
The output directory is not writable. SGE's error mail contains a line saying something like "/afs/naf.desy.de/...: permission denied". Check directory permissions.
The log directory does not exist on the execution host. You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.
Your job uses threads and works perfectly if not run under the batch system. Running in SGE it dies with obscure error messages stating you are not able to start specific threads SGE sets the stack size limit to the same value as h_vmem by default. That should be considered a bug of SGE. Specify the h_stack job resource like this: -l h_stack=10M. 10M is a good value for most applications
qrsh failes with an error message complaining 'Your "qrsh" request could not be scheduled, try again later' The farm is full and qrsh wants to occupy a slot at once. Try "qrsh -now n ". That way your request will be put into the waiting queue and no immidiate execution will be forced.
qrsh starts normally, it finishes unexpectedly during program execution You requested not enough memory (the default is 128M) Request more memory, e.g. 1GB: qrsh -l h_vmem=1G

Storage


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[May 07, 2017] Monitoring and Controlling Jobs

biowiki.org

After submitting your job to Grid Engine you may track its status by using either the qstat command, the GUI interface QMON, or by email.

Monitoring with qstat

The qstat command provides the status of all jobs and queues in the cluster. The most useful options are:

You can refer to the man pages for a complete description of all the options of the qstat command.

Monitoring Jobs by Electronic Mail

Another way to monitor your jobs is to make Grid Engine notify you by email on status of the job.

In your batch script or from the command line use the -m option to request that an email should be send and -M option to precise the email address where this should be sent. This will look like:

#$ -M myaddress@work
#$ -m beas

Where the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the beginning/end of the job, or when the job is aborted/suspended (see the sample script lines above).

And from the command line you can use the same options (for example):

qsub -M myaddress@work -m be job.sh

How do I control my jobs

Based on the status of the job displayed, you can control the job by the following actions:

Monitoring and controlling with QMON

You can also use the GUI QMON, which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.


For further information, see the SGE User's Guide ( PDF, HTML).


[May 07, 2017] Why Won't My Job Run Correctly? ( aka How To Troubleshoot/Diagnose Problems)

May 07, 2017 | biowiki.org

Does your job show "Eqw" or "qw" state when you run qstat , and just sits there refusing to run? Get more info on what's wrong with it using:

$ qstat -j <job number>

Does your job actually get dispatched and run (that is, qstat no longer shows it - because it was sent to an exec host, ran, and exited), but something else isn't working right? Get more info on what's wrong with it using:

$ qacct -j <job number> (especially see the lines "failed" and "exit_status")

If any of the above have an "access denied" message in them, it's probably a permissions problem. Your user account does not have the privileges to read from/write to where you told it (this happens with the -e and -o options to qsub often). So, check to make sure you do. Try, for example, to SSH into the node on which the job is trying to run (or just any node) and make sure that you can actually read from/write to the desired directories from there. While you're at it, just run the job manually from that node, see if it runs - maybe there's some library it needs that the particular node is missing.

To avoid permissions problems, cd into the directory on the NFS where you want your job to run, and submit from there using qsub -cwd to make sure it runs in that same directory on all the nodes.

Not a permissions problem? Well, maybe the nodes or the queues are unreachable. Check with:

qstat -f

or, for even more detail:

qstat -F

If the "state" column in qstat -f has a big E , that host or queue is in an error state due to... well, something. Sometimes an error just occurs and marks the whole queue as "bad", which blocks all jobs from running in that queue, even though there is nothing otherwise wrong with it. Use qmod -c <queue list> to clear the error state for a queue.

Maybe that's not the problem, though. Maybe there is some network problem preventing the SGE master from communicating with the exec hosts, such as routing problems or a firewall misconfiguration. You can troubleshoot these things with qping , which will test whether the SGE processes on the master node and the exec nodes can communicate.

N.B.: remember, the execd process on the exec node is responsible for establishing a TCP/IP connection to the qmaster process on the master node , not the other way around. The execd processes basically "phone home". So you have to run qping from the exec nodes , not the master node!

Syntax example (I am running this on a exec node, and sheridan is the SGE master):

$ qping sheridan 536 qmaster 1

where 536 is the port that qmaster is listening on, and 1 simply means that I am trying to reach a daemon. Can't reach it? Make sure your firewall has a hole on that port, that the routing is correct, that you can ping using the good old ping command, that the qmaster process is actually up, and so on.

Of course, you could ping the exec nodes from the master node, too, e.g. I can see if I can reach exec node kosh like this:

$ qping kosh 537 execd 1

but why would you do such a crazy thing? execd is responsible for reaching qmaster , not the other way around.

If the above checks out, check the messages log in /var/log/sge_messages on the submit and/or master node (on our Babylon Cluster , they're both the node sheridan ):

$ tail /var/log/sge_messages

Personally, I like running:

$ tail -f /var/log/sge_messages

before I submit the job, and then submit a job in a different window. The -f option will update the tail of the file as it grows, so you can see the message log change "live" as your job executes and see what's happening as things take place.

(Note that the above is actually a symbolic link I put in to the messages log in the qmaster spool directory, i.e. /opt/sge/default/spool/qmaster/messages .)

One thing that commonly goes wrong is permissions. Make sure that the user that submitted the job using qsub actually has the permissions to write error, output, and other files to the paths you specified.

For even more precise troubleshooting... maybe the problem is unique only to some nodes(s) or some queue(s)? To pin it down, try to run the job only on some specific node or queue:

$ qsub -l hostname=<node/host name> <other job params>

$ qsub -l qname=<queue name> <other job params>

Maybe you should also try to SSH into the problem nodes directly and run the job locally from there, as your own user, and see if you can get any more detail on why it fails.

If all else fails...

Sometimes, the SGE master host will become so FUBARed that we have to resort to brute, traumatizing force to fix it. The following solution is equivalent to fixing a wristwatch with a bulldozer, but seems to cause more good than harm (although I can't guarantee that it doesn't cause long-term harm in favor of a short-term solution).

Basically, you wipe the database that keeps track of SGE jobs on the master host, taking any problem "stuck" jobs with it. (At least that's what I think this does...)

I've found this useful when:

The solution:

ssh sheridan
su -
service sgemaster stop
cd /opt/sge/default/
mv spooldb spooldb.fubared
mkdir spooldb
cp spooldb.fubared/sge spooldb/
chown -R sgeadmin:sgeadmin spooldb
service sgemaster start

Wipe spooldb.fubared when you are confident that you won't need its contents again.

After the reinstallation of the node passwordless ssh does not work due to wrong hashes

That's a typical situation if you reinstall the computational node.

[Nov 09, 2015] Some compute nodes not accepting jobs

Jul 19 , 2010 | Rocks-Discuss
Mike Hanby mhanby at uab.edu
Mon Jul 19 12:27:15 PDT 2010

Some times if a job fails in a way that makes SGE think the node might be at fault, SGE will mark the node in error.

'qstat -f' will show any nodes marked as errored
'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...
You may be able to find some information in /opt/gridengine/default/spool/qmaster/messages

It's a good idea to run the 'qstat -g c' command periodically. It can help head off some support calls, especially if you have some users who are node watchers ;-)

Also, check 'qhost' from time to time to make sure none of your nodes are overloaded (i.e. jobs behaving badly) as it will display load, memory used and swap used.

Mike >

[Nov 09, 2015] Situation when one node does not accept jobs, while all other do accept them

try qalter -w v jobnumber

Troubleshooting

imperial.ac.uk
SGE jobs can fails for a variety of reasons, which may either be scheduling problems preventing the job from being executed, or runtime errors due to scripting errors etc.

Scheduling Problems

A job will remain in a queued state if SGE can not allocate appropriate resources as requested in the job submission. This may be either due to jobs of other users on the cluster which are ahead of the queued job in the queue, or it may be that the resources requested are not available on the cluster i.e. more memory was requested than is provided by any cluster node, or the job has been submitted by a user not registered to use the cluster.

The 'qalter' command provides a method for verifiying whether a job is capable of being run assuming no other jobs are present in the queue. Executing 'qalter -w v [jobid]' will sequentially check each queue and report on whether a queue capable of executing the job is present on the system.

Example: Unregistered User Submitting Jobs

A job submitted by a user who is not registered to submit jobs to the cluster will result in jobs appearing in the queue, but remaining in a queued state. Executing 'qalter -w v [jobid]' in this circumstance will produce an output such as that show below. The job is reported for having no permission to run in any of the queues since the user is not registered. In this case, contact [email protected] to arrange access to the cluster.

[root@codon /]# qalter -w v 3811734
Job 3811734 has no permission for cluster queue "quick"
Job 3811734 has no permission for cluster queue "emaas"
Job 3811734 has no permission for cluster queue "1day_16"
Job 3811734 has no permission for cluster queue "3day_16"
Job 3811734 has no permission for cluster queue "3day_32"
Job 3811734 has no permission for cluster queue "6hour_16"
Job 3811734 has no permission for cluster queue "7day_16"
Job 3811734 has no permission for cluster queue "7day_32"
Job 3811734 has no permission for cluster queue "infinite_128"
Job 3811734 has no permission for cluster queue "infinite_16"
Job 3811734 has no permission for cluster queue "infinite_32"
verification: no suitable queues

Example: Incorrect memory request

A job which is submitted with a resource request which can never be fulfilled can also be identified by executing 'qalter -w v [jobid]'. For example, in the following case a job was submitted requesting a slot with 5Gb of memory, however standard slots on the cluster offer either 2Gb or 4Gb memory, consequently this request can not be fulfilled. Jobs requiring access to more memory than offered by a single slot should be submitted as SMP jobs requesting a number of slots (see submitting parallel jobs for details)

[bss-admin@codon ~]$ qalter -w v 3811742
Job 3811742 has no permission for cluster queue "quick"
Job 3811742 has no permission for cluster queue "emaas"
Job 3811742 (-l h_vmem=5G) cannot run in queue "1day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_32" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "6hour_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_32" because of cluster queue
Job 3811742 has no permission for cluster queue "infinite_128"
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_32" because of cluster queue
verification: no suitable queues

[Dec 30, 2013] Use qping to check if host is running OK

qping -info myhost16 6445 execd 1         # check status of execd from master

[Jul 12, 2012] Gridengine diag tool

[ID 1288901.1]

Modified 19-APR-2012 Type DIAGNOSTIC TOOLS Status PUBLISHED

This documentation contains the scripts for data collection and configuration information in order to troubleshoot and resolve HPC Grid Engine issues.

Attachments

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Man pages

Articles



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: July, 28, 2019