Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

SGE Troubleshooting

News	SGE queue states and state codes	Recommended Links	Job Post Mortem	Job stuck in the queue problem	SGE log files	Enabling scheduling information qstat -j
Job or Queue Reported in Error State E	MPI startup failure with message memlock is too small on infiniband	ulimit problem with infiniband	Creating and modifying SGE Queues	Parallel Environment	Gridengine diag tool	Backup of SGE configuration
Monitoring and Controlling Jobs	Monitoring Queues	qsub -- Submitting Jobs To Queue Instances	SGE Submit Scripts	MPI
Main troubleshooting commandss	qstat	qacct command	qping	qhost	qmod	SGE cheat sheet
Installation of SCE on a small set of multicore servers	Usage of NFS	Installation of the Master Host	Installation of the Execution Hosts	Perl Admin Tools and Scripts	Humor	Etc

Jobs queried by qstat can be in different queue states:

qw -- job is waiting for execution. This is the most typical problem
r -- job is currently running
E - job is in error state
Eqw -- job has failed,
- use the command sge-job-error <jobid> to determine why.
- After that either delete the job with qdel <jobid> (if it is a permanent error)
  or clear the error status with qmod -cj <jobid> (if the error reason was temporary)

The most typical situation In SGE is a stalled job, or situation in which a node does not accept any jobs (see also Job or Queue Reported in Error State E)

As with any troubleshooting the first step is to read SGE logs. See SGE log files

There are several reasons such a situation can arise:

passwordless ssh is misconfigured (for example a node was reinstalled, but old ssh hashes remain and that prevents passwordless ssh)

Status of particular queue in SGE is "in error"(see also Job or Queue Reported in Error State E). If a job fails in a way that makes SGE think the node might be at fault (as is the save with misconfigured SSH), SGE will mark the node in error.

'qstat -f' will show any nodes marked as in error state
'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...

qstat -s p shows pending jobs, which is all those with state "qw" and "hqw".

qstat -s h shows hold jobs, which is all those with state "hqw".

qstat -u "*" | grep " qw"

Using those command you can also create your custom scripts to help you to analyze the situation. Starting with the most promitive like: alias jobstat 'source <code_script>

echo "Running jobs: " ` qstat -u '*' | awk ' { if ($5 == "r") print $0 }' | wc -l` ;
echo "Pending jobs: " ` qstat -u '*'  |  awk ' { if ($5 == "qw" || $5 == "hqw")  print $0 }' | wc -l ` ;
echo "------------------------" ;
echo "Total   " `qstat -u '*' | wc -l `

The first thing in such a situation is to see output of the command qstat -f (seeQueue states for explanation of possible states)

Here is an example of this command output of qstat -f:

[1] root@node17: # qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@node16                 BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
all.q@node17                 BIP   0/0/12         0.06     lx24-amd64
---------------------------------------------------------------------------------
all.q@node52                 BIP   0/0/12         12.02    lx24-amd64
---------------------------------------------------------------------------------
all.q@node53                 BIP   0/0/80         39.72    lx24-amd64
---------------------------------------------------------------------------------
all.q@node54                 BIP   0/0/80         0.02     lx24-amd64
---------------------------------------------------------------------------------
all.q@wx3481-ustc              BIP   0/0/8          -NA-     lx24-amd64    au
---------------------------------------------------------------------------------
c12.q@node52                 BIP   0/0/12         12.02    lx24-amd64
---------------------------------------------------------------------------------
c32.q@node16                 BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
c32.q@node53                 BIP   0/0/32         39.72    lx24-amd64
---------------------------------------------------------------------------------
c32.q@node54                 BIP   0/0/32         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
c40.q@node53                 BIP   0/0/40         39.72    lx24-amd64
---------------------------------------------------------------------------------
c40.q@node54                 BIP   0/0/40         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
m12a.q@node52                BIP   0/12/12        12.02    lx24-amd64
---------------------------------------------------------------------------------
m32a.q@node16                BIP   0/0/32         0.00     lx24-amd64
---------------------------------------------------------------------------------
m40a.q@node54                BIP   0/0/40         0.02     lx24-amd64    E
---------------------------------------------------------------------------------
m40b.q@node53                BIP   0/40/40        39.72    lx24-amd64

To correct situation when job is in E state you can try to use qmod -c command (you can issue qmod -c "*" to do blanket resetting of all such situations on all the nodes).

For example:

qmod -c c40.q c32.q m40a.q
root@node17 changed state of "c40.q@node54" (no error)
Queue instance "c40.q@node53" is already in the specified state: no error
Queue instance "c32.q@node16" is already in the specified state: no error
root@node17 changed state of "c32.q@node54" (no error)
Queue instance "c32.q@node53" is already in the specified state: no error
root@node17 changed state of "m40a.q@node54" (no error)

qhost Print out execution host configuration and load. For example:

# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node16                   lx24-amd64     32  0.01  126.0G 1018.1M   62.5G     0.0
node17                   lx24-amd64     12  0.09    5.8G  637.4M   11.7G  204.0K
node52                   lx24-amd64     12 12.02   47.2G    2.0G   49.1G     0.0
node53                   lx24-amd64     80 39.56  126.0G    1.4G   62.5G     0.0
node54                   lx24-amd64     80  0.02  126.0G  502.9M   62.5G     0.0
node8                    lx24-amd64      8     -   15.7G       -   15.6G       -

qstat -g c Print out the current queue utilization

qstat -u <user> Only show jobs of a special user

qstat -j <job id> Print out detailed information about the job with the specified job id. If it is not enabled, you need to enable this type of reporting Enabling scheduling information in command qstat -j

Common pitfalls

The most typical is Your job "starves" in the waiting queue

problem description	possible reason	example	solution
Your job "starves" in the waiting queue	The farm is full		check the output of "`qstat -g c`" for available nodes
Your job "starves" in the waiting queue	You requested resources which cannot be fullfilled:	`-l h_cpu > 48:00:00`	you can just request cpu time < 49 hours
Only some of a set of identical jobs die	You did not specify your requirements correctly	You did not specify `h_cpu`	If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed
	Too many jobs access data on the same file server at once		Use AFS!
	Too many jobs access data on the same file server at once		Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option.
All your jobs die at once	There are problems writing the log files (job's STDOUT/STDERR)	The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/naf.desy.de/...: File too large".	Do not store more than 1000 output files per directory.
		The output directory is not writable. SGE's error mail contains a line saying something like "/afs/naf.desy.de/...: permission denied".	Check directory permissions.
		The log directory does not exist on the execution host.	You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.
Your job uses threads and works perfectly if not run under the batch system. Running in SGE it dies with obscure error messages stating you are not able to start specific threads	SGE sets the stack size limit to the same value as h_vmem by default. That should be considered a bug of SGE.		Specify the h_stack job resource like this: `-l h_stack=10M`. 10M is a good value for most applications
qrsh failes with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'	The farm is full and qrsh wants to occupy a slot at once.		Try "`qrsh -now n` ". That way your request will be put into the waiting queue and no immidiate execution will be forced.
qrsh starts normally, it finishes unexpectedly during program execution	You requested not enough memory (the default is 128M)		Request more memory, e.g. 1GB: `qrsh -l h_vmem=1G`

Storage

Please copy the data you need during your job to $TMPDIR (not to /tmp directly!) first.
- Input Data in AFS UNTIL FURTHER NOTICE You should NOT use the commands Atrans or afscp to increase the data transfer throughput, BUT USE STANDARD LINUX cp COMMAND INSTEAD!
- Input Data in dCache You can copy it via dccp (you should know the dcap door in the dCache, ask your administrators)
- Input Data in dCache You can use srmcp command to copy data too
Alternative: copy large input data (several hundreds of Gigabytes) to the Lustre file system: /scratch/current/vo/user
Let your job work on that data (generate the results in $TMPDIR also!).
Before your job finishes copy the results back: to AFS, Lustre, dCache, etc.
- To AFS for small amounts of data
- To AFS Group Volume (ask your vo-admin) for bigger amount of data (several GB)
- To dCache if amount is even bigger or the data is relevant for other users too
- Don't leave it on the temporary storage like $TMPDIR or Lustre
All data in $TMPDIR will be removed automatically when your job finishes!

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20170507 : Monitoring and Controlling Jobs ( biowiki.org )
20170507 : Why Won't My Job Run Correctly? ( aka How To Troubleshoot/Diagnose Problems) ( May 07, 2017 , biowiki.org )
20170507 : After the reinstallation of the node passwordless ssh does not work due to wrong hashes ( )
20151109 : Some compute nodes not accepting jobs ( Jul 19 , 2010 , Rocks-Discuss )
20151109 : Situation when one node does not accept jobs, while all other do accept them ( Nov 09, 2015 )
20151109 : Troubleshooting ( imperial.ac.uk )
20131230 : qping ( softpanorama.org, Dec 30, 2013 )
20120712 : Gridengine diag tool ( Jul 12, 2012 )

Old News ;-)

[May 07, 2017] Monitoring and Controlling Jobs

biowiki.org

After submitting your job to Grid Engine you may track its status by using either the qstat command, the GUI interface QMON, or by email.
Monitoring with qstat
The qstat command provides the status of all jobs and queues in the cluster. The most useful options are:

qstat: Displays list of all jobs with no queue status information.

qstat -u hpc1***: Displays list of all jobs belonging to user hpc1***

qstat -f: gives full information about jobs and queues.

qstat -j [job_id]: Gives the reason why the pending job (if any) is not being scheduled.

You can refer to the man pages for a complete description of all the options of the qstat command.
Monitoring Jobs by Electronic Mail
Another way to monitor your jobs is to make Grid Engine notify you by email on status of the job.

In your batch script or from the command line use the -m option to request that an email should be send and -M option to precise the email address where this should be sent. This will look like:

#$ -M myaddress@work
#$ -m beas

Where the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the beginning/end of the job, or when the job is aborted/suspended (see the sample script lines above).

And from the command line you can use the same options (for example):

qsub -M myaddress@work -m be job.sh
How do I control my jobs
Based on the status of the job displayed, you can control the job by the following actions:

Modify a job: As a user, you have certain rights that apply exclusively to your jobs. The Grid Engine command line used is qmod. Check the man pages for the options that you are allowed to use.

Suspend/(or Resume) a job: This uses the UNIX kill command, and applies only to running jobs, in practice you type
qmod -s/(or-r)job_id (where job_id is given by qstat or qsub).

Delete a job: You can delete a job that is running or spooled in the queue by using the qdel command like this
qdel job_id (where job_id is given by qstat or qsub).

Monitoring and controlling with QMON
You can also use the GUI QMON, which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.

For further information, see the SGE User's Guide ( PDF, HTML).

[May 07, 2017] Why Won't My Job Run Correctly? ( aka How To Troubleshoot/Diagnose Problems)

May 07, 2017 | biowiki.org

Does your job show "Eqw" or "qw" state when you run qstat , and just sits there refusing to run? Get more info on what's wrong with it using:

$ qstat -j <job number>

Does your job actually get dispatched and run (that is, qstat no longer shows it - because it was sent to an exec host, ran, and exited), but something else isn't working right? Get more info on what's wrong with it using:

$ qacct -j <job number> (especially see the lines "failed" and "exit_status")

If any of the above have an "access denied" message in them, it's probably a permissions problem. Your user account does not have the privileges to read from/write to where you told it (this happens with the -e and -o options to qsub often). So, check to make sure you do. Try, for example, to SSH into the node on which the job is trying to run (or just any node) and make sure that you can actually read from/write to the desired directories from there. While you're at it, just run the job manually from that node, see if it runs - maybe there's some library it needs that the particular node is missing.

To avoid permissions problems, cd into the directory on the NFS where you want your job to run, and submit from there using qsub -cwd to make sure it runs in that same directory on all the nodes.

Not a permissions problem? Well, maybe the nodes or the queues are unreachable. Check with:

qstat -f

or, for even more detail:

qstat -F

If the "state" column in qstat -f has a big E , that host or queue is in an error state due to... well, something. Sometimes an error just occurs and marks the whole queue as "bad", which blocks all jobs from running in that queue, even though there is nothing otherwise wrong with it. Use qmod -c <queue list> to clear the error state for a queue.

Maybe that's not the problem, though. Maybe there is some network problem preventing the SGE master from communicating with the exec hosts, such as routing problems or a firewall misconfiguration. You can troubleshoot these things with qping , which will test whether the SGE processes on the master node and the exec nodes can communicate.

N.B.: remember, the execd process on the exec node is responsible for establishing a TCP/IP connection to the qmaster process on the master node , not the other way around. The execd processes basically "phone home". So you have to run qping from the exec nodes , not the master node!

Syntax example (I am running this on a exec node, and sheridan is the SGE master):

$ qping sheridan 536 qmaster 1

where 536 is the port that qmaster is listening on, and 1 simply means that I am trying to reach a daemon. Can't reach it? Make sure your firewall has a hole on that port, that the routing is correct, that you can ping using the good old ping command, that the qmaster process is actually up, and so on.

Of course, you could ping the exec nodes from the master node, too, e.g. I can see if I can reach exec node kosh like this:

$ qping kosh 537 execd 1

but why would you do such a crazy thing? execd is responsible for reaching qmaster , not the other way around.

If the above checks out, check the messages log in /var/log/sge_messages on the submit and/or master node (on our Babylon Cluster , they're both the node sheridan ):

$ tail /var/log/sge_messages

Personally, I like running:

$ tail -f /var/log/sge_messages

before I submit the job, and then submit a job in a different window. The -f option will update the tail of the file as it grows, so you can see the message log change "live" as your job executes and see what's happening as things take place.

(Note that the above is actually a symbolic link I put in to the messages log in the qmaster spool directory, i.e. /opt/sge/default/spool/qmaster/messages .)

One thing that commonly goes wrong is permissions. Make sure that the user that submitted the job using qsub actually has the permissions to write error, output, and other files to the paths you specified.

For even more precise troubleshooting... maybe the problem is unique only to some nodes(s) or some queue(s)? To pin it down, try to run the job only on some specific node or queue:

$ qsub -l hostname=<node/host name> <other job params>

$ qsub -l qname=<queue name> <other job params>

Maybe you should also try to SSH into the problem nodes directly and run the job locally from there, as your own user, and see if you can get any more detail on why it fails.
If all else fails...
Sometimes, the SGE master host will become so FUBARed that we have to resort to brute, traumatizing force to fix it. The following solution is equivalent to fixing a wristwatch with a bulldozer, but seems to cause more good than harm (although I can't guarantee that it doesn't cause long-term harm in favor of a short-term solution).

Basically, you wipe the database that keeps track of SGE jobs on the master host, taking any problem "stuck" jobs with it. (At least that's what I think this does...)

I've found this useful when:

You submit >10,000 jobs to SGE, which uses too much system resources resulting in their inability to get dispatched to exec hosts, and start getting the "failed receiving gdi request" error on something as simple as qstat . You can't use qdel to wipe the jobs due to the same error.

A job is stuck in the r state (and if you try to delete it, the dr state) despite the fact that the exec host is not running the job, not is even aware of it. This can happen if you reboot a stuck/unresponsive exec host.

The solution:
ssh sheridan
su -
service sgemaster stop
cd /opt/sge/default/
mv spooldb spooldb.fubared
mkdir spooldb
cp spooldb.fubared/sge spooldb/
chown -R sgeadmin:sgeadmin spooldb
service sgemaster start
Wipe spooldb.fubared when you are confident that you won't need its contents again.

After the reinstallation of the node passwordless ssh does not work due to wrong hashes

That's a typical situation if you reinstall the computational node.

[Nov 09, 2015] Some compute nodes not accepting jobs

Jul 19 , 2010 | Rocks-Discuss
Mike Hanby mhanby at uab.edu
Mon Jul 19 12:27:15 PDT 2010

Previous message: [Rocks-Discuss] "Some compute nodes not accepting jobs"

Next message: [Rocks-Discuss] All nodes down

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Some times if a job fails in a way that makes SGE think the node might be at fault, SGE will mark the node in error.
'qstat -f' will show any nodes marked as errored
'qstat -g c' will give you a quick summary of total slots, slots in use, and slots offline/errored, etc...
You may be able to find some information in /opt/gridengine/default/spool/qmaster/messages
It's a good idea to run the 'qstat -g c' command periodically. It can help head off some support calls, especially if you have some users who are node watchers ;-)
Also, check 'qhost' from time to time to make sure none of your nodes are overloaded (i.e. jobs behaving badly) as it will display load, memory used and swap used.
Mike >

[Nov 09, 2015] Situation when one node does not accept jobs, while all other do accept them

try qalter -w v jobnumber

Troubleshooting

imperial.ac.uk
SGE jobs can fails for a variety of reasons, which may either be scheduling problems preventing the job from being executed, or runtime errors due to scripting errors etc.
Scheduling Problems

A job will remain in a queued state if SGE can not allocate appropriate resources as requested in the job submission. This may be either due to jobs of other users on the cluster which are ahead of the queued job in the queue, or it may be that the resources requested are not available on the cluster i.e. more memory was requested than is provided by any cluster node, or the job has been submitted by a user not registered to use the cluster.

The 'qalter' command provides a method for verifiying whether a job is capable of being run assuming no other jobs are present in the queue. Executing 'qalter -w v [jobid]' will sequentially check each queue and report on whether a queue capable of executing the job is present on the system.

Example: Unregistered User Submitting Jobs

A job submitted by a user who is not registered to submit jobs to the cluster will result in jobs appearing in the queue, but remaining in a queued state. Executing 'qalter -w v [jobid]' in this circumstance will produce an output such as that show below. The job is reported for having no permission to run in any of the queues since the user is not registered. In this case, contact [email protected] to arrange access to the cluster.
[root@codon /]# qalter -w v 3811734
Job 3811734 has no permission for cluster queue "quick"
Job 3811734 has no permission for cluster queue "emaas"
Job 3811734 has no permission for cluster queue "1day_16"
Job 3811734 has no permission for cluster queue "3day_16"
Job 3811734 has no permission for cluster queue "3day_32"
Job 3811734 has no permission for cluster queue "6hour_16"
Job 3811734 has no permission for cluster queue "7day_16"
Job 3811734 has no permission for cluster queue "7day_32"
Job 3811734 has no permission for cluster queue "infinite_128"
Job 3811734 has no permission for cluster queue "infinite_16"
Job 3811734 has no permission for cluster queue "infinite_32"
verification: no suitable queues
Example: Incorrect memory request

A job which is submitted with a resource request which can never be fulfilled can also be identified by executing 'qalter -w v [jobid]'. For example, in the following case a job was submitted requesting a slot with 5Gb of memory, however standard slots on the cluster offer either 2Gb or 4Gb memory, consequently this request can not be fulfilled. Jobs requiring access to more memory than offered by a single slot should be submitted as SMP jobs requesting a number of slots (see submitting parallel jobs for details)

[bss-admin@codon ~]$ qalter -w v 3811742
Job 3811742 has no permission for cluster queue "quick"
Job 3811742 has no permission for cluster queue "emaas"
Job 3811742 (-l h_vmem=5G) cannot run in queue "1day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "3day_32" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "6hour_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "7day_32" because of cluster queue
Job 3811742 has no permission for cluster queue "infinite_128"
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_16" because of cluster queue
Job 3811742 (-l h_vmem=5G) cannot run in queue "infinite_32" because of cluster queue
verification: no suitable queues

[Dec 30, 2013] Use `qping` to check if host is running OK

qping -info myhost16 6445 execd 1         # check status of execd from master

[Jul 12, 2012] Gridengine diag tool

[ID 1288901.1]
Modified 19-APR-2012 Type DIAGNOSTIC TOOLS Status PUBLISHED

This documentation contains the scripts for data collection and configuration information in order to troubleshoot and resolve HPC Grid Engine issues.
Attachments

collects info on qmaster (11.55 KB) -- this is a shell script

collects grid config info (32.09 KB) -- this is a Perl script

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: July, 28, 2019

SGE Troubleshooting

Common pitfalls

Storage

NEWS CONTENTS

Old News ;-)

[May 07, 2017] Monitoring and Controlling Jobs

biowiki.org

[May 07, 2017] Why Won't My Job Run Correctly? ( aka How To Troubleshoot/Diagnose Problems)

May 07, 2017 | biowiki.org

After the reinstallation of the node passwordless ssh does not work due to wrong hashes

That's a typical situation if you reinstall the computational node.

[Nov 09, 2015] Some compute nodes not accepting jobs

Jul 19 , 2010 | Rocks-Discuss

[Nov 09, 2015] Situation when one node does not accept jobs, while all other do accept them

Troubleshooting

imperial.ac.uk

[Dec 30, 2013] Use `qping` to check if host is running OK

[Jul 12, 2012] Gridengine diag tool

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Man pages

Articles

Etc

SGE Troubleshooting

Common pitfalls

Storage

Old News ;-)

[May 07, 2017] Monitoring and Controlling Jobs

[May 07, 2017] Why Won't My Job Run Correctly? ( aka How To Troubleshoot/Diagnose Problems)

May 07, 2017 | biowiki.org

After the reinstallation of the node passwordless ssh does not work due to wrong hashes

That's a typical situation if you reinstall the computational node.

[Nov 09, 2015] Some compute nodes not accepting jobs

Jul 19 , 2010 | Rocks-Discuss

[Nov 09, 2015] Situation when one node does not accept jobs, while all other do accept them

[Dec 30, 2013] Use qping to check if host is running OK

[Jul 12, 2012] Gridengine diag tool

Google matched content

Softpanorama Recommended

Man pages

Articles

[Dec 30, 2013] Use `qping` to check if host is running OK