Softpanorama

May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

qb command

News Enterprise Job schedulers Recommended Links qb ql qh  jobpar
Parallel Environment Client Commands Monitoring Queues        
Installation of SCE on a small set of multicore servers Usage of NFS Installation of the Master Host Installation of the Execution Hosts Creating and modifying SGE Queues Submitting Jobs To Queue Instance Monitoring and Controlling Jobs
qconf qstat qmod qalter -- Change Job Priority qsub -- Submitting Jobs To Queue Instance qacct command MPI
Troubleshooting Gridengine diag tool Slot limits and restricting number of slots per server Resource Quotas Perl Admin Tools and Scripts Humor Etc

The qb program parses similar SGE information but outputs the data in more of a 'block' format. Each machine is shown as a block of letters with each letter representing an SGE job-ID. Thus, we can see which machines are loaded down, and which are free. This can be useful if you must request special nodes or large amounts of memory.

jbp@head1 [ 84 ] % qb 
Cluster Status (by sub-cluster) 

as of Fri Aug  6 10:19:18 2004 

  

node    1:16 |GGH|HKK|HJ |H  |HKK|HJ |HLL|HII|HLL|HII|GGH|HII|GGH|HJ |HJ |KK | 

       17:32 |HII|JJ |KK |HJ |GGH|HJ |HLL|HJ |HLL|HJ |HJ |JJ |JJ |JJ |JJ |JJ | 

       33:48 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | 

       49:64 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | 

  

cbcb    1:16 |AA |FF |BB |DD |AA |BB |BB |EE |BB |BB |CC |DD |CC |AA |CC |EE | 

       17:32 |AA |EE |FF |BB |DD |EE |AA |CC |AA |FF |DD |CC |BB |FF |FF |CC | 

  

chg    01:08 |---|---|---|---|---|---|---|---| 

  

'---' --> machine is down

K --> job 25268, 8 cpus, agenda1.q adobra 08/03/2004 11:24:52 

D --> job 25269, 8 cpus, agenda2.q adobra 08/03/2004 11:25:09 

L --> job 25270, 8 cpus, agenda3.q adobra 08/03/2004 11:25:09 

I --> job 25271, 8 cpus, agenda4.q adobra 08/03/2004 11:25:09 

This is not a complete picture of the cluster, it only shows the state of the core nodes, the 'cbcb' nodes, and the 'chg' nodes. The first line of data shows the core nodes, named 'node1' through 'node16', with each node being separated by vertical bars. The next line shows 'node17' through 'node32'. We can see that the next two lines, nodes 33 through 48, and nodes 49 through 64, are all empty. The next block of data shows a similar display for the 'cbcb' nodes. The last block of node-data shows the 'chg' nodes which are either down (maybe for hardware maintenance) or have not yet been initialized for use with SGE and thus are shown as dashes.

Within each line, the letters represent jobs in the system. The letter-to-job mapping is shown at the end. E.g. job 25268 is shown as the letter 'K' and we can see that there are some 'K'-jobs running on node2, node5, node16, and node19. Looking more closely, we see that each of those nodes has 2 'K'-jobs, which adds up to the 8 CPUs as reported.

(2 Feb 2007) We have added a few new features to the 'qb' script to help pare down some of the data that is shown. You can run 'qb' with these options to show more or less data:

qb -x
 
#!/usr/bin/perl
#
# (C) 2004-2009, John Pormann, Duke University
#      jbp1@duke.edu
#
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#
# RCSID $Id: qb,v 1.47 2007/02/02 15:48:33 jbp1 Exp jbp1 $
#
# qb - produce a 'block' view of the cluster/queue load

use Getopt::Std;
getopts('hvVpxsC:');

if( defined($opt_h) ) {
  print "usage:  qb [opts]\n"
    . "  -x              don't show job-list (just the 'block view')\n"
    . "  -s              show simple output (no colors, limited job info)\n"
    . "  -p              don't show pending jobs in job-list\n"
    . "  -C cols         use alternate number of columns in output\n"
    . "  -v              verbose\n"
    . "  -V              really verbose\n";
  exit;
}

if( defined($opt_V) ) {
  $opt_v = $opt_V;
}

%qinfo = ();
%jinfo = ();
%jcount = ();
$minjid = 99999;
$maxjid = 0;
$nextsym = 0;
%subclusters = ();

# collect all the sge info
&process_sge_info();

# there is a blank line before the 'PENDING JOBS' line
# which inserts a null-key into %jinfo
delete( $jinfo{''} );

if( defined($opt_V) ) {
	print "qinfo:\n";
	foreach $key ( keys(%qinfo) ) {
		$val = $qinfo{$key};
		print " [$key] [$val]\n";
	}
	print "jinfo:\n";
	foreach $key ( keys(%jinfo) ) {
		$val = $jinfo{$key};
		print " [$key] [$val]\n";
	}
}

@hostlist = sorthosts( keys(%qinfo) );

# # # # # # # # # # #
# print header info #
# # # # # # # # # # #
$z = localtime;
print "Cluster Status (by sub-cluster)\nas of $z\n\n";

# kludge up a better ordering for subclusters
@subclusterlist = sort(keys(%subclusters));

# try to pretty up the output
# : find max jobs per machine
$maxjobs = 0;
foreach $k ( keys(%qinfo) ) {
	if( $k eq 'pending' ) {
		next;
	}
	$qi = $qinfo{$k};
	@fld = split( ':', $qi );
	$n = 0;
	for($i=0;$i $maxjobs ) {
		$maxjobs = $n;
	}
}
# : find spacing for group-name column
$maxlen = 0;
foreach $k ( @subclusterlist ) {
  $n = length( $k );
  if( $n > $maxlen ) {
    $maxlen = $n;
  }
}
$fmt = "%-${maxlen}s %3s:%-3s |%s\n";
# : figure 80 chars per line --> minus maxlen/3/3/4 = 64 chars in line
$cols = int( (80-$maxlen-3-3-4)/($maxjobs+1) );
if( defined($opt_C) ) {
	$cols = $opt_C;
}

$n = -1;
$f = 0;
foreach $cluster ( @subclusterlist ) {
  $clst = $cluster;
  $strt = -1;
  $fnsh = -1;
  $text = '';
  $n = 0;
  foreach $q ( @hostlist ) {
    if( $q !~ m/^$cluster\-/ ) {
      next;
    }
    if( $n == $cols ) {
      printf( $fmt, $clst, $strt, $fnsh, $text );
      $clst = '';
      $strt = -1;
      $fnsh = -1;
      $text = '';
      $n = 0;
    }
    if( $strt < 0 ) {
      $strt = $q;
      $strt =~ s/(.*?)\-//g;
      $strt =~ s/\D//g;
    }
    $fnsh = $q;
    $fnsh =~ s/(.*?)\-//g;
    $fnsh =~ s/\D//g;
    $w = '';
    $qi = $qinfo{$q};
    @fld = split( ':', $qi );
    $nnn = 0;
    for($ii=0;$ii26)*32 );
      if( defined($opt_s) ) {
        if( $high =~ m/!hi$/ ) {
          $z = uc($z);
        } else {
          $z = lc($z);
        }
      } else {
        $idx = int($c/52) + 31;
        if( $high =~ m/!hi$/ ) {
          $clr = "\033[${idx};1;7m";
        } else {
          $clr = "\033[${idx};1;27m";
        }
        $z = "${clr}${z}";
      }
      for($j=0;$j<$y;$j++) {
        $nnn++;
        $w .= $z;
      }
     }
    }
    if( $qi =~ m/down/ ) {
      $xtra = '=' x ($maxjobs-$nnn);
    } elsif( $qi =~ m/susp/ ) {
      $xtra = '-' x ($maxjobs-$nnn);
    } elsif( $qi =~ m/error/ ) {
      $xtra = '~' x ($maxjobs-$nnn);
    } elsif( $qi =~ m/dead/ ) {
      $xtra = ':' x ($maxjobs-$nnn);
    } else {
      $xtra = ' ' x ($maxjobs-$nnn);
	 }
    if( defined($opt_s) ) {
	   $text .= "$w$xtra|";
    } else {
	   $text .= "$w\033[0m$xtra|";
    }
    $n++;
  }
  printf( $fmt, $clst, $strt, $fnsh, $text );
  print "\n";
}

if( defined($opt_s) ) {
  print "'====' --> machine is down\n"
    .   "'::::' --> machine is known-dead\n"
    .   "'----' --> machine is up but queue is disabled\n"
    .   "           (jobs may still be running)\n"
    .   "'~~~~' --> machine is up but queue is an error state\n"
    .   "UPPER  --> high-priority job\n"
    .   "* NOTE: the job-to-letter mapping is not unique for 'simple' output\n";
} else {
  print "'====' --> machine is down\n"
    .   "'::::' --> machine is known-dead\n"
    .   "'----' --> machine is up but queue is disabled\n"
    .   "           (jobs may still be running)\n"
    .   "'~~~~' --> machine is up but queue is an error state\n"
    .   "\033[7minverse\033[0m --> high-priority job\n";
}

if( defined($opt_x) ) {
  exit;
}

# do two passes, once for active jobs and once for pending jobs
foreach $j ( sort(keys(%jinfo)) ) {
  $z = $jinfo{$j};
  $z =~ s/(.*?)!(.*)/$1/;
  $c = $1;
  $y = $2;
  $z = chr( 65 + $c%26 + (($c%52)>26)*32 );
  $clr = int($c/52) + 31;
  if( $y =~ m/!hi$/ ) {
    $clr .= ";7";
  } else {
    $clr .= ";1";
  }
  $z = "\033[${clr}m${z}\033[0m";
  $y =~ s/!/ /g;
  $w = $jcount{$j};
  if( $w > 0 ) {
    $w = "$w cpus";
    print "$z --> job $j, $w, $y\n";
  }
}

if( not defined($opt_p) ) {
  foreach $j ( sort(keys(%jinfo)) ) {
    $z = $jinfo{$j};
    $z =~ s/(.*?)!(.*)/$1/;
    $c = $1;
    $y = $2;
    $z = chr( 65 + $c%26 + (($c%52)>26)*32 );
    $clr = int($c/52) + 31;
    if( $y =~ m/!hi$/ ) {
      $clr .= ";7";
    } else {
      $clr .= ";1";
    }
    $z = "\033[${clr}m${z}\033[0m";
    $y =~ s/!/ /g;
    $w = $jcount{$j};
    if( $w < 0 ) {
      print "$z --> job $j, pending, $y\n";
    }
  }
}


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

sub process_sge_info {
  my ($x,$y,$orig,$pending,$qname,$high,$txt,$jid,$ncpus,$nextsym);
  my ($m1,$m2,$n1,$n2,$prev);
  my @fld;
  my %jobs;

  $qname = 'none';
  %jobs = ();
  $pending = 0;
  $high = '';
  $prev = 'foo-n999';

  open( FP, "qstat -f -u '*' |" );
  # skip first two lines
  ;
  ;
  while(  ) {
   chomp( $_ );
   $orig = $_;
   if( $_ =~ m/^\-\-\-\-\-\-\-\-\-\-/ ) {
     # separator line
     $txt = $qinfo{$qname};
     foreach $x ( keys(%jobs) ) {
       $y = $jobs{$x};
       $txt .= "${x}!$y:";
     }
     $qinfo{$qname} = $txt;
     if( defined($opt_V) ) {
     	print "storing queue [$qname] [$txt]\n";
     }
     $prev = $qname;
     $qname = 'none';
     $high = '';
     %jobs = ();
   } elsif( $_ =~ m/PENDING JOBS/ ) {
     # end of active jobs
     $pending = 1;
     $qname = 'pending';
     if( defined($opt_V) ) {
     	print "* End of active jobs\n";
     }
   } elsif( $_ =~ m/^\#\#\#\#\#\#\#\#\#\#/ ) {
     # end of active jobs
     $pending = 1;
     $qname = 'pending';
     if( defined($opt_V) ) {
     	print "* End of active jobs\n";
     }
   } else {
     # this line has queue or job info on it
     @fld = split( m/\s+/, $orig );
     if( $fld[0] eq '' ) {
     	 shift( @fld );
     }
     if( $fld[0] =~ m/^[A-Za-z]/ ) {
       # new queue entry
       $qname = $fld[0];
       if( $qname =~ m/^highprio/ ) {
         $high = '!hi';
       }
       # trim off the 'high/lowprio.q@' garbage from front of string
       $qname =~ s/^(.*?)\@//;
       if( defined($opt_V) ) {
         print "new queue [$qname]\n";
       }
       # check for missing machines (name==prev but num!=prev+1)
       $m1 = $qname;
       $m1 =~ s/\-n(.*)//;
       $n1 = $1 + 0;
       $m2 = $prev;
       $m2 =~ s/\-n(.*)//;
       $n2 = $1 + 0;
       if( (($m1 eq $m2) and ($n1 != ($n2+1)))
           or (($m1 ne $m2) and ($n1 != 1)) ) {
         if( defined($opt_V) ) {
           print "missing machine [$m1|$n1] [$m2|$n2]\n";
         }
         if( $m1 eq $m2 ) {
           $n1 = $n2 + 1;
         } else {
           $n1 = 1;
         }
         if( $n1 < 10 ) {
           $qinfo{"$m1-n0$n1"} = 'dead:';
         } else {
           $qinfo{"$m1-n$n1"} = 'dead:';
         }
       }
       # is host down?
       # : for SGE6, field 5 will have 'u' in it
       # :: u -- unknown/can't contact sge daemon
       # :: aAE -- alarm/Error state
       # :: CsS -- calendar-suspended/suspended/subordinate
       # :: dD -- disabled
       if( $fld[5] =~ m/u/ ) {
  	      $qinfo{$qname} = 'down:' . $qinfo{$qname};
  	      if( defined($opt_v) ) {
  	        print "** host is down [$qname]\n";
  	      }
  	    } elsif( $fld[5] =~ m/[aA]/ ) {
  	      $qinfo{$qname} = 'alarm:' . $qinfo{$qname};
  	      if( defined($opt_v) ) {
           print "** queue is in an alarm condition [$qname]\n";
  	      }
  	    } elsif( $fld[5] =~ m/[eE]/ ) {
  	      $qinfo{$qname} = 'error:' . $qinfo{$qname};
  	      if( defined($opt_v) ) {
           print "** queue is in an error condition [$qname]\n";
  	      }
  	    } elsif( $fld[5] =~ m/[CsSdD]/ ) {
  	      $qinfo{$qname} = 'susp:' . $qinfo{$qname};
  	      if( defined($opt_v) ) {
           print "** queue is suspended [$qname]\n";
  	      }
  	    }
  	    # add to the 'subclusters' list
  	    $cluster = $fld[0];
  	    $cluster =~ s/^(.*?)\@(.*?)\-(.*)/$2/;
  	    $subclusters{$cluster} = 1;
     } else {
  	    if( $pending ) {
  	      # this is a pending job
         # : array tasks may have already begun running
  	      $jid = $fld[0];
         if( not exists($jinfo{$jid}) ) {
           $jobs{$jid} = -1;
           $jcount{$jid} = -1;
           $jinfo{$jid} = "${nextsym}!$fld[2]!$fld[3]!$fld[5]!$fld[6]";
           if( defined($opt_V) ) {
             print "pending job [$jid] [$nextsym]\n";
           }
           $nextsym++;
         }
  	    } else {
         # running job
  	      $jid = $fld[0];
  	      $ncpus = $fld[7];
  	      $jobs{$jid} += $ncpus;
  	      $jcount{$jid} += $ncpus;
  	      if( not exists($jinfo{$jid}) ) {
  	        if( $jid > $maxjid ) {
  	         $maxjid = $jid;
  	        }
  	        if( $jid < $minjid ) {
  	         $minjid = $jid;
  	        }
  	        $jinfo{$jid} = "${nextsym}!$fld[2]!$fld[3]!$fld[5]!$fld[6]$high";
  	        if( defined($opt_V) ) {
  	       	 print "running job [$jid] [$nextsym]\n";
  	        }
  	        $nextsym++;
  	      }
  	    } # endif (pending vs running)
     } # endif (A-Z line)
   } # endif (non-header line)
  } # next line in file
  
  # don't forget the last set of data!
  if( $qname ne 'none' ) {
     $txt = $qinfo{$qname};
     foreach $x ( keys(%jobs) ) {
       $y = $jobs{$x};
       $txt .= "${x}!$y:";
     }
     $qinfo{$qname} = $txt;
  }
  
  close( FP );
}

# sorthosts subroutine by Benny Kjellgren <@staff.spray.se>
#       correctly sorts FQDN as well as non-FQDN
sub sorthosts {
  my @unsorted = @_;
  my $fqdn;
  my $host;
  my $domain;
  my %domain;
  my %caps;
  my %nums;

  for( @unsorted ) {
     $fqdn = $_;
     ( $host, $domain ) = split('\.', $fqdn, 2);
     $domain{$fqdn} = uc($domain) || "";
     ( $caps{$fqdn} = uc($host) ) =~ s/\d*$//;
     ( $nums{$fqdn} ) = ( $host =~ /(\d*)$/ );
     $nums{$fqdn} = 0 unless $nums{$fqdn};
  }

  my @list = sort {
    $domain{$a} cmp $domain{$b}
      ||
    $caps{$a} cmp $caps{$b}
      ||
    $nums{$a} <=> $nums{$b}
  } @unsorted;

  return( @list );
}

sub get_header_info {
  my $aref = shift( @_ );
  my $i    = shift( @_ );
  my ($y,$z,$cluster,$j,$jj);

  $cluster = $aref->[$i];
  $cluster =~ s/(.*?)[\-\d](.*)/$1/g;

  # first node number for this cluster
  $y = $aref->[$i];
  $y =~ s/\D+//g;

  # last node ( scalar(@$aref) ) {
      $z = $aref->[scalar(@$aref)-1];
      last;
    } elsif( $aref->[$jj] !~ m/^$cluster/ ) {
      $z = $aref->[$jj-1];
      last;
    } else {
    }
  }
  if( $z eq '' ) {
    $z = $aref->[$jj];
  }
  $z =~ s/\D+//g;

  # trim cluster to only 6 letters
  $cluster =~ s/(......)(.*)/$1/;

  return( ($cluster,$y,$z) );
}