Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Grid Engine Config Tips

News Grid engine Recommended Links Installation Planning Installation of the Master Host Installation of the Execution Hosts SGE Queues
Grid Engine Tuning Changing SGE spool to local directory  Optimizing usage of NFS in Grid Engine Grid Engine Config Tips SGE Parallel Environment Configuring Hosts From the Command Line SGE Commands Reference
Backup of SGE configuration SGE implementations Migrating from sge 6.2u5 to Son of Grid engine 8.1.7 Migrating from Univa 8.1.7 to Son of Grid engine 8.1.7      
Enterprise Unix System Administration Perl Admin Tools and Scripts Duke University Tools for SGE Simple Unix Backup Tools Sysadmin Horror Stories Humor Etc

To reboot execution hosts, you need to ensure they’re empty and avoid races with job submission.

Most commands in SGE are structured using the following template of options

Capitalized argument means ‘read in from file’.  Lowercase means ‘do it interactively’. All SGE commands generally follow this structure

The command-line user interface is a set of ancillary programs (commands) that enable you to do the following tasks:

Handy aliases

Here are some handy aliases I find useful in my ~/.bashrc file:
alias qcall='qconf -mq all.q'
alias qerrors='qstat -f -explain E'
alias qsummary='qstat -g c'
alias qclear='qmod -c "*"'

Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Nov 10, 2014] Enabling qstat -j infomation

See also Enabling scheduling information in qstat -j

schedd_job_info

The default scheduler can keep track why jobs could not be scheduled during the last scheduler run. This parameter enables or disables the observation. The value true enables the monitoring false turns it off.

It is also possible to activate the observation only for certain jobs. This will be done if the parameter is set to job_list followed by a comma separated list of job ids.

If schedd_job_info=true, the user can obtain the collected information with the command

qstat  -j job number

One of the important parameters in this file is schedd_job_info which determined whether qstat -j provides information about jobs (Chris Dagdigian)

In this case the change is that with 6.2 the parameter "schedd_job_info" now defaults to FALSE where in the past it was TRUE.

I *completely* understand why the change happened since the 6.2 design goal was for massive scalability and schedd_job_info can put a massive load on the SGE system particularly in massive clusters like Ranger where 6.2 was tested out.

But ... are most 6.2 deployments going on to systems where the exechost count or job throughput rates means that setting schedd_job_info=FALSE has a measurable performance gain, significant enough to offset the massive loss of end-user-accessible troubleshooting information? I suspect ... not.

The schedd_job_info output appended in the output of "qstat -j" is the single most effective troubleshooting and "why does my job not get dispatched" resource that is available to non SGE administrators. Taking this tool away from users (in my opinion) has a bigger negative impact than any performance gains realized (at least for the types of systems I work on most often).

So -- just like I recommend and tell people to use classic spooling on smaller systems I also plan on telling people to re-enable schedd_job_info feature on their 6.2 systems (if their system and workflow allows).

I'm bringing this up on the list for two reasons:

- Just to see what others think

  1 algorithm                         default
  2 schedule_interval                 0:0:15
  3 maxujobs                          0
  4 queue_sort_method                 load
  5 job_load_adjustments              np_load_avg=0.50
  6 load_adjustment_decay_time        0:7:30
  7 load_formula                      np_load_avg
  8 schedd_job_info                 true
  9 flush_submit_sec                  0
 10 flush_finish_sec                  0
 11 params                            none
 12 reprioritize_interval             0:0:0
 13 halftime                          168
 14 usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
 15 compensation_factor               5.000000
 16 weight_user                       0.250000
 17 weight_project                    0.250000
 18 weight_department                 0.250000
 19 weight_job                        0.250000
 20 weight_tickets_functional         0
 21 weight_tickets_share              0
 22 share_override_tickets            TRUE
 23 share_functional_shares           TRUE
 24 max_functional_jobs_to_schedule   200
 25 report_pjob_tickets               TRUE
 26 max_pending_tasks_per_job         50
 27 halflife_decay_list               none
 28 policy_hierarchy                  OFS
 29 weight_ticket                     0.010000
 30 weight_waiting_time               0.000000
 31 weight_deadline                   3600000.000000
 32 weight_urgency                    0.100000
 33 weight_priority                   1.000000
 34 max_reservation                   0
 35 default_duration                  INFINITY

[Sep 20, 2014] Slot hacking in SGE

Typically cluster is fully loaded with job almost all the time and there is a difficulty in submitting small (1-2 min) testing jobs running on more then ne node.

bioteam.net

1. Same queue structure as before (see bioteam.net)

2. Attach "slots=2" as a host resource on all nodes

3. Submit test jobs to all queues

The wizard solution:
􀀁 qconf -aattr exechost complex_values slots=2 <host>
􀀁 What did we do?
􀀁 Slot limits "solve" the oversubscription problem
􀀁 Still have these problems:
􀀁 FIFO job execution
􀀁 Priority is handled by OS after SGE scheduling
􀀁 We can still do better (stay tuned)…


Grid Engine Config Tips by

Posted

Here are some of the Grid Engine configuration steps we should take on a new install. I recommend doing all of these from the very beginning, to prevent changes that may confuse or break user workflow.

There is one thing we must always do with a new compute cluster, and that is enable hard memory limits. Users are usually not too keen on any kind of limit, because jobs will eventually run into them. Once the realization is made that limits ensures node stability and uptime, users will demand them. Without limits, one bad job can crash a node and bring down many other jobs.

To enable hard memory limits, we modify the complex configuration to make h_vmem requestable.

# qconf -mc
h_vmem              h_vmem     MEMORY      <=    YES         YES        1g       0
Once this complex is set, it is a good idea to define a default option for qsub in the $SGE_ROOT/default/common/sge_request file. When enabling h_vmem, we should also set a default value for h_stack. h_vmem sets a limit on virtual memory, while h_stack sets a limit on stack space for binary execution. Without a sufficient value for h_stack, programs like Python, Matlab or IDL will fail to start. Here, we are also binding each job to a single core.
-binding linear:1
-q all.q
-l h_vmem=1g
-l h_stack=128m
If we want to manually set values for each individual node, like slots and memory, a for-loop is very helpful.
# qconf -rattr exechost complex_values slots=8,num_proc=8,h_vmem=8g node01
# for ((I=1; I <= 16 ; I++)); do 
> NODE=`printf "node%02d\n" $I`
> MEM=`ssh $NODE 'free -b |grep Mem |cut -d" " -f 5'`
> SWAP=`ssh $NODE 'free -b |grep Swap |cut -d" " -f 4'`
> VMEM=`echo $MEM+$SWAP|bc`
> qconf -rattr exechost complex_values slots=8,num_proc=8,h_vmem=$VMEM $NODE
> done
To submit a job with a 4 gig limit, use the -l command line option.
$ qsub -l h_vmem=4g -l h_stack=256m myjob.sh
To see available memory, use qstat.
$ qstat -F h_vmem
It is also a good idea to place limits on the amount of memory any single process on the login node may allocate, in the /etc/security/limits.conf file. This example will limit any user in the clusterusers group to 4 gigs per process. Anything larger should be ran via qlogin. When adding new users, make sure to add them to this now default group.
# limit any process to 4GB = 1024*1024*4KB = 4194304
@clusterusers      hard    rss             4194304
@clusterusers      hard    as              4194304
There should also be a limit on how many jobs a single user can queue at once. If a user must submit over 2000 jobs simultaneously, they may want to consider a more manageable workflow utilizing array jobs.
# qconf -mconf
max_u_jobs 2000
If we want to limit the number of jobs a single user can have in the running state simultaneously:
# qconf -msconf
 max_reservation 128
 maxujobs 128
If the queue will be accepting multi-slot parallel jobs, slot reservation should be enabled to prevent starvation. Otherwise, single-slot jobs will constantly fill in space ahead of the big job. This can be done by submitting multi-slot jobs with the "-R y" option.

To enable a simple fairshare policy between all users, there are only three options to check:

# qconf -mconf
enforce_user auto
auto_user_fshare 100
# qconf -msconf
weight_tickets_functional 10000
To be a bit more verbose, we should collect some job scheduler info.
# man sched_conf
# qconf -msconf
 schedd_job_info true
Now we can see why or why not a job is scheduled.
$ qstat -j 427997
$ qacct -j 427997
If we plan to allow graphical GUI programs in the queue, we must setup a qlogin wrapper script with proper X11 forwarding.
# vim /usr/global/sge/qlogin_wrapper
# chmod +x /usr/global/sge/qlogin_wrapper
qlogin_wrapper:
#!/bin/sh
HOST=$1
PORT=$2
shift
shift
echo /usr/bin/ssh -Y -p $PORT $HOST
/usr/bin/ssh -Y -p $PORT $HOST
Set the qlogin wrapper and ssh shell:
# qconf -mconf
 qlogin_command /usr/global/sge/qlogin_wrapper
 qlogin_daemon /usr/sbin/sshd -i
If we have a floating license server with a limited number of seats, we will want to configure a consumable complex resource. When a user submits a job, the qsub option '-l idl=1' must be used. In this example, the number of jobs that specify idl will be limited to 15 at any one time.
# qconf -mc
matlab ml INT <= YES YES 0 0
idl idl INT <= YES YES 0 0
# qconf -me global
complex_values matlab=10,idl=15
If we want to have multiple queues across the same hosts, we can define a policy so that nodes do not become oversubscribed.
# qconf -arqs
{
    name         limit_slots_to_cores_rqs
    description  Prevents core oversubscription across queues. 
    enabled      TRUE    
    limit        hosts {*} to slots=$num_proc
}

Grid Engine Configuration Recipes

Dave Love
2013-08-30

Table of Contents

Script Execution
  • Unix behaviour
  • Modules environment
  • Parallel Environments
  • Heterogeneous/Isolated Node Groups
  • JSVs
  • Wildcarded PEs
  • Checking for Windows-style line endings in job scripts
  • Scheduling Policies
  • Host Group Ordering
  • Fill Up Hosts
  • Avoiding Starvation (Resource Reservation/Backfilling)
  • Fair Share
  • Resource Management
  • Slot Limits
  • Memory Limits
  • Licence Tokens
  • Killing Detached Processes
  • Core Binding
  • Administration
  • Maintenance Periods
  • Rebooting execution hosts
  • Broken/For-testing Hosts

    This is a somewhat-random collection of commonly-required configuration recipes. It is written mainly from the point of view of high performance computing clusters, and some of the configuration suggestions may not be relevant in other circumstances or for old versions of gridengine. See also the other howto documents. Suggestions for additions/corrections are welcome (to d.love @ liverpool.ac.uk).

    Script Execution

    Unix behaviour

    Set shell_start_mode to unix_behavior in your queue configurations to get the normally-expected behaviour of starting job scripts with the shell specified in the initial #! line.

    Modules environment

    A side-effect of unix_behaviour is usually not getting the normal login environment, specifically not with the module command available for those sites that use environment modules. At least for use with the bash shell, add the following to the site sge_request file to avoid users having to source modules.sh etc. in job scripts, assuming sessions from which jobs are submitted have modules available:

    -v module -v MODULESHOME -v MODULEPATH

    This may not work with other shells

    Parallel Environments

    Heterogeneous/Isolated Node Groups

    Suppose you have various sets of compute nodes over which you want to run parallel jobs, but each job must be confined to a specific set. Possible reasons are that you have

    Then you'll want to define multiple parallel environments and host groups. There will typically be one PE and one host group (with possibly an ACL) for each node type or island. The PEs will all be the same, unless you want a different fixed allocation_rule for each, but with different names. The names need to be chosen so that you can conveniently use wildcard specifications for them. Normally the names will all have the same name base, e.g. mpi-…. As an example, for different architectures, with different numbers of cores which get exclusive use for any tightly integrated MPI:

    $ qconf -sp mpi-8
    pe_name            mpi-8
    slots              99999
    user_lists         NONE
    xuser_lists        NONE
    start_proc_args    NONE
    stop_proc_args     NONE
    allocation_rule    8
    control_slaves     TRUE
    job_is_first_task  FALSE
    urgency_slots      min
    accounting_summary FALSE
    qsort_args         NONE
    $ qconf -sp mpi-12
    pe_name            mpi-12
    slots              99999
    user_lists         NONE
    xuser_lists        NONE
    start_proc_args    NONE
    stop_proc_args     NONE
    allocation_rule    12
    control_slaves     TRUE
    job_is_first_task  FALSE
    urgency_slots      min
    accounting_summary FALSE
    qsort_args         NONE

    with corresponding host groups @quadcore and @hexcore for each type of dual-processor box. Those PEs are assigned one-to-one to host groups, ensuring that jobs can't run across the groups (since parallel jobs are always granted a unique PE by the scheduler, whereas they can be split across queues).

    $ qconf -sq parallel
    ...
    seq_no    2,[@quadcore=3],[@hexcore-eth=4],...
    ...
    pe_list   NONE,[@quadcore=make mpi-8 smp],[@hexcore=make mpi-12 smp],...
    ...
    slots     0,[@quadcore=8],[@hexcore=12],...
    ...

    Now the PE naming comes in useful, since you can submit to a wildcarded PE, -pe mpi-*, if you're not fussy about the PE you actually get. See Wildcarded PEs for the next step.

    Suppose you want to retain the possibility of running across all the PEs (assuming they're not isolated). Then you can define an extra PE, say allmpi, which isn't matched by the wildcard.

    Note SGE 8.1.1+ allows general PE wildcards (actually patterns), as documented, fixing the bug which meant that only * was available in older versions. The correct treatment might be useful with such configurations, e.g. selecting mpi-[1-4].

    JSVs

    See jsv(1) and jsv_script_interface(3) for documentation on job submission verifiers, as well as the examples in $SGE_ROOT/util/resources/jsv/.

    See also the Resource Reservation section.

    Wildcarded PEs

    When you would use a wildcarded PE as above, for convenience and abstraction, you can use a JSV to write the wildcard pattern. This JSV fragment from jsv_on_verify in Bourne shell re-writes -pe mpi to -pe mpi-*:

    if [ $(jsv_is_param pe_name) = true ]; then
    pe=$(jsv_get_param pe_name)
    ...
    case $pe in
        mpi)
        jsv_set_param pe_name "$pe-*"
        pe="$pe-*"
        modified=1
        ...

    Suppose you want to retain the possibility of running across all the PEs (assuming the groups aren't isolated). Then you can define an extra PE, say allmpi, which isn't re-written by the JSV.

    Checking for Windows-style line endings in job scripts

    Users sometimes transfer job scripts from MS Windows systems in binary mode, so that they end up with CRLF line endings, which typically fail, often with a rather obscure failure to execute if shell_start_mode is set to unix_behavior. The following fragment from jsv_on_verify in a shell script JSV prevents submitting such job scripts

    # Avoid having to use literal ^M
    ctrlM=$(printf "\15")
    ...
    jsv_on_verify () {
    ...
      cmd=$(jsv_get_param CMDNAME)
      case $(jsv_get_param b) in y|yes) binary=y;; esac
      [ "$cmd" != NONE -a "$cmd" != STDIN -a "$binary" != y ] &&
          [ -f "$cmd" ] &&
          grep -q "$ctrlM\$" "$cmd" &&
          # Can't use multi-line messages, unfortunately.
          jsv_reject "\
      Script has Windows-style line endings; transfer in text mode or use dos2unix"
    ...

    Scheduling Policies

    See sched_conf(5) for detailed information on the scheduling configuration.

    Host Group Ordering

    To change scheduling so that hosts in different host groups are preferentially used in some defined order, set queue_sort_method to seqno:

    $ qconf -ssconf
    ...
    queue_sort_method seqno
    ...

    and define the ordering in the relevant queue(s) as required with seq_no):

    $ qconf -sq ...
    ...
    seq_no 10,[@group1=4],[@group2=3],...
    ...

    It is possible to use seqno, for instance, to schedule serial jobs preferentially to one 'end' of the hosts and parallel jobs to the other 'end'.

    Fill Up Hosts

    To schedule preferentially to hosts which are already running a job, as opposed to the default of roughly round robin according to the load level, change the load_formula:

    $ qconf -ssconf
    ...
    queue_sort_method load
    load_formula slots
    ...

    assuming the slots consumable is defined on each node.

    Reasons for compacting jobs onto a few nodes as possible include avoiding fragmentation (so that parallel jobs which require complete nodes have a better chance of being fitted in), or being able to power down complete unused nodes.

    Note Scheduling is done according to load values reported at scheduling time, without lookahead, so that it only takes effect over time.

    Since the load formula is used to determine scheduling when hosts are equal according to queue_sort_method, you can schedule to the preferred host groups by seqno as above, and still compact jobs onto the nodes using slots in the load formula, as above, i.e. with this configuration:

    $ qconf -ssconf
    ...
    queue_sort_method seqno
    load_formula slots
    ...

    Avoiding Starvation (Resource Reservation/Backfilling)

    To avoid "starvation" of larger, higher-priority jobs by smaller, lower-priority ones (i.e. the smaller ones always run in front of the larger ones) enable resource reservation by setting max_reservation to a reasonable value (maybe around 100), and arrange that relevant jobs are submitted with -R y, e.g. using a JSV. Here is a JSV fragment suitable for client side use, to add reservation to jobs over a certain size, assuming that PE slots is the only relevant resource:

        if [ $(jsv_is_param pe_name) = true ]; then
            pe=$(jsv_get_param pe_name)
            pemin=$(jsv_get_param pe_min)
    ...
            # Check for an appropriate pe_min with no existing reservation.
            if [ $(jsv_is_param R) = false ]; then
                if [ $pemin -ge $pe_min_reserve ]; then
                    jsv_set_param R y
                    modified=1
                fi
            fi
    Note For "backfilling" (shorter jobs can fill the gaps before reservations) to work properly with jobs which do not specify an h_rt value at submission, the scheduler default_duration must be set to a value other then the default infinity, e.g. to the longest runtime you allow.

    To monitor reservations, set MONITOR=1 in sched_conf(5) params and use qsched(1) after running process_scheduler_log; qsched -a summarizes all the current reservations.

    Fair Share

    It is often required to provide a fair share of resources in some sense, whereby heavier users get reduced priority. There are two SGE policies for this. The share tree policy assigns priorities based on historical usage with a specified lifetime, and the functional policy only takes current usage into account, i.e. is similar to the share tree with a very short decay time. (It isn't actually possible to use the share tree with a lifetime less than one hour.) Use one of the other, but not both to avoid confusion.

    With both methods, ensure that the default scheduler parameters are changed so that weight_ticket is a lot larger than weight_urgency and weight_priority or set the latter two to zero if you don't need them. Otherwise it is possible to defeat the fair share by submitting with a high priority (-p) or with resources with a high urgency attached. See sge_priority(5) for details.

    You may also want to set ACCT_RESERVED_USAGE in execd_params to use effectively 'wall clock' time in the accounting that determines the shares.

    Functional

    For simple use of the functional policy, add

    weight_tickets_functional 10000

    to the default scheduling parameters (qconf -msconf) and define a non-zero fshare for each user (qconf -muser). If you use enforce_user auto in the configuration,

    auto_user_fshare 1000

    could be used to set up automatically-created users (new ones only).

    Warning enforce_user auto implies not using CSP security, which typically is not wise.

    Share Tree

    See share_tree(5).

    To make a simple tree, use qconf -Astree with a file with contents similar to:

    id=0
    name=Root
    type=0
    shares=1
    childnodes=1
    id=1
    name=default
    type=0
    shares=1000
    childnodes=NONE

    and give the share tree policy a high weight, like (qconf -msconf):

    weight_tickets_share 10000

    If you have auto-creation of users (see the warning above), you probably want to ensure that they are preserved with:

    auto_user_delete_time 0

    The share tree usage decays with a half-life of 7 days by default; modify halftime (specified in hours) to change it.

    Resource Management

    Slot Limits

    You normally want to prevent over-subscription of cores on execution hosts by limiting the slots allocated on a host to its core (or actually processor) count - where "processors" might mean hardware threads. There are multiple ways of doing so, according to taste, administrative convenience, and efficiency.

    If you only have a single queue, you can get away with specifying the slot counts in the queue definition (qconf -mconf), e.g. by host group

       slots 0,[@hexcore=12],[@quadcore=8]...

    but with multiple queues on the same hosts, you may need to avoid over-subscription due to contributions from each queue.

    An easy way for an inhomogeneous cluster is with the following RQS (with qconf -arqs), although it may lead to slow scheduling in a large cluster:

    {
       Name         host-slots
       description  restrict slots to core count
       enabled      true
       limit        hosts {*} to slots=$num_proc
    }

    This would probably be the best solution if num_proc, the processor count, is variable by turning hardware threads on and off.

    Alternatively, with a host group for each hardware type, you can use a set of limits like

       limit        hosts {@hexcore} to slots=12
       limit        hosts {@quadcore} to slots=8

    which will avoid the possible scheduling inefficiency of the $num_proc dynamic limit.

    Finally, and possibly the most foolproof way in normal situations is to set the complex on each host, e.g.

    $ for n in 8 16; do qconf -mattr exechost complex_values slots=$n \
       `qconf -sobjl exechost load_values "*num_proc=$n*"`; done

    Memory Limits

    Normally it is advisable to prevent jobs swapping. To do so, make the h_vmem complex consumable, and give it a default value that is (probably slightly less than) the lowest memory/core that you have on execution hosts, e.g.:

    $ qconf -sc | grep h_vmem
    h_vmem              h_vmem       MEMORY      <=      YES         YES        2000m    0

    (See complex(5) and the definition of memory_specifier.)

    Also set h_vmem to an appropriate value on each execution host, leaving some head-room for system processes, e.g. (with bash-style expansion):

    $ qconf -mattr exechost complex_values h_vmem=31.3G node{1..32}

    Then single-process jobs can't over-subscribe memory on the hosts-at least the jobs on their own-and multi-process ones can't over-subscribe long term (see below).

    Jobs which need more than the default (2000m per slot above) need to request it at job submission with -l h_vmem=…, and may end up under-subscribing hosts' slots to get enough memory in total.

    Each process is limited by the system to the requested memory (see setrlimit(2)), and attempts to allocate more will fail. If it is a stack allocation, the program will typically die; if it is an attempt to malloc(3) too much, well-written programs should report an allocation failure. Also, the qmaster tracks the total memory accounted to the job, and will kill it if allocated memory exceeds the total requested.

    These mechanisms are not ideal in the case of MPI-style jobs, in particular. The rlimit applied is the h_vmem request multiplied by the slot count for the job on the host, but it is for each process in the job-the limit does not apply to the process tree as a whole. This means that MPI processes, for instance, can over-subscribe in the PDC_INTERVAL before the execd notices, and out-of-memory system errors may still occur. Future use of memory control groups will help address this on Linux.

    Note Killing by qmaster due to the memory limit may occur spuriously, at least under Linux, if the execd over-accounts memory usage. Older SGE versions, and possibly newer ones on old Linux versions, use the value of VmSize that Linux reports (see proc(5)); that includes cached data, and takes no account of sharing. The current SGE uses a more accurate value if possible (see execd_params USE_SMAPS). Also, if a job maps large files into memory (see mmap(2)), that may cause it to fail due to the rlimit, since that counts the memory mapped data, at least under Linux. A future version of SGE is expected to provide control over using the rlimit.
    Note Suspended jobs contribute to the h_vmem consumed on the host, which may need to be taken into account if you allow jobs to preempt others by suspension.
    Note Setting h_vmem can cause trouble with programs using pthreads(7), typically appearing as a segmentation violation. This is apparently because the pthreads runtime (at least on GNU/Linux) defines a per-thread stack from the h_vmem limit. The solution is to specify a reasonable value for h_stack in addition; typically a few 10s to 100 or so of MB is a good value, but may depend on the program.
    Note There is also an issue with recent OpenJDK Java. It allegedly tries to allocate 1/4 of physical memory for the heap initially by default, which will fail with typical h_vmem on recent systems. The (only?) solution is to use java -XmxN explicitly, with N derived from h_vmem.

    Licence Tokens

    For managing Flexlm licence tokens, see Olesen's method. This could be adapted to similar systems, assuming they can be interrogated suitably. There's also the licence juggler for multiple locations.

    Killing Detached Processes

    If any of a job's processes detach themselves from the process tree under the shepherd, they are not killed directly when the job terminates. Use ENABLE_ADDGRP_KILL to turn on finding and killing them at job termination. It will probably be on by default in a future version.

    Core Binding

    Binding processes to cores (or 'CPU affinity') is normally important for performance on 'modern' systems (in the mainstream at least since the SGI Origin). Assuming cores are not over-subscribed, a good default (since SGE 8.0.0c) is to set a default in sge_request(5) of

    -binding linear:slots

    The allocated binding is accessible via SGE_BINDING in the job's environment, which can be assigned directly to GOMP_CPU_AFFINITY for the benefit of the GNU OpenMP implementation, for instance. If you happen to use OpenMPI, good defaults matching the SGE -binding are (at least for OpenMPI 1.6):

    rmaps_base_schedule_policy = core
    orte_process_binding = core

    Administration

    Maintenance Periods

    Rejecting Jobs

    In case you want to drain the system, adding $SGE_ROOT/util/resources/jsv/jsv_reject_all.sh as a server JSV will reject all jobs at submission with a suitable message.

    Down Time

    If you want jobs to stay queued, there are two approaches to avoid starting ones that might run into a maintenance period, assuming you enforce a runtime limit and the maintenance won't start any sooner than that period: a calendar and an advance reservation.

    Calendar

    You can define a calendar for the shutdown period and attach it to all your queues, e.g.

    # qconf -scal shutdown
    calendar_name    shutdown
    year             6.9.2013-9.9.2013=off
    week             NONE
    # qconf -mattr queue calendar shutdown serial parallel
    root@head modified "serial" in cluster queue list
    root@head modified "parallel" in cluster queue list
    Note To get the scheduler to look ahead to the calendar, you need to enable resource reservation (issue #493) but that reservation may interact poorly with calendars (issue #722), but it's not clear whether this is still a problem.
    Advance reservation

    Define a fake PE with allocation_rule 1 and access only by the admin ACL, say, and attach it to all your hosts, possibly via a new queue if you already have a complex pe_list setup:

    $ qconf -sp all
    slots              99999
    user_lists         admin
    ...
    allocation_rule    1
    ...
    $ qconf -sq shutdown
    qname                 shutdown
    hostlist              @allhosts
    ...
    pe_list               all
    ...

    Now you can make an advance reservation (assuming max_advance_reservations allows it, and you're in arusers as well as admin):

    $ qrsub -l exclusive -pe all $(qselect -pe all|wc -l) -a 201309061200 -d 201309091200

    Rebooting execution hosts

    To reboot execution hosts, you need to ensure they're empty and avoid races with job submission. Thus, submit a job which requires exclusive access to the host and then does a reboot. Since you want to avoid root being able to run jobs for security reasons, use sudo(1) with appropriate settings to allow password-less executions of the commands by the appropriate users. You want to comment out Defaults requiretty from /etc/sudoers, add !requiretty to the relevant policy line, or use -pty y on the job submission. It is cleanest to shut down the execd before the reboot.

    The job submission parameters will depend on what is allowed to run on the hosts in question, but assuming you can run SMP jobs on all hosts (some might not be allowed serial jobs), a suitable job might be

    qsub -pe smp 1 -R y -N boot-$1 -l h=$node,exclusive -p 1024 -l h_rt=60 -j y <<EOF
    /usr/bin/sudo /sbin/service sgeexecd.ulgbc5 softstop
    /usr/bin/sudo /sbin/reboot
    EOF

    where $node is the host in question, and we try to ensure the job runs early by using a resource reservation and a high priority.

    Broken/For-testing Hosts

    Administrative Control

    A useful tactic for dealing with hosts which are broken, or possibly required for testing and not available to normal users is to make a host group for them, say @testing (qconf -ahgrp testing), and restrict access to it only to admin users with an RQS rule like

       limit users {!@admin} hosts {@testing} to slots=0

    It can also be useful to have a host-level string-valued complex (say comment or broken) with information on the breakage, say with a URL pointing to your monitoring/ticketing system. A utility script can look after adding to the host group, setting the complex and, for instance, assigning downtime (in Nagios' terms) for the host in your monitoring system.

    Alternatively the RQS could control access on the basis of the broken complex rather than using host group separately.

    A monitoring system like Nagios (which has hooks for such actions and is allowed admin access to the SGE system) can set the status as above when it detects a problem.

    Using a restricted host group or complex is more flexible than disabling the relevant queues on the host, as sometimes recommended; that stops you running test jobs on them and can cause confusion if queues are disabled for other reasons.

    Using Alarm States

    As an alternative to explicitly restricting access as above, one can put a host into an alarm state to stop it getting jobs. This can be done by defining an appropriate complex and a load formula involving it, along with a suitable load sensor. The sensor executes periodic tests, e.g. using existing frameworks, and sets the load value high via the complex if it detects an error. However, since it takes time for the load to be reported, jobs might still get scheduled for a while after the problem occurs.

    Running tests could also be done in the prolog potentially to set the queue into an error state before trying to run the job. However, that is queue-specific, and the prolog only runs on parallel jobs' master node.

    Copyright © 2012, 2013, Dave Love, University of Liverpool

    Licence GFDL (text), GPL (code).

    Last updated 2014-02-27 15:36:43 GMT
  • [Jun 13, 2014] Other ideas

    HowTo:

    1. We want MPI jobs to eat all of the Infiniband on a node, so that no two MPI jobs can run on the same node. However, we want to be able to have a bunch of instances of the same job on the same node. Solution: Its complicated, but see Daniel Templeton's blog for how to do this.

    [Jun 13, 2014] Brown CS Using GridEngine

    Helpful Hints
    Current Working Directory
    To ensure that your job runs in the directory from which you submit it (and to ensure that it's standard output and error files land there) use the -cwd option:
       % qsub -cwd runme
    
    Running Now
    If you want GridEngine to run your job now or else fail, give it the -now option:
       % qsub -now y runme
    
    Embedding Options
    You don't have to remember all the qsub options you need for every job you run. You can embed them in your script:
    % cat runme
    #!/bin/sh
    #
    #  Execute from the current working directory
    #$ -cwd
    #
    #  This is a long-running job
    #$ -l inf
    #
    #  Can use up to 6GB of memory
    #$ -l vf=6G
    #
    ~/project/sim
    
    With all the options in the script, executing it is simple:
       % qsub runme
    
    You can, of course, still use command-line arguments to augment or override embedded options.
    Mail Notification
    To receive email notifications about your job, use the "-m" option:
       % qsub -m as runme
    
    In the example above, you will get mail if the job aborts or is suspended. The mail options are:
       a - abort
       b - begin
       e - exit
       s - suspend
    

    Deleting Your Jobs

    Deleting your submitted jobs can be done with the qdel command:
       % qdel job-id
        The specified job-id is deleted.
    
       % qdel -u username
        All the jobs by usrename are deleted.
    
    Users can only delete their own jobs.

    Recommended Links

    Google matched content

    Softpanorama Recommended

    Top articles

    Sites

    Top articles

    Sites



    Etc

    Society

    Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

    Quotes

    War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

    Bulletin:

    Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

    History:

    Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

    Classic books:

    The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

    Most popular humor pages:

    Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

    The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


    Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

    This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

    You can use PayPal to to buy a cup of coffee for authors of this site

    Disclaimer:

    The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

    Last modified: March, 12, 2019