|May the source be with you, but remember the KISS principle ;-)|
|Contents||Bulletin||Scripting in shell and Perl||Network troubleshooting||History||Humor|
|News||Enterprise Unix System Administration||Recommended Links||Installation Planning||Usage of NFS||Installation of the Master Host|
|SGE cheat sheet||qconf||qsub||qalter||qstat|
|Starting and Killing SGE Daemons||SGE Queues||Configuring Hosts From the Command Line||SGE Submit Scripts||Humor||Etc|
The execution host installation creates the appropriate directory hierarchy required by sge_execd. In some versions of SGE it starts the sge_execd daemon on the execution host. In others it should be done manually.
You can automate the installation of execution host for multiple hosts using GUI installation: just add as many hosts as you wish and they will be installed one by one in one batch.
If prerequisites are met and everything is checked this allow to install SGE of execution hosts on all or substantial part of nodes of the cluster.
Before installing an execution host, you first need to install and configure the master.
Installation consist of two major steps
On execution host you need to check the following six preconditions:
Register and patch the server
Configure NTP. Check using ntpdate -u ntp1.firm.com
Share common directory from the master host via NFS
Create passwordless login from the master host to execution host
Add SGE services to /etc/services
Usually java is already installed. But you still need to verify that. In case it is not, you need to install it:
yum install java Loaded plugins: rhnplugin, security Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package java-1.6.0-openjdk.x86_64 1:220.127.116.11-18.104.22.168.8.el5_8 set to be updated --> Processing Dependency: tzdata-java for package: java-1.6.0-openjdk --> Processing Dependency: libgif.so.4()(64bit) for package: java-1.6.0-openjdk --> Running transaction check ---> Package giflib.x86_64 0:4.1.3-7.3.3.el5 set to be updated ---> Package tzdata-java.x86_64 0:2012c-1.el5 set to be updated --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================== Package Arch Version Repository Size ======================================================================================================================== Installing: java-1.6.0-openjdk x86_64 1:22.214.171.124-126.96.36.199.8.el5_8 rhel-x86_64-server-5 36 M Installing for dependencies: giflib x86_64 4.1.3-7.3.3.el5 rhel-x86_64-server-5 39 k tzdata-java x86_64 2012c-1.el5 rhel-x86_64-server-5 181 k Transaction Summary ======================================================================================================================== Install 3 Package(s) Upgrade 0 Package(s) Total download size: 36 M Is this ok [y/N]: y Downloading Packages: (1/3): giflib-4.1.3-7.3.3.el5.x86_64.rpm | 39 kB 00:00 (2/3): tzdata-java-2012c-1.el5.x86_64.rpm | 181 kB 00:01 (3/3): java-1.6.0-openjdk-188.8.131.52-184.108.40.206.8.el5_8.x86_64.rpm | 36 MB 01:31 ------------------------------------------------------------------------------------------------------------------------ Total 339 kB/s | 36 MB 01:50 Running rpm_check_debug Running Transaction Test Finished Transaction Test Transaction Test Succeeded Running Transaction Installing : giflib 1/3 Installing : tzdata-java 2/3 Installing : java-1.6.0-openjdk 3/3 Installed: java-1.6.0-openjdk.x86_64 1:220.127.116.11-18.104.22.168.8.el5_8 Dependency Installed: giflib.x86_64 0:4.1.3-7.3.3.el5 tzdata-java.x86_64 0:2012c-1.el5
Most SGE installation share the whole /sge directory from the master host. It should be mounted under the same name on the execution host. See Usage of NFS in Grid Engine.
For large installations you can share less to improve efficiency. If you the fail to share at least $SGE_ROOT/$SGE_CELL/common directory from qmaster host, you will not able to install execution hosts on nodes other than the qmaster host.
Create the directory for shared files (for example /sge) and put an appropriate line in /etc/fstab file.
# cat /etc/fstab | grep "/sge" m17:/sge /sge nfs rw,hard,intr,tcp,rsize=32768,wsize=32768 1 2
Or you can use netmask for subnet. or add host to export file on qmaster host, for example:
/sge 10.194.186.254(rw,no_root_squash) 10.194.181.26(rw,no_root_squash)
In the latter case you need to restart the NFS daemon on qmaster host to reread export file:# service nfs restart Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ]
Create passwordless login environment.
Tip: If you already have configured it just copy file authorized_hosts from already configured execution host.
cd /root/.ssh scp sge01:/root/.ssh/authorized_hosts .Check ssh access from the master host to the node on which you install the execution host (b5 in the example below):
root@m17: # ssh b5 The authenticity of host 'b5 (10.194.181.46)' can't be established. RSA key fingerprint is 18:35:6e:96:11:77:27:fc:ac:1c:8e:46:36:2b:ae:2b. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'b5,10.194.181.46' (RSA) to the list of known hosts. Last login: Thu Jul 26 08:29:41 2012 from sge_master.firma.net
Update /etc/services . You need to add two ports that are used by SGE
add lines (typically people use the default ports 6444 and 6445, but your mileage may vary)
sge_qmaster 6444/tcp # Grid Engine Qmaster Service sge_qmaster 6444/udp # Grid Engine Qmaster Service sge_execd 6445/tcp # Grid Engine Execution Service sge_execd 6445/udp # Grid Engine Execution Service
On the execution host: verify that SGE directory is NFS mounted. We assume you will be using a NFS-mounted directory (we will assume that it is /sge) and it is already mounted as required by prerequisites:
cd /sge && ls
On execution host and master host: verify the $SGE_ROOT directory setting for your shell session.On the execution host: If the $SGE_ROOT environment variable is not set, set it by typing:
# SGE_ROOT=/sge; export SGE_ROOT
To confirm that you have set the $SGE_ROOT environment variable, type:
# echo $SGE_ROOT
On the master host: host change directory (cd) to the installation directory, $SGE_ROOT
On the master host:
Add host IP to the /etc/hosts (RHEL puts long name as host name which is not very convenient
for SGE purposes)
On the master host: add the host to the list of execution hosts (this is not strictly nessesary)qconf -ae
The -ae option (add execution host) displays an editor that contains a configuration template for an execution host. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable.
In this template you specify the hostname, which should be the name of an execution host we wnat to configure. In VI screen change the name and save the template. See the host_conf(5) man page for a detailed description of the template entries to be changed.
1 hostname template 2 load_scaling NONE 3 complex_values NONE 4 user_lists NONE 5 xuser_lists NONE 6 projects NONE 7 xprojects NONE 8 usage_scaling NONE 9 report_variables NONE
export DISPLAY=10.14.17.7:0; echo $DISPLAY
That should start installer in X session on your workstation/PC.
Click Next. You will see select host screen
"1 out of 1 reachable hosts have configuration errors... ... ... ... Do you want to continue the installation"
On the execution host: Register execution daemon and start it. Ensure proper environment after reboot:
NOTE: You can automate steps listed below by creating a small script:
#!/bin/bash # # Post install operations for SGE execution host # . /$SGE_ROOT/default/common/settings.sh # Add sgeexecd.$SGE_CLUSTER_NAME (or whatever is your cluster name) to default services on level 3 and 5 chkconfig sgeexecd.$SGE_CLUSTER_NAME on # On the execution host: start the sge_execd service service sgeexecd.$SGE_CLUSTER_NAME start # add nessesary commands to /etc/profile echo ". /$SGE_ROOT/default/common/settings.sh" >> /etc/profile
# chkconfig sgeexecd.$SGE_CLUSTER_NAME on sgeexecd.p6444 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# service sgeexecd.$SGE_CLUSTER_NAME start starting sge_execdNOTE: The first start takes two-three minutes of more. It's really slow even on a very fast server.
On the master host: Specify a queue for this host. That can be done by either adding it to existing queue or copying existing queue, renaming it and saving under new name.
To add a new queue using existing queue as a template use commands
# qconf -sq c32.q > m40a.q
Change in template four parameters
hostlist lusprocessors 32slots 32shell /bin/bashpe_list ms
qconf -Aq m40a.q root@lus17 added "m40a.q" to cluster queue list
See Creating and modifying SGE Queues
Verify that the execution host has been declared with the command
which lists all execution hosts.
You can also use qconf -se <hostname> to see parameters configured (usually
only hostname is configured) See
Configuring Hosts From the Command
On the execution host: Reboot execution host and verify that the NFS correctly mounted on reboot
Tip: For details about how you can verify that the execution host has been set up correctly, see How to Verify that the Daemons are Running on the Execution Hosts.
- Log in to the execution hosts on which you ran the execution host installation procedure.
- Verify that the daemons are running by typing one of the following commands, depending on the operating system you are running.
- On BSD-based UNIX systems, type the following command.% ps -ax | grep sge
- On systems running a UNIX System 5--based operating system (such as the Solaris Operating System), type the following command.% ps -ef | grep sge
- Verify the daemons are running by looking for the sge_execd string in the output.
Specifically, you should see that the sge_execd daemon is running.
- If you do not see similar output, the daemon required on the execution host is not running. Restart the daemon by hand. For example for Linux you can use service command:/sbin/service sgeexecd.p6444 start