1. Logging into the
Grid Node for the first time
2. Changing Password
3. Compiling MPI
Program
4. Sun
Grid Engine
4.1 Writing and Submitting
Batch Jobs
4.2 Monitoring and
Controlling Jobs
4.2.a
Monitoring With
qstat
4.2.b
Monitoring Jobs By
Electronic Mail
5. Controlling Jobs
6. Sample Snapshot
of MPI Job Submission Along With Monitoring
Your username is your blazer id, and your initial password is your student id
(for example 999554444). You would need to login to moat first, if not in the
CIS domain
You will then be asked 3 questions shown below.
Press "Enter" key for all the questions (entering no
other input) and your ssh keys will generated:
Generating public/private rsa1 key pair.
It is strongly advised that you change your initial password.
You would be asked to enter your old password and your new password.
Use the mpicc compiler to compile programs at the shell prompt.
Sun Grid Engine has a large set of programs that let the user submit/delete
jobs, check job status, and have information about available queues and
environments. For the normal user the knowledge of the following basic commands
should be sufficient to get started with Grid Engine and have full control of
his jobs:
To run a job with grid engine you have to submit it from the command line.
But first, you have to write a batch script file that contains all the
commands and environment requests that you want for this job. If, for example,
serial.sh is the name of the script file then use the command
``qsub'' to submit the job:
And, if the submission of the job is successful, you will see this message:
After that, you can monitor the status of your job with the command `` qstat
'' .
When the job is finished you will have two output files called "output _
file" and "error _ file" (if there were any output/error messages) .
job -
ID prior name user state submit/start at queue master ja
- task - ID
---------------------------------------------------------------------------------------------
4 0 serial.sh user qw 06/15/2004 21:40:49
In Grid Engine, it is a batch script that contains additionally to normal
UNIX command special comments lines defined by the leading prefix `` #$ '' .
The first line of the batch file starts with
#$ -S /bin/bash
which is default shell interpreter for Grid Engine . To tell GE to run the
job from the current working directory add this script line
#$ -cwd
if you want to pass some environment variable VAR (or a list of variables
separated by commas) use the -v option like this
#$ -v VAR (#$ -V passes all variables listed in env.)
Insert the full path name of the files to which you want to redirect the
standard o utput/ e rror respectively .
#$ -o <path_name>
#$ -e <path_name> The prefix #$ has many options and is used the same way you
use Insert your email address after the "#$ -M", and also insert
the full path name of the files to which you want to redirect the standard
output/error . after the "#$ -o " (the "#$
-e") statement, respectively .
Note that
And after that, to submit the job you simply type
An example of parallel (MPI) job (parallel . sh) that requests 4 processors:
$MPI_DIR/mpirun -nolocal -np $NSLOTS -machinefile $TMPDIR/machines
$EXECUTABLE And after that, to submit the job you simply type
A sample snapshot of job submission, and
monitoring is provided towards the end of this document .
Due to the tight integration of MPI with SGE (via the The first is $NSLOTS, the number of slots (or processors) granted by SGE for
this MPI job, which corresponds to the (range) value given by the user as the
second argument to the - pe option . The second variable is $TMPDIR, a temporary
directory which will contain a file titled machines, itself containing an
automatically - generated list of nodes on which the MPI job will be run . The
temporary directory and its contents will be automatically removed upon
completion of the MPI job .
Note: $MPI _ DIR and $EXECUTABLE are provided for clarity and may be
dispensed with, substituting them with their actual values . However, $NSLOTS
and $TMPDIR are mandatory .
Both these values are passed to mpirun via the specified arguments . The next
argument should be the name of the MPI program itself, followed by any optional
arguments to be sent to that program . The above script should suffice to run
any simple MPI job by changing the name of the program (myprogram) in the mpirun
line .
After submitting your job to Grid Engine you may track its status by using
either the
The
Another way to monitor your jobs is to make Grid Engine notify you by email
on status of the job .
In your batch script or from the command line use the And from the command line you can use the same options (for example): [1]
Based on the status of the job displayed, you can control the job by the
following actions:
Note that the first time qstat was executed, the job was in the queue in
submitted state . The job was in execution state when qstat was executed the
second time, and the third time it had completed.
For further information, see the SGE User's Guide
This process will make the files:
/home/username/.ssh/identity.pub
/home/username/.ssh/identity
/home/username/.ssh/authorized _ keys
Enter file in which to
save the key (/home/username/.ssh/identity):
Created directory
'/home/username/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved
in /home/username/.ssh/identity.
Your public key has been saved in
/home/username/.ssh/identity.pub.
The key fingerprint is:
several 2
digit hex numbers separated by :>
username@everest00.cis.uab.edu
For further information, see the SGE User's Guide
http://www.sun.com/products-n-solutions/hardware/docs/pdf/816-2077-12.pdf
(PDF)
http://docs.sun.com/source/816-4739-11/enterpri.htm
(HTML)
#
#$ -cwd
#$ -j y
#$ -S
/bin/bash
#
#$ -M myemail
#$ -e error_file
#$ -o output_file
date
sleep 10
date
your job 1
("serial.sh") has been submitted . qsub , so check qsub man pages to take a look at
those options .
qsub accepts shell scripts only, not
executable files, and also that shell scripts need to be executable, if it's not
the case run the command
#
#$ -cwd
#$ -j y
#$ -S
/bin/bash
#
#$ -M user@uab.edu
#$ -e /home/user/error_file
#$ -o
/home/user/output_file
#$ -pe mpi 4
MPI_DIR=/opt/mpich/gnu/bin
EXECUTABLE=/home/user/hello
qsub
command), SGE automatically configures a number of environment variables
containing values required by mpirun .
qstat command, or by email .
qstat command provides the status of all jobs and queues in
the cluster . The most useful options are:
You can refer to the man pages for a
complete description of all the options of the qstat: Displays list of all jobs with no queue status
information .
qstat -u hpc1***: Displays list of all jobs belonging to user
hpc1***
qstat -f: gives full information about jobs and queues .
qstat -j [job _ id]: Gives the reason why the pending job (if
any) is not being scheduled . qstat command.
-m option
to request that an email should be send and -M option to precise
the email address where this should be sent . This will look like:
#$ -m beas -m) option can select after which events you want to receive
your email . In particular you can select to be notified at the beginning/end of the job, or
when the job is aborted/suspended (see the sample script lines above) .
qmod. Check the man pages for the options that you are allowed to
use .
kill command, and applies only to running jobs, in practice you
type (where
job_id is given by qstat or
qsub ) .
qdel command like this (where
job_id is given by qstat or
qsub ) . [user@everest00 user]$ qsub parallel . sh
your job 15 ("parallel . sh") has been submitted[user@everest00 user]$ qstat
job - ID prior name user state submit/start at queue master ja - task - ID
---------------------------------------------------------------------------------------------
15 0 parallel . s user qw 06/16/2004 14:51:07
[user@everest00 user]$ qstat
job - ID prior name user state submit/start at queue master ja - task - ID
---------------------------------------------------------------------------------------------
15 0 parallel . s user t 06/16/2004 14:51:07 everest - 0 - SLAVE
0 parallel . s user t 06/16/2004 14:51:07 everest - 0 - SLAVE
15 0 parallel . s user t 06/16/2004 14:51:07 everest - 0 - MASTER
0 parallel . s user t 06/16/2004 14:51:07 everest - 0 - SLAVE
[user@everest00 user]$ qstat
[user@everest00 user]$ more output _ file
/opt/gridengine/default/spool/everest - 0 - 6/active _ jobs/16 . 1/pe _ hostfile
everest - 0 - 6
everest - 0 - 6
everest - 0 - 14
everest - 0 - 14
Warning: Permanently added 'everest - 0 - 6' (RSA1) to the list of known hosts .
Warning: Permanently added 'everest - 0 - 14' (RSA1) to the list of known hosts .
[1]: Hello World, 1 of 4 alive
[3]: Hello World, 3 of 4 alive
[2]: Hello World, 2 of 4 alive
[0]: Hello World, 0 of 4 alive
rm: cannot remove `/tmp/16 . 1 . everest - 0 - 6 . q/rsh': No such file or directory
http://www.sun.com/products-n-solutions/hardware/docs/pdf/816-2077-12.pdf
(PDF)
http://docs.sun.com/source/816-4739-11/enterpri.htm
(HTML)