r30 - 18 Oct 2014 - 07:11:52 - AntonioMariYou are here: TWiki >  NEXT Web  > QueueSystem

User Manual

Torque/Maui Portable Batch System (PBS)

PBS Torque is a workload management system for GNU/Linux farms (clusters). It supplies tools to submit, monitor, and delete jobs. It has the following components:

  • pbs_server - Is the Job Server which provides the basic batch services such as receiving/creating a batch job, modifying the job and running the job.
  • pbs_mom - Is the Job executor. Is a daemon that places the job into execution machine when it receives a copy of the job from the Job Server. 
  • maui - Is the Job Scheduler that contains the site's policy controlling which job is run and where and when it is run.

The steps needed to run a batch job are:

  • Create a Job script containing the following options:
    • PBS options for requesting the resources that will be needed (i.e. walltime, memory, etc).
    • variables and commands to run the task
  • Submit the Job with qsub command or create a run script to automate the process
  • Monitor and manage the Job

CREATE A JOB SCRIPT

To create the script we need to know some variables and options included in Torque.

PBS Environment Variables

The following list are some of the more useful environment variables available for you to use in your scripts:

OPTION
                                      Description
 PBS_O_HOST
  The host machine on which the qsub command was run
 PBS_O_HOSTNAME
  The login name on the machine on which the qsub was run
 PBS_O_HOME
  The home directory from which the qsub was run
 PBS_O_WORKDIR
  The working directory from which the qsub was run

The following variables relate to the environment where the job is executing:

OPTION                               Description
 PBS_O_QUEUE    
   The orginial queue to which the job was submitted
 PBS_JOBID
   The identifier that PBS assigns to the job
 PBS_JOBNAME
   The name of the job

PBS Options

These are some of the commonly used PBS options that you may use in a job script. The options starts with " #PBS "

 OPTION                                         Description
#PBS -N  myJob 
  Assigns a job name. The default is the name of PBS job script
#PBS -q queuename 
  Assigns the queue your job will use
#PBS -l walltime 01:00:00
  The maximum wall-clock time during which this job can run
#PBS -l mem=200mb
  The Maximum amount of physical memory used by the job
#PBS -o myPath
  The path of the standard output file
#PBS -e myPath   The path of the standard error file
#PBS -j oe
  Join option that merges the standard error stream with the standard output stream of the job
#PBS -M user@ific.uv
  Declares the user or list of users to whom mail is sent by the execution server
#PBS -m b
  Sends mail to the user when the job begins 
#PBS -m e
  Sends mail to the user when the job ends
#PBS -m a
  Sends mail to the user when job aborts (with an error)

To use your alias definitions in your pbs scripts you must included this line in your .bashrc:

shopt -s expand_aliases

Job Script template

The following job script template should be modified for the need of the job. A job script may consist of PBS directives, comments and executable statements. A PBS directive provides a way of specifying job attributes in addition to the command line options. For example:

## PBS OPTIONS
#PBS -N testing
#PBS -q short
#PBS -l nodes=1:ppn=1,walltime=00:30:00
#PBS -M anmaro@ific.uv.es
#PBS -m bae

for a in {1..5}
do
      sleep $a      
done

Another example to run a process with ana. S1, S2 and S3 are variables come from an external script. S1 is the initial file to process, S2 is the last file to process and S3 is the run number. This example executes one ana command per input file and create one output .evt.root file per input.

#PBS -N testing
#PBS -q short
#PBS -l nodes=1:ppn=1,walltime=00:30:00
#PBS -M anmaro@ific.uv.es
#PBS -m bae

# PATHS DEFINITION
INPUT=/data4/NEXT/NEXT1IFIC/Run4/FMWK/FDATA/dataRaw2PMaps/${S3}
OUTPUT=/data4/NEXT/NEXT1IFIC/Run4/FMWK/FDATA/dataRaw2PMaps/test_output/${S3}
LOG_FILES=$PBS_O_WORKDIR/log_files
XML_FILE=$PBS_O_WORKDIR/xml

mkdir -p $OUTPUT

for a in $(seq $S1 $S2)                 # This loop executes one ana command per input file
do
        num=`printf "%#03d\n" $a`                                 # File format 000,001,.. (with 3 characters)
        input_file=`cd ${INPUT} && ls *_${num}.* `                # Read the input data file
        ana -b 1 -x $XML_FILE/RecalibrateAnodeJOB.xml -n -1 -i $INPUT/${input_file} -o $OUTPUT/Run_${num}.evt.root -g $OUTPUT/histos/histo_$num.histo
done
 

A variant of a loop to run ana with a list of input files is shown in the next example. Here the variable called file will contains a list of input files.

file=' '
for a in $(seq $S1 $S2)
do
  num=`printf "%#03d\n" $a`
 file+=${INPUT}/'Raw2PMaps2_3350_file_'${num}'.evt.root '
 #echo $file >> $PBS_O_WORKDIR/input
 #i=`expr $i + 1` 
 ana -b 1 -x $XML_FILE -n -1 -i ${file}  -g $OUTPUT/${S3}-PMTER_${S1}_${S2}.root
done

For those people who use python scripts, they should use a wrapper bash script to call the python script. Here you can see an example:

#PBS -N test_python
#PBS -q short
#PBS -l walltime=00:30:00
#PBS -M anmaro@ific.uv.es
#PBS -m bae
#PBS -o log_files
#PBS -e log_files
#PBS -j eo

python $PBS_O_WORKDIR/ReCal.py $S1 $S2

SUBMITTING A JOB

The command to submit a job is qsub. For example, you can submit a new job with:

qsub script.sh

Additionally, you can include a list of parameters in the qsub with the -v option. Then, you could use these parameters inside of the script.

qsub script.sh -v S1=$ini,S2=$PATH,S3=20

MONITOR AND MANAGE THE JOB

Torque/PBS has some tools to monitoring and managing the jobs (qstat, qdel, etc)

Option 
                                                                            Description
qstat -a 
  check status of jobs, queues, and the PBS server 
qstat -f
  get all the information about a job, i.e. resources requested, resource limits, owner, source, destination, queue, etc.
qdel jobID
  delete a job from the queue
qhold jobID 
  hold a job if it is in the queue
qrls jobID
  release a job from hold

Starting to use Neutrinos PBS

You must be access to neutrinos1 across SSH and follow the steps on the previous sections (see The steps needed to run a batch job).

At this moment we have available three kind of queues (This is not the final configuration is only for testing purposes) :

  • short - for jobs that require less than 4 hour of time
  • long - for jobs that require less than 24 hours

If you have any question, please let me know: anmaro@ific.uv.es

Torque Cheat Sheet

Frequently Used Commands

Command
Description
qsub [script]
Submit a pbs job
qstat [jobid] Show status of pbs batch jobs
qdel [jobid] Delete pbs batch job
qhold [jobid] Hold pbs batch jobs
qrls [jobid] Release hold on pbs batch jobs

Check Queue and Job Status

Command
Description
qstat -q
List all queues
qstat -a List all jobs
qstat -au [userid] list jobs for userid
qstat -r Running Jobs
qstat -f [jobid] List full information about job_id
qstat -Qf [queue] List full information about queue
qstat -B List summary status of the job server

References

http://www.clusterresources.com/torquedocs/commands/qsub.shtml

http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml

http://www.eresearchsa.edu.au/pbs_exitcodes

-- AntonioMari - 05 Mar 2013

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
pdfpdf presentation_1.pdf manage 192.1 K 09 May 2013 - 08:11 AntonioMari Short Tutorial about Batch System
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r30 < r29 < r28 < r27 < r26 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback