HTCondor Quick Reference
The complete HTCondor manual can be found
here
Job submission
Job description file
Before sending a job, you have to prepare a Job Description File which specifies what do you want to
excecute, where do you what to write your output and what are your job requirements.
A simple job description file might be (test.sub):
universe = vanilla
executable = test.sh
arguments = $(Cluster) $(Process)
log = test.log
output = outfile.$(Cluster).$(Process).out
error = errors.$(Cluster).$(process).err
request_Cpus = 4
request_Memory = 4000
queue 2
In this file:
- universe: specifies how HTCondor is going to deal with the job. Use vanilla here.
- $(Cluster): HTCondor assigns a consecutive number for each group of jobs sent at the same time.
- $(Process): Identifies an specific job inside the group. Process number starts from 0.
- executable: Tells HTCondor what to execute. This can be an executable or a script
- arguments: The arguments HTCondor with pass to the executable. In this example the cluster and process job identifiers.
- log: This is the file where HTCondor will log the job processing information.
- outout: Where the job output is going to be written. It is important that each job has their own output file, since it will be overwritten if this file exists. In this example, the file name contains the Cluster and Process identifiers so file name is unique.
- error: Same as output but for the standard error
- request_Cpus: how many cores need the job. It is important to be accurate since this requirement is used by HTCondor to distribute the available resources among all users.
- request_Memory: same as CPU but for memory. Unit here is MB !!
- queue: Number of jobs to be send to the system using this job configuration
This is a very simple job, but the Job Description File can be richer and other options can be more appropriate to submit other user workloads.
Actual job submission (condor_submit)
When the Job Description File is ready, we can send the jobs to the system:
$ condor_submit test.sub
Submitting job(s)..
2 job(s) submitted to cluster 110.
110 is the Cluster identification for this bunch of jobs
Get the jobs status (condor_q)
To get the job status use:
$ condor_q
-- Schedd: xxx.ific.uv.es : <xx.xx.xx.xx:9618?... @ 03/12/19 11:44:28
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
fulano ID: 110 3/12 11:44 _ _ 2 2 110.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for fulano: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Here, you can see the jobs in IDLE state waiting for HTCondor to be considered for running.
The log file begins to have information about the HTCondor processing.
When the job start running, you can see the output and error files being written since at IFIC,
the file system (Lustre) can be seen everywhere in the cluster
Removing a job (condor_rm)
A job can be removed from the system using its cluster identification. If several jobs have been sent, a specific job can be removed using its Process identification.
$ condor_rm 110.1
Job 110.1 marked for removal
Checking cluster status (condor_pool)
You can get the cluster (pool) status with:
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxx01.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:11:16
slot1@xxx02.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:11:35
slot1@xxx03.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:11:25
slot1@xxx04.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:11:23
slot1@xxx05.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:11:17
slot1@xxx06.ific.uv.es LINUX X86_64 Unclaimed Idle 0.000 32117 19+17:10:57
Machines Owner Claimed Unclaimed Matched Preempting Drain
X86_64/LINUX 6 0 0 6 0 0 0
Total 6 0 0 6 0 0 0
FAQ: Frequently Asked Questions
Job Submision error "Hold reason: Error ... (errno=13: 'Permission denied')"
This error happens when submitting with HTCondor and the 'executable' defined in the .sub execution does not have have execution permissions (+x).
In this case the job enters in state "Hold" (as can be seen with condor_q), and stays in that state. You can remove the job with 'condor_rm'.
To solve it, simply add correct file permissions to your executable before submitting jobs. For example for a script called 'executable.sh' :
chmod +x executable.sh
-- Last update:
--
AlvaroFernandez - 02 Jul 2019