r3 - 02 Jul 2019 - 15:04:23 - AlvaroFernandezYou are here: TWiki >  Informatica Web  >  ScientificComputing > HTCondorUsage

HTCondor Quick Reference

The complete HTCondor manual can be found here

Job submission

Job description file

Before sending a job, you have to prepare a Job Description File which specifies what do you want to excecute, where do you what to write your output and what are your job requirements.

A simple job description file might be (test.sub):

universe = vanilla

executable          = test.sh
arguments           = $(Cluster) $(Process)

log                 = test.log
output              = outfile.$(Cluster).$(Process).out
error               = errors.$(Cluster).$(process).err

request_Cpus        = 4
request_Memory      = 4000

queue 2

In this file:

  • universe: specifies how HTCondor is going to deal with the job. Use vanilla here.
  • $(Cluster): HTCondor assigns a consecutive number for each group of jobs sent at the same time.
  • $(Process): Identifies an specific job inside the group. Process number starts from 0.
  • executable: Tells HTCondor what to execute. This can be an executable or a script
  • arguments: The arguments HTCondor with pass to the executable. In this example the cluster and process job identifiers.
  • log: This is the file where HTCondor will log the job processing information.
  • outout: Where the job output is going to be written. It is important that each job has their own output file, since it will be overwritten if this file exists. In this example, the file name contains the Cluster and Process identifiers so file name is unique.
  • error: Same as output but for the standard error
  • request_Cpus: how many cores need the job. It is important to be accurate since this requirement is used by HTCondor to distribute the available resources among all users.
  • request_Memory: same as CPU but for memory. Unit here is MB !!
  • queue: Number of jobs to be send to the system using this job configuration

This is a very simple job, but the Job Description File can be richer and other options can be more appropriate to submit other user workloads.

Actual job submission (condor_submit)

When the Job Description File is ready, we can send the jobs to the system:

$ condor_submit test.sub
Submitting job(s)..
2 job(s) submitted to cluster 110.

110 is the Cluster identification for this bunch of jobs

Get the jobs status (condor_q)

To get the job status use:

$ condor_q

-- Schedd: xxx.ific.uv.es : <xx.xx.xx.xx:9618?... @ 03/12/19 11:44:28
fulano ID: 110      3/12 11:44      _      _      2      2 110.0-1

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for fulano: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

Here, you can see the jobs in IDLE state waiting for HTCondor to be considered for running. The log file begins to have information about the HTCondor processing.

When the job start running, you can see the output and error files being written since at IFIC, the file system (Lustre) can be seen everywhere in the cluster

Removing a job (condor_rm)

A job can be removed from the system using its cluster identification. If several jobs have been sent, a specific job can be removed using its Process identification.

$ condor_rm 110.1
Job 110.1 marked for removal

Checking cluster status (condor_pool)

You can get the cluster (pool) status with:

$ condor_status
Name                     OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@xxx01.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:11:16
slot1@xxx02.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:11:35
slot1@xxx03.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:11:25
slot1@xxx04.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:11:23
slot1@xxx05.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:11:17
slot1@xxx06.ific.uv.es LINUX      X86_64 Unclaimed Idle      0.000 32117 19+17:10:57

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX        6     0       0         6       0          0      0

         Total        6     0       0         6       0          0      0

FAQ: Frequently Asked Questions

Job Submision error "Hold reason: Error ... (errno=13: 'Permission denied')"

This error happens when submitting with HTCondor and the 'executable' defined in the .sub execution does not have have execution permissions (+x).

In this case the job enters in state "Hold" (as can be seen with condor_q), and stays in that state. You can remove the job with 'condor_rm'.

To solve it, simply add correct file permissions to your executable before submitting jobs. For example for a script called 'executable.sh' :

chmod +x executable.sh

-- Last update: -- AlvaroFernandez - 02 Jul 2019
Edit | WYSIWYG | Attach | PDF | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback