CONDOR
A Job Control System


CONDOR is a job control system which can accept jobs from a variety of users, and assign them for execution on one or more computers that it communicates with. In particular, if a user wishes to run a program in parallel, with MPI, CONDOR can find the necessary machines and manage the parallel execution of the job.

A job that is to be controlled by CONDOR must be "noninteractive". If the program would normally read from the keyboard, then a file of input must be prepared beforehand, and CONDOR must be told to use that file for input. If the program would normally display results to the terminal, then CONDOR must be told to save such results in an output file.

The Login Node

A CONDOR job will run on one or more nodes (think of these as individual computers, each of which contains one or more processors) in a cluster. A cluster consists of a collection of cooperating nodes. There is one special node, called the login node, or submit node or master node, from which you can submit jobs to be run on the cluster.

The FSU Research Computing Center (RCC) has a cluster whose login node is condor-login.rcc.fsu.edu. To get interactive access to this node, you log in as follows:


        ssh condor-login.rcc.fsu.edu
      
If you want to login to the FSU RCC CONDOR login node from a remote site, you may need to first sign into the FSU Virtual Private Network (VPN), at http://vpn.fsu.edu.

Your CONDOR job will almost certainly need files (programs, data, job scripts). You may have created these on another system, or you may later wish to copy results back to your home system. In order to transfer files between your laptop, desktop, or home system and the CONDOR cluster, you need to use the SFTP program to establish a link between the systems:


        sftp condor-login.rcc.fsu.edu
      
then use commands such as cd and lcd to change directories on the cluster or your local system, and then use put or get commands to move files to or from the cluster.

The CONDOR Universe

CONDOR allows the user to choose a universe. You choose an universe based on which features of CONDOR you will be needing.

In the simplest universe, known as vanilla, CONDOR simply finds a suitable computer, sends the necessary files (input and executable program) to that machine, runs it, and returns the output.

The vanilla universe is a good way to start using CONDOR. For one thing, the other universes require you to recompile your program with a special CONDOR library. When you don't have that special library, you are giving up the ability to do parallel processing, checkpointing, and remote procedure calls, but for the simplest jobs, none of these features are necessary.

The standard universe allows you to submit a job to be run by CONDOR, but adds checkpointing and remote procedure calls. In order for these features to work, it is necessary that your executable program be compiled with the CONDOR libraries.

The MPI universe allows you to submit a job which is to run an MPI program on a given number of processors. Current, the RCC cluster does not support the MPI universe.

Submitting a job

To use CONDOR, the user prepares a submit description file, a text file which specifies the values of certain parameters, such as the name of the program to be run, the location of a file to be associated with standard input, a starting default directory, and so on.

Once a submit description file is prepared, the user may ask CONDOR to run the job, or more precisely, to manage the process of running the job. This is done by issuing the condor_submit command. For instance, if the submit description file was named foo.txt, then the user would type

condor_submit foo.txt

Of course, the condor_submit command can only be issued from a machine that is running CONDOR; however, the job doesn't have to run there; in fact, the user can request that the job actually be run on any suitable machine in the local collection of machines.

The user can also request that the job be run as an MPI job, in which case CONDOR is responsible for finding a suitable number of available processors on which the job can be executed.

The condor_submit command transfers the responsibility for running the job to CONDOR. But usually your job will not run right away. While you are waiting for results, you may be curious whether the job is still waiting for a suitable machine to be found, or has started, or is well on its way to completion. To find out the status of your jobs, issue the command

condor_q user_name

To find out the status of all jobs, issue the command

condor_q

Here is some of the output from the condor_q command:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 307.0   bleason         7/7  17:31   0+07:24:31 I  0   0.0  toy 1978          
 362.5   manowar         8/18 11:13  46+03:55:59 I  0   152  emigrate
 544.0   ely            10/7  13:46   0+02:25:51 R  20  2.1  sos        
 545.0   burdett        10/7  15:38   0+00:00:13 I  0   0.0  foo.csh           
      
Note, in particular, under the ST (status) column, that I indicates that the job is idle, that is, not running, while R means the job is currently running.

CONDOR Script Parameters:

universe = vanilla | standard | mpi
The universe parameter specifies how CONDOR is to be used.
initialdir = /home/your directory
This command specifies the directory in which your executable is stored. If you use the ls command on the submit cluster, you will see a complicated path for your home directory, but this should be replaced by /home. Thus, to specify that the job's initial directory should be in the matlab/fred subdirectory, I would issue a command like
          initialdir = /home/matlab/fred
        
executable = myprog
This command specifies the name of the executable program or shell script that is to be run.
log = logfile
This command specifies the name of a file into which CONDOR will write a running commentary of the process by which it set up and ran the job. This is occasionally useful if the job fails.
input = inputfile
This command specifies the name of a file from which standard input is to be read. If no such file is to be used, the command can be omitted, or the value of inputfile can be blank.
output = outputfile
This command specifies the name of the file into which the standard output should go.
log = logfile
This command specifies the name of the file into which the log of the run should go.
requirements = requirement
requirements = (requirement1 && requirement2)
requirements = (requirement1 || requirement2)
This command specifies requirements that must be true about the system on which your job is run. If there is more than one requirement, they can be grouped using external parentheses, and logical operators such as && for "and", or || for "or". Sample requirements include:
notification = choice
This specifies whether CONDOR should notify you about the job status. Valid options for choice include:
rank = expression
The rank command tells CONDOR how to choose the appropriate machine when there are several available. In the simplest case, replacing expression by memory means that, given a choice, CONDOR should run your job on the machine with the most memory.
queue number
This command specifies the number of copies of the job to submit. It only makes sense to submit multiple copies if there is a way for each job to carry out a different task, based on an ID number.
should_transfer_files = choice
This specifies whether CONDOR should transfer files to and from the remote machine where the job is actually being run. Valid choices include:
transfer_input_files = file1,file2,...
list specifies one or more input files that should be transferred to the remote machine.
when_to_transfer_output = choice
This specifies when output files should be transferred back to the submit node. Valid choices include:
machine_count = number
This command is only needed for MPI jobs, and specifies the number of computers to be used.
arguments = arg1 arg2 arg3
This command allows you to supply the command line arguments you would give if you were running your executable interactively.
queue
This command should be the last command in your submit file. It causes CONDOR to process your job.

Useful CONDOR commands

condor_compile is used to compile your program with the CONDOR library. This is needed if your program is to run under CONDOR's standard universe. The form of the command depends on the language you are using:

condor_submit file_name
is how you submit a job.

condor_q
to find out the status of all jobs.

condor_q username
is how you find out the status of just your jobs.

condor_rm username
removes all jobs submitted by username from the queue, whether they are waiting, or executing.

Related Data and Programs:

C_CONDOR, C programs which illustrate how a C program can be run in batch mode using the condor queueing system.

C++_CONDOR, C++ programs which illustrate how a C++ program can be run in batch mode using the condor queueing system.

F90_CONDOR, FORTRAN90 programs which illustrate how a FORTRAN90 program can be run in batch mode using the condor queueing system.

MATLAB_CONDOR, MATLAB programs which illustrate how MATLAB can be run in batch mode using the CONDOR queueing system.

Reference:

  1. condor.pdf,
    Condor Team,
    University of Wisconsin, Madison,
    Condor Version 8.0.2 Manual;
  2. http://www.cs.wisc.edu/htcondor/,
    The HTCondor home page;

Examples and Tests:

FOO is the simplest example I could think of that would demonstrate the simplest use of CONDOR, to run a basis shell script in the "vanilla" universe. This only took me a week to get right.

GOO is an example job that uses the "standard" universe. This only took me 10 minutes to get right.

HOO is an example job that uses the "MPI" universe to run a very simple MPI program on 4 processors. This only took me 20 minutes to get right.
The current CONDOR cluster does NOT support the MPI universe, so I can no longer run this job!

MOO is a simple example of how to run an executable compiled program (written in C, C++ or FORTRAN) using CONDOR. We assume that no MPI stuff is going on, and no checkpointing is being done. This is just the FOO example, but with a "more interesting" executable. To make this work, compile the program on the CONDOR submit node, and call it moo. That will guarantee by default that the executable will be run on a machine just like the one from which the CONDOR job was submitted.

You can go up one level to the Examples page.


Last revised on 27 August 2013.