CONDOR
A Job Control System

CONDOR is a job control system which can accept jobs from a variety of users, and assign them for execution on one or more computers that it communicates with. In particular, if a user wishes to run a program in parallel, with MPI, CONDOR can find the necessary machines and manage the parallel execution of the job.

A job that is to be controlled by CONDOR must be "noninteractive". If the program would normally read from the keyboard, then a file of input must be prepared beforehand, and CONDOR must be told to use that file for input. If the program would normally display results to the terminal, then CONDOR must be told to save such results in an output file.

The Login Node

A CONDOR job will run on one or more nodes (think of these as individual computers, each of which contains one or more processors) in a cluster. A cluster consists of a collection of cooperating nodes. There is one special node, called the login node, or submit node or master node, from which you can submit jobs to be run on the cluster.

The FSU Research Computing Center (RCC) has a cluster whose login node is condor-login.rcc.fsu.edu. To get interactive access to this node, you log in as follows:


        ssh condor-login.rcc.fsu.edu

If you want to login to the FSU RCC CONDOR login node from a remote site, you may need to first sign into the FSU Virtual Private Network (VPN), at http://vpn.fsu.edu.

Your CONDOR job will almost certainly need files (programs, data, job scripts). You may have created these on another system, or you may later wish to copy results back to your home system. In order to transfer files between your laptop, desktop, or home system and the CONDOR cluster, you need to use the SFTP program to establish a link between the systems:


        sftp condor-login.rcc.fsu.edu

then use commands such as cd and lcd to change directories on the cluster or your local system, and then use put or get commands to move files to or from the cluster.

The CONDOR Universe

CONDOR allows the user to choose a universe. You choose an universe based on which features of CONDOR you will be needing.

In the simplest universe, known as vanilla, CONDOR simply finds a suitable computer, sends the necessary files (input and executable program) to that machine, runs it, and returns the output.

The vanilla universe is a good way to start using CONDOR. For one thing, the other universes require you to recompile your program with a special CONDOR library. When you don't have that special library, you are giving up the ability to do parallel processing, checkpointing, and remote procedure calls, but for the simplest jobs, none of these features are necessary.

The standard universe allows you to submit a job to be run by CONDOR, but adds checkpointing and remote procedure calls. In order for these features to work, it is necessary that your executable program be compiled with the CONDOR libraries.

The MPI universe allows you to submit a job which is to run an MPI program on a given number of processors. Current, the RCC cluster does not support the MPI universe.

Submitting a job

To use CONDOR, the user prepares a submit description file, a text file which specifies the values of certain parameters, such as the name of the program to be run, the location of a file to be associated with standard input, a starting default directory, and so on.

Once a submit description file is prepared, the user may ask CONDOR to run the job, or more precisely, to manage the process of running the job. This is done by issuing the condor_submit command. For instance, if the submit description file was named foo.txt, then the user would type

condor_submit foo.txt

Of course, the condor_submit command can only be issued from a machine that is running CONDOR; however, the job doesn't have to run there; in fact, the user can request that the job actually be run on any suitable machine in the local collection of machines.

The user can also request that the job be run as an MPI job, in which case CONDOR is responsible for finding a suitable number of available processors on which the job can be executed.

The condor_submit command transfers the responsibility for running the job to CONDOR. But usually your job will not run right away. While you are waiting for results, you may be curious whether the job is still waiting for a suitable machine to be found, or has started, or is well on its way to completion. To find out the status of your jobs, issue the command

condor_q user_name

To find out the status of all jobs, issue the command

condor_q

Here is some of the output from the condor_q command:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 307.0   bleason         7/7  17:31   0+07:24:31 I  0   0.0  toy 1978          
 362.5   manowar         8/18 11:13  46+03:55:59 I  0   152  emigrate
 544.0   ely            10/7  13:46   0+02:25:51 R  20  2.1  sos        
 545.0   burdett        10/7  15:38   0+00:00:13 I  0   0.0  foo.csh

Note, in particular, under the ST (status) column, that I indicates that the job is idle, that is, not running, while R means the job is currently running.

CONDOR Script Parameters:

universe = vanilla | standard | mpi

The universe parameter specifies how CONDOR is to be used.

A vanilla universe is the simplest. CONDOR simply finds an available computer of the same architecture as the one you submitted the job from, and runs your job. There is no checkpointing, remote procedure calls, or parallel processing.
A standard universe asks CONDOR to find an available computer of the same architecture as the one you submitted the job from. It requests checkpointing, the option to migrate your job to another machine if necessary, and the option of remote procedure calls. Unless your executable is simply a shell script, you will need to compile your code with the condor_compile command, so that the necessary CONDOR libraries are included, as in
condor_compile cc myprog.c
or
condor_compile javac myprog.java
or
condor_compile f77 myprog.f
An mpi universe asks CONDOR to find a given number of computers of the same architecture as the one you submitted the job from, and to run your executable program as an MPI process on all those computers. Your executable program should have been created with the appropriate MPI version of the compiler, such as
mpicc myprog.c
or
mpiCC myprog.C
or
mpif77 myprog.f

initialdir = /home/your directory

This command specifies the directory in which your executable is stored. If you use the ls command on the submit cluster, you will see a complicated path for your home directory, but this should be replaced by /home. Thus, to specify that the job's initial directory should be in the matlab/fred subdirectory, I would issue a command like

          initialdir = /home/matlab/fred

executable = myprog

This command specifies the name of the executable program or shell script that is to be run.

log = logfile

This command specifies the name of a file into which CONDOR will write a running commentary of the process by which it set up and ran the job. This is occasionally useful if the job fails.

input = inputfile

This command specifies the name of a file from which standard input is to be read. If no such file is to be used, the command can be omitted, or the value of inputfile can be blank.

output = outputfile

This command specifies the name of the file into which the standard output should go.

For a Vanilla job that has used the queue command to submit multiple copies, the symbol $(PROCESS) can be used as part of the output file name, so that each job produces a distinct output file. Thus you might specify an output file name of myprog_$(PROCESS).output.
For MPI jobs, it is possible to specify a separate output file for each process, by including the symbol $NODE in the name, which will be replaced by the MPI processor number. Thus, you might specify an output file name of myprog_$(NODE).output.

log = logfile

This command specifies the name of the file into which the log of the run should go.

For a Vanilla job that has used the queue command to submit multiple copies, the symbol $(PROCESS) can be used as part of the log file name, so that each job produces a distinct log file. Thus you might specify a log file name of myprog_$(PROCESS).log.
For MPI jobs, it is possible to specify a separate log file for each process, by including the symbol $NODE in the name, which will be replaced by the MPI processor number. Thus, you might specify a log file name of myprog_$(NODE).log.

requirements = requirement
requirements = (requirement1 && requirement2)
requirements = (requirement1 || requirement2)

This command specifies requirements that must be true about the system on which your job is run. If there is more than one requirement, they can be grouped using external parentheses, and logical operators such as && for "and", or || for "or". Sample requirements include:

OpSys == "LINUX": (the operating system must be LINUX);
Arch == "X86_64": (the processor must be a 64bit X8600 chip);
Matlab == "true": (MATLAB must be available on the system);

notification = choice

This specifies whether CONDOR should notify you about the job status. Valid options for choice include:

always: notify me on job checkpoint and completion;
complete: notify me on job completion;
error: only notify me on job error;
never: never notify me.

rank = expression

The rank command tells CONDOR how to choose the appropriate machine when there are several available. In the simplest case, replacing expression by memory means that, given a choice, CONDOR should run your job on the machine with the most memory.

queue number

This command specifies the number of copies of the job to submit. It only makes sense to submit multiple copies if there is a way for each job to carry out a different task, based on an ID number.

should_transfer_files = choice

This specifies whether CONDOR should transfer files to and from the remote machine where the job is actually being run. Valid choices include:

yes:
no:
if_needed:

transfer_input_files = file1,file2,...

list specifies one or more input files that should be transferred to the remote machine.

when_to_transfer_output = choice

This specifies when output files should be transferred back to the submit node. Valid choices include:

on_exit:
on_exit_or_evict:

machine_count = number

This command is only needed for MPI jobs, and specifies the number of computers to be used.

arguments = arg1 arg2 arg3

This command allows you to supply the command line arguments you would give if you were running your executable interactively.

queue

This command should be the last command in your submit file. It causes CONDOR to process your job.

Useful CONDOR commands

condor_compile is used to compile your program with the CONDOR library. This is needed if your program is to run under CONDOR's standard universe. The form of the command depends on the language you are using:

condor_compile cc myprog.c
condor_compile CC myprog.C
condor_compile f77 myprog.f
condor_compile javac myprog.java

condor_submit file_name
is how you submit a job.

condor_q
to find out the status of all jobs.

condor_q username
is how you find out the status of just your jobs.

condor_rm username
removes all jobs submitted by username from the queue, whether they are waiting, or executing.

Related Data and Programs:

C_CONDOR, C programs which illustrate how a C program can be run in batch mode using the condor queueing system.

C++_CONDOR, C++ programs which illustrate how a C++ program can be run in batch mode using the condor queueing system.

F90_CONDOR, FORTRAN90 programs which illustrate how a FORTRAN90 program can be run in batch mode using the condor queueing system.

MATLAB_CONDOR, MATLAB programs which illustrate how MATLAB can be run in batch mode using the CONDOR queueing system.

Reference:

condor.pdf,
Condor Team,
University of Wisconsin, Madison,
Condor Version 8.0.2 Manual;
http://www.cs.wisc.edu/htcondor/,
The HTCondor home page;

Examples and Tests:

FOO is the simplest example I could think of that would demonstrate the simplest use of CONDOR, to run a basis shell script in the "vanilla" universe. This only took me a week to get right.

foo.condor, a very simple file that runs the shell script foo.sh;
foo.sh, a BASH shell script.
foo.output, output from the shell script (that is, stuff I do want to see.)
foo.log, a record of CONDOR's actions while trying to run this job.

GOO is an example job that uses the "standard" universe. This only took me 10 minutes to get right.

goo.condor, a very simple file that runs the BASH shell script goo.sh;
goo.sh, a BASH shell script;
goo.output, output from the shell script (that is, stuff I do want to see.)
goo.log, a record of CONDOR's actions while trying to run this job.

HOO is an example job that uses the "MPI" universe to run a very simple MPI program on 4 processors. This only took me 20 minutes to get right.
The current CONDOR cluster does NOT support the MPI universe, so I can no longer run this job!

hoo.condor, a very simple file that tries to run an executable program under MPI;
hoo.cpp, the C++ source code for the executable program;
hoo_0.output, output from the program on MPI node 0.
hoo_1.output, output from the program on MPI node 1.
hoo_2.output, output from the program on MPI node 2.
hoo_3.output, output from the program on MPI node 3.
hoo.log, a record of CONDOR's actions while trying to run this job.

MOO is a simple example of how to run an executable compiled program (written in C, C++ or FORTRAN) using CONDOR. We assume that no MPI stuff is going on, and no checkpointing is being done. This is just the FOO example, but with a "more interesting" executable. To make this work, compile the program on the CONDOR submit node, and call it moo. That will guarantee by default that the executable will be run on a machine just like the one from which the CONDOR job was submitted.

moo.condor, a very simple file that runs the executable program moo, which was created by compiling moo.c;
moo.c, a C program.
moo.output, the output from the compiled program after CONDOR ran it.
moo.log, a record of CONDOR's actions while trying to run this job.

You can go up one level to the Examples page.

Last revised on 27 August 2013.

CONDOR A Job Control System