CONDOR is a job control system which can accept jobs from a variety of users, and assign them for execution on one or more computers that it communicates with. In particular, if a user wishes to run a program in parallel, with MPI, CONDOR can find the necessary machines and manage the parallel execution of the job.
A job that is to be controlled by CONDOR must be "noninteractive". If the program would normally read from the keyboard, then a file of input must be prepared beforehand, and CONDOR must be told to use that file for input. If the program would normally display results to the terminal, then CONDOR must be told to save such results in an output file.
A CONDOR job will run on one or more nodes (think of these as individual computers, each of which contains one or more processors) in a cluster. A cluster consists of a collection of cooperating nodes. There is one special node, called the login node, or submit node or master node, from which you can submit jobs to be run on the cluster.
The FSU Research Computing Center (RCC) has a cluster whose login node is condor-login.rcc.fsu.edu. To get interactive access to this node, you log in as follows:
ssh condor-login.rcc.fsu.edu
If you want to login to the FSU RCC CONDOR login node from a remote site,
you may need to first sign into the FSU Virtual Private Network (VPN),
at http://vpn.fsu.edu.
Your CONDOR job will almost certainly need files (programs, data, job scripts). You may have created these on another system, or you may later wish to copy results back to your home system. In order to transfer files between your laptop, desktop, or home system and the CONDOR cluster, you need to use the SFTP program to establish a link between the systems:
sftp condor-login.rcc.fsu.edu
then use commands such as cd and lcd to change directories
on the cluster or your local system, and then use put or get
commands to move files to or from the cluster.
CONDOR allows the user to choose a universe. You choose an universe based on which features of CONDOR you will be needing.
In the simplest universe, known as vanilla, CONDOR simply finds a suitable computer, sends the necessary files (input and executable program) to that machine, runs it, and returns the output.
The vanilla universe is a good way to start using CONDOR. For one thing, the other universes require you to recompile your program with a special CONDOR library. When you don't have that special library, you are giving up the ability to do parallel processing, checkpointing, and remote procedure calls, but for the simplest jobs, none of these features are necessary.
The standard universe allows you to submit a job to be run by CONDOR, but adds checkpointing and remote procedure calls. In order for these features to work, it is necessary that your executable program be compiled with the CONDOR libraries.
The MPI universe allows you to submit a job which is to run an MPI program on a given number of processors. Current, the RCC cluster does not support the MPI universe.
To use CONDOR, the user prepares a submit description file, a text file which specifies the values of certain parameters, such as the name of the program to be run, the location of a file to be associated with standard input, a starting default directory, and so on.
Once a submit description file is prepared, the user may ask CONDOR to run the job, or more precisely, to manage the process of running the job. This is done by issuing the condor_submit command. For instance, if the submit description file was named foo.txt, then the user would type
condor_submit foo.txt
Of course, the condor_submit command can only be issued from a machine that is running CONDOR; however, the job doesn't have to run there; in fact, the user can request that the job actually be run on any suitable machine in the local collection of machines.
The user can also request that the job be run as an MPI job, in which case CONDOR is responsible for finding a suitable number of available processors on which the job can be executed.
The condor_submit command transfers the responsibility for running the job to CONDOR. But usually your job will not run right away. While you are waiting for results, you may be curious whether the job is still waiting for a suitable machine to be found, or has started, or is well on its way to completion. To find out the status of your jobs, issue the command
condor_q user_name
To find out the status of all jobs, issue the command
condor_q
Here is some of the output from the condor_q command:
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 307.0 bleason 7/7 17:31 0+07:24:31 I 0 0.0 toy 1978 362.5 manowar 8/18 11:13 46+03:55:59 I 0 152 emigrate 544.0 ely 10/7 13:46 0+02:25:51 R 20 2.1 sos 545.0 burdett 10/7 15:38 0+00:00:13 I 0 0.0 foo.cshNote, in particular, under the ST (status) column, that I indicates that the job is idle, that is, not running, while R means the job is currently running.
condor_compile cc myprog.cor
condor_compile javac myprog.javaor
condor_compile f77 myprog.f
mpicc myprog.cor
mpiCC myprog.Cor
mpif77 myprog.f
initialdir = /home/matlab/fred
condor_compile is used to compile your program with the CONDOR library. This is needed if your program is to run under CONDOR's standard universe. The form of the command depends on the language you are using:
condor_submit file_name
is how you submit a job.
condor_q
to find out the status of all jobs.
condor_q username
is how you find out the status of just your jobs.
condor_rm username
removes all jobs submitted by username from the queue,
whether they are waiting, or executing.
C_CONDOR, C programs which illustrate how a C program can be run in batch mode using the condor queueing system.
C++_CONDOR, C++ programs which illustrate how a C++ program can be run in batch mode using the condor queueing system.
F90_CONDOR, FORTRAN90 programs which illustrate how a FORTRAN90 program can be run in batch mode using the condor queueing system.
MATLAB_CONDOR, MATLAB programs which illustrate how MATLAB can be run in batch mode using the CONDOR queueing system.
FOO is the simplest example I could think of that would demonstrate the simplest use of CONDOR, to run a basis shell script in the "vanilla" universe. This only took me a week to get right.
GOO is an example job that uses the "standard" universe. This only took me 10 minutes to get right.
HOO is an example job that uses the "MPI" universe
to run a very simple MPI program on 4 processors.
This only took me 20 minutes to get right.
The current CONDOR
cluster does NOT support the MPI universe, so I can no longer run
this job!
MOO is a simple example of how to run an executable compiled program (written in C, C++ or FORTRAN) using CONDOR. We assume that no MPI stuff is going on, and no checkpointing is being done. This is just the FOO example, but with a "more interesting" executable. To make this work, compile the program on the CONDOR submit node, and call it moo. That will guarantee by default that the executable will be run on a machine just like the one from which the CONDOR job was submitted.
You can go up one level to the Examples page.