Saturday, May 28, 2016

Setting up htcondor on ubuntu 16.04

References:
Note: To setup a cluster, you will need some form or dns name-service, or your /etc/hosts should match hostnames against actual ip addresses rather than local loopback 127 prefix addresses. (If not, the most likely consequence is that condor_status will give you a blank.)

Condor brings back fond memories; my old workplace, since I was a research assistant with them while I was a graduate student doing my Master's at the University of Wisconsin - Madison. So it was a pleasant surprise to find that condor was available as a standard package on ubuntu.

To install condor:
sudo apt install htcondor
Part of this installation process, you'll be run through the condor configuration screens. On these screens:
  • Manage initial HTCondor configuration automatically? Yes
  • Perform a "Personal HTCondor installation"? No (since we're setting up a cluster)
    • Note: If you just want to setup a single node installation, you say yes here. This will bind condor to the local loopback 127 prefix address, and will not ask you any of the below questions.
  • Role of this machine in the HTCondor pool:
    • select appropriate roles
  • File system domain label: <leave blank>
  • user directory domain label: < leave blank > (unless you know what you're doing)
  • Address of the central manager:  <your central manager >
  • Machines with write access to this host: < comma separated list >
Note that if you use machine names in the above and you don't have a dns nameserver, you'll need to setup your /etc/hosts with the address resolution.

Next, edit /etc/condor/condor_config.local and add the following lines (the file is blank to start)
CONDOR_ADMIN = <username>@<hostname>
CONDOR_HOST = <hostname>
ALLOW_WRITE = <comma separated list >
PS: to inspect a particular variable's value and where it is coming from, you can use
condor_config_val -v <VARIABLE>
At this point, you can run condor by starting the service using
sudo service condor start
However, one would rather set it up to autostart. You can do that with
sudo update-rc.d condor enable
All of the above steps are done on all hosts. The CONDOR_HOST variable is what decides which host is your master.

Now that you have a running condor pool, issue
condor_status
to see the status of the pool

To run a sample job, let us first create a sample job. We'll use reference 2 listed at the top, and steal the sample c code from there:
#include <stdio.h>

main(int argc, char **argv)
{
    int sleep_time;
    int input;
    int failure;

    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer> \n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);

        printf("Thinking really hard for %d seconds...\n", sleep_time);
        sleep(sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    }
    return failure;
}
This can be compiled by:
% gcc -o simple simple.c
Create a submit file with contents:
Universe   = vanilla
Executable = simple
Arguments  = 4 10
Log        = simple.log
Output     = simple.$(Process).out
Error      = simple.$(Process).error
Queue

Arguments = 4 11
Queue

Arguments = 4 12
Queue
Submit job using the command:
condor_submit submit
And check the status of the job using the command:
condor_q
If the job gets held, release it by using the command:
condor_release -all
My condor install kept setting some of my jobs to held state. So I ran a
watch -n 10 condor_release -a
to keep releasing jobs every 10 seconds. Haven't figured out the right fix on this one yet.
.

No comments: