Introduction
Batch processing is a standard way to queue and run large amount of tasks while sharing the resources with other groups. In CANFAR, we provide a batch processing system which users can access either through a minimal http service, or with many more features using a login portal to launch and manage processing jobs.
Batch Framework on the CANFAR clouds
The CANFAR scheduling system is orchestrated with the HTCondor high throughput computing software, and the cloud-scheduler to allow multi-clusters (multi-clouds) batch processing.
For a typical CANFAR user, HTCondor
will be similar use as other scheduling software that can be found on High Performance Computing platforms, such as PBS/Torque, Sun Grid Engine, or Slurm. Most users will not have to care about the cloud scheduler since it runs in the background launching Virtual Machines (VMs) while monitoring the jobs queues, but HTCondor is important to understand in order to submit and manage jobs.
Writing a Batch Job
Running a job with CANFAR means executing a program on a Virtual Machine. So first thing is to write a script that HTCondor will forward to the VM and execute it. It has to be a local executable. For example, if one wants to execute the echo
command on the VM, the local script could contain:
#!/bin/bash
echo $*
Let’s name the local script myexec.bash
. We now have to choose a machine that will execute the transfered script. Let’s assume a VM image is called my_vm_image
has been created following the tutorial. Now to execute the command ls
we need the smallest possible resource flavour that can boot the my_vm_image
VM, that is c1-7.5gb-30
. Open your favorite editor, and write a job file myjob.jdl
.
A typical job will be like this:
executable = myexec.bash
arguments = "Hello World"
output = hello.out
error = hello.err
log = hello.log
queue 4
arguments = "Bonjour Monde"
output = bonjour.out
error = bonjour.err
log = bonjour.log
queue 4
Here are the explanations of the other parameters:
output
will be the result of thestdout
from the job execution that will be transfered back to the login portal batch.canfar.net at the end of the job.error
is the correspondingstderr
of the job executionlog
is a log ofHTCondor
activitiesarguments
contains the arguments we want to pass to theexecutable
queue 4
means “sends 4 jobs”, all with thearguments
previously defined.
Multi jobs per VM or Multi-threading
By default, a batch scheduled VM will spawn the same amount of jobs as there are CPUs in the requested flavour. A c8-30gb-380 flavour will launch an 8 cores VM, thus launch 8 jobs per VM. Most of the time it is more efficient to request fairly large flavours to spawn multiple concurrent jobs per VM. Imagine a bare metal node with 16 cores: the overhead of running two 8 cores VMs (2 hypervisors + 2 HTCondor + 16 jobs) vs running 16 one core VMs (16 hypervisors + 16 HTCondor + 16 jobs).
In some cases you may need the whole VM for your job. Multi-threaded programs will usually benefit from it. You will need extra resource requirements in your job submission file. For example, if you need a minimum of 8 cores, 16GB of RAM and 250GB of scratch space for your job, use the following parameters in your job submission file:
request_memory = 16000
request_cpus = 8
request_disk = 250000000
You can also add a different request parameter per job. The VMs will be dynamically partioned into jobs to maximally fit the VM flavour. See the submission examples for reference on the HTCondor request parameters.
Managing Batch Jobs on the submission host
Job Submission
In this case, the submission files and the executable will have to reside on the CANFAR batch login node. Connect to the batch node with your CADC username (refered as[username]
):
ssh [username]@batch.canfar.net
Then you will need to source your credentials to access your project’s VMs:
. [project]-openrc.sh
This file is the same as the one you can download from your project, when clicking in the API Access tab from your dashboard. It should download it for you, then replace [project]
by your project name.
Now to submit the job, there is a special wrapper script that will share your VM with CANFAR, add some boiler plate lines for the cloud-scheduler, validate and submit the job. Instead of running condor_submit
, you would run canfar_submit
. For our simple example, you would do:
canfar_submit myjob.jdl my_vm_image c8-30gb-380
Checking Job Status
HTCondor
offers a great deal of command line tools to check the status of the VM and the job. Below is a basic list of typical HTCondor commands for job management. For a more exhaustive list of commands, we refere the reader to the official HTCondor user documentation or a good overview and cheat sheet on SIEpedia.
Check the status of the global queue:
condor_q -all
Check the status of your jobs in a summarized format:
condor_q [username]
Or, look at a bit more detail about each task in the queue:
condor_q [username] -nobatch
and look for the status R: running, I: inactive, X: marked for removal.
See why your job 11.3 is still idle (job status is “I”):
condor_q -better-analyze 11.3
Check to see if your VMs are joining the pool of execution hosts:
condor_status
condor_ssh_to_job 11.3