User Tools

Site Tools


public:introduction_to_the_voxbo_scheduler

The VoxBo scheduler

There are two types of things you do in VoxBo, as in life: interactive things and batch things (or jobs). Interactive things are things where you need to sit by the computer to keep punching buttons, clicking the mouse, watching the screen, etc. Batch jobs are things where you set the computer going and come back later when it’s done, like motion-correcting large datasets and solving large GLM’s. As computers get faster, batch things tend to become interactive things – in an ideal world, there would be no batch things at all.

Functional neuroimaging data analysis presents some computational problems that are sufficiently time-consuming to require batch processing. In order to minimize the amount of time these tasks take, VoxBo implements a job scheduling mechanism that keeps track of what jobs need to be done, and ships them off to computers as they become available.

The utility of this is most easily seen in the context of big conceptual jobs (e.g., preprocessing of a subject’s data) that can easily be broken up into several smaller parts (e.g., slice acquisition correction, threshholding, etc.). Some of these pieces have to be executed sequentially (e.g., distortion correcting and smoothing data from a single run), but others are independent (e.g., motion correcting the first and the second runs of a subject’s data). By breaking up our jobs into relatively small pieces, and having a central scheduler decide when to run each piece, we gain several advantages:

  1. Efficiency A limited but powerful form of parallel, distributed processing. Time consuming jobs can be run on multiple machines simultaneously.
  2. Automation Long sequences of jobs from multiple users can be queued up, each to be run in sequence when the resources are available. Instead of starting one process, waiting for it to finish, starting the next, waiting, etc., you can describe a large number at the same time, and have them all run at the right time.
  3. Flexibility Long, time-consuming preprocessing sequences can be interrupted, resumed, and altered easily if they're broken up into multiple jobs.

In doing its scheduling, VoxBo takes into account the structure and priority of your job, the resources available, and laboratory policies. It's not very smart yet, but it's still very effective. If there are eight CPUs free and you need to motion correct eight scans as quickly as possible, you might get lucky and max out the whole lab for a few hours. Without this type of distributed processing, you’d probably end up doing it all on one machine, and it would take eight times as long. With more machines (a reasonable VoxBo-capable machine can be had for under $1500 these days), improvements of more than an order of magnitude are possible.

With the single user model, if you find a free machine, you can set it working on your analysis, and it will crank away until it's done. With VoxBo's parallelization, a few big sequences can clog up the entire lab (especially after hours). So if there are a few big jobs stuck ahead of you in the queue, your work might not even get started for many hours.

Although it sounds bad, this is generally a good thing, especially if several people each need to run complex sequences that take a long time. Consider the case where three people each need about 24 machine-hours to do some processing, and there are three free machines. Under the old system, each would start their work on one machine, and each would finish about a day later. With parallel processing, the first sequence would run on all available machines first, followed by the second, followed by the third. The last sequence to finish should still finish in about a day. However, the first to start should finish in about eight hours (a third of a day) and the second to start should finish in about sixteen. The third might not get started until sixteen hours after it was queued, but it will run in a third the time, and therefore be finished at about the same time. In other words, 24 hours has gone from being the average case to the worst case.

Of course, if there happens to be a fourth machine available, all three jobs will finish in much less time than without job scheduling, assuming they can be broken down into small enough chunks to benefit.

Now, this is idealized a bit. If not all the machines are the same speeds, and the jobs aren't perfectly atomic (we break them down into chunks, but some of them are still pretty chunky), it is possible that under some circumstances a job might actually take slightly longer to finish than if it had had the uninterrupted attention of a single CPU. The greatest risk is to jobs that would ordinarily only take a short amount of time, since they are more likely to run up against a bigger job that's using all the machines. If the big job is especially chunky, and none of its pieces are close to finishing, the little job will suffer in comparison to the single-user system. But we're pretty sure this won't happen often, if ever, especially if users are conscientious about assigning appropriate priorities to their work. And this small downside will be greatly outweighed by the majority of times when things will actually take much less time.

Jobs and Sequences

Whenever you set up some computationally intensive work for VoxBo to do (whether via a graphical interface like voxprep or VoxAnalyze, or via a command line program like gdesign), it gets broken down into a set of logically related jobs, and placed in a queue. We refer to the entire set of those jobs as a sequence. Ideally each job is relatively small, so that your processing stream can be interrupted if necessary and resumed later. Or so that an error can be corrected without having to start over from scratch.

Priorities

Every sequence has a priority associated with it, an integer from 1 to 5, and correspondingly each machine has a priority level that it will accept. VoxBo uses this information (and several other factors) to do its scheduling.

To make the priority system work, you need to set the priority levels for your sequences both appropriately and cooperatively. To make this as easy as possible, we have associated some semantics with various priority levels:

  1. Lowest Yield to anything, only run during off hours. Low priority is good for anything you want to run overnight, like data preprocessing.
  2. Low Run during the day, but yield to 'normal' and higher jobs.
  3. Normal Go ahead and run on unreserved machines, but yield to more urgent work. Take precedence over low-priority jobs. The majority of your jobs should be of normal priority.
  4. High Run as soon as possible on any machine that indicates some level of availability. Use this level when you have a good reason to want to jump the queue, or to use machines that would ordinarily be reserved for interactive work.
  5. Emergency Run immediately or sooner if at all possible, on any machine that considers itself up.

By default, we assume that all jobs will run at normal priority and that most machines will be set at priority 2 during the day, 1 at night or on weekends. Machines that need to be reserved for interactive use may be set to priority 3 or 4.

The priority system is meant to work fine as long as no one gets in the habit of running their jobs at higher priority without good reason. Emergency priority should be reserved for genuine emergencies, times when you don’t mind bringing everyone else’s work to a standstill if that’s what it takes. Remember that everyone will be waiting for their work to complete, and anyone who sees that your job has jumped the queue may want to stop by your office for a chat.

After Your Sequence is Submitted

VoxBo has complex mechanisms for scheduling when the various things you need to do will take place. voxq is your way of interacting with those mechanisms – canceling and rescheduling jobs, seeing what's running, requesting machines for your personal use, and many other functions. If you type voxq (or just 'vq') in a terminal window on a line by itself, you'll see all the available options, as follows:

VoxBo voxq (v1.8.5pre4/Mar 23 2009)
usage:
  voxq -c                 show status of voxbo CPUs
  voxq -l                 show resource info for all CPUs
  voxq -s [seq]           show sequences (or just seq) currently in queue
  voxq -a                 show sequences and CPUs
  voxq -g <hosts> [hrs=1] gimme server (shut down queueing)
  voxq -b <hosts>         giveback server (restart queueing)
  voxq -d <seq>           debug sequence (see bad log files)
  voxq -k <num>           kill sequence (permanent)
  voxq -y <num>           halt a running sequence
  voxq -p <num>           postpone a sequence
  voxq -r <num>           resume a postponed sequence
  voxq -t <num>           retry a sequence (let 'bad' jobs retry)
  voxq -u <num>           mark bad jobs in a sequence as done
  voxq -w <num>           do-over -- mark all jobs as waiting
  voxq -m <seq> <max>     set maxjobs for a sequence
  voxq -x <seq> [job=all] examine sequence in detail
  voxq -# <seqnum>        change sequence priority to 1/2/3/4/5
  voxq -h                 get help (this message)
  voxq -v                 print voxbo version information

<num> can be a single sequence number, a range, or both (e.g., 1-3,7-10)

These are the different things voxq will let you do. Here is an explanation of some of them.

Viewing the Status of all the Machines (voxq -c)

When the VoxBo server is running on multiple machines, figuring out what machines are doing what can be useful. For this we provide the following command:

voxq -c

The output should look something like the following:

Server      Load Pri Job                            Pct Elapsed
blob        3.00  2  vbbatch job (08182-00030)        - 4:54:20 
                     vbbatch job (08182-00033)        - 3:50:54 
                     vbbatch job (08182-00037)        - 1:26:33 
kart        1.18  4  <idle>
pace        1.00  2  vbbatch job (08182-00035)        - 2:35:51 
snax        0.00  2  <idle>
spin        1.00  2  vbbatch job (08190-00001)        - 17:21:07
sumo        0.03  2  <idle>
trax        0.00  2  <idle>
trix        1.99  2  vbbatch job (08182-00039)        - 0:25:21 
                     vbbatch job (08182-00040)        - 0:18:55 

This gives you a snapshot summary of what's going on with every server machine. The Load column indicates how many CPUs are being used on that machine, the Pri column indicates the minimum priority of jobs the server is accepting, the Job column gives the name of the job running, and in parenthesis the sequence number, followed by the job number for that sequence. If the job reports a percent complete, it will be indicated in the Pct column, and the total elapsed time for that job is indicated in Elapsed

Viewing the Job/Sequence Queue (voxq -s)

If you want to know where your jobs stand in the grand scheme of things, or what the current logjam looks like, type

voxq -s

and you will see

------------------------------------------------------------------------
Sequence Name        Num  Pri Owner    Wait  Run   Bad   Done Total
------------------------------------------------------------------------
biasCorNreg_zheng    8182  3  yuanjiez  64    10    0     31   105   
pdMapT1NT3           8190  3  pcook     1     1     0     2    4     
------------------------------------------------------------------------

at the command line. You'll see a list of all the VoxBo sequences currently in the queue, and their status of completion, including the number of jobs waiting to run, currently running, gone bad, and completed. You can also tell each sequence's sequence number, which you can use with some of the other voxq functions (see below).

Getting the Big Picture (voxq -a)

We noticed people like to do voxq -c and voxq -s together a lot. So we wrapped them up together as voxq -a.

Postponing and Resuming Sequences (voxq -p and voxq -r)

Suppose you just submitted a sequence and then decided you're not quite ready to have it run (maybe you realized your data isn't quite finished downloading, or someone just begged you to let their job run first). Just type

voxq -p <num> [<num> …]

to postpone your sequence (replace “<num>” with the actual number of your sequence, which you can get using voxq -s). Your sequence will re-enter the queue at its original spot when you type

voxq -r <num> [<num> …]

Please don't abuse this feature to queue jobs that aren't quite ready to run just so you can get a good spot in the queue. You can only postpone jobs you yourself queued.

Grabbing and Relinquishing a CPU (voxq -g and voxq -b)

Often you will want to prevent VoxBo jobs from running on a CPU because you are using it for some other purpose. Although we are hoping to have a more elaborate mechanism in place shortly, currently anyone can ask any of the servers to go up or down, using voxq. To ask a server to stop accepting jobs, type

voxq -g <servername> [<servername> …]

at the command line, where “<servername>” is something like “fornix” or “pons”. If you then do a voxq -c, you’ll see that you have the machine reserved (assuming nobody beat you to it). Any VoxBo job currently running on that CPU will continue to run, but nothing new will be started.

When you want to return the CPU to active use, just type

voxq -b <servername> [<servername> …]

Killing Sequences (voxq -k)

If you’ve queued a sequence you wish you hadn’t, you can kill it by typing:

voxq -k <num> [<num> …]

Anything already running will still complete, but nothing new will start. When all the running jobs have completed, the sequence will disappear from the queue.

Finding a free machine to work on

One problem that arises when machines are used both interactively and for batch processing is that you might end up logging onto a machine only to find out it’s already busy running something. To minimize the chances of this happening, we may decide to make certain machines unavailable for low-priority processing during the day, so they will always be available for interactive sessions (except under special circumstances). Hopefully, urgent processing of data will be unusual enough that you won’t have to worry about this too often.

However, there's an easy way to check the status of all the machines in the lab, so that at most you'll have to log into two. Just type

voxq -c

at the command line, to see which machines are idle, unreserved, and have free IDL licenses (if you need one).

public/introduction_to_the_voxbo_scheduler.txt · Last modified: 2009/09/04 20:15 by aguirreg