There are two types of things you do in VoxBo, as in life: interactive things and batch things (or jobs). Interactive things are things where you need to sit by the computer to keep punching buttons, clicking the mouse, watching the screen, etc. Batch jobs are things where you set the computer going and come back later when it’s done, like motion-correcting large datasets and solving large GLM’s. As computers get faster, batch things tend to become interactive things – in an ideal world, there would be no batch things at all.
Functional neuroimaging data analysis presents some computational problems that are sufficiently time-consuming to require batch processing. In order to minimize the amount of time these tasks take, VoxBo implements a job scheduling mechanism that keeps track of what jobs need to be done, and ships them off to computers as they become available.
The utility of this is most easily seen in the context of big conceptual jobs (e.g., preprocessing of a subject’s data) that can easily be broken up into several smaller parts (e.g., slice acquisition correction, threshholding, etc.). Some of these pieces have to be executed sequentially (e.g., distortion correcting and smoothing data from a single run), but others are independent (e.g., motion correcting the first and the second runs of a subject’s data). By breaking up our jobs into relatively small pieces, and having a central scheduler decide when to run each piece, we gain several advantages:
In doing its scheduling, VoxBo takes into account the structure and priority of your job, the resources available, and laboratory policies. It's not very smart yet, but it's still very effective. If there are eight CPUs free and you need to motion correct eight scans as quickly as possible, you might get lucky and max out the whole lab for a few hours. Without this type of distributed processing, you’d probably end up doing it all on one machine, and it would take eight times as long. With more machines (a reasonable VoxBo-capable machine can be had for under $1500 these days), improvements of more than an order of magnitude are possible.
With the single user model, if you find a free machine, you can set it working on your analysis, and it will crank away until it's done. With VoxBo's parallelization, a few big sequences can clog up the entire lab (especially after hours). So if there are a few big jobs stuck ahead of you in the queue, your work might not even get started for many hours.
Although it sounds bad, this is generally a good thing, especially if several people each need to run complex sequences that take a long time. Consider the case where three people each need about 24 machine-hours to do some processing, and there are three free machines. Under the old system, each would start their work on one machine, and each would finish about a day later. With parallel processing, the first sequence would run on all available machines first, followed by the second, followed by the third. The last sequence to finish should still finish in about a day. However, the first to start should finish in about eight hours (a third of a day) and the second to start should finish in about sixteen. The third might not get started until sixteen hours after it was queued, but it will run in a third the time, and therefore be finished at about the same time. In other words, 24 hours has gone from being the average case to the worst case.
Of course, if there happens to be a fourth machine available, all three jobs will finish in much less time than without job scheduling, assuming they can be broken down into small enough chunks to benefit.
Now, this is idealized a bit. If not all the machines are the same speeds, and the jobs aren't perfectly atomic (we break them down into chunks, but some of them are still pretty chunky), it is possible that under some circumstances a job might actually take slightly longer to finish than if it had had the uninterrupted attention of a single CPU. The greatest risk is to jobs that would ordinarily only take a short amount of time, since they are more likely to run up against a bigger job that's using all the machines. If the big job is especially chunky, and none of its pieces are close to finishing, the little job will suffer in comparison to the single-user system. But we're pretty sure this won't happen often, if ever, especially if users are conscientious about assigning appropriate priorities to their work. And this small downside will be greatly outweighed by the majority of times when things will actually take much less time.
Whenever you set up some computationally intensive work for VoxBo to do (whether via a graphical interface like voxprep or VoxAnalyze, or via a command line program like gdesign), it gets broken down into a set of logically related jobs, and placed in a queue. We refer to the entire set of those jobs as a sequence. Ideally each job is relatively small, so that your processing stream can be interrupted if necessary and resumed later. Or so that an error can be corrected without having to start over from scratch.
Every sequence has a priority associated with it, an integer from 1 to 5, and correspondingly each machine has a priority level that it will accept. VoxBo uses this information (and several other factors) to do its scheduling.
To make the priority system work, you need to set the priority levels for your sequences both appropriately and cooperatively. To make this as easy as possible, we have associated some semantics with various priority levels:
By default, we assume that all jobs will run at normal priority and that most machines will be set at priority 2 during the day, 1 at night or on weekends. Machines that need to be reserved for interactive use may be set to priority 3 or 4.
The priority system is meant to work fine as long as no one gets in the habit of running their jobs at higher priority without good reason. Emergency priority should be reserved for genuine emergencies, times when you don’t mind bringing everyone else’s work to a standstill if that’s what it takes. Remember that everyone will be waiting for their work to complete, and anyone who sees that your job has jumped the queue may want to stop by your office for a chat.
VoxBo has complex mechanisms for scheduling when the various things you need to do will take place. voxq
is your way of interacting with those mechanisms – canceling and rescheduling jobs, seeing what's running, requesting machines for your personal use, and many other functions.
If you type voxq
(or just 'vq') in a terminal window on a line by itself, you'll see all the available options, as follows:
VoxBo voxq (v1.8.5pre4/Mar 23 2009) usage: voxq -c show status of voxbo CPUs voxq -l show resource info for all CPUs voxq -s [seq] show sequences (or just seq) currently in queue voxq -a show sequences and CPUs voxq -g <hosts> [hrs=1] gimme server (shut down queueing) voxq -b <hosts> giveback server (restart queueing) voxq -d <seq> debug sequence (see bad log files) voxq -k <num> kill sequence (permanent) voxq -y <num> halt a running sequence voxq -p <num> postpone a sequence voxq -r <num> resume a postponed sequence voxq -t <num> retry a sequence (let 'bad' jobs retry) voxq -u <num> mark bad jobs in a sequence as done voxq -w <num> do-over -- mark all jobs as waiting voxq -m <seq> <max> set maxjobs for a sequence voxq -x <seq> [job=all] examine sequence in detail voxq -# <seqnum> change sequence priority to 1/2/3/4/5 voxq -h get help (this message) voxq -v print voxbo version information <num> can be a single sequence number, a range, or both (e.g., 1-3,7-10)
These are the different things voxq
will let you do. Here is an explanation of some of them.
When the VoxBo server is running on multiple machines, figuring out what machines are doing what can be useful. For this we provide the following command:
voxq -c
The output should look something like the following:
Server Load Pri Job Pct Elapsed blob 3.00 2 vbbatch job (08182-00030) - 4:54:20 vbbatch job (08182-00033) - 3:50:54 vbbatch job (08182-00037) - 1:26:33 kart 1.18 4 <idle> pace 1.00 2 vbbatch job (08182-00035) - 2:35:51 snax 0.00 2 <idle> spin 1.00 2 vbbatch job (08190-00001) - 17:21:07 sumo 0.03 2 <idle> trax 0.00 2 <idle> trix 1.99 2 vbbatch job (08182-00039) - 0:25:21 vbbatch job (08182-00040) - 0:18:55
This gives you a snapshot summary of what's going on with every server machine. The Load
column indicates how many CPUs are being used on that machine, the Pri
column indicates the minimum priority of jobs the server is accepting, the Job
column gives the name of the job running, and in parenthesis the sequence number, followed by the job number for that sequence. If the job reports a percent complete, it will be indicated in the Pct
column, and the total elapsed time for that job is indicated in Elapsed
If you want to know where your jobs stand in the grand scheme of things, or what the current logjam looks like, type
voxq -s
and you will see
------------------------------------------------------------------------ Sequence Name Num Pri Owner Wait Run Bad Done Total ------------------------------------------------------------------------ biasCorNreg_zheng 8182 3 yuanjiez 64 10 0 31 105 pdMapT1NT3 8190 3 pcook 1 1 0 2 4 ------------------------------------------------------------------------
at the command line. You'll see a list of all the VoxBo sequences currently in the queue, and their status of completion, including the number of jobs waiting to run, currently running, gone bad, and completed. You can also tell each sequence's sequence number, which you can use with some of the other voxq
functions (see below).
We noticed people like to do voxq -c and voxq -s together a lot. So we wrapped them up together as voxq -a.
Suppose you just submitted a sequence and then decided you're not quite ready to have it run (maybe you realized your data isn't quite finished downloading, or someone just begged you to let their job run first). Just type
voxq -p <num> [<num> …]
to postpone your sequence (replace “<num>” with the actual number of your sequence, which you can get using voxq -s
). Your sequence will re-enter the queue at its original spot when you type
voxq -r <num> [<num> …]
Please don't abuse this feature to queue jobs that aren't quite ready to run just so you can get a good spot in the queue. You can only postpone jobs you yourself queued.
Often you will want to prevent VoxBo jobs from running on a CPU because you are using it for some other purpose. Although we are hoping to have a more elaborate mechanism in place shortly, currently anyone can ask any of the servers to go up or down, using voxq. To ask a server to stop accepting jobs, type
voxq -g <servername> [<servername> …]
at the command line, where “<servername>” is something like “fornix” or “pons”. If you then do a voxq -c
, you’ll see that you have the machine reserved (assuming nobody beat you to it). Any VoxBo job currently running on that CPU will continue to run, but nothing new will be started.
When you want to return the CPU to active use, just type
voxq -b <servername> [<servername> …]
If you’ve queued a sequence you wish you hadn’t, you can kill it by typing:
voxq -k <num> [<num> …]
Anything already running will still complete, but nothing new will start. When all the running jobs have completed, the sequence will disappear from the queue.
One problem that arises when machines are used both interactively and for batch processing is that you might end up logging onto a machine only to find out it’s already busy running something. To minimize the chances of this happening, we may decide to make certain machines unavailable for low-priority processing during the day, so they will always be available for interactive sessions (except under special circumstances). Hopefully, urgent processing of data will be unusual enough that you won’t have to worry about this too often.
However, there's an easy way to check the status of all the machines in the lab, so that at most you'll have to log into two. Just type
voxq -c
at the command line, to see which machines are idle, unreserved, and have free IDL licenses (if you need one).