This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
using_ogs_sge [2017/02/09 22:32] mgstauff [Temp directory] |
using_ogs_sge [2018/03/02 20:35] (current) mgstauff [Per-job memory limit] |
||
---|---|---|---|
Line 56: | Line 56: | ||
The most common way to use SGE is to run batch jobs via '' | The most common way to use SGE is to run batch jobs via '' | ||
- | '' | + | '' |
- | You run a batch job like so: | + | You run a batch job like so, where '' |
- | [mgstauff@chead ~]$ qsub myjob | + | [mgstauff@chead ~]$ qsub myjobscript |
- | Your job 27657 ("myjob") has been submitted | + | Your job 27657 ("myjobscript") has been submitted |
| | ||
- | //myjob// is any kind of script | + | Here's an example BASH script that could be in the file named '' |
- | The output says that your job has been submitted to the queue. It's either | + | //# |
+ | echo I am a job running | ||
+ | ZZZ=5 | ||
+ | echo Sleeping | ||
+ | sleep $ZZZ | ||
+ | echo NSLOTS: $NSLOTS | ||
+ | echo All Done. | ||
===== Output from your job ===== | ===== Output from your job ===== | ||
- | Your script should be setup to save your image or data output files as you normally would, i.e. typically in your /jet directory somewhere in your project tree. | + | Your script should be setup to save your image or data output files as you normally would, i.e. typically in your /data/< |
But what happens to the terminal output of your script? That is, the text or error messages your script normally generates and shows on the screen when you run it from the command line? This output is saved to special files for each job in the job's working directory. They look like this: | But what happens to the terminal output of your script? That is, the text or error messages your script normally generates and shows on the screen when you run it from the command line? This output is saved to special files for each job in the job's working directory. They look like this: | ||
- | [mgstauff@chead ~]$ ls myjob.* | + | [mgstauff@chead ~]$ ls myjobscript.* |
myjob.e27657 | myjob.e27657 | ||
myjob.o27657 | myjob.o27657 | ||
Line 339: | Line 345: | ||
==== Checking How Busy the Cluster Is ==== | ==== Checking How Busy the Cluster Is ==== | ||
+ | === cfn-resources === | ||
+ | |||
+ | The best way to check resources is to run this command: | ||
+ | |||
+ | cfn-resources | ||
+ | |||
+ | to get a list of resources available for '' | ||
+ | |||
+ | cfn-resources himem.q | ||
+ | | ||
=== qstat -g c === | === qstat -g c === | ||
- | To get a good idea of how busy the cluster is, run '' | + | |
+ | A lower-level, run '' | ||
+ | |||
+ | See this example: | ||
{{: | {{: | ||
Line 356: | Line 375: | ||
A more detailed view of each node, including slot and memory usage, can be seen this way: | A more detailed view of each node, including slot and memory usage, can be seen this way: | ||
- | qstat -F h_vmem, | + | qstat -F h_vmem, |
| | ||
this will show the info for the qsub queue, all.q. Replace all.q with another queue name to see its status | this will show the info for the qsub queue, all.q. Replace all.q with another queue name to see its status | ||
+ | |||
+ | **NOTE** that the '' | ||
+ | |||
+ | You can add an alias to make this easier: | ||
+ | |||
+ | alias qsF=' | ||
---- | ---- | ||
Line 395: | Line 420: | ||
__**However, | __**However, | ||
+ | === Java Memory Issues === | ||
+ | |||
+ | Java like to allocate lots of RAM. You usually have to limit its memory. [[java|Click here for details.]] | ||
+ | |||
==== Jobs on chead ==== | ==== Jobs on chead ==== | ||
If you're running something directly on chead, there are different limits. [[clusterbasics# | If you're running something directly on chead, there are different limits. [[clusterbasics# | ||
Line 447: | Line 476: | ||
==== Per-job memory limit ==== | ==== Per-job memory limit ==== | ||
- | There is a limit of 62GB per job at this point. | + | There is a limit of 30GB per job at this point for jobs running on the default queue, 'all.q'. See notes on the himem.q queue on this page if your job uses more memory. |
NOTE that if you request this much memory, you might have to wait for a node to become free since this means using most of a node's memory resources, and your job might be slowed along with other jobs on the node because memory swap space will most likely be used. | NOTE that if you request this much memory, you might have to wait for a node to become free since this means using most of a node's memory resources, and your job might be slowed along with other jobs on the node because memory swap space will most likely be used. | ||
Line 515: | Line 544: | ||
If you have a lot of jobs to run, it's usually better to run them single-threaded. You'll run more of them at once, and in the end all of them will complete sooner. And when the cluster is busy, you'll spend less time waiting for a compute core with the requested number of cores available. So if you've submitted more jobs than you have slots in your quota, you're better off running them single-threaded. | If you have a lot of jobs to run, it's usually better to run them single-threaded. You'll run more of them at once, and in the end all of them will complete sooner. And when the cluster is busy, you'll spend less time waiting for a compute core with the requested number of cores available. So if you've submitted more jobs than you have slots in your quota, you're better off running them single-threaded. | ||
- | The exception | + | __To modify queued jobs__, you can run this command: |
+ | |||
+ | | ||
+ | |||
+ | __The exception__ | ||
And with a single or small number of jobs, you should have a decent idea of whether it will run faster with multiple cores before you ask for them, especially if you run this kind of job periodically. You can run once with 1 core, then once with 4 cores and compare the time it takes (you can add the ' | And with a single or small number of jobs, you should have a decent idea of whether it will run faster with multiple cores before you ask for them, especially if you run this kind of job periodically. You can run once with 1 core, then once with 4 cores and compare the time it takes (you can add the ' | ||
Line 521: | Line 554: | ||
In your '' | In your '' | ||
- | Use this variable in your scripts/ | + | Use this variable in your scripts/ |
Line 560: | Line 593: | ||
__NOTE__ Because the -V option will pass your environment variables to your qsub sessions, be careful what value you set for ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS. If it does not match the number of slots you're requesting for qsub, threading will not work properly and performance will decrease. | __NOTE__ Because the -V option will pass your environment variables to your qsub sessions, be careful what value you set for ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS. If it does not match the number of slots you're requesting for qsub, threading will not work properly and performance will decrease. | ||
+ | |||
+ | === Limiting threads in OMP-based apps like FSL=== | ||
+ | The default environment is setup to include | ||
+ | |||
+ | export OMP_NUM_THREADS=${NSLOTS} | ||
+ | export OMP_THREAD_LIMIT=${NSLOTS} | ||
+ | |||
+ | which limits OMP-based apps (like FSL) to use only as many threads as you have slots. | ||
=== Limiting threads in Matlab === | === Limiting threads in Matlab === |