User Tools

Site Tools


using_ogs_sge

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
using_ogs_sge [2017/05/24 15:19]
mgstauff [Parallel Isn't Always Better!]
using_ogs_sge [2018/03/02 20:35] (current)
mgstauff [Per-job memory limit]
Line 56: Line 56:
 The most common way to use SGE is to run batch jobs via ''qsub''. This is the non-interactive method. The most common way to use SGE is to run batch jobs via ''qsub''. This is the non-interactive method.
  
-''qsub'' allows you to submit a job defined by a script, and the job scheduler will place your job in the **job queue**, to be run either immediately or when resources open up.+''qsub'' allows you to submit a job defined by a script (a file that holds a series of commands - basically a program), and the job scheduler will place your job in the **job queue**, to be run either immediately or when resources open up.
  
-You run a batch job like so:+You run a batch job like so, where ''myjobscript'' is the name of a script that holds some commands. For example, it could be a BASH script, or a PERL script.
  
-  [mgstauff@chead ~]$ qsub myjob +  [mgstauff@chead ~]$ qsub myjobscript 
-  Your job 27657 ("myjob") has been submitted+  Your job 27657 ("myjobscript") has been submitted
      
-//myjob// is any kind of script file that can be run in a bash shell. The bash shell is the default shell for the front end and compute nodes. It's also the default shell in the old CfN cluster.+Here's an example BASH script that could be in the file named ''myjobscript'' (you can cut-n-paste into a text editor on the cluster to try it yourself):
  
-The output says that your job has been submitted to the queue. It's either running right away or waiting for resources to open up so it can runThe output also gives you the job-ID (27657 in this case).+  //#!/bin/bash 
 +  echo I am a job running now on $HOSTNAME 
 +  ZZZ=5 
 +  echo Sleeping for $ZZZ... 
 +  sleep $ZZZ 
 +  echo NSLOTS: $NSLOTS 
 +  echo All Done.
  
 ===== Output from your job ===== ===== Output from your job =====
  
-Your script should be setup to save your image or data output files as you normally would, i.e. typically in your /jet directory somewhere in your project tree.+Your script should be setup to save your image or data output files as you normally would, i.e. typically in your /data/<xyz>/<username> directory somewhere in your project tree.
  
 But what happens to the terminal output of your script? That is, the text or error messages your script normally generates and shows on the screen when you run it from the command line? This output is saved to special files for each job in the job's working directory. They look like this:  But what happens to the terminal output of your script? That is, the text or error messages your script normally generates and shows on the screen when you run it from the command line? This output is saved to special files for each job in the job's working directory. They look like this: 
  
-  [mgstauff@chead ~]$ ls myjob.*+  [mgstauff@chead ~]$ ls myjobscript.*
   myjob.e27657   myjob.e27657
   myjob.o27657   myjob.o27657
Line 414: Line 420:
 __**However, you may not see any message that your job was killed because of a memory limitations.**__ There's a glitch in SGE in that it when you hit a memory limit, the SGE system doesn't always catch the fact before the operating system. If the operating system notices first, then your job will be killed such that SGE can't get a message back to you about what happened, and any exception/error handling in the app will most likely not be able to get its message to your output files before the process is terminated.  I hope to be able to find a workaround to this in the future. __**However, you may not see any message that your job was killed because of a memory limitations.**__ There's a glitch in SGE in that it when you hit a memory limit, the SGE system doesn't always catch the fact before the operating system. If the operating system notices first, then your job will be killed such that SGE can't get a message back to you about what happened, and any exception/error handling in the app will most likely not be able to get its message to your output files before the process is terminated.  I hope to be able to find a workaround to this in the future.
    
 +=== Java Memory Issues ===
 +
 +Java like to allocate lots of RAM. You usually have to limit its memory. [[java|Click here for details.]]
 +
 ==== Jobs on chead ==== ==== Jobs on chead ====
 If you're running something directly on chead, there are different limits. [[clusterbasics#don_t_generally_run_programs_on_the_front_end_itself|See here for details.]] If you're running something directly on chead, there are different limits. [[clusterbasics#don_t_generally_run_programs_on_the_front_end_itself|See here for details.]]
Line 466: Line 476:
  
 ==== Per-job memory limit ==== ==== Per-job memory limit ====
-There is a limit of 62GB per job at this point. This allows a single ''qlogin'' session to run a large memory job on a single compute node.+There is a limit of 30GB per job at this point for jobs running on the default queue, 'all.q'. See notes on the himem.q queue on this page if your job uses more memory.
  
 NOTE that if you request this much memory, you might have to wait for a node to become free since this means using most of a node's memory resources, and your job might be slowed along with other jobs on the node because memory swap space will most likely be used. NOTE that if you request this much memory, you might have to wait for a node to become free since this means using most of a node's memory resources, and your job might be slowed along with other jobs on the node because memory swap space will most likely be used.
Line 544: Line 554:
 In your ''qlogin'' session or ''qsub'' job, the ''NSLOTS'' environment variable will be set by SGE to the number of slots you've requested (some scripting in your .bash_profile actually handles ''qlogin'' instances). In your ''qlogin'' session or ''qsub'' job, the ''NSLOTS'' environment variable will be set by SGE to the number of slots you've requested (some scripting in your .bash_profile actually handles ''qlogin'' instances).
  
-Use this variable in your scripts/commands if you need to know how many slots are available for threading. Matlab and ITK apps are setup automatically on the cluster by special handling, see below.+Use this variable in your scripts/commands if you need to know how many slots are available for threading. The [[matlab_usage|Matlab]], [[mrtrix_usage|MRTrix]] and ITK apps are setup automatically on the cluster by special handling, see below.
  
  
Line 583: Line 593:
  
 __NOTE__ Because the -V option will pass your environment variables to your qsub sessions, be careful what value you set for ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS. If it does not match the number of slots you're requesting for qsub, threading will not work properly and performance will decrease.  __NOTE__ Because the -V option will pass your environment variables to your qsub sessions, be careful what value you set for ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS. If it does not match the number of slots you're requesting for qsub, threading will not work properly and performance will decrease. 
 +
 +=== Limiting threads in OMP-based apps like FSL===
 +The default environment is setup to include
 +
 +  export OMP_NUM_THREADS=${NSLOTS}
 +  export OMP_THREAD_LIMIT=${NSLOTS}
 +
 +which limits OMP-based apps (like FSL) to use only as many threads as you have slots.
  
 === Limiting threads in Matlab === === Limiting threads in Matlab ===
using_ogs_sge.1495639199.txt.gz · Last modified: 2017/05/24 15:19 by mgstauff