Changes between Version 14 and Version 15 of discovery
- Timestamp:
- 04/26/23 13:53:19 (22 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
discovery
v14 v15 181 181 182 182 to have a job of 8 cores on one node. 183 See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]] 183 See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]]. 184 185 If you want to use more than one node, the current (temporary) solution is to:\\ 186 1) start the job\\ 187 2) go to Discovery and see which nodes discovery is using (see `squeue` usage below)\\ 188 3) cancel the job (see `scancel` usage below)\\ 189 4) find the .queue script for your run and manually edit the start of the mpirun command to look like: 190 {{{ 191 #!sh 192 mpirun -n 40 --hosts $NODELIST 193 }}} 194 where `$NODELIST` is the list of nodes separated by commas (e.g., `q03,q09`).\\ 195 5) restart your run with: 196 {{{ 197 #!sh 198 sbatch <filename>.queue 199 }}} 200 If you do not do this, then your job will run on just one node. 184 201 185 202 There is no specific time limit on Discovery, however, jobs longer than 10 hours may need credential to DartFS system. … … 202 219 where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error. 203 220 204 To get more information about your job while it's running, from Discovery you can `ssh` into the node given by `squeue -u username` and then run `htop`. 221 To get more information about your job while it's running, from Discovery you can `ssh` into the node given by `squeue -u username` and then run `htop`. Once in `htop`, if you want to see information for a specific user, type `u` and then start typing the user ID until the correct one is highlighted and hit ENTER. 205 222 206 223 If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab: