Changes between Version 14 and Version 15 of discovery


Ignore:
Timestamp:
04/26/23 13:53:19 (22 months ago)
Author:
badgeley
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • discovery

    v14 v15  
    181181
    182182to have a job of 8 cores on one node.
    183 See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]]
     183See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]].
     184
     185If you want to use more than one node, the current (temporary) solution is to:\\
     1861) start the job\\
     1872) go to Discovery and see which nodes discovery is using (see `squeue` usage below)\\
     1883) cancel the job (see `scancel` usage below)\\
     1894) find the .queue script for your run and manually edit the start of the mpirun command to look like:
     190{{{
     191#!sh
     192mpirun -n 40 --hosts $NODELIST
     193}}}
     194where `$NODELIST` is the list of nodes separated by commas (e.g., `q03,q09`).\\
     1955) restart your run with:
     196{{{
     197#!sh
     198sbatch <filename>.queue
     199}}}
     200If you do not do this, then your job will run on just one node.
    184201
    185202There is no specific time limit on Discovery, however, jobs longer than 10 hours may need credential to DartFS system.
     
    202219where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
    203220
    204 To get more information about your job while it's running, from Discovery you can `ssh` into the node given by `squeue -u username` and then run `htop`.
     221To get more information about your job while it's running, from Discovery you can `ssh` into the node given by `squeue -u username` and then run `htop`. Once in `htop`, if you want to see information for a specific user, type `u` and then start typing the user ID until the correct one is highlighted and hit ENTER.
    205222
    206223If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab: