Changes between Version 16 and Version 17 of discovery


Ignore:
Timestamp:
04/28/23 13:00:00 (2 years ago)
Author:
badgeley
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • discovery

    v16 v17  
    179179}}}
    180180
    181 
    182181to have a job of 8 cores on one node.
    183182See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]].
     183
     184Each node has it's own time limit for jobs that are being run from the queue, but they tend to be 10 or 30 days.
     185You can find the time limit of each node by entering on Discovery:
     186{{{
     187#!sh
     188sinfo
     189}}}
     190If you are running something interactively on Discovery, there may be a credential limit for the DartFS system of 10 hours.
     191Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=76691]]
     192
     193Now if you want to check the status of your job and the node you are using, type in the bash with the Discovery session:
     194 {{{
     195#!sh
     196squeue -u username
     197}}}
     198
     199You can delete your job manually by typing:
     200
     201{{{
     202#!sh
     203scancel JOBID
     204}}}
     205
     206where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
     207
     208If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab:
     209
     210{{{
     211#!m
     212md=loadresultsfromcluster(md,'LAUNCHSTRING','JOBNAME');
     213}}}
     214
     215Obs.: in the case where `md.settings.waitonlock`>0 and you need to load manually (e.g., internet interruption), it is necessary to set `md.private.runtimename=LAUNCHSTRING;` before calling `loadresultsfromcluster`.
     216
     217
     218== Other notes about running on Discovery  ==
    184219
    185220If you want to use more than one node (not recommended), the current (temporary) solution is to:\\
     
    200235If you do not do this, then your job will run on just one node.
    201236
    202 There is no specific time limit on Discovery, however, jobs longer than 10 hours may need credential to DartFS system.
    203 Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=76691]]
    204 
    205 
    206 Now if you want to check the status of your job and the node you are using, type in the bash with the Discovery session:
    207  {{{
    208 #!sh
    209 squeue -u username
    210 }}}
    211 
    212 You can delete your job manually by typing:
    213 
    214 {{{
    215 #!sh
    216 scancel JOBID
    217 }}}
    218 
    219 where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
    220 
    221237To get more information about your job while it's running, from Discovery you can `ssh` into the node given by `squeue -u username` and then run `htop`. Once in `htop`, if you want to see information for a specific user, type `u` and then start typing the user ID until the correct one is highlighted and hit ENTER.
    222238You can also get more information about a job by entering:
     
    226242}}}
    227243
    228 If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab:
    229 
    230 {{{
    231 #!m
    232 md=loadresultsfromcluster(md,'LAUNCHSTRING','JOBNAME');
    233 }}}
    234 
    235 Obs.: in the case where `md.settings.waitonlock`>0 and you need to load manually (e.g., internet interruption), it is necessary to set `md.private.runtimename=LAUNCHSTRING;` before calling `loadresultsfromcluster`.
     244If your job is in the queue for a long time, there may be several reasons for this. One thing to try is to add the following line to your `<run_name>.queue` file and restarting your job:
     245{{{
     246#!sh
     247#SBATCH --partition preemptable
     248}}}
     249This may give you access to some idle nodes, but note that your job can be stopped if a higher priority job wants your resources.
     250
    236251
    237252== slurm ==