Changes between Version 2 and Version 3 of discovery


Ignore:
Timestamp:
01/07/22 13:21:03 (3 years ago)
Author:
Cheng Gong
Comment:

update wiki of Discovery

Legend:

Unmodified
Added
Removed
Modified
  • discovery

    v2 v3  
    1212   User USERNAME
    1313}}}
    14 and replace `USERNAME` by your discovery username (which should be your Dartmouth NetID).
     14and replace `USERNAME` by your Discovery username (which should be your Dartmouth NetID).
    1515
    16 Once this is done, you can ssh discovery by simply doing:
     16Once this is done, you can ssh Discovery by simply doing:
    1717
    1818{{{
     
    2222
    2323== Password-less ssh ==
    24 
     24 
    2525Once you have the account, you can setup a public key authentification in order to avoid having to input your password for each run.
    2626You need to have a SSH public/private key pair. If you do not, you can create a SSH public/private key pair by typing the following command and following the prompts (no passphrase necessary):
     
    3636}}}
    3737
    38 Two files were created: your private key `/Users/username/.ssh/id_rsa`, and the public key `/Users/username/.ssh/id_rsa.pub`. The private key is read-only and only for you, it is used to decrypt all correspondence encrypted with the public key. The contents of the public key need to be copied to `~/.ssh/authorized_keys` on your discovery account:
     38Two files were created: your private key `/Users/username/.ssh/id_rsa`, and the public key `/Users/username/.ssh/id_rsa.pub`. The private key is read-only and only for you, it is used to decrypt all correspondence encrypted with the public key. The contents of the public key need to be copied to `~/.ssh/authorized_keys` on your Discovery account:
    3939
    4040{{{
     
    4343}}}
    4444
    45 Now on '''discovery''', copy the content of id_rsa.pub:
     45Now on Discovery, copy the content of id_rsa.pub:
    4646
    4747{{{
     
    5151}}}
    5252
     53**Note**: Discovery officially suggests to use GSSAPI for passwordless access, see [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=89203|here]]. However, as long as you have set the RSA key pair on discovery and your local machine, you do not need to enter your NetID password anymore.
     54
     55
    5356== Environment ==
    5457
    55 On discovery, add the following lines to `~/.bashrc`:
     58On Discovery, add the following lines to `~/.bashrc`:
    5659{{{
    5760#!sh
     
    6568}}}
    6669
    67 ''Log out and log back in'' to apply this change.
     70Use:
     71{{{
     72#!sh
     73source ~/.bashrc
     74}}}
     75or ''Log out and log back in'' to apply this change.
    6876
    69 == Installing ISSM on discovery ==
     77== Installing ISSM on Discovery ==
    7078
    71 discovery will ''only'' be used to run the code, you will use your local machine for pre and post processing, you will never use discovery's MATLAB. You can check out ISSM and install the following packages:
     79Discovery will ''only'' be used to run the code, you will use your local machine for pre and post-processing, you will never use Discovery's MATLAB. You can check out ISSM and install the following packages:
    7280 - PETSc 3.15 (use the discovery script)
    7381 - m1qn3
     
    93101== discovery_settings.m ==
    94102
    95 Discovery staff ask that no "serious work" should be done on your home directory, you should create an execution directory as `/pub/$USERNAME/execution`.
    96 
    97103You have to add a file in `$ISSM_DIR/src/m` entitled `discovery_settings.m` with your personal settings on your local ism install:
    98104
    99105{{{
    100106#!m
    101 cluster.login='mmorligh';
    102 cluster.port=8000;
    103 cluster.queue='pub64';
    104 cluster.codepath='/data/users/mmorligh/trunk-jpl/bin/';
    105 cluster.executionpath='/data/users/mmorligh/trunk-jpl/execution/';
     107cluster.login='yourNetID';
     108cluster.codepath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/bin/';
     109cluster.executionpath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/execution/';
    106110}}}
    107111
    108 use your username for the `login` and enter your code path and execution path. These settings will be picked up automatically by matlab when you do `md.cluster= discovery()`
     112use your NetID for the `login` and enter your code path and execution path. These settings will be picked up automatically by matlab when you do `md.cluster= discovery()`
    109113
    110 == Running jobs on discovery  ==
     114The file sytem on Discovery is called DartFS (or DarFS-hpc). Your home directory on DartFS is only 50GB, it would be better to use the lab folder which has 1TB:
     115{{{
     116#!sh
     117/dartfs/rc/lab/I/ICE/yourpath/
     118}}}
     119Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=64619]]
    111120
    112 On discovery, you can use up to 64 cores per node. The more nodes and the longer the requested time, the more you will have to wait in the queue. So choose your settings wisely:
     121== Running jobs on Discovery  ==
     122
     123On Discovery, you can use up to 64 cores per node. The more nodes and the longer the requested time, the more you will have to wait in the queue. So choose your settings wisely:
    113124
    114125 {{{
     
    117128}}}
    118129
    119 The list of available queues is `'pub64','free64','free48','free*,pub64'` and `'free*'`.
    120130
    121 to have a job of 8 cores on one node. If the run lasts longer than 10 minutes, it will be killed and you will not be able to retrieve your results.
     131to have a job of 8 cores on one node.
     132See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]]
    122133
    123 Now if you want to check the status of your job and the queue you are using, type in the bash with the discovery session:
     134There is no specific time limit on Discovery, however, jobs longer than 10 hours may need credential to DartFS system.
     135Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=76691]]
    124136
     137
     138Now if you want to check the status of your job and the queue you are using, type in the bash with the Discovery session:
    125139 {{{
    126140#!sh
    127 qstat -u USERNAME
     141squeue -u username
    128142}}}
    129143
     
    132146{{{
    133147#!sh
    134 qdel JOBID
     148scancel JOBID
    135149}}}
    136150
    137 where JOBID is the ID of your job (indicated in the MATLAB session). MATLAB indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the informations that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
     151where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
    138152
    139 If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the informations Matlab gave you `/home/srebuffi/trunk-jpl/execution//SOMETHING/JOBNAME.lock `, you copy the SOMETHING and you type in Matlab:
     153If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab:
    140154
    141155{{{
    142156#!m
    143 md=loadresultsfromcluster(md,'SOMETHING');
     157md=loadresultsfromcluster(md,'LAUNCHSTRING','JOBNAME');
    144158}}}
     159
     160Obs.: in the case where `md.settings.waitonlock`>0 and you need to load manually (e.g., internet interruption), it is necessary to set `md.private.runtimename=LAUNCHSTRING;` before calling `loadresultsfromcluster`.
     161
     162== slurm ==
     163
     164A comparison of PBS to slurm commands can be found here: http://slurm.schedmd.com/rosetta.pdf
     165
     166An overview of slurm is found here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=132625]]
     167
     168
     169Useful commands:
     170
     171Graphical overview over Discovery usage:
     172{{{
     173sview
     174}}}
     175
     176Get number of idle nodes:
     177{{{
     178sinfo --states=idle
     179}}}
     180
     181See jobs of <username>:
     182{{{
     183squeue -u <username>
     184}}}
     185
     186Get more information on jobs of user:
     187{{{
     188sacct -u <username> --format=User,JobID,account,Timelimit,elapsed,ReqMem,MaxRss,ExitCode
     189}}}