Changes between Version 2 and Version 3 of discovery
- Timestamp:
- 01/07/22 13:21:03 (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
discovery
v2 v3 12 12 User USERNAME 13 13 }}} 14 and replace `USERNAME` by your discovery username (which should be your Dartmouth NetID).14 and replace `USERNAME` by your Discovery username (which should be your Dartmouth NetID). 15 15 16 Once this is done, you can ssh discovery by simply doing:16 Once this is done, you can ssh Discovery by simply doing: 17 17 18 18 {{{ … … 22 22 23 23 == Password-less ssh == 24 24 25 25 Once you have the account, you can setup a public key authentification in order to avoid having to input your password for each run. 26 26 You need to have a SSH public/private key pair. If you do not, you can create a SSH public/private key pair by typing the following command and following the prompts (no passphrase necessary): … … 36 36 }}} 37 37 38 Two files were created: your private key `/Users/username/.ssh/id_rsa`, and the public key `/Users/username/.ssh/id_rsa.pub`. The private key is read-only and only for you, it is used to decrypt all correspondence encrypted with the public key. The contents of the public key need to be copied to `~/.ssh/authorized_keys` on your discovery account:38 Two files were created: your private key `/Users/username/.ssh/id_rsa`, and the public key `/Users/username/.ssh/id_rsa.pub`. The private key is read-only and only for you, it is used to decrypt all correspondence encrypted with the public key. The contents of the public key need to be copied to `~/.ssh/authorized_keys` on your Discovery account: 39 39 40 40 {{{ … … 43 43 }}} 44 44 45 Now on '''discovery''', copy the content of id_rsa.pub:45 Now on Discovery, copy the content of id_rsa.pub: 46 46 47 47 {{{ … … 51 51 }}} 52 52 53 **Note**: Discovery officially suggests to use GSSAPI for passwordless access, see [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=89203|here]]. However, as long as you have set the RSA key pair on discovery and your local machine, you do not need to enter your NetID password anymore. 54 55 53 56 == Environment == 54 57 55 On discovery, add the following lines to `~/.bashrc`:58 On Discovery, add the following lines to `~/.bashrc`: 56 59 {{{ 57 60 #!sh … … 65 68 }}} 66 69 67 ''Log out and log back in'' to apply this change. 70 Use: 71 {{{ 72 #!sh 73 source ~/.bashrc 74 }}} 75 or ''Log out and log back in'' to apply this change. 68 76 69 == Installing ISSM on discovery ==77 == Installing ISSM on Discovery == 70 78 71 discovery will ''only'' be used to run the code, you will use your local machine for pre and post processing, you will never use discovery's MATLAB. You can check out ISSM and install the following packages:79 Discovery will ''only'' be used to run the code, you will use your local machine for pre and post-processing, you will never use Discovery's MATLAB. You can check out ISSM and install the following packages: 72 80 - PETSc 3.15 (use the discovery script) 73 81 - m1qn3 … … 93 101 == discovery_settings.m == 94 102 95 Discovery staff ask that no "serious work" should be done on your home directory, you should create an execution directory as `/pub/$USERNAME/execution`.96 97 103 You have to add a file in `$ISSM_DIR/src/m` entitled `discovery_settings.m` with your personal settings on your local ism install: 98 104 99 105 {{{ 100 106 #!m 101 cluster.login='mmorligh'; 102 cluster.port=8000; 103 cluster.queue='pub64'; 104 cluster.codepath='/data/users/mmorligh/trunk-jpl/bin/'; 105 cluster.executionpath='/data/users/mmorligh/trunk-jpl/execution/'; 107 cluster.login='yourNetID'; 108 cluster.codepath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/bin/'; 109 cluster.executionpath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/execution/'; 106 110 }}} 107 111 108 use your usernamefor the `login` and enter your code path and execution path. These settings will be picked up automatically by matlab when you do `md.cluster= discovery()`112 use your NetID for the `login` and enter your code path and execution path. These settings will be picked up automatically by matlab when you do `md.cluster= discovery()` 109 113 110 == Running jobs on discovery == 114 The file sytem on Discovery is called DartFS (or DarFS-hpc). Your home directory on DartFS is only 50GB, it would be better to use the lab folder which has 1TB: 115 {{{ 116 #!sh 117 /dartfs/rc/lab/I/ICE/yourpath/ 118 }}} 119 Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=64619]] 111 120 112 On discovery, you can use up to 64 cores per node. The more nodes and the longer the requested time, the more you will have to wait in the queue. So choose your settings wisely: 121 == Running jobs on Discovery == 122 123 On Discovery, you can use up to 64 cores per node. The more nodes and the longer the requested time, the more you will have to wait in the queue. So choose your settings wisely: 113 124 114 125 {{{ … … 117 128 }}} 118 129 119 The list of available queues is `'pub64','free64','free48','free*,pub64'` and `'free*'`.120 130 121 to have a job of 8 cores on one node. If the run lasts longer than 10 minutes, it will be killed and you will not be able to retrieve your results. 131 to have a job of 8 cores on one node. 132 See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]] 122 133 123 Now if you want to check the status of your job and the queue you are using, type in the bash with the discovery session: 134 There is no specific time limit on Discovery, however, jobs longer than 10 hours may need credential to DartFS system. 135 Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=76691]] 124 136 137 138 Now if you want to check the status of your job and the queue you are using, type in the bash with the Discovery session: 125 139 {{{ 126 140 #!sh 127 qstat -u USERNAME 141 squeue -u username 128 142 }}} 129 143 … … 132 146 {{{ 133 147 #!sh 134 qdel JOBID148 scancel JOBID 135 149 }}} 136 150 137 where JOBID is the ID of your job (indicated in the MATLAB session). MATLAB indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the informationsthat would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.151 where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error. 138 152 139 If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information s Matlab gave you `/home/srebuffi/trunk-jpl/execution//SOMETHING/JOBNAME.lock `, you copy the SOMETHING and you type in Matlab:153 If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in Matlab: 140 154 141 155 {{{ 142 156 #!m 143 md=loadresultsfromcluster(md,' SOMETHING');157 md=loadresultsfromcluster(md,'LAUNCHSTRING','JOBNAME'); 144 158 }}} 159 160 Obs.: in the case where `md.settings.waitonlock`>0 and you need to load manually (e.g., internet interruption), it is necessary to set `md.private.runtimename=LAUNCHSTRING;` before calling `loadresultsfromcluster`. 161 162 == slurm == 163 164 A comparison of PBS to slurm commands can be found here: http://slurm.schedmd.com/rosetta.pdf 165 166 An overview of slurm is found here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=132625]] 167 168 169 Useful commands: 170 171 Graphical overview over Discovery usage: 172 {{{ 173 sview 174 }}} 175 176 Get number of idle nodes: 177 {{{ 178 sinfo --states=idle 179 }}} 180 181 See jobs of <username>: 182 {{{ 183 squeue -u <username> 184 }}} 185 186 Get more information on jobs of user: 187 {{{ 188 sacct -u <username> --format=User,JobID,account,Timelimit,elapsed,ReqMem,MaxRss,ExitCode 189 }}}