Changes between Initial Version and Version 1 of andes


Ignore:
Timestamp:
05/01/24 15:52:39 (18 months ago)
Author:
Mathieu Morlighem
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • andes

    v1 v1  
     1== Getting an account ==
     2
     3Go to [https://rc.dartmouth.edu/index.php/discoveryhpc/], you will need a Dartmouth NetID. If applicable, make sure you are added to the ICE DartFS Lab share and the ice Slurm account. Ask to make the ice Slurm account your default.
     4
     5== ssh configuration ==
     6
     7You can add the following lines to `~/.ssh/config` on your local machine:
     8{{{
     9#!sh
     10Host andes
     11   Hostname andes andes8.dartmouth.edu
     12   User USERNAME
     13}}}
     14and replace `USERNAME` by your Andes username (which should be your Dartmouth NetID). Once this is done, you can ssh Andes by simply doing:
     15
     16{{{
     17#!sh
     18ssh andes
     19}}}
     20
     21== Password-less ssh ==
     22 
     23Discovery **officially** suggests using `GSSAPI` for passwordless access, see [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=89203|here]]. 
     24
     25On your local machine, you will need to enter:
     26{{{
     27kinit -f -l 7d username@KIEWIT.DARTMOUTH.EDU
     28}}}
     29with your NetID at `username` and the password for NetID to request a ticket for 7 days (or any time period you need), then you can use {{{ssh discovery}}} without entering a password.
     30
     31== Environment ==
     32
     33On Discovery, add the following lines to `~/.bashrc`:
     34{{{
     35#!sh
     36export ISSM_DIR=PATHTOTRUNK
     37source $ISSM_DIR/etc/environment.sh
     38
     39#load modules
     40module purge
     41module load gcc/9.3.1
     42module load cmake/3.10.1
     43}}}
     44
     45Use:
     46{{{
     47#!sh
     48source ~/.bashrc
     49}}}
     50or ''Log out and log back in'' to apply this change.
     51
     52== Installing ISSM on Andes ==
     53
     54Andes will ''only'' be used to run the code, you will use your local machine for pre and post-processing, you will never use Andes's MATLAB. You can check out ISSM and install the following packages:
     55 - PETSc 3.21 (use the andes script, `install-3.21-andes.sh`)
     56 - m1qn3
     57Follow the detailed instructions for compiling ISSM: [[https://issm.jpl.nasa.gov/download/unix/]]
     58
     59Use the following configuration script (adapt to your needs):
     60
     61{{{
     62#!sh
     63export CC=mpicc
     64export CXX=mpicxx
     65export FC=mpifort
     66./configure \
     67   --prefix=$ISSM_DIR \
     68   --with-wrappers=no \
     69   --with-petsc-dir="$ISSM_DIR/externalpackages/petsc/install" \
     70   --with-m1qn3-dir="$ISSM_DIR/externalpackages/m1qn3/install" \
     71   --with-mpi-include="$ISSM_DIR/externalpackages/petsc/install/include" \
     72   --with-mpi-libflags="-L$ISSM_DIR/externalpackages/petsc/install/lib -lmpi -lmpifort"\
     73   --with-metis-dir="$ISSM_DIR/externalpackages/petsc/install" \
     74   --with-blas-lapack-dir="$ISSM_DIR/externalpackages/petsc/install" \
     75   --with-scalapack-dir="$ISSM_DIR/externalpackages/petsc/install" \
     76   --with-mumps-dir="$ISSM_DIR/externalpackages/petsc/install" \
     77   --with-fortran-lib="-L/usr/lib64/ -lgfortran" \
     78   --with-cxxoptflags="-g -O3 -std=c++11" \
     79   --enable-development
     80}}}
     81It is highly recommended to use batch or interactive job to compile ISSM, since the login node has very limited computational resources.
     82
     83To request resources for an interactive job:
     84{{{
     85#!sh
     86srun --nodes=1 --ntasks-per-node=16 --pty /bin/bash
     87}}}
     88== Installing ISSM with CoDiPack (AD) on Discovery ==
     89
     90You will need to install the following additional packages:
     91 * codipack
     92 * medipack
     93
     94Use the following configuration script (adapt to your needs, make sure to NOT include --with-petsc-dir):
     95{{{
     96#!sh
     97export CC=mpicc
     98export CXX=mpicxx
     99export FC=mpifort
     100./configure \
     101   --prefix=$ISSM_DIR \
     102   --with-wrappers=no \
     103   --without-kriging \
     104   --without-kml \
     105   --without-Love \
     106   --without-Sealevelchange \
     107   --with-m1qn3-dir="$ISSM_DIR/externalpackages/m1qn3/install" \
     108   --with-mpi-include="$ISSM_DIR/externalpackages/petsc/install/include" \
     109   --with-mpi-libflags="-L$ISSM_DIR/externalpackages/petsc/install/lib -lmpi -lmpifort"\
     110   --with-metis-dir="$ISSM_DIR/externalpackages/petsc/install" \
     111   --with-blas-lapack-dir="$ISSM_DIR/externalpackages/petsc/install" \
     112   --with-scalapack-dir="$ISSM_DIR/externalpackages/petsc/install" \
     113   --with-mumps-dir="$ISSM_DIR/externalpackages/petsc/install" \
     114   --with-codipack-dir="$ISSM_DIR/externalpackages/codipack/install" \
     115   --with-medipack-dir="$ISSM_DIR/externalpackages/medipack/install" \
     116   --with-fortran-lib="-L/usr/lib64/ -lgfortran" \
     117   --with-cxxoptflags="-g -O2 -fPIC -std=c++11 -DCODI_ForcedInlines -wd2196" \
     118   --enable-tape-alloc \
     119   --enable-development \
     120   --enable-debugging
     121}}}
     122
     123== discovery_settings.m ==
     124
     125You have to add a file in `$ISSM_DIR/src/m` entitled `discovery_settings.m` with your personal settings on your local ism install:
     126
     127{{{
     128#!m
     129cluster.login='yourNetID';
     130cluster.codepath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/bin/';
     131cluster.executionpath='/dartfs/rc/lab/I/ICE/yourpath/trunk-jpl/execution/';
     132}}}
     133
     134use your NetID for the `login` and enter your code path and execution path. These settings will be picked up automatically by matlab when you do `md.cluster= discovery()`
     135
     136The file sytem on Discovery is called DartFS (or DarFS-hpc). Your home directory on DartFS is only 50GB, it would be better to use the lab folder which has 1TB:
     137{{{
     138#!sh
     139/dartfs/rc/lab/I/ICE/yourpath/
     140}}}
     141Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=64619]]
     142
     143== Running jobs on Andes  ==
     144
     145On Andes, you can use up to 64 cores per node. The more nodes and the longer the requested time, the more you will have to wait in the queue. So choose your settings wisely:
     146
     147 {{{
     148#!m
     149md.cluster= discovery('numnodes',1,'cpuspernode',8);
     150}}}
     151
     152to have a job of 8 cores on one node.
     153See cluster details: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=134058]].
     154
     155Each node has it's own time limit for jobs that are being run from the queue, but they tend to be 10 or 30 days.
     156You can find the time limit of each node by entering on Discovery:
     157{{{
     158#!sh
     159sinfo
     160}}}
     161If you are running something interactively on Discovery, there may be a credential limit for the DartFS system of 10 hours.
     162Read more here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=76691]]
     163
     164Now if you want to check the status of your job and the node you are using, type in the bash with the Discovery session:
     165 {{{
     166#!sh
     167squeue -u username
     168}}}
     169
     170You can delete your job manually by typing:
     171
     172{{{
     173#!sh
     174scancel JOBID
     175}}}
     176
     177where `JOBID` is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the information that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
     178
     179If you want to load results from the cluster manually (for example if you have an error due to an internet interruption), you find in the information Matlab gave you `$ISSM_DIR/execution/LAUNCHSTRING/JOBNAME.lock`, you copy the LAUNCHSTRING and you type in MATLAB:
     180
     181{{{
     182#!m
     183md=loadresultsfromcluster(md,'LAUNCHSTRING','JOBNAME');
     184}}}
     185
     186Obs.: in the case where `md.settings.waitonlock`>0 and you need to load manually (e.g., internet interruption), it is necessary to set `md.private.runtimename=LAUNCHSTRING;` before calling `loadresultsfromcluster`.
     187
     188
     189== Other notes about running on Andes  ==
     190
     191If you want to use more than one node (not recommended), the current (temporary) solution is to:\\
     1921) start the job\\
     1932) go to Discovery and see which nodes discovery is using (see `squeue` usage below)\\
     1943) cancel the job (see `scancel` usage below)\\
     1954) find the .queue script for your run and manually edit the start of the mpirun command to look like:
     196{{{
     197#!sh
     198mpirun -n 40 --hosts $NODELIST
     199}}}
     200where `$NODELIST` is the list of nodes separated by commas (e.g., `q03,q09`).\\
     2015) restart your run with:
     202{{{
     203#!sh
     204sbatch <filename>.queue
     205}}}
     206If you do not do this, then your job will run on just one node.
     207
     208To get more information about your job while it's running, from Andes you can `ssh` into the node given by `squeue -u username` and then run `htop`. Once in `htop`, if you want to see information for a specific user, type `u` and then start typing the user ID until the correct one is highlighted and hit ENTER.
     209You can also get more information about a job by entering:
     210{{{
     211#!sh
     212scontrol show job JOBID
     213}}}
     214
     215If your job is in the queue for a long time, there may be several reasons for this. First, try reducing the amount of time you're requesting. Another thing to try is to add the following line to your `<run_name>.queue` file and restarting your job:
     216{{{
     217#!sh
     218#SBATCH --partition preemptable
     219}}}
     220This may give you access to some idle nodes, but note that your job can be stopped if a higher priority job wants your resources.
     221
     222
     223== slurm ==
     224
     225A comparison of PBS to slurm commands can be found here: http://slurm.schedmd.com/rosetta.pdf
     226
     227An overview of slurm is found here: [[https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=132625]]
     228
     229
     230Useful commands:
     231
     232Get number of idle nodes:
     233{{{
     234sinfo --states=idle
     235}}}
     236
     237See jobs of <username>:
     238{{{
     239squeue -u <username>
     240}}}
     241
     242Get more information on jobs of user:
     243{{{
     244sacct -u <username> --format=User,JobID,account,Timelimit,elapsed,ReqMem,MaxRss,ExitCode
     245}}}
     246
     247== Running jobs with GPU ==
     248Andes has 12 GPU nodes: g01-g12. To submit a job to these nodes, you will need to specify with
     249{{{
     250#!sh
     251#SBATCH --partition gpuq
     252#SBATCH --gres=gpu:1
     253}}}
     254where `1` means to use 1 GPU.