Running on Discover (Reto Ruedy)
+++++++++++++++++++
News:
====

pdE now also works for new-i/o runs, the various *ij*.nc files may
    be displayed using AViewer, hence are left out of the PRT file.
    In this case, "runID" need not be provided, the utility is model
    indepedent and always runs in serial mode on the login node.

runpbs was modified to not resubmit jobs that bombed - touch "I" to
       resubmit such a job anyway.

The automatic restart procedure was changed to try to avoid submitting
runs that got into trouble or submitting the same job repeatedly.

The more efficient (by a factor of 1.5-2) nehalem nodes are available;
you may access them using     "runpbs runID nh #cpus"  (new_highspeed)
These nodes have 8 cpus and do NOT support scaliMPI    (use intelMPI)

Currently, programs compiled with intelmpi may be run on the login nodes
on 1 cpu without using "mpirun -np 1"; this may be a bug and not true
in the future, but at this point it makes serial compilations unnecessary:
CMPE002, qc, and the other aux/auxqflux executables may be treated as if
they were compiled serially.

pdE is by default run as a batch job. However, it still may be run in
serial mode in an emergency or if you run a small version of modelE as
follows:  ln -s runID.exe runID.exe_serial
   then:  pdE ia runID list_of_accfiles

Runs compiled with intelMPI may now be bundled (run ... runID1+runID2+...)

Currently there are  520 slow cpus, 2064 fast cpus on 4-cpu nodes,
   n*2064 cpus of 8-cpu nodes ((n-2)*2064 are super-efficient) with n=6 (and rising).

Generalities:
============
- "discover" is a distributed memory machine
- the compilers provided that we currently use are intel-compilers
- mpi-runs on interactive nodes are not possible; current EXCEPTION:
  intelmpi executables treated like a serial exec will run on 1 cpu.
- dirac's disks are NFS-mounted on discover as /archive/u/userID, but
  they are accessible only from the login nodes, not to batch jobs.

Queues:
======
queues are not specific to groups - however, for special needs, there are
queues that are accessible only to selected groups/users. So far, there
was no need for GISS to use this option.

On the generally accessible queues, the NUMBER of JOBS (not CPUs) a user
or group can run simultaneously is limited on each queue
- we may bundle our small jobs to get our share of resources
As a favor to us, group limits for general and general_small were abolished !

Bundling jobs: (no loss in efficiency)
=============
runpbs job1+job2+... should work fine for any modelE jobs
For non-modelE openMP jobs, create in the run directory a copy E4 of "E"
   which sets:   export OMP_NUM_THREADS=4    (before the *.exe line)
             or  setenv OMP_NUM_THREADS 4    (for csh tcsh users)
Any number of mpi jobs that use the same number of nodes can be bundled
(does NOT work with openMPI)

Unresolved bugs:
===============
- occasionally, jobs tend to "hang" for extended period of times,
  usually during i/o processes (creation of PRT files)
     hint: for long runs use "kdiag=13*9," not "kdiag=13*0,"

- serial and parallel versions do not always give the same result
     hint: see next section (Consistency ...)

- intelmpi: may be used on all current and future nodes

- scali mpi: can only run on the relatively few old 4-cpu nodes

- openmpi: may be used on all nodes but is not recommended (use intelmpi)
   - compile with: EXTRA_FFLAGS="-DMPITYPE_LOOKUP_HACK" to avoid memory leak
   - Bundling openmpi jobs is currently impossible

Compiler issues:
===============
For off-line fortran compilations use:
   ifort -convert big_endian x.f -o x.exe

Before doing so, load the appropriate modules, currently:
   /etc/profile.d/modules.sh             ; # bash or ksh or:
   source /etc/profile.d/modules.csh     ; # tcsh or csh
and either
   module load comp/intel-9.1.042
   module load mpi/scali-5               ; # for scali mpi (on borga-e...)
or
   module load comp/intel-9.1.052        ; # for openmpi
   module load mpi/openmpi-1.2.5/intel-9
or
   module load comp/intel-10.1.017       ; # for intel mpi
   module load mpi/impi-3.2.011

This is to be put into your startup script (e.g. .profile if your login shell
is ksh), so these modules also get loaded when running in batch mode.
"runpbs (=run)" was modified to load the appropriate modules for mpi batch jobs
even if those commands are missing in the login script. But you still need to set
them appropriately during the "setup" step.

Big OpenMP jobs may need a stacksize > default=4m : export KMP_STACKSIZE=64m

Intel mpi requires passwordless login between the nodes (if you want
to run the model on more than one node). To enable it just append your
ssh key to "authorized_keys":
   cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
   chmod 644 ~/.ssh/authorized_keys
(this should give -rw---------  authorized_keys)
Also, ensure that home directory has the correct permissions:
chmod 755 ~

Consistency between serial and parallelized executables:
-------------------------------------------------------
openMP and serial runs give the same results ONLY if using the
   option -mp (machine precision) in both cases, i.e. use
   gmake setup|exe RUN=.... EXTRA_FFLAGS=-mp EXTRA_LFLAGS=-mp
Warning: The current version of modelE no longer supports openMP

mpi and serial runs (without the -mp options) give identical results
   (but different from serial runs and openMP runs with -mp)

Disks and file names:
====================
Each user is given 2 directories /home/userid (backed up, 1 GB cap)
                     and         /discover/nobackup/userid (100 GB cap)

and we also share:  /discover/nobackup/projects/giss
                    to be used for files of general interest only

some subdirectories of /discover/nobackup/projects/giss:    (.modelErc name)
                       --------------------------------
   exec                                                     (EXECDIR)
   prod_decks (where setup should copy *.R files)           (DECKS_REPOSITORY)
   prod_runs  (contains symb. links to run-directories      (CMRUNDIR)
                    and lists of active long runs)
   prod_input_files (init. and boundary conditions)         (GCMSEARCHPATH)
   esmf_2_2_ifort_9.1.042                                   (ESMF_DIR)
   RUNS        (directories from which to start long production runs)
   OBS         (for obs. files - used by modelE RMS-program)
   on_UniTree  (inventories of files archived using to_ut)
   totape      (staging site for mass transfers with to_ut_all)
   restored    (temporary holding place for files restored with from_ut)
   archive     (with symbolic link 'prod_output' to files archived via to_ut)

Long path names - possible work arounds:
---------------------------------------
There are several ways to avoid typing those long path names, creating
environment variables, aliases, symbolic links, etc. My preferred way
currently is to create symbolic links from the home directory to the
most important directories like
   ln -s /discover/nobackup/projects/giss                  $HOME/giss
   ln -s /discover/nobackup/projects/giss/prod_input_files $HOME/IC
   ln -s /discover/nobackup/projects/giss/prod_runs        $HOME/runs
   ln -s /discover/nobackup/my-user-ID                     $HOME/data
"cd ~/giss" will get you to that directory. You may also avoid the "~/" by
setting e.g. " alias ccd='cd ; cd' "; then "ccd giss" will do the job.

general utilities: (see "man name_of_utility" for details)
=================
qsub             to submit batch jobs to PBS   (used in run,runpbs and qsubI)
qdel             to kill a batch job
qstat            to query system               (used in qs, qpbs)
qhold/qrls       to put a queued job on hold/undo the hold
xxdiff -a f1 f2  to compare text files f1 f2
id               to display your userid/account/group(s)

tailored utilities: (on ....../giss/exec)
==================
sftp_help   get password-free access to dirac, other nodes, CVS repository
run,runpbs  submit 1 or several jobs as a single batch job
  "run" has a systems default alias that may be bypassed typing "unalias run"
  "run" and "runpbs" are identical
  Usage: runpbs runID [fast<N>|slow<N>|new<N>|nh<N>|scali<N>] #_of_cpus #_of_wall_clock_hours
         The optional <N> limits the number of cpus per node used; it has to be 1-4 in the
         first 2 cases and 1-8 in the last 2 (fast3 may be abbreviated by f3, new6 by n6)
         new defaults: if no particular type of nodes is specified, then
            scali_mpi jobs may run on any 4-cpu node (slow or fast; i.e. same as "scali")
            other jobs may run on any 8-cpu node (new or nehalem)

  Notes: 1) #_of_wall_clock_hours - always let it default (to the max. 12 hrs) unless
            a) you want the job to go into the background queue - set it to 4 (hours)
               - the background queue is mostly closed, so 4 hrs is NOT recommended
            b) you want the job to go into the pproc queue      - set it to 3 (hours)
               you may choose 3 hrs if systems closed the general queues; remember that
               you are limited to 32 cpus in order to use this option.
         2) runpbs runID new|nh ... sends the job to the new|newest 8-processor nodes
            if the 2nd arg ends in a digit N (N=1-8), at most N cpus per node are used.
qsubI       get an interactive batch session (args: fast|slow|new #_of_cpus wc_time)
ssw         stop a modelE run
qrsf        query how far modelE-run went
qpbs        query how busy the machine is and get info about your jobs
qpbs -a     query how busy the machine is and get info about all jobs
qs runID    query status (# cpus used, time left until timeout)
qnodes      query how many and which types of nodes are free

pdE         post-process acc-files (mix of uncompressed and *.gz files is ok)
              if runID.exe_serial is present, pdE will use it instead of runID.exe
              also works for runs using new-i/o : pdE list_of_acc-files

jE_to_lpl   extract an aplot-file for a selected field from *.jE* (budg.pg)
qdf         display titles for ij-files, etc
qdf1        display titles for rsf-files,acc-files
q4df        display titles,etc for GISS/4D-files (e.g. *.jk* files)
qit         display/check times/multiplicity for subdaily output files
maxmin      find extremes for ij-files
maxmin4     find extremes for GISS/4D-files
maxmin_i    find extremes for subdaily output files
gm          find global means for ij-files (if last 4 bytes = glob.mean)
GM          find global means for ij-files (re-computed from data)
AViewer     plot 72x46 ij-files (color maps)
nmaps       plot many ij-files (color maps - see nmaps.doc)
1map        plot ij-file or GISS-4D-file (automatic color bar election)
aplotX      line plots from text files
showlpl     combines aplotX and aplotX -a

to_ut       archive directories on dirac
from_ut     retrieve directories from dirac
on_ut       query inventory of dirac-archive

Archive:  /archive/u/userID
=======
prod_input_files contains links to dirac's copies of the directories
   BCND (input files used throughout a run)
   AIC  (input files used only at the start of a run, incl. OIC GIC ...)

Rarely used or obsolete input files may still be found there. These disks
are only accessible from the login nodes, i.e. batch jobs cannot see them.

starting modelE runs
====================
The modelE "gmake setup RUN=runID options" makes use of $HOME/.modelErc

Using a single  .modelErc :
-------------------------
To create such a file you may use the command:
   sed "s/rruedy/`whoami`/" < /home/rruedy/.modelErc_uni > $HOME/.modelErc

To start a run (gmake setup) or recreate the exe-file (gmake exe) use the
following options:

   openMP options: MP=YES EXTRA_FFLAGS=-mp EXTRA_LFLAGS=-mp
   mpi    options: ESMF=YES (to run on 4 processors, add  NPES=4 )

For serial runs, no options are needed; however if you want results identical
to that produced by an openMP executable, add EXTRA_FFLAGS=-mp EXTRA_LFLAGS=-mp

NOTE: When changing options, make sure to use     "gmake clean vclean"
      to get rid of old object modules and avoid incompatibilities.

Using multiple versions of .modelErc
------------------------------------
If you need more than the above options, you may use the command "get_modelErc_samples"
to create files named "$HOME/modelErc_version_name". (note: no leading ".")
You may modify them, rename them, add more samples

use "setErc"  to select the appropriate version - it will also load the appropriate
modules. Remember:   use "gmake clean vclean" if you switch to a new .modelErc

Old complicated way to start mpi-runs in serial mode (no longer recommended)
----------------------------------------------------
This method needs 2 compilations (serial and mpi) and multiple modelErc versions.
Use "setErc" to link the selected version to $HOME/.modelErc .
ser2mpi and mpi2ser will automatically create the proper link. ESMF=YES not needed.

With this preparation, starting a run involves the following steps:

0 - gmake clean vclean
1 - Make sure that the serial version is linked to .modelErc
2 - gmake setup RUN=Exyz        - does first hour w/o parallelization
3 - ser2mpi Exyz                - renames Exyz.exe,E (Exyz.exe_serial,E_ser)
                                - creates E for mpi (E_mpi)
                                - hides serial object modules and sets .modelErc to mpi
                                - asks whether to do gmake aux ... (usually: say no)
4 - gmake exe RUN=Exyz ESMF=YES - creates Exyz.exe for mpi (much bigger than Exyz.exe_serial)
    Now you are ready to extend the run using e.g.
5 - runpbs Exyz fast 15         - continues Exyz as batch job on 15 fast processors

To set up a new run in the same directory, do
0 - mpi2ser
repeat steps 1-4 for the new run

To set up the SAME run after fixing something or other, proceed as above, but
first REMOVE ALL OLD FILES in the rundirectory Exyz (in particular Exyz.exe_serial)

How many processors should be used per modelE run ?
===================================================
With openMP, you are limited to 1 node (i.e. 4 cpus on the old nodes,
8 cpus on the new nodes)

With mpi and as long as modelE uses a 1-dim. composition with respect to latitude J
I would recommend at most JM/3 cpus; this assigns to each processor a region consisting
of 3 latitudes (the halo always consists of 2 latitudes). It is highly recommended not
to use more than JM/2 latitudes, since some parts of modelE assume that the processor
whose region starts or ends at a pole also contains at least one other latitude. If
that is not the case, the model will use undefined quantities and will malfunction.

If memory is the problem, limit the number of cpus used per node (see runpbs).

utilities for long jobs: (on ...../giss/exec)
=======================
check_stop_reque.xxx  to stop/requeue the jobs in prod_runs/xxx.runs if needed
bkup.xxx              to backup all new files to dirac
scr_rsf.xxx           keep only the 1JAN and the last 12 rsfs       (x0 see PPD)
                      keep 1 1JAN/decade and the last 12 rsfs       (x  see PPD)
do_ann.xxx            create ann acc/PRT-files and key ann.glob.mean.lists   (a)
                             and monthly global mean series                 (ma)
add_ocn_lpl           add some ocean related ann.glob.mean.lists            (oa)
save_acc.xxx          collect next N years of acc-files in an appropriately (aN)
                           named directory, move it to "totape" directory
qrsfu  xxx            shows status of jobs listed in xxx.runs

The above utilities process all the runs listed in:
/discover/nobackup/projects/giss/prod_runs/xxx.runs
xxx.runs  should contain for each active run a line of the form:
          PPD runID #cpus #wc_hrs #wc-min    (see rar.runs)

PPD is an optional post-processing directive of the form oaNx0
the x-options create a subdirectory rsfRUNID in the run-directory
the a-options create the subdirectories ACC, Means, ann, lpl_data
the oa-option adds some ocean data line plots to lpl_data (coupled model only)
lines starting with "#" are ignored

To use any of these utilities, do in ...../giss/exec
- ln -s utility.rar utility.xxx, where xxx=your_initials or nickname
- create and maintain prod_runs/xxx.runs
- start a cron using the command   "crontab cronxxx"
  where cronxxx is an auxiliary file containing your list of commands
      Suggestions:
  execute check_stop_reque.xxx every hour
  execute bkup.xxx  once a day
  execute scr_rsf.xxx  often enough to stay within your quota (100 GB is not much !!)
  /home/rruedy/cronrar is what I currently use

If staying within 100 GB is a real hardship, you might consider asking support
(support@nccs.nasa.gov) to double your cap.

Monitoring standard-error output of in-progress noninteractive
batch jobs on Discover (as of Dec. 2009):
==============================================================

Messages written to "standard error" do not appear in your_run_name.PRT;
they are usually directed to the file job_number.borgmg.OU .
While a batch job runs, this output can be seen in
/discover/pbs_spool/job_number.borgmg.OU; after completion, the file is
moved to the run directory.

Parallel Compression of Large Files
==============================================================

Compression of multi-GB files takes minutes to hours using
the standard serial gzip utility.  In certain circumstances it
may be preferable to use "pigz",  a multithreaded version of gzip
that can utilize the multiple cores on a compute node (but not
across multiple nodes).  To invoke with compression level 9
using n threads:

pigz -9 -p n file_to_compress

IMPORTANT: on Discover nodes, it is necessary to set
"ulimit -v unlimited" before running pigz.
If "-p n" is omitted, pigz will try to detect the number of cores
available on a node, and uses 8 threads if unsuccessful.
