Changes between Version 17 and Version 18 of lonestar


Ignore:
Timestamp:
07/10/24 10:51:13 (5 months ago)
Author:
Cheng Gong
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • lonestar

    v17 v18  
    168168
    169169where JOBID is the ID of your job (indicated in the Matlab session). Matlab indicates too the directory of your job where you can find the files `JOBNAME.outlog` and `JOBNAME.errlog`. The outlog file contains the informations that would appear if you were running your job on your local machine and the errlog file contains the error information in case the job encounters an error.
     170
     171== Running PINNICLE on Lonestar6  ==
     172Lonestar supports container by a software called `apptainer` [https://apptainer.org]. A precompiled image with Tensorflow v.2 backend is available at `docker://chenggongdartmouth/pinnicle_ls6:v0.1`
     173
     174You need to build this apptainer image from the Docker on Lonestar6.
     175First, irst create an interactive session in LS6's `gpu-a100-dev` or `gpu-a100`  queue:
     176{{{
     177idev -t 1:00:00 -N 1 -n 4 -p gpu-a100-dev
     178}}}
     179You will need to load `cuda` and `apptainer` module as follows
     180{{{
     181module load cuda/11.4 cudnn/8.2.4 nccl/2.11.4
     182module load tacc-apptainer
     183}}}
     184
     185Move to your `<YOUR_WORKING_PATH>` directory on Lonestar6, it is in the format of `/work/xxxxx/yourname/ls6`
     186
     187Build the apptainer image from the Docker **with** `--nv`
     188{{{
     189apptainer build --nv <YOUR_WORKING_PATH>/<YOUR_IMAGE_NAME> docker://chenggongdartmouth/pinnicle_ls6:v0.1
     190}}}
     191
     192After building the image, you can run this Docker image by
     193{{{
     194apptainer shell --nv <YOUR_WORKING_PATH>/<YOUR_IMAGE_NAME>
     195}}}
     196
     197You can also submit a job in the queue with the following script:
     198{{{
     199#!/bin/bash
     200
     201#SBATCH -J job_name           # job name
     202#SBATCH -o output.%j          # output file named, output.jobID
     203#SBATCH -e error.%j           # error file named, error.jobID
     204#SBATCH -p gpu-a100           # queue name
     205#SBATCH -N 1                  # number of nodes requested
     206#SBATCH --ntasks-per-node 4   # tasks per node
     207#SBATCH -t 10:00:00           # time, hh:mm:ss
     208#SBATCH --mail-user=<EMAIL_ADDRESS>
     209#SBATCH --mail-type=all
     210
     211module load cuda/11.4 cudnn/8.2.4 nccl/2.11.4
     212module load tacc-apptainer
     213apptainer exec --nv <YOUR_WORKING_PATH>/<YOUR_IMAGE_NAME> python test.py
     214}}}