Hi all
I am trying run the shakti model through ISSM, remotely on the Narval Cluster on ComputeCanada. Using the information at - https://issm.ess.uci.edu/trac/issm/wiki/computecanada
I ran the following command (on Narval Computecanada) -
module load gcc/9.3.0 openmpi/4.0.3 matlab/2020a metis parmetis imkl petsc
I also installed the triangle, chaco and m1qn3 external packages using the respective .sh files in the externalpackages directory.
Along with this, I used the following configuration -
./configure \
--prefix=$ISSM_DIR \
--with-numthreads=2 \
--with-mkl-libflags="-lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_openmpi_lp64 -lmkl_scalapack_lp64" \
--with-petsc-dir="$EBROOTPETSC" \
--with-mumps-dir="$EBROOTPETSC" \
--with-m1qn3-dir="$ISSM_DIR/externalpackages/m1qn3/install" \
--with-mpi-include="$EBROOTOPENMPI/include" \
--with-mpi-libflags="-lmpi_cxx -lmpi_mpifh -lmpi" \
--with-fortran-lib="-L$EBROOTGCC/lib64 -lgfortran" \
--with-triangle-dir="${ISSM_DIR}/externalpackages/triangle/install" \
--with-chaco-dir="${ISSM_DIR}/externalpackages/chaco/install" \
--with-matlab-dir="/cvmfs/restricted.computecanada.ca/easybuild/software/2020/Core/matlab/2020a" \
--with-blas-dir=”$EBROOTIMKL” \
--with-scalapack-dir="$EBROOTIMKL" \
--with-metis-dir="$EBROOTMETIS" \
--with-parmetis-dir="$EBROOTPARMETIS"
I followed this up with make and make install commands, which ran without errors.
Following this, I opened matlab on Narval and tried to run runme.m in examples/shakti. Steps 1 and 2 ran without errors, but for step 3 I received the following error -
runme
Step 3: Solve!
Warning: While loading an object of class 'love':
Unrecognized method, property, or field 'int_steps_per_layers' for class 'love'.
In loadmodel (line 29)
In runme (line 55)
checking model consistency
marshalling file moulin.bin
uploading input file and queuing script
launching solution sequence on remote cluster
Ice-sheet and Sea-level System Model (ISSM) version 4.22
(website: http://issm.jpl.nasa.gov contact: issm@jpl.nasa.gov)
call computational core:
iteration 1/720 time [yr]: 0.00 (time step: 0.00)
Intel MKL ERROR: Parameter 9 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
.
.
.
.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DTRSM .
Intel MKL ERROR: Parameter 5 was incorrect on entry to DGEMM .
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: Signal received
[1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022
[1]PETSC ERROR: /lustre07/scratch/avigupt8/trunk//bin/issm.exe on a named narval4.narval.calcul.quebec by avigupt8 Mon Jul 10 15:28:32 2023
[1]PETSC ERROR: Configure options --prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc/3.17.1 --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-cxx-dialect=C++14 --with-memalign=64 --with-python=no --with-mpi4py=no --download-party=1 --download-superlu_dist=1 --download-SuiteSparse=1 --download-superlu=1 --download-metis=1 --download-ptscotch=1 --download-hypre=1 --download-spooles=1 --download-chaco=1 --download-strumpack=1 --download-spai=1 --download-parmetis=1 --download-slepc=1 --download-hpddm=1 --download-ml=1 --download-prometheus=1 --download-triangle=1 --download-mumps=1 --download-mumps-shared=0 --download-ptscotch-shared=0 --download-superlu-shared=0 --download-superlu_dist-shared=0 --download-parmetis-shared=0 --download-metis-shared=0 --download-ml-shared=0 --download-SuiteSparse-shared=0 --download-hypre-shared=0 --download-prometheus-shared=0 --download-spooles-shared=0 --download-chaco-shared=0 --download-slepc-shared=0 --download-spai-shared=0 --download-party-shared=0 --with-cc=mpicc --with-cxx=mpicxx --with-c++-support --with-fc=mpifort --CFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --CXXFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC -DOMPI_SKIP_MPICXX -DMPICH_SKIP_MPICXX" --FFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --with-mpi=1 --with-build-step-np=8 --with-shared-libraries=1 --with-debugging=0 --with-pic=1 --with-x=0 --with-windows-graphics=0 --with-scalapack=1 --with-scalapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/scalapack/2.1.0/lib/libscalapack.a,libflexiblas.a,libgfortran.a]" --with-blaslapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/flexiblas/3.0.4/lib/libflexiblas.a,libgfortran.a]" --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-fftw=1 --with-fftw-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8
[1]PETSC ERROR: #1 User provided function() at unknown file:0
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022
[0]PETSC ERROR: /lustre07/scratch/avigupt8/trunk//bin/issm.exe on a named narval4.narval.calcul.quebec by avigupt8 Mon Jul 10 15:28:32 2023
[0]PETSC ERROR: Configure options --prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc/3.17.1 --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-cxx-dialect=C++14 --with-memalign=64 --with-python=no --with-mpi4py=no --download-party=1 --download-superlu_dist=1 --download-SuiteSparse=1 --download-superlu=1 --download-metis=1 --download-ptscotch=1 --download-hypre=1 --download-spooles=1 --download-chaco=1 --download-strumpack=1 --download-spai=1 --download-parmetis=1 --download-slepc=1 --download-hpddm=1 --download-ml=1 --download-prometheus=1 --download-triangle=1 --download-mumps=1 --download-mumps-shared=0 --download-ptscotch-shared=0 --download-superlu-shared=0 --download-superlu_dist-shared=0 --download-parmetis-shared=0 --download-metis-shared=0 --download-ml-shared=0 --download-SuiteSparse-shared=0 --download-hypre-shared=0 --download-prometheus-shared=0 --download-spooles-shared=0 --download-chaco-shared=0 --download-slepc-shared=0 --download-spai-shared=0 --download-party-shared=0 --with-cc=mpicc --with-cxx=mpicxx --with-c++-support --with-fc=mpifort --CFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --CXXFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC -DOMPI_SKIP_MPICXX -DMPICH_SKIP_MPICXX" --FFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --with-mpi=1 --with-build-step-np=8 --with-shared-libraries=1 --with-debugging=0 --with-pic=1 --with-x=0 --with-windows-graphics=0 --with-scalapack=1 --with-scalapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/scalapack/2.1.0/lib/libscalapack.a,libflexiblas.a,libgfortran.a]" --with-blaslapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/flexiblas/3.0.4/lib/libflexiblas.a,libgfortran.a]" --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-fftw=1 --with-fftw-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8
[0]PETSC ERROR: #1 User provided function() at unknown file:0
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
Intel MKL ERROR: Parameter 4 was incorrect on entry to DGEMM .
[narval4.narval.calcul.quebec:1786898] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[narval4.narval.calcul.quebec:1786898] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[narval4.narval.calcul.quebec:1786898] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
loading results from cluster
Warning:
Binary file moulin.outbin not found!
This typically happens when the run crashed.
Please check for error messages above or in the outlog
In loadresultsfromdisk (line 16)
In loadresultsfromcluster (line 50)
In solve (line 180)
In runme (line 81)
Looks like there is an error with imkl or imkl is not compatible with issm. I am not sure how to deal with this error. Do I try a different version of IMKL(current version is 2020.1.217)? Do I change --with-blas-dir and --with-scalapack-dir values in the configure to something else? Also, there is a petsc error as well.
Also I noticed there is a file named computecanada.m in bin. I am not sure how I use this file to run the simulation indirectly on computecanada without explicitly accessing the narval terminal using ssh. Do I set md.cluster = computecanada('np',8,'login','username'), and then when I run md=solve(md,'Transient'); will it automatically connect to compute canada and run? what should username be here? Should it be in the form xyz@narval.computecanada.ca or just xyz@narval?
According to https://issm.ess.uci.edu/trac/issm/wiki/computecanada , it appears that I need to add a file named narval_settings.m in sr/m(I'm guessing this is the file name because the site says sherlock_settings.m instead) with the following personal settings instead-
cluster.login='XXX';
cluster.port=0;
cluster.codepath=XXX';
cluster.executionpath='XXX';
I am not sure whether to make this file on my computer or on narval. Also I am not sure what codepath and executionpath refer to here. Please help resolve my queries.