Hi all,
I've been attempting to debug some issues I'm having with the python interface on the Compute Canada systems to run GlaDS simulations. I have been working on the Graham system (with ISSM installed according to instructions here), and everything was working great with the python interface (3.11.5). After updating some packages to keep my local and cluster environments consistent, I can no longer run ISSM with the python interface on the cluster. I've reverted all packages except netCDF4 (because of some module conflicts that have come up so I can't revert the version) to their previous versions.
Specifically, ISSM marshals and uploads the binary files correctly, but is unable to run them. For my job with name ensemble_001
, the model fails with messages:
checking model consistency
marshalling file 'ensemble_001'.bin
uploading input file and queuing script
launching solution sequence on remote cluster
loading results from cluster
============================================================
Binary file ensemble_001.outbin not found
This typically happens when the run crashed.
Please check for error messages above or in the outlog
============================================================
WARNING: ensemble_001.outbin does not exist
However, I'm able to run ./ensemble_001.queue
directly on the command line and have the model solve correctly!
It seems that the netCDF4 package is somehow effecting my environment and stopping the model execution. If I just ask python to run the model files (i.e., the same way the solver does) it works as long as I don't import netCDF4 (and none of my imports use netCDF4):
>>> subprocess.run('./ensemble_001.queue')
Ice-sheet and Sea-level System Model (ISSM) version 4.24
(website: http://issm.jpl.nasa.gov forum: https://issm.ess.uci.edu/forum/)
call computational core:
iteration 1/8760 time [yr]: 0.00 (time step: 0.00)
updating effective pressure
saving temporary results
iteration 2/8760 time [yr]: 0.00 (time step: 0.00)
updating effective pressure
...
>>> import netCDF4 as nc
>>> subprocess.run('./ensemble_001.queue')
CompletedProcess(args='./ensemble_001.queue', returncode=1)
So I can work around this problem by using another reader for *.nc files (h5netcdf works!). But I thought this might be important to bring up since the built-in python I/O uses netCDF4, so perhaps other users might have similar problems. Note that I've reproduced this problem and solution on another Compute Canada cluster (Narval, a newer cluster but with the same software environment).
Some more details about package and module versions:
Currently Loaded Modules:
1) CCconfig 8) mii/1.1.2 15) java/17.0.6 (t) 22) scalapack/2.2.0 (math)
2) gentoo/2023 (S) 9) hwloc/2.9.1 16) matlab/2023b.2 (t) 23) hdf5-mpi/1.14.2 (io)
3) gcccore/.12.3 (H) 10) ucx/1.14.1 17) metis/5.1.0 (math) 24) petsc/3.20.0 (t)
4) gcc/12.3 (t) 11) libfabric/1.18.0 18) parmetis/4.0.3 (math) 25) python/3.11.5 (t)
5) flexiblas/3.3.1 12) pmix/4.2.4 19) imkl/2023.2.0 (math) 26) mpi4py/3.1.6 (t)
6) blis/0.9.0 13) ucc/1.2.0 20) fftw/3.3.10 (math)
7) StdEnv/2023 (S) 14) openmpi/4.1.5 (m) 21) fftw-mpi/3.3.10 (math)
And a selection of the most relevant python packages:
h5netcdf 1.3.0+computecanada
h5py 3.10.0
matlabengine 23.2
matplotlib 3.9.0+computecanada
mpi4py 3.1.6
netCDF4 1.7.1+computecanada
numpy 1.25.2+computecanada
petsc4py 3.20.0
pip 23.2.1
rasterio 1.3.9+computecanada
scipy 1.11.2+computecanada
setuptools 68.1.2
setuptools-scm 7.1.0
virtualenv 20.24.3
wheel 0.41.2