netCDF4 python package breaks model execution

TimHill

Hi all,

I've been attempting to debug some issues I'm having with the python interface on the Compute Canada systems to run GlaDS simulations. I have been working on the Graham system (with ISSM installed according to instructions here), and everything was working great with the python interface (3.11.5). After updating some packages to keep my local and cluster environments consistent, I can no longer run ISSM with the python interface on the cluster. I've reverted all packages except netCDF4 (because of some module conflicts that have come up so I can't revert the version) to their previous versions.

Specifically, ISSM marshals and uploads the binary files correctly, but is unable to run them. For my job with name ensemble_001, the model fails with messages:

checking model consistency
marshalling file 'ensemble_001'.bin
uploading input file and queuing script
launching solution sequence on remote cluster
loading results from cluster
============================================================
   Binary file ensemble_001.outbin not found                                 
                                                            
   This typically happens when the run crashed.             
   Please check for error messages above or in the outlog   
============================================================

WARNING: ensemble_001.outbin does not exist

However, I'm able to run ./ensemble_001.queue directly on the command line and have the model solve correctly!

It seems that the netCDF4 package is somehow effecting my environment and stopping the model execution. If I just ask python to run the model files (i.e., the same way the solver does) it works as long as I don't import netCDF4 (and none of my imports use netCDF4):

>>> subprocess.run('./ensemble_001.queue')

Ice-sheet and Sea-level System Model (ISSM) version  4.24
(website: http://issm.jpl.nasa.gov forum: https://issm.ess.uci.edu/forum/)

call computational core:
iteration 1/8760  time [yr]: 0.00 (time step: 0.00)
   updating effective pressure
   saving temporary results
iteration 2/8760  time [yr]: 0.00 (time step: 0.00)
   updating effective pressure
...

>>> import netCDF4 as nc
>>> subprocess.run('./ensemble_001.queue')
CompletedProcess(args='./ensemble_001.queue', returncode=1)

So I can work around this problem by using another reader for *.nc files (h5netcdf works!). But I thought this might be important to bring up since the built-in python I/O uses netCDF4, so perhaps other users might have similar problems. Note that I've reproduced this problem and solution on another Compute Canada cluster (Narval, a newer cluster but with the same software environment).

Some more details about package and module versions:
Currently Loaded Modules:

  1) CCconfig              8) mii/1.1.2             15) java/17.0.6     (t)     22) scalapack/2.2.0 (math)
  2) gentoo/2023     (S)   9) hwloc/2.9.1           16) matlab/2023b.2  (t)     23) hdf5-mpi/1.14.2 (io)
  3) gcccore/.12.3   (H)  10) ucx/1.14.1            17) metis/5.1.0     (math)  24) petsc/3.20.0    (t)
  4) gcc/12.3        (t)  11) libfabric/1.18.0      18) parmetis/4.0.3  (math)  25) python/3.11.5   (t)
  5) flexiblas/3.3.1      12) pmix/4.2.4            19) imkl/2023.2.0   (math)  26) mpi4py/3.1.6    (t)
  6) blis/0.9.0           13) ucc/1.2.0             20) fftw/3.3.10     (math)
  7) StdEnv/2023     (S)  14) openmpi/4.1.5    (m)  21) fftw-mpi/3.3.10 (math)

And a selection of the most relevant python packages:

h5netcdf          1.3.0+computecanada
h5py              3.10.0
matlabengine      23.2
matplotlib        3.9.0+computecanada
mpi4py            3.1.6
netCDF4           1.7.1+computecanada
numpy             1.25.2+computecanada
petsc4py          3.20.0
pip               23.2.1
rasterio          1.3.9+computecanada
scipy             1.11.2+computecanada
setuptools        68.1.2
setuptools-scm    7.1.0
virtualenv        20.24.3
wheel             0.41.2

mathieumorlighem

Hi Tim

it is not very clear to me what's happening with your first error message (Binary file ensemble_001.outbin not found). This means that ISSM had a problem but the error message should be in the outlog and/or errlog. Could you take a look at these files and tell us what's in them? Also, could you print md.cluster and show us which options you used?
Thanks
Mathieu

TimHill

Hi Mathieu,

The outlog and errlog are empty. I'm using md.cluster = generic('np', 1),

class '<class 'generic.generic'>' object 'self' = 
    name: narval1.narval.calcul.quebec
    login: 
    np: 1
    port: 0
    codepath: /home/tghill/SFU-code/ISSM/bin
    executionpath: /lustre07/scratch/tghill/GladsGP/experiments/synthetic/issm/train/TMP/
    valgrind: /home/tghill/SFU-code/ISSM/externalpackages/valgrind/install/bin/valgrind
    valgrindlib: /home/tghill/SFU-code/ISSM/externalpackages/valgrind/install/lib/libmpidebug.so
    valgrindsup: ['/home/tghill/SFU-code/ISSM/externalpackages/valgrind/issm.supp']
    verbose: 1
    shell: /bin/sh

and using the not-recommend workflow of running everything directly on the cluster (because I'm running ensembles of 512+ simulations, so staying on the cluster for everything simplifies the workflow).

I have everything running now, but if I import any modules that use netCDF4, I'm back to the errors I copied above. For example, with import devpath I get the errors, but it works fine with import issmversion to load the ISSM files.

Thanks,
Tim

mathieumorlighem

This is very weird because we don't see anything between launching solution sequence on remote cluster and
loading results from cluster. Is md.cluster.interactive set to 1?

In short, something is going wrong in src/m/clusters/generic.py LaunchQueueJob. Python is probably running this section of the code:

launchcommand = 'cd {} && rm -rf ./{} && mkdir {} && cd {} && mv ../{}.tar.gz ./&& tar -zxf {}.tar.gz  && chmod 755 {}.queue && ./{}.queue'.format(self.executionpath, dirname, dirname, dirname, dirname, dirname, modelname, modelname)
issmssh(self.name, self.login, self.port, launchcommand)

And issmssh should just call subprocess.call(launchcommand, shell=True) since you are running on locally.

Could you add print(launchcommand) and try to execute it manually in the terminal? Somehow this part is not going well.
Mathieu

TimHill

Thanks Mathieu,

Yes, md.cluster.interactive = 1.

Executing the launchcommand manually in the terminal works as expected, it launches ISSM and it starts to solve. It seems that python has executed everything in launchcommand except the last command to run the .queue file, since I end up with a .queue file with the expected permissions that I can run manually in the terminal.

See my original comments about calling subprocess.run (same results for subprocess.call) on the .queue file. If I start a fresh python session and do subprocess.call(launchcommand, shell=True), it executes ISSM normally. If I do import netCDF4 and try to execute the exact same command, we're back to the error.

mathieumorlighem

Hi Tim

sorry it took me some time to catch up! I tried to reproduce your problem on my end and it works in both cases:

>>> import subprocess
>>> subprocess.call(' chmod 755 test101.queue && ./test101.queue', shell=True)

Ice-sheet and Sea-level System Model (ISSM) version  4.24
(website: http://issm.jpl.nasa.gov forum: https://issm.ess.uci.edu/forum/)

call computational core:
write lock file:

   FemModel initialization elapsed time:   0.0107302
   Total Core solution elapsed time:       0.0331714
   Linear solver elapsed time:             0.021423 (65%)

   Total elapsed time: 0 hrs 0 min 0 sec
0
>>> import netCDF4
>>> subprocess.call(' chmod 755 test101.queue && ./test101.queue', shell=True)

Ice-sheet and Sea-level System Model (ISSM) version  4.24
(website: http://issm.jpl.nasa.gov forum: https://issm.ess.uci.edu/forum/)

call computational core:
write lock file:

   FemModel initialization elapsed time:   0.0107739
   Total Core solution elapsed time:       0.0330299
   Linear solver elapsed time:             0.0213566 (65%)

   Total elapsed time: 0 hrs 0 min 0 sec
0

both run just fine on my machine (Mac) and a linux server. I am affraid this looks like a bug on your supercomputer. Could you ask the support desk? They may try and update these libraries? (Maybe it fails with a simple subprocess.call('echo "Hello World"', shell=True) as a test case you can send them)

Mathieu

TimHill

Hi Mathieu,

Thanks for trying to reproduce this on your end. I agree that it seems like a bug specific to my clusters, especially since it was working before I updated packages. Good idea to see if it fails with a simple example, but curiously that works just fine:

>>> import subprocess
>>> subprocess.run('echo Hello World', shell=True)
Hello World
>>> import netCDF4
>>> subprocess.run('echo Hello World', shell=True)
Hello World

I'll update you here if I get a useful resolution from the cluster support desk.

Thanks,
Tim