Parallel model solves

nicare

Hi,

I want to compute transient solutions for several different input parameters to eventually be able to do some UQ via sampling. For convenience, let's say these are in total 100 transient runs (how many it will be still depends somewhat on how much I can break down the compute time). I've made some test runs with the np options to see how the ISSM solve scales. For my test problem I recorded (one run each, all with the same parameter):

np = 1: computation time 29 min
np = 2: computation time 18 min
np = 4: computation time 12.6 min
np = 8: computation time 12.3 min

So essentially using more cores initially speeds up the computations but I get diminishing returns. This happens depending on the mesh size for smaller or larger np values, which overlaps with the observations made here https://issm.ess.uci.edu/forum/d/155-parallel-computing back in 2017.

Rather than running 100 solves with np=8 (wait 20.5 h), it would speed up things a lot more if I ran 2x (50 solves with np=4) in parallel (wait 10.5 h). Is there a way for me to do that? I've tried the obvious ways of running two separate Matlab windows or calling matlab scripts in parallel from the terminal with mpirun. The solves actually terminate without any thrown errors and the solution at the final time seem to be correct. However, whichever process finishes second only records the transient solution starting at the time step where it was when the first process ended. I didn't think this approach would work at all, but based on my results so far it almost does. So, is there a way I can make it work?

Example: If window 1 finishes after 200 time steps and window 2 is at the time at time step 119, then the md.results returned in window 1 will be a 1x200 struct with the transient solution for time steps 1 - 200, but md.results for window 2 will be a 1x82 struct with time steps 119 - 200 but with the actual solution for the parameter passed in window 2.

Thanks and best wishes,
Nicole

mathieumorlighem

Hi @nicare !

First, you may be able to speed things up a bit by using an iterative solver instead of MUMPS (the default direct solver). Could you try md.toolkits.DefaultAnalysis=bcgslbjacobioptions(); before calling solve to see if things improve?

As for your question, the best may be to not wait for the run to finish before launching the other solves. You will have to be careful with job names (and keep track of them), but essentially you can do something like this:

submit 3 jobs at the same time (you can make this larger than 3 of course)

for i=1:3
   %Make changes specific to job "i" now
   %...

   %Now run job on local machine without waiting for run to finish
   md.settings.waitonlock = 0;
   md.cluster.interactive = 0;
   md.miscellaneous.name = ['MyRun_' num2str(i) ]; %Unique name!

  %Submit job or download results, make sure that there is no runtime name (that includes the date)
   md=solve(md,'Transient','runtimename',false);
end

load results
Once your jobs are done (you can check with top or anything else), you can run the exact same code but with the option 'loadonly',1 added to solve, in order to ask the solver to load existing results instead of submitting a new job:

for i=1:3
   %Make changes specific to job "i" now
   %...

   %Now run job on local machine without waiting for run to finish
   md.settings.waitonlock = 0;
   md.cluster.interactive = 0;
   md.miscellaneous.name = ['MyRun_' num2str(i) ]; %Unique name!

  %Submit job or download results, make sure that there is no runtime name (that includes the date)
   md=solve(md,'Transient','runtimename',false,'loadonly',1);

   %Same model!
   save(['Model_' num2str(i) ], md);
end

You can adjust this to your needs, but don't hesitate if you have any questions!
Mathieu

nicare

Hi Mathieu,

Thank you so much! I tested it and could run it without a problem, that's gonna speed up my workflow immensely. As for the iterative solver, that change helped a lot too. I've posted some timings below just for reference.

Have a nice weekend,
Nicole

-- computation times (200 time steps, HO + thermal + masstransport, mesh with 17460 vertices):

np = 1: 29 min (default solver), 26 min (iterative solver)
np = 2: 18 min (default), 14 min (iterative)
np = 4: 12.6 min (default), 7.5 min (iterative)
np = 8: 12.3 min (default), 4.8 min (iterative)
for more cores the time stagnates again, but that should be expected.

mathieumorlighem

oh that's great! Thanks a lot for the update and have a great week end as well