Nice!
I tested it for gemver_mpi and worked perfect. Only had to add --mem-per-cpu=16G
sth. the memory doesn't overflow for few processes
tomw (e0269734) at 04 Jan 09:50
tomw (c9e051b1) at 04 Jan 09:49
tomw (c9e051b1) at 18 Dec 16:07
Merge remote-tracking branch 'origin/main' into tom
... and 31 more commits
all variants perform more or less the same in the tests using Euler. V3 seems to be slightly the best.
why do we need to remove omp_set_num_threads() on euler?
I think you could move the multiplication with alpha and beta out of the doubble loop to reduce the number of multiplications.
And how did you run it on euler? i.e. how many threads did you try?
Move data initialization in gemver method inside of timing. This allows data initialization on separate processes which greatly reduces Scatter functions which accounts for the major time in gemver. Therefore gemver_mpi_2_new is the fastest method after input size > 5000 but with no -O3, otherwise 3 FLOPS are computed faster then sending & reciving data via MPI. Openmp is slow in this plot since they ran only on one thread...
tomw (95c989a0) at 14 Dec 16:16
add blocking and openmp to best mpi version
tomw (9dac98c4) at 14 Dec 12:51
rm define of NUM_THREADS
No. I can change that, as you once mentioned it is faster
Yes, you are right with the number of threads, i only focused on MPI so far and will now look into the combination with openmp.
But isn't the advantage of MPI over openmp that we can use it to distribute over multiple nodes, and if all CPUs are on one node then we could do it with openmp only?
Command for the plot:
sbatch -C ib --ntasks=16 --nodes=16 --mem-per-cpu=16G --cpus-per-task=1 --wrap="mpirun ./evaluate_gemver_mpi; ./evaluate_gemver_openmp"
New versions:
TODO: gemver_mpi_3_new with openmp
tomw (df0d4b67) at 12 Dec 08:27
rm executable
tomw (134143ec) at 12 Dec 08:25
add gemver_mpi_3_new, a faster MPI version (faster then the baselin...