Tom
Move data initialization in gemver method inside of timing. This allows data initialization on separate processes which greatly reduces Scatter functions which accounts for the major time in gemver. Therefore gemver_mpi_2_new is the fastest method after input size > 5000 but with no -O3, otherwise 3 FLOPS are computed faster then sending & reciving data via MPI. Openmp is slow in this plot since they ran only on one thread...
Edited by tomw