Hybrid Trisolv implementations
There are two openMP implementations, one for the mpi_isend and one for the mpi_onesided. They give some speedup for certain sizes. For Gao's version I couldn't find any speedups with omp. (I left in the omp_set_num_threads()
statements, these need to be removed for running on Euler)
I also made some slight adjustments to the mpi_isend and mpi_onesided implementations.