How I discovered testing is not optional
So, this is a mistake I made in 2018. Three or four months after I started working as an RSE I was given some 2000 lines of Fortran 77 or even older (our team works on performance optimisation). At the time, the code was serial and it actually had quite good performance on one core - the physicist who was working on it had worked with Archer webinars and that kind of thing to do some optimisation. So it was performing quite well, but they needed it to be parallel: They needed to go to bigger problems and they needed to get results more quickly as they scaled up. They thought 12x might be sufficient performance and our HPC machines have more cores than that per node, so hopefully it wouldn’t be needed to go beyond a single node. They gave an example problem that ran in about an hour. The initial projection we’d made was that maybe it would take three months for one RSE to do this. This wasn’t based on any experience because we were a very new team at that point and didn’t have that many projects under our belt
I took this project on myself and my approach was this: The first thing I did was to benchmark the example problem with a variety of compilers and compiler options. So we found that using the auto-parallel optimisation in Intel gave us 4x speed-up, which was great, that was a really good quick win. We gave that back to the researcher as an interim improvement while we worked on the proper parallelisation. I profiled the code with intel VTune and identified where the hotspots were - there were two to three functions that were taking up the majority, about 90% of the runtime. So we knew we needed to work on those and we’d heard about OpenMP, how it’s really easy and you just put some directives around where your code’s taking its time and it makes everything go faster, right? So we spent some time working on that.
And then things go south. So, this doesn’t work. The code starts segfaulting all over the place and we can’t work out why that is and we’re getting cryptic messages when the application terminates. And then when we stop it from segfaulting, the code runs more slowly - I’m using the word “we”, I mean “I”, I was working on this, this is entirely my mistake. The code runs more slowly, so it’s maybe parallelizing but it’s certainly not going faster which was the aim of the parallelism. And at the same time it was also giving the wrong answer! I had no idea why this was because, you know, this was 2000 lines of monolithic Fortran. So, yeah. The project was delayed and I had to abandon that month of work on OpenMP, which was a waste of time.
The way I recovered from this was to, firstly, write regression tests for each function to figure out where things were going wrong. Then I tidied and refactored the code a little bit so that I could actually understand what each section was doing while I was working on it rather than blindly adding directives to it. I needed to make sure that I didn’t lose the readability for the researcher who is a Fortran developer and doesn’t want to be dealing with things that aren’t Fortran. Then I switched to using MPI to parallelise function by function, because I’m a lot more familiar with MPI. The researcher was also saying that perhaps the code would need to scale beyond one node at some point, so being more flexible in that direction was a good idea. Finally, in some sense the way this application was structured lent itself quite nicely to MPI parallelism. So in the end, because I’m leading the team as well as doing this development, I ran out of time to work on it so I had to hand it over to another team member to finalise. Dr Michele Mesiti gave a talk at the RSE conference in 2019 about the final version of this.
So what did I learn?
- Don’t believe the hype around technologies.
- Testing is not optional: you can’t cut back on testing, you can’t just jump in and do stuff because you’re going to waste all of your time, you’re going to have to abandon it and go back and write tests anyway.
- As a more management-related one: it’s hard to lead a team and also keep to a timeline while you’re doing software development. You have to bear that in mind when you’re deciding who’s going to do what work.