Doing good work can make you feel a bit stupid, well thats my mixed bag of feelings for this weekend. Here is why…
Last week was a rollercoaster of a week for software development at Software Verification.
Off by one, again?
First off we found a nasty off-by-one bug in our nifty memory mapped performance tools, specifically the Performance Validator. The off-by-one didn’t cause any crashes or errors or bad data or anything like that. But it did cause us to eat memory like nobodies business. But for various reasons it hadn’t been found as it didn’t trigger any of our tests.
Then along comes a customer with his huge monolithic executable which won’t profile properly. He had already thrown us a curve balled by supplying it as a mixed mode app – half native C++, half C#. That in itself causes problems with profiling – the native profiler has to identify and ignore any functions that are managed (.Net). He was pleased with that turnaround but then surprised we couldn’t handle his app, as we had handled previous (smaller) versions of his app. The main reason he was using our profiler is that he had tried others and they couldn’t handle his app – and now neither could we! Unacceptable – well that was my first thought – I was half resigned to the fact that maybe there wasn’t a bug and this was just a goliath of an app that couldn’t be profiled.
I spent a day adding logging to every place, no matter how insignificant, in our function tree mapping code. This code uses shared memory mapped space exclusively, so you can’t refer to other nodes by addresses as the address in one process won’t be valid in the other processes reading the data. We had previously reorganised this code to give us a significant improvement in handling large data volumes and thus were surprised at the failure presented to us. Then came a long series of tests, each which was very slow (the logging writes to files and its a large executable to process). The logging data was huge. Some of the log files were GBs in size. Its amazing what notepad can open if you give it a chance!
Finally about 10 hours in I found the first failure. Shortly after that I found the root cause. We were using one of our memory mapped APIs for double duty. And as such the second use was incorrect – it was multiplying our correctly specified size by a prefixed size offset by one. This behaviour is correct for a different usage. Main cause of the problem – in my opinion, incorrectly named methods. A quick edit later and we have two more sensibly named methods and a much improved memory performance. A few tests later and a lot of logging disabled and we are back to sensible performance with this huge customer application (and a happy customer).
So chalk up one “how the hell did that happen?” followed by feelings of elation and pleasure as we fixed it so quickly.
I’m always amazed by off-by-one bugs. It doesn’t seem to matter how experienced you are – it does seem that they do reappear from time to time. Maybe that is one of the persils of logic for you, or tiredness.
I guess there is a Ph.D. for someone in studying CVS commits, file modification timestamps and off-by-one bugs and trying to map them to time-of-day/tiredness attributes.
That did eat my Wednesday and Thursday evenings, but it was worth it.
Not to be outdone…
I had always thought .Net Coverage Validator was a bit slow. It was good in GUI interaction tests (which is part of what .Net Coverage Validator is about – realtime code coverage feedback to aid testing) but not good on long running loops (a qsort() for example). I wanted to fix that. So following on from the success with the C++ profiling I went exploring an idea that had been rattling around in my head for some time. The Expert .Net 2.0 IL Assembler book (Serge Lidin, Microsoft Press) was an invaluable aid in this.
What were we doing that was so slow?
The previous (pre V3.00) .Net Coverage Validator implementation calls a method for each line that is visited in a .Net assembly. That method is in a unique DLL and has a unique ID. We were tracing application execution and when we found our specific method we’d walk up the callstack one item and that would be the location of a coverage line visit. This technique works, but it has a high overhead:
- ICorProfiler / ICorProfiler2 callback overhead.
- Callstack walking overhead.
The result is that for GUI operations, code coverage is fast enough that you don’t notice any problems. But for long running functions, or loops code coverage is very slow.
This needed replacing.
What are we doing now that is so fast?
The new implementation doesn’t trace methods or call a method of our choosing. For each line we modify a counter. The location of the counter and modification of it are placed directly into the ilAsm code for each C#./VB.Net method. Our first implementation of .Net Coverage Validator could not do this because our shared memory mapped coverage data architecture did not allow it – the shared memory may have moved during the execution run and thus the embedded counter location would be invalidated. The new architecture allows the pointer to the counter to be fixed.
The implementation and testing for this only took a few hours. Amazing. I thought it was going to fraught with trouble, not having done much serious ilAsm for a year or so.
The new architecture is so lightweight that you barely notice the performance overhead. Less than 1%. Your code runs just about at full speed even with code coverage in place.
As you can imagine, getting that implemented, working and tested in less than a day is an incredible feeling. Especially compared to the previous performance level we had.
So why feel stupid?
Having acheived such good performance (and naturally feeling quite good about yourself for a while afterwards) its hard not to look back on the previous implementation and think “Why did we accept that?, We could have done so much better”. And that is where the feeling stupid comes in. You’ve got to be self critical to improve. Pat yourself on the back for the good times and reflect on the past to try to recognise where you could have done better so that you don’t make the same mistake in the future.
And now for our next trick…
The inspiration for our first .Net Coverage Validator implementation came from our Java Coverage Validator tool. Java opcodes don’t allow you to modify memory directly like .Net ilAsm does, so we had to use the method calling technique for Java. However given our success with .Net we’ve gone back to the JVMTI header files (which didn’t exist when we first wrote the Java tools) and have found there may be a way to improve things. We’ll be looking at that soon.