A coworker asked me to look at a mysterious segfault crash today, in some simple c++ code that didn’t look like it could possibly crash. Further adding to the mystery, it was 100% reproducible on Android and OSX/iOS, but not PC. Here it is (with a whole bunch of irrelevant details removed):
Sprite* sprite0 = /*some known-valid, non-null pointer*/
Sprite* sprite1 = &FindChildRef(someString);
sprite0 = sprite1;
sprite0->SomeTrivialFunction(); // segfault here because sprite0 is null
sprite0 starts out as a valid pointer. We then search our scenegraph for another sprite, and if we find it, assign it to sprite0 instead.
Then we crash because sprite0 is null, despite that we know it started out valid, and we null checked the only case where we might reassign it.
So. Do you see the problem?
If not, watch this very entertaining cppcon talk about optimization and undefined behavior, and come back. I’ll wait.
Right, so the obvious bit of bad code smell here is the FindChildRef call– that’s weird, right? It returns a reference, but then we convert the reference to a pointer, and null-check it. Which means that we’re assuming the function can return a reference that, converted to a pointer, would point to null. This should not be possible–references in c++ cannot be null. What’s going on?
For that matter, what does FindChildRef return if it doesn’t find anything with the requested name?
It turns out the implementation of that function was doing something deeply problematic:
Sprite& FindChildRef(std::string name) const
Now, the FindChild function it calls does return a pointer, but some prior programmer on this project, a big fan of C# or Java perhaps, didn’t much care for that, and wrote a function that returned a reference instead. Which works just fine, until GetChild fails to find something and returns null. Now, we’ve dereferenced a null pointer to get a reference to return, and we’re in the magic land of undefined behavior.
Still, though, we do null check the result, right?
This is where optimization comes in. Since a reference can’t be null, a pointer that was just taken from a reference also cannot be null. So the compiler, very wisely, sees that ‘if (sprite1)’ condition, reasons that it must always be true, and removes it, because it’s redundant.
But of course the reference is null, because we forced it to be. And thus, we reach the seemingly impossible assignment.
(And why does it happen on every platform but PC? Simple–PC is the only platform we don’t use clang on, and clang is more aggressive than the Visual Studio compiler with this kind of optimization.)
I admit I was delighted to discover this–a real-life example of a class of problem I’d previously only read about. And, of course, the fix was simple: replacing FindChildRef with FindChild.
(…and, of course, scrubbing FindChildRef entirely from our code base.)