Tuesday, January 28, 2014

Another 'fun' bug

Time for another bug story - this one was fun because it is a combination of programmer (me) error and a compiler error (specifically a PGO-only error).

The bug in question is bug 959842 (in particular, the second patch).

The thing that made this bug 'fun' was that it only showed up in PGO builds. PGO is profile guided optimisation, but is often used as a shorthand for both that and link time optimisation (LTO). PGO is a further optimisation pass that occurs at link time where the optimiser has global knowledge of the program (and profiles of the running program) so can do sophisticated optimisations without having to worry about 'open world' constraints. PGO is a pain because it take a long time and can be buggy (usually in the manner of a build failing at the PGO stage when it builds fine otherwise, but not this time). However, it gets us significant speed up, so Firefox releases are always built using PGO. Try builds do not use PGO and only occasional pushes to inbound do (I think). I have not got PGO builds running locally. That made this bug difficult to debug, because testing required throwing a build at try (you can change the mozconfig to get a PGO build from try), waiting for five hours, downloading the build and seeing if it crashed. Furthermore, its an optimised build so debugging using a debugger is a real pain.

The bug was also fun because it was (in part) due to me trying to be too clever with my code and falling between the cracks of C++ semantics.

So, we have an object RotatedBuffer which holds on to a DrawTarget. That DrawTarget can be loaned out to other objects so they can draw into it. However, it is important that only one other object has the DrawTarget at a time (because depending on how the DrawTarget will be used, the RotatedBuffer will set a transform on it). To handle this we have methods BorrowDrawTargetFor... and ReturnDrawTarget. The former returns a DrawTarget, the latter is used when the client object is done drawing. It looks like this:

void
BorrowDrawTarget::ReturnDrawTarget(gfx::DrawTarget*& aReturned)
{
  MOZ_ASSERT(aReturned == mLoanedDrawTarget);
  mLoanedDrawTarget->SetTransform(mLoanedTransform);
  mLoanedDrawTarget = nullptr;
  aReturned = nullptr;
}

The method takes a reference to a pointer to a DrawTarget. The last statement sets the referred to pointer to null, so that the client can no longer use the DrawTarget. That is me trying to be too clever.

The problem happens because due to an inheritance tangle we need a sub-class to have a wrapper method which statically dispatches to the base class (BorrowDrawTarget, which is a mixin class which implements the 'loan out a DrawTarget' behaviour). There are a few of these wrappers, which look like:

virtual void ReturnDrawTarget(gfx::DrawTarget* aReturned) MOZ_OVERRIDE
{
  BorrowDrawTarget::ReturnDrawTarget(aReturned);
}


(Now renamed to ReturnDrawTargetToBuffer, but that is not important).

The programmer-error part of the bug was missing the '&' after 'DrawTarget*' in the signature of these wrapper methods. That means that the caller's pointer is copied to the wrapper method, which passes a reference to its argument to BorrowDrawTarget::ReturnDrawTarget. So it is the wrapper's copy of the pointer which is set to null and the clients copy remains in existence, potentially to be (ab-)used. Unfortunately this is all valid C++ so the compiler gives us no warning that something is amiss. Indeed, running Firefox like this works just fine (since the client never actually reuses its returned pointer).

Until that is, PGO works its magic. Then (I assume, I couldn't actually make sense of the disassembly), the wrapper function is inlined and so the wrapper's copy of the pointer no longer exists. Something gets set to null by BorrowDrawTarget::ReturnDrawTarget, but I have no idea what, it certainly isn't the client's pointer. It causes a segfault and Firefox crashes. This part is a compiler error, specifically an error in the PGO(/LTO) optisation pass. That is because optimisation should not change the observable behaviour of code and this optimisation clearly does.

So, the combination of a programmer trying to be too clever, a typo, and a PGO bug caused a crash which was more than a little bit frustrating to track down, but now (hopefully) it is fixed.

3 comments:

RyanVM said...

FYI, PGO builds run every 3 hours on inbound and 6 hours on m-c. And nightlies are always PGO.

Anonymous said...

Craziness like needing a mix-in class BorrowDrawTarget is exactly the kind of thing that makes people avoid C++ like the plague.

mayankleoboy1 said...

so is this compiler bug worth reporting to MS