A Use for Volatile in Multi-threaded Programming
As anyone who has done serious shared-memory parallel coding knows, the volatile
keyword in C (and C++) has a pretty bad rap. There is no shortage of people trying to explain why this is the case. Take for example this article from the original author of Intel Threaded Building Blocks (a man who, I assure you, knows what he’s talking about): Volatile: Almost Useless for Multithreaded Programming. There are others out there who decry volatile
, and their arguments are right to varying degrees. The heart of the issue is that first, volatile
is EXPLICITLY IGNORABLE in the C specification, and second, that it provides neither ordering guarantees nor atomicity. Let me say that again, because I’ve had this argument:
VOLATILE DOES NOT ESTABLISH ORDERING REQUIREMENTS
VOLATILE DOES NOT PROVIDE ATOMICITY
But I’m not here to talk about that; I want to talk about a place where I found it to be critical to correctness (as long as it isn’t ignored by the compiler, in which case creating correct code is painful). Honestly, I was quite surprised about this, but it makes sense in retrospect.
I needed a double-precision floating point atomic increment. Most increments, of the __sync_fetch_and_add()
variety, operate exclusively on integers. So, here’s my first implementation (just the x86_64 version, without the PGI bug workarounds):
double qthread_dincr(double *operand, double incr)
{
union {
double d;
uint64_t i;
} oldval, newval, retval;
do {
oldval.d = *operand;
newval.d = oldval.d + incr;
__asm__ __volatile__ ("lock; cmpxchgq %1, (%2)"
: "=a" (retval.i)
: "r" (newval.i), "r" (operand),
"0" (oldval.i)
: "memory");
} while (retval.i != oldval.i);
return oldval.d;
}
Fairly straightforward, right? But this has a subtle race condition in it. The dereference of operand gets translated to the following assembly:
movsd (%rcx), %xmm0
movd (%rcx), %rax
See the problem? In the assembly, it’s actually dereferencing operand
TWICE; and under contention, that memory location could change values between those two instructions. Now, we might pause to ask: why is it doing that? We only told it to go to memory ONCE; why would it go to twice? Well, a certain amount of that is unfathomable. Memory accesses are usually slow, so you’d think the compiler would try to avoid them. But apparently sometimes it doesn’t, and technically, dereferencing non-volatile memory multiple times is perfectly legal. The point is, this is what happened when compiling with basically every version of gcc 4.x right up through the latest gcc 4.7.1.
In any event, there are two basic ways to fix this problem. The first would be to code more things in assembly; either the entire loop or maybe just the dereference. That’s not an appealing option because it requires me to pick which floating point unit to use (SSE versus 387 versus whatever fancy new stuff comes down the pike), and I’d rather let the compiler do that. The second way to fix it is to use volatile
. If I change that dereference to this:
oldval.d = *(volatile double *)operand;
Then the assembly it generates looks like this:
movsd (%rcx), %xmm0
movd %xmm0, %rax
Problem solved! As long as the compiler doesn’t ignore the volatile cast, at least…
So, for those who love copy-and-paste, here’s the fixed function:
double qthread_dincr(double *operand, double incr)
{
union {
double d;
uint64_t i;
} oldval, newval, retval;
do {
oldval.d = *(volatile double *)operand;
newval.d = oldval.d + incr;
__asm__ __volatile__ ("lock; cmpxchgq %1, (%2)"
: "=a" (retval.i)
: "r" (newval.i), "r" (operand),
"0" (oldval.i)
: "memory");
} while (retval.i != oldval.i);
return oldval.d;
}
(That function will not work in the PGI compiler, due to a compiler bug I’ve talked about previously.)