Monday, September 29, 2008

How Do Locks Lock?

When I was 9 or so, I was convinced I could write my own operating system in GW-BASIC. After all, I had written programs that played Mary Had a Little Lamb on the PC speaker while displaying CGA graphics. How much harder could an operating system be?

One thing had me stumped. How would I get my OS to multitask?

After giving it some thought, the best idea I could come up with was to write the programs for my "OS" in such a way that they would occasionally perform a GOSUB or GOTO to parts of other programs. By using a lot of mental handwaving to myself, I was convinced I could somehow make it work.

But the details came back to bite me. I never pulled it off. I put it aside and worked on other things. I just assumed that something magical happened under the covers. My bedtime reading of magazines like PC/Computing convinced me that the magical solution to the problem included buzzwords like "preemptive multitasking" and "multithreading." Both of these ideas were in the upcoming Windows NT 3.1 design. I didn't really understand what these buzzwords meant, but at least they sounded impressive.

Even though I've learned a bit more about writing programs that can do more than one thing at a time since those early days, I still have points where I don't understand exactly what is going on. Recently, I decided that I wanted to stop my handwaving about locks and do whatever it took to find out how they, well... lock.

A View from the Top

Locks let you express your belief that something bad might happen if two threads go through a section of code at the same time. Locks are present in many languages. Here's an example of one in C#:

object syncRoot = new object();

private void MyFunction()
{
lock (syncRoot)
{
// only one thread can enter this locked gate at a time.
}
}

The above lock creates a "gate" that only allows one thread at a time regardless of what that thread is doing. This is often too pessimistic and inefficient. In a lot of cases, it doesn't matter if multiple threads are reading a piece of data at the same time. However, we want to make sure that there is only one writer at a time. Furthermore, we want to make sure that the readers don't see corrupted garbage.

A lock that is optimized for allowing multiple readers at a time while still being consistent in the face of writers is called a reader/writer lock. Here's an example of some typical code you might see in an application that has to cache objects that take a long time to create:

private readonly Dictionary<string, ReallyExpensiveObject> expensiveObjectCache
= new Dictionary<string, ReallyExpensiveObject>();

private readonly ReaderWriterLock rwl = new ReaderWriterLock();

public ReallyExpensiveObject GetExpensiveObject(string key)
{
ReallyExpensiveObject expensiveObject;

rwl.AcquireReaderLock(Timeout.Infinite);

try
{
if (this.expensiveObjectCache.TryGetValue(key, out expensiveObject))
{
// Cache hit
return expensiveObject;
}

// Cache miss
LockCookie cookie;

cookie = rwl.UpgradeToWriterLock(Timeout.Infinite);
try
{
expensiveObject = new ReallyExpensiveObject();
expensiveObjectCache[key] = expensiveObject;
return expensiveObject;
}

finally
{
rwl.DowngradeFromWriterLock(ref cookie);
}
}
finally
{
rwl.ReleaseReaderLock();
}
}

This code will allow more than one reader at a time and will handle the case of upgrading the read lock to a write lock if it needs to insert an item into the cache. The code is a bit long and tedious to write. [1] Even worse, it has a very subtle bug in it that we'll look at later.

We first need to understand what is going on inside a lock. Since reader/writer locks are usually more interesting than a normal lock, we'll focus on the details of a reader/writer lock implementation. [2]

Entering an Internal Lock

Locks themselves have internal data that they must protect from corruption by competing threads. This internal data includes the number of active readers and how many threads are waiting to write data. In order to provide this protection, locks must enter an internal lock.

For simplicity, locks usually have a single variable that indicates if a thread has acquired the internal lock. I'll illustrate this and other concepts by posting snippets from Vance Morrison's excellent reader-writer lock sample: [3]

 private void EnterMyLock()  {
if (Interlocked.CompareExchange(ref myLock, 1, 0) != 0)
EnterMyLockSpin();
}

In Vance's code, a member variable named "myLock" is set to 1 if the lock is held, otherwise it is 0. The Interlocked.CompareExchange function is interesting because it does two things. If "myLock" is 0, then it is set to 1. If "myLock" doesn't equal 0, then we know that someone else has the internal lock. The documentation says that "the compare and exchange operations are performed as an atomic operation." I was curious how this actually worked, so I looked at .NET's source code and eventually hit this wall:

[MethodImplAttribute(MethodImplOptions.InternalCall)]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
public static extern int Exchange(ref int location1, int value);

In my earlier .NET days, I would have given up at this point. The "InternalCall" attribute indicates that the CLR handles the call internally. It seemed like I would never quite know what happened inside this mysterious internal call.

And then I met Rotor.

Inside an InternalCall

Rotor lets you look at the CLR's internals that are mostly written in C++ with a small amount of assembly code. [4] If you navigate through several layers of function calls, you'll see that Interlocked.CompareExchange ultimately reaches this function in clr/src/vm/i386/asmhelpers.asm that is written in x86 assembly: [5]

FASTCALL_FUNC CompareExchangeMP,12
_ASSERT_ALIGNED_4_X86 ecx
mov eax, [esp+4] ; Comparand
cmpxchg [ecx], edx
retn 4 ; result in EAX
FASTCALL_ENDFUNC CompareExchangeMP

By far, the most interesting aspect of this is the "lock cmpxchg" line. Besides giving gamers power to play DOOM, the 486 chip introduced a "cmpxchg" instruction that will both compare a value and exchange it with a new value if the comparison succeeds. It does this in a single CPU instruction. This is how the documentation can claim that it is an "atomic" operation. We'll see why it's very important to do this in a single instruction a bit later. The "lock" prefix is used to tell the CPU to assert a physical signal so that no other core can access memory at the same time. [6] This hardware locking is how the "myLock" variable can remain safe from simultaneous access by other cores.

But now what happens if another thread already had the internal lock?

Spinlocks

When locks need to wait on the internal lock, they typically use a spinlock. This tells the processor to essentially spin around in circles in order to waste a little bit of time so that the thread that has the internal lock can possibly finish what it is doing and exit the lock.

Here's an example from Vance's code of how a spinlock is performed:

private void EnterMyLockSpin()
{
for (int i = 0; ;i++)
{
if (i < 3 && Environment.ProcessorCount > 1)
Thread.SpinWait(20); // Wait a few dozen instructions to let another processor release the lock.
else
Thread.Sleep(0); // Give up my quantum.

if (Interlocked.CompareExchange(ref myLock, 1, 0) == 0)
return;
}
}

It's a basic for loop that has a few interesting bits. The first thing to notice is that this loop will keep going until the CompareExchange succeeds. That is, until we acquire the internal lock.

On a multiprocessor system, it will try a Thread.SpinWait for 3 times. The documentation for SpinWait is a bit confusing about what it does with the "20" argument:

SpinWait essentially puts the processor into a very tight loop, with the loop count specified by the iterations parameter. The duration of the wait therefore depends on the speed of the processor.

What exactly is it doing? Looking at Rotor's clr/src/vm/comsynchronizable.cpp gives us the reality:

FCIMPL1(void, ThreadNative::SpinWait, int iterations)
{
WRAPPER_CONTRACT;
STATIC_CONTRACT_SO_TOLERANT;

for(int i = 0; i < iterations; i++)
YieldProcessor();
}
FCIMPLEND

Further diving shows that "YieldProcessor" is this macro:

#define YieldProcessor() __asm { rep nop }

This is a "repeat no-op" assembly instruction. It's also known in the Intel instruction set manual as "PAUSE - Spin Loop Hint." [7] This means that the CPU knows about the spin waiting that we're wanting to accomplish.

As the documentation hinted, the time spent waiting is dependant upon the clock rate of the machine. In my case, I have an EE6600 Core 2 Duo where each core runs at 2.4 GHz, and it's about 3.4 nanoseconds for each "YieldProcessor" call. [8] This means it takes approximately 70 nanoseconds for 20 iterations. Three unsuccessful passes through this loop will cost about 210 nanoseconds.

I'm sure that if reader/writer locks had cheerleaders inside of them, they would be shouting in hope that the spin wait used enough time so that the thread that had the internal lock that is protected by the "cmpxchg" will have released it.

Why does the lock try so hard to get by with only a spin wait? In a word: speed.

By not being able to satisfy the request with a spin wait, the thread gives up and goes to sleep. This requires intervention by the OS to potentially swap it out and run a different thread that that has been waiting. By letting another thread run, there is a chance that the CPU will have to toss out its onboard cache only to fetch it back when the thread gets swapped back in. All of this could potentially add up to hundreds of thousands of CPU cycles that aren't being used to get "real" work done. [9]

How Does a Thread Sleep? How Does it Know When to Wake Up?

Windows keeps track of the state of each thread in your system. [10] Here's a simplified thread state chart from the excellent Windows Internals book:



When your thread is "Running" and is interrupted because its time-slice ran out or it voluntarily gave up its time slice by using a Sleep(0) request, the OS will save the state of what it is doing by performing a context switch. The fact that this switch can occur at any time is why it was critical that the CompareExchange code execute in a single CPU instruction that cannot be interrupted. If it didn't, it would be possible to put the lock in an inconsistent state between the compare and the exchange.

When the Sleep(0) call is made, the thread is put in a "Ready" queue meaning that it's "ready" to run and not waiting on anything. [11]

While we're at this chart, it's important to note that one of the possible thread states is "waiting." Windows will keep track of the events that a thread is waiting on.

Events are useful for signaling between threads. For example, a reader/writer lock can "signal" other threads by informing the OS that an event has been "set." What happens next inside of the thread scheduler/dispatcher depends on the type of event that was set: [12]

  • AutoResetEvent: In this case, only a single thread that is waiting for the event will be put into the "ready" queue. Anyone else that is waiting for the event stays in the waiting queue. This is useful for releasing exactly one writer.
  • ManualResetEvent: When signaled, Windows will update all of the threads that are waiting on the event to be moved into the ready queue. Effectively, it releases them all. This is useful for releasing all waiting readers.

What's important to realize is that we have the cooperation of the thread dispatcher in the OS to obey the thread states. By simply being in the waiting queue, the threads aren't even considered to be scheduled to run and thus consume no real CPU time.

Putting It All Together

It's been a long journey, but we now have all the background information we need in order to understand how a reader/writer lock actually locks:

  • Acquiring a Read Lock:
    • Enter the internal lock (spin wait or sleep as necessary)
    • Look to see If there is anyone writing
      • If no one is writing and no writers are waiting, just increase the number of readers. This is the happy path.
      • If a writer is writing, tell the OS that we want to wait until the "ready for readers" event is set.
    • Exit the internal lock.
  • Exiting a read lock:
    • Enter the internal lock
    • Decrement the total number of readers.
    • Check the total number of readers remaining
      • If there are still readers remaining, then make sure that all waiting readers have been released so they can run.
      • If there are no readers left and there is at least one writer waiting, signal the appropriate event so that exactly one of the waiting writers wakes up and will get the lock.
    • Exit the internal lock.
  • Acquiring a write lock:
    • Enter the internal lock
    • Check the number of readers or if there is an active writer in progress. In addition, check to see if there are any in line waiting to do either.
      • If there are no readers or writers, record that we're writing. This is the happy path.
      • Otherwise wait for our turn in line on the write queue by waiting on the "ready for writing" event.
    • Exit the internal lock
  • Exiting a write lock:
    • Enter the internal lock
    • Check how many readers or writers are waiting
      • If there are no waiting writers, immediately signal all readers at once.
      • Alternatively, release one writer that has been waiting.
    • Exit the internal lock.

As it turned out, the locks themselves are fairly simple. They just had a few pieces that came together in interesting ways.

The Problem of the Upgrades

There's a subtle problem with reader/writer locks: Imagine that two threads are both in a read lock and then they both decide that they need to upgrade to a write lock. Because a writer must wait for all readers to finish before acquiring a write lock, there seems to be a deadlock scenario since neither one wants to give up its read lock.

The secret trick is that they indeed must give up their read lock if there are any other active readers. This forces a correct implementation to do something like this when performing an upgrade:

// Cache miss
LockCookie cookie;

int writerSeqNum = rwl.WriterSeqNum;

cookie = rwl.UpgradeToWriterLock(Timeout.Infinite);

try
{
if (rwl.AnyWritersSince(writerSeqNum))
{
// Since another writer snuck in, we might no longer need
// to create a new object. Let's check again to see if we
// need to do this.

if (this.expensiveObjectCache.TryGetValue(key, out expensiveObjectCache))
{
// The other writer(s) must have created it
return expensiveObjectCache;
}
}

// Other writers didn't sneak in, so we know we need to create
// the object.

expensiveObject = new ReallyExpensiveObject();

expensiveObjectCache[key] = expensiveObject;

return expensiveObject;

}
finally
{
rwl.DowngradeFromWriterLock(ref cookie);
}

Each time a writer lock is acquired, the WriterSeqNum is incremented. The "AnyWritersSince" function will check to see if any writers have occurred since the moment just before we upgraded. If this is true, then it means that another writer might have come in and thus we need to recheck our assumptions.

This is just one way to handle the upgrade problem. [13]

As Joe Duffy explained on his blog, the new ReaderWriterLockSlim gets around the upgrade problem by introducing an "UpgradeableRead." When a thread enters an upgradeable read, the lock will force all subsequent readers to wait until the thread decides if it wants to upgrade. In the mean time, existing readers are allowed to finish. If an upgrade is requested, the lock will wait until no more readers are present. This preserves the integrity of the data.

Here's a full example:

private ReaderWriterLockSlim slimLock = new ReaderWriterLockSlim();

public ReallyExpensiveObject GetExpensiveObjectSlim(string key)
{
ReallyExpensiveObject expensiveObject;

bool upgraded = true;

this.slimLock.EnterUpgradeableReadLock();

try
{
if (this.expensiveObjectCache.TryGetValue(key, out expensiveObject))
{
// Cache hit
upgraded = false;
return expensiveObject;
}

// Cache miss
this.slimLock.EnterWriteLock();

try
{
expensiveObject = new ReallyExpensiveObject();
expensiveObjectCache[key] = expensiveObject;
return expensiveObject;
}
finally
{
this.slimLock.ExitWriteLock();
}
}
finally
{
slimLock.ExitUpgradeableReadLock();
}
}

While showing the full syntax needed, it's a bad example of using a ReaderWriterLockSlim since it doesn't allow for multiple readers. The code might perform better by just entering a read lock if reads occur much more frequently than writes. If the a write is actually needed, then a subsequent write lock can be acquired and a double check can occur to see if a write is still needed. Alternatively, a classic Monitor style lock can be used.

Conclusion: Why Should I Care About the Details?

Our journey has taken us from the high level of using a reader/writer lock to the low level of a single signal on the processor bus. Knowing this low level "goo" is helpful for a few reasons:

  1. Ideally, lock sections should be kept small so that no lock contention occurs in the first place so that we don't need to "spin wait" at all (to acquire the internal lock).
  2. Since readers are released all at once via an event, it shows how a reader/writer lock gets its performance benefit over a normal Monitor lock that would force waiting readers to wait in a line to get released one-by-one.
  3. We now can see that locks aren't magic at all. They just have a few moving parts.

With a better understanding of locks, I'll focus my handwaving on other concepts... like building a work-stealing thread pool.

UPDATE 30 Sep 2008: I fixed the footnote links and updated conclusion #1 based off comments.

UPDATE 3 Oct 2008: I updated the paragraph discussing why spin waits can be helpful based off some good discussion in the comments.


kick it on DotNetKicks.com



Notes

(This article had a few places where I made some claims. I'm sure I probably got some things wrong or might have misunderstood something in my explorations. I'm putting notes here to give background of what I was thinking when I wrote the post. Please feel free to correct me and keep me honest via comments for the benefit of others)

[1] If performance isn't insanely critical, you can create your own lock and have helper methods like "EnterScopedRead" that return an IDisposable object that will automatically release the lock when it's disposed. Then, you can use a C# "using" statement with it to reduce some of the code clutter.

[2] I still think that reader/writer locks are more interesting than normal Monitor type locks. However, if you have a relatively balanced number of readers and writers (e.g. no clear readers majority), then Monitors make more sense since they're slightly faster. However, the code that Monitors uses a lot more assembly code for performance reasons and thus is often harder to understand.

[3] Even a cursory look at the source code of .NET 3.5's new ReaderWriterLockSlim in Reflector shows that Vance's lock code was the basis for it. ReaderWriterLockSlim is slightly more complex to handle cases like a recursive lock (e.g. acquiring a read lock when you have a write lock), but the core ideas remain the same.

[4] To be technically correct, Rotor isn't exactly what the commercial CLR is. Some of the aspects like the garbage collector and the Just In Time (JIT) compiler are different due to trade secrets. However, I don't think the parts I'll be mentioning here are different between the reference implementation and the actual commercial implementation. For more details on Rotor, I recommend reading Joel Pobar and Ted Neward's new book on Rotor 2.0.

[5] Things aren't quite that simple. The first thing to realize is that the CLR has a Platform Adaptation Layer (PAL) that handles all architecture specific code. Thus, there can be equivalent assembly code for other chips like the PowerPC which is located in clr/src/vm/ppc. Furthermore, the actual x86 code can be one of these two functions:

FASTCALL_FUNC CompareExchangeUP,12
_ASSERT_ALIGNED_4_X86 ecx
mov eax, [esp+4] ; Comparand
cmpxchg [ecx], edx
retn 4 ; result in EAX
FASTCALL_ENDFUNC CompareExchangeUP

or

FASTCALL_FUNC CompareExchangeMP,12
_ASSERT_ALIGNED_4_X86 ecx
mov eax, [esp+4] ; Comparand
lock cmpxchg [ecx], edx
retn 4 ; result in EAX
FASTCALL_ENDFUNC CompareExchangeMP

If you look at cgenx86.cpp, you'll see that CompareExchangeUP is the default and the CompareExchangeMP is used if your system has more than one processor:

CmpXchgOps FastInterlockCompareExchange = (CmpXchgOps)CompareExchangeUP;

// Adjust the generic interlocked operations for any platform specific ones we
// might have.

void InitFastInterlockOps()
{
...

  if (g_SystemInfo.dwNumberOfProcessors != 1)
{
...

FastInterlockCompareExchange = (CmpXchgOps)CompareExchangeMP;

...
}

...
}

From this, I was able to determine that "CompareExchangeUP" is for uniprocessor machines and "CompareExchangeMP" is for multiprocessor machines. If you look at the code carefully, you'll notice that the only difference is that the uniprocessor version doesn't bother locking the processor bus. This is safe because there is no chance of simultaneous memory access.

[6] The truth is a little bit more complicated, but more interesting. The Intel 64 and IA-32 Architectures Software Developer's Manual - Volume 3A: System Programming Guide, Part 1 tells us:

7.1.4 Effects of a LOCK Operation on Internal Processor Caches.

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow [its] cache coherency mechanism to insure that the operation is carried out atomically. This operation is called "cache locking." The cache coherency mechanism automatically prevents two or more processors that have the same area of memory from simultaneously modifying data in that area. (emphasis added)

Here we learn that the P6 and newer chips are smart enough to determine if they really have to block off the bus or can just rely on intelligent caching. I think this is a neat optimization.

[7] I think the "PAUSE" instruction is an interesting example of how Intel allows for backwards compatibility. Its opcode of "F3 90" makes it identical to "REP NOP" which will cause older processors to still do nothing, however newer chips use this odd opcode as a clear hint to do more efficient things.

[8] If you're curious about where I came up with the numbers, look at this discussion on the Intel forums. It's a bit hard to measure since context switches might skew the results. My lowest run yielded numbers that indicate a nop took 0.34 clocks per instruction and that "PAUSE" took 8.21 clocks per instruction. I did (2,400,000,000 clocks / second) * (1 PAUSE / 8.21 clocks) to get 292,326,431 PAUSEs / second or 3.42 ns / PAUSE.

[9] UPDATE: The original statement made a statement that this was 100,000 times slower. After some good discussion in the comments, I'm not going to claim specifics since it depends a lot on the machine's workload and a thread's working set.

It's interesting to note that Vance's original code gives up after only three tries of around 70ns each. The ReaderWriterLockSlim that was ultimately derived from it keeps up the fight a little longer. It does something like "Thread.SpinWait(20 * (i + 1))" which leads to a geometric back-off over time for a total of up to 10 iterations which yields the equivalent of 1100 "YieldProcessor" calls or approximately 3.76 microseconds. This means it stays in the fight 1100 / 210 or 5.2 times longer.

(See the comments for more discussion on this)

[10] It's important to realize that processes don't run on Windows, but threads do. Although processes provide isolation boundaries, they require a thread to execute code.

[11] The Thread.Sleep(0) call ultimately reaches the SleepEx API that has this documentation:

A[n argument value of] zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution.

It's interesting that ReaderWriterLockSlim does a Sleep(0) for 5 times before finally doing Sleep(1) until it reaches success. I'm assuming it does the Sleep(1) in order to protect users from themselves in the event that the thread that has the lock and then goes into a state where it's waiting on something (e.g. I/O) to occur. Lock sensitive code should not be doing this, but it doesn't hurt to provide this forced 1 ms minimum delay. If they didn't do this, the thread that had the lock would remain on the waiting queue while waiting for I/O and this would potentially have the thread that needs the lock to immediately be put back on the run queue without burning much CPU time.

Finally, I'm assuming that SleepEx(0) performs a software interrupt. This is like a GOTO that jumps to a well known address that that handles the scheduling/dispatching of threads in the kernel.

[12] I'm simplifying things a bit here. Threads can wait on multiple events at one time. Therefore, setting one event might still mean that a thread is waiting on other events.

As an aside, it was only while researching this post that I realized why the WaitForMultipleObjects API is so useful. By using it, you can eliminate a very costly process of putting your thread back on the ready queue and getting scheduled only to find out that it still needs to wait and thus go back on a the wait queue.

[13] While the ReaderWriterLockSlim handles things properly, the early version that Vance posted will throw an error on a simultaneous upgrade. This is technically another way to handle the upgrade problem :-).

24 comments:

Anonymous said...

I like your post. It talks about a nice area which I find mentally enjoyable...

However, I would like to point out two things for the sake of being a stickler, and just maybe make you see locks in a different way at the same time.

The first thing is that cmpxchg is the essence of any synchronization object on any system. lock cmpxchg is even more powerful than that in that it locks the entire memory bus (and makes an operation atomic across multiple processors). Obviously, the lock prefix can be a huge performance hit (potentially ruining L2 caches etc), but the cmpxchg instruction really isn't. It's lightning fast.

That being said, locks are not magical mythical beasts. True user mode locks are implemented in kernel mode using a combination of the aforementionned cmpxchg operand, and an active cooperation of the scheduler itself. They are simply features of the scheduler: do the cmpxchg, if the lock is taken, schedule the thread accordingly, otherwise move on. Nothing more.

In a sense, it's all about higher authority: Kernel mode rules User mode, just as service rules client. As a user, you are at the whim of what the kernel decides. Spinlocks in that sense should only ever exist in kernel mode - and yes, there's an exception which I will get to. Why is this? Because in kernel mode, you can legitimately require a lock that is kernel wide, ie: outside the realm of a scheduler. In this case, when there is no higher authority left, all you can do is pit the multiple processors against each other and waste resources. (Note that spinlocks are entirely useless on single processor systems).
The only legitimate use of a spin lock is for Critical Section object which are part of the Win32, or similar implementations of them. Once again, they are only useful on multi processor systems, otherwise they are simply normal locks (read: the critical section is useless). The other good advantage of critical sections is that they can be non blocking: ie you can try and acquire them and return immediately if they're already taken (this without going to kernel).

The reason I'm dwelling so much on a spinlock is this: for a spin lock to become freed (during the spinning motion), another thread has to release the lock. Now, on the off chance that another processor is running your release code at the very same moment that the spinlock is spinning, you might actually get a chance to get a break. However, in any other case (and I do mean most other cases since a 70 nanosecond spin is quasi singular if your time slice is generally no shorter than 20ms on NT/2k/xp), even a successful spinlock - which from usermode looks like it never called the WaitFor... API, and hence never blocked - there will necessarily have been a context switch by the kernel from your code to a different thread which releases the lock, and then a subsequent rescheduling of your spin lock thread.

Now, if you go back to what I was pointing out in the beginning, a real lock is only a feature of the scheduler anyways. So, if I go back to this statement:
"By not being able to satisfy the request with a spin wait, the thread gives up and goes to sleep. This requires intervention by the OS and could be more than a hundred thousand times slower."

My problem with that statement is simply that if your spinlock succeeded, odds are very high that it succeeded because: a) your thread got pre-empted while spinning (ie. went to sleep), b) the releasing thread got task switched in, c) the scheduler re-context switched your thread back in.

If you don't believe the odds of what I'm saying, try this math out: open any application (such as firefox or whatever) which you've been using for hours in taskmanager. Look at the total CPU time it's used. It will be in the mere seconds unless you are processing video. Seconds that are scheduled at best in 20ms chunks. 20ms chunks that are rarely if ever fully used up. So you've got over several hours of usage, a few seconds of actual CPU time, spread over small spots of 20ms chunks... how high do you think your odds of hitting two concurrent pieces of code on a 70 nano second long cross section are?

CmpXchg is great for quickly checking if a lock is taken. Spin locks are wholly useless unless you are positive (after many a profile and stress test) that you somehow manage to get many CPUs running your threads during the same 20ms timeslice, and lighting up at the same lock during the same 70ns duration.


Anyways, my two cents. It always helps me to remember that even the most mythical beasts like APCs and IO ports are simple simple things in kernel mode masquerading as complex elves and wizards in userland.

Jeff Moser said...

anonymous: Wow! I wish I could thank you by name for your very thoughtful and detailed comment.

I agree that most of the time a thread doesn't use its full quantum.

I also agree that on a single processor system, spin-waits are useless. I didn't emphasize it, but the code I posted only enters a spinwait if there is more than one processor. Therefore, it will (exactly like you said) immediately sleep and get swapped out and allow the releasing thread to run.

I now realize that I was a bit misleading in the post. Spinlocks are only used to protect the internal lock variables and not much else. If the thread needs to wait for anything, it does a WaitFor... call that would have it most likely be be swapped out.

When I was imagining the spin-wait, I was assuming that the spin-locked section would be very small (on the order of a few instructions) to update the bookkeeping of the lock variables (e.g. number of writers waiting).

In an ideal case, the code that the lock is being used to protect is really small too and no WaitFor... calls are made because the spin wait used enough CPU to let the thread exit the internal lock, do its real work, and then update the lock. However, I have no data to know how often this occurs.

I suppose I just made assumptions that since the .NET framework ReaderWriterLockSlim performed things a certain way that there is a good reason. I'd assume that on a heavily loaded multiprocessor system, user mode spin-waits are worth it.

However, I'm now thinking that this is a "handwavy" response. What do you think? I'd love to continue the discussion.

Anonymous said...

anonymous: Wow! I wish I could thank you by name for your very thoughtful and detailed comment.

My pleasure. To answer you...

quote: "In an ideal case, the code that the lock is being used to protect is really small too and no WaitFor... calls are made because the spin wait used enough CPU to let the thread exit the internal lock, do its real work, and then update the lock. However, I have no data to know how often this occurs." /quote


There's a subtle gap in that logic. And that gap is that you are talking about two threads in the same sentence. Namely: the thread (a) that does the spinning (waiting) and the thread (b) that "exit[s] the internal lock". These are by the very design of the thing two separate threads.

So then comes the thought experiment of how can we actually make two threads simultaneously execute. The answer is actually deceptively boring: either we have multiple cores/cpus, or we can't. Now, to come back to the original post I made, even if we do have multiple cores, the odds of the two threads running through a section of code which usually takes nanoseconds to execute is excessively small (like 1 in 10^6 small). More on that later.

However, it is much more likely that the threads appear to be running in parallel but only so because of the magical aspect of the scheduler as seen in the usermode. As far as kernel mode is concerned, whether thread (a) goes to sleep willingly by calling WaitFor..., yields willingly by calling Sleep(0), or is simply pre-empted, the end result is the same: a context switch occurs, the thread state is swapped out of the registers and stored somewhere and the scheduler subsequently (perhaps up to tens of timeslices later) resumes this thread. The performance hit is done.

Now, for thread (b) to ever really be able to release the lock while (a) is spinning, it needs to be executing that very same instant. Think about this: the spin is maybe a few hundred cycles long, a CPU executes billions of cycles per second.
What I'm saying is that the odds that both threads were scheduled to run even on seperate CPUs during the same 20ms time slice is quite small: ie, odds are much higher that thread (a) gets scheduled 23ms later, or 45ms before. Or even that thread (b)'s time runs out or even yields because of a function call it makes that internally uses some obscure other system call during its lock portion.

The end result is always the same, while in user mode it might appear that thread (a) only spun a few times and never "dove" into kernel mode WaitFor..., the reality is probably that it actually did get context switched by the kernel - just as expensive as willingly calling WaitForXxxObject.

To come back to the above "more on that" statement, if your code somehow does something maybe say 1000 times a second, and you have three simulatenous threads doing this 1000 time a second thing, you are now increasing your odds of a collision somewhat and yes, in this case, it is possibly useful to use spin locks. Possibly. Seriously, I can only think of industrial grade applications like SQL Server or some DTC coordinator or something that would need this.

In all of this, I'd like to make it clear that you are certainly not alone in this "handwaving" as you call it, as spin locks are quite a few places in the wild, even in libraries like C#. When I see that libraries like C# have spin locks implemented in them, all I can think is that either some programmer thought he was being awesome, or some big vendor had a 0.01% percentile case that required a spinlock and so they asked Microsoft for that code. In any case though, it is IMO entirely useless for normal applications for the reasons I've stated above. That being said, spinlocks aren't really a performance issue in themselves to make this a problem for anyone...

PS. I notice a typo in my first post: "Once again, they are only useful on multi processor systems, otherwise they are simply normal locks (read: the critical section is useless)". I meant to say "Once again, they are only useful on multi processor systems, otherwise they are simply normal locks (read: the spinlock is useless)"

PPS. The "GOTO" or similar statement you posit the existence of is in indeed the "int 2E" opcode, which triggers a software interrupt which the OS handles by switching to kernel mode. It is the entry point for the Native API.

Anonymous said...

Just as a parenthesis: a case can be made that maybe some call into your application very quickly needs to acquire a lock, and thus, that it would be inappropriate to have the thread lose its current time slice immediately to a sleep (as opposed to spinning at least a while in the hopes that it could avoid a context switch). That is: some operation causes your application to be summoned, thread (a) gets called, is scheduled, eventually runs, and the first thing it does is go to sleep, only to later resume its actual task.

However, I posit that if your thread is taking a lock just as it starts, then the app isn't properly designed. If semantically, the lock is the first thing that the thread needs to acquire, then the thread is the lock. ie. the thread needs to be queued (via kernel APCs, io completion ports, worker thread queues) or it needs to be waiting on the lock to begin with, or the thread needs to be used as a lock for other performers of the operation.

I readily admit that above paragraph is highly vague and not very rigorous... it's just food for thought.

Perry Hertler said...

Excellent post Jeff!

One comment: this is a great optimization IF NEEDED. Most applications can get away with simply using a regular lock.

private readonly Dictionary<string, NotExpensiveObject> expensiveObjectCache
= new Dictionary<string, NotExpensiveObject>();

private readonly object notExpensiveObjectCacheLock = new object();

public NotExpensiveObject GetExpensiveObject(string key)
{
lock (notExpensiveObjectCacheLock)
{
NotExpensiveObject notExpensiveObject;

if (!expensiveObjectCache.TryGetValue(key, out notExpensiveObject))
{
notExpensiveObject = new NotExpensiveObject();
expensiveObjectCache[key] = notExpensiveObject;
}
return notExpensiveObject;
}
}

Thoughts?

I point this out because I don't like to see code prematurely optimized especially if a lot of complexity is involved.

Jeff Moser said...

anonymous:

Thanks for the follow-up! I really appreciate the depth that you give in your response.

One thing that I'm having difficulty following your reasoning is that the spin-wait is only for a very small amount of time to update bookkeeping of lock data. It's not actually used to protect the code that a user of a lock would be running. When I wrote the post, I don't think I made this clear enough. Based off comments like these, I should probably update the post.

I assume the lock bookkeeping can be done within a hundred instructions on the "happy path." (e.g. checking for writers, updating reader counts, etc). Since the bookkeeping is small, it seems like it makes sense to go into a spin-wait if somehow this small window of bookkeeping is holding the lock.

If it's highly improbable that two cores will be executing the same piece of code (and it is on the lightly loaded machine), then that seems like a strong argument that lock contention will also be low and thus there will be no need to spin at all (e.g. assume you're just adding another reader to a pool of readers).

To summarize, the spin-wait will only be used while waiting on the *internal* lock that just protects the reader count, writer count, along with the event handles for signaling readers and writers. That's roughly it (maybe a count and event for upgrades as well).

If you focus on just this area of internal bookkeeping where spin-waiting is occurring, do you still think that it doesn't make sense to use a spin-wait here?

Looking forward to future discussions. I'm definitely realizing how much I still don't know :)

P.S. Thanks for the INT 2Eh tip!

Jeff Moser said...

Perry Hertler:

You're absolutely right that in many applications, a normal Monitor based lock is a great choice that could be faster too (less bookkeeping). This is especially true if the probability of a write is close to the probability of a read or if the chance for lock contention is low.

I also agree that my example might be too complex. I was trying to be complete, but you're right that it might obscure the normal case.

One might argue that a reader/writer lock that is almost as fast as a Monitor and uses a clean syntax (like note 1 hints at), then you could make your code more declarative by expressing what the block of code actually needs (a read or a write).

The reader/writer lock could add more complexity than it is worth and careful thought and perfmon-counter-watching should be given to see if lock contention is a real issue.

I suppose I was trying to focus on more of the idea of locks and roughly how they work rather than discussing when one should use a particular type of locks.

Did this help clear anything up?

Thanks for your comment!

Perry Hertler said...

Jeff,

Thanks for the clarification.

Also, thanks for not adding double-checked locking to your example:)

Anonymous said...

Jeff,

the point about the spinlock is that even if the operation you are doing (in this case bookkeeping) is light, the thing that implements it is not: namely threads and their associated context switches.

Think of it this way, you have two parallel train tracks from NYC to SF, with the rail ties representing each quantum or slice of time the processor allocates. The whole distance might be, say, 5 seconds of CPU time.

What I'm saying is that for any sort of spinlock scenario to have any effect at all, both the thread currently doing the work and the thread waiting on to do the work by spinning would have to be on two ties at the same distance on both tracks.

A crude figure:
thread 1: SF ------------------|-----------...--- NYC
thread 2: SF ------------------|-----------...--- NYC

I'm also saying that the odds of that happening are extremely low, given that in 5 seconds of user time your app will likely use 100 ms of CPU time. Chances are that one rail tie will be allocated in NJ, one in Ohio, and another in Vegas.

Like so:

thread 1: SF ------------------|------------NYC
thread 2: SF------------|----------|--------NYC

(keep in mind there's millions of rail ties)
If we agree on this, then you already agree with what I'm saying but you don't realize it yet. The reason is: let's say I'm the first thread and I'm doing my work in the span of the rail tie in NJ. I start my work, and then enter the lock using cmpxchg, and start updating some minor lock variables. No need for spinning yet. Now, as I'm doing this, I get pre-empted. So, even though that thread isn't aware that it has stopped, it in reality has stopped, and the next allocation comes in Vegas. (this is the most likely scenario for a contention to ever occur).

Now, along comes a different thread in Ohio. It starts doing the same work, enters the lock area, does a cmpxchg and realizes that the lock is busy. It thinks, oh, it's busy. But I know what gets done next is only 5 instructions long and then the lock is immediately released, so if I spin for 200 instructions, I'm guaranteed to get it the next time. Right? Well, not so: that other thread/rail tie is scheduled to be in Vegas only. Long long ways ahead. So this rail tie in Ohio, no matter how much it knows that the operation is short, can not possibly hope to spin long enough without itself getting pre-empted and yet live to see the freeing of the lock.

Not unless, as I've been saying, both threads just happen to have been scheduled on the single same rail tie from NYC to SF.

What will likely happen is one of two things: the Ohio operation will spin and spin, and eventually get pre-empted. Get scheduled for say, Irvine, and then when it "wakes up" (of course itself never realizing it went to sleep), all of a sudden the spin lock will be freed. However, to think the spin lock did the job here is naive. The job got done by the scheduler, and there was a context switch involved. The spin lock only wasted time.
The other scenario is that the spin will expire, think "ok, it's time to call a heavy function like WaitFor", and voluntarily go to sleep. Probably to awaken right after Vegas this time.

The thing to keep in mind is that the scheduler is not dumb, so while the first scenario will be seen as a simple premption, and there will be no cause to schedule it at any particular time (for example, it might be rescheduled multiple times before the lock is freed, or conversely scheduled long after the lock is freed), the second scenario will be much more efficient: the scheduler will probably wake up the thread right after the vegas operation is done, because it will know that the Ohio thread is specifically waiting for the Vegas one.

Images images, if only I had a wait to post an image, this would be way simpler and much less verbose to explain.

Anonymous said...

Just to something which might make it clear:

Kernels don't have this issue with spin locks because they are not time sliced. A kernel runs right now, ie if it is running, the kernel knows that it is running since it itself does the scheduling of things. That is: if I ever check a spinlock and find that it's busy, I know for fact that another CPU currently has it and is running it (from the mere fact that kernel locks are reserved for kernel use). This means that I know without a shadow of a doubt that the other CPU is currently working on the operation itself. I know this because I know that both I and the other "thread" are kernel code executing and that we are not being pre-emptively scheduled.

I put "thread" in quotes there because there are no such things as threads as we are used to them in Userland in the kernel. There's threads only of execution.

This means that if I check the cmpxchg and find the lock's taken, I know that right across from my rail tie, on the other rail, the same tie is doing this thing. This is guaranteed - unless hardware interrupts occur, but no need to get into that.

Notice how many times I insisted on the word "know" in that first paragraph. There are many many assumptions you make when you use a spinlock and they are almost never true in user mode.

Jeff Moser said...

Perry Hertler:

I do the double check on the upgraded reader example. I don't do it on the Monitor based example because it doesn't seem to be necessary so long as any look at a sensitive data structure is inside the lock.

Often, it seems that when I need to double check, it's against a null on the data structure itself and not an item in a data structure.

One interesting thing about this is that the old collections (e.g. ArrayList) that had a SyncRoot object (before it was decided in 2.0 to remove it). It's interesting how they got around the double-checking issue which saved an object allocation if this was never used:

if (this._syncRoot == null)
{
Interlocked.CompareExchange(ref this._syncRoot, new object(), null);
}

I only bring this up because it touches on some similar concepts as the post itself.

Jeff Moser said...

anonymous:

I think I see where you're coming from, but maybe am not quite there yet with your conclusion :)

First, I think we both agree that we're only talking about multiprocessor machines (e.g. 2 or more parallel train tracks). As has been established, I fully agree with your assessment for a uniproc machine.

I'm thinking that since full quantums might never be used (as you indicate), it seems that there is a good chance that the scheduler would not pre-empt in the middle of a 70-1100 ns spin since that'd occur within the scheduled timeslice (~20ms). If it did pre-empt at this point, I agree with your conclusion.

I also agree that kernels don't have the preemption issues that usermode code does since they are in control.

I think a test run is in order to understand this better. I asked a question on StackOverflow for help on how to measure the context switches while running a piece of code, but I think this will be hard to do. I'll see if I can try several approaches though.

Do you see where I'm at with my point? Am I way off base to assume it's reasonable that a pre-emption might not occur if full quantums are rarely used?

Jeff Moser said...

There have been some other comments on this post.

Some on reddit have commented about ways to make reader-writer locks fairer by implementing a ticket spinlock that mimics the "take a number" and "now serving" numbers you see at a deli. I thought this was a really cool way to make locks more FIFO.

Some on barropunto discuss the post with more references to task scheduling in general.

Anonymous said...

Jeff,

you are right about the quantums rarely expiring if all you do is 70ns long. And this is the correct direction in which you should be thinking about this...

Indeed, in all likelihood, if you regularly use up your entire quantum (like doing short CPU intensive tasks), then your spinlock might be anywhere inside the 20ms.

However, this does not increase in any way your chances of having your other thread on a parallel processor using up the same quantum time slot.

The reason why I was saying that thing about being pre-empted is this: the odds are much higher that if your spinlock does succeed in avoiding a WaitFor call, it's because the thread went to sleep and woke up during a quantum many many quantums down the road.

So, if you did something like a trace/debug printf to a log file, and saw there that the spinlock "worked", i.e. the WaitFor section of your code was not reached, then it would be much more likely that it was because your thread lost its quantum and started a new one.

The thing that I just barely grazed in my previous posts is that calling WaitFor is potentially much more optimal, as the scheduler will now know exactly what this thread is waiting on. This means that it can for example release this thread right away when the object being waited on is released. This is no small advantage. Btw, this will also address the "fairness" issue that you mention, as modern operating system schedulers tend to be quite fair. If however, you need a queuing action like FIFO, why not simply use a queue? I call "handwaving" on that one.

If you want to see context swaps and the likes, open up your "performances" mmc pane in Administrative Tools. In the "System Monitor" node, you can add counters for Thread > Elapsed Time and Thread > Context Switches /sec, and then set the source as "Select instances from list". Your app will need to be already running.

What I would recommend is doing a test like so: launch two threads that do the same thing. In an endless loop put three CPU intensive operations before, after and inside the lock area. At the end of the loop, call Sleep() with an adjustable value simulating frequency of use of your app (be it page hits, user requests etc).
The CPU parts should be simple code (not Sleep(x) and not some function call to a big library). Ideally do for loop with three configurable upper limits, and a series of
float var += make_long_float();

Where make_long_float is a series of floating point operations.

Make sure the loops don't call API calls like ReadFile etc, as these will for sure incur context switches. Just make it a plain vanilla integer/float escalator to nowhere...

Let them run. The smaller the value the Sleep function is called with, the more CPU usage you will get. Then do your tests by varying the length of the three loops, hence the CPU load before, after and during the lock.

See if by enabling a spinlock you can reduce the Context switches per sec, while improving or at least keeping the Elapsed Time value constant. If you actually find a way to do it, please share. But honestly, I'm ready to bet money that you can't unless you make your loop sleep value be 0, and your thread work be very light: meaning that you come back to the scenario of doing a small task thousands of times per second with contention, which is indeed the prime if not sole candidate for taking advantage of a spin lock.

Jeff Moser said...

anonymous [although I feel we're getting to know each other :-) ]:

Please bear with me, I'm still not quite reaching your conclusion, even though I agree with many details of what you say. I do appreciate the depth you're going at to keep me honest and to inform anyone reading these comments.

In order to be clear, I want to provide some definitions I've had in my had while writing the post and these comments:

Contention - The only critical part from a *spinning* perspective that matters is the time that a thread acquires the internal lock until it releases it. That is (in Vance's code), setting the "myLock" variable from 0 to 1 and then doing bookkeeping and then setting it back to 0. If a thread can set the "myLock" to 1 on a single "lock cmpxchg" instruction, everyone is happy and there is no contention. If the "lock cmpxchg" failed because the compare value was not 0, but 1 (someone else had the *internal* lock), then we have contention.

Bookkeeping - Here is the complete happy path on an AcquireReaderLock: "Interlocked.CompareExchange(ref myLock, 1, 0); if (owners >= 0 && numWriteWaiters == 0) { owners++; break; } myLock = 0;}" This assumes the CompareExchange works by setting "myLock" to 1 because it was 0 before. In addition, it assumes no writers are active. All that is inside of this bookkeeping is a few compares and sets. Very quick stuff.

-

When I read "if your spinlock does succeed in avoiding a WaitFor call, it's because the thread went to sleep and woke up during a quantum many many quantums down the road," I interpret that as "your thread may think that all that is doing is spinning, but what is really happening is that it is being context-switched out while it is doing a run of 'pause' assembly instructions". The scheduler is then deciding other threads should run, and then it will be swapped back in and pick up right where it left off. This will burn a lot more time than the small spin-wait window it thinks it is using"

Is this roughly what you meant?

I fully agree with your WaitFor analysis. The kernel is in a much better position to wake threads up exactly when needed. However, it must be stressed that the only time the lock will ever use a WaitFor call is when readers are waiting for a writer to finish, a writer is waiting for readers or another writer to finish, or a reader is waiting to upgrade to a writer.

Do you agree that the previously mentioned events are the only times WaitFor is called? If you don't agree with this, what other time do you see a WaitFor being called? If you do agree with this, then do you agree with the conclusion that if you have 5 active readers and you go to add another reader then there is no chance a WaitFor call will be issued since all that will happen is the internal lock will be entered and the total read count will be incremented and then the internal lock will be exited?

In the situations where WaitFor (e.g. writer waiting for readers to finish), the scheduler will know exactly when it's reasonable to wake up a thread (as discussed in the post).

I'll concede that "ticket spinlocks" are better suited for a kernel level spinlock than for user mode primitives. However, some things should be considered at least with respect to the decision if readers or writers are favored. Typically writers are favored (checked first) since they're assumed to be infrequent.

At the very least, I think "ticket spinlocks" are cool at the kernel level and enjoyed learning about them :)

--

In my highly unscientific test on my dual-core machine with 2 threads with some lock contention (e.g. fighting for "myLock"), the spin-wait enhanced version of the lock squeazed out a 3% reduction in context switches vs a lock that just did a Sleep(0) on contentions. It's not huge by any means, but it saves some time.

I think one problem that you might be addressing is that in the post I claimed a context switch was disasterous for performance, and you're trying to make sure that I see that it happens much more often and with less impact. Is that a fair statement?

Do you at least see the opportunity of a 3% improvement as reasonable for an increase in performance by doing a spin-wait vs. nothing at all?

Thus, in response to your statement of:

"when I see that libraries like C# have spin locks implemented in them, all I can think is that either some programmer thought he was being awesome, or some big vendor had a 0.01% percentile case that required a spinlock and so they asked Microsoft for that code. "

would you conclude that in a normal app that had some contention, Vance's use of a spinwait is justified for acquiring an internal lock and not just an edge case?

Furthermore, would you agree that as more and more processors get added, there is a greater chance for an increase in performance increase by using spin-waits?

--

Regardless, these comments have shown that I'm still doing quite a bit of "handwaving"

Anonymous said...

Inbeed, everything before the "Is this roughly what you meant?" is what I meant, minus one tiny thing: spinning isn't wasteful enough for me to consider it a problem. I'm just saying it's useless at best.

Now, regarding all of your questions about readers and writers: I don't even think about it that way. A lock is a "primitive" (the mutex primitive) and the reader-writer scenario makes use of that primitive. Other primitives include semaphores (locks with counts), Events, and I'm probably forgetting some more.

The point about spinning or not, is that it's implemented as part of a lock. The specific library implementation of a mutex lock can choose to do spinning or not before it goes from the failed cmpxchg to the WaitFor.

And in that sense, whether the scenario involves reader/writer queues, or flat queue synchronisation, the implementation of the lock is the same.

So to answer your question maybe a bit side ways, a lock will only call WaitFor if the cmpxchg fails. So a lock that is successfully acquired will fly like the wind through the lock code without ever diving into kernel. If not, it will call waitFor, which semantically speaking is exactly what we want.

Lock prioritization is indeed a very important topic (re: ticket spinlocks), and is absolutely necessary in kernel in certain areas. There's also other things that OS's do to avoid system wide dead locks, specifically if a process obtains a lock on an object, be it a file or whatever, and then a higher priority process starts waiting on that same object... this requires the OS to detect a priority inversion etc. etc. You can look it up.

Regarding your test, did the Elapsed time reduce by the same amount as the Context switch count? Because the test pretty much guarantees that the context switch count with no spin lock can only be greater than or equal to to that with spinning.

quote "I think one problem that you might be addressing is that in the post I claimed a context switch was disasterous for performance, and you're trying to make sure that I see that it happens much more often and with less impact. Is that a fair statement?"/quote

Bingo. Spinlocks are the smallest form of contribution to multithreaded code while conserving the most wow factor for programmers. Baffles me to this day.

To answer you, 3% is indeed an improvement, if we are indeed talking about elapsed time. But, no human will ever notice a 3% difference. Ever. The only kind of optimization I've ever been paid to do has been 80% and above (more like 10 fold, 200 fold etc).
When you start playing with gains of 3%, you have to understand it's for NASA/LHC like projects.

This being said, I never said spin locks should be removed from system libraries.


quote would you conclude that in a normal app that had some contention, Vance's use of a spinwait is justified for acquiring an internal lock and not just an edge case?
/quote

The above response should be fair enough. I think it's not essential. Nor is it detrimental. It certainly wouldn't be something I would highlight to a group of peer programmers during a review.

And last answer, re: more processors. The answer is: it all depends on your data/thread structure. It might not at all.

Anonymous said...

Oh, I forgot to say: if you need lock prioritization, you need to make that evident in your semantics. Thus, ticket spinlocks would be inadequate as they are inside a lock which semantically has no intrinsic prioritization.

Jeff Moser said...

anonymous:

Good continued discussion. Minor points:

"a lock will only call WaitFor if the cmpxchg fails. So a lock that is successfully acquired will fly like the wind through the lock code without ever diving into kernel. If not, it will call waitFor, which semantically speaking is exactly what we want."

I could be wrong here, but as I see it, the cmpxchg is only used to protect the internal data about the lock. Remember, I'm speaking only of the user mode code here.

Picture this scenario on a multicore machine where the numbers represent increasing wall clock time:

1. Thread A does an AcquireWriterLock(...). Thread A does this by entering the internal lock by succeeding on its very first call to cmpxchg and sees that there are no active readers or writers. It then makes a note in the internal data structure that there is an active writer and then immediately exits the internal lock and begins writing on some shared resource.

2. Thread B does an AcquireReaderLock(...). Thread B also does this by entering the internal lock by succeeding on its very first call to cmpxchg and sees that there *is* an active writer (Thread A). It increments the total number of reader waiters *and* makes sure that there is an OS event handle to wait on. Once these two things are done, it exits the internal lock and does a WaitFor on the OS event handle that is used to notify waiting readers. At this point the scheduler swaps out Thread B and makes a note of the handle that it is waiting on and puts it in the "Waiting" queue.

3. Thread A finishes writing to the shared resource and calls ReleaseWriterLock(). The lock implementation handles this by entering in the internal lock and setting the number of writers to 0. Next, it notices that there is a reader waiting (Thread B). It then exits its internal lock and sets the event for readers waiting. By setting this event, the OS can now move Thread B from the "Waiting" queue to the "Ready" queue since the event that it was WaitFor-ing on has been set.

4. Thread B is put on the running queue and begins reading the shared resource that the lock implementation is being used to protect.

Do you agree that this is a very normal scenario for a reader-writer lock? Do you see that even though the cmpxchg succeeded on the first time in every case, the WaitFor was still needed for notification of when it is ok to start reading by Thread B?

I think of WaitFor and Sleep differently. Like you mentioned in previous comments, a WaitFor can be highly optimized by the scheduler and be used to wake up a thread at the precise moment when things are ready (see also my note 12 in the post). A Sleep call is just telling the OS that the thread wants to yield its quantum and be put back on the Ready queue.

Throughout our discussion, we've been saying "WaitFor." I'm assuming that we're talking about APIs like WaitForSingleObject, WaitForSingleObjectEx, WaitForMultipleObjects, and WaitForMultipleObjectsEx. In the particular scenario I outlined above, step 2 can use a simple WaitForSingleObject on the kernel handle representing the "waiting readers" event.


"this requires the OS to detect a priority inversion etc. etc. You can look it up."

Indeed, this is a cool part about OS's like Windows. The only thing I know about it is from reading the "Priority Boosts for CPU Starvation" in Chapter 6 of Windows Internals 4th Edition:

"Once per second, the balance set manager ... scans the ready queues for any threads that have been in the ready state (that is, haven't run) for approximately 4 seconds. If it finds such a thread, the balance set manager boosts the thread's priority to 15... Once the quantum is expired, the thread's priority decays immediately to its original base priority. If the thread wasn't finished and a higher priority thread is ready to run, the decayed thread will return to the ready queue, where it again becomes eligible for another boost if it remains there for another 4 seconds."

As the book mentions, it's not perfect, but it is a neat way around the problem. Is this roughly what you were referring to in that comment?

As to the elapsed time count with and without spinning, it's too close to call given my unscientific test. The results fluctuated +- 0.1% on who won, so no clear winner came out. However, my test could be a bad example. I encourage anyone reading this to offer up test code either to my personal email address (found on the right hand side of my blog) or on StackOverflow question mentioned above in my previous comments.

I'm not trying to sidestep an error on my part. I mislead myself and readers of this post. If I updated my "hundred thousand times slower" comment to read something like "The lock implementation performs a spin wait with the hope that it might use enough time to happen to allow a thread running on another core to release the internal lock. By spinning, it might avoid a context switch and therefore save some context switches." Would this be satisfactory from your perspective?

I'll agree that .1% to 3% isn't noticable or worth a lot of effort. At the very least, I'd classify it as a neat trick. While it really matters in kernel mode, the application in user mode is marginal. It's noticable more in highly contentended locks, so I'm glad that we both agree that it shouldn't be removed from system libraries since it could have some benefit.

Anonymous said...

quote " If I updated my [post]" /quote

That statement you make sums it up very correctly.

To answer your question about WaitFor usage: my understanding of your code is that the whole purpose of the thing is to avoid the creation of the expensive objects. Thus, the ideal scenario with any level of contention is that the writer lock is never taken because all of the expensive objects have already been created in the warm-up phase of the application.

In this scenario, everyone acquires the read locks, and releases them without ever waiting on anything.

The bottom line of the waitFor call is that it semantically means that the thread should block or stall. Whether this is done through spinning or a genuine WaitFor call is only a detail of it. Given this, the ideal and most often used case will be, in my opinion, that any number of threads will get to this area of code, take out the reader lock (using nothing but a cmpxchg), get their references to ExpensiveObjects, and leave the reader lock area without any blocking. After all, this is ideally what multithreaded applications should be doing. This will be all done without any context switching (and no WaitFor calls).

I think you get the point of everything I said so far given that your questions and comments indicate that you grasp the concepts.

There is one last tiny nit to pick though. Vance's code, namely the implementation of EnterMyLockSpin. I actually don't know where it comes from or who this is, and am feeling to lazy to go in depth into it. However, I think it's (poorly) re-implementing what a CriticalSection is.

Given what I've already explained (that sleep and WaitFor incur the same penalty of switching to kernel mode and giving up your quantum), this spin lock is inefficient. It is also not absolutely correct, or rather, not fair.

1) it is inefficient because calling Sleep every 3rd time means that you will incur at least as much a penalty as calling WaitFor once.
2) it is not fair, because you might lose your turn if some other thread gets in there during the time you are sleeping. It is incorrect because this could possibly happen for long periods of times given the correct circumstances (e.g. priority inversion). And because you aren't relying on OS synchronization, this kind of priority inversion would not be remedied automagically.

To sum it all up: no, spinlocks aren't to be banished. Yes, re-implementing OS primitives is poor form, risky at best. EnterMyLockSpin is essentially re-implementing what EnterCriticalSection already is. User code shouldn't spin, but we can forgive libraries that do so (like EnterCriticalSection or a C# library).

Jeff Moser said...

anonymous:

As a result of our disccussion, I wrote Vance and he did a great job of responding quickly with good points:

1. Vance is very upfront in saying that his spin-lock implementation of a reader-writer lock is not a "gold standard for spin locks." I still commend it though on its simplicity and ease of understanding.

2. Microsoft is currently working on Parallel Extensions for the .NET framework. This represents the culmination of more detailed work in parallel processing. It has SpinWait and SpinLock classes that can be used as primitives. They're still tuning the exact mix of spinning and sleeping/yielding.

3. Agreement that spin waiting is wasted time and is only beneficial if you have a "strong reason to believe the lock will be freed in the short time you spin."

4. By not getting swapped out, you save calls to the kernel, 2 switches (out and in), as well as some cycles ("100K to 1M") to get your working set back in cache. We had talked about this earlier, but I don't think I realized that this can be the biggest issue (e.g. having to "warm up" a "cool" cache). In a contended lock environment that is held for a very short period (e.g. to only update a few variables), the spin can be effective and adds up over time.

I don't think my tests did a good job at contention and having other work that was ready to run that would swap out the cache lines and thus I was missing a large benefit of the spins.

Furthermore, my limitation of only having a dual-processor machine limited my ability to effectively show the benefits of spins in a heavily loaded environment with contended locks that are all spinning at once.

They're not perfect in user mode, but I think we've established that spins have their place when used properly.

I've updated the offending paragraph as a result of all of this discussion. It's not perfect, but I do think it is better.

Thank you for helping improve this post and enriching the conversation!

P.S. Thanks to Vance Morrison for posting his lock implementation and for taking the time to answer my questions regarding it.

Anonymous said...

Great!

Thanks to you too for keeping an open mind and conversation.

Rishi said...

Hi Jeff, thanks for your answer on stackflow about Silverlight ReaderWriterLock. I tested your code it ran like 9x v/s Monitor.Lock blocks. Also, as per your reference I got Vance Morrison's implementation working on silverlight check it out http://www.orkpad.com/Blog/post/2009/03/08/Silverlight-ReaderWriterLock-Implementation.aspx

Cheers

Jeff Moser said...

Rishi: Great to hear this was helpful. Congrats on getting it working in Silverlight!

Passer-by said...

I know this is a fairly old discussion but I'd like to thank you Jeff and Anonymous for the amazing discussion. The concepts are pretty simple to grasp yet it's surprising how many people overlook such important aspects of parallel programming.