Sunday, February 13, 2011

Performance Optimization using cache coherence

Tip - Take advantage of cache coherency for optimizing programs

Details - High level languages such as C, C++, C#, FORTRAN, and Java all do a great job of abstracting the hardware away from the language. This means that programmers generally don’t have to worry about how the hardware goes about executing their program. However, in order to get the maximum amount of performance out of your programs, it is necessary to start thinking about how the hardware is actually going to execute your program.

Most modern CPUs have at least 1 MB of cache, and some have up to 8 MB of cache or more. Datasets for many programs are easily in the tens of megabytes if not in the hundreds or thousands of megabytes. Therefore, we have to worry about how the cache works.

What is cache coherence?
When a program requests data from a memory location, say 0×1000, and then shortly afterwards requests data from a nearby memory location, say 0×1004, the data is coherent. In other words, when a program needs to access different pieces of data around the same memory location within a short period of time, the data is coherent. When a piece of data is retrieved from main memory, more often than not, more data is retrieved than we actually requested. This is because the processor ‘hopes’ that the program will soon request data from a nearby memory location which it just so happened to have fetched already behind your back. In fact, we can use this processor behavior to our advantage. This behavior is often referred to as prefetching. Consider the following two pieces of code written in C.

// Code 1
for( int i = 0; i < 32; i++ )
{
    for( int j = 0; j < 32; j++ )
    {
        nTotal += nArray[i][j];
    }
}

// Code 2
for( int i = 0; i < 32; i++ )
{
    for (int j=0; j < 32; j++ )
    {
        nTotal += nArray[j][i];
    }
}

The two pieces of code above both perform the exact same task. However, the first piece of code is faster than the second piece of code! The reason for this is because of how memory is organized in C and C++ applications. The first piece of code accesses consecutive memory addresses, which not only utilizes cache coherency, but prefetching as well. The second piece of code is much less coherent because each time the inner loop executes, the memory address being accessed is 32 words away from the previous execution of the inner loop. This doesn’t utilize prefetching at all.

Access memory linearly as much as possible to utilize prefetching. Accesses to main memory can easily cost you hundreds of clock cycles. When you access memory linearly in a predictable fashion, chances are greater that the processor will notice the pattern and prefetch automatically for you. Try to re-use the data which is already in the cache as much as possible. If you’re going to need to access the same data more than once, it’s better to do so quickly to diminish the chances of that data being kicked out of the cache by other, unrelated data.

Reference –

Posted By : Sujith R Mohan

No comments:

Post a Comment