Technical Tips: February 2011

Wednesday, February 23, 2011

Debugging a blue screen

Title - Debugging a blue screen

Details - Have you ever wondered how to obtain extra information from the infamous Blue Screen of Death (BSOD) that will sometimes show up and give you a cryptic, Stop: 0×00000000 error message, before flashing off the screen? The error message is trying to point you to a fatal operating system error that could be caused by a number of problems. When the system encounters a hardware problem, data inconsistency, or similar error, it may display a blue screen containing information that can be used to determine the cause of the error. This information includes the STOP code and whether a crash dump file was created. It may also include a list of loaded drivers and a stack trace.

Microsoft’s WinDBG will help you to debug and diagnose the problem and then lead you to the root cause so you can fix it.

Steps For Analyze

Create and capture the memory dump associated with the BSOD you are trying to troubleshoot.
Install and configure WinDBG and the Symbols path to the correct Symbols folder.
Use WinDBG to Debug and analyze the screen dump, and then get to the root cause of the problem.

Minidump

A minidump is a smaller version of a complete, or kernel memory dump. Usually Microsoft will want a kernel memory dump. But the debugger will analyze a mini-dump and quite possibly give information needed to resolve. If it's all you have, then debug it, rather than waiting for the machine to crash again. Open the file in the debugger (see below) just as opening memory.dmp.

Steps to create memory dump

Keep in mind that if you are not experiencing a blue screen fatal system error, there will be no memory dump to capture.

1. Press the WinKey + Pause.

2. Click Advanced, and under Start Up and Recovery.

3. Uncheck Automatically Restart.

4. Click on the dropdown arrow under Write Debugging Information.

5. Select Small Memory Dump (64 KB) and make sure the output is %SystemRoot%\Minidump.

6. Restart the PC normally, as this will allow the System to error and Blue Screen and then create the Minidump.

The location of the Minidump files can be found here:

C:\WINDOWS\Minidump\Mini000000-01.dmp

To download and install the Windows debugging tools for your version of Windows, visit the Microsoft Debugging Tools Web site.

Follow the prompts, and when you install, take note of your Symbols location, if you accept the default settings this Microsoft Support Knowledge Base article will explain how to read the small memory dump files that Windows creates for debugging purposes.

Dump Analyze using WinDBG

Open WinDBG and select File and select Open Crash Dump and then navigate to the minidump file created earlier, highlight it, and select Open.

Click on:

! analyze –v

As shown in Figure C under Bugcheck Analysis.

Figure C

! analyze -v

Conclusion

The problem creating the BSOD was caused by the installed driver software for a USB modem. The answer to the problem was achieved by using the WinDBG tool to Debug and analyze the memory dump file.

Reference:

http://blogs.msdn.com/b/ntdebugging/archive/2008/03/28/debugging-a-bluescreen-at-home.aspx

Posted By : Binu M D

More on OpenMP

Title -More on OpenMP

Details - OpenMP is an implementation of multithreading, a method of parallelization whereby the master "thread" (a series of instructions executed consecutively) "forks" a specified number of slave "threads" and a task is divided among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.

The section of code that is meant to run in parallel is marked accordingly, with a preprocessor directive that will cause the threads to form before the section is executed. Each thread has an "id" attached to it which can be obtained using a function (called omp_get_thread_num()). The thread id is an integer, and the master thread has an id of "0". After the execution of the parallelized code, the threads "join" back into the master thread, which continues onward to the end of the program.

By default, each thread executes the parallelized section of code independently. "Work-sharing constructs" can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP in this way.

The runtime environment allocates threads to processors depending on usage, machine load and other factors. The number of threads can be assigned by the runtime environment based on environment variables or in code using functions. The OpenMP functions are included in a header file labelled "omp.h" in C/C++.

Implementations

OpenMP has been implemented in many commercial compilers. For instance, Visual C++ 2005, 2008 and 2010 support it (in their Professional, Team System, Premium and Ultimate editions), as well as Intel Parallel Studio for various processors.

Pros and cons

Pros

Data layout and decomposition is handled automatically by directives.
Incremental parallelism: can work on one portion of the program at one time, no dramatic change to code is needed.
Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.
Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs.
Both coarse-grained and fine-grained parallelism are possible

Cons

Risk of introducing difficult to debug synchronization bugs and race conditions.
Currently only runs efficiently in shared-memory multiprocessor platforms .
Requires a compiler that supports OpenMP.
Scalability is limited by memory architecture.
no support for compare-and-swap
Reliable error handling is missing.
Lacks fine-grained mechanisms to control thread-processor mapping.
Can't be used on GPU
High chance of accidentally writing false sharing code

Performance expectations

One might expect to get an N times speedup when running a program parallelized using OpenMP on a N processor platform. However, this is seldom the case due to the following reasons:

N processors in a SMP may have N times the computation power, but the memory bandwidth usually does not scale up N times. Quite often, the original memory path is shared by multiple processors and performance degradation may be observed when they compete for the shared memory bandwidth.
Many other common problems affecting the final speedup in parallel computing also apply to OpenMP, like load balancing and synchronization overhead.

http://msdn.microsoft.com/en-us/library/tt15eb9t(v=vs.80).aspx

https://computing.llnl.gov/tutorials/openMP/

http://en.wikipedia.org/wiki/OpenMP

Posted By : Binu M D

Double buffering in computer graphics

Title :- Double buffering in computer graphics

Details :- In computer graphics, double buffering is a technique for drawing graphics that shows no (or less) flicker, tearing, and other artifacts.

It is difficult for a program to draw a display so that pixels do not change more than once. For instance to update a page of text it is much easier to clear the entire page and then draw the letters than to somehow erase all the pixels that are not in both the old and new letters. However, this intermediate image is seen by the user as flickering. In addition computer monitors constantly redraw the visible video page (at around 60 times a second), so even a perfect update may be visible momentarily as a horizontal divider between the "new" image and the un-redrawn "old" image, known as tearing.

A software implementation of double buffering has all drawing operations store their results in some region of system RAM; any such region is often called a "back buffer". When all drawing operations are considered complete, the whole region (or only the changed portion) is copied into the video RAM (the "front buffer"); this copying is usually synchronized with the monitor's raster beam in order to avoid tearing. Double buffering necessarily requires more video memory and CPU time than single buffering because of the video memory allocated for the back buffer, the time for the copy operation, and the time waiting for synchronization.

Compositing window managers often combine the "copying" operation with "compositing" used to position windows, transform them with scale or warping effects, and make portions transparent. Thus the "front buffer" may contain only the composite image seen on the screen, while there is a different "back buffer" for every window containing the non-composited image of the entire window contents.
Page Flipping

In this method (sometimes called ping-pong buffering), instead of copying the data, both buffers are capable of being displayed (both are in VRAM). At any one time, one buffer is actively being displayed by the monitor, while the other, background buffer is being drawn. When drawing is complete, the roles of the two are switched. The page-flip is typically accomplished by modifying the value of a pointer to the beginning of the display data in the video memory.

The page-flip is much faster than copying the data and can guarantee that tearing will not be seen as long as the pages are switched over during the monitor's vertical blank period when no video data is being drawn. The currently active and visible buffer is called the front buffer, while the background page is called the 'back buffer'.

Posted By : Binu MD

How to mark the specific I/O operations as cancelled.

Tip - CancelSynchronousIo ()

Details - This function marks pending synchronous I/O operations that are issued by the specified thread as canceled.

BOOL WINAPI CancelSynchronousIo(

__in HANDLE hThread

);

hThread [in]

A handle to the thread.

Ø If the function succeeds, the return value is nonzero.

Ø If the function fails, the return value is 0 (zero).

Ø If this function cannot find a request to cancel, the return value is 0.

Ø The caller must have the THREAD_TERMINATE access right.

Ø If there are any pending I/O operations in progress for the specified thread, the CancelSynchronousIo function marks them for cancellation. Most types of operations can be canceled immediately; other operations can continue toward completion before they are actually canceled and the caller is notified. The CancelSynchronousIo function does not wait for all canceled operations to complete.

The operation being canceled is completed with one of three statuses; you must check the completion status to determine the completion state. The three statuses are:

* The operation completed normally. This can occur even if the operation was canceled, because the cancel request might not have been submitted in time to cancel the operation.

* The operation was canceled.

* The operation failed with another error.

Reference -

http://msdn.microsoft.com/en-us/library/aa363794%28v=VS.85%29.aspx

Posted By : Praveen V

How to cancel the I/O operations for a specified file

Tip - CancelIo ()

Details - This function cancels all pending input and output (I/O) operations that are issued by the calling thread for the specified file. The function does not cancel I/O operations that other threads issue for a file handle. To cancel I/O operations from another thread, use the CancelIoEx function.

BOOL WINAPI CancelIo(

__in HANDLE hFile

);

hFile [in]

A handle to the file.

Ø The function cancels all pending I/O operations for this file handle

Ø If the function succeeds, the return value is nonzero. The cancel operation for all pending I/O operations issued by the calling thread for the specified file handle was successfully requested.

Ø The thread can use the GetOverlappedResult function to determine when the I/O operations themselves have been completed.

Ø If the function fails, the return value is zero (0).

Ø If there are any pending I/O operations in progress for the specified file handle, and they are issued by the calling thread, the CancelIo function cancels them. CancelIo cancels only outstanding I/O on the handle, it does not change the state of the handle; this means that you cannot rely on the state of the handle because you cannot know whether the operation was completed successfully or canceled.

Ø The I/O operations must be issued as overlapped I/O. If they are not, the I/O operations do not return to allow the thread to call the CancelIo function. Calling the CancelIo function with a file handle that is not opened with FILE_FLAG_OVERLAPPED does nothing.

Ø All I/O operations that are canceled complete with the error ERROR_OPERATION_ABORTED, and all completion notifications for the I/O operations occur normally.

Reference

http://msdn.microsoft.com/en-us/library/aa363791%28v=VS.85%29.aspx

Posted By : Praveen V

How to determine whether the character set code page used by file I/O functions

Tip- AreFileApisANSI ()

Details - This function has no parameters.

BOOL WINAPI AreFileApisANSI(void);

If the return value is non zero, the I/O functions is using ANSI code page.

If the return value is zero, the I/O functions are using OEM code page.

The SetFileApisToOEM function causes a set of file I/O functions to use the OEM code page. The SetFileApisToANSI function causes the same set of file I/O functions to use the ANSI

code page. Use the AreFileApisANSI function to determine which code page the set of file I/O functions is currently using. The functions SetFileApisToOEM and SetFileApisToANSI set the code page for a process, so AreFileApisANSI returns a value indicating the code page of an entire process.

Reference -

http://msdn.microsoft.com/en-us/library/aa363781%28v=VS.85%29.aspx

Posted By : Praveen V

Sunday, February 13, 2011

Adds user keys to the specified encrypted file

Tip - AddUsersToEncryptedFile()

Details - The AddUsersToEncryptedFile function adds user keys to the specified encrypted file.

DWORD WINAPI AddUsersToEncryptedFile(

__in LPCWSTR lpFileName,

__in PENCRYPTION_CERTIFICATE_LIST pUsers

);

lpFileName [in]

The name of the encrypted file.

pUsers [in]

A pointer to an ENCRYPTION_CERTIFICATE_LIST structure that contains the list of new user keys to be added to the file.

The details of ENCRYPTION_CERTIFICATE_LIST structure is as shown below.

typedef struct _ENCRYPTION_CERTIFICATE_LIST {

DWORD nUsers;

PENCRYPTION_CERTIFICATE *pUsers;

} ENCRYPTION_CERTIFICATE_LIST, *PENCRYPTION_CERTIFICATE_LIST;

nUsers

The number of certificates in the list.

pUsers

A pointer to the first ENCRYPTION_CERTIFICATE structure in the list.

Reference -

http://msdn.microsoft.com/en-us/library/aa363770%28v=VS.85%29.aspx

Posted By : Praveen V

How to get the GPU memory size and usage in CUDA?

Tip - Use cuMemGetInfo get the GPU memory size and usage in CUDA.

Details - CUDA API cuMemGetInfo retrieves the size of the total available GPU memory and the size of the current available GPU memory. It returns sizes in byte.

unsigned int uCurAvailMemoryInBytes;

unsigned int uTotalMemoryInBytes;

int nNoOfGPUs;

CUresult result;

CUdevice device;

CUcontext context;

cuInit(0); // Initialize CUDA

cuDeviceGetCount( &nNoOfGPUs ); // Get number of devices supporting CUDA

for( int nID = 0; nID < nNoOfGPUs; nID++ )

{

cuDeviceGet( &device, nID ); // Get handle for device

cuCtxCreate( &context, 0, device ); // Create context

result = cuMemGetInfo( &uCurAvailMemoryInBytes, &uTotalMemoryInBytes );

if( result == CUDA_SUCCESS )

{

printf( "Device: %d\nTotal Memory: %d MB, Free Memory: %d MB\n",

nID,

uTotalMemoryInBytes / ( 1024 * 1024 ),

uCurAvailMemoryInBytes / ( 1024 * 1024 ));

}

cuCtxDetach( context ); // Destroy context

}

Reference –

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__MEMORY.html

Posted By : Sujith R Mohan

How to get the GPU memory size and usage in OpenGL?

Tip - Use GL_NVX_gpu_memory_info for NVIDIA graphics cards and GL_ATI_meminfo for AMD/ATI cards.

Details - Few years ago, this kind of information wasn’t available to OpenGL developers but today, the two majors in OpenGL (I mean NVIDIA and AMD/ATI) have added some useful extensions to fetch graphics card memory size and other related data.

OpenGL Memory Size with NVIDIA

NVIDIA has recently published the specs of a new extension called GL_NVX_gpu_memory_info. Thanks to GL_NVX_gpu_memory_info you can retrieve the size of the total available GPU memory and the size of the current available GPU memory. It returns sizes in KB.

Here is a code snippet that shows how to use it.

#define GL_GPU_MEM_INFO_TOTAL_AVAILABLE_MEM_NVX 0x9048

#define GL_GPU_MEM_INFO_CURRENT_AVAILABLE_MEM_NVX 0x9049

GLint nTotalMemoryInKB = 0;

glGetIntegerv( GL_GPU_MEM_INFO_TOTAL_AVAILABLE_MEM_NVX,

&nTotalMemoryInKB );

GLint nCurAvailMemoryInKB = 0;

glGetIntegerv( GL_GPU_MEM_INFO_CURRENT_AVAILABLE_MEM_NVX,

&nCurAvailMemoryInKB );

GL_NVX_gpu_memory_info is available in NVIDIA display drivers since R196.xx.

OpenGL Memory Size with AMD/ATI

AMD has released two extensions to deal with the size of video memory:

· WGL_AMD_gpu_association

· GL_ATI_meminfo

WGL_AMD_gpu_association is an extension developed for parallel rendering: you can create a different OpenGL context on each available GPU. This extension comes with hardware query functionalities. With WGL_AMD_gpu_association you can get the amount of graphics memory available for the GPU.

GL_ATI_meminfo is used when you need more detailed information about the available memory for VBO(vertex buffer object) or for your textures.

Here is a code snippet that shows how to use it.

GLuint uNoOfGPUs = wglGetGPUIDsAMD( 0, 0 );

GLuint* uGPUIDs = new GLuint[uNoOfGPUs];

wglGetGPUIDsAMD( uNoOfGPUs, uGPUIDs );

GLuint uTotalMemoryInMB = 0;

wglGetGPUInfoAMD( uGPUIDs[0],

WGL_GPU_RAM_AMD,

GL_UNSIGNED_INT,

sizeof( GLuint ),

&uTotalMemoryInMB );

GLint nCurAvailMemoryInKB = 0;

glGetIntegerv( GL_TEXTURE_FREE_MEMORY_ATI,

&nCurAvailMemoryInKB );

Reference –

http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt

http://www.opengl.org/registry/specs/ATI/meminfo.txt

http://www.opengl.org/registry/specs/AMD/wgl_gpu_association.txt

Posted By : Sujith R Mohan

Performance Optimization using cache coherence

Tip - Take advantage of cache coherency for optimizing programs

Details - High level languages such as C, C++, C#, FORTRAN, and Java all do a great job of abstracting the hardware away from the language. This means that programmers generally don’t have to worry about how the hardware goes about executing their program. However, in order to get the maximum amount of performance out of your programs, it is necessary to start thinking about how the hardware is actually going to execute your program.

Most modern CPUs have at least 1 MB of cache, and some have up to 8 MB of cache or more. Datasets for many programs are easily in the tens of megabytes if not in the hundreds or thousands of megabytes. Therefore, we have to worry about how the cache works.

What is cache coherence?

When a program requests data from a memory location, say 0×1000, and then shortly afterwards requests data from a nearby memory location, say 0×1004, the data is coherent. In other words, when a program needs to access different pieces of data around the same memory location within a short period of time, the data is coherent. When a piece of data is retrieved from main memory, more often than not, more data is retrieved than we actually requested. This is because the processor ‘hopes’ that the program will soon request data from a nearby memory location which it just so happened to have fetched already behind your back. In fact, we can use this processor behavior to our advantage. This behavior is often referred to as prefetching. Consider the following two pieces of code written in C.

// Code 1

for( int i = 0; i < 32; i++ )

{

for( int j = 0; j < 32; j++ )

{

nTotal += nArray[i][j];

}

// Code 2

for( int i = 0; i < 32; i++ )

{

for (int j=0; j < 32; j++ )

{

nTotal += nArray[j][i];

}

The two pieces of code above both perform the exact same task. However, the first piece of code is faster than the second piece of code! The reason for this is because of how memory is organized in C and C++ applications. The first piece of code accesses consecutive memory addresses, which not only utilizes cache coherency, but prefetching as well. The second piece of code is much less coherent because each time the inner loop executes, the memory address being accessed is 32 words away from the previous execution of the inner loop. This doesn’t utilize prefetching at all.

Access memory linearly as much as possible to utilize prefetching. Accesses to main memory can easily cost you hundreds of clock cycles. When you access memory linearly in a predictable fashion, chances are greater that the processor will notice the pattern and prefetch automatically for you. Try to re-use the data which is already in the cache as much as possible. If you’re going to need to access the same data more than once, it’s better to do so quickly to diminish the chances of that data being kicked out of the cache by other, unrelated data.

Reference –

http://supercomputingblog.com/optimization/taking-advantage-of-cache-coherence-in-your-programs

http://developer.amd.com/documentation/articles/pages/1128200684.aspx

Posted By : Sujith R Mohan

Remote Procedure Call

Tip : Microsoft Remote Procedure Call (RPC) defines a powerful technology for creating distributed client/server programs. The RPC run-time stubs and libraries manage most of the processes relating to network protocols and communication. This enables you to focus on the details of the application rather than the details of the network.

Details :

. RPC can be used in all client/server applications based on Windows operating systems. It can also be used to create client and server programs for heterogeneous network environments that include such operating systems as Unix and Apple.

· RPC is designed to be used by C/C++ programmers. Familiarity with the Microsoft Interface Definition Language (MIDL) and the MIDL compiler are required.

· The RPC run-time libraries are included with Windows. The components of the RPC development environment are installed when you install the Microsoft Windows Software Development Kit (SDK).

· The Platform Software Development Kit (SDK) includes examples that demonstrate a variety of Remote Procedure Call (RPC) concepts, as follows:

o ASYNCRPC illustrates the structure of an RPC application that uses asynchronous remote procedure calls. It also demonstrates various methods of notification of the call's completion.

o CLUUID demonstrates use of the client-object UUID to enable a client to select from multiple implementations of a remote procedure.

o DATA directory contains four programs: DUNION illustrates discriminated (nonencapsulated) unions; INOUT demonstrates [in], [out] parameters; REPAS demonstrates the represent_as attribute; XMIT demonstrates the transmit_as attribute.

o DYNEPT demonstrates a client application managing its connection to the server through dynamic endpoints.

o FILEREP directory contains four samples illustrating how developers can write a simple file replication service, a multi-user file replication service, a service supporting security features, and a service using RPC asynchronous pipes.

o HANDLES directory contains three programs, AUTO, CXHNDL, USRDEF, which demonstrate auto_handle, [context_handle], and generic (user-defined) handles, respectively.

o HELLO is a client/server implementation of "Hello, world."

o PICKLE directory contains two programs: PICKLP demonstrates data procedure serialization; PICKLT demonstrates data type serialization; both programs use the [encode] and [decode] attributes.

o PIPES demonstrates the use of the pipe-type constructor.

o RPCSVC demonstrates the implementation of a service with RPC.

o STROUT demonstrates how to allocate memory at a server for a two-dimensional object (an array of pointers) and pass it back to the client as an [out]-only parameter. The client then frees the memory. This technique allows the stub to call the server without knowing in advance how much data will be returned.

Reference -

http://msdn.microsoft.com/en-us/library/aa378651(v=VS.85).aspx

Posted By : Rijesh T.K.

Security Support Provider Interface (SSPI)

Tip : SSPI is a common interface between transport-level applications, such as Microsoft Remote Procedure Call (RPC), and security providers, such as Windows Distributed Security. SSPI allows a transport application to call one of several security providers to obtain an authenticated connection. These calls do not require extensive knowledge of the security protocol's details.

Details : Applications initialize SSPI using the following steps to secure a network connection for most security packages:

Using SecBufferDesc and SecBuffer

Communication often involves potentially large amounts of data or data of unknown length. Sending and receiving data is done in similar or identical fashion in both of these cases. To deal with unknown length and potentially large amounts of data, SSPI communication is done using two structures, SecBufferDesc and SecBuffer.

Initializing SSPI

The following steps initialize SSPI:

Ø Writing and Installing a Security Support Provider

Ø Initializing the Security Package

Ø Getting Information About Security Packages

Ø Using Security Packages

Establishing a Secure Connection with Authentication

In a client/server application protocol, a server binds to a communication port such as a socket or an RPC interface. The server then waits for a client to connect and request service. The role of security at connection setup is two-fold in the case of mutual authentication:

Ø Client authentication by the server.

Ø Server authentication by the client.

Ensuring Communication Integrity During Message Exchange

The client or server passes the security context and a message to the MakeSignature function to generate a secure signature that prevents the message from being modified while in transit. The receiver of the message calls the VerifySignature function. VerifySignature uses the information in the signature to verify that the message received was not modified during transmission. The client and server can also exchange encrypted messages using EncryptMessage (General) and DecryptMessage (General).

Ending an SSPI Session

After the client and server have finished communicating, both sides call the DeleteSecurityContext function with their respective context handles. When the client has finished communicating with any server or has finished using the additional credentials passed to the AcquireCredentialsHandle function, the client must call the FreeCredentialsHandle function. When the server is ready to shut down and before unloading the DLL, the server must call DeleteSecurityContext.

Reference -

http://msdn.microsoft.com/en-us/library/aa378824(VS.85).aspx

Posted By : Rijesh T.K.

Wednesday, February 2, 2011

PsList

Tip - Show information about processes and threads.

Details -

pslist exp : Would show statistics for all the processes that start with "exp", which would include Explorer.

-d : Show thread detail.

-m : Show memory detail.

-x : Show processes, memory information and threads.

-t : Show process tree.

-s [n] : Run in task-manager mode, for optional seconds specified. Press Escape to abort.

-r n : Task-manager mode refresh rate in seconds (default is 1).

\\computer : Instead of showing process information for the local system, PsList will show information for the NT/Win2K system specified. Include the -u switch with a username and password to login to the remote system if your security credentials do not permit you to obtain performance counter information from the remote system.

-u : Username If you want to kill a process on a remote system and the account you are executing in does not have administrative privileges on the remote system then you must login as an administrator using this command-line option. If you do not include the password with the -p option then PsList will prompt you for the password without echoing your input to the display.

-p : Password This option lets you specify the login password on the command line so that you can use PsList from batch files. If you specify an account name and omit the -p option PsList prompts you interactively for a password.

name : Show information about processes that begin with the name specified.

-e : Exact match the process name.

pid : Instead of listing all the running processes in the system, this parameter narrows PsList's scan to the process that has the specified PID. Thus: pslist 53 would dump statistics for the process with the PID 53.

Reference -

http://technet.microsoft.com/en-us/sysinternals/bb896682

Posted By : Rijesh T.K.