64 bit porting gotcha #1! x64 Datatype misalignment.

By , June 17, 2010 11:10 am

Datatype misalignment, there is a topic so interesting you’d probably prefer to watch paint dry.

But! There are serious consequences for getting it wrong. So perhaps you’d better read about it after all 🙂

The problem that wasted my time

Why am I writing about datatype misalignment? Because its just eaten two days of my time and if what I share with you helps save you from such trouble, all the better.

The problem I was chasing was that three calls to CreateThread() were failing. All calls were failing with ERROR_NOACCESS. They would only fail if called from functions in the Thread Validator x64 profiling DLL injected into the target x64 application. If the same functions were called later in the application (via a Win32 API hook or directly from the target application) the functions would work. That meant that the input parameters were correct.

Lots of head scratching and trying many, many variations of input parameters and asking questions on Stack Overflow and we were stuck. I could only think it was to do with the callstack but I had no idea why. So I started investigating the value of RSP during the various calls. The investigating what happened if I pushed more data onto the stack to affect the stack pointer. After some trial and error I found a combination that worked. Then I experimented with that combination to determine if it was the values being pushed that were important or the actual value of the stack pointer that was important.

At this point I was confused, as I didn’t know about any stack alignment requirements, I only knew about data alignment requirements. I then went searching for appropriate information about stack alignments and found this handy document from Microsoft clarifies that.

What is datatype misalignment?

Datatype alignment is when the data read by the CPU falls on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is aligned on a 4 byte boundary.

In the following code examples, let us assume that data points to a location aligned on a four byte boundary.

void aligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[4];

Datatype misalignment is when the data read by the CPU does not fall on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is not aligned on a 4 byte boundary.

void misaligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[5];

Why should I care about datatype misalignment?

Aligned data reads and data writes happen at the maximum speed the memory subsystem and processor can provide. For example to read an aligned DWORD, one 32 bit data read needs to be performed.

	BYTE	[ 1];
	BYTE	[ 2];
	BYTE	[ 3];
DWORD	BYTE	[ 4];	// ignored
	BYTE	[ 5];	// read
	BYTE	[ 6];	// read
	BYTE	[ 7];	// read
DWORD	BYTE	[ 8];	// read
	BYTE	[ 9];	// ignored
	BYTE	[10];	// ignored
	BYTE	[11];	// ignored
	BYTE	[13];
	BYTE	[14];
	BYTE	[15];

Misligned data reads and data writes do not happen at the maximum speed the memory subsystem and processor can provide. For example to read a misaligned DWORD the processor has to fetch data for the two 32 bit words that the misaligned data straddles.

In the misaligned example shown above, the read happens at offset 5 in the input array. I’ve shown the input array first 16 bytes, marking where each DWORD starts and showing which bytes are read and which are ignored. If we assume the input array is aligned then the DWORD being read has 3 byte2 in the first DWORD and 1 bytes in the second DWORD. The processor has to read both DWORDs, then shuffle the bytes around, discarding the first byte from the first DWORD and discarding the last 3 bytes from the last DWORD, then combining the remaining bytes to form the requested DWORD.

Performance tests on 32 bit x86 processors shown performance drops of between 2x and 3x. Thats quite a hit. On some other architectures, the performance hit can be much worse. This largely depends on if the processor does the rearrangement (as on x86 and x64 processors) or if an operating system exception handler handles it for you (much slower).

I’ve shown the example with DWORDs, because the they are short enough to be easily shown in a diagram whereas 8 byte or larger values would be unweildy.

The above comments also apply to 8 byte values such as doubles, __int64, DWORD_PTR (on x64), etc.

Clearly, getting your datatype alignments optimized can be very handy in performance terms. Niave porting from 32 bit to 64 bit will not necessarily get you there. You may need to reorganise the order of some data members in your structures. We’ve had to do that with Thread Validator x64.

Not just performance problems either!

In addition to the performance problems mentioned above there is another, more important consideration to be aware of: On x64 Windows operating systems you must have the stackframe correctly aligned. Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.

Failure to ensure that this is the case will mean that a few Windows API calls will fail with the cryptic error code ERROR_NOACCESS (hex: 0x3E6, decimal: 998). This means “Invalid access to memory location”.

The problem with the error code ERROR_NOACCESS in this case is that the real error code that gets converted into ERROR_NOACCESS is STATUS_DATATYPE_MISALIGNMENT, which tells you a lot more. I spent quite a bit of time digging around until I found the true error code for the bug I was chasing that lead me to write this article.

If you are writing code using a compiler, the compiler will sort the stack alignment details out for you. However if you are writing code in assembler, or writing hooks using dynamically created machine language, you need to be aware of the 16 byte stack alignment requirement.

x64 datatype alignment requirements

Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.


Size Alignment
1 1
2 2
4 4
8 8
10 10
16 16

Anything larger than 8 bytes is aligned on the next power of 2 boundary.


Correct stack frame alignment is essential to ensure calling functions works reliably.

Correct datatype alignment is essential for maximum speed when accessing data.

Failure to align stack frames correctly could lead to Win32 API calls failing and or program failure or lack of correct behaviour.

Failure to align data correctly will lead to slow speed when accessing data. This could be disasterous for your application, depending upon what it is doing.


Microsoft x86/x64/IA64 alignment document.

First steps to 64 bit

By , June 10, 2010 6:23 pm

Today Thread Validator 64 bit took it’s first steps on its 64 bit training wheels.

Interesting problems including

  • DLL Hooking
  • DLL Import structure walking
  • Injecting
  • PE 64 bit DLL inspection

The first attempt at launching a 64 bit application and injecting into it worked. Awesome! First time out and something that hard to do worked first time. We’ve got a few datastructure size mismatches and a stack walking problem to look at, but looks like our first 64 bit port is going well.

Check back on the blog for progress or follow us on twitter. We’ll be announcing the progress of our various 64 bit ports on both the blog and twitter.

How to do cpuid and rdtsc on 32 bit and 64 bit Windows

By , June 4, 2010 3:57 pm

With the introduction of WIN64 the C++ compiler has many improvements and certain restrictions. One of those restrictions is no inline assembly code. For those few of us that write hooking software this is a real inconvenience. Inline assembly is also useful for adding little snippets of code to access hardware registers that are not so easy to access from C or C++.

The 64 bit compiler also introduces some intrinsics which are defined in intrinsic.h. These intrinsics allow you to add calls to low level functionality in your 64 bit code. Such functionality includes setting breakpoints, getting CPU and hardware information (cpuid instruction) and reading the hardware timestamp counter.

In this article I’ll show you how you can use the same code for both 32 bit and 64 bit builds to have access to these intrinsics on both platforms.


The 64 bit compiler provides a convenient way for you to hard code breakpoints into your code. Often very useful for putting breakpoints in your code during testing. The __debugbreak() intrinsic provides this functionality.

There is no 32 bit __debugbreak();

For 32 bit systems you have to know 80386 assembly. The breakpoint instruction as opcode 0xcc. The inline assembly for this is __asm int 3;

	#define __debugbreak()				__asm { int 3 }

cpuid – 32 bit

On 32 bit systems there is no cpuid assembly instruction so you have to use the emit directive.

	#define cpuid	__asm __emit 0fh __asm __emit 0a2h

and then you can use cpuid anywhere you need to.

	void doCpuid()
		__asm pushad;		// save all the registers - cpuid trashes EAX, EBX, ECX, EDX
		__asm mov eax, 0; 	// get simplest cpuid data
		cpuid;			// call cpuid, results returned in EAX, EBX, ECX, EDX
		// read cpuid results here before restoring the registers...

		__asm popad;		// restore registers

cpuid – 64 bit

On 64 bit systems you have to use the intrinsic __cpuid(registers, 0) provided in the intrinsic.h file.

	void doCpuid()
	    int registers[4];

	    __cpuid(registers, 0);

rdtsc – 32 bit

On 32 bit systems there is no rdtsc assembly instruction so you have to use the emit directive.

	#define rdtsc	__asm __emit 0fh __asm __emit 031h

and then you can use rdtsc anywhere you need to.

	__int64 getTimeStamp()


	    __asm	mov	li.LowPart, eax;
	    __asm	mov	li.HighPart, edx;
	    return li.QuadPart;

rdtsc – 64 bit

On 64 bit systems you have to use the intrinsic __rdtsc() provided in the intrinsic.h file.

	__int64 getTimeStamp()
	    return __rdtsc();

Thats not very portable is it?

The problem with the above approach is that you end up having two implementations for these functions – one for your 32 bit build and one for your 64 bit build. It would be much more elegant to have a drop in replacement that you can use in your 32 bit code that will compile in the same manner as the 64 bit code that uses the intrinsics defined in intrinsic.h

Here is how you do it. Put all of the code below into a header file and #include that header file wherever you need access to __debugbreak(), __cpuid() or rdtsc().

#ifdef _WIN64
#else	// _WIN64
	// x86 architecture

	// __debugbreak()

	#if     _MSC_VER >= 1300
		// Win32, __debugbreak defined for VC2005 onwards
	#else	//_MSC_VER >= 1300
		// define for before VC 2005

		#define __debugbreak()				__asm { int 3 }
	#endif	//_MSC_VER >= 1300

	// __cpuid(registers, type)
	//		registers is int[4], 
	//		type = 0

	// DO NOT add ";" after each instruction - it screws up the code generation

	#define rdtsc	__asm __emit 0fh __asm __emit 031h
	#define cpuid	__asm __emit 0fh __asm __emit 0a2h

	inline void __cpuid(int	cpuInfo[4], 
						int	cpuType)
		__asm pushad;
		__asm mov	eax, cpuType;

		if (cpuInfo != NULL)
			__asm mov	cpuInfo[0], eax;
			__asm mov	cpuInfo[1], ebx;
			__asm mov	cpuInfo[2], ecx;
			__asm mov	cpuInfo[3], edx;

		__asm popad;

	// __rdtsc()

	inline unsigned __int64 __rdtsc()


		__asm	mov	li.LowPart, eax;
		__asm	mov	li.HighPart, edx;
		return li.QuadPart;

#endif	// _WIN64

Now you can just the the 64 bit style intrinsics in your code and not have to worry about any messy condition code for doing the 32 bit inline assembly or the 64 bit intrinsics. Much neater, elegant, readable and more maintainable.

Additional Information

If you wish to know more about __cpuid(), the parameters it takes and the values returned in the registers array, Microsoft have a __cpuid() instrinsic description which explains everything in great detail.

Cupid instruction

When I wrote this article I kept typing cupid instead of cpuid. I’m sure my mind was on something else :-). How would the cupid instruction be implemented…

Are you a First Aider? Something to think about.

By , June 1, 2010 2:00 pm

This weekend I attended a music convention in the UK.

Something happened on Sunday morning that made me re-assess if I should be a qualified First-Aider.
This post isn’t about software, but all the same it most likely applies to your workplace and quite possibly your private life.

What happened on Sunday? A gentleman I’d be talking with the evening before had a heart attack. People rushed around the campsite and surrounding buildings shouting “Medical Emergency! If you are a doctor, nurse or first aider please go to the main hall immediately”. I was in the campsite at the time. Your first thought is “What has happened?”, followed by “I’m not trained, I can’t help”. It was at that point I knew that I could do nothing to help whoever was in need of help. Its a horrible feeling. You don’t know who is ill, you most likely know who they are and you can’t help.

As it happened there was a doctor (a general practitioner) on site as well as a first aider that had done his training 20 years earlier. An ambulance arrived in short order and after some work resuscitating him they took him to hospital where he underwent appropriate treatment for the heart attack. Hopefully all will be well with him.

Over the remainder of the weekend over the course of various conversations it became clear that just about everyone had come to the same conclusion: Everyone really disliked the powerlessness of being unable to help and quite a few were considering getting themselves trained as a first aider so that they’d be suitably skilled should something happen again.

In the past I’d always been reluctant to consider getting qualified as a first aider. Not because I didn’t want to be nominated but more because I didn’t want to do the training. I can’t quite explain it – I think I had some embarrassment about the training you have to do with the dummy body. That and combined with the “it won’t happen near me” delusion – I deluded myself I’d never have any use for the training. How wrong I was.

Anyway, none of that matters now. I know if I’d be trained I could have helped and may have been of some assistance on Sunday. That would have counted. So I’m going to get some first aider training and encourage my colleagues to do so. How about you?

Panorama Theme by Themocracy