Posts tagged: assembler

64 bit porting gotcha #2! x64 Register preservation

By , March 9, 2012 8:01 pm

In a previous article on x64 development I mentioned the problem of aligning the callstack on 16 byte boundaries and what happens if you do not do this.

Why 16 bytes?

At the time it seemed odd to me that the stack had to be 16 byte aligned. Why? All the parameters are 8 bytes (64 bits) wide and the first four are passed in registers. Everything else spills onto the stack and anything larger is passed as a reference. Floating point values are passed in dedicated floating point registers.

And there lies the key. That last sentence. The floating point registers.

Floating point on x64 is not done using the floating point coprocessor instructions. Instead the SSE instruction sets (and its extensions) are used.

If everything floats, whats the point?

If you are just hooking x64 functions and possibly collecting callstack you may never know need to know about floating point register preservation. We managed to get all four of our C++ tools (coverage, memory, profiler and deadlock detector) functional without knowing. Why would we need to know? Floating point preservation was never important for x86 because we could never damage the floating point without trying to do so.

But when we got into the details of the last bugs that were killing us we noticed seemingly random crashes. Closer investigation showed that calls to Win32 API functions had a tendency to wipe out the floating point registers. And that is when we got interested in what happens if we preserved the x64 floating point registers.

How to view the registers?

At first glance this wasn’t obvious to me. The Registers window in Visual Studio just shows the registers from RAX through R15 etc. However if you right click this window there is a helpful context menu that allows you to choose just how much information you display in this window.

Once you have the registers in view things get a lot easier inside the debugger. You can step through your code ensuring that nothing is getting trampled on until Viola! the floating point registers get totally hosed. A bit more investigation and you realise that seemingly innocent call you had in your code contains a call to a Win32 function (for example VirtualProtect) and that that function is responsible for the register death.

OK, so how do we preserve registers on x64? Its nothing like on x86.

x64 floating point preservation

The x64 designers in their infinite wisdom took away two very useful instructions (pushad and popad). As a result x64 hook writers now have to push lots of registers and pop lots of registers at the start and end of each hook. You can even see this in parts of the Windows operating system DLLs. So much simpler just to push everything and pop everything.

However what the Lord taketh away he can give back. And the x64 designers did that by providing two dedicated instructions for saving and restoring floating point. fxsave and fxrstor. These instructions take one operand each. The operand must point to a 512 byte chunk of memory which is 16 byte aligned.

A common usage would be as shown below although you can use any register as the destination location. It just so happens that the stack pointer (rsp) is the most common usage.

	sub	rsp, 200h;
	fxsave	[rsp];

	.. do your task that damages the floating point registers

	fxrstor	[rsp;]
	add	rsp, 200h;

When you see the above usage you can see why there is the requirement for the stack to be 16 byte aligned. Why 16 bytes? I suspect it is because it is the start of a cache line and that makes executing the instruction *SO* much quicker.


So now you know why the x64 callstack is 16 byte aligned. Its all to do with ensuring your code executes as fast as possible, especially when executing a memory intensive register copy when saving and restoring the x64 floating point registers. I’ve also shown you how to preserve and restore the floating point registers.

64 bit porting gotcha #1! x64 Datatype misalignment.

By , June 17, 2010 11:10 am

Datatype misalignment, there is a topic so interesting you’d probably prefer to watch paint dry.

But! There are serious consequences for getting it wrong. So perhaps you’d better read about it after all 🙂

The problem that wasted my time

Why am I writing about datatype misalignment? Because its just eaten two days of my time and if what I share with you helps save you from such trouble, all the better.

The problem I was chasing was that three calls to CreateThread() were failing. All calls were failing with ERROR_NOACCESS. They would only fail if called from functions in the Thread Validator x64 profiling DLL injected into the target x64 application. If the same functions were called later in the application (via a Win32 API hook or directly from the target application) the functions would work. That meant that the input parameters were correct.

Lots of head scratching and trying many, many variations of input parameters and asking questions on Stack Overflow and we were stuck. I could only think it was to do with the callstack but I had no idea why. So I started investigating the value of RSP during the various calls. The investigating what happened if I pushed more data onto the stack to affect the stack pointer. After some trial and error I found a combination that worked. Then I experimented with that combination to determine if it was the values being pushed that were important or the actual value of the stack pointer that was important.

At this point I was confused, as I didn’t know about any stack alignment requirements, I only knew about data alignment requirements. I then went searching for appropriate information about stack alignments and found this handy document from Microsoft clarifies that.

What is datatype misalignment?

Datatype alignment is when the data read by the CPU falls on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is aligned on a 4 byte boundary.

In the following code examples, let us assume that data points to a location aligned on a four byte boundary.

void aligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[4];

Datatype misalignment is when the data read by the CPU does not fall on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is not aligned on a 4 byte boundary.

void misaligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[5];

Why should I care about datatype misalignment?

Aligned data reads and data writes happen at the maximum speed the memory subsystem and processor can provide. For example to read an aligned DWORD, one 32 bit data read needs to be performed.

	BYTE	[ 1];
	BYTE	[ 2];
	BYTE	[ 3];
DWORD	BYTE	[ 4];	// ignored
	BYTE	[ 5];	// read
	BYTE	[ 6];	// read
	BYTE	[ 7];	// read
DWORD	BYTE	[ 8];	// read
	BYTE	[ 9];	// ignored
	BYTE	[10];	// ignored
	BYTE	[11];	// ignored
	BYTE	[13];
	BYTE	[14];
	BYTE	[15];

Misligned data reads and data writes do not happen at the maximum speed the memory subsystem and processor can provide. For example to read a misaligned DWORD the processor has to fetch data for the two 32 bit words that the misaligned data straddles.

In the misaligned example shown above, the read happens at offset 5 in the input array. I’ve shown the input array first 16 bytes, marking where each DWORD starts and showing which bytes are read and which are ignored. If we assume the input array is aligned then the DWORD being read has 3 byte2 in the first DWORD and 1 bytes in the second DWORD. The processor has to read both DWORDs, then shuffle the bytes around, discarding the first byte from the first DWORD and discarding the last 3 bytes from the last DWORD, then combining the remaining bytes to form the requested DWORD.

Performance tests on 32 bit x86 processors shown performance drops of between 2x and 3x. Thats quite a hit. On some other architectures, the performance hit can be much worse. This largely depends on if the processor does the rearrangement (as on x86 and x64 processors) or if an operating system exception handler handles it for you (much slower).

I’ve shown the example with DWORDs, because the they are short enough to be easily shown in a diagram whereas 8 byte or larger values would be unweildy.

The above comments also apply to 8 byte values such as doubles, __int64, DWORD_PTR (on x64), etc.

Clearly, getting your datatype alignments optimized can be very handy in performance terms. Niave porting from 32 bit to 64 bit will not necessarily get you there. You may need to reorganise the order of some data members in your structures. We’ve had to do that with Thread Validator x64.

Not just performance problems either!

In addition to the performance problems mentioned above there is another, more important consideration to be aware of: On x64 Windows operating systems you must have the stackframe correctly aligned. Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.

Failure to ensure that this is the case will mean that a few Windows API calls will fail with the cryptic error code ERROR_NOACCESS (hex: 0x3E6, decimal: 998). This means “Invalid access to memory location”.

The problem with the error code ERROR_NOACCESS in this case is that the real error code that gets converted into ERROR_NOACCESS is STATUS_DATATYPE_MISALIGNMENT, which tells you a lot more. I spent quite a bit of time digging around until I found the true error code for the bug I was chasing that lead me to write this article.

If you are writing code using a compiler, the compiler will sort the stack alignment details out for you. However if you are writing code in assembler, or writing hooks using dynamically created machine language, you need to be aware of the 16 byte stack alignment requirement.

x64 datatype alignment requirements

Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.


Size Alignment
1 1
2 2
4 4
8 8
10 10
16 16

Anything larger than 8 bytes is aligned on the next power of 2 boundary.


Correct stack frame alignment is essential to ensure calling functions works reliably.

Correct datatype alignment is essential for maximum speed when accessing data.

Failure to align stack frames correctly could lead to Win32 API calls failing and or program failure or lack of correct behaviour.

Failure to align data correctly will lead to slow speed when accessing data. This could be disasterous for your application, depending upon what it is doing.


Microsoft x86/x64/IA64 alignment document.

Panorama Theme by Themocracy