Posts tagged: x64

Changes to injection behaviour

By , July 23, 2015 4:16 pm

We’ve just changed how we launch executables and attach to them at launch time. We’ve also changed how we inject into running executables. This blog post outlines the changes and the reasoning behind the changes.

The injected DLL

Microsoft documentation for DllMain() states that only certain functions can be called from DllMain() and you need to be careful about what you call from DllMain(). The details and reasons for this have never been made very explicit, but the executive summary is that you risk getting into a deadlock with the DLL loader lock (a lock over which you, as a programmer, have no control).

Even though this dire warning existed from the first alpha versions of our tools in 1999 until July 2015 we’ve been starting our profilers by making call from DllMain when it receives the DLL_PROCESS_ATTACH notification. We’ve never had any major problems with this, but that’s probably because we just tried to keep things simple. There are some interesting side-benefits of starting your profiler this way – the profiler stub just starts when you load the profiler DLL. You don’t need to call a specific function to start the profiler DLL. This is also the downside, you can’t control if the profiler stub starts profiling the application or not, it always starts.

Launching executables

Up until now the ability for the profiler to auto-start has been a useful attribute. But we needed to change this so that we could control when the profiler stub starts profiling the application into which it is injected. These changes involved the following:

  • Removing the call to start the profiler from DllMain().
  • Adding additional code to the launch profiler CreateProcess() injector to allow a named function to be looked up by GetProcAddress() then called.
  • Changing all calls to the launch process so that the correct named function is identified.
  • Finding all places that load the profiler DLL and modifying them so that they know to call the appropriate named function.

Injecting into running executables

The above mentioned changes also meant that we had to change the code that injects DLLs into running executables. We use the well documented technique of using CreateRemoteThread() to inject our DLLs into the target application. We now needed to add the ability to specify a DLL name, a function name, LoadLibrary() function address and GetProcAddress() function address and error handling into our dynamically injected code that can be injected into a 32 bit or 64 bit application from our tools.

Performance change

A useful side effect of this change from DllMain() auto-start to the start function being called after the DLL has loaded is that thread creation happens differently.

When the profiler stub starts via a call to the start profiler from DllMain() any threads created with CreateThread()/beginthread()/beginthreadex() all wait until the calling DllMain() call terminates before these threads start running. You can create the threads and get valid thread handles, etc but they don’t start working until the DllMain() returns. This is to part of Microsoft’s own don’t-cause-a-DLL-loader-lock-deadlock protection scheme. This means our threads that communicate data to the profiler GUI don’t run until the instrumentation process of the profiler is complete (because it all happens from a call inside DllMain()).

But now we’ve changed to calling the start profiler function from after the LoadLibrary() call all the threads start pretty much when we create them. This means that all data comms with the GUI get up to speed immediately and start sending data even as the instrumentation proceeds. This means that the new arrangement gets data to the GUI faster and the user of the software starts receiving actionable data in quicker than the old arrangement.

UX changes

In doing this work we noticed some inconsistencies with some of our tools (Coverage Validator and Thread Validator for instance) when working with an elevated Validator and non-elevated target application if we were injecting into the running target application. The shared memory used by these tools to communicate important profiling data wasn’t accessible due to permissions conflicts between the two processes. This was a problem because we were insisting the the Validator should be elevated prior to performing an injection into any process.


A bit of experimentation showed that under the new injection regime described above that we didn’t need to elevate the Validator to succeed at injecting into the target non-elevated application. It seems that you get the best results when the target application and the Validator require the same elevation level. This is also important for working with services as they tend to run with elevated permissions these days – but injecting into services is always problematic due to different security regimes for services.

This insight allowed us to remove the previously mandatory "Restart with administrator privileges" dialog and move any potential request to elevate privileges into the inject wizard and inject dialog. In this article I will describe the inject dialog, the changes to the inject wizard are similar with minor changes to accommodate the difference between dialog and wizard.


Depending upon Operating System and the version of the software there are two columns that may or may not be displayed on the inject dialog. The display can be sorted by all columns.

Elevation status

When running on Windows Vista or any more recent operating system the inject dialog will display an additional column titled Admin. If a process is running with elevated permissons this column will display an administrator permissions badge indicating the elevation may be required to inject into this process.

Processor architecture

When running 64 bit versions of our tools an additional column, titled Arch will be added to the inject dialog. This column indicates if the process is a 32 bit process or 64 bit process. We could have added a control to allow only 32 bit or 64 bit processes to be displayed but our thinking is that examining this column will be something is only done for confirmation if the user of our tools is working on both 32 bit and 64 bit versions of their tools. As such having to find the process selector and select that you are interested in 32 bit tools is overhead the user probably doesn’t need.

64 bit C++ software tool Beta Tests are complete.

By , January 9, 2014 1:33 pm

We recently closed the beta tests for the 64 bit versions of C++ Coverage Validator, C++ Memory Validator, C++ Performance Validator and C++ Thread Validator.

We launched the software on 2nd January 2014. A soft launch, no fanfare, no publicity. We just wanted to make the software available and then contact all the beta testers so that we could honour our commitments made at the start of the beta test.

Those commitments were to provide a free single user licence to any beta tester that provided feedback, usage reports, bugs reports, etc about the software. This doesn’t include anyone that couldn’t install the software because they used the wrong licence key!

We’ve written a special app here that we can use to identify all email from beta test participants and allow us to evaluate that email for beta test feedback criteria. It’s saved us a ton of time and drudge work even though writing this extension to the licence manager software took a few days. It was interesting using the tool and seeing who provided feedback and how much.

We’ve just sent out the licence keys and download instructions to all those beta testers that were kind enough to take the time to provide feedback, bug reports etc. to us. A few people went the extra mile. These people bombarded us with email containing huge bugs, trivial items and everything in between. Two of them, we were on the verge of flying out to their offices when we found some useful software that allowed to us to remotely debug their software. Special mentions go to:

Bengt Gunne (
Ciro Ettorre (
Kevin Ernst (

We’re very grateful for everyone taking part in the beta test. Thank you very much.

Why didn’t I get a free licence?

If you didn’t receive a free licence and you think you did provide feedback, please contact us. It’s always possible that a few people slipped through our process of identifying people.

Dang! I knew I should’ve provided feedback

If you didn’t provide us with any feedback, check your inbox. You’ll find a 50% off coupon for the tool that you tested.

64 bit tools leaving beta this month

By , December 5, 2013 1:14 pm

For those of you keeping a keen eye on the version numbers of the various C++ 64 bit betas we have will have noticed that the betas now have the same version number as their 32 bit counterparts. The reason for this is that we are getting ready for these tools to leave beta and to be released. We expect this to be during December 2013, if all goes well in the next week or so.

Before we can do this we need to make some modifications to the website, the shopping cart, test it all still works correctly, etc.

Once released the 32 bit and 64 bit C++ tools will have the same version number.

The 32 bit tools will continue to work with 32 bit executables on both 32 bit and 64 bit operating systems.

The 64 bit tools will only work with 64 bit executables on 64 bit operating systems.

The 64 bit tools will come bundled with the 32 bit version of the tool so that you can work with 32 bit or 64 bit executables on 32 bit and 64 bit operating systems. We anticipate a future version of the 64 bit tools that can launch and monitor 32 bit executables. This will allow you to collect more data and examine much larger datasets when testing 32 bit executables. We have prototypes of these tools now, but they are not ready for public release yet.

Why so long?

I know its been a very long time for these tools to have been in beta. We could have released some of them some time ago but we wanted to release all four at the same time so that we could also offer suites of tools and the also the combined 32+64 bundles. The problem was that C++ Memory Validator x64 had some problems with a few products being beta tested and C++ Performance Validator x64 also had some problems with SolidWorks 64 bit. We fixed the C++ Performance Validator x64 problem a few months ago and have been working on nailing all the remaining C++ Memory Validator x64 bugs.

Two beta testers in particular deserve a special mention for really helping us with these last bugs. Ciro Ettorre and Kevin Ernst. Ciro provided many logs of data for me to stare at and Kevin provided us with a machine that we could use to repeatedly try new fixes on.

I guess I didn’t really follow the advice I give to many people: “Get your product out there”. But that’s because these are new versions of existing products and I wanted them to be as good as the existing 32 bit tools. I hope we do not disappoint you.


I’d also like to recommend TeamViewer as a very useful product for remote working. We couldn’t have done the work with Kevin without this excellent tool. Best used with a headset microphone with the Voice over IP part of TeamViewer turned on.

What next?

As always we’ll keep you updated with what is happening with the betas via the email address associated with your beta test account.

“cannot open type library file” error on x64 systems

By , May 18, 2012 11:18 am

I’ve just tried building a Visual Studio 2010 helper DLL on Windows 7 x64.

The build failed with “cannot open type library file vsmso.olb : No such file or directory”.

A quick search found them in c:\program files (x86)\microsoft shared\MSEnv. This is the folder for 32 bit applications (Visual Studio is a 32 bit application).

OK, so if the files are present on the machine why does the build fail for this line?

	//The following #import imports VS Command Bars
	#import <vsmso.olb> raw_interfaces_only named_guids

The reason is the Visual Studio project include directories are setup for c:\program files\microsoft shared\MSEnv. You need to edit the properties for Debug and Release for each processor type (in my case Win32 and x64) and change the include directory (or add another include directory) c:\program files (x86)\microsoft shared\MSEnv then rebuild.

This is a potential non-obvious timewaster. Hope I’ve saved you some time.

Problem solved.

x64 Porting gotcha #3. Serializing collections

By , April 22, 2012 1:21 pm

When Microsoft ported MFC to 64 bits they also changed the return type for the GetSize() and GetCount() methods in the collection classes. They changed the return type from the 32 bit DWORD to the 64 bit DWORD_PTR on x64. This has implications if you write your own collection serialization methods. For example if you use a CMap<> to map a thread id to an object you will want to write your own serialization code.

For example (error checking removed for simplification), consider the serialization of this collection.

	CMap		threadObjectStatistics;


	CSingleLock	lock(&threadObjectStatisticsSect, TRUE);

	ar << threadObjectStatistics.GetCount();

	pos = threadObjectStatistics.GetStartPosition();
	while(pos != NULL)
		runningObjectManager	*rom = NULL;
		DWORD			threadID;
		threadObjectStatistics.GetNextAssoc(pos, threadID, rom);
		ar << threadID;


	CSingleLock	lock(&threadObjectStatisticsSect, TRUE);
	DWORD		i, count;

	ar >> count;
	for(i = 0; i < count; i++)
		runningObjectManager	*rom = new runningObjectManager();
		DWORD			threadID;
		ar >> threadID;
		threadObjectStatistics.SetAt(threadID, rom);

In the above code the first item saved/loaded is the number of objects in the CMap. After that the thread id and the complex object associated with the type is saved/loaded for each pair of objects in the CMap. The code above uses a DWORD to load the size. This won’t work for x64 because the count of objects is taken directly from the GetCount() method (or GetSize() for some collection types).

	ar << threadObjectStatistics.GetCount();

x86, return type is DWORD, count is saved as DWORD (32 bit)

x64, return type is DWORD_PTR, count is saved as DWORD_PTR (64 bit)

This is a problem because the loading code is expecting a DWORD.

	DWORD		i, count;

	ar >> count;

Update (23/4/2012): Turns out the same issue affects the STL collections as well. If you are directly serializing the result from the size() method in an STL collection you will be faced with the same problem as I describe for MFC.

Solution 1

One solution is simply to change the type being loaded from a DWORD to a DWORD_PTR.

	DWORD_PTR	i, count;

	ar >> count;

Solution 2

An alternative solution is to always save the size as a DWORD.

	DWORD		count;

	count = (DWORD)threadObjectStatistics.GetCount();
	ar << count;


You may think this is a trivial issue and why write a blog post about it? I agree the actual problem is trivial and the fix for it is also trivial. However the fact that you have a mismatch in datatypes being serialized because you directly serialized the return value from GetCount() is not so obvious. So much so that this particular issue escaped our attention (and got past a static analyser) until today.

So yes, its a trivial problem but its a hidden problem and it will cause all sorts of issues during serialization and when you go looking for it you'll probably look straight past it for a while. Hopefully I've just saved you a few hours of banging your head on a brick wall, or more likely your desk.

64 bit porting gotcha #2! x64 Register preservation

By , March 9, 2012 8:01 pm

In a previous article on x64 development I mentioned the problem of aligning the callstack on 16 byte boundaries and what happens if you do not do this.

Why 16 bytes?

At the time it seemed odd to me that the stack had to be 16 byte aligned. Why? All the parameters are 8 bytes (64 bits) wide and the first four are passed in registers. Everything else spills onto the stack and anything larger is passed as a reference. Floating point values are passed in dedicated floating point registers.

And there lies the key. That last sentence. The floating point registers.

Floating point on x64 is not done using the floating point coprocessor instructions. Instead the SSE instruction sets (and its extensions) are used.

If everything floats, whats the point?

If you are just hooking x64 functions and possibly collecting callstack you may never know need to know about floating point register preservation. We managed to get all four of our C++ tools (coverage, memory, profiler and deadlock detector) functional without knowing. Why would we need to know? Floating point preservation was never important for x86 because we could never damage the floating point without trying to do so.

But when we got into the details of the last bugs that were killing us we noticed seemingly random crashes. Closer investigation showed that calls to Win32 API functions had a tendency to wipe out the floating point registers. And that is when we got interested in what happens if we preserved the x64 floating point registers.

How to view the registers?

At first glance this wasn’t obvious to me. The Registers window in Visual Studio just shows the registers from RAX through R15 etc. However if you right click this window there is a helpful context menu that allows you to choose just how much information you display in this window.

Once you have the registers in view things get a lot easier inside the debugger. You can step through your code ensuring that nothing is getting trampled on until Viola! the floating point registers get totally hosed. A bit more investigation and you realise that seemingly innocent call you had in your code contains a call to a Win32 function (for example VirtualProtect) and that that function is responsible for the register death.

OK, so how do we preserve registers on x64? Its nothing like on x86.

x64 floating point preservation

The x64 designers in their infinite wisdom took away two very useful instructions (pushad and popad). As a result x64 hook writers now have to push lots of registers and pop lots of registers at the start and end of each hook. You can even see this in parts of the Windows operating system DLLs. So much simpler just to push everything and pop everything.

However what the Lord taketh away he can give back. And the x64 designers did that by providing two dedicated instructions for saving and restoring floating point. fxsave and fxrstor. These instructions take one operand each. The operand must point to a 512 byte chunk of memory which is 16 byte aligned.

A common usage would be as shown below although you can use any register as the destination location. It just so happens that the stack pointer (rsp) is the most common usage.

	sub	rsp, 200h;
	fxsave	[rsp];

	.. do your task that damages the floating point registers

	fxrstor	[rsp;]
	add	rsp, 200h;

When you see the above usage you can see why there is the requirement for the stack to be 16 byte aligned. Why 16 bytes? I suspect it is because it is the start of a cache line and that makes executing the instruction *SO* much quicker.


So now you know why the x64 callstack is 16 byte aligned. Its all to do with ensuring your code executes as fast as possible, especially when executing a memory intensive register copy when saving and restoring the x64 floating point registers. I’ve also shown you how to preserve and restore the floating point registers.

64 bit porting gotcha #1! x64 Datatype misalignment.

By , June 17, 2010 11:10 am

Datatype misalignment, there is a topic so interesting you’d probably prefer to watch paint dry.

But! There are serious consequences for getting it wrong. So perhaps you’d better read about it after all 🙂

The problem that wasted my time

Why am I writing about datatype misalignment? Because its just eaten two days of my time and if what I share with you helps save you from such trouble, all the better.

The problem I was chasing was that three calls to CreateThread() were failing. All calls were failing with ERROR_NOACCESS. They would only fail if called from functions in the Thread Validator x64 profiling DLL injected into the target x64 application. If the same functions were called later in the application (via a Win32 API hook or directly from the target application) the functions would work. That meant that the input parameters were correct.

Lots of head scratching and trying many, many variations of input parameters and asking questions on Stack Overflow and we were stuck. I could only think it was to do with the callstack but I had no idea why. So I started investigating the value of RSP during the various calls. The investigating what happened if I pushed more data onto the stack to affect the stack pointer. After some trial and error I found a combination that worked. Then I experimented with that combination to determine if it was the values being pushed that were important or the actual value of the stack pointer that was important.

At this point I was confused, as I didn’t know about any stack alignment requirements, I only knew about data alignment requirements. I then went searching for appropriate information about stack alignments and found this handy document from Microsoft clarifies that.

What is datatype misalignment?

Datatype alignment is when the data read by the CPU falls on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is aligned on a 4 byte boundary.

In the following code examples, let us assume that data points to a location aligned on a four byte boundary.

void aligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[4];

Datatype misalignment is when the data read by the CPU does not fall on the natural datatype boundary of the datatype. For example, when you read a DWORD and the DWORD is not aligned on a 4 byte boundary.

void misaligned(BYTE	*data)
	DWORD	dw;

	dw = *(DWORD *)&bp[5];

Why should I care about datatype misalignment?

Aligned data reads and data writes happen at the maximum speed the memory subsystem and processor can provide. For example to read an aligned DWORD, one 32 bit data read needs to be performed.

	BYTE	[ 1];
	BYTE	[ 2];
	BYTE	[ 3];
DWORD	BYTE	[ 4];	// ignored
	BYTE	[ 5];	// read
	BYTE	[ 6];	// read
	BYTE	[ 7];	// read
DWORD	BYTE	[ 8];	// read
	BYTE	[ 9];	// ignored
	BYTE	[10];	// ignored
	BYTE	[11];	// ignored
	BYTE	[13];
	BYTE	[14];
	BYTE	[15];

Misligned data reads and data writes do not happen at the maximum speed the memory subsystem and processor can provide. For example to read a misaligned DWORD the processor has to fetch data for the two 32 bit words that the misaligned data straddles.

In the misaligned example shown above, the read happens at offset 5 in the input array. I’ve shown the input array first 16 bytes, marking where each DWORD starts and showing which bytes are read and which are ignored. If we assume the input array is aligned then the DWORD being read has 3 byte2 in the first DWORD and 1 bytes in the second DWORD. The processor has to read both DWORDs, then shuffle the bytes around, discarding the first byte from the first DWORD and discarding the last 3 bytes from the last DWORD, then combining the remaining bytes to form the requested DWORD.

Performance tests on 32 bit x86 processors shown performance drops of between 2x and 3x. Thats quite a hit. On some other architectures, the performance hit can be much worse. This largely depends on if the processor does the rearrangement (as on x86 and x64 processors) or if an operating system exception handler handles it for you (much slower).

I’ve shown the example with DWORDs, because the they are short enough to be easily shown in a diagram whereas 8 byte or larger values would be unweildy.

The above comments also apply to 8 byte values such as doubles, __int64, DWORD_PTR (on x64), etc.

Clearly, getting your datatype alignments optimized can be very handy in performance terms. Niave porting from 32 bit to 64 bit will not necessarily get you there. You may need to reorganise the order of some data members in your structures. We’ve had to do that with Thread Validator x64.

Not just performance problems either!

In addition to the performance problems mentioned above there is another, more important consideration to be aware of: On x64 Windows operating systems you must have the stackframe correctly aligned. Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.

Failure to ensure that this is the case will mean that a few Windows API calls will fail with the cryptic error code ERROR_NOACCESS (hex: 0x3E6, decimal: 998). This means “Invalid access to memory location”.

The problem with the error code ERROR_NOACCESS in this case is that the real error code that gets converted into ERROR_NOACCESS is STATUS_DATATYPE_MISALIGNMENT, which tells you a lot more. I spent quite a bit of time digging around until I found the true error code for the bug I was chasing that lead me to write this article.

If you are writing code using a compiler, the compiler will sort the stack alignment details out for you. However if you are writing code in assembler, or writing hooks using dynamically created machine language, you need to be aware of the 16 byte stack alignment requirement.

x64 datatype alignment requirements

Correct stack alignment on x64 systems means that the stack frame must be aligned on a 16 byte boundary.


Size Alignment
1 1
2 2
4 4
8 8
10 10
16 16

Anything larger than 8 bytes is aligned on the next power of 2 boundary.


Correct stack frame alignment is essential to ensure calling functions works reliably.

Correct datatype alignment is essential for maximum speed when accessing data.

Failure to align stack frames correctly could lead to Win32 API calls failing and or program failure or lack of correct behaviour.

Failure to align data correctly will lead to slow speed when accessing data. This could be disasterous for your application, depending upon what it is doing.


Microsoft x86/x64/IA64 alignment document.

First steps to 64 bit

By , June 10, 2010 6:23 pm

Today Thread Validator 64 bit took it’s first steps on its 64 bit training wheels.

Interesting problems including

  • DLL Hooking
  • DLL Import structure walking
  • Injecting
  • PE 64 bit DLL inspection

The first attempt at launching a 64 bit application and injecting into it worked. Awesome! First time out and something that hard to do worked first time. We’ve got a few datastructure size mismatches and a stack walking problem to look at, but looks like our first 64 bit port is going well.

Check back on the blog for progress or follow us on twitter. We’ll be announcing the progress of our various 64 bit ports on both the blog and twitter.

Panorama Theme by Themocracy