This post has some test results of several types introduced in C99, also I will explain here a little bit about structure packing, and why I think I got the results I’ve got.
I was creating a shooting game, and needed a way to store bullets. I decided to use the C fixed size array instead of using a resize-able array or a list, the reason for this is simple: I will have to iterate trough all of them every frame, a linked list would mean a performance hit there, also later as I will show, a pointer would make it MUCH bigger, and finally, I think the performance hit of deleting bullets (ie: overwriting a invalid bullet with the last valid bullet of the array) would be smaller than the performance hit of iterating with pointers.
So, this is the basic structure used for my bullet:
unsigned char bulletType;
I used int for x and y position as “default” integer value, also this is because a bullet cannot be in a “half-pixel” position anyway, thus there is no need to actually use a floating point number.
I decided to use short for the speed variables to use less memory, since the speed is always quite small anyway. And finally bulletType is supposed to be a enumeration, I used a unsigned char just to make sure it will use less memory too, the enumeration size is implementation dependant, and might be even some crazy size (like the pointer size, in a 64-bit machine, thus 8 bytes).
Some people then will come with pitchforks after me along with the entire village while chanting: “KILL THE PREMATURE OPTIMIZATION MONSTER!”, well, indeed it is premature, but I wanted to learn a bit, and we must remember that although we have some crazy amounts of memory on today machines (compared to some few kbytes when the first R-Type was released back in 1987) we don’t have infinite cache, the larger your memory use during calculations, the bigger the change for a CPU cache miss.
The benchmark is simple, I created several versions of the above structure, and then I for each of them I made a simulation of the game calculating bullet position. For the sake of simplicity I did not put on it bullet deletion, or rendering, or any other logic.
An error has occurred. Please try again later.
So, what are the results, and their implications?
The int size is 32 bits, the short size is 16 bits, and the char size is 8 bits, those are the “default” integer types on my machine and OS combination (I ran the test on a Windows 8 on a Intel i7) The default struct had as size 16 bytes (more on why it is not 13, later)
On my machine the least, exact and fast variations of integer ended having the same size, that is the smallest size possible, so if I called least, exact or fast 8 bits, I ended with 8 bits. The structures that used those sizes had 8 bytes.
I also made a test creating structs where all sizes were the same, one for 16 bit (struct size was 10 bytes) and one for 32 bit (struct size was 20 bytes)
For floating point, the float type has 32 bits when stored in the memory, and the double type has 64 bits, their structures use 20 bytes, and 40 bytes.
All of that had the same results in both 32 and 64 bit compiled binaries.
Now for the calculation speeds, the fastest was the struct with all members having 16-bit integers, the average time was 0.0126, the next was all the C99 integer types, at 0.0141, then the normal integer types at 0.0184, next was single precision floating point at 0.02199 that was thus almost tied with 32-bit integer at 0.0220, then we had the double precision struct at laggards 0.042
Why the structures we used were always bigger than the sum of all its members? The reason is fairly simple: Alignment. Most processors take performance hits when they have to read things in the memory not aligned in ways they like, the most basic way of alignment is align things in the memory in the size of the processor word, for example a 32-bit processor would require everything in the memory to start their bits in a bit multiple of 32. The other mode of alignment is what is when a type is aligned with its own size, a 8-bit type start aligned in multiple of 8 bits in the memory, a 16-bit in multiple of 16, and so on. This is called “self-aligned” types.
x86 processors (ranging from 8086, to the current Intel i7 and AMD FX ones) support self-aligned types, ARM (used in iPhone, most Androids, most handheld consoles and some other devices) also support those.
Theoretically then, my default struct that started with a int of 4 bytes, and ended with a char of 1 byte, would use only 13 bytes, so why the extra 3 bytes? It is to ensure that in case the structure is used on an array, the first int of the next element will be aligned properly, thus the structure has a size of 16 bytes, because this guarantees that if you put several of them in the same array all elements will start aligned properly, with the ints starting into addresses of four bytes.
As for calculation speed, we could notice that the speed was mostly tied to the size, largest structures were the slowest, this is because they use more memory, and increase a chance of a cache miss, modern processors have a limited amount of cache, and if you change data faster than they can populate the cache, specially if your data is too big, you will get a severe performance penalty.
Also it is interesting to note that the 32-bit performance was the same for integers and floating points, in old processors usually integers were faster, but modern processors usually have separate parts to do floating point calculations, and many can actually do integer and floating calculations at the same time, thus potentially doubling the speed of your program.
Finally, we had the 16-bit + 16-bit sum go slighly faster than the 16-bit + 8-bit sum, despite being bigger, I don’t know yet why this happened, this result surprised me, if someone knows please send me a e-mail and I will update the article, also I plan in doing some Assembly delving when I have time to see if I can figure this.