fzabkar wrote:This code works, too. I'm guessing that the "Naked" prologue and epilogue compiler overheads...
"Naked" means "no overhead", and the Sub is only 28 bytes instead of 36. You should one day insert an asm int 3 into your code, before calling the sub or function, and let the just-in-time debugger show you what happens under the hood. I recommend
OllyDbg, it's easy to learn: you just need to press F7 repeatedly. Example:
Code: Select all
asm int 3
EndianRev16(s(0), d(0), 10)
Code: Select all
CPU Disasm
Address Hex dump Command Comments
004016A7 ³. CC int3 ; ³
004016A8 ³. 6A 0A push 0A ; ³ÚArg3 = 0A
004016AA ³. 8D45 AC lea eax, [ebp-54] ; ³³
004016AD ³. 50 push eax ; ³³Arg2 => offset LOCAL.21
004016AE ³. 8D45 E4 lea eax, [ebp-1C] ; ³³
004016B1 ³. 50 push eax ; ³³Arg1 => offset LOCAL.7
004016B2 ³. E8 D9FEFFFF call 00401590 ; ³ÀTmpFb.00401590
...
00401590 Ú$ 8B4C24 0C mov ecx, [esp+0C] ; TmpFb.00401590(guessed Arg1,Arg2,Arg3)
00401594 ³. 56 push esi
00401595 ³. 57 push edi
00401596 ³. 8B7424 0C mov esi, [esp+0C]
0040159A ³. 8B7C24 10 mov edi, [esp+10]
0040159E ³> 66:AD Úlodsw
004015A0 ³. 0FC8 ³bswap eax
004015A2 ³. C1E8 10 ³shr eax, 10
004015A5 ³. 66:AB ³stosw
004015A7 ³. 49 ³dec ecx
004015A8 ³. 7F F4 Àjg short 0040159E
004015AA ³. 5F pop edi
004015AB ³. 5E pop esi
004015AC À. C2 0C00 retn 0C
The same but not naked:
Code: Select all
CPU Disasm
Address Hex dump Command Comments
004016D7 ³. CC int3
004016D8 ³. 6A 0A push 0A
004016DA ³. 8D45 AC lea eax, [ebp-54]
004016DD ³. 50 push eax ; ÚArg3 => offset LOCAL.21
004016DE ³. 8D45 E4 lea eax, [ebp-1C] ; ³
004016E1 ³. 50 push eax ; ³Arg2 => offset LOCAL.7
004016E2 ³. E8 C9FEFFFF call 004015B0 ; ³
...
004015B0 Ú$ 55 push ebp
004015B1 ³. 89E5 mov ebp, esp
004015B3 ³. 53 push ebx
004015B4 ³. 56 push esi
004015B5 ³. 57 push edi
004015B6 ³. 8B4D 10 mov ecx, [ebp+10]
004015B9 ³. 8B75 08 mov esi, [ebp+8]
004015BC ³. 8B7D 0C mov edi, [ebp+0C]
004015BF ³> 66:AD Úlodsw
004015C1 ³. 0FC8 ³bswap eax
004015C3 ³. C1E8 10 ³shr eax, 10
004015C6 ³. 66:AB ³stosw
004015C8 ³. 49 ³dec ecx
004015C9 ³. 7F F4 Àjg short 004015BF
004015CB ³. 5F pop edi
004015CC ³. 5E pop esi
004015CD ³. 5B pop ebx
004015CE ³. 89EC mov esp, ebp
004015D0 ³. 5D pop ebp
004015D1 À. C2 0C00 retn 0C
marcov wrote:that code uses movdqu so that is unaligned
Yes, you can use movdqu or movups to load the source operand into an xmm reg. Pshufb works on a 128-bit memory operand, too, but then it needs to be aligned.