Sunday, January 22, 2017

The macOS Experiment (Part 2)

Around the beginning of the new year, I mentioned what I titled "The macOS Experiment" and some of what I had planned for it.  Over the weekend, I wanted to put a bit of effort into it and see what would happen.  Did I get it working?  Well, almost.  And I'll get to that in a moment.  First, I'll share the git repo for those that want to see what I'm talking about:

https://github.com/blueshogun96/MacBox

There isn't much there but the bare minimum, and it was crappily written (no need to point out the obvious) because I just wanted to quickly see if I could get it up and running.

Basically what it does is this: It serves as a way to launch 32-bit code from the desired base address (for the sake of Xbox, 0x10000) by calling mmap to reserve that desired memory location and range as well as give us read, write and execute permissions on it, then injects the code to be run at the given base address (in this case, a .xbe file).  From there, we point to the .xbe file's main function (not the entry point, which is what Xeon did) with a function pointer and call it.  Actually before we call that function, we need to insert our wrapper functions to HLE the necessary functions in the .xbe.  For now, I just hard coded those to see if I could get a quick proof of concept going.

So to start off, I took the simplest .xbe to deal with, the first D3D tutorial in the XDK, CreateDevice.  All it does is create the D3D device, clear the screen to a random colour, then present it to us.  This is actually simpler than it sounds, considering that you don't have to literally create a D3D device.  All the D3D functions are C-based functions using the __stdcall calling convention.  The only parameter to even worry about there is D3DPRESENT_PARAMETERS anyway to initialize an OpenGL context that we want to match those parameters.  Clearing and swapping the screen are also very trivial things, but as you can see, I didn't literally add any of that code.  Why is that?  Because I needed to make sure that my hooked functions get called correctly, and this is where things got a bit tricky.

Let's take a look at this code where the "magic" begins (see macbox_xbe.cpp on the repo):

void macbox_install_wrapper( void* function_addr, void* wrapper_addr )
{
    uint8_t* func = (uint8_t*) function_addr;
    
    *(uint8_t*) &func[0] = 0xE9;    // JMP rel32 (quick and easy)
    *(uint32_t*)&func[1] = (uint32_t) wrapper_addr - (uint32_t) function_addr;
}

Just like Cxbx does, my code goes to the top of the function that we need to hook, and places a jmp rel32 instruction right there to immediately redirect it to our wrapper function.  Quite easy to do when you know how to encode x86 instructions.

So let's move on.  Next, let's hook some functions:

/* TODO: Move this elsewhere and don't hard code it either */
macbox_install_wrapper( reinterpret_cast(0x195B0), reinterpret_cast(Direct3D_CreateDevice) );
macbox_install_wrapper( reinterpret_cast(0x1A270), reinterpret_cast(D3DDevice_Clear) );
macbox_install_wrapper( reinterpret_cast(0x1ABC0), reinterpret_cast(D3DDevice_Swap) );

How did I get these offsets?  I loaded the .xbe up in IDA Pro using the flirt signatures to spot the D3D functions right away.  It really saved my arse and my sanity when working on Cxbx. 

After this, I was ready to go, or at least I thought I was.  When Direct3D_CreateDevice was first called, the parameters contained all gibberish.  On top of that, it crashed soon after the function returned.  I made sure that all the parameters were correct and the same as they were for Windows (at least, from a byte perspective).  But then I remembered that these functions have to use __stdcall in order to work on Windows for Cxbx.  So I had to find the equivalent for Mac.  This is what WINE used to define __stdcall (or also known as CALLBACK) for Mac:

#define CALLBACK __attribute__((__stdcall__)) __attribute__((__force_align_arg_pointer__))

Kinda long, isn't it?  So I used this, and it caused the immediate crash to go away.  The parameters are still gibberish for that particular function (Direct3D_CreateDevice) but with D3DDevice_Clear, the parameters were perfect!  Then after that function returned, something went wonky and the EIP ended up in a weird state resulting in another crash.  This brings me to the next issue.

From what I have read, the stack alignment in macOS is always 16-byte aligned, whereas with Xbox code it was 4-byte aligned.  This seems like it could be the issue.  I took a moment to get some assembly code screen shotted.

From IDA, this is the code that calls Direct3D_CreateDevice:



And this is the code that XCode generated...


Once more, IDA's output from whence D3DDevice_Clear was called:


And this is what XCode generated...


So the green line is actually where the breakpoint is placed at the beginning of the function.  So is the stack getting screwed up somewhere or what?  I'm quite sure there's a way around it considering that WINE was able to pull it off somehow.  Maybe I should ask them on their forum?  Well, I want to figure something out.

Since I didn't have all day to dedicate to this, I thought I'd at least share with you all what's going on.  The sooner I can fix this, the sooner I can move forward.  Frankly, I wouldn't be surprised if it was stupidly simple what I was missing.  If worse comes to worse, I guess I could do something super hacky to fix it, but I want to avoid that if I can.

So that's it for now.  Thanks for reading, and let me know what you think.

Shogun.

Monday, January 2, 2017

The macOS experiment (Part 1)

Lately, I haven't been doing much emulation wise.  Well, to be frank, I haven't done any serious emulation work in a long time (and just incase you're reading this JayFox, I'm not talking about xqemu, nor have I ~ever~ said I was doing any real work on it aside from a few work arounds here and there; hell I've even stated that I'm not an xqemu dev multiple times, so pleeeeeeeeeease spare me of your usual "3 lines of code" lecture because I'm really not in the mood, thanks and no disrespect), but since I'm in between jobs/paid projects yet again, I decided to pick up one of the previous experiments I started before out of curiosity.

What is this experiment?  Well, the goal of this little coding experiment was to claim the typical base address of 0x10000 that every .xbe uses.  For those who don't quite see what I'm getting at, let's take a look at how both Cxbx and Xeon, the first two Xbox emulators, worked from an internal perspective.  They both use HLE, yeah that's right sherlock, in order to run code natively on the host CPU, they both use a .exe to reserve the memory range beginning at 0x10000 and at least 64mb beyond that.  For Cxbx, the generated .exe from the .xbe has the base address set when the header is written.  For Xeon, the main .exe simply reserves that memory address range by forcing it to load at 0x10000 w/ a big global static array and it's overwritten with the .xbe contents.  The actual emulator code is run from a .dll that's loaded into memory well out of that range.  That's the only way they knew how to do it at the time (for windows at least).  Naturally, I started to wonder how would one do that on a Mac?

Since similar experiments for this have already been done on Linux, so I figured Mac could do it also with similar methods.  So I started a thread on ngemu a long while back, and asked for suggestions, hints and ideas.  At the time, I was quite new to Mac development, but at the time of writing this, I have grown to be quite experienced with it (although still not quite as knowledgeable of the lower level and kernel level aspects).  Instinctively, my first try was to use mmap() directly, or vm_allocate().  That didn't work.

Among one of the respondents to my thread back then was Ben Vanik, the author of Xenia.  He was running linux, and recommended this code:

#include 
#include 

namespace {
    void * const MEMORY_ADDRESS = reinterpret_cast(0x10000);
    size_t const MEMORY_SIZE = 1 << 20;
}

int main() {
    void * p = mmap(MEMORY_ADDRESS, MEMORY_SIZE, PROT_EXEC | PROT_READ | PROT_WRITE, MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    if (p != reinterpret_cast(-1)) {
        std::cout << "Memory allocated" << std::endl;
        munmap(p, MEMORY_SIZE);
    } else {
        std::cout << "Memory could not be allocated" << std::endl;
    }
}

It worked for him, but still didn't work for me on macOS (OSX), so I was still back where I started.  Keep in mind that all these flags were necessary.  I needed to be able to read, write and execute from this memory range.  It also needed to be at that exact address (hence the MAP_FIXED flag), and so forth.  Many were against that last bit, which I'll explain in a moment.

After that, I eventually forgot about it and moved on to other projects.  Mostly working on this indie game that I started less than 2 months of starting that thread.  I eventually did get it working, and the reason it wasn't working was actually quite a testament to my ignorance and stupidity... I forgot to switch the compiler setting from x86_64 to i386!  Duh, no wonder it wasn't working.  I wanted to slap myself afterwards.  IIRC, there's a flag called MAP_32BIT which allows 64-bit programs to access the 32-bit memory address space, but it appears to only be available in Linux, not macOS.  Oh well, no big loss for now.

Now, what I mean by "work" is that mmap() stop failing.  I would finally get that base address, but it would simply crash on a call to std::cout.  Even commenting this out, unmapping the pointer and letting it return from main also resulted in a crash.  I'm sure you could theorize why, but finding a solution was what I was interested in.

Since there weren't many Mac devs on your every day emulation/gaming forum (most of them are extremely anti-Mac; many for ignorant and uninformed reasons, and few for legitimate and well thought out reasons) so I ended up going to a forum that was dedicated to Mac related programming topics; macrumors.com.  While most of them did give me some sound advice, they didn't quite understand why I needed this exact functionality.  Many would suggest removing MAP_FIXED but that would give me a memory address out of the range I'm looking for and would essentially defeat the purpose.  I had to start two threads altogether, and only one person actually understood what I was trying to do, as he was also in the need to write a basic VM while lacking the proper resources to do it.  He tried this code in XCode, and it worked for him, but only after adding the following compiler flag in the "other linker flags" section:

-pagezero_size 0x10000000 -segprot __PAGEZERO rwx rwx

Frankly, I do not fully understand what this did, even though it generally looks pretty obvious.  A google search didn't yield much results either, but it worked.

Now, I understand that what I'm doing is considered "unsafe" and risky.  Even on Linux, it is considered the same to do so.  It's all part of some experiment I put a bit of free time into.  Was there any particular goal to it?  Well, for starters, I wanted to see at one point whether it were at all feasible or possible to bring Cxbx to Mac since using WINE wasn't enough to do it due to the memory map requirements (my assumption; it always crashed for me).  Another idea was to provide proof of concept that HLE and direct code execution was indeed possible on an Intel Mac without the need for a VM library or driver.  There's a VM framework for macOS now, but it requires a 2010 Mac with 10.10+ installed.  Since neither of my comps meet the former requirement, that wasn't an option for me.  Third, I've had the itch to implement just enough APIs in an attempt to run Azurik: Rize of Perathia on my Mac.  I am greatly intrigued by the possibility of playing this game in some hacked form of 1080p or 4k using certain Apple exclusive OpenGL extensions, but hesitant to re-live the pain of trying to get this game working on Cxbx.

So far, it appears to work without issues, but I'm sure there could be some hidden caveats somewhere.  What I'd like to do later on is experiment with a simple .xbe that simply creates a D3D device, and clears the screen and HLE that as proof of concept.  I don't really plan on going too much farther than that, but I am curious to find out if there's anything else that can evolve out of this, even if it doesn't mean emulating Xbox.  Who knows?  Maybe an HLE emulator of something else?  Like one of Sega's recent arcade hardware machines like Europa?  Hey, I just like to learn, and emulating Xbox even on an HLE level has helped me tremendously personally and in my professional career (which is part of the reason I have this job interview coming up; I'll blog about what I mean about this statement later).

I know it's late but happy new year, and happy coding.  It's late so I'm off to bed.

Shogun.