Description of problem: I have a dual CPU Athlon machine running the Xen kernel 2.6. 18-1.2784.fc6xen, with mono version 1.1.17.1-3 installed. When I go to start the mono binary, it immediately "Segmentation faults". Looking at the core file (with gdb /usr/bin/mono core.2608), I see the following: (gdb) bt #0 0x080794c0 in mono_arch_cpu_optimizazions () #1 0x0805bf5e in parse_optimizations () #2 0x0805c444 in mono_main () #3 0x0805bee2 in main () I recompiled mono from the source rpm, and ran it through gdb. The call trace actually ends up looking like this: main (mono/mini/main.c)-> mono_main (mono/mini/driver.c) -> parse_optimizations (mono/mini/driver.c) -> mono_arch_cpu_optimizazions (mono/mini/mini-x86.c) -> cpuid (mono/mini/mini-x86.c, actually inlined). In the cpuid function, there is a set of assembly at the top, followed by the following code: if (have_cpuid) { /* Have to use the code manager to get around WinXP DEP */ MonoCodeManager *codeman = mono_code_manager_new_dynamic (); CpuidFunc func; void *ptr = mono_code_manager_reserve (codeman, sizeof (cpuid_impl)); memcpy (ptr, cpuid_impl, sizeof (cpuid_impl)); func = (CpuidFunc)ptr; func (id, p_eax, p_ebx, p_ecx, p_edx); The call to func at the end there is what actually causes the Segmentation fault. This seems to be the assembly copied from the cpuid_impl array, and it all seems to be copied over properly, so I am not entirely sure what is wrong. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1. Start mono Actual results: Mono SEGFAULT Expected results: mono starts and runs normally
I wonder if this is due to the xen TLS optimization thing we did in 1.1.17.1-3. Have you tried earlier version and not seen this problem? Does it go away if you remove the --with-xen_opt=yes from configure in the spec?
Also, are you running a 32bit or 64bit mono binary?
And, are you running inside a xen instance or outside?
Ah, I forgot to mention about that. The with-xen_opt was my first idea, too. I actually rebuilt the rpm, and removed the "--with-xen_opt=yes", and I still got the same behavior. It's so early in the startup code, though, that it probably doesn't make a difference. This is the 32-bit mono binary, by the way, running in dom0.
Wow. I think I may have figured this one out, I'm just not sure what to do about it. I knew mono used to work, so I decided to go back and try various older fc6 mono versions. None of them worked properly, which I thought quite strange. So then I went back and tried various kernels, since that was the next likely. I eventually found out it worked in 1.2716.fc6, but not 1.2723.fc6. Looking at a diff between these two versions shows that exec-shield options changed in there, more specifically, making the stack non-executable. Aha! The code in question (in mono/mini/mini-x86.c) allocates some memory, copies a bunch of opcodes into that memory (from the cpuid_impl array), and then tries to call that as a function. So it is trying to execute (I think) code on the stack, which exec-shield is disallowing. Sure enough, setting /proc/sys/kernel/exec-shield to 0 allows mono to start normally. Incidentally, as far as I can tell, this will affect *all* kernels after 2716, not just the Xen ones. Have you happened to test mono on a kernel after 2716? So, that is the workaround. The real crux of the problem, though, is mono doing this crack with the executable code on the stack. One simple solution might be to take the cpuid_impl, turn it into a normal __asm__ __volatile__ code chunk, and drop it into cpuid, but I'm not sure how palatable that is upstream, because I don't know the reason they are doing this to begin with. Do you have any further ideas, and/or can you pursue this upstream? Chris Lalancette
Created attachment 138852 [details] Patch to not make cpuid try to run code on the stack Well, here's a patch that basically does the CPUID call in a PIC friendly manner, which happened to be the problem before. This *seems* to do the right thing, and doesn't execute code on the stack, which is always a good thing. One thing to note; I shamelessly stole this bit of assembly from kdemultimedia (http://www.google.com/codesearch?q=+PIC+version+:+save+ebx+show:8hVzTkQrhro:yQNIAg1iuac:kWmVzQ9LyY8&sa=N&cd=1&ct=rc&cs_p=http://gentoo.osuosl.org/distfiles/kdemultimedia-3.5.3.tar.bz2&cs_f=kdemultimedia-3.5.3/mpeglib/lib/util/mmx/cpu_accel.c#a0). kdemultimedia seems to be GPL, while Mono seems to be LGPL, so we might have to resolve that/figure something else out, if we decide we want to fix this (and I think we should). Chris Lalancette
Its not running it from the stack. mono_code_manager_reserve() allocates the memory dynamically. In this case i think it will use malloc() which we then call mprotect() on to make it executable. Maybe execshield is making it fail the mprotect() call? I dunno why I'm not seeing this problem though. mono runs fine on my core duo with kernel 2.6.81-1.2726. I don't think your patch is right though. The code manager is probably used in many places in mono, and your change is just in one place. We should figure out why it fails and fix it. Isn't there a way to flag a binary to allow heap execution with exec-shield?
alex: maybe the example patch in openoffice.org of openoffice.org-2.0.3.oooXXXXX.selinux.bridges.patch might be helpful for a similiar sounding scenario where I do the selinux dance to avoid a simultaneously readable + writable executable memory segment (http://people.redhat.com/drepper/selinux-mem.html)
I do have selinux on, and it works for me. Mono has its own selinux context that allows it to do the stuff it needs. Chris: Do you have selinux on?
No, I don't have SELinux on. Actually, once I found the workaround (echo 0 > /proc/sys/kernel/exec-shield), I turned SELinux on just for a test, and ran into the problem with the heap execution. setroubleshootd told me I needed to set the execheap boolean on. In any case, that is a different problem since in my original test I had SELinux disabled. I'm not sure if it is relevant that I am using an Athlon machine? Maybe some difference in the way it protects the stack? That is a pure guess. I'll try this out on another machine (intel) at work today, and let you know if I am more successful there.
OK, test cases (all with SElinux off): Opteron box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works Opteron box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works Opteron box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works Opteron box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: works EM64T box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works EM64T box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works EM64T box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works EM64T box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT Athlon box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT Pentium 4 box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT OK, so I'm starting to get the pattern: only i686 kernels, only Xen. The one exception is the Opteron box running the i686 Xen kernel; that seems to work fine. There must be some difference in the way the Xen kernel handles the exec-shield stuff, which is causing this problem, but I can't explain the Opteron difference. Now I am really confused.
This seems to be more of a xen kernel bug. Reassigning on advice from Dave Jones.
Closing this out...it seems to be fixed in the latest FC6 kernels.