210830 – SIGSEGV starting mono on xen kernel

Bug 210830 - SIGSEGV starting mono on xen kernel

Summary: SIGSEGV starting mono on xen kernel

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel-xen
Sub Component:
Version:	6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Xen Maintainance List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	212588
TreeView+	depends on / blocked

Reported:	2006-10-15 23:22 UTC by Chris Lalancette
Modified:	2007-11-30 22:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:	2869
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-01-10 21:21:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to not make cpuid try to run code on the stack (2.19 KB, patch) 2006-10-19 03:57 UTC, Chris Lalancette	no flags	Details \| Diff
View All

Description Chris Lalancette 2006-10-15 23:22:39 UTC

Description of problem:
I have a dual CPU Athlon machine running the Xen kernel 2.6.
18-1.2784.fc6xen, with mono version 1.1.17.1-3 installed.  When I go to start
the mono binary, it immediately "Segmentation faults".  Looking at the core file
(with gdb /usr/bin/mono core.2608), I see the following:

(gdb) bt
#0  0x080794c0 in mono_arch_cpu_optimizazions ()
#1  0x0805bf5e in parse_optimizations ()
#2  0x0805c444 in mono_main ()
#3  0x0805bee2 in main ()

I recompiled mono from the source rpm, and ran it through gdb.  The call trace
actually ends up looking like this:

main (mono/mini/main.c)-> mono_main (mono/mini/driver.c) -> parse_optimizations
(mono/mini/driver.c) -> mono_arch_cpu_optimizazions (mono/mini/mini-x86.c) ->
cpuid (mono/mini/mini-x86.c, actually inlined).  In the cpuid function, there is
a set of assembly at the top, followed by the following code:

	if (have_cpuid) {
		/* Have to use the code manager to get around WinXP DEP */
		MonoCodeManager *codeman = mono_code_manager_new_dynamic ();
		CpuidFunc func;
		void *ptr = mono_code_manager_reserve (codeman, sizeof (cpuid_impl));
		memcpy (ptr, cpuid_impl, sizeof (cpuid_impl));

		func = (CpuidFunc)ptr;
		func (id, p_eax, p_ebx, p_ecx, p_edx);


The call to func at the end there is what actually causes the Segmentation
fault.  This seems to be the assembly copied from the cpuid_impl array, and it
all seems to be copied over properly, so I am not entirely sure what is wrong.

Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1.  Start mono
  
Actual results:
Mono SEGFAULT

Expected results:
mono starts and runs normally

Comment 1 Alexander Larsson 2006-10-16 10:49:51 UTC

I wonder if this is due to the xen TLS optimization thing we did in 1.1.17.1-3. 
Have you tried earlier version and not seen this problem?
Does it go away if you remove the --with-xen_opt=yes from configure in the spec?

Comment 2 Alexander Larsson 2006-10-16 10:50:50 UTC

Also, are you running a 32bit or 64bit mono binary?

Comment 3 Alexander Larsson 2006-10-16 11:04:25 UTC

And, are you running inside a xen instance or outside?

Comment 4 Chris Lalancette 2006-10-16 13:15:30 UTC

Ah, I forgot to mention about that.  The with-xen_opt was my first idea, too.  I
actually rebuilt the rpm, and removed the "--with-xen_opt=yes", and I still got
the same behavior.  It's so early in the startup code, though, that it probably
doesn't make a difference.

This is the 32-bit mono binary, by the way, running in dom0.

Comment 5 Chris Lalancette 2006-10-19 00:58:04 UTC

Wow.  I think I may have figured this one out, I'm just not sure what to do
about it.  I knew mono used to work, so I decided to go back and try various
older fc6 mono versions.  None of them worked properly, which I thought quite
strange.  So then I went back and tried various kernels, since that was the next
likely.  I eventually found out it worked in 1.2716.fc6, but not 1.2723.fc6. 
Looking at a diff between these two versions shows that exec-shield options
changed in there, more specifically, making the stack non-executable.  Aha!  The
code in question (in mono/mini/mini-x86.c) allocates some memory, copies a bunch
of opcodes into that memory (from the cpuid_impl array), and then tries to call
that as a function.  So it is trying to execute (I think) code on the stack,
which exec-shield is disallowing.  Sure enough, setting
/proc/sys/kernel/exec-shield to 0 allows mono to start normally.  Incidentally,
as far as I can tell, this will affect *all* kernels after 2716, not just the
Xen ones.  Have you happened to test mono on a kernel after 2716?

So, that is the workaround.  The real crux of the problem, though, is mono doing
this crack with the executable code on the stack.  One simple solution might be
to take the cpuid_impl, turn it into a normal __asm__ __volatile__ code chunk,
and drop it into cpuid, but I'm not sure how palatable that is upstream, because
I don't know the reason they are doing this to begin with.  Do you have any
further ideas, and/or can you pursue this upstream?

Chris Lalancette

Comment 6 Chris Lalancette 2006-10-19 03:57:31 UTC

Created attachment 138852 [details]
Patch to not make cpuid try to run code on the stack

Well, here's a patch that basically does the CPUID call in a PIC friendly
manner, which happened to be the problem before.  This *seems* to do the right
thing, and doesn't execute code on the stack, which is always a good thing. 
One thing to note; I shamelessly stole this bit of assembly from kdemultimedia
(http://www.google.com/codesearch?q=+PIC+version+:+save+ebx+show:8hVzTkQrhro:yQNIAg1iuac:kWmVzQ9LyY8&sa=N&cd=1&ct=rc&cs_p=http://gentoo.osuosl.org/distfiles/kdemultimedia-3.5.3.tar.bz2&cs_f=kdemultimedia-3.5.3/mpeglib/lib/util/mmx/cpu_accel.c#a0).
 
kdemultimedia seems to be GPL, while Mono seems to be LGPL, so we might have to
resolve that/figure something else out, if we decide we want to fix this (and I
think we should).

Chris Lalancette

Comment 7 Alexander Larsson 2006-10-19 09:03:27 UTC

Its not running it from the stack. mono_code_manager_reserve() allocates the
memory dynamically. In this case i think it will use malloc() which we then call
mprotect() on to make it executable.

Maybe execshield is making it fail the mprotect() call? I dunno why I'm not
seeing this problem though. mono runs fine on my core duo with kernel 2.6.81-1.2726.

I don't think your patch is right though. The code manager is probably used in
many places in mono, and your change is just in one place. We should figure out
why it fails and fix it. 

Isn't there a way to flag a binary to allow heap execution with exec-shield?

Comment 8 Caolan McNamara 2006-10-19 09:30:21 UTC

alex: maybe the example patch in openoffice.org of 

openoffice.org-2.0.3.oooXXXXX.selinux.bridges.patch

might be helpful for a similiar sounding scenario where I do the selinux dance
to avoid a simultaneously readable + writable executable memory segment

(http://people.redhat.com/drepper/selinux-mem.html)

Comment 9 Alexander Larsson 2006-10-19 09:35:17 UTC

I do have selinux on, and it works for me. Mono has its own selinux context that
allows it to do the stuff it needs. 

Chris: Do you have selinux on?

Comment 10 Chris Lalancette 2006-10-19 13:06:02 UTC

No, I don't have SELinux on.  Actually, once I found the workaround (echo 0 >
/proc/sys/kernel/exec-shield), I turned SELinux on just for a test, and ran into
the problem with the heap execution.  setroubleshootd told me I needed to set
the execheap boolean on.  In any case, that is a different problem since in my
original test I had SELinux disabled.

I'm not sure if it is relevant that I am using an Athlon machine?  Maybe some
difference in the way it protects the stack?  That is a pure guess.  I'll try
this out on another machine (intel) at work today, and let you know if I am more
successful there.

Comment 11 Chris Lalancette 2006-10-19 14:46:15 UTC

OK, test cases (all with SElinux off):

Opteron box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works
Opteron box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works
Opteron box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works
Opteron box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: works
EM64T box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works
EM64T box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works
EM64T box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works
EM64T box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT
Athlon box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT
Pentium 4 box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT

OK, so I'm starting to get the pattern: only i686 kernels, only Xen.  The one
exception is the Opteron box running the i686 Xen kernel; that seems to work
fine.  There must be some difference in the way the Xen kernel handles the
exec-shield stuff, which is causing this problem, but I can't explain the
Opteron difference.  Now I am really confused.

Comment 12 Alexander Larsson 2006-10-23 07:24:08 UTC

This seems to be more of a xen kernel bug. Reassigning on advice from Dave Jones.

Comment 13 Chris Lalancette 2007-01-10 21:21:41 UTC

Closing this out...it seems to be fixed in the latest FC6 kernels.

Note You need to log in before you can comment on or make changes to this bug.