SWsoft Virtuozzo/OpenVZ Linux kernel team has found 4/4GB split related issue:
machine check exception handler tries accessing kernel-space memory
(machine_check_vector) before switching to kernel-space context.
If MCE interrupts userspace application it usually leads to non-fatal oops
message, however if this memory address is used by application kernel will jump
to wrong pointer and it can lead to various troubles like memory corruptions,
hangs or reboot.
from Virtuozzo/OpenVZ kernel 2.6.8-022stab078.9-enterprise (with 4/4GB split patch)
Unable to handle kernel paging request at virtual address 024a8e10
*pde = 0063c027
Oops: 0000 [#1]
CPU: 0, VCPU: 1:2
EIP: 0060:[<fffad03e>] Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206 (2.6.8-022stab078.9-enterprise)
eax: 409e55b8 ebx: 40c52660 ecx: 00000200 edx: 40331468
esi: 40cd0018 edi: 403317e8 ebp: bdffe2cc esp: d0e9ffe8
ds: 007b es: 007b ss: 0068
Stack: 00000000 08217e13 00000073 00000206 bdffe2a4 0000007b
Code: ff 35 10 8e 4a 02 e9 ab fd ff ff 8d 76 00 6a 00 68 50 7a 10
>>EIP; fffad03e <machine_check+2/10> <=====
>>eax; 409e55b8 <pg0+3e3aa5b8/fd971000>
>>ebx; 40c52660 <pg0+3e617660/fd971000>
>>edx; 40331468 <pg0+3dcf6468/fd971000>
>>esi; 40cd0018 <pg0+3e695018/fd971000>
>>edi; 403317e8 <pg0+3dcf67e8/fd971000>
>>ebp; bdffe2cc <pg0+bb9c32cc/fd971000>
>>esp; d0e9ffe8 <pg0+ce864fe8/fd971000>
Code; fffad03e <machine_check+2/10>
Code; fffad03e <machine_check+2/10> <=====
0: ff 35 10 8e 4a 02 pushl 0x24a8e10 <=====
Code; fffad044 <machine_check+8/10>
6: e9 ab fd ff ff jmp fffffdb6 <_EIP+0xfffffdb6>
Code; fffad049 <machine_check+d/10>
b: 8d 76 00 lea 0x0(%esi),%esi
Code; fffad04c <spurious_interrupt_bug+0/50fb4>
e: 6a 00 push $0x0
Code; fffad04e <spurious_interrupt_bug+2/50fb4>
10: 68 50 7a 10 00 push $0x107a50
from System map:
024a8e10 D machine_check_vector
Created attachment 134797 [details]
fixed by attached patch
could you please comment this bug?
From our point of view it can explain the various troubles on the nodes where
kernel with 4G split patch is running. It may be memomry corruptions, hangs and
reboots without any diagnostic.
please try with our latest RHEL4 kernel. The kernel you are reporting this
problem on isnt a RHEL kernel and all of the RHEL4 kernels appear to properly
switch to kernel space via the error_code path in entry.S before calling the
vector pushed onto the stack.
Ok, please look at the arch/i386/kernel/entry.S
I've copy it from our 2.6.9-42.0.3 kernel:
we see here 3 interrups vectors. please note that that in first 2 cases we push
into stack the constants: $do_alignment_check and $do_page_fault
But in case of machine_check we read the content from kernel-space _variable_
machine_check_vector. And we do it _before_ jump to error_code where we change
the context from user-space to kernel-space.
Therefore we will access kernel-space adress in user-space context. Ususally
this address is not mapped, and we have on oops message. But if this address is
mappend in userspace -- we will access to it, and read his content as
machine_check_vector and push it into stack.
Then we jumps to error_code, switches the context and calls to wrong pointer, in
kernel-space context, with unexpected behaviour.
movl ORIG_EAX(%esp), %esi # get the error code
movl ES(%esp), %edi # get the function address
leal 4(%esp), %edx # prepare pt_regs
pushl %edx # push pt_regs
I would note that it is real issue, we (SWsoft Virtuozzo/OpenVZ kernel team)
have received 2 such oopses from our customers.
Ok, my bad, I see what your doing now. I was hung up thinking you were worried
about an alignment issue with the movl ORIG_EAX(%esp) in error_code, since
machine_check pushes a 0 error code onto the stack, which I see now that you
aren't. This looks to make sense to me now. You're getting the oops because
the machine_check_vector is holding a kernel address, and we're trying to access
it from user space context. Your patch fixes that by adding the
machine_check_vector variable to the entry.text trampoline area, so that it can
be safely accessed from user space.
The patch looks fine to me, although it appears that Dave Jones is proposing to
handle it slightly differently upstream:
It appears he's trying to keep excess stuff out of the trampoline area. unless
you have a particular objection, I'm going to try to do this inline with
upstream (assuming his patch gets taken, no sign of it yet). I'll post a patch
here for testing shortly.
scratch that last comment, didn't see the date on that post. So it actually
looks like this needs to go upstream as well.
I'm not sure that this patch will be included into mainstream.
I would note that it is 4G/4G split-related issue. Any linux mainstream kernels
are not vulnerabled because of this patch is not included into mainstream kernel.
As far as I know 4G split patch is used now only in RHEL i386 hugemem kernels
and in our Virtuozzo/OpenVz kernels. Old FC kernels did used it but dropped
long time ago.
Are you (or Ingo Molnar) knows probably other vendors who uses 4G split patch?
Created attachment 140648 [details]
alternate patch to fix the read of kernel space memory in user space context
Please test this alternate patch and confirm that it solves the problem equally
well. I'm not sure which patch is more appropriate yet (add a few dozen bytes
to kernel text, or 4 bytes to the shared trampoline area, which should be as
small as possible), but I'd like to have both alternatives available when I
propose a solution. Thanks
both patches are correct. there is no any real difference in effeciency,
however your patch looks a bit better for me.
From my point of view your patch is better. Our patch is unclean and looks like
a hack, but your patch looks like correct solution. It fixes the root cause of
this error -- access to kernel-space variable before error_code. Nobody do it,
and I assume there is some important reason. How do you think is it probably the
other wrong situtions? Probably we have not guarantee that some segment
registers are correct before error_code? Probably when MCE interrupts CPU in
VM86 mode? In this case we have a chance to commit your patch into mainstream.
Also I would note that my patch requires the write permissions for trampoline
area. In our case variable placed in this area is changed and it is not a very
good in principle. In your case trampoline area can be write-protected, and it
is yet another little advantage of your patch.
I hadn't considered the read/write dillemma of your patch, but I would guess
that entry.text needs to be read/write in the case of signal handler returns
(not sure though). Either way, if you're comfortable with my variant of the
patch, I'll go propose this upstream, and, if accepted, I'll get it into update
posted upstream for review
Just got pulled into -mm. I'll post internally soon
committed in stream U5 build 42.25. A test kernel with this patch is available
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
QE ack for RHEL4.5.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.