SWsoft Virtuozzo/OpenVZ Linux kernel team has found 4/4GB split related issue: machine check exception handler tries accessing kernel-space memory (machine_check_vector) before switching to kernel-space context. If MCE interrupts userspace application it usually leads to non-fatal oops message, however if this memory address is used by application kernel will jump to wrong pointer and it can lead to various troubles like memory corruptions, hangs or reboot.
Oops example: from Virtuozzo/OpenVZ kernel 2.6.8-022stab078.9-enterprise (with 4/4GB split patch) Unable to handle kernel paging request at virtual address 024a8e10 fffad03e *pde = 0063c027 Oops: 0000 [#1] CPU: 0, VCPU: 1:2 EIP: 0060:[<fffad03e>] Tainted: P Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010206 (2.6.8-022stab078.9-enterprise) eax: 409e55b8 ebx: 40c52660 ecx: 00000200 edx: 40331468 esi: 40cd0018 edi: 403317e8 ebp: bdffe2cc esp: d0e9ffe8 ds: 007b es: 007b ss: 0068 Stack: 00000000 08217e13 00000073 00000206 bdffe2a4 0000007b Call Trace: Code: ff 35 10 8e 4a 02 e9 ab fd ff ff 8d 76 00 6a 00 68 50 7a 10 >>EIP; fffad03e <machine_check+2/10> <===== >>eax; 409e55b8 <pg0+3e3aa5b8/fd971000> >>ebx; 40c52660 <pg0+3e617660/fd971000> >>edx; 40331468 <pg0+3dcf6468/fd971000> >>esi; 40cd0018 <pg0+3e695018/fd971000> >>edi; 403317e8 <pg0+3dcf67e8/fd971000> >>ebp; bdffe2cc <pg0+bb9c32cc/fd971000> >>esp; d0e9ffe8 <pg0+ce864fe8/fd971000> Code; fffad03e <machine_check+2/10> 00000000 <_EIP>: Code; fffad03e <machine_check+2/10> <===== 0: ff 35 10 8e 4a 02 pushl 0x24a8e10 <===== Code; fffad044 <machine_check+8/10> 6: e9 ab fd ff ff jmp fffffdb6 <_EIP+0xfffffdb6> Code; fffad049 <machine_check+d/10> b: 8d 76 00 lea 0x0(%esi),%esi Code; fffad04c <spurious_interrupt_bug+0/50fb4> e: 6a 00 push $0x0 Code; fffad04e <spurious_interrupt_bug+2/50fb4> 10: 68 50 7a 10 00 push $0x107a50 from System map: 024a8e10 D machine_check_vector
Created attachment 134797 [details] fixed by attached patch
Ingo, could you please comment this bug? From our point of view it can explain the various troubles on the nodes where kernel with 4G split patch is running. It may be memomry corruptions, hangs and reboots without any diagnostic.
please try with our latest RHEL4 kernel. The kernel you are reporting this problem on isnt a RHEL kernel and all of the RHEL4 kernels appear to properly switch to kernel space via the error_code path in entry.S before calling the vector pushed onto the stack.
Ok, please look at the arch/i386/kernel/entry.S I've copy it from our 2.6.9-42.0.3 kernel: ENTRY(alignment_check) pushl $do_alignment_check jmp error_code ENTRY(page_fault) pushl $do_page_fault jmp error_code #ifdef CONFIG_X86_MCE ENTRY(machine_check) pushl $0 pushl machine_check_vector jmp error_code #endif we see here 3 interrups vectors. please note that that in first 2 cases we push into stack the constants: $do_alignment_check and $do_page_fault But in case of machine_check we read the content from kernel-space _variable_ machine_check_vector. And we do it _before_ jump to error_code where we change the context from user-space to kernel-space. Therefore we will access kernel-space adress in user-space context. Ususally this address is not mapped, and we have on oops message. But if this address is mappend in userspace -- we will access to it, and read his content as machine_check_vector and push it into stack. Then we jumps to error_code, switches the context and calls to wrong pointer, in kernel-space context, with unexpected behaviour. error_code: ... movl ORIG_EAX(%esp), %esi # get the error code movl ES(%esp), %edi # get the function address ... __SWITCH_KERNELSPACE leal 4(%esp), %edx # prepare pt_regs pushl %edx # push pt_regs call *%edi I would note that it is real issue, we (SWsoft Virtuozzo/OpenVZ kernel team) have received 2 such oopses from our customers.
Ok, my bad, I see what your doing now. I was hung up thinking you were worried about an alignment issue with the movl ORIG_EAX(%esp) in error_code, since machine_check pushes a 0 error code onto the stack, which I see now that you aren't. This looks to make sense to me now. You're getting the oops because the machine_check_vector is holding a kernel address, and we're trying to access it from user space context. Your patch fixes that by adding the machine_check_vector variable to the entry.text trampoline area, so that it can be safely accessed from user space. The patch looks fine to me, although it appears that Dave Jones is proposing to handle it slightly differently upstream: http://www.uwsg.iu.edu/hypermail/linux/kernel/0210.3/0669.html It appears he's trying to keep excess stuff out of the trampoline area. unless you have a particular objection, I'm going to try to do this inline with upstream (assuming his patch gets taken, no sign of it yet). I'll post a patch here for testing shortly.
scratch that last comment, didn't see the date on that post. So it actually looks like this needs to go upstream as well.
I'm not sure that this patch will be included into mainstream. I would note that it is 4G/4G split-related issue. Any linux mainstream kernels are not vulnerabled because of this patch is not included into mainstream kernel. As far as I know 4G split patch is used now only in RHEL i386 hugemem kernels and in our Virtuozzo/OpenVz kernels. Old FC kernels did used it but dropped long time ago. Are you (or Ingo Molnar) knows probably other vendors who uses 4G split patch?
Created attachment 140648 [details] alternate patch to fix the read of kernel space memory in user space context Please test this alternate patch and confirm that it solves the problem equally well. I'm not sure which patch is more appropriate yet (add a few dozen bytes to kernel text, or 4 bytes to the shared trampoline area, which should be as small as possible), but I'd like to have both alternatives available when I propose a solution. Thanks
both patches are correct. there is no any real difference in effeciency, however your patch looks a bit better for me.
Neil, From my point of view your patch is better. Our patch is unclean and looks like a hack, but your patch looks like correct solution. It fixes the root cause of this error -- access to kernel-space variable before error_code. Nobody do it, and I assume there is some important reason. How do you think is it probably the other wrong situtions? Probably we have not guarantee that some segment registers are correct before error_code? Probably when MCE interrupts CPU in VM86 mode? In this case we have a chance to commit your patch into mainstream. Also I would note that my patch requires the write permissions for trampoline area. In our case variable placed in this area is changed and it is not a very good in principle. In your case trampoline area can be write-protected, and it is yet another little advantage of your patch.
I hadn't considered the read/write dillemma of your patch, but I would guess that entry.text needs to be read/write in the case of signal handler returns (not sure though). Either way, if you're comfortable with my variant of the patch, I'll go propose this upstream, and, if accepted, I'll get it into update 5 ASAP.
posted upstream for review
Just got pulled into -mm. I'll post internally soon
committed in stream U5 build 42.25. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
QE ack for RHEL4.5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html