Escalated to Bugzilla from IssueTracker
Event posted on 10-08-2009 12:02pm EDT by jabrown crash> sys KERNEL: vmlinux DUMPFILE: 351790-vmcore CPUS: 16 DATE: Wed Oct 7 17:22:03 2009 UPTIME: 00:03:44 LOAD AVERAGE: 1.24, 0.41, 0.14 TASKS: 185 NODENAME: svlhiav-ventura-1 RELEASE: 2.6.9-89.0.9.ELhugemem VERSION: #1 SMP Wed Aug 19 08:12:26 EDT 2009 MACHINE: i686 (2926 Mhz) MEMORY: 64 GB PANIC: "Oops: 0000 [#1]" (check log for details) crash> bt PID: 474 TASK: 8b56adf0 CPU: 2 COMMAND: "mpt-status" #0 [8c90ebb4] netpoll_start_netdump at f8f1b570 #1 [8c90ebd4] die at 210633d #2 [8c90ec08] do_invalid_op at 2106718 #3 [8c90ecb8] error_code (via invalid_op) at fffecede EAX: 0000002f EBX: 8c90e000 ECX: 8c90ecf0 EDX: 022e72a3 EBP: 022e6f22 DS: 007b ESI: 022dfa48 ES: 007b EDI: 00000000 CS: 0060 EIP: 02122cce ERR: ffffffff EFLAGS: 00010286 #4 [8c90ecf4] panic at 2122cce #5 [8c90ecfc] die at 21063bf #6 [8c90ed34] do_page_fault at 211bac5 #7 [8c90ee14] error_code (via page_fault) at fffecede EAX: 00000080 EBX: 00000028 ECX: 8d719000 EDX: 8d719000 EBP: feedf604 DS: 007b ESI: 8c90ef20 ES: 007b EDI: 00000000 CS: 0060 EIP: f88afeb0 ERR: ffffffff EFLAGS: 00010283 #8 [8c90ee50] mptctl_do_mpt_command at f88afeb0 #9 [8c90eedc] mptctl_mpt_command at f88afd7b #10 [8c90ef68] mptctl_ioctl at f88ae7bf #11 [8c90ef94] sys_ioctl at 216cc13 #12 [8c90efc0] system_call at fffec219 EAX: 00000036 EBX: 00000003 ECX: c0386d14 EDX: feedf5d0 DS: 007b ESI: 08fa02e8 ES: 007b EDI: feedcdd0 SS: 007b ESP: feeda5a8 EBP: feee1de8 CS: 0073 EIP: 08056074 ERR: 00000036 EFLAGS: 00000282 crash> ps | grep "> " > 0 1 1 236b10b0 RU 0.0 0 0 [swapper] > 0 1 3 236b05b0 RU 0.0 0 0 [swapper] > 0 1 4 236b0030 RU 0.0 0 0 [swapper] > 0 1 5 236b3670 RU 0.0 0 0 [swapper] > 0 1 6 236b30f0 RU 0.0 0 0 [swapper] > 0 1 7 236b2b70 RU 0.0 0 0 [swapper] > 0 1 8 236b25f0 RU 0.0 0 0 [swapper] > 0 1 9 236b2070 RU 0.0 0 0 [swapper] > 0 1 10 236c56b0 RU 0.0 0 0 [swapper] > 0 1 11 236c5130 RU 0.0 0 0 [swapper] > 0 1 12 236c4bb0 RU 0.0 0 0 [swapper] > 0 1 13 236c4630 RU 0.0 0 0 [swapper] > 0 1 14 236c40b0 RU 0.0 0 0 [swapper] > 0 1 15 236c76f0 RU 0.0 0 0 [swapper] > 474 473 2 8b56adf0 RU 0.0 1708 144 mpt-status > 722 1 0 8b2ae630 RU 0.0 3232 720 xinetd crash> ps | grep UN 31717 1 2 8ba36db0 UN 0.0 0 0 [kjournald] 32382 1 10 8cb60c70 UN 0.0 2368 728 syslogd This event sent from IssueTracker by djeffery [SEG - Storage] issue 351790
Event posted on 10-08-2009 03:41pm EDT by djeffery The system crash is do to a bug in the mpt driver. The function mptctl_do_mpt_command()'s second parameter, mfPtr, is a pointer to a userspace address. This pointer is directly dereferenced by this function: if (((MPIHeader_t *)(mfPtr))->MsgContext == 0x02012020) { which is the instruction the kernel crashed at: 0xf88afeb0 <mptctl_do_mpt_command+295>: cmpl $0x2012020,0x8(%ebp) While always unsafe, the driver can usually get away with this addressing violation as the race window for the page to be unmapped is small on most kernels. But this is a largemem kernel. Direct userspace address accesses from the kernel don't work on largemem kernels. So this bug can be very rare to trigger on any non-largemem kernels while failing horribly on systems like this one that use a largemem kernel. This event sent from IssueTracker by djeffery [SEG - Storage] issue 351790
Sathya, Can you take a look at comment 4 and provide a patch similar to upstream: mptctl_do_mpt_command() if (copy_from_user(mf, mfPtr, karg.dataSgeOffset * 4)) { printk(MYIOC_s_ERR_FMT "%s@%d::mptctl_do_mpt_command - " "Unable to read MF from mpt_ioctl_command struct @ %p\n", ioc->name, __FILE__, __LINE__, mfPtr); function = -1; rc = -EFAULT; goto done_free_mem; } Rob
I'm adding Cisco Engineering to this bugzilla. Question for Cisco Eng.: Can a B200 be used to test this or must this be tested on a B250 (due to the high memory)? If a B250 is required, Cisco and or Cisco/IT must assist in testing and verifying the fix. Thanks!
BTW: I'm noticing this was found on an i386 arch - something that was not included in the current 4.8 hardware certification. Gary, can you confirm?
That's correct. This system was never certified on 32-bit RHEL. -Gary
> While always unsafe, the driver can usually get away with this addressing > violation as the race window for the page to be unmapped is small on most > kernels. But this is a largemem kernel. Direct userspace address > accesses from the kernel don't work on largemem kernels. So this bug can > be very rare to trigger on any non-largemem kernels while failing horribly > on systems like this one that use a largemem kernel. Since this bug can happen on other non-largemem kernels as well, this bug still needs to be fixed, independent of the cisco certification or the kernel that this bug was observed on. Rob
I am adding Kashyap, Who is currently handling this driver for further analysis and action
Created attachment 366579 [details] Proposed fix in mptclt.c for kernel crash Rob, I have attached proposed patch for this issue. Kernel crash at "if (((MPIHeader_t *)(mfPtr))->MsgContext == 0x02012020)" is valid. This part of code is not there at upstream. We can not remove this code because of some old user space application requires above condition check. considering above line as must for RHEL4.8 I have provided patch which is better way of doing memory access of userspace from kernel address space. Thanks, Kashyap
Kashyap, I had a few problems with this patch: - The patch needs to be generated from rhel4.8. There is context in the patch that differs from rhel4.8. - The patch needs to apply cleanly to rhel4.8 using the -p1 option. - The 4.8 code has the following snippet that appears to conflict with the one in this patch: /* Copy the request frame * Reset the saved message context. */ if (copy_from_user(mf, mfPtr, karg.dataSgeOffset * 4)) { printk(KERN_ERR "%s@%d: mptctl_do_mpt_command - " "Unable to read MF from mpt_ioctl_command struct @ %p\n", __FILE__, __LINE__, mfPtr); rc = -EFAULT; goto done_free_mem; } Please resolve these issues and re-attach the patch. Rob
Created attachment 366762 [details] recreated patch for RHEL4.8 kernel Recreated patch for RHEL4.8 kernel. Please try this new patch with -p1 option. Thanks, Kashyap
Hi Kashyap, Can you confirm if this problem exists or not in rhel5. Looks like the driver has diverged in rhel5 a bit from rhel4. Thanks, Rob
In RHEL5 this problem is not exists. - Kashyap
Committed in 89.16.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
The latest EUS .z stream bits can be found here: http://people.redhat.com/~cward/4.8.z/kernel/ Please report back on the testing status of this bug as soon as possible. The target END TESTING date for this 4.8.z kernel is approximately December 15th, 2009 (2009-12-15) When reporting your results, make sure to indicate which version of the kernel build you tested. Thank you for your expedited response!
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A bug in the mptctl_do_mpt_command() function in the mpt driver may have resulted in crashes during boot on i386 systems with certain adapters using the mpt driver, and also running the hugemem kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html