Description of problem: They have 6 new boxes and 5 are running okay but one is oopsing frequently. Version-Release number of selected component (if applicable): VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006 MACHINE: x86_64 (2612 Mhz) MEMORY: 34 GB RELEASE: 2.6.9-42.0.2.ELsmp How reproducible: Frequently Steps to Reproduce: No steps has been identified yet Actual results: oopses Additional info: * They have 6 new boxes and 5 are running okay but one is oopsing frequently. * Product Name: ProLiant DL585 G2 * The HW diags tools ran clean, no errors reported. * The memtest86 ran clean, no errors * We have sysreport of both 'bad' and 'good' systems * No differences in dmidecode/lspci from 'bad' to 'good' systems. * Dual-Core AMD Opteron(tm) Processor 8218 stepping 03 * nVidia Corporation CK804 * 3 vmcores available, 1 oops, NMI messages: ... Uhhuh. NMI received for unknown reason 20. Uhhuh. NMI received for unknown reason 20. Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? <repeated followed by the oops below> egrep: Corrupted page table at address 39164b9272 PML4 4755d3067 PGD 475def067 PMD 474cb3067 PTE 5720796572666665 Bad pagetable: 001d [1] SMP CPU 1 ------------------------------------------------------------------- On the other two oopses the registers had huge numbers: Core1 bracktrace frame #4 dump: #4 [104700efd70] error_exit at ffffffff80110d91 [exception RIP: vfs_getattr+46] RIP: ffffffff8018197c RSP: 00000104700efe28 RFLAGS: 00010246 RAX: 4631313532353046 RBX: 000001026d888a98 RCX: 0000000000000046 ^^^^^^^^^^^^^^^^^^^^^ RDX: 00000104700efef8 RSI: 000001026d888a98 RDI: 0000010478421900 RBP: 00000104700efef8 R8: 000000000000000f R9: 0000000000000001 ... 0xffffffff80181966 <vfs_getattr+24>: mov 0x10(%rsi),%r12 0xffffffff8018196a <vfs_getattr+28>: callq *0x1a0(%rax) 0xffffffff80181970 <vfs_getattr+34>: test %eax,%eax 0xffffffff80181972 <vfs_getattr+36>: jne 0xffffffff801819e2 <vfs_getattr+148> 0xffffffff80181974 <vfs_getattr+38>: mov 0xf8(%r12),%rax 0xffffffff8018197c <vfs_getattr+46>: mov 0x78(%rax),%rax <--crash ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0xffffffff80181980 <vfs_getattr+50>: test %rax,%rax The offset 0x78 is in struct inode_operations: crash> size -o inode_operations struct inode_operations { [0x0] int (*create)(struct inode *, struct dentry *, int, struct nameidata .... [0x78] int (*getattr)(struct vfsmount *, struct dentry *, struct kstat *); The relevant code was: int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { ... if (inode->i_op->getattr) return inode->i_op->getattr(mnt, dentry, stat); ... but this pointer is corrupted with value of RAX: 4631313532353046. This corruption also went to slab: kmem: ext3_inode_cache: full list: slab: 10872f43100 bad next pointer: 40115cd0b006920 kmem: ext3_inode_cache: full list: slab: 10872f43100 bad prev pointer: 205b020931081811 kmem: ext3_inode_cache: full list: slab: 10872f43100 bad inuse counter: 1094005561 kmem: ext3_inode_cache: full list: slab: 10872f43100 bad inuse counter: 1094005561 kmem: ext3_inode_cache: full list: slab: 10872f43100 bad s_mem pointer: 3734463333423839 On core2 (another backtrace): #4 [102758639e0] error_exit at ffffffff80110d91 [exception RIP: __find_get_block_slow+125] RIP: ffffffff8017adca RSP: 0000010275863a98 RFLAGS: 00010206 RAX: 0000000000000000 RBX: 4f4c41444e415453 RCX: 000001027fb3d3f0 ^^^^^^^^^^^^^^^^ Note another huge number on RBX, and the assembly was: 0xffffffff8017adca <__find_get_block_slow+125>: cmp %r12,0x20(%rbx) then it should be another pointer. This is almost the same symptom present in core1, but no slabs was affected by this corruption. On another oops: Losing some ticks... checking if CPU frequency changed. lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip __do_softirq+0x4d/0xd0 general protection fault: 0000 [1] SMP I thought it could be timer related and then passing 'report_lost_ticks' shows that it lost 1 tick. time.c: Lost 1 timer tick(s)! rip __do_softirq+0x4d/0xd0) <repeatedly>
Created attachment 292540 [details] messages with 'report_lost_ticks' and NMI errors
full oops copied below -------------------------------------------------------------------------- Will boot with that param now. Just rebooting caused an issue. Turning off quotas: Unmounting pipe file systems: Unmounting file systems: Unable to handle kernel paging request at 000010000000025f RIP: <ffffffff80159d5c>{page_waitqueue+70} PML4 0 Oops: 0000 [1] SMP CPU 2 Modules linked in: mptctl sg autofs4 i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdump zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U) fdd(U) vxportal(U) vxfs(U) dm_mod button battery ac joydev ohci_hcd ehci_hcd uhci_hcd e1000 bnx2 ext3 jbd lpfc scsi_transport_fc mptsas cciss mptspi mptscsi mptbase sd_mod scsi_mod Pid: 21609, comm: umount Tainted: PF 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffff80159d5c>] <ffffffff80159d5c>{page_waitqueue+70} RSP: 0018:000001027356dd40 EFLAGS: 00010206 RAX: 512e10c73baf5018 RBX: 000001047c0f5018 RCX: 0000000000000040 RDX: 00000fffffffffff RSI: 000001047c0f5018 RDI: 0000000000000000 RBP: 430140240000a208 R08: 0000000000000004 R09: 000001027356dc28 R10: 000001047c0f5050 R11: 0000000000000000 R12: 000000000000000c R13: 0000000000000000 R14: 0000000000000000 R15: 0000010473b38360 FS: 0000002a95562b00(0000) GS:ffffffff804e5280(0000) knlGS:00000000f61edbb0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000010000000025f CR3: 0000000008020000 CR4: 00000000000006e0 Process umount (pid: 21609, threadinfo 000001027356c000, task 0000010116459030) Stack: ffffffff80159d7a 000001047c0f5018 ffffffff80164144 000000000000000e 0000000000000000 000001047c0f4d78 000001047c0f4db0 000001047c0f4de8 000001047c0f4e20 000001047c0f4e58 Call Trace:<ffffffff80159d7a>{wake_up_page+9} <ffffffff80164144>{truncate_inode_pages+142} <ffffffff80191860>{dispose_list+76} <ffffffff80191be3>{invalidate_inodes+177} <ffffffff8017eb2d>{generic_shutdown_super+162} <ffffffff8017f979>{kill_block_super+19} <ffffffff8017ea72>{deactivate_super+95} <ffffffff8019429b>{sys_umount+925} <ffffffff80181d4c>{sys_newstat+17} <ffffffff80110d91>{error_exit+0} <ffffffff8011026a>{system_call+126} Code: 2b 8a 60 02 00 00 48 d3 e8 48 6b c0 18 48 03 82 50 02 00 00 RIP <ffffffff80159d5c>{page_waitqueue+70} RSP <000001027356dd40> CR2: 000010000000025f CPU frozen: #0#1#3#4#5#6#7 CPU#2 is executing diskdump. start dumping to cciss/c0d0p3 check dump partition... dumping memory(partial dump with dump_level 19).. 539103(315254 skipped)/8388078 669 ETA \
core1 is here: seg.rdu.redhat.com:/export/nfs/awashbro/144241/010608/144241.vmcore.010608 VMlinux is also on seg: /usr/lib/debug/lib/modules/2.6.9-42.0.2.ELsmp/vmlinux KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.2.ELsmp/vmlinux DUMPFILE: 144241.vmcore.010608 [PARTIAL DUMP] CPUS: 8 DATE: Sun Jan 6 05:38:50 2008 UPTIME: 01:11:34 LOAD AVERAGE: 0.12, 0.05, 0.01 TASKS: 456 NODENAME: uswxapstac05f RELEASE: 2.6.9-42.0.2.ELsmp VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006 MACHINE: x86_64 (2612 Mhz) MEMORY: 34 GB PANIC: "" PID: 572 COMMAND: "kjournald" TASK: 104800ad030 [THREAD_INFO: 10275862000] CPU: 4 STATE: TASK_RUNNING (PANIC) crash> bt PID: 572 TASK: 104800ad030 CPU: 4 COMMAND: "kjournald" #0 [10275863940] start_disk_dump at ffffffffa052736d #1 [10275863970] try_crashdump at ffffffff8014bd01 #2 [10275863980] die at ffffffff80111c00 #3 [102758639a0] do_general_protection at ffffffff801124e5 #4 [102758639e0] error_exit at ffffffff80110d91 [exception RIP: __find_get_block_slow+125] RIP: ffffffff8017adca RSP: 0000010275863a98 RFLAGS: 00010206 RAX: 0000000000000000 RBX: 4f4c41444e415453 RCX: 000001027fb3d3f0 RDX: 000001066a6212f0 RSI: 0000000000001626 RDI: 000001028003b5a0 RBP: 000001027fb3d3f0 R8: 0000000000000008 R9: 000001000800d080 R10: 0000000000001000 R11: 0000000000000000 R12: 0000000000001626 R13: 000001028003b400 R14: 000001028003b520 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [10275863a90] __find_get_block_slow at ffffffff8017ada3 #6 [10275863ad0] __find_get_block at ffffffff8017b52e #7 [10275863b60] __getblk at ffffffff8017d8fe #8 [10275863b90] journal_get_descriptor_buffer at ffffffffa00a1436 #9 [10275863bb0] journal_commit_transaction at ffffffffa009cd59 #10 [10275863e80] kjournald at ffffffffa009f914 #11 [10275863f50] kernel_thread at ffffffff80110f47 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3 warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip __do_softirq+0x4d/0xd0 general protection fault: 0000 [1] SMP CPU 4 Modules linked in: cpqci(U) mptctl sg ipmi_devintf ipmi_si ipmi_msghandler autofs4 i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdu mp zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U) fdd(U) vxportal(U) vxfs(U) dm_mod button battery ac joydev ohci_hcd ehci_hcd uhci_h cd e1000 bnx2 ext3 jbd lpfc scsi_transport_fc mptsas cciss mptspi mptscsi mptbase sd_mod scsi_mod Pid: 572, comm: kjournald Tainted: PF 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffff8017adca>] <ffffffff8017adca>{__find_get_block_slow+125} RSP: 0018:0000010275863a98 EFLAGS: 00010206 RAX: 0000000000000000 RBX: 4f4c41444e415453 RCX: 000001027fb3d3f0 RDX: 000001066a6212f0 RSI: 0000000000001626 RDI: 000001028003b5a0 RBP: 000001027fb3d3f0 R08: 0000000000000008 R09: 000001000800d080 R10: 0000000000001000 R11: 0000000000000000 R12: 0000000000001626 R13: 000001028003b400 R14: 000001028003b520 R15: 0000000000000000 FS: 0000002a95562b00(0000) GS:ffffffff804e5380(0000) knlGS:00000000ebc5cbb0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000007fbffff836 CR3: 0000000878f98000 CR4: 00000000000006e0 Process kjournald (pid: 572, threadinfo 0000010275862000, task 00000104800ad030) Stack: 0000000000000000 0000000000000000 0000000000001000 0000000000001626 000001028003b340 0000010277336a00 0000000000000000 ffffffff8017b52e 0000010277336a00 0000000000000216 Call Trace:<ffffffff8017b52e>{__find_get_block+162} <ffffffff8030ab37>{io_schedule+38} <ffffffff8017d8fe>{__getblk+20} <ffffffffa00a1436>{:jbd:journal_get_descriptor_buffer+43} <ffffffffa009cd59>{:jbd:journal_commit_transaction+1669} <ffffffff8030a1c1>{thread_return+0} <ffffffff8030a219>{thread_return+88} <ffffffffa009f914>{:jbd:kjournald+250} <ffffffff80135752>{autoremove_wake_function+0} <ffffffff80135752>{autoremove_wake_function+0} <ffffffffa009f814>{:jbd:commit_timeout+0} <ffffffff80110f47>{child_rip+8} <ffffffffa009f81a>{:jbd:kjournald+0} <ffffffff80110f3f>{child_rip+0} Code: 4c 39 63 20 75 0a 48 89 1c 24 f0 ff 43 18 eb 5e 8b 03 48 8b RIP <ffffffff8017adca>{__find_get_block_slow+125} RSP <0000010275863a98>
The above was actually core2, the core1 is below: 1. Provide core file (if one is involved) and state: * Location: dl585.gsslab.rdu.redhat.com:/work/samfw/144241 * Access info: root/redhat * Backtrace output from the core file Backtrace KERNEL: /cores/20080103142241/work/vmlinux DUMPFILE: /cores/20080103142241/work/RH144241.vmcore [PARTIAL DUMP] CPUS: 8 DATE: Mon Dec 31 18:11:27 2007 UPTIME: 1 days, 10:08:22 LOAD AVERAGE: 0.00, 0.00, 0.00 TASKS: 399 NODENAME: uswxapstac05f RELEASE: 2.6.9-42.0.2.ELsmp VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006 MACHINE: x86_64 (2612 Mhz) MEMORY: 34 GB PANIC: "" PID: 13020 TASK: 102741a0030 CPU: 1 COMMAND: "save" #0 [104700efcd0] start_disk_dump at ffffffffa052d36d #1 [104700efd00] try_crashdump at ffffffff8014bd01 #2 [104700efd10] die at ffffffff80111c00 #3 [104700efd30] do_general_protection at ffffffff801124e5 #4 [104700efd70] error_exit at ffffffff80110d91 [exception RIP: vfs_getattr+46] RIP: ffffffff8018197c RSP: 00000104700efe28 RFLAGS: 00010246 RAX: 4631313532353046 RBX: 000001026d888a98 RCX: 0000000000000046 RDX: 00000104700efef8 RSI: 000001026d888a98 RDI: 0000010478421900 RBP: 00000104700efef8 R8: 000000000000000f R9: 0000000000000001 R10: 0000000000000001 R11: ffffffff801cec14 R12: 000001066a9d7648 R13: 0000010478421900 R14: 000000004049deb1 R15: 000000004049de10 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [104700efe20] vfs_getattr at ffffffff80181970 #6 [104700efe50] vfs_lstat at ffffffff80181a5a #7 [104700efef0] sys_newlstat at ffffffff80181d75 #8 [104700eff80] system_call at ffffffff8011026a RIP: 00000039164b8d25 RSP: 0000007fbffe4480 RFLAGS: 00000293 RAX: 0000000000000006 RBX: ffffffff8011026a RCX: 000000004049b4f0 RDX: 0000007fbffe43c0 RSI: 0000007fbffe43c0 RDI: 0000000040478e50 RBP: 00000000404be0e0 R8: 0000000000000001 R9: 000000004023bf00 R10: 0000000000000001 R11: 0000000000000246 R12: 000000004049deb8 R13: 0000007fbffe30c0 R14: 000000004026f7d0 R15: 0000000000000008 ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
Lastest vmcore available: You may view it at megatron.gsslab.rdu.redhat.com Login with kerberos name/password $ cd /cores/20080122085506/work /cores/20080122085506/work$ ./crash Backtrace KERNEL: /cores/20080122085506/work/vmlinux DUMPFILE: /cores/20080122085506/work/vmcore.144241.2008-01-19 [PARTIAL DUMP] CPUS: 8 DATE: Sat Jan 19 02:05:01 2008 UPTIME: 07:49:06 LOAD AVERAGE: 0.02, 0.02, 0.00 TASKS: 458 NODENAME: uswxapstac05f RELEASE: 2.6.9-42.0.2.ELsmp VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006 MACHINE: x86_64 (2612 Mhz) MEMORY: 34 GB PANIC: "Kernel panic - not syncing: Oops" PID: 13985 TASK: 108760ff7f0 CPU: 1 COMMAND: "egrep" #0 [10876afdb90] start_disk_dump at ffffffffa052a36d #1 [10876afdbc0] try_crashdump at ffffffff8014bd01 #2 [10876afdbd0] die at ffffffff80111c00 #3 [10876afdbf0] do_invalid_op at ffffffff80111fc8 #4 [10876afdcb0] error_exit at ffffffff80110d91 [exception RIP: panic+211] RIP: ffffffff8013794a RSP: 0000010876afdd68 RFLAGS: 00010286 RAX: 0000000000000024 RBX: ffffffff8031e00f RCX: 0000000000000246 RDX: 00000000000113e3 RSI: 0000000000000246 RDI: ffffffff803e2000 RBP: 00000039164b9272 R8: 0000000000000002 R9: ffffffff8031e00f R10: 0000000100000000 R11: 0000ffff803fcd00 R12: 000000000000001d R13: 0000010876afdf58 R14: 000000000000001d R15: 0000010678737240 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [10876afdd60] panic at ffffffff80137936 #6 [10876afde40] oops_end at ffffffff80111b07 #7 [10876afde50] pgtable_bad at ffffffff80123c8a #8 [10876afde70] do_page_fault at ffffffff801242a9 #9 [10876afdf50] error_exit at ffffffff80110d91 RIP: 00000039164b9272 RSP: 0000007fbffffbc8 RFLAGS: 00010246 RAX: 000000000000004f RBX: 0000000000008000 RCX: 00000039164b9272 RDX: 0000000000008000 RSI: 00000000006170a1 RDI: 0000000000000000 RBP: 0000000000000000 R8: 0000000000001000 R9: 0000000000000001 R10: 0000003916630848 R11: 0000000000000246 R12: 00000000006170a1 R13: 0000000000514660 R14: 0000000000000000 R15: 0000000000000411 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b ... Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? egrep: Corrupted page table at address 39164b9272 PML4 4755d3067 PGD 475def067 PMD 474cb3067 PTE 5720796572666665 Bad pagetable: 001d [1] SMP CPU 1 ...