429717 – Machine crashing due to memory corruption

Bug 429717 - Machine crashing due to memory corruption

Summary: Machine crashing due to memory corruption

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Josef Bacik
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-22 17:18 UTC by Flavio Leitner
Modified:	2008-02-28 21:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-02-28 21:50:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
messages with 'report_lost_ticks' and NMI errors (1.17 MB, application/octet-stream) 2008-01-22 17:24 UTC, Flavio Leitner	no flags	Details
View All

Description Flavio Leitner 2008-01-22 17:18:52 UTC

Description of problem:
They have 6 new boxes and 5 are running okay but one is oopsing frequently.

Version-Release number of selected component (if applicable):

VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006
MACHINE: x86_64  (2612 Mhz)
MEMORY: 34 GB
RELEASE: 2.6.9-42.0.2.ELsmp

How reproducible:
Frequently

Steps to Reproduce:
No steps has been identified yet
  
Actual results:
oopses

Additional info:
* They have 6 new boxes and 5 are running okay but one is oopsing frequently.
* Product Name: ProLiant DL585 G2
* The HW diags tools ran clean, no errors reported.
* The memtest86 ran clean, no errors
* We have sysreport of both 'bad' and 'good' systems
* No differences in dmidecode/lspci from 'bad' to 'good' systems.
* Dual-Core AMD Opteron(tm) Processor 8218 stepping 03
* nVidia Corporation CK804
* 3 vmcores available, 1 oops, NMI messages:
...
Uhhuh. NMI received for unknown reason 20.
Uhhuh. NMI received for unknown reason 20.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
<repeated followed by the oops below>
egrep: Corrupted page table at address 39164b9272
PML4 4755d3067 PGD 475def067 PMD 474cb3067 PTE 5720796572666665
Bad pagetable: 001d [1] SMP
CPU 1
-------------------------------------------------------------------
On the other two oopses the registers had huge numbers:
Core1 bracktrace frame #4 dump:
#4 [104700efd70] error_exit at ffffffff80110d91
   [exception RIP: vfs_getattr+46]
   RIP: ffffffff8018197c  RSP: 00000104700efe28  RFLAGS: 00010246
   RAX: 4631313532353046  RBX: 000001026d888a98  RCX: 0000000000000046
   ^^^^^^^^^^^^^^^^^^^^^
   RDX: 00000104700efef8  RSI: 000001026d888a98  RDI: 0000010478421900
   RBP: 00000104700efef8   R8: 000000000000000f   R9: 0000000000000001

...
0xffffffff80181966 <vfs_getattr+24>:    mov    0x10(%rsi),%r12
0xffffffff8018196a <vfs_getattr+28>:    callq  *0x1a0(%rax)
0xffffffff80181970 <vfs_getattr+34>:    test   %eax,%eax
0xffffffff80181972 <vfs_getattr+36>:    jne    0xffffffff801819e2 <vfs_getattr+148>
0xffffffff80181974 <vfs_getattr+38>:    mov    0xf8(%r12),%rax
0xffffffff8018197c <vfs_getattr+46>:    mov    0x78(%rax),%rax <--crash
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0xffffffff80181980 <vfs_getattr+50>:    test   %rax,%rax

The offset 0x78 is in struct inode_operations:
crash> size -o inode_operations
struct inode_operations {
  [0x0] int (*create)(struct inode *, struct dentry *, int, struct nameidata ....
 [0x78] int (*getattr)(struct vfsmount *, struct dentry *, struct kstat *);

The relevant code was:
int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
...
       if (inode->i_op->getattr)
               return inode->i_op->getattr(mnt, dentry, stat);
...

but this pointer is corrupted with value of RAX: 4631313532353046.

This corruption also went to slab:
kmem: ext3_inode_cache: full list: slab: 10872f43100  bad next pointer:
40115cd0b006920
kmem: ext3_inode_cache: full list: slab: 10872f43100  bad prev pointer:
205b020931081811
kmem: ext3_inode_cache: full list: slab: 10872f43100  bad inuse counter: 1094005561
kmem: ext3_inode_cache: full list: slab: 10872f43100  bad inuse counter: 1094005561
kmem: ext3_inode_cache: full list: slab: 10872f43100  bad s_mem pointer:
3734463333423839

On core2 (another backtrace):
#4 [102758639e0] error_exit at ffffffff80110d91
   [exception RIP: __find_get_block_slow+125]
   RIP: ffffffff8017adca  RSP: 0000010275863a98  RFLAGS: 00010206
   RAX: 0000000000000000  RBX: 4f4c41444e415453  RCX: 000001027fb3d3f0
                               ^^^^^^^^^^^^^^^^
Note another huge number on RBX, and the assembly was:
0xffffffff8017adca <__find_get_block_slow+125>: cmp    %r12,0x20(%rbx)
then it should be another pointer. This is almost the same
symptom present in core1, but no slabs was affected by this corruption.

On another oops:
Losing some ticks... checking if CPU frequency changed.
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0
general protection fault: 0000 [1] SMP 

I thought it could be timer related and then passing 'report_lost_ticks'
shows that it lost 1 tick.
time.c: Lost 1 timer tick(s)! rip __do_softirq+0x4d/0xd0)
<repeatedly>

Comment 3 Flavio Leitner 2008-01-22 17:24:20 UTC

Created attachment 292540 [details]
messages with 'report_lost_ticks' and NMI errors

Comment 4 Flavio Leitner 2008-01-22 17:27:35 UTC

full oops copied below
--------------------------------------------------------------------------
Will boot with that param now. Just rebooting caused an issue.

Turning off quotas:
Unmounting pipe file systems:
Unmounting file systems:  Unable to handle kernel paging request at
000010000000025f RIP:
<ffffffff80159d5c>{page_waitqueue+70}
PML4 0
Oops: 0000 [1] SMP
CPU 2
Modules linked in: mptctl sg autofs4 i2c_dev i2c_core sunrpc ide_dump cciss_dump
scsi_dump diskdump zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U)
fdd(U) vxportal(U) vxfs(U) dm_mod button battery ac joydev ohci_hcd ehci_hcd
uhci_hcd e1000 bnx2 ext3 jbd lpfc scsi_transport_fc mptsas cciss mptspi mptscsi
mptbase sd_mod scsi_mod
Pid: 21609, comm: umount Tainted: PF     2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffff80159d5c>] <ffffffff80159d5c>{page_waitqueue+70}
RSP: 0018:000001027356dd40  EFLAGS: 00010206
RAX: 512e10c73baf5018 RBX: 000001047c0f5018 RCX: 0000000000000040
RDX: 00000fffffffffff RSI: 000001047c0f5018 RDI: 0000000000000000
RBP: 430140240000a208 R08: 0000000000000004 R09: 000001027356dc28
R10: 000001047c0f5050 R11: 0000000000000000 R12: 000000000000000c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000010473b38360
FS:  0000002a95562b00(0000) GS:ffffffff804e5280(0000) knlGS:00000000f61edbb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000010000000025f CR3: 0000000008020000 CR4: 00000000000006e0
Process umount (pid: 21609, threadinfo 000001027356c000, task 0000010116459030)
Stack: ffffffff80159d7a 000001047c0f5018 ffffffff80164144 000000000000000e
      0000000000000000 000001047c0f4d78 000001047c0f4db0 000001047c0f4de8
      000001047c0f4e20 000001047c0f4e58
Call Trace:<ffffffff80159d7a>{wake_up_page+9}
<ffffffff80164144>{truncate_inode_pages+142}
      <ffffffff80191860>{dispose_list+76} <ffffffff80191be3>{invalidate_inodes+177}
      <ffffffff8017eb2d>{generic_shutdown_super+162}
<ffffffff8017f979>{kill_block_super+19}
      <ffffffff8017ea72>{deactivate_super+95} <ffffffff8019429b>{sys_umount+925}
      <ffffffff80181d4c>{sys_newstat+17} <ffffffff80110d91>{error_exit+0}
      <ffffffff8011026a>{system_call+126}

Code: 2b 8a 60 02 00 00 48 d3 e8 48 6b c0 18 48 03 82 50 02 00 00
RIP <ffffffff80159d5c>{page_waitqueue+70} RSP <000001027356dd40>
CR2: 000010000000025f
CPU frozen: #0#1#3#4#5#6#7
CPU#2 is executing diskdump.
start dumping to cciss/c0d0p3
check dump partition...
dumping memory(partial dump with dump_level 19)..
539103(315254 skipped)/8388078    669 ETA \

Comment 5 Flavio Leitner 2008-01-22 17:29:30 UTC

core1 is here:
seg.rdu.redhat.com:/export/nfs/awashbro/144241/010608/144241.vmcore.010608
VMlinux is also on seg: /usr/lib/debug/lib/modules/2.6.9-42.0.2.ELsmp/vmlinux 

    KERNEL: /usr/lib/debug/lib/modules/2.6.9-42.0.2.ELsmp/vmlinux
   DUMPFILE: 144241.vmcore.010608  [PARTIAL DUMP]
       CPUS: 8
       DATE: Sun Jan  6 05:38:50 2008
     UPTIME: 01:11:34
LOAD AVERAGE: 0.12, 0.05, 0.01
      TASKS: 456
   NODENAME: uswxapstac05f
    RELEASE: 2.6.9-42.0.2.ELsmp
    VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006
    MACHINE: x86_64  (2612 Mhz)
     MEMORY: 34 GB
      PANIC: ""
        PID: 572
    COMMAND: "kjournald"
       TASK: 104800ad030  [THREAD_INFO: 10275862000]
        CPU: 4
      STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 572    TASK: 104800ad030       CPU: 4   COMMAND: "kjournald"
#0 [10275863940] start_disk_dump at ffffffffa052736d
#1 [10275863970] try_crashdump at ffffffff8014bd01
#2 [10275863980] die at ffffffff80111c00
#3 [102758639a0] do_general_protection at ffffffff801124e5
#4 [102758639e0] error_exit at ffffffff80110d91
   [exception RIP: __find_get_block_slow+125]
   RIP: ffffffff8017adca  RSP: 0000010275863a98  RFLAGS: 00010206
   RAX: 0000000000000000  RBX: 4f4c41444e415453  RCX: 000001027fb3d3f0
   RDX: 000001066a6212f0  RSI: 0000000000001626  RDI: 000001028003b5a0
   RBP: 000001027fb3d3f0   R8: 0000000000000008   R9: 000001000800d080
   R10: 0000000000001000  R11: 0000000000000000  R12: 0000000000001626
   R13: 000001028003b400  R14: 000001028003b520  R15: 0000000000000000
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#5 [10275863a90] __find_get_block_slow at ffffffff8017ada3
#6 [10275863ad0] __find_get_block at ffffffff8017b52e
#7 [10275863b60] __getblk at ffffffff8017d8fe
#8 [10275863b90] journal_get_descriptor_buffer at ffffffffa00a1436
#9 [10275863bb0] journal_commit_transaction at ffffffffa009cd59
#10 [10275863e80] kjournald at ffffffffa009f914
#11 [10275863f50] kernel_thread at ffffffff80110f47

lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
lpfcdfc: 0:1608 libdfc get rev Data: x50 xa3
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0
general protection fault: 0000 [1] SMP
CPU 4
Modules linked in: cpqci(U) mptctl sg ipmi_devintf ipmi_si ipmi_msghandler
autofs4 i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdu
mp zlib_deflate lpfcdfc dmpaa(U) vxspec(U) vxio(U) vxdmp(U) fdd(U) vxportal(U)
vxfs(U) dm_mod button battery ac joydev ohci_hcd ehci_hcd uhci_h
cd e1000 bnx2 ext3 jbd lpfc scsi_transport_fc mptsas cciss mptspi mptscsi
mptbase sd_mod scsi_mod
Pid: 572, comm: kjournald Tainted: PF     2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffff8017adca>] <ffffffff8017adca>{__find_get_block_slow+125}
RSP: 0018:0000010275863a98  EFLAGS: 00010206
RAX: 0000000000000000 RBX: 4f4c41444e415453 RCX: 000001027fb3d3f0
RDX: 000001066a6212f0 RSI: 0000000000001626 RDI: 000001028003b5a0
RBP: 000001027fb3d3f0 R08: 0000000000000008 R09: 000001000800d080
R10: 0000000000001000 R11: 0000000000000000 R12: 0000000000001626
R13: 000001028003b400 R14: 000001028003b520 R15: 0000000000000000
FS:  0000002a95562b00(0000) GS:ffffffff804e5380(0000) knlGS:00000000ebc5cbb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000007fbffff836 CR3: 0000000878f98000 CR4: 00000000000006e0
Process kjournald (pid: 572, threadinfo 0000010275862000, task 00000104800ad030)
Stack: 0000000000000000 0000000000000000 0000000000001000 0000000000001626
      000001028003b340 0000010277336a00 0000000000000000 ffffffff8017b52e
      0000010277336a00 0000000000000216
Call Trace:<ffffffff8017b52e>{__find_get_block+162}
<ffffffff8030ab37>{io_schedule+38}
      <ffffffff8017d8fe>{__getblk+20}
<ffffffffa00a1436>{:jbd:journal_get_descriptor_buffer+43}
      <ffffffffa009cd59>{:jbd:journal_commit_transaction+1669}
      <ffffffff8030a1c1>{thread_return+0} <ffffffff8030a219>{thread_return+88}
      <ffffffffa009f914>{:jbd:kjournald+250}
<ffffffff80135752>{autoremove_wake_function+0}
      <ffffffff80135752>{autoremove_wake_function+0}
<ffffffffa009f814>{:jbd:commit_timeout+0}
      <ffffffff80110f47>{child_rip+8} <ffffffffa009f81a>{:jbd:kjournald+0}
      <ffffffff80110f3f>{child_rip+0}

Code: 4c 39 63 20 75 0a 48 89 1c 24 f0 ff 43 18 eb 5e 8b 03 48 8b
RIP <ffffffff8017adca>{__find_get_block_slow+125} RSP <0000010275863a98>

Comment 6 Flavio Leitner 2008-01-22 17:34:35 UTC

The above was actually core2, the core1 is below:
  1. Provide core file (if one is involved) and state:
         * Location: dl585.gsslab.rdu.redhat.com:/work/samfw/144241
         * Access info: root/redhat
         * Backtrace output from the core file


Backtrace
     KERNEL: /cores/20080103142241/work/vmlinux
   DUMPFILE: /cores/20080103142241/work/RH144241.vmcore  [PARTIAL DUMP]
       CPUS: 8
       DATE: Mon Dec 31 18:11:27 2007
     UPTIME: 1 days, 10:08:22
LOAD AVERAGE: 0.00, 0.00, 0.00
      TASKS: 399
   NODENAME: uswxapstac05f
    RELEASE: 2.6.9-42.0.2.ELsmp
    VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006
    MACHINE: x86_64  (2612 Mhz)
     MEMORY: 34 GB
      PANIC: ""

PID: 13020  TASK: 102741a0030       CPU: 1   COMMAND: "save"
#0 [104700efcd0] start_disk_dump at ffffffffa052d36d
#1 [104700efd00] try_crashdump at ffffffff8014bd01
#2 [104700efd10] die at ffffffff80111c00
#3 [104700efd30] do_general_protection at ffffffff801124e5
#4 [104700efd70] error_exit at ffffffff80110d91
   [exception RIP: vfs_getattr+46]
   RIP: ffffffff8018197c  RSP: 00000104700efe28  RFLAGS: 00010246
   RAX: 4631313532353046  RBX: 000001026d888a98  RCX: 0000000000000046
   RDX: 00000104700efef8  RSI: 000001026d888a98  RDI: 0000010478421900
   RBP: 00000104700efef8   R8: 000000000000000f   R9: 0000000000000001
   R10: 0000000000000001  R11: ffffffff801cec14  R12: 000001066a9d7648
   R13: 0000010478421900  R14: 000000004049deb1  R15: 000000004049de10
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#5 [104700efe20] vfs_getattr at ffffffff80181970
#6 [104700efe50] vfs_lstat at ffffffff80181a5a
#7 [104700efef0] sys_newlstat at ffffffff80181d75
#8 [104700eff80] system_call at ffffffff8011026a
   RIP: 00000039164b8d25  RSP: 0000007fbffe4480  RFLAGS: 00000293
   RAX: 0000000000000006  RBX: ffffffff8011026a  RCX: 000000004049b4f0
   RDX: 0000007fbffe43c0  RSI: 0000007fbffe43c0  RDI: 0000000040478e50
   RBP: 00000000404be0e0   R8: 0000000000000001   R9: 000000004023bf00
   R10: 0000000000000001  R11: 0000000000000246  R12: 000000004049deb8
   R13: 0000007fbffe30c0  R14: 000000004026f7d0  R15: 0000000000000008
   ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

Comment 7 Flavio Leitner 2008-01-22 17:38:56 UTC

Lastest vmcore available:
You may view it at megatron.gsslab.rdu.redhat.com
Login with kerberos name/password
$ cd /cores/20080122085506/work
/cores/20080122085506/work$ ./crash

Backtrace
     KERNEL: /cores/20080122085506/work/vmlinux
   DUMPFILE: /cores/20080122085506/work/vmcore.144241.2008-01-19  [PARTIAL DUMP]
       CPUS: 8
       DATE: Sat Jan 19 02:05:01 2008
     UPTIME: 07:49:06
LOAD AVERAGE: 0.02, 0.02, 0.00
      TASKS: 458
   NODENAME: uswxapstac05f
    RELEASE: 2.6.9-42.0.2.ELsmp
    VERSION: #1 SMP Thu Aug 17 17:57:31 EDT 2006
    MACHINE: x86_64  (2612 Mhz)
     MEMORY: 34 GB
      PANIC: "Kernel panic - not syncing: Oops"

PID: 13985  TASK: 108760ff7f0       CPU: 1   COMMAND: "egrep"
#0 [10876afdb90] start_disk_dump at ffffffffa052a36d
#1 [10876afdbc0] try_crashdump at ffffffff8014bd01
#2 [10876afdbd0] die at ffffffff80111c00
#3 [10876afdbf0] do_invalid_op at ffffffff80111fc8
#4 [10876afdcb0] error_exit at ffffffff80110d91
   [exception RIP: panic+211]
   RIP: ffffffff8013794a  RSP: 0000010876afdd68  RFLAGS: 00010286
   RAX: 0000000000000024  RBX: ffffffff8031e00f  RCX: 0000000000000246
   RDX: 00000000000113e3  RSI: 0000000000000246  RDI: ffffffff803e2000
   RBP: 00000039164b9272   R8: 0000000000000002   R9: ffffffff8031e00f
   R10: 0000000100000000  R11: 0000ffff803fcd00  R12: 000000000000001d
   R13: 0000010876afdf58  R14: 000000000000001d  R15: 0000010678737240
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
#5 [10876afdd60] panic at ffffffff80137936
#6 [10876afde40] oops_end at ffffffff80111b07
#7 [10876afde50] pgtable_bad at ffffffff80123c8a
#8 [10876afde70] do_page_fault at ffffffff801242a9
#9 [10876afdf50] error_exit at ffffffff80110d91
   RIP: 00000039164b9272  RSP: 0000007fbffffbc8  RFLAGS: 00010246
   RAX: 000000000000004f  RBX: 0000000000008000  RCX: 00000039164b9272
   RDX: 0000000000008000  RSI: 00000000006170a1  RDI: 0000000000000000
   RBP: 0000000000000000   R8: 0000000000001000   R9: 0000000000000001
   R10: 0000003916630848  R11: 0000000000000246  R12: 00000000006170a1
   R13: 0000000000514660  R14: 0000000000000000  R15: 0000000000000411
   ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

...
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
egrep: Corrupted page table at address 39164b9272
PML4 4755d3067 PGD 475def067 PMD 474cb3067 PTE 5720796572666665
Bad pagetable: 001d [1] SMP
CPU 1 
...

Note You need to log in before you can comment on or make changes to this bug.