645043 – RHEL5.3 - xen host - crashing randomly

Bug 645043 - RHEL5.3 - xen host - crashing randomly

Summary: RHEL5.3 - xen host - crashing randomly

Keywords:
Status:	CLOSED DUPLICATE of bug 479754
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Xen Maintainance List
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-10-20 18:39 UTC by Douglas Schilling Landgraf
Modified:	2018-11-14 17:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-18 12:55:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Douglas Schilling Landgraf 2010-10-20 18:39:21 UTC

Description of problem:

Customer is migrating 6 hosts from Solaris 10 x86_64 to RHEL x86_64 5.3 + XEN and getting random reboots. The hardware is Sun Blade Server Module X6220 which is certified by Red Hat.

The same customer has another machines running the same hardware and OS just fine.

Hosts rebooting:
============================
ussp-pb29 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb20 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb22 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb07 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb14 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb13 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114


Hosts running just fine:
============================
ussp-pb08 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT110
ussp-pb12 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT114
ussp-pb32 - 2.6.18-128.el5xen (SMP)     - Bios: American Megatrends Inc.Version 0ABJT110
ussp-pb35 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT110
ussp-pb01 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 080012
ussp-pb10 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT106


Analyze from first ussp-pb20 vmcore:
=======================================
The problem happened in a very stressed function by the kernel for all subsystems
(memset/mempool_alloc) and we haven't found any known issue with that so far. Also, 
kmem verified the memory address as valid and mapped, so the pagefault couldn't have happened.

Here some notes:
=========================

#crash> log

<snip>

Unable to handle kernel paging request at ffff88032d3efbc0 RIP: 
 [<ffffffff80261012>] __memset+0x36/0xc0
PGD 4da0067 PUD 65ad067 PMD 6717067 PTE 0


#crash> kmem ffff88032d3efbc0
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff8807b1acc2c0 nfs_write_data           832         95       144     16     8k
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff88032d3ee140  ffff88032d3ee1c0      9          1     8
FREE / [ALLOCATED]
  [ffff88032d3efbc0]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff880013e9ac48 32d3ef000                0  35d9f52  0 80

Based on crash the address ffff88032d3efbc0 is valid which couldn't happen 
this pagefault.

Source code:

struct nfs_write_data *nfs_commit_alloc(void)
{
      struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, SLAB_NOFS);
      if (p) {
            memset(p, 0, sizeof(*p));
            INIT_LIST_HEAD(&p->pages);
      }
      return p;
}




Analyze from ussp-pb29 last two cores provided:
================================================

- The last vmcores showed different random places of crash.
  The crashes happened at places very used by the kernel as they are generic memory helpers. That goes in the same line as before pointing 
  to HW failure which could be a misconfigured BIOS,  bug in the BIOS/FW, or even a missing/incorrect parameter to better support this hardware.

----------------------------

Unable to handle kernel paging request at ffff8801a0ac9000 RIP:
 [<ffffffff80260bb9>] copy_page+0x4d/0xe4
PGD 4e37067 PUD 5a3e067 PMD 5b44067 PTE 0
Oops: 0002 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:0a.0/irq
CPU 1
Modules linked in: sr_mod cdrom usb_storage xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev k8_edac k8temp hwmon edac_mc serial_core serio_raw forcedeth i2c_nforce2 i2c_core pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 4643, comm: dhclient-script Not tainted 2.6.18-128.el5xen #1
RIP: e030:[<ffffffff80260bb9>]  [<ffffffff80260bb9>] copy_page+0x4d/0xe4
RSP: e02b:ffff8801a04edd48  EFLAGS: 00010206
RAX: 0000000000445c20 RBX: 0000000000445c20 RCX: 000000000000003a
RDX: 0000000000445c20 RSI: ffff88019f958000 RDI: ffff8801a0ac9000
RBP: ffff8807bbef8cc0 R08: 0000000000445c20 R09: 0000000000445c20
R10: 0000000000445c20 R11: 0000000000445c20 R12: 0000000000445c20
R13: 00000000006c0648 R14: ffff88000e871bf8 R15: ffff88019e5af600
FS:  00002b155218cdc0(0000) GS:ffffffff805ba080(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process dhclient-script (pid: 4643, threadinfo ffff8801a04ec000, task ffff8807abc91080)
Stack:  0000000000000000  ffff88000e834b40  00000000006c0648  ffffffff8021181d
 ffff88019e5ae018  ffff8807baf7fa68  ffff8807bbef8c40  ffff8801a04ede2c
 ffff8807ac5b2090  ffff8807bbef8cc0
Call Trace:
 [<ffffffff8021181d>] do_wp_page+0x3ba/0x6a3
 [<ffffffff80209ac4>] __handle_mm_fault+0x114b/0x11f6
 [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff802666ef>] do_page_fault+0xf7b/0x12e0
 [<ffffffff8025f82b>] error_exit+0x0/0x6e
 [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14
 [<ffffffff80228f5f>] do_sigaction+0x189/0x19d
 [<ffffffff8025f82b>] error_exit+0x0/0x6e


Code: 48 89 07 48 89 5f 08 48 89 57 10 4c 89 47 18 4c 89 4f 20 4c
RIP  [<ffffffff80260bb9>] copy_page+0x4d/0xe4
 RSP <ffff8801a04edd48>
 
-----------

Unable to handle kernel paging request at ffff88013660df58 RIP:
 [<ffffffff80260f19>] __memcpy+0x15/0xac
PGD 4e37067 PUD 563c067 PMD 57f0067 PTE 0
Oops: 0002 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:0a.0/irq
CPU 2
Modules linked in: xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev k8_edac serio_raw i2c_nforce2 i2c_core edac_mc serial_core forcedeth k8temp hwmon sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3224, comm: automount Not tainted 2.6.18-128.el5xen #1
RIP: e030:[<ffffffff80260f19>]  [<ffffffff80260f19>] __memcpy+0x15/0xac
RSP: e02b:ffff8807b1929de8  EFLAGS: 00010203
RAX: ffff88013660df58 RBX: ffff88013660c000 RCX: 0000000000000001
RDX: 00000000000000a8 RSI: ffff8807b1929f58 RDI: ffff88013660df58
RBP: ffff8807bddf4820 R08: 0000000000000000 R09: ffff8807b1929f58
R10: 0000000000010800 R11: 0000000000001000 R12: ffff88013660df58
R13: ffff8807bbd197a0 R14: 0000000040cdb250 R15: 00000000003d0f00
FS:  00002b5b4219c540(0063) GS:ffffffff805ba100(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process automount (pid: 3224, threadinfo ffff8807b1928000, task ffff8807bbd197a0)
Stack:  ffff88013660c000  ffffffff8022b2db  ffff8807ba0620c0  ffff8807ba062378
 0000000000000000  ffff8807bddf4820  ffff8807bd6574c0  0000000040cdb9d0
 0000000000010800  ffffffff80220251
Call Trace:
 [<ffffffff8022b2db>] copy_thread+0x3b/0x18e
 [<ffffffff80220251>] copy_process+0x13b8/0x1a48
 [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14
 [<ffffffff80297d03>] alloc_pid+0x26c/0x292
 [<ffffffff80231fcf>] do_fork+0x69/0x1c1
 [<ffffffff8025f2f9>] tracesys+0xab/0xb6
 [<ffffffff8025f519>] ptregscall_common+0x3d/0x64

------------------

Also, we noticed from dmesg output:

#crash> log
<snip>
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
PCI: Using configuration type 1
<snip>


Additional info
============================

- On last Saturday 10/15/2010, customer replaced their 32G on ussp-pb14 host
and since this date we still doesn't have any report of reboot for this machine. Also, customer was not able to see any issue with their memory RAM.

- Another initial observation was that the blades that exhibit problems appear to have over 18 GB assigned to a xen virtual.  To conter this, customer has identified two servers that have xen virtuals of over 20 GB that have been up for over 6 months. (ussp-pb01 and ussp-pb10).

 - Hardware certified at:
https://hardware.redhat.com/show.cgi?id=244700

- This is a summary from the salesforces:
- https://c.na7.visual.force.com/apex/Case_View?id=500A00000045ER8&sfdc.override=1
- https://c.na7.visual.force.com/apex/Case_View?id=500A00000043M2F&sfdc.override=1

Comment 1 Douglas Schilling Landgraf 2010-10-20 19:46:50 UTC

The cores files available at:


host ussp-pb20:
=============================

Machine:
--------------
megatron.gsslab.rdu.redhat.com
Login with kerberos name/password

1st core available:
$ cd /cores/20101013074537/work
$ ./crash

2nd core available:
$ cd /cores/20101019105733/work
$ ./crash


host ussp-pb29:
=============================

Machine:
--------------
megatron.gsslab.rdu.redhat.com
Login with kerberos name/password


1st core available:
$ cd /cores/20101018111357/work 
$ ./crash


2nd core available:
$ cd /cores/20101018105514/work 
$ ./crash


host ussp-pb07
================================

Machine:
--------------
megatron.gsslab.rdu.redhat.com
Login with kerberos name/password


Core available:
$ cd /cores/20101014095555/work
$ ./crash

Comment 7 Paolo Bonzini 2010-10-22 14:56:28 UTC

Thanks.  The evidence is quite strong, the only problem I have is that I don't see how update_va_mapping could return ENOMEM in either 5.3 or more recent hypervisors.

I'll prepare a custom kernel that BUGs on errors from the single hypercalls.  The three error messages at startup are always there on 5.3, I think they were fixed on 5.4.

Comment 14 Paolo Bonzini 2010-10-23 17:03:55 UTC

> I am confused by the test kernel version with respect to the
> content of the RPMs, because ...
> 
> - The most recent change log entry is only 2.6.18-8:
> 
> - The list of patches that I see in the 'kernel-2.6.spec' file looks much
>   different from what I see, for example in a spec file from a 2.6.18-128
>   source RPM.
> 
> Could you please clarify ?

There are two sources of these differences:

1) I used "make rh-srpm" on the kernel git repository to build the SRPM, not dist-cvs.  I didn't know that it created such a different list of patches.

2) The hypervisor is 5.6-based even for the -128 kernel.  This was not intended, if desired the customer can keep using the stock -128 hypervisor since there is no debug output there.

---

Thanks for double checking the -ENOMEM vs. -EINVAL value.  It really looks like some paging data structure is corrupted (I don't think it's the hypervisor's fault, it seems more likely to be the dom0 kernel).

At this point, I suggest that the customer tries the BUG_ON version of the -228 test kernel (which has a WARN_ON) on some machines, and the -128 BUG_ON test kernel on others.  The former will tell us if the bug has been fixed; the latter will provide hopefully some hints on the corruption earlier, though likely a bit after it has happened.

If the machines are attached to a serial console, it can be useful to get the hypervisor's error output from there, since they're lost by the time the sosreport is generated.  Add to the hypervisor boot options the following: "com1=115200,8n1 guest_loglvl=9".

Comment 20 Paolo Bonzini 2011-01-18 12:55:58 UTC

There are residual issues in bug 666453, but this part was a dup.

*** This bug has been marked as a duplicate of bug 479754 ***

Note You need to log in before you can comment on or make changes to this bug.