Description of problem: Customer is migrating 6 hosts from Solaris 10 x86_64 to RHEL x86_64 5.3 + XEN and getting random reboots. The hardware is Sun Blade Server Module X6220 which is certified by Red Hat. The same customer has another machines running the same hardware and OS just fine. Hosts rebooting: ============================ ussp-pb29 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb20 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb22 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb07 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb14 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb13 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 Hosts running just fine: ============================ ussp-pb08 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT110 ussp-pb12 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT114 ussp-pb32 - 2.6.18-128.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT110 ussp-pb35 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT110 ussp-pb01 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 080012 ussp-pb10 - 2.6.18-128.1.1.el5xen (SMP) - Bios: American Megatrends Inc.Version 0ABJT106 Analyze from first ussp-pb20 vmcore: ======================================= The problem happened in a very stressed function by the kernel for all subsystems (memset/mempool_alloc) and we haven't found any known issue with that so far. Also, kmem verified the memory address as valid and mapped, so the pagefault couldn't have happened. Here some notes: ========================= #crash> log <snip> Unable to handle kernel paging request at ffff88032d3efbc0 RIP: [<ffffffff80261012>] __memset+0x36/0xc0 PGD 4da0067 PUD 65ad067 PMD 6717067 PTE 0 #crash> kmem ffff88032d3efbc0 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff8807b1acc2c0 nfs_write_data 832 95 144 16 8k SLAB MEMORY TOTAL ALLOCATED FREE ffff88032d3ee140 ffff88032d3ee1c0 9 1 8 FREE / [ALLOCATED] [ffff88032d3efbc0] PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff880013e9ac48 32d3ef000 0 35d9f52 0 80 Based on crash the address ffff88032d3efbc0 is valid which couldn't happen this pagefault. Source code: struct nfs_write_data *nfs_commit_alloc(void) { struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, SLAB_NOFS); if (p) { memset(p, 0, sizeof(*p)); INIT_LIST_HEAD(&p->pages); } return p; } Analyze from ussp-pb29 last two cores provided: ================================================ - The last vmcores showed different random places of crash. The crashes happened at places very used by the kernel as they are generic memory helpers. That goes in the same line as before pointing to HW failure which could be a misconfigured BIOS, bug in the BIOS/FW, or even a missing/incorrect parameter to better support this hardware. ---------------------------- Unable to handle kernel paging request at ffff8801a0ac9000 RIP: [<ffffffff80260bb9>] copy_page+0x4d/0xe4 PGD 4e37067 PUD 5a3e067 PMD 5b44067 PTE 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:0a.0/irq CPU 1 Modules linked in: sr_mod cdrom usb_storage xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev k8_edac k8temp hwmon edac_mc serial_core serio_raw forcedeth i2c_nforce2 i2c_core pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4643, comm: dhclient-script Not tainted 2.6.18-128.el5xen #1 RIP: e030:[<ffffffff80260bb9>] [<ffffffff80260bb9>] copy_page+0x4d/0xe4 RSP: e02b:ffff8801a04edd48 EFLAGS: 00010206 RAX: 0000000000445c20 RBX: 0000000000445c20 RCX: 000000000000003a RDX: 0000000000445c20 RSI: ffff88019f958000 RDI: ffff8801a0ac9000 RBP: ffff8807bbef8cc0 R08: 0000000000445c20 R09: 0000000000445c20 R10: 0000000000445c20 R11: 0000000000445c20 R12: 0000000000445c20 R13: 00000000006c0648 R14: ffff88000e871bf8 R15: ffff88019e5af600 FS: 00002b155218cdc0(0000) GS:ffffffff805ba080(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process dhclient-script (pid: 4643, threadinfo ffff8801a04ec000, task ffff8807abc91080) Stack: 0000000000000000 ffff88000e834b40 00000000006c0648 ffffffff8021181d ffff88019e5ae018 ffff8807baf7fa68 ffff8807bbef8c40 ffff8801a04ede2c ffff8807ac5b2090 ffff8807bbef8cc0 Call Trace: [<ffffffff8021181d>] do_wp_page+0x3ba/0x6a3 [<ffffffff80209ac4>] __handle_mm_fault+0x114b/0x11f6 [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14 [<ffffffff802666ef>] do_page_fault+0xf7b/0x12e0 [<ffffffff8025f82b>] error_exit+0x0/0x6e [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14 [<ffffffff80228f5f>] do_sigaction+0x189/0x19d [<ffffffff8025f82b>] error_exit+0x0/0x6e Code: 48 89 07 48 89 5f 08 48 89 57 10 4c 89 47 18 4c 89 4f 20 4c RIP [<ffffffff80260bb9>] copy_page+0x4d/0xe4 RSP <ffff8801a04edd48> ----------- Unable to handle kernel paging request at ffff88013660df58 RIP: [<ffffffff80260f19>] __memcpy+0x15/0xac PGD 4e37067 PUD 563c067 PMD 57f0067 PTE 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:0a.0/irq CPU 2 Modules linked in: xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev k8_edac serio_raw i2c_nforce2 i2c_core edac_mc serial_core forcedeth k8temp hwmon sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod lpfc scsi_transport_fc shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3224, comm: automount Not tainted 2.6.18-128.el5xen #1 RIP: e030:[<ffffffff80260f19>] [<ffffffff80260f19>] __memcpy+0x15/0xac RSP: e02b:ffff8807b1929de8 EFLAGS: 00010203 RAX: ffff88013660df58 RBX: ffff88013660c000 RCX: 0000000000000001 RDX: 00000000000000a8 RSI: ffff8807b1929f58 RDI: ffff88013660df58 RBP: ffff8807bddf4820 R08: 0000000000000000 R09: ffff8807b1929f58 R10: 0000000000010800 R11: 0000000000001000 R12: ffff88013660df58 R13: ffff8807bbd197a0 R14: 0000000040cdb250 R15: 00000000003d0f00 FS: 00002b5b4219c540(0063) GS:ffffffff805ba100(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process automount (pid: 3224, threadinfo ffff8807b1928000, task ffff8807bbd197a0) Stack: ffff88013660c000 ffffffff8022b2db ffff8807ba0620c0 ffff8807ba062378 0000000000000000 ffff8807bddf4820 ffff8807bd6574c0 0000000040cdb9d0 0000000000010800 ffffffff80220251 Call Trace: [<ffffffff8022b2db>] copy_thread+0x3b/0x18e [<ffffffff80220251>] copy_process+0x13b8/0x1a48 [<ffffffff80263a0d>] _spin_lock_irq+0x9/0x14 [<ffffffff80297d03>] alloc_pid+0x26c/0x292 [<ffffffff80231fcf>] do_fork+0x69/0x1c1 [<ffffffff8025f2f9>] tracesys+0xab/0xb6 [<ffffffff8025f519>] ptregscall_common+0x3d/0x64 ------------------ Also, we noticed from dmesg output: #crash> log <snip> PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved PCI: Not using MMCONFIG. PCI: Using configuration type 1 <snip> Additional info ============================ - On last Saturday 10/15/2010, customer replaced their 32G on ussp-pb14 host and since this date we still doesn't have any report of reboot for this machine. Also, customer was not able to see any issue with their memory RAM. - Another initial observation was that the blades that exhibit problems appear to have over 18 GB assigned to a xen virtual. To conter this, customer has identified two servers that have xen virtuals of over 20 GB that have been up for over 6 months. (ussp-pb01 and ussp-pb10). - Hardware certified at: https://hardware.redhat.com/show.cgi?id=244700 - This is a summary from the salesforces: - https://c.na7.visual.force.com/apex/Case_View?id=500A00000045ER8&sfdc.override=1 - https://c.na7.visual.force.com/apex/Case_View?id=500A00000043M2F&sfdc.override=1
The cores files available at: host ussp-pb20: ============================= Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password 1st core available: $ cd /cores/20101013074537/work $ ./crash 2nd core available: $ cd /cores/20101019105733/work $ ./crash host ussp-pb29: ============================= Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password 1st core available: $ cd /cores/20101018111357/work $ ./crash 2nd core available: $ cd /cores/20101018105514/work $ ./crash host ussp-pb07 ================================ Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password Core available: $ cd /cores/20101014095555/work $ ./crash
Thanks. The evidence is quite strong, the only problem I have is that I don't see how update_va_mapping could return ENOMEM in either 5.3 or more recent hypervisors. I'll prepare a custom kernel that BUGs on errors from the single hypercalls. The three error messages at startup are always there on 5.3, I think they were fixed on 5.4.
> I am confused by the test kernel version with respect to the > content of the RPMs, because ... > > - The most recent change log entry is only 2.6.18-8: > > - The list of patches that I see in the 'kernel-2.6.spec' file looks much > different from what I see, for example in a spec file from a 2.6.18-128 > source RPM. > > Could you please clarify ? There are two sources of these differences: 1) I used "make rh-srpm" on the kernel git repository to build the SRPM, not dist-cvs. I didn't know that it created such a different list of patches. 2) The hypervisor is 5.6-based even for the -128 kernel. This was not intended, if desired the customer can keep using the stock -128 hypervisor since there is no debug output there. --- Thanks for double checking the -ENOMEM vs. -EINVAL value. It really looks like some paging data structure is corrupted (I don't think it's the hypervisor's fault, it seems more likely to be the dom0 kernel). At this point, I suggest that the customer tries the BUG_ON version of the -228 test kernel (which has a WARN_ON) on some machines, and the -128 BUG_ON test kernel on others. The former will tell us if the bug has been fixed; the latter will provide hopefully some hints on the corruption earlier, though likely a bit after it has happened. If the machines are attached to a serial console, it can be useful to get the hypervisor's error output from there, since they're lost by the time the sosreport is generated. Add to the hypervisor boot options the following: "com1=115200,8n1 guest_loglvl=9".
There are residual issues in bug 666453, but this part was a dup. *** This bug has been marked as a duplicate of bug 479754 ***