Bug 661989
Summary: | [ia64] xencomm_privcmd_domctl: unknown domctl cmd 45 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Alexander Todorov <atodorov> | ||||||||
Component: | kernel-xen | Assignee: | Laszlo Ersek <lersek> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.6 | CC: | alex.williamson, drjones, lersek, mbrodeur, mrezanin, pbonzini, syeghiay, xen-maint | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | ia64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 970066 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-01-10 09:46:05 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 638638 | ||||||||||
Attachments: |
|
Description
Alexander Todorov
2010-12-10 09:18:28 UTC
Both "xen/include/public/domctl.h" and "kernel/include/xen/interface/domctl.h" have /* Assign PCI device to HVM guest. Sets up IOMMU structures. */ #define XEN_DOMCTL_assign_device 37 #define XEN_DOMCTL_test_assign_device 45 But neither not handled by xencomm_privcmd_domctl() in "kernel/arch/ia64/xen/xcom_privcmd.c". Please attach the guest config file. Alexander, can you please add the host dmesg and the xend log? Thanks. I believe the failed XEN_DOMCTL_test_assign_device is not the direct cause of the crash. The xend log file could confirm this. I think xend continued after it received (and logged) the error for XEN_DOMCTL_test_assign_device. The lower frames of the crash should look like this (with some gaps, probably), to be read bottom-up: * [... hypervisor stack trace here from the Description ...] * xc_hvm_build() in "tools/libxc/ia64/xc_ia64_hvm_build.c" (xen userspace) issues XEN_DOMCTL_setvcpucontext. * pyxc_hvm_build() in "tools/python/xen/lowlevel/xc/xc.c" (xen userspace) calls xc_hvm_build(). * buildDomain() in "tools/python/xen/xend/image.py" (xen userspace) calls "xc.hvm_build". At the top, update_vhpi() calls ia64_call_vsa() in "arch/ia64/vmx/vmx_vsa.S". The faulting instruction pointer (f0000705fc042080) seems to fall in a very different address range than the other entries on the hypervisor's stack. So I suppose either the call to ia64_call_vsa() fails, or that latter function (which I don't understand) does something nasty here: br.cond.sptk b6 // call the service I guess this might be even a hypervisor crash. Alexander, Andrew tells me host dump captures aren't supported on ia64 and thus we'll have to debug by serial. Can you please describe the hardware more closely so that we can allocate a similar test machine and try to reproduce the crash? Or can we experiment with hp-rx8640-03.rhts.eng.bos.redhat.com itself, interactively? (With serial console connected to a nearby machine?) (In reply to comment #3) > Please attach the guest config file. I will have to reserve the system again to grab this and other files. It wasn't anything special. I created a full virt guest with virt-manager, chose to boot from boot.iso, assigned a single file image for the guest disk and selected bridged networking. (In reply to comment #5) > Or can we experiment with hp-rx8640-03.rhts.eng.bos.redhat.com itself, > interactively? (With serial console connected to a nearby machine?) This system is available in Beaker and has serial console configured. You can reserve it and experiment with it. The message *** xen_handle_domain_access: exception table lookup failed, iip=0xf0000705fc042080, addr=0x7f, spinning... is from ia64_do_page_fault(). Citation from "arch/ia64/xen/faults.c" (hypervisor) reindented for bugzilla: 251 if (!user_mode(regs)) { 252 /* The fault occurs inside Xen. */ 253 if (!ia64_done_with_exception(regs)) { 254 // should never happen. If it does, region 0 addr may 255 // indicate a bad xen pointer 256 printk("*** xen_handle_domain_access: exception table" 257 " lookup failed, iip=0x%lx, addr=0x%lx, " 258 "spinning...\n", iip, address); 259 panic_domain(regs, "*** xen_handle_domain_access: " 260 "exception table lookup failed, " 261 "iip=0x%lx, addr=0x%lx, spinning...\n", 262 iip, address); I could reproduce the bug. Host: # uname -r 2.6.18-236.el5 # yum install xen (10/12): xen-3.0.3-120.el5.ia64.rpm (12/12): kernel-xen-2.6.18-236.el5.ia64.rpm ... append="xenheap_megabytes=122 --" # uname -r 2.6.18-236.el5xen # yum install virt-manager # rpm -ivh http://download.devel.redhat.com/brewroot/packages/xen-ia64-guest-firmware/1.0.13/1/ia64/xen-ia64-guest-firmware-1.0.13-1.ia64.rpm Then created the following HVM guest: - virt method: fully virtualized - init mem: 4G - max mem: 4G - numvcpu: 8 - OS: RHEL 5.4 or later - Installation source: http://download.devel.redhat.com/nightly/RHEL5.6-Server-20101215.n/5/ia64/os/ (No ia64 was available for RHEL5.6-Server-20101208.n on the download server) - disk image: file based - disk size: 4G - network: shared physdev - network target: xenbr0 - sound: off Virt-manager created some storage file, then booted the guest, then I got a hypervisor dump that matched the one in comment #0, except for the following difference: --- 0 2010-12-16 20:44:26.535262936 +0100 +++ 1 2010-12-16 20:43:20.371263092 +0100 @@ -21,7 +21,7 @@ (XEN) r14 : 000000000000007f r15 : f00000000b7e8150 r16 : 0000000000000000 (XEN) r17 : 0000000000010000 r18 : 000001ffffffffe0 r19 : 0000007ffffffff8 (XEN) r20 : 0000000000000001 r21 : 0000000000000280 r22 : 0000000000000001 -(XEN) r23 : f00000000b7d0678 r24 : f0000000040ac110 r25 : f00000000b7d0000 +(XEN) r23 : f00000000b7a0678 r24 : f0000000040ac110 r25 : f00000000b7a0000 (XEN) r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0800000000000000 (XEN) r29 : f00000000883fb48 r30 : 00000121f8000000 r31 : f0000000041b4f00 (XEN) @@ -57,22 +57,22 @@ (XEN) psr : 0000141208526030 ifs : 0000000000000006 ip : [<a00000010006a3e2>] (XEN) ip is at ??? (XEN) unat: 0000000000000000 pfs : 8000000000000591 rsc : 000000000000000b -(XEN) rnat: 0000000000000000 bsps: e0000700cc0291b0 pr : 00000000006a6999 -(XEN) ldrs: 0000000001380000 ccv : 0000000000000002 fpsr: 0009804c0270033f +(XEN) rnat: 0000000000000000 bsps: e0000700cd0691b0 pr : 00000000006a6999 +(XEN) ldrs: 0000000001380000 ccv : 0000000000000000 fpsr: 0009804c0270033f (XEN) csd : 0000000000000000 ssd : 0000000000000000 (XEN) b0 : a00000010006feb0 b6 : a00000010006fd90 b7 : a0000001000180b0 (XEN) f6 : 1003e0000000000000002 f7 : 1003eaf8af8af8af8af8b (XEN) f8 : 1003e0000000000000046 f9 : 1003e0000000000000002 (XEN) f10 : 1003e0000000000000046 f11 : 1003eaf8af8af8af8af8b -(XEN) r1 : a000000100c782a0 r2 : 000000001c032d24 r3 : 00000000e0196920 -(XEN) r8 : 0000000000000001 r9 : bfffffffabf67894 r10 : 150261daabf67894 -(XEN) r11 : 0000000000000000 r12 : e0000700cc02fb40 r13 : e0000700cc028000 -(XEN) r14 : 00000000000007ff r15 : 0000000000000024 r16 : 00000000c4163bfc -(XEN) r17 : 00002a046fac7894 r18 : 150237d63c4a0000 r19 : 00002b5f6b049d80 -(XEN) r20 : 0000015afb5824ec r21 : 000000568dd07a3c r22 : 000000055c9ba3e4 -(XEN) r23 : 0000000620b1dfe0 r24 : 0000000620b1dfe0 r25 : a0007ff9df21c000 -(XEN) r26 : a000000100a918e8 r27 : a000000100a918e8 r28 : 0000000000000000 -(XEN) r29 : 0000000000024000 r30 : 0000000000000000 r31 : a0007fffffd39ff8 +(XEN) r1 : a000000100c782a0 r2 : ffffffffffffc000 r3 : 00000000c416ff77 +(XEN) r8 : 0000000000000001 r9 : a0007ff9df21c000 r10 : a000000100a918e8 +(XEN) r11 : 0000000000000000 r12 : e0000700cd06fb40 r13 : e0000700cd068000 +(XEN) r14 : a0007fffffd9bbb8 r15 : 0000000000000024 r16 : a0007fffffd98000 +(XEN) r17 : a0007fffffd9bbef r18 : 0000000000000000 r19 : 8000000000000001 +(XEN) r20 : e0000700f2a80030 r21 : 0000000000000006 r22 : e0000700f2a80000 +(XEN) r23 : 0000000000000000 r24 : a0007fffffcd4738 r25 : 0000000000000400 +(XEN) r26 : a000000100a91ce8 r27 : a000000100a91ce8 r28 : 000000001c03f800 +(XEN) r29 : a000000100a91cf0 r30 : a000000100a91cf0 r31 : a0007fffffcd4748 (XEN) (XEN) Call Trace: (XEN) [<f0000000040c0530>] show_stack+0x80/0xa0 On hp-rx2660-04 with kernel-2.6.18-225.el5xen I get a slightly different result: The first time the system paniced and rebooted (I wasn't connected to the serial console but ssh connection dropped and system rebooted). After the reboot there are no guests running and when trying to create the new guest I get: # xencomm_privcmd_domctl: unknown domctl cmd 45 (XEN) No enough contiguous memory(16384KB) for init_domain_vhpt # free -m total used free shared buffers cached Mem: 3446 1082 2363 0 107 454 -/+ buffers/cache: 521 2924 Swap: 5951 0 5951 Hi Laszlo, on a second retry on hp-rx2660-04 this time with kernel-2.6.18-237.el5xen and xen-3.0.3-120.el5 I got: xencomm_privcmd_domctl: unknown domctl cmd 45 (XEN) ia64_fault, vector=0x4, ifa=0xf300001e21e1df78, iip=0xf00000000407da30, ipsr=0x0000121008226038, isr=0x00000a0400000000 (XEN) Alt DTLB. (XEN) d 0xf000000007c88080 domid 0 (XEN) vcpu 0xf000000007c68000 vcpu 0 (XEN) (XEN) CPU 0 (XEN) psr : 0000121008226038 ifs : 800000000000040d ip : [<f00000000407da31>] (XEN) ip is at domain_page_flush_and_put+0x441/0x500 (XEN) unat: 0000000000000000 pfs : 000000000000040d rsc : 0000000000000003 (XEN) rnat: 0000000000000000 bsps: f0000000043a4b50 pr : 000000000069c199 (XEN) ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f (XEN) csd : 0000000000000000 ssd : 0000000000000000 (XEN) b0 : f00000000407da30 b6 : f0000000040b98d0 b7 : f000000004002e50 (XEN) f6 : 0ffff8000000000000000 f7 : 000000000000000000000 (XEN) f8 : 000000000000000000000 f9 : 000000000000000000000 (XEN) f10 : 000000000000000000000 f11 : 000000000000000000000 (XEN) r1 : f0000000043a4b50 r2 : 0000000000000000 r3 : 000000000000003f (XEN) r8 : 0000000000000000 r9 : 0000000000000000 r10 : a000000100a91920 (XEN) r11 : 0000000000000008 r12 : f000000007c6fdd0 r13 : f000000007c68000 (XEN) r14 : f000000007c60018 r15 : f300001e21e1df78 r16 : 0000000000000000 (XEN) r17 : f000000004c10480 r18 : 0000000000000000 r19 : 0000000000000001 (XEN) r20 : 0000000000000001 r21 : 0000000000000000 r22 : 0000001008226038 (XEN) r23 : 0000000000000000 r24 : 0000000000000020 r25 : 0000000000000000 (XEN) r26 : e0000100b5393dc0 r27 : 0000000000000000 r28 : a000000201154010 (XEN) r29 : 0000000000000000 r30 : 0000000000000000 r31 : f0000000041bc580 (XEN) (XEN) Call Trace: (XEN) [<f0000000040c0530>] show_stack+0x80/0xa0 (XEN) sp=f000000007c6fa00 bsp=f000000007c695e8 (XEN) [<f000000004088430>] ia64_fault+0x9e0/0xbb0 (XEN) sp=f000000007c6fbd0 bsp=f000000007c695a8 (XEN) [<f0000000040b9320>] ia64_leave_kernel+0x0/0x300 (XEN) sp=f000000007c6fbd0 bsp=f000000007c695a8 (XEN) [<f00000000407da30>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fdd0 bsp=f000000007c69540 (XEN) [<f000000004081970>] __dom0vp_add_physmap+0x330/0x630 (XEN) sp=f000000007c6fde0 bsp=f000000007c694d8 (XEN) [<f00000000405fa70>] do_dom0vp_op+0x1f0/0x560 (XEN) sp=f000000007c6fdf0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe00 bsp=f000000007c69498 (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Fault in Xen. (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... Please advise if this is the same one or different? I will open another bug if different. I'll bisect tomorrow. I'm still trying to check if this is a regression. I have now provisioned another rx2600 series Integrity machine with RHEL5.5 and am about to create a virtual machine as described in comment 12. Since (from my last attempt which I was unable to finish before the provisioning time expired) RHEL5.5 doesn't seem to support creation of HVM guests, I'll have to do it by writing/customizing a config file for "xm create", so it will be a bit slower. Re "unknown domctl cmd 45", now I'm convinced it has nothing to do with the problem (or problems, plural, if comment 14 actually pertains to some other problem -- I can't say if it does). Both domctls mentioned in comment 2 were added to upstream Xen as no-ops: XEN_DOMCTL_assign_device: http://xenbits.xensource.com/linux-2.6.18-xen.hg/rev/641 XEN_DOMCTL_test_assign_device: http://xenbits.xensource.com/linux-2.6.18-xen.hg/rev/741 It's also clear that the hypervisor is the one to crash. I'll try to comb through the xen-ia64-devel archives. (In reply to comment #16) > RHEL5.5 doesn't seem to support creation of HVM guests I meant "virt-manager in RHEL5.5", sorry. https://beaker.engineering.redhat.com/jobs/41014 Host: # uname -psrn Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-194.el5xen ia64 installed: xen 3.0.3-105.el5 kernel-xen 2.6.18-194.el5 xen-ia64-guest-firmware.ia64 0:1.0.13-1 I was able to create and start, with virt-manager, an HVM guest like described in comment 12, except the installation source was http://download.devel.redhat.com/released/RHEL-5-Server/U5/ia64/os/ However this is a different machine (the repro in comment 12 was made on hp-rx8640-03.rhts.eng.bos.redhat.com, see job 39757), so now I have to try to crash the HV with RHEL5.6. Job: https://beaker.engineering.redhat.com/jobs/41117 System: hp-rx2660-03.rhts.eng.bos.redhat.com Distro: RHEL5.6-Server-20101221.n_nfs-ia64 Right after boot: [root@hp-rx2660-03 ~]# uname -psrn Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64 Installed the following with yum: xen-3.0.3-120.el5 kernel-xen-2.6.18-238.el5 Installed the following with rpm: xen-ia64-guest-firmware-1.0.13-1 Rebooted host. [root@hp-rx2660-03 ~]# uname -psrn Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64 Succeeded to create and start (with virt-manager) a hvm guest. The difference to comment 18 was the installation path (http://download.devel.redhat.com/nightly/RHEL5.6-Server-20101221.n/5/ia64/os/). At the EFI prompt I typed "fs0:" then "elilo" (thanks Paolo). The guest kernel booted and the installer started. Thus the test on hp-rx2660-03 is inconclusive. Everything worked fine both with RHEL-5.5 and RHEL5.6-Server-20101221.n. I also saw none of the problems described in comment 13 and comment 14. (The machine affected by those, hp-rx2660-04, should be a "sibling" of hp-rx2660-03 -- that's why I reserved hp-rx2660-03 when I saw that hp-rx2660-04 was occupied.) I'm waiting for the following queued jobs to start: - 41012 -- hp-rx8640-02.rhts.eng.bos.redhat.com; need to check both with RHEL5.6 and RHEL5.5. - 41013 -- hp-rx8640-03.rhts.eng.bos.redhat.com; need to check only with RHEL5.5 (RHEL5.6 already done in comment 12). Current status: - still unable to say if this is a regression - no idea what causes either type of crash (and whether they have a common root cause) (In reply to comment #17) > (In reply to comment #16) > > RHEL5.5 doesn't seem to support creation of HVM guests > > I meant "virt-manager in RHEL5.5", sorry. It does support HVM guests. On ia64 however you need to install xen-ia64-guest-firmware package from the Supplementary CD. Created attachment 470318 [details] console log of RHEL5.5 crash (This is the RHEL5.5 half of the test initiated in comment 12, on the same machine.) https://beaker.engineering.redhat.com/jobs/41013 Host: [root@hp-rx8640-03 ~]# uname -psrn Linux hp-rx8640-03.rhts.eng.bos.redhat.com 2.6.18-194.el5 ia64 # yum install xen (10/12): xen-3.0.3-105.el5.ia64.rpm (12/12): kernel-xen-2.6.18-194.el5.ia64.rpm # yum install xen-ia64-guest-firmware xen-ia64-guest-firmware.ia64 0:1.0.13-1 ... append="xenheap_megabytes=122 --" After reboot: [root@hp-rx8640-03 ~]# uname -psrn Linux hp-rx8640-03.rhts.eng.bos.redhat.com 2.6.18-194.el5xen ia64 Created a similar hvm guest as described in comment 12, used http://download.devel.redhat.com/released/RHEL-5-Server/U5/ia64/os/ as boot image path. When the guest started, the hypervisor crashed. The console output was similar to that in comment #0, see the attachment. Therefore this bug is not a regression. Notably, if one diffs this attachment with the trace in comment #0, the invalid faulting address is the same: (XEN) [<f0000705fc042080>] ??? (XEN) sp=f00000000883fba0 bsp=f000000008839478 Interestingly, the outermost frame is not fast_hypercall() this time; below that, there is +(XEN) [<f00000000409f080>] update_vhpi+0xb0/0xd0 +(XEN) sp=f00000000883fe00 bsp=f000000008839308 Status: - Not a regression (comment 21). - "XEN_DOMCTL_test_assign_device" is irrelevant wrt. the bug (comment 16). - Hypervisor crash (comment 9). - Cause of bug and contingent identity to comment 13 / comment 14 still unclear. I strongly suspect (a) the bug is hardware depentent, and (b) I don't know enough about ia64 to go deeper than this. (In reply to comment #14) > Hi Laszlo, > on a second retry on hp-rx2660-04 this time with kernel-2.6.18-237.el5xen and > xen-3.0.3-120.el5 I got: > > xencomm_privcmd_domctl: unknown domctl cmd 45 > (XEN) ia64_fault, vector=0x4, ifa=0xf300001e21e1df78, iip=0xf00000000407da30, > ipsr=0x0000121008226038, isr=0x00000a0400000000 > (XEN) Alt DTLB. > (XEN) d 0xf000000007c88080 domid 0 > (XEN) vcpu 0xf000000007c68000 vcpu 0 Confirming this crash too. HOSTNAME=hp-rx2660-04.rhts.eng.bos.redhat.com JOBID=41493 RECIPEID=84487 RESULT_SERVER=127.0.0.1:7080 DISTRO=RHEL5.6-Server-20101221.n ARCHITECTURE=ia64 [root@hp-rx2660-04 ~]# uname -psrn Linux hp-rx2660-04.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64 xen-3.0.3-120.el5.ia64.rpm kernel-xen-2.6.18-238.el5.ia64.rpm xen-ia64-guest-firmware-1.0.13-1.ia64.rpm [root@hp-rx2660-04 ~]# uname -psrn Linux hp-rx2660-04.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64 - virt method: fully virtualized - init mem: 2G - max mem: 2G - numvcpu: 2 - OS: RHEL 5.4 or later - Installation source: http://download.devel.redhat.com/nightly/RHEL5.6-Server-20101222.0/5/ia64/os/ - disk image: file based - disk size: 4G - network: shared physdev - network target: xenbr0 - sound: off Trace diff: < xencomm_privcmd_domctl: unknown domctl cmd 45 < (XEN) ia64_fault, vector=0x4, ifa=0xf300001e21e1df78, iip=0xf00000000407da30, ipsr=0x0000121008226038, isr=0x00000a0400000000 --- > Red Hat Enterprise Linux Server release 5.6 (Tikanga) > Kernel 2.6.18-238.el5xen on an ia64 > > hp-rx2660-04.rhts.eng.bos.redhat.com login: xencomm_privcmd_domctl: unknown domctl cmd 45 > (XEN) ia64_fault, vector=0x4, ifa=0xf300001e21e1df78, iip=0xf00000000407da30, ipsr=0x0000121008226018, isr=0x00000a0400000000 8c11 < (XEN) psr : 0000121008226038 ifs : 800000000000040d ip : [<f00000000407da31>] --- > (XEN) psr : 0000121008226018 ifs : 800000000000040d ip : [<f00000000407da31>] 11c14 < (XEN) rnat: 0000000000000000 bsps: f0000000043a4b50 pr : 000000000069c199 --- > (XEN) rnat: 0000000000000011 bsps: f0000000041af5c8 pr : 000000000069c199 23c26 < (XEN) r20 : 0000000000000001 r21 : 0000000000000000 r22 : 0000001008226038 --- > (XEN) r20 : 0000000000000001 r21 : 0000000000000000 r22 : 0000001008226018 25c28 < (XEN) r26 : e0000100b5393dc0 r27 : 0000000000000000 r28 : a000000201154010 --- > (XEN) r26 : e0000100c6593540 r27 : 0000000000000000 r28 : a000000201150010 Will check with RHEL5.5 too. (In reply to comment #24) > Confirming this crash too. > [root@hp-rx2660-04 ~]# uname -psrn > Linux hp-rx2660-04.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64 > > - virt method: fully virtualized > - init mem: 2G > - max mem: 2G > - numvcpu: 2 > - OS: RHEL 5.4 or later > - Installation source: > http://download.devel.redhat.com/nightly/RHEL5.6-Server-20101222.0/5/ia64/os/ > - disk image: file based > - disk size: 4G > - network: shared physdev > - network target: xenbr0 > - sound: off > Will check with RHEL5.5 too. HOSTNAME=hp-rx2660-04.rhts.eng.bos.redhat.com JOBID=41787 RECIPEID=85222 RESULT_SERVER=127.0.0.1:7094 DISTRO=RHEL5-Server-U5 ARCHITECTURE=ia64 [root@hp-rx2660-04 ~]# uname -psrn Linux hp-rx2660-04.rhts.eng.bos.redhat.com 2.6.18-194.el5 ia64 (15/18): xen-ia64-guest-firmware-1.0.13-1.ia64.rpm (16/18): xen-3.0.3-105.el5.ia64.rpm (18/18): kernel-xen-2.6.18-194.el5.ia64.rpm [root@hp-rx2660-04 ~]# uname -psrn Linux hp-rx2660-04.rhts.eng.bos.redhat.com 2.6.18-194.el5xen ia64 http://download.devel.redhat.com/released/RHEL-5-Server/U5/ia64/os/ Hypervisor crashed again, this time with a very deep mutual recursion. Crash is possibly due to stack overflow. The crash on this machine is not a regression either. Red Hat Enterprise Linux Server release 5.5 (Tikanga) Kernel 2.6.18-194.el5xen on an ia64 hp-rx2660-04.rhts.eng.bos.redhat.com login: xencomm_privcmd_domctl: unknown domctl cmd 45 (XEN) ia64_fault, vector=0x4, ifa=0xf300001e21e1df78, iip=0xf00000000407da10, ipsr=0x0000121008226038, isr=0x00000a0400000000 (XEN) Alt DTLB. (XEN) d 0xf000000007c88080 domid 0 (XEN) vcpu 0xf000000007c68000 vcpu 0 (XEN) (XEN) CPU 0 (XEN) psr : 0000121008226038 ifs : 800000000000040d ip : [<f00000000407da11>] (XEN) ip is at domain_page_flush_and_put+0x441/0x500 (XEN) unat: 0000000000000000 pfs : 000000000000040d rsc : 0000000000000003 (XEN) rnat: 0000000000000000 bsps: f0000000043a4b50 pr : 000000000069c199 (XEN) ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f (XEN) csd : 0000000000000000 ssd : 0000000000000000 (XEN) b0 : f00000000407da10 b6 : f0000000040b97f0 b7 : f000000004002e50 (XEN) f6 : 0ffff8000000000000000 f7 : 000000000000000000000 (XEN) f8 : 000000000000000000000 f9 : 000000000000000000000 (XEN) f10 : 000000000000000000000 f11 : 000000000000000000000 (XEN) r1 : f0000000043a4b50 r2 : 0000000000000000 r3 : 000000000000003f (XEN) r8 : 0000000000000000 r9 : 0000000000000000 r10 : a000000100a82830 (XEN) r11 : 0000000000000008 r12 : f000000007c6fdd0 r13 : f000000007c68000 (XEN) r14 : f000000007c60018 r15 : f300001e21e1df78 r16 : 0000000000000000 (XEN) r17 : f000000004c10480 r18 : 0000000000000000 r19 : 0000000000000001 (XEN) r20 : 0000000000000001 r21 : 0000000000000000 r22 : 0000001008226038 (XEN) r23 : 0000000000000000 r24 : 0000000000000020 r25 : 0000000000000000 (XEN) r26 : e0000100c4da6dc0 r27 : 0000000000000000 r28 : a00000020105c010 (XEN) r29 : 0000000000000000 r30 : 0000000000000000 r31 : f0000000041bc580 (XEN) (XEN) Call Trace: (XEN) [<f0000000040c0450>] show_stack+0x80/0xa0 (XEN) sp=f000000007c6fa00 bsp=f000000007c695e8 (XEN) [<f000000004088340>] ia64_fault+0x9e0/0xbb0 (XEN) sp=f000000007c6fbd0 bsp=f000000007c695a8 (XEN) [<f0000000040b9240>] ia64_leave_kernel+0x0/0x300 (XEN) sp=f000000007c6fbd0 bsp=f000000007c695a8 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fdd0 bsp=f000000007c69540 (XEN) [<f000000004081950>] __dom0vp_add_physmap+0x330/0x630 (XEN) sp=f000000007c6fde0 bsp=f000000007c694d8 (XEN) [<f00000000405fa50>] do_dom0vp_op+0x1f0/0x560 (XEN) sp=f000000007c6fdf0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe00 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe00 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe10 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe10 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe20 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe20 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe30 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe30 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe40 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe40 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe50 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe50 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe60 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe60 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe70 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe70 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe80 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe80 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fe90 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fe90 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fea0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fea0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6feb0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6feb0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fec0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fec0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fed0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fed0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fee0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fee0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fef0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fef0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff00 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff00 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff10 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff10 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff20 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff20 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff30 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff30 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff40 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff40 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff50 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff50 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff60 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff60 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff70 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff70 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff80 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff80 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ff90 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ff90 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ffa0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ffa0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ffb0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ffb0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ffc0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ffc0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ffd0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ffd0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6ffe0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6ffe0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c6fff0 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c6fff0 bsp=f000000007c69498 (XEN) [<f000000004002e80>] fast_hypercall+0x170/0x2f0 (XEN) sp=f000000007c70000 bsp=f000000007c69498 (XEN) [<f00000000407da10>] domain_page_flush_and_put+0x440/0x500 (XEN) sp=f000000007c70000 bsp=f000000007c69498 (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Fault in Xen. (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... Created attachment 470857 [details] hp-rx2660-04 crash logs and hp-rx2660-01.brq / hp-rx2660-04 Xen boot logs Based on the advice I received, this is/was the plan: 1. Find the first RHEL5 xen hv on hp-rx2660-04 that broke the hvm guest start. Originally that machine must have been well supported. 2. If "1" finds a working RHEL xen on hp-rx2660-04, try it out on hp-rx8640-03. 3. Try to find a xen-3 upstream release working on hp-rx8640-03. hg repo to use: http://xenbits.xensource.com/xen-unstable.hg 4. This upstream repo may have a greater chance to work for step "3": http://xenbits.xensource.com/ext/ia64/xen-unstable.hg 5. Builds of upstream xen ("3"/"4") might work better with open source EFI: http://xenbits.xensource.com/ext/efi-vfirmware.hg?file/3ad73b4314e3/binaries/ 6. Compare "xm dmesg" (hypervisor boot messages) right after boot on hp-rx2660-04 vs. hp-rx2660-01.brq (should be similar) and check if the latter crashes as well. ----o---- Execution status: **** 1. Find the first RHEL5 xen hv on hp-rx2660-04 that broke the hvm guest start. Originally that machine must have been well supported. I grabbed the RHEL5 xen tags, the RHEL5 kernel tags, took their intersection, then looked up the ia64 kernel-xen RPMs for that (not all versions could be found in brew). I manually, partially "bisected" the resultant list. I checked 17 versions, of which 16 crashed when starting the hvm guest, and 1 simply didn't work. Here's the command I used for hvm guest creation: virt-install \ --connect=xen \ --name=rhel56-hvm \ --ram=2048 \ --arch=ia64 \ --vcpus=2 \ --os-type=linux \ --os-variant=rhel5.4 \ --hvm \ --location=http://download.devel.redhat.com/nightly/RHEL5.6-Server-20101222.n/5/ia64/os/ \ --disk=path=/var/lib/xen/images/rhel56-hvm,size=4,sparse=true \ --network=bridge:xenbr0 \ --debug \ --prompt This is the list of tested versions and the tag dates: 2.6.18-62.el5 Dec 21 2007 15:44:52 -0500 2.6.18-91.el5 Apr 23 2008 15:11:43 -0400 2.6.18-124.el5 Nov 18 2008 16:24:38 -0500 2.6.18-146.el5 May 12 2009 13:55:39 -0400 2.6.18-162.el5 Aug 5 2009 10:33:16 -0400 2.6.18-178.el5 Dec 9 2009 13:13:24 -0500 2.6.18-194.el5 Mar 17 2010 11:53:52 -0400 2.6.18-201.el5 May 28 2010 11:22:16 -0400 2.6.18-208.el5 Jul 22 2010 18:10:08 -0400 2.6.18-212.el5 Aug 11 2010 23:00:55 -0400 2.6.18-219.el5 Sep 9 2010 16:51:06 -0400 2.6.18-221.el5 Sep 13 2010 21:38:00 -0400 2.6.18-222.el5 Sep 15 2010 21:51:35 -0400 2.6.18-225.el5 Sep 27 2010 10:08:45 -0400 2.6.18-230.el5 Oct 28 2010 16:37:07 -0400 2.6.18-235.el5 Dec 1 2010 11:54:29 -0500 2.6.18-238.el5 Dec 19 2010 13:48:52 -0500 The crash logs are in the "hp-rx2660-04.results" directory. -221 didn't crash but refused guest creation with "(XEN) No enough contiguous memory(16384KB) for init_domain_vhpt". **** 2. If "1" finds a working RHEL xen on hp-rx2660-04, try it out on hp-rx8640-03. N/A **** 3. Try to find a xen-3 upstream release working on hp-rx8640-03. hg repo to use: http://xenbits.xensource.com/xen-unstable.hg The NUMA configuration of hp-rx8640-03 is not optimal for Xen. Until that is fixed (see comment 26), this step is suspended. To work a bit in advance, I built 19 tagged dist bundles from the abovementioned hg repo. Once the NUMA config is fixed, those may be used for manual bisection. dist-3.0.3-branched.tar.bz2 dist-3.3.0-rc3.tar.bz2 dist-3.0.4-branched.tar.bz2 dist-3.3.0-rc4.tar.bz2 dist-3.1.0-branched.tar.bz2 dist-3.3.0-rc5.tar.bz2 dist-3.2.0-rc1.tar.bz2 dist-3.3.0-rc6.tar.bz2 dist-3.2.0-rc2.tar.bz2 dist-3.3.0-rc7.tar.bz2 dist-3.2.0-rc3.tar.bz2 dist-3.3.0-branched.tar.bz2 dist-3.2.0-rc4.tar.bz2 dist-3.4.0-rc1.tar.bz2 dist-3.2.0-rc5.tar.bz2 dist-3.4.0-rc2.tar.bz2 dist-3.2.0-rc6.tar.bz2 dist-3.4.0-rc3.tar.bz2 dist-3.3.0-rc1.tar.bz2 This script was used for building: #!/bin/bash set -e -u -x -C cd ~/src-hg/xen-unstable.hg while read TAG REV; do make -k distclean hg checkout $TAG make -j 40 world 2>&1 | tee -i ~/xenbuildlog/$TAG.log tar -c -v -f ~/xenbin/dist-$TAG.tar dist done <~/tested-upstream-xen-revisions 2>&1 \ | tee -i ~/buildall2.log **** 4. This upstream repo may have a greater chance to work for step "3": http://xenbits.xensource.com/ext/ia64/xen-unstable.hg See the status for the previous bullet: dist-ext-ia64-3.0.3-branched.tar.bz2 dist-ext-ia64-3.3.0-rc3.tar.bz2 dist-ext-ia64-3.0.4-branched.tar.bz2 dist-ext-ia64-3.3.0-rc4.tar.bz2 dist-ext-ia64-3.1.0-branched.tar.bz2 dist-ext-ia64-3.3.0-rc5.tar.bz2 dist-ext-ia64-3.2.0-rc1.tar.bz2 dist-ext-ia64-3.3.0-rc6.tar.bz2 dist-ext-ia64-3.2.0-rc2.tar.bz2 dist-ext-ia64-3.3.0-rc7.tar.bz2 dist-ext-ia64-3.2.0-rc3.tar.bz2 dist-ext-ia64-3.3.0-branched.tar.bz2 dist-ext-ia64-3.2.0-rc4.tar.bz2 dist-ext-ia64-3.4.0-rc1.tar.bz2 dist-ext-ia64-3.2.0-rc5.tar.bz2 dist-ext-ia64-3.4.0-rc2.tar.bz2 dist-ext-ia64-3.2.0-rc6.tar.bz2 dist-ext-ia64-3.4.0-rc3.tar.bz2 dist-ext-ia64-3.3.0-rc1.tar.bz2 **** 5. Builds of upstream xen ("3"/"4") might work better with open source EFI: http://xenbits.xensource.com/ext/efi-vfirmware.hg?file/3ad73b4314e3/binaries/ Suspended, see above. **** 6. Compare "xm dmesg" (hypervisor boot messages) right after boot on hp-rx2660-04 vs. hp-rx2660-01.brq (should be similar) and check if the latter crashes as well. hp-rx2660-01.rhts.eng.brq.redhat.com didn't crash but successfully created the guest with the virt-install command described in bullet 1. (This is the second hp-rx2660 that has no problem creating the guest, see comment 18 and comment 19). Both machines were provisioned with RHEL5.6-Server-20101222.n_nfs-ia64 and booted with 2.6.18-238.el5xen. The "xm dmesg" outputs are in the attached tarball. To me the only non-trivial difference seems to be +(XEN) Reducing dom0 memory allocation from 4194304K to 3928560K to fit available memory in the hp-rx2660-04 xen boot log. **** Summary hp-rx2660-04 appears to be a dead-end. hp-rx8640-03 needs a NUMA reconfig before testing can continue. Matt Brodeur provided expert help. On 01/07/11 20:39, Matthew Brodeur wrote: > The 8640s are configured to have two different configurations as > they'd be used in the field. The rx8640-03 uses interleaved memory, > where -02 has a flat configuration. > https://engineering.redhat.com/rt3/Ticket/Display.html?id=28539#txn-679741 >> Matt, >> >> I configured this system to use NUMA while -02 is set up not to use NUMA. >> With the NUMA config you cannot run xen. The MCA you saw is expected in >> this case. Sorry, I really should have mentioned that bit!. >> >> - Doug Alex, I'm closing this as NOTABUG, as it was reported for hp-rx8640-03. The other "problematic machine" mentioned in this bug was hp-rx2660-04. Alas, I couldn't determine the cause of the crash. The difference between hp-rx2660-04 and hp-rx2660-01/hp-rx2660-03 might be NUMA again, or it might be something else. If you want that machine to be investigated further, please clone this bug (if appropriate), or please submit an independent bug. Thank you. *** Bug 696599 has been marked as a duplicate of this bug. *** |