Bug 611978
Summary: | xen dom0, guests become unresponsive. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Bill Braswell <bbraswel> |
Component: | kernel-xen | Assignee: | Xen Maintainance List <xen-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.5 | CC: | drjones, minovotn, mrezanin, pbonzini, smayhew, tao, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-03-18 14:51:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 514489 |
Description
Bill Braswell
2010-07-06 23:13:51 UTC
(In reply to comment #0) > The dom0 and all the guest on the customers system become unresponsive to > pings. All attempts to force a crash dump at these times have been > unsuccessful. Whether using Alt-Sysrq-c or using the Hypervisor to generate > the dump, neither respond. The only response seems to be dumping the registers > and the run queues from within the Hypervisor. > > It has been suggested that this may be related to the problem with “Xen on > Arrandale where IPI's were getting” and to bring it to the attention of Drew > Jones. Bill, could you provide us steps to reproduce this issue? Thanks, Michal Michal, The customer is not sure how it happens. Their sys admin just notices nothing on the machine is responding. It always happens with the same physical system, running the same guests. We have asked them to move the guests to a different box to see if the problem follows the guests or stays with the hardware. As of yet, we have not gotten any results from that. Chris recommended opening this BZ and bringing it to the attention of Drew Jones. Bill (In reply to comment #2) > Michal, > > The customer is not sure how it happens. Their sys admin just notices nothing > on the machine is responding. It always happens with the same physical system, > running the same guests. We have asked them to move the guests to a different > box to see if the problem follows the guests or stays with the hardware. As of > yet, we have not gotten any results from that. > > Chris recommended opening this BZ and bringing it to the attention of Drew > Jones. > > > Bill It *may* be related to the bug if the processor you're running this on is Arrandale. That may be right. I'm adding him to CC list right now. Michal Yes, this certainly sounds related to bug 570579. I also see a new bug being worked on bare-metal that looks related, bug 612659. Switching to _irq_ keyhandler works for me. Should I create a brew build? Or would you just like to quick build xen.gz with the simple patch? -static void do_crashdump_trigger(unsigned char key) +static void do_crashdump_trigger(unsigned char key, struct cpu_user_regs *regs) - register_keyhandler('C', do_crashdump_trigger, "trigger a crashdump"); + register_irq_keyhandler('C', do_crashdump_trigger, "trigger a crashdump"); Do you still have the packages for the kernel which gave this "xm dmesg" (especially the kernel-xen and debuginfo package)? The messages are harmless, I see no reason for them to use XENLOG_ERR. We just got that severity level from upstream. I'm downloading the kernel packages anyway to check what address is ffff828c8013073f for. Of the patches that went in between -17.1 and -26.1, the last one is a data corruption, so it's certainly possible that it is the cause of this hang---and possibly others that were reported in the past. However, the new kernel would "just fix it", it wouldn't result in any of the messages of comment #18. The messages seems OK to me, just overly verbose. Here's what the address maps to # addr2line -fie xen-syms-2.6.18-194.26.1.el5.bz611978.debug ffff828c8013073f load_segments /usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1002 context_switch /usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1362 978 static void load_segments(struct vcpu *n) ... 996 /* 997 * Either selector != 0 ==> reload. 998 * Also reload to reset FS_BASE if it was non-zero. 999 */ 1000 if ( unlikely((dirty_segment_mask & (DIRTY_FS | DIRTY_FS_BASE)) | 1001 nctxt->user_regs.fs) ) 1002 all_segs_okay &= loadsegment(fs, nctxt->user_regs.fs); 1003 So we're just detecting that we need to reset the FS segment when scheduling a new domain (meaning the last domain touched it), and that causes a host trap because the segment doesn't exist (Trap 11 is TRAP_no_segment). How about the other issues in this bug? Are things working well for the customer now? Closing current release. If we're wrong about that, and the bug reappears on the latest release, then this can be reopened. |