Bug 611978

Summary: xen dom0, guests become unresponsive.
Product: Red Hat Enterprise Linux 5 Reporter: Bill Braswell <bbraswel>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.5CC: drjones, minovotn, mrezanin, pbonzini, smayhew, tao, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-18 14:51:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514489    

Description Bill Braswell 2010-07-06 23:13:51 UTC
The dom0 and all the guest on the customers system become unresponsive to pings.   All attempts to force a crash dump at these times have been unsuccessful.  Whether using Alt-Sysrq-c or using the Hypervisor to generate the dump, neither respond.  The only response seems to be dumping the registers and the run queues from within the Hypervisor.

It has been suggested that this may be related to the problem with “Xen on Arrandale where IPI's were getting”  and to bring it to the attention of Drew Jones.

Comment 1 Michal Novotny 2010-07-08 12:32:14 UTC
(In reply to comment #0)
> The dom0 and all the guest on the customers system become unresponsive to
> pings.   All attempts to force a crash dump at these times have been
> unsuccessful.  Whether using Alt-Sysrq-c or using the Hypervisor to generate
> the dump, neither respond.  The only response seems to be dumping the registers
> and the run queues from within the Hypervisor.
> 
> It has been suggested that this may be related to the problem with “Xen on
> Arrandale where IPI's were getting”  and to bring it to the attention of Drew
> Jones.    

Bill, could you provide us steps to reproduce this issue?

Thanks,
Michal

Comment 2 Bill Braswell 2010-07-08 20:54:34 UTC
Michal,

The customer is not sure how it happens.  Their sys admin just notices nothing on the machine is responding.  It always happens with the same physical system, running the same guests.  We have asked them to move the guests to a different box to see if the problem follows the guests or stays with the hardware.  As of yet, we have not gotten any results from that.

Chris recommended opening this BZ and bringing it to the attention of Drew Jones.


Bill

Comment 3 Michal Novotny 2010-07-09 12:04:28 UTC
(In reply to comment #2)
> Michal,
> 
> The customer is not sure how it happens.  Their sys admin just notices nothing
> on the machine is responding.  It always happens with the same physical system,
> running the same guests.  We have asked them to move the guests to a different
> box to see if the problem follows the guests or stays with the hardware.  As of
> yet, we have not gotten any results from that.
> 
> Chris recommended opening this BZ and bringing it to the attention of Drew
> Jones.
> 
> 
> Bill    

It *may* be related to the bug if the processor you're running this on is Arrandale. That may be right. I'm adding him to CC list right now.

Michal

Comment 4 Andrew Jones 2010-07-12 11:08:02 UTC
Yes, this certainly sounds related to bug 570579. I also see a new bug being worked on bare-metal that looks related, bug 612659.

Comment 14 Andrew Jones 2010-11-18 07:56:28 UTC
Switching to _irq_ keyhandler works for me. Should I create a brew build? Or would you just like to quick build xen.gz with the simple patch?

-static void do_crashdump_trigger(unsigned char key)

+static void do_crashdump_trigger(unsigned char key, struct cpu_user_regs *regs)


-    register_keyhandler('C', do_crashdump_trigger, "trigger a crashdump");

+    register_irq_keyhandler('C', do_crashdump_trigger, "trigger a crashdump");

Comment 19 Paolo Bonzini 2010-12-20 14:39:35 UTC
Do you still have the packages for the kernel which gave this "xm dmesg" (especially the kernel-xen and debuginfo package)?

Comment 22 Paolo Bonzini 2011-01-10 17:33:48 UTC
The messages are harmless, I see no reason for them to use XENLOG_ERR.  We just got that severity level from upstream.  I'm downloading the kernel packages anyway to check what address is ffff828c8013073f for.

Of the patches that went in between -17.1 and -26.1, the last one is a data corruption, so it's certainly possible that it is the cause of this hang---and possibly others that were reported in the past.  However, the new kernel would "just fix it", it wouldn't result in any of the messages of comment #18.

Comment 23 Andrew Jones 2011-03-18 14:05:25 UTC
The messages seems OK to me, just overly verbose. Here's what the address maps to

# addr2line -fie xen-syms-2.6.18-194.26.1.el5.bz611978.debug ffff828c8013073f
load_segments
/usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1002
context_switch
/usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1362

978 static void load_segments(struct vcpu *n)
...
 996     /*
 997      * Either selector != 0 ==> reload.
 998      * Also reload to reset FS_BASE if it was non-zero.
 999      */
1000     if ( unlikely((dirty_segment_mask & (DIRTY_FS | DIRTY_FS_BASE)) |
1001                   nctxt->user_regs.fs) )
1002         all_segs_okay &= loadsegment(fs, nctxt->user_regs.fs);
1003 

So we're just detecting that we need to reset the FS segment when scheduling a new domain (meaning the last domain touched it), and that causes a host trap because the segment doesn't exist (Trap 11 is TRAP_no_segment).

How about the other issues in this bug? Are things working well for the customer now?

Comment 26 Andrew Jones 2011-03-18 14:51:12 UTC
Closing current release. If we're wrong about that, and the bug reappears on the latest release, then this can be reopened.