Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 611978 - xen dom0, guests become unresponsive.
xen dom0, guests become unresponsive.
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.5
All Linux
medium Severity medium
: rc
: ---
Assigned To: Xen Maintainance List
Red Hat Kernel QE team
:
Depends On:
Blocks: 514489
  Show dependency treegraph
 
Reported: 2010-07-06 19:13 EDT by Bill Braswell
Modified: 2011-03-18 10:51 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-03-18 10:51:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Bill Braswell 2010-07-06 19:13:51 EDT
The dom0 and all the guest on the customers system become unresponsive to pings.   All attempts to force a crash dump at these times have been unsuccessful.  Whether using Alt-Sysrq-c or using the Hypervisor to generate the dump, neither respond.  The only response seems to be dumping the registers and the run queues from within the Hypervisor.

It has been suggested that this may be related to the problem with “Xen on Arrandale where IPI's were getting”  and to bring it to the attention of Drew Jones.
Comment 1 Michal Novotny 2010-07-08 08:32:14 EDT
(In reply to comment #0)
> The dom0 and all the guest on the customers system become unresponsive to
> pings.   All attempts to force a crash dump at these times have been
> unsuccessful.  Whether using Alt-Sysrq-c or using the Hypervisor to generate
> the dump, neither respond.  The only response seems to be dumping the registers
> and the run queues from within the Hypervisor.
> 
> It has been suggested that this may be related to the problem with “Xen on
> Arrandale where IPI's were getting”  and to bring it to the attention of Drew
> Jones.    

Bill, could you provide us steps to reproduce this issue?

Thanks,
Michal
Comment 2 Bill Braswell 2010-07-08 16:54:34 EDT
Michal,

The customer is not sure how it happens.  Their sys admin just notices nothing on the machine is responding.  It always happens with the same physical system, running the same guests.  We have asked them to move the guests to a different box to see if the problem follows the guests or stays with the hardware.  As of yet, we have not gotten any results from that.

Chris recommended opening this BZ and bringing it to the attention of Drew Jones.


Bill
Comment 3 Michal Novotny 2010-07-09 08:04:28 EDT
(In reply to comment #2)
> Michal,
> 
> The customer is not sure how it happens.  Their sys admin just notices nothing
> on the machine is responding.  It always happens with the same physical system,
> running the same guests.  We have asked them to move the guests to a different
> box to see if the problem follows the guests or stays with the hardware.  As of
> yet, we have not gotten any results from that.
> 
> Chris recommended opening this BZ and bringing it to the attention of Drew
> Jones.
> 
> 
> Bill    

It *may* be related to the bug if the processor you're running this on is Arrandale. That may be right. I'm adding him to CC list right now.

Michal
Comment 4 Andrew Jones 2010-07-12 07:08:02 EDT
Yes, this certainly sounds related to bug 570579. I also see a new bug being worked on bare-metal that looks related, bug 612659.
Comment 14 Andrew Jones 2010-11-18 02:56:28 EST
Switching to _irq_ keyhandler works for me. Should I create a brew build? Or would you just like to quick build xen.gz with the simple patch?

-static void do_crashdump_trigger(unsigned char key)

+static void do_crashdump_trigger(unsigned char key, struct cpu_user_regs *regs)


-    register_keyhandler('C', do_crashdump_trigger, "trigger a crashdump");

+    register_irq_keyhandler('C', do_crashdump_trigger, "trigger a crashdump");
Comment 19 Paolo Bonzini 2010-12-20 09:39:35 EST
Do you still have the packages for the kernel which gave this "xm dmesg" (especially the kernel-xen and debuginfo package)?
Comment 22 Paolo Bonzini 2011-01-10 12:33:48 EST
The messages are harmless, I see no reason for them to use XENLOG_ERR.  We just got that severity level from upstream.  I'm downloading the kernel packages anyway to check what address is ffff828c8013073f for.

Of the patches that went in between -17.1 and -26.1, the last one is a data corruption, so it's certainly possible that it is the cause of this hang---and possibly others that were reported in the past.  However, the new kernel would "just fix it", it wouldn't result in any of the messages of comment #18.
Comment 23 Andrew Jones 2011-03-18 10:05:25 EDT
The messages seems OK to me, just overly verbose. Here's what the address maps to

# addr2line -fie xen-syms-2.6.18-194.26.1.el5.bz611978.debug ffff828c8013073f
load_segments
/usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1002
context_switch
/usr/src/debug/kernel-2.6.18/xen/arch/x86/domain.c:1362

978 static void load_segments(struct vcpu *n)
...
 996     /*
 997      * Either selector != 0 ==> reload.
 998      * Also reload to reset FS_BASE if it was non-zero.
 999      */
1000     if ( unlikely((dirty_segment_mask & (DIRTY_FS | DIRTY_FS_BASE)) |
1001                   nctxt->user_regs.fs) )
1002         all_segs_okay &= loadsegment(fs, nctxt->user_regs.fs);
1003 

So we're just detecting that we need to reset the FS segment when scheduling a new domain (meaning the last domain touched it), and that causes a host trap because the segment doesn't exist (Trap 11 is TRAP_no_segment).

How about the other issues in this bug? Are things working well for the customer now?
Comment 26 Andrew Jones 2011-03-18 10:51:12 EDT
Closing current release. If we're wrong about that, and the bug reappears on the latest release, then this can be reopened.

Note You need to log in before you can comment on or make changes to this bug.