Bug 623181

Summary: [RHEL6] Dumping core of RHEL6 i386 PV guest immediately after it is created got error
Product: Red Hat Enterprise Linux 5 Reporter: Yufang Zhang <yuzhang>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.6CC: ddutile, drjones, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-13 09:18:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
xm dmesg logs
none
xend.log none

Description Yufang Zhang 2010-08-11 14:32:02 UTC
Created attachment 438203 [details]
xm dmesg logs

Description of problem:
When we try to dump core a RHEL 6 i386 PV guest with memory size >= 1G, xm dump-core command failed with error output.

Version-Release number of selected component (if applicable):
kernel-xen-devel-2.6.18-210.el5
xen-3.0.3-115.el5
xen-debuginfo-3.0.3-115.el5
xen-devel-3.0.3-115.el5
kernel-xen-2.6.18-210.el5
xen-libs-3.0.3-115.el5

RHEL 6 PV guest:
snapshot 10 (20100807.0)
kernel-2.6.32-59.1.el6.i386

How reproducible:
Not always but quite easy to reproduce

Steps to Reproduce:
1. Reboot the host to get fresh environment
2. Create a RHEL6 i386 PV guest with memory=1024 maxmem=2048
3. Try to dump core the PV guest via command xm dump-core
  
Actual results:
# xm dump-core vm1
Dumping core of domain: vm1 ...
Error: Failed to dump core: (1, 'Internal error', 'p2m_size < nr_pages -1 (0 < 3ffff')
Usage: xm dump-core [-L|--live] [-C|--crash] <Domain> [Filename]
 
Dump core for a specific domain.


Expected results:
Could dump core successfully

Additional info:
(1) Cannot reproduce this bug when set guest memory size as 512M
(2) RHEL6 x86_64 PV guest didn't hit this problem
(3) Other PV guest(RHEL5, RHEL4) didn't hit this problem
(4) When you do a dump core for a other PV guest and succeed, then you cannot reproduce this bug on the same host, unless you reboot the host and repeat Step1 to Step 3.
(4) xm dmesg is attached. xend.log is uploaded soon.

Comment 1 Yufang Zhang 2010-08-11 14:33:29 UTC
Created attachment 438204 [details]
xend.log

Comment 2 Yufang Zhang 2010-08-11 14:38:51 UTC
Add another additional info:
Cannot reproduce this bug with RHEL6 snapshot8(20100722.0) i386 PV guest .

Comment 3 Yufang Zhang 2010-08-12 02:10:17 UTC
Re-test this bug on a i386 host, didn't hit this problem.

Comment 4 Andrew Jones 2010-08-12 13:22:27 UTC
I can't reproduce this at all. I've tried matching everything described here (host versions HV/userspace, guest kernel version, memory config, etc.), rebooting before each try, but it always works for me.

Am I missing an ingredient to get it to reproduce? Are you still able to reproduce it every time? What about with later guest kernels like -63?

Comment 5 Yufang Zhang 2010-08-13 03:12:23 UTC
(In reply to comment #4)
> I can't reproduce this at all. I've tried matching everything described here
> (host versions HV/userspace, guest kernel version, memory config, etc.),
> rebooting before each try, but it always works for me.
> 
> Am I missing an ingredient to get it to reproduce? Are you still able to
> reproduce it every time? What about with later guest kernels like -63?    

I think I know the origin of this bug: dump core for the VM intermediately after it is created.  

Using the following command, I can always reproduce this bug:
# xm cr /tmp/xm-test.conf &&  xm dump-core vm1
Using config file "/tmp/xm-test.conf".
Using <class 'grub.GrubConf.GrubConfigFile'> to parse /grub/menu.lst
Started domain vm1
Dumping core of domain: vm1 ...
Error: Failed to dump core: (1, 'Internal error', 'p2m_size < nr_pages -1 (0 < 1ffff')
Usage: xm dump-core [-L|--live] [-C|--crash] <Domain> [Filename]

Dump core for a specific domain.

Waiting for a while and dump core for the VM, you may not hit the problem. The time you should wait for depends on the memory size of the VM. For example, for a 1024M VM, using the following command:
# xm cr /tmp/xm-test.conf &&  sleep 1 && xm dump-core vm1
wouldn't hit the problem. But using the the following the command:
# xm cr /tmp/xm-test.conf &&  sleep 0.5 && xm dump-core vm1
would hit the problem.

However, this problem only exists for RHEL6 i386 guest on RHEL5.6 Xen x86_64 host. We didn't hit this problem for other guests. We didn't hit problem for previous version of RHEL6 i386 guest neither. I would test this problem with the new kernel.

Comment 6 Yufang Zhang 2010-08-13 03:26:51 UTC
Still hit this problem after guest kernel to -63

Comment 7 Yufang Zhang 2010-08-13 03:29:51 UTC
Change the summary of this bug for clarification.

Comment 8 Yufang Zhang 2010-08-13 08:16:19 UTC
Re-test this bug with reproducer in comment #5. Both rhel5 (i386 and x86_64) and rhel6(i386 and x86_64) guests could hit this problem with the "# xm cr &&  xm dump-core" command. Furthermore, previous versions of RHEL6 PV guests could also hit this problem with the producer.

Comment 9 Andrew Jones 2010-08-13 08:52:34 UTC
Based on comment 8 we now see this bug is addressing the inability to capture a core early in the boot. I think there are many reasons that wouldn't work. It might be an interesting exercise to try and find out the earliest point a core can be captured by setting xen up to auto-dump guests and then booting a kernel that panics early, but I don't see that exercise as being a high priority. Mainly because even if it's not as early as we might like it to be, we probably wouldn't be able to fix it. IMO the priority of the bug should be very low, and likely it will be closed as WONTFIX.

Comment 10 Yufang Zhang 2010-08-13 09:04:06 UTC
Try to trigger a kernel panic at boot time to check if this problem have any impacts on dumping core automatically for the crashed guest. Tested with the following two scenarios:
(1) Edit the guest grub to a wrong root filesystem
(2) In a RHEL6 x86_64 PV guest, downgrading the kernel package to -59 which will trigger a crash on RHEL5.6 host.

In either scenario, core file is generated automatically when the guests crashes. No error output is founded in xend.log. So it seems that such crashes are not too early so that we cannot hit the problem.

Comment 11 Andrew Jones 2010-08-13 09:18:22 UTC
Nice work. Thanks for those extra tests. I think that's satisfactory for dump-core. I'm closing this as NOTABUG.

Comment 12 Don Dutile (Red Hat) 2010-08-13 22:48:46 UTC
Just an added note:

(1) hooking up the necessary hypervisor callback to dump when panic()
    invokes is relatively early in kernel boot, but it does take a wee-bit of time.

(2) for hvm guests w/pv-hvm, that time is expanded until xen-platform-pci
    (virtual xenbus pci device) is configured, at which point, xen-dumps
    will occur when a panic() occurs.  until then, you'll get same scenario
    as early pv kernels.