673456 – Zombie-RHEL-pv-guest appears after save/restore PV guest to an insufficient partition and destroy the guest.

Bug 673456 - Zombie-RHEL-pv-guest appears after save/restore PV guest to an insufficient partition and destroy the guest.

Summary: Zombie-RHEL-pv-guest appears after save/restore PV guest to an insufficient p...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.6
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Novotny
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514500
TreeView+	depends on / blocked

Reported:	2011-01-28 09:41 UTC by Yuyu Zhou
Modified:	2014-02-02 22:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:	xen-3.0.3-126.el5
Doc Type:	Bug Fix
Doc Text:	After a PV (paravirtualized) guest was saved, restored and then destroyed on a disk partition with insufficient space, the xend daemon preserved reference to the non-existent guest. As a consequence, a zombie guest sometimes appeared and was reported by the "xm list" command. With this update, a memory leak has been fixed in the xc_resume() function, and now no zombie guests appear in the described scenario.
Clone Of:
Environment:
Last Closed:	2011-07-21 09:15:28 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
xend log (35.95 KB, text/plain) 2011-01-28 09:44 UTC, Yuyu Zhou	no flags	Details
Fix memory leak in xc_resume code (2.81 KB, patch) 2011-03-10 14:24 UTC, Michal Novotny	no flags	Details \| Diff
Fix memory leak in xc_resume code v2 (3.17 KB, patch) 2011-03-14 12:04 UTC, Michal Novotny	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1070	0	normal	SHIPPED_LIVE	xen bug fix and enhancement update	2011-07-21 09:12:56 UTC

Description Yuyu Zhou 2011-01-28 09:41:51 UTC

Description of problem:

Zombie-RHEL-pv-guest appears after save/restore PV guest to an insufficient partition and destroy the guest.


Version-Release number of selected component (if applicable):


Xen Version: 3.03-120.el5
Host Version: kernel-xen-2.6.18-238.el5
Guest Version: RHEL4 pv guest and RHEL5 pv guest (32bit and 64bit)


How reproducible:

sometimes



Steps to Reproduce:

1.
 Create one or more PV guests.
Name                                      ID Mem(MiB) VCPUs State   Time(s)

Domain-0                                   0     4787     4 r-----    501.5

RHEL5-pv-32                               69      512     2 -b----      0.8

RHEL5-pv-64                               70      512     2 -b----      0.2


2.
 Save and restore PV guests to an insufficient partition. 
[root@dhcp-65-174 boot]# ( rm -fr /exp/100Msave/RHEL5-pv-32.save; xm save RHEL5-pv-32 /exp/100Msave/RHEL5-pv-32.save; xm restore /exp/100Msave/RHEL5-pv-32.save) &

( rm -fr /exp/100Msave/RHEL5-pv-64.save; xm save RHEL5-pv-64 /exp/100Msave/RHEL5-pv-64.save; xm restore /exp/100Msave/RHEL5-pv-64.save) &


[root@dhcp-65-174 boot]# Error: can't write guest state file /exp/100Msave/RHEL5-pv-32.save: No space left on device

Usage: xm save <Domain> <CheckpointFile>



Save a domain state to restore later.

Error: Restore failed

Usage: xm restore <CheckpointFile>



Restore a domain from a saved state.

Error: /usr/lib64/xen/bin/xc_save 22 70 0 0 0 failed

Usage: xm save <Domain> <CheckpointFile>



Save a domain state to restore later.

Error: Restore failed

Usage: xm restore <CheckpointFile>



Restore a domain from a saved state.



[root@dhcp-65-174 boot]# xm list

Name                                      ID Mem(MiB) VCPUs State   Time(s)

Domain-0                                   0     4787     4 r-----    503.5

RHEL5-pv-32                               69      511     2 -b----      6.5

RHEL5-pv-64                               70      511     2 -b----      3.2


3.
 Destroy PV guests.
[root@dhcp-65-174 boot]xm de 69; xm de 70;
[root@dhcp-65-174 boot]# xm list

Name                                      ID Mem(MiB) VCPUs State   Time(s)

Domain-0                                   0     4787     4 r-----    505.3

Zombie-RHEL5-pv-64                        70      512     2 --p--d      4.3



Actual results:


Sometimes Zombie guests come out after destroy operation. It only for PV guests and only after save/restore in insufficient space. Till now, it happens to RHEL4 PV guest(both 32bit and 64 bit) and RHEL5 PV guest(both 32 bit and 64 bit)


Expected results:


1. After step 2, the save/restore operation is interrupted by error: no space left. The original guests still works fine.
2. After destroy the guests no Zombie guest left.


Additional info:
xend.log attached.

Comment 1 Yuyu Zhou 2011-01-28 09:44:09 UTC

Created attachment 475758 [details]
xend log

Comment 2 Andrew Jones 2011-01-31 11:54:37 UTC

Has this ever worked? I think these types of tests are always doomed to fail unless we have suspend cancellation support (bug 497080).

Comment 3 Yuyu Zhou 2011-02-09 03:24:07 UTC

(In reply to comment #2)
> Has this ever worked? I think these types of tests are always doomed to fail
> unless we have suspend cancellation support (bug 497080).

Most of the time the test failed. At least there is one scenario it works.
When the insufficient partition is extremely insufficient . 

When the error similar to (Error: /usr/lib64/xen/bin/xc_save 22 70 0 0 0 failed) appears, there will be a Zombie guest.

When the error similar to (Error: can't write guest state file /exp/100Msave/RHEL5-pv-32.save: No space left on device) appears, there won't be any Zombie guest.

Comment 4 Andrew Jones 2011-02-09 12:55:34 UTC

Ah, OK. Without digging in code it sounds like we "succeed" the failed suspend and recovery if we fail within the tools before actually beginning the suspend within the guest kernel. If we get past the places the tools that could fail, and attempt to really suspend, then we get in trouble because we can't cancel the suspend (no suspend cancellation support). I believe that means this bug is a dup of 497080.

Comment 5 Miroslav Rezanina 2011-02-09 13:48:30 UTC

Just a note - this is problem of Xend, it still has reference to non-exist guest. Guest can be working after canceled suspend and no symptoms visible. However, when guest is destroyed, xend still have some information related to the guest even if the guest is not existing in HV or xenstore -> zombie record.

Comment 6 Andrew Jones 2011-02-09 14:00:28 UTC

OK, so this bug doesn't need to be dupped. This bug can handle the case that since we don't support suspend cancellation, or handle it gracefully in xend, xend's state can get messed up.

Comment 7 Michal Novotny 2011-03-02 16:26:34 UTC

Well, I've been able to reproduce it and I've added some debug logging there to see what's going on there and I've found out that the information coming from xen_domains() method in xend/XendDomain.py are bogus since prior to the destroy call there's "{'paused': 0, 'dying': 0, 'mem_kb': 524288L, 'running': 1, }" but after the destroy call there's "{'paused': 1, 'dying': 1, 'mem_kb': 8L, 'running': 0 }" which is what leaves the domain there in the zombie state.

xen_domains() method is reading the information using the domain_getinfo() call which is basically issuing the XEN_DOMCTL_getdomaininfo hypercall.

I'll investigate this further in the hypervisor codes since I think the issue is in the hypervisor because of this DOMCTL is showing the value of domain that should not be mentioned there anymore.

Michal

Comment 8 Michal Novotny 2011-03-02 22:44:49 UTC

I've added the debugging output to the common/domctl.c source file of hypervisor itself and then I rebooted dom0.

After the dom0 reboot and PV guest creation there were following lines present in the `xm dmesg` output:

(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #0 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #1 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #2 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #3 found
(XEN) domctl.c:122:d0 getdomaininfo: Domain 0 VCPU count is 4
(XEN) domctl.c:107:d0 getdomaininfo: Domain 1 VCPU #0 found
(XEN) domctl.c:122:d0 getdomaininfo: Domain 1 VCPU count is 1

When I tried to save the guest to the partition with insufficient disk space I was still able to see:

(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #0 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #1 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #2 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #3 found
(XEN) domctl.c:122:d0 getdomaininfo: Domain 0 VCPU count is 4
(XEN) domctl.c:107:d0 getdomaininfo: Domain 1 VCPU #0 found
(XEN) domctl.c:122:d0 getdomaininfo: Domain 1 VCPU count is 1

for this iteration of `xm list`. After xen daemon restart there's following output (please note that Domain 1 found line is missing)

(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #0 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #1 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #2 found
(XEN) domctl.c:107:d0 getdomaininfo: Domain 0 VCPU #3 found
(XEN) domctl.c:122:d0 getdomaininfo: Domain 0 VCPU count is 4

I've been investigating this further and I found that there are not just many occurrences of "Domain 0 VCPU #x found" and "Domain 1 VCPU #0 found" but also 2 occurrences of following line:

(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version

in the middle of those lines.  This line is coming from the common/sysctl.c source file for case the command is XEN_SYSCTL_physinfo although this command is being implemented in the arch/x86/sysctl.c file of the sources so I'm thinking whether there is no reset code or something.

It seems that the reinitialization is done by the XendDomain.instance() from the SrvDomainDir.py:39 so I've been investigating it there and I found out that there is a xen_domains() call right before the Xend stop signal and also right after the Xend start so I've been able to see following:

[2011-03-02 23:37:32 xend 7629] DEBUG (XendDomain:153) xen_domains() with the zombie domain entry
[2011-03-02 23:38:54 xend 7628] INFO (SrvDaemon:190) Xend stopped due to signal 15.
[2011-03-02 23:38:54 xend 8224] DEBUG (XendDomain:153) xen_domains() without the zombie domain entry

Some further investigation will be necessary however I recall the bug 589123 that was about event channel not giving data required to domain destroy leaving the zombie record so I guess this one may be relevant to this one.

Unfortunately I didn't manage to find out what removes the zombie domain from the list on xend restart yet but I think we're getting closer.

Michal

Comment 9 Michal Novotny 2011-03-03 15:41:10 UTC

Well, I revealed the problem. The problem seems to be in the hypervisor since for normal (non-resumed) domain destruction there are some debugging lines (I added them myself so you won't be able to see them without patching the code)

(XEN) domain.c:350:d0 domain_kill: Ref count is 2, decrementing by 1
(XEN) domain.c:350:d0 domain_kill: Ref count is now 1
(XEN) domain.c:354:d0 domain_kill: Domain 1 is dead and ref count is 1
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 4)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 3)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 2)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 1)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 0)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 1
(XEN) page_alloc.c:917:d0 free_domheap_pages: Ref count is 1, decrementing by 1
(XEN) domain.c:541:d0 domain_destroy: Destroying domain 2 (dying 2)
(XEN) domain.c:565:d0 domain_destroy: Scheduling domain destruction for domain 2

And now, for the case of resumed domain there's following output:

(XEN) domain.c:350:d0 domain_kill: Ref count is 2, decrementing by 1
(XEN) domain.c:350:d0 domain_kill: Ref count is now 1
(XEN) domain.c:354:d0 domain_kill: Domain 2 is dead and ref count is 1
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 6)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 5)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 4)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 3)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 2)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) domain.c:541:d0 domain_destroy: Destroying domain 2 (dying 2)

As you can see, there are 2 more pages in total that are not freed and therefore it cannot drop the tot_pages variable to 0 and therefore it can't set drop_dom_ref to 1. This is required to destroy the domain but since the page count is bigger by 2 it can't drop to zero.

It seems like that the free_domheap_pages() function frees only last 4 pages in the memory and expects the rest to be done by prior to this.

Also, I know it's strange that it's done on the xen daemon restart but I was able to see following new lines in the `xm dmesg` output after the domain restart:

(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 1)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 0
(XEN) page_alloc.c:899:d0 free_domheap_pages: dying = 2 (tot_pages = 0)
(XEN) page_alloc.c:914:d0 free_domheap_pages: drop_dom_ref = 1
(XEN) page_alloc.c:917:d0 free_domheap_pages: Ref count is 1, decreasing by 1
(XEN) domain.c:541:d0 domain_destroy: Destroying domain 2 (dying 2)

The only thing that's being done on the restart is that hypervisor calls put_page() from include/asm-x86/mm.h and since the unlikely((nx & PGC_count_mask) == 0) is met it's freeing the rest of the memory pages that are not being freed yet.

Based on the fact it's having a wrong page count after the resume I guess the bug is in the hypervisor itself and restarting xend daemon only reinitializes the connection which is the place where HV frees the memory.

I would prefer finding a root cause in the hypervisor itself because of the reasons mentioned above.

Andrew, what do you think about this? Feel free to reassign to kernel-xen space if it makes sense to you.

Thanks,
Michal

Comment 10 Michal Novotny 2011-03-03 15:51:31 UTC

[snip

> (XEN) domain.c:541:d0 domain_destroy: Destroying domain 2 (dying 2)
> (XEN) domain.c:565:d0 domain_destroy: Scheduling domain destruction for domain
> 2
> 

Oh, just a note this should be domain 1 for both. I made a mistake there and I put there is_dying instead of domain_id.

Michal

Comment 13 Andrew Jones 2011-03-07 14:36:39 UTC

(In reply to comment #9)
> Andrew, what do you think about this? Feel free to reassign to kernel-xen space
> if it makes sense to you.

As discussed, we need to investigate this further from both userspace and HV sides. It's still unclear if there's an HV bug or if userspace isn't forcing a cleanup when it should be. In any case I'm retracting my statement to dup this to 497080, because I believe one way or another we should keep the system clean (no zombie domains), even when running guests that don't support suspend cancellation.

Comment 14 Michal Novotny 2011-03-08 19:32:22 UTC

Well, I have most likely found the block of memory in the restore code (used for resume) that's responsible for this leak. It's in the user-space stack after all and simple patch like:

diff --git a/tools/libxc/xc_resume.c b/tools/libxc/xc_resume.c
index 811989f..85ae93b 100644
--- a/tools/libxc/xc_resume.c
+++ b/tools/libxc/xc_resume.c
@@ -264,6 +264,10 @@ static int xc_domain_resume_any(int xc_handle, uint32_t domid)
         munmap(p2m_frame_list, P2M_FLL_ENTRIES*PAGE_SIZE);
     if (p2m_frame_list_list)
         munmap(p2m_frame_list_list, PAGE_SIZE);
+    if (live_p2m_frame_list)
+        munmap(live_p2m_frame_list, P2M_FLL_ENTRIES*PAGE_SIZE);
+    if (live_p2m_frame_list_list)
+        munmap(live_p2m_frame_list_list, PAGE_SIZE);
     if (shinfo)
         munmap(shinfo, PAGE_SIZE);
 #endif

solved the issue however we've been discussing this with Andrew and we need to test it using more guests, most likely the RHEL-5 PV guest as well as Fedora-rawhide guest that should support suspend cancellation to support both guest having and not having the SUSPEND_CANCELLATION bits.

I've tested this for case of RHEL-5 guest and it was working fine to save the guest successfully, restore the guest successfully and resume the guest after save fails. Everything was working fine however I need to do some testing with Fedora-rawhide now.

Michal

Comment 15 Michal Novotny 2011-03-09 09:42:06 UTC

Well, I did try to compare the behaviour without and with this patch applied for RHEL-5 PV guest and Fedora-14 PV guest and it doesn't seem to regress anything.

The test matrix was using the normal save/restore path to the location with enough disk space and also I tried to save it to the location with not enough disk space to resume the guest automatically using the resume code.

For RHEL-5 guest everything was working fine with and without my patch applied however for Fedora-14 after it failed to save the guest the resume made the guest be in the stuck (although running) state both without and with my patch applied so it didn't regress anything and the behaviour with and without my patch applied was the same in both cases (for both guests).

Fedora-14 has been tried because it should be supporting the SUSPEND_CANCELLATION bits however we're not sure it does support it the right way so we should try to test Fedora-14 PV guest on top of SLES host to confirm the SUSPEND_CANCELLATION bits are working fine in this case or not.

Since Fedora-14 cannot resume successfully after save fails we should file also one additional patch to check for enough disk space before trying to save the guest not to run into those issues but in any case it could be good to have the patch above applied since it doesn't seems to regress anything and if it fails for any other reason it should at least try to resume the guest (which is confirmed to be working for RHEL-5.6 guest).

In any case I think having this patch could be a good thing.

Any objections?

Michal

Comment 17 Michal Novotny 2011-03-10 10:35:32 UTC

I've been talking to Pavel and according to his testing of SLES-11 x86_64 guest on top of SLES-11 x86_64 host the SUSPEND_CANCELLATION was working fine. I did try it on top of RHEL-5 guest since we don't have SUSPEND_CANCELLATION bits backported to the hypervisor. However if the save fails we resume the guest which we have already for some time now.

WITHOUT THIS PATCH APPLIED
==========================

When I was trying to save and restore the SLES-11 x86_64 PV guest (using the partition with enough disk space), after the restore the guest was not responding correctly in the VNC window but it was responding fine for the console.

When I tried to save to the partition without enough space the guest was having kernel panic:

[   54.356192] ------------[ cut here ]------------
[   54.356209] kernel BUG at /usr/src/packages/BUILD/kernel-xen-2.6.32.12/linux-2.6.32/arch/x86/mm/hypervisor.c:77!
[   54.356215] invalid opcode: 0000 [#1] SMP 
[   54.356221] last sysfs file: /sys/devices/xen/vif-0/net/eth0/statistics/collisions
[   54.356225] CPU 0 
[   54.356228] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet microcode ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop dm_mod rtc_core joydev rtc_lib xennet ext3 mbcache jbd processor thermal_sys hwmon xenblk cdrom
[   54.356288] Supported: Yes
[   54.356293] Pid: 3363, comm: suspend Not tainted 2.6.32.12-0.7-xen #1 
[   54.356297] RIP: e030:[<ffffffff803413e3>]  [<ffffffff803413e3>] setup_vcpu_info+0x53/0x110
[   54.356311] RSP: e02b:ffff88003ef47e50  EFLAGS: 00010082
[   54.356316] RAX: ffffffffffffffea RBX: ffff8800013bd7c0 RCX: 0000000000219415
[   54.356320] RDX: ffff88003ef47e50 RSI: 0000000000000000 RDI: 000000000000000a
[   54.356325] RBP: 0000000000000000 R08: 00003ffffffff000 R09: 00003ffffffff000
[   54.356329] R10: ffff8800013bd8b0 R11: ffff8800013bd820 R12: 0000000000000000
[   54.356334] R13: 0000000000000000 R14: ffffffff80629d60 R15: ffff8800013c3088
[   54.356343] FS:  00007f32e0d20710(0000) GS:ffff8800013b7000(0000) knlGS:0000000000000000
[   54.356348] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   54.356353] CR2: 00007fe697db32fc CR3: 000000003fd65000 CR4: 0000000000002620
[   54.356357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   54.356362] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[   54.356367] Process suspend (pid: 3363, threadinfo ffff88003ef46000, task ffff880030502440)
[   54.356371] Stack:
[   54.356375]  0000000000219415 ffff8800000007c0 00000000002199e9 000000018000aafb
[   54.356381] <0> ffff8800013bc020 0000000000000000 0000000000000000 ffffffff8025d777
[   54.356389] <0> 0000000200000000 0000000000000000 ffff88003ef47f00 0000000000000000
[   54.356399] Call Trace:
[   54.356416]  [<ffffffff8025d777>] post_suspend+0x187/0x3f0
[   54.356424]  [<ffffffff8025daf6>] take_machine_down+0x116/0x220
[   54.356430]  [<ffffffff8025ddd9>] __xen_suspend+0x1d9/0x330
[   54.356437]  [<ffffffff8025d572>] xen_suspend+0x52/0xd0
[   54.356445]  [<ffffffff80007f0a>] child_rip+0xa/0x20
[   54.356451] Code: 01 26 00 00 e8 4a 17 30 00 48 89 04 24 89 d8 bf 0a 00 00 00 89 ee 25 ff 0f 00 00 48 89 e2 89 44 24 08 e8 21 1f cc ff 85 c0 74 0d <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 48 83 c4 28 5b 5d c3 66 
[   54.356513] RIP  [<ffffffff803413e3>] setup_vcpu_info+0x53/0x110
[   54.356519]  RSP <ffff88003ef47e50>
[   54.356525] ---[ end trace 8ab6c5c1a6f275b8 ]---

WITH THIS PATCH APPLIED
=======================

After applying this patch I tried to save and restore the SLES-11 x86_64 PV guest both to the partition without enough space and to the partition without enough space. When I tried to save to the partition with enough space and I restored the guest the guest was working successfully after restore for both VNC window and console.

When I tried to save to the partition without enough space the guest was having kernel panic again:

[  392.312626] ------------[ cut here ]------------
[  392.312642] kernel BUG at /usr/src/packages/BUILD/kernel-xen-2.6.32.12/linux-2.6.32/arch/x86/mm/hypervisor.c:77!
[  392.312648] invalid opcode: 0000 [#1] SMP 
[  392.312654] last sysfs file: /sys/devices/xen/vif-0/net/eth0/statistics/collisions
[  392.312661] CPU 0 
[  392.312665] Modules linked in: microcode iscsi_ibft ppdev parport_pc lp parport sg st sd_mod crc_t10dif sr_mod scsi_mod ide_cd_mod ide_core ip6t_REJECT ip6t_LOG nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_tcpudp xt_pkttype ipt_LOG xt_limit xt_state iptable_raw ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipv6 af_packet fuse loop dm_mod rtc_core joydev rtc_lib xennet ext3 mbcache jbd processor thermal_sys hwmon xenblk cdrom
[  392.312740] Supported: Yes
[  392.312746] Pid: 25163, comm: suspend Not tainted 2.6.32.12-0.7-xen #1 
[  392.312750] RIP: e030:[<ffffffff803413e3>]  [<ffffffff803413e3>] setup_vcpu_info+0x53/0x110
[  392.312764] RSP: e02b:ffff8800398b3e50  EFLAGS: 00010082
[  392.312769] RAX: ffffffffffffffea RBX: ffff8800013bd7c0 RCX: 0000000000001299
[  392.312773] RDX: ffff8800398b3e50 RSI: 0000000000000000 RDI: 000000000000000a
[  392.312778] RBP: 0000000000000000 R08: 00003ffffffff000 R09: 00003ffffffff000
[  392.312782] R10: ffff8800013bd8b0 R11: ffff8800013bd820 R12: 0000000000000000
[  392.312787] R13: 0000000000000000 R14: ffffffff80629d60 R15: ffff8800013c3088
[  392.312796] FS:  00007f92d683b700(0000) GS:ffff8800013b7000(0000) knlGS:0000000000000000
[  392.312802] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  392.312806] CR2: 0000000005a412e0 CR3: 0000000039859000 CR4: 0000000000002620
[  392.312811] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  392.312816] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[  392.312821] Process suspend (pid: 25163, threadinfo ffff8800398b2000, task ffff88003fafe400)
[  392.312826] Stack:
[  392.312829]  0000000000001299 ffff8800000007c0 00000000000036e3 000000018000aafb
[  392.312835] <0> ffff8800013bc020 0000000000000000 0000000000000000 ffffffff8025d777
[  392.312844] <0> 0000000200000000 0000000000000000 ffff8800398b3f00 0000000000000000
[  392.312854] Call Trace:
[  392.312871]  [<ffffffff8025d777>] post_suspend+0x187/0x3f0
[  392.312879]  [<ffffffff8025daf6>] take_machine_down+0x116/0x220
[  392.312885]  [<ffffffff8025ddd9>] __xen_suspend+0x1d9/0x330
[  392.312892]  [<ffffffff8025d572>] xen_suspend+0x52/0xd0
[  392.312901]  [<ffffffff80007f0a>] child_rip+0xa/0x20
[  392.312906] Code: 01 26 00 00 e8 4a 17 30 00 48 89 04 24 89 d8 bf 0a 00 00 00 89 ee 25 ff 0f 00 00 48 89 e2 89 44 24 08 e8 21 1f cc ff 85 c0 74 0d <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 48 83 c4 28 5b 5d c3 66 
[  392.312973] RIP  [<ffffffff803413e3>] setup_vcpu_info+0x53/0x110
[  392.312979]  RSP <ffff8800398b3e50>
[  392.312984] ---[ end trace 4f1a33d2523a1b94 ]---


So based on this it didn't regress a guest with SUSPEND_CANCELLATION and it improved the results for restoring such a guest (using normal `xm restore` on properly saved image).

Drew, what do you think about this?

Thanks,
Michal

Comment 18 Andrew Jones 2011-03-10 10:59:35 UTC

Please analyse and explain precisely how your patch fixes the issue with VNC. Also please explain your comment in comment 15

"For RHEL-5 guest everything was working fine with and without my patch applied"

What happened to the zombies without the patch for RHEL-5 guests?

Assuming the zombies are gone in all cases and that the suspend cancellation enabled guests work fine for the normal case (enough disk space), then the patch looks good to me. This testing also suggests we should highly consider getting suspend cancellation into the RHEL HV (which can be done under bug 497080, along with the guest kernel updates), as we currently panic guests that suspect it to be there.

Comment 19 Michal Novotny 2011-03-10 11:36:00 UTC

(In reply to comment #18)
> Please analyse and explain precisely how your patch fixes the issue with VNC.
> Also please explain your comment in comment 15


Well, I retested this again using the save file of SLES guest and I was unable to see it now. There was some error in the save image apparently.


> 
> "For RHEL-5 guest everything was working fine with and without my patch
> applied"
> 
> What happened to the zombies without the patch for RHEL-5 guests?
> 


I meant the guest was working fine until you issued shutdown but it left zombie on shutdown without the patch. With the patch there was no zombie.


> Assuming the zombies are gone in all cases and that the suspend cancellation
> enabled guests work fine for the normal case (enough disk space), then the
> patch looks good to me.


So do you think this patch is good and should I send it to the list ?


> This testing also suggests we should highly consider
> getting suspend cancellation into the RHEL HV (which can be done under bug
> 497080, along with the guest kernel updates), as we currently panic guests that
> suspect it to be there.


Getting suspend cancellation for RHEL-5 HV is definitely a good thing however I'm afraid that this may be broken by the resume on save failure patch (I don't know the BZ number now) for some time now already.

Michal

Comment 20 Andrew Jones 2011-03-10 12:36:55 UTC

(In reply to comment #19)
> So do you think this patch is good and should I send it to the list ?
> 

OK

> Getting suspend cancellation for RHEL-5 HV is definitely a good thing however
> I'm afraid that this may be broken by the resume on save failure patch (I don't
> know the BZ number now) for some time now already.

I guess we'll cross that bridge when/if it presents itself.

Comment 21 Michal Novotny 2011-03-10 14:24:55 UTC

Created attachment 483458 [details]
Fix memory leak in xc_resume code

Hi,
this is the patch for BZ #673456 to fix the Zombie issues after you try
to destroy PV guest if the guest was resumed because of failed save.
There was memory leak since xc_resume was having the memory mapped from
the hypervisor and therefore hypervisor was unable to free the memory
which resulted into the Zombie guest record. This simple fix is freeing
the mapped live memory to ensure those 2 pages are freed successfully.

Testing: This has been tested using RHEL-5 i386 and x86_64 guest and
         also SLES-11 x86_64 PV guest as well as Fedora-14 PV guest
         to save/restore the guest using normal save/restore path (i.e.
         using location with enough space and restoring from this
         location) and also to save the guest to partition without
         enough space. For case of RHEL-5 it was working fine to resume
         the guest from partition without enough space both with and
         without my patch with the exception without my patch it left
         zombies when trying to destroy/shutdown the guest but with my
         patch no zombies were left.
         For case of Fedora-14 guest the normal save/restore path was
         working successfully but resume after failed save was not
         working both without and with my patch applied.
         For case of SLES-11 guest (that has been confirmed to have
         SUSPEND_CANCELLATION) I was able to do normal save/restore
         path both without and with my patch applied and it was working
         fine however the guest was unable to recover from the save
         failure since there was kernel panic in both of the cases.

         To summarize this no regressions were found there and it solved
         the zombie issue by freeing up the memory from the hypervisor.

Michal

Comment 23 Michal Novotny 2011-03-14 12:04:47 UTC

Created attachment 484154 [details]
Fix memory leak in xc_resume code v2

Differences between v1 and v2:
 - Lists p2m_frame_list and p2m_frame_list_list are malloc'ed so they
   have to be freed instead of unmapped to cleanup their leaks as well.
   Fix this as well in this patch.

Michal

Comment 25 Miroslav Rezanina 2011-03-17 09:55:27 UTC

Fix built into xen-3.0.3-126.el5

Comment 27 Yuyu Zhou 2011-03-22 07:48:07 UTC

The bug is reproduced on xen-3.0.3-120.el5 and verified on xen-3.0.3-126.el5

Steps to Reproduce:
on xen-3.0.3-120.el5
1. create a PV guest.
2. save the guest to an insufficient partition, error as following show up:
Error: /usr/lib64/xen/bin/xc_save 4 60 0 0 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.
3. destroy the guest and show the list of guest by xm list, Zombie guest shows up.
#xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     3744     4 r-----   3622.6
Zombie-RHEL6.1-64-PV                      59     1024     1 --p--d     46.4

Steps to Verify:
on xen-3.0.3-126.el5
1. create a PV guest.
2. save guest to an insufficient partition,error as following show up:
Error: /usr/lib64/xen/bin/xc_save 22 62 0 0 0 failed
Usage: xm save <Domain> <CheckpointFile>

Save a domain state to restore later.
3. destroy the guest and show the list of guest by xm list, no more zombie guest
# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     3744     4 r-----   3642.3

Comment 28 Tomas Capek 2011-07-13 13:23:00 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
After a PV (paravirtualized) guest was saved, restored and then destroyed on a disk partition with insufficient space, the xend daemon preserved reference to the non-existent guest. As a consequence, a zombie guest sometimes appeared and was reported by the "xm list" command. With this update, a memory leak has been fixed in the xc_resume() function, and now no zombie guests appear in the described scenario.

Comment 29 errata-xmlrpc 2011-07-21 09:15:28 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Comment 30 errata-xmlrpc 2011-07-21 12:07:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1070.html

Note You need to log in before you can comment on or make changes to this bug.