Bug 613513

Summary: RHEL6 PV guest hangs there after migrating back from remote host to local host
Product: Red Hat Enterprise Linux 5 Reporter: Yufang Zhang <yuzhang>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: drjones, joe.jin, leiwang, llim, minovotn, mrezanin, mshao, pbonzini, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-14 14:28:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514499    
Attachments:
Description Flags
xend.log of local host
none
xend.log of remote host
none
config file used by the guest
none
screen dump of the guest on the remote host
none
xend.log of destination host
none
xend.log of src host none

Description Yufang Zhang 2010-07-12 05:41:15 UTC
Created attachment 431072 [details]
xend.log of local host

Description of problem:
RHEL6 PV guest will hang there after migrating back from remote host to local host. We couldn't ping or ssh to the guest, neither could we get any output from the console of the guest. cpu times of guest remain steadily low.

Version-Release number of selected component (if applicable):
xen-libs-3.0.3-113.el5
xen-3.0.3-113.el5
xen-debuginfo-3.0.3-113.el5
kernel-xen-2.6.18-203.el5
xen-devel-3.0.3-113.el5
kernel-xen-devel-2.6.18-203.el5

Guest Version:
RHEL6 Beta2 snapshot7 (RHEL6.0-20100707.4)

How reproducible:
Always

Steps to Reproduce:
1. Start the PV guest and set shared storage properly(nfs in this case)
2. Try to ssh to the guest or ping to the guest
3. Migrate the guest to a remote host
4. Try to ssh to or ping to the guest again
5. Migrate the guest back to local host from remote host
6. Try to ssh to or ping to the guest again
  
Actual results:
1. At step 2 and 4, we could ssh to the guest successfully.
2. At step 6, we could neither ssh to nor ping to the guest:
# ping 10.66.93.188
PING 10.66.93.188 (10.66.93.188) 56(84) bytes of data.
From 10.66.93.223 icmp_seq=1 Destination Host Unreachable
From 10.66.93.223 icmp_seq=2 Destination Host Unreachable
From 10.66.93.223 icmp_seq=3 Destination Host Unreachable

--- 10.66.93.188 ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5018ms
, pipe 3

# ssh 10.66.93.188
ssh: connect to host 10.66.93.188 port 22: No route to host

xm console shows nothing; cpu times of guest remain steadily low:
# xm vcpu-list 24
Name                              ID VCPUs   CPU State   Time(s) CPU Affinity
vm1                               24     0     3   -b-       0.3 any cpu
vm1                               24     1     3   -b-       0.0 any cpu
vm1                               24     2     1   -b-       0.0 any cpu
vm1                               24     3     3   -b-       0.0 any cpu



Expected results:
At step 2, 4 and 6, we could ssh to the guest successfully.

Additional info:
1. Other supported OS(RHEL5, RHEL 4) didn't encounter such problem.
2. xend.log of local host and remote host are in the attachment.

Comment 1 Yufang Zhang 2010-07-12 05:46:04 UTC
Created attachment 431073 [details]
xend.log of remote host

Comment 2 Yufang Zhang 2010-07-12 05:47:52 UTC
Created attachment 431075 [details]
config file used by the guest

Comment 3 Michal Novotny 2010-07-13 13:05:41 UTC
I've been looking to the logs and there seems to be no error on the user-space side.

I remember that was one issue with the netfront driver (which is being used by PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be possible that RHEL-6 is being affected as well or the bug may be similar.

Yufang, could you please try to migrate with the vif device commented out/not used in the domain configuration file and tell us whether it still hangs or not?

Thanks,
Michal

Comment 4 Yufang Zhang 2010-07-15 10:04:24 UTC
(In reply to comment #3)
> I've been looking to the logs and there seems to be no error on the user-space
> side.
> 
> I remember that was one issue with the netfront driver (which is being used by
> PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be
> possible that RHEL-6 is being affected as well or the bug may be similar.
> 
> Yufang, could you please try to migrate with the vif device commented out/not
> used in the domain configuration file and tell us whether it still hangs or
> not?
> 
> Thanks,
> Michal    

Hi Michal,
I retest this on two AMD machines with vif device commented. It could migrate back to local host successfully, although it takes much more time(10 minutes) than migrating from local host to remote host(20 seconds). I will test this bug tomorrow on two Intel machines.

Comment 5 Michal Novotny 2010-07-16 08:59:03 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > I've been looking to the logs and there seems to be no error on the user-space
> > side.
> > 
> > I remember that was one issue with the netfront driver (which is being used by
> > PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be
> > possible that RHEL-6 is being affected as well or the bug may be similar.
> > 
> > Yufang, could you please try to migrate with the vif device commented out/not
> > used in the domain configuration file and tell us whether it still hangs or
> > not?
> > 
> > Thanks,
> > Michal    
> 
> Hi Michal,
> I retest this on two AMD machines with vif device commented. It could migrate
> back to local host successfully, although it takes much more time(10 minutes)
> than migrating from local host to remote host(20 seconds). I will test this bug
> tomorrow on two Intel machines.    

Hi Yufang,
so did it take pretty long time on AMD machines? The original testing (in comment #0) was on AMD or Intel ? So, it did migrate successfully for AMDs but extremely slow? Could you provide the logs from the AMD testing if you still have them ? The investigation could reveal some errors there resulting into this extreme slowdown.

Michal

Comment 6 Yufang Zhang 2010-07-18 08:34:35 UTC
Hi Michal,
I test this bug with the latest xen and kernel-xen packages on both Intel and AMD machines. Ping-Pong migration finished successfully on both cases. 

xen and kernel-xen rpms: 
xen-devel-3.0.3-114.el5
xen-libs-3.0.3-114.el5
kernel-xen-2.6.18-206.el5
xen-3.0.3-114.el5
kernel-xen-devel-2.6.18-206.el5
xen-debuginfo-3.0.3-114.el5

guest kernel:
2.6.32-44

Comment 7 Michal Novotny 2010-07-19 06:05:25 UTC
(In reply to comment #6)
> Hi Michal,
> I test this bug with the latest xen and kernel-xen packages on both Intel and
> AMD machines. Ping-Pong migration finished successfully on both cases. 
> 
> xen and kernel-xen rpms: 
> xen-devel-3.0.3-114.el5
> xen-libs-3.0.3-114.el5
> kernel-xen-2.6.18-206.el5
> xen-3.0.3-114.el5
> kernel-xen-devel-2.6.18-206.el5
> xen-debuginfo-3.0.3-114.el5
> 
> guest kernel:
> 2.6.32-44    

Ok, so is it being verified this bug has disappeared in the latest RHEL-6 kernel? If so, feel free to close it yourself ;)

Michal

Comment 8 Andrew Jones 2010-07-19 07:21:52 UTC
(In reply to comment #7)
> Ok, so is it being verified this bug has disappeared in the latest RHEL-6
> kernel? If so, feel free to close it yourself ;)
> 
> Michal    

It would be nice to know what the bug was, and what made it disappear! I have a feeling it might just be due to the remote storage though. Yufang, now that you've seen it work with your current ping-pong network set up, can you please downgrade one component at a time to see if you can find what brings the bug back? If you can go all the way back to the same revisions of host and guest code you had when you reported this bug, and it never comes back, then we can just blame the network.

Drew

Comment 9 Michal Novotny 2010-07-20 08:13:58 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Ok, so is it being verified this bug has disappeared in the latest RHEL-6
> > kernel? If so, feel free to close it yourself ;)
> > 
> > Michal    
> 
> It would be nice to know what the bug was, and what made it disappear! I have a
> feeling it might just be due to the remote storage though. Yufang, now that
> you've seen it work with your current ping-pong network set up, can you please
> downgrade one component at a time to see if you can find what brings the bug
> back? If you can go all the way back to the same revisions of host and guest
> code you had when you reported this bug, and it never comes back, then we can
> just blame the network.
> 
> Drew    

I have no objections about that and yeah, it would be nice Drew. Nevertheless there's nothing in the xend logs attached so I guess this is the RHEL-6 kernel thing. The kernel version used for the testing was 2.6.32-44 according to Yufang's comment #6 but I don't know what version is the newest but you can have a look to the kernel codes. But it's good to wait to for Yufang's reply about downgrading the components. I'm just saying that I doubt this could be user-space related.

Michal

Comment 10 Lei Wang 2010-07-23 07:33:17 UTC
Hi Andrew and Michal,

1.We ever encountered this issue with both older and newer version packages.

2.But it's hard to track the exact steps, so we need further investigation about this issue and then update the relative information here.

Comment 11 Michal Novotny 2010-07-23 07:44:08 UTC
(In reply to comment #10)
> Hi Andrew and Michal,
> 
> 1.We ever encountered this issue with both older and newer version packages.
> 

So, does it exist in the package versions stated in comment #6 ?

> 2.But it's hard to track the exact steps, so we need further investigation
> about this issue and then update the relative information here.    

Ok, please let us know when knowing the exact steps.

Michal

Comment 12 Miroslav Rezanina 2010-08-31 12:13:02 UTC
Hi Lei,
is there any progress on getting info how to reproduce this problem?

Comment 14 Yufang Zhang 2010-09-03 07:05:01 UTC
Created attachment 442811 [details]
screen dump of the guest on the remote host

Problem still exists for the latest RHEL6 snapshot13. 

xen and kernel-xen packages:
kernel-xen-2.6.18-214.el5
xen-3.0.3-115.el5
kernel-xen-devel-2.6.18-214.el5
xen-debuginfo-3.0.3-115.el5
xen-devel-3.0.3-115.el5
xen-libs-3.0.3-115.el5

Guest:
RHEL6 snapshot13(kernel-2.6.32-70.el6)

After migrating the guest to remote host, we couldn't ssh or ping to the guest while the guest is still alive in the remote host. In the remote host, we could get console of the guest via virt-viewer. ifconfig command would hang there when we try to check network status via this command from within the guest(as shown in the screen dump).

Comment 15 Yufang Zhang 2010-09-03 07:09:14 UTC
Created attachment 442813 [details]
xend.log of destination host

Guest information at remote host:

# xm li vm1 -l
(domain
    (domid 17)
    (uuid 1efb30c3-86fd-9dd7-4934-9b72b6a833fc)
    (vcpus 4)
    (cpu_cap 0)
    (cpu_weight 256.0)
    (memory 512)
    (shadow_memory 0)
    (maxmem 512)
    (bootloader /usr/bin/pygrub)
    (features )
    (localtime 0)
    (name vm1)
    (on_poweroff destroy)
    (on_reboot restart)
    (on_crash restart)
    (image
        (linux
            (ramdisk /var/lib/xen/boot_ramdisk.UdDyds)
            (kernel /var/lib/xen/boot_kernel.jPjTlN)
            (args
                'ro root=/dev/mapper/vg_dhcp66929-lv_root rd_LVM_LV=vg_dhcp66929/lv_root rd_LVM_LV=vg_dhcp66929/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto console=tty0 console=hvc0 rhgb quiet'
            )
        )
    )
    (cpus ())
    (device
        (vif
            (backend 0)
            (script vif-bridge)
            (bridge xenbr0)
            (mac 00:AE:9D:94:85:48)
        )
    )
    (device
        (tap
            (backend 0)
            (dev xvda:disk)
            (uname
                tap:aio:/xen-autotest/xen-autotest/client/tests/xen/images/RHEL-Server-6.0-64-pv.raw
            )
            (mode w)
        )
    )
    (device (vkbd (backend 0)))
    (device
        (vfb
            (backend 0)
            (type vnc)
            (vnclisten 0.0.0.0)
            (vncunused 1)
            (display localhost:10.0)
            (xauthority /root/.Xauthority)
        )
    )
    (state -b----)
    (shutdown_reason poweroff)
    (cpu_time 0.862081718)
    (online_vcpus 4)
    (up_time 2393.75772381)
    (start_time 1283495319.67)
    (store_mfn 238223)
    (console_mfn 253123)
)

Comment 16 Yufang Zhang 2010-09-03 07:12:07 UTC
Created attachment 442815 [details]
xend.log of src host

Comment 17 Yufang Zhang 2010-09-03 07:41:06 UTC
Problem with this bug is that once we hit this issue on any machine, it is 100% reproducible on that machine as long as we don't reboot the src and remote host. But if we reboot both src and destination host, we couldn't reproduce this bug any more until we hit it again after some time. 
Miroslav, could you please have a try on your environment to make sure it is not related with environment(network)? Thanks.

Comment 19 Paolo Bonzini 2011-01-04 14:56:56 UTC
If anything, the

    (store_mfn 238223)
    (console_mfn 253123)

seems a bit suspicious since the two pages are allocated contiguously, e.g.

    (store_mfn 2315861)
    (console_mfn 2315860)

Comment 20 Andrew Jones 2011-01-10 15:22:26 UTC
This is likely a dup of bug 663755 (also bug 658720 and bug 663881).

Comment 21 Andrew Jones 2011-01-14 14:28:49 UTC

*** This bug has been marked as a duplicate of bug 663755 ***