Created attachment 431072 [details] xend.log of local host Description of problem: RHEL6 PV guest will hang there after migrating back from remote host to local host. We couldn't ping or ssh to the guest, neither could we get any output from the console of the guest. cpu times of guest remain steadily low. Version-Release number of selected component (if applicable): xen-libs-3.0.3-113.el5 xen-3.0.3-113.el5 xen-debuginfo-3.0.3-113.el5 kernel-xen-2.6.18-203.el5 xen-devel-3.0.3-113.el5 kernel-xen-devel-2.6.18-203.el5 Guest Version: RHEL6 Beta2 snapshot7 (RHEL6.0-20100707.4) How reproducible: Always Steps to Reproduce: 1. Start the PV guest and set shared storage properly(nfs in this case) 2. Try to ssh to the guest or ping to the guest 3. Migrate the guest to a remote host 4. Try to ssh to or ping to the guest again 5. Migrate the guest back to local host from remote host 6. Try to ssh to or ping to the guest again Actual results: 1. At step 2 and 4, we could ssh to the guest successfully. 2. At step 6, we could neither ssh to nor ping to the guest: # ping 10.66.93.188 PING 10.66.93.188 (10.66.93.188) 56(84) bytes of data. From 10.66.93.223 icmp_seq=1 Destination Host Unreachable From 10.66.93.223 icmp_seq=2 Destination Host Unreachable From 10.66.93.223 icmp_seq=3 Destination Host Unreachable --- 10.66.93.188 ping statistics --- 6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5018ms , pipe 3 # ssh 10.66.93.188 ssh: connect to host 10.66.93.188 port 22: No route to host xm console shows nothing; cpu times of guest remain steadily low: # xm vcpu-list 24 Name ID VCPUs CPU State Time(s) CPU Affinity vm1 24 0 3 -b- 0.3 any cpu vm1 24 1 3 -b- 0.0 any cpu vm1 24 2 1 -b- 0.0 any cpu vm1 24 3 3 -b- 0.0 any cpu Expected results: At step 2, 4 and 6, we could ssh to the guest successfully. Additional info: 1. Other supported OS(RHEL5, RHEL 4) didn't encounter such problem. 2. xend.log of local host and remote host are in the attachment.
Created attachment 431073 [details] xend.log of remote host
Created attachment 431075 [details] config file used by the guest
I've been looking to the logs and there seems to be no error on the user-space side. I remember that was one issue with the netfront driver (which is being used by PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be possible that RHEL-6 is being affected as well or the bug may be similar. Yufang, could you please try to migrate with the vif device commented out/not used in the domain configuration file and tell us whether it still hangs or not? Thanks, Michal
(In reply to comment #3) > I've been looking to the logs and there seems to be no error on the user-space > side. > > I remember that was one issue with the netfront driver (which is being used by > PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be > possible that RHEL-6 is being affected as well or the bug may be similar. > > Yufang, could you please try to migrate with the vif device commented out/not > used in the domain configuration file and tell us whether it still hangs or > not? > > Thanks, > Michal Hi Michal, I retest this on two AMD machines with vif device commented. It could migrate back to local host successfully, although it takes much more time(10 minutes) than migrating from local host to remote host(20 seconds). I will test this bug tomorrow on two Intel machines.
(In reply to comment #4) > (In reply to comment #3) > > I've been looking to the logs and there seems to be no error on the user-space > > side. > > > > I remember that was one issue with the netfront driver (which is being used by > > PV guests AFAIK). Although the bug was not for RHEL-6 PV guest it may be > > possible that RHEL-6 is being affected as well or the bug may be similar. > > > > Yufang, could you please try to migrate with the vif device commented out/not > > used in the domain configuration file and tell us whether it still hangs or > > not? > > > > Thanks, > > Michal > > Hi Michal, > I retest this on two AMD machines with vif device commented. It could migrate > back to local host successfully, although it takes much more time(10 minutes) > than migrating from local host to remote host(20 seconds). I will test this bug > tomorrow on two Intel machines. Hi Yufang, so did it take pretty long time on AMD machines? The original testing (in comment #0) was on AMD or Intel ? So, it did migrate successfully for AMDs but extremely slow? Could you provide the logs from the AMD testing if you still have them ? The investigation could reveal some errors there resulting into this extreme slowdown. Michal
Hi Michal, I test this bug with the latest xen and kernel-xen packages on both Intel and AMD machines. Ping-Pong migration finished successfully on both cases. xen and kernel-xen rpms: xen-devel-3.0.3-114.el5 xen-libs-3.0.3-114.el5 kernel-xen-2.6.18-206.el5 xen-3.0.3-114.el5 kernel-xen-devel-2.6.18-206.el5 xen-debuginfo-3.0.3-114.el5 guest kernel: 2.6.32-44
(In reply to comment #6) > Hi Michal, > I test this bug with the latest xen and kernel-xen packages on both Intel and > AMD machines. Ping-Pong migration finished successfully on both cases. > > xen and kernel-xen rpms: > xen-devel-3.0.3-114.el5 > xen-libs-3.0.3-114.el5 > kernel-xen-2.6.18-206.el5 > xen-3.0.3-114.el5 > kernel-xen-devel-2.6.18-206.el5 > xen-debuginfo-3.0.3-114.el5 > > guest kernel: > 2.6.32-44 Ok, so is it being verified this bug has disappeared in the latest RHEL-6 kernel? If so, feel free to close it yourself ;) Michal
(In reply to comment #7) > Ok, so is it being verified this bug has disappeared in the latest RHEL-6 > kernel? If so, feel free to close it yourself ;) > > Michal It would be nice to know what the bug was, and what made it disappear! I have a feeling it might just be due to the remote storage though. Yufang, now that you've seen it work with your current ping-pong network set up, can you please downgrade one component at a time to see if you can find what brings the bug back? If you can go all the way back to the same revisions of host and guest code you had when you reported this bug, and it never comes back, then we can just blame the network. Drew
(In reply to comment #8) > (In reply to comment #7) > > Ok, so is it being verified this bug has disappeared in the latest RHEL-6 > > kernel? If so, feel free to close it yourself ;) > > > > Michal > > It would be nice to know what the bug was, and what made it disappear! I have a > feeling it might just be due to the remote storage though. Yufang, now that > you've seen it work with your current ping-pong network set up, can you please > downgrade one component at a time to see if you can find what brings the bug > back? If you can go all the way back to the same revisions of host and guest > code you had when you reported this bug, and it never comes back, then we can > just blame the network. > > Drew I have no objections about that and yeah, it would be nice Drew. Nevertheless there's nothing in the xend logs attached so I guess this is the RHEL-6 kernel thing. The kernel version used for the testing was 2.6.32-44 according to Yufang's comment #6 but I don't know what version is the newest but you can have a look to the kernel codes. But it's good to wait to for Yufang's reply about downgrading the components. I'm just saying that I doubt this could be user-space related. Michal
Hi Andrew and Michal, 1.We ever encountered this issue with both older and newer version packages. 2.But it's hard to track the exact steps, so we need further investigation about this issue and then update the relative information here.
(In reply to comment #10) > Hi Andrew and Michal, > > 1.We ever encountered this issue with both older and newer version packages. > So, does it exist in the package versions stated in comment #6 ? > 2.But it's hard to track the exact steps, so we need further investigation > about this issue and then update the relative information here. Ok, please let us know when knowing the exact steps. Michal
Hi Lei, is there any progress on getting info how to reproduce this problem?
Created attachment 442811 [details] screen dump of the guest on the remote host Problem still exists for the latest RHEL6 snapshot13. xen and kernel-xen packages: kernel-xen-2.6.18-214.el5 xen-3.0.3-115.el5 kernel-xen-devel-2.6.18-214.el5 xen-debuginfo-3.0.3-115.el5 xen-devel-3.0.3-115.el5 xen-libs-3.0.3-115.el5 Guest: RHEL6 snapshot13(kernel-2.6.32-70.el6) After migrating the guest to remote host, we couldn't ssh or ping to the guest while the guest is still alive in the remote host. In the remote host, we could get console of the guest via virt-viewer. ifconfig command would hang there when we try to check network status via this command from within the guest(as shown in the screen dump).
Created attachment 442813 [details] xend.log of destination host Guest information at remote host: # xm li vm1 -l (domain (domid 17) (uuid 1efb30c3-86fd-9dd7-4934-9b72b6a833fc) (vcpus 4) (cpu_cap 0) (cpu_weight 256.0) (memory 512) (shadow_memory 0) (maxmem 512) (bootloader /usr/bin/pygrub) (features ) (localtime 0) (name vm1) (on_poweroff destroy) (on_reboot restart) (on_crash restart) (image (linux (ramdisk /var/lib/xen/boot_ramdisk.UdDyds) (kernel /var/lib/xen/boot_kernel.jPjTlN) (args 'ro root=/dev/mapper/vg_dhcp66929-lv_root rd_LVM_LV=vg_dhcp66929/lv_root rd_LVM_LV=vg_dhcp66929/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto console=tty0 console=hvc0 rhgb quiet' ) ) ) (cpus ()) (device (vif (backend 0) (script vif-bridge) (bridge xenbr0) (mac 00:AE:9D:94:85:48) ) ) (device (tap (backend 0) (dev xvda:disk) (uname tap:aio:/xen-autotest/xen-autotest/client/tests/xen/images/RHEL-Server-6.0-64-pv.raw ) (mode w) ) ) (device (vkbd (backend 0))) (device (vfb (backend 0) (type vnc) (vnclisten 0.0.0.0) (vncunused 1) (display localhost:10.0) (xauthority /root/.Xauthority) ) ) (state -b----) (shutdown_reason poweroff) (cpu_time 0.862081718) (online_vcpus 4) (up_time 2393.75772381) (start_time 1283495319.67) (store_mfn 238223) (console_mfn 253123) )
Created attachment 442815 [details] xend.log of src host
Problem with this bug is that once we hit this issue on any machine, it is 100% reproducible on that machine as long as we don't reboot the src and remote host. But if we reboot both src and destination host, we couldn't reproduce this bug any more until we hit it again after some time. Miroslav, could you please have a try on your environment to make sure it is not related with environment(network)? Thanks.
If anything, the (store_mfn 238223) (console_mfn 253123) seems a bit suspicious since the two pages are allocated contiguously, e.g. (store_mfn 2315861) (console_mfn 2315860)
This is likely a dup of bug 663755 (also bug 658720 and bug 663881).
*** This bug has been marked as a duplicate of bug 663755 ***