Customer reports a regression where the domain briefly goes missing from the list of domains in xm list when undergoing a reboot initiated by xm reboot. This is not seen on RHEL 5.3 kernels. ----- Description of Problem: This is the REGRESSION issue. This issue does not occur on RHEL5.3GA. The `xm list' command may not output HVM domains if they are rebooted by `xm reboot'. Notice that this issue did not occur on PV domains. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: 5 Release Number: 4 Beta Architecture: ia64 Kernel Version: 2.6.18-155.el5xen Related Package Version: xen-3.0.3-88.el5 Related Middleware / Application: None Drivers or hardware or architecture dependency: None How reproducible: 1/20 Step to Reproduce: 1. Create a HVM domain 2. Reboot the domain by `xm reboot' 3. Run `xm list' Actual Results: Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 837 1 r----- 807.4 Expected Results: Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 837 1 r----- 807.4 rhel54_74 25 1047 1 r----- 17.4 Summary of actions taken to resolve issue: None Location of diagnostic data: None
I believe this must have always been there. When a guest is restarted, xend first destroys the guest and then creates and boots a new one. It's just a matter of luck if you manage to list running domains during the short time in between.
Yes this is a fundamental limitation of this XenD version. It does not have any concept of 'inactive' guests, so if the guest is not running, XenD won't report it. During reboot you have a small window between the guest shutting down & new one booting, and thus thanks to lack of inactive guest mgmt, the guest can briefly disappear.
If it goes missing "briefly" as written in the summary and in the issue tracker, then this resembles what Jiri and Daniel said: during reboot you have a small window between the guest shutting down & new one being created, and thus the guest can disappear. This matches the observation that it is reproducible 5% of the time only. It is implicit in the behavior of Xen, and libvirt (virsh) fixes it. If it never reappears, it is a different problem. The summary should be upgraded and the xend-debug.log and xend.log files should be attached.
There is not enough information in this bug report to further diagnose the problem. Please provide - /var/log/xen/xend.log & xend-debug.log from the point in time immediately after doing a 'xm reboot' that exhibits the missing domain problem - Output of 'xm list --long' - Output of 'xenstore-ls' - The /etc/xen/$GUEST config file for the guest showing problems - The 'virsh dumpxml GUEST' output
Hmm, I wasn't able to reproduce it even after 200 reboots of an hvm guest. Could you please try to reproduce the bug with packages from http://people.redhat.com/jdenemar/xen/bz513604/ and send xend.log after running xm list --long? Thanks.
Thanks a lot. So the error is caused by missing entry for one of the block devices in /vm/UUID/device/vbd/: vbd = "" 5632 = "" frontend = "/local/domain/11/device/vbd/5632" frontend-id = "11" backend-id = "0" backend = "/local/domain/0/backend/vbd/11/5632" 768 is missing in there. In the previous report, it was the cdrom (5632) which was missing. Oops and vif got lost during restarts, which looks almost like bug #509099. I might have an idea why this happens... I'll report once I know it's (in)correct.
Could you try the new package from http://people.redhat.com/jdenemar/xen/bz513604/ to see if that fixes the error? And report the results and logs even if it does, please.
Thanks for the testing. Could you try yet another version of the package? http://people.redhat.com/jdenemar/xen/bz513604/ Thanks a lot.
OK, another round... Could you follow the following steps, please? - install the new packages from http://people.redhat.com/jdenemar/xen/bz513604/ and restart xend or the whole machine - turn on logging in xen hotplug scripts: # echo 'SYSLOG=yes' >>/etc/sysconfig/xenhotplug - let udev log debugging messages: # udevcontrol log_priority=debug - let syslog write all (including debugging) messages into /var/log/debug: # echo '*.* /var/log/debug' >>/etc/syslog.conf # service syslog reload - reproduce the bug - send me everything you normally do together with /var/log/debug Thanks a lot
So the race condition is confirmed. As usual, the race is between hotplug scripts and xend. Under very lucky conditions hotplug-cleanup script runs early enough to see /local/domain/ID/vm and then it's delayed so that it actually removes /vm/UUID/device/CLASS/ID from the newly created domain instead of the old one. It looks like IA64 is very lucky platform :-) By injecting some sleeps at right places, I'm able to reproduce it locally, which should speed up things quite a bit.
Hi, could you try with the latest packages from http://people.redhat.com/jdenemar/xen/bz513604/ (xen-3.0.3-94.el5.bz513604.7)? Thanks
Great, thank you very much for the testing.
Created attachment 358693 [details] Fix race condition on domain restart
*** Bug 513265 has been marked as a duplicate of this bug. ***
Fix built into xen-3.0.3-95.el5
I verify this bug by following steps: (1) Create a HVM domain (2) Reboot the domain by `xm reboot' (3) Run `xm list' I try this about 30 times and find that the domain will no more missing from xm list when rebooted. So this bug is verified in xen-3.0.3-102.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0294.html
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).