Description of problem: when running a vm with ISO attached as first boot device, but ISO domain is unreachable, the vm will remain "hanged" in vds in 'wait for launch' state with no pid and when backend changes the domain state to inactive, the vm will get pid and change state to up with no boot device. running a vm with cd attched when iso in unavailable is blocked by backend with CanDoAction, but since backend is not aware that the domain is inactive when the vm is run it creates this issue. Version-Release number of selected component (if applicable): ic149 vdsm-4.9-112.el6.x86_64 libvirt-0.9.4-23.el6.x86_64 qemu-kvm-0.12.1.2-2.209.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. block connectivity to ISO domain in host 2. start a vm with ISO attached as first boot device 3. Actual results: until the the backend deactivates the ISO domain the vm will remain in waiting for launch with no pid. once domain changes to inactive we run the vm Expected results: 1) vm should not hang in vds for 10 minutes until backend changes the domain state to inactive 2) once we know the domain is down, vm with no boot device should not change state to up but to down. Additional info: rhevm and vdsm logs Every 2.0s: vdsClient -s 0 list table Mon Nov 21 13:21:26 2011 978a480a-7784-4e8b-bea8-b9b975f39ec4 0 XP WaitForLaunch a few minutes later still in wait for launch with no pid: Every 2.0s: vdsClient -s 0 list table Mon Nov 21 13:25:05 2011 978a480a-7784-4e8b-bea8-b9b975f39ec4 0 XP WaitForLaunch
I have no guess what was blocking in Vdsm. Please supply vdsm.log - here, and in any Vdsm bug that you open.
Created attachment 534775 [details] logs
clientIF._prepareVolumePath has if drive and not os.path.exists(drive): which should be called out of process in order to have a timeout.
Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
(In reply to comment #3) > clientIF._prepareVolumePath has > > if drive and not os.path.exists(drive): > > which should be called out of process in order to have a timeout. Offending line has been removed by Eduardo in upstream commit v4.9.1-32-g4e4bb26. Please verify.
tested with si24.4 and vdsm-4.9.6-44.0.el6_3.x86_64 vm is still stuck in wait for launch until iso domain is deactivated by engine and than it manages to start. Thread-89164::ERROR::2012-11-22 19:00:48,270::utils::398::vm.Vm::(collect) vmId=`e3c87f07-0af1-469c-936f-593e69f586d1`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x1fc07b0> Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 394, in collect statsFunction() File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 271, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 178, in _sampleDiskLatency diskLatency[vmDrive.name] = stats[vmDrive.alias] AttributeError: 'Drive' object has no attribute 'alias' Dummy-142::DEBUG::2012-11-22 19:00:48,276::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0704693 s, 14.5 MB/s\n'; <rc> = 0 VM Channels Listener::DEBUG::2012-11-22 19:00:48,541::vmChannels::71::vds::(_do_add_channels) fileno 29 was added to unconnected channels. VM Channels Listener::DEBUG::2012-11-22 19:00:48,542::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29.
Dafna, this is a different mode of failure. Would you attach more of vdsm.log?
I ran a second test and this time added a floppy as well. the vm was stuck in wait for launch and finally was destroyed. engine re-runs the vm but fails because of a different issue on the floppy. attaching the logs
Created attachment 651877 [details] logs
As I understand, having startupPolicy="optional", we can close it already?
no, because of e.g.: Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 597, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1392, in _run domxml = hooks.before_vm_start(self._buildCmdLine(), self.conf) File "/usr/share/vdsm/libvirtvm.py", line 1250, in _buildCmdLine devElem = dev.getXML() File "/usr/share/vdsm/libvirtvm.py", line 1133, in getXML self.path)['write']: File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 728, in getUserPermissions fileStats = os.stat(path) OSError: [Errno 2] No such file or directory: '/rhev/data-center/edf0ee04-0cc2-4e13-877d-1e89541aea55/eebcad17-cbf7-4098-880d-81e2c2504169/images/11111111-1111-1111-1111-111111111111/virtio-drivers-1.0.0-45801.vfd'
Hello Team, I just wanted to add another scenario to this effect. If the vm is already running & iso domain is not reachable, the vm goes in hung state (although qemu process is not in D state) . Wouldn't it be proper if the vdsm notifies this / rhevm shows events for this guest that it iso image is not accessible ? ( Although in such scenarios , i notice that qemu-kvm does not log this nor does libvirtd . But the VM turns from UP to Not responding state . ) Thanks . Regards, Jaison R
(In reply to Jaison Raju from comment #18) > Hello Team, > > I just wanted to add another scenario to this effect. > If the vm is already running & iso domain is not reachable, > the vm goes in hung state (although qemu process is not in D state) . > > Wouldn't it be proper if the vdsm notifies this / rhevm shows events for > this guest that it iso image is not accessible ? > ( Although in such scenarios , i notice that qemu-kvm does not log this > nor does libvirtd . But the VM turns from UP to Not responding state . ) > > Thanks . > > Regards, > Jaison R Hello Team, Let me just describe the scenario in detail . The ISO storage domain goes down . This is logged & after timeout the status of this storage domain changes . Although this takes around 5mins . engine-config -g StorageDomainFalureTimeoutInMinutes StorageDomainFalureTimeoutInMinutes: 5 version: general The vms using iso images from this domain go in hung state . VM poweroff take some time (which take time but succeeds) Till the storage domain status goes down/inactive any vms using the iso domain can we powered up but will remain in 'Wait For Launch' state . What my thought was that , if the file is not accessible to qemu-kvm process , should the qemu process itself fail & send the message to overlying components like libvirt & vdsm ? Regards, Jaison R
not converging still, but we started to look at that again (finally!) for 3.5
taking the bug
While attempting to reproduce on RHEL6.x this issue to fix the remaining places where VDSM can get stuck, I stumbled into another place we can get stuck inside libvirt. in a nutshell, in qemuProcessStart libvirt iterates on all the configured disk and try to access all the disks (with the proper uid/gid). *Only after* check failed then an appropriate action is done according to the startupPolicy. This *will* block in the described scenario A few snippets demonstrating the behaviour: background: ISCSI domain, NFS ISO storage, one VM with one (ISCSI) disk but with CDROM as first boot device Relevant domain XML section: <disk device="cdrom" snapshot="no" type="file"> <source file="/rhev/data-center/mnt/shinji:_srv_ovirt_iso/029b397f-6a0a-432a-b5ac-03861cbc6b38/images/11111111-1111-1111-1111-111111111111/Fedora-Live-Desktop-x86_64-19-1.iso" startupPolicy="optional"/> <target bus="ide" dev="hdc"/> <readonly/> <serial/> <boot order="1"/> </disk> <disk device="disk" snapshot="no" type="block"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci"/> <source dev="/rhev/data-center/mnt/blockSD/bc1e6c00-7a28-4d0f-860f-e620ac3bd24d/images/e9d5398b-2f2c-4066-9f45-53ae6616899b/86226983-9b04-4b19-9b2b-85c87442d658"/> <target bus="virtio" dev="vda"/> <serial>e9d5398b-2f2c-4066-9f45-53ae6616899b</serial> <boot order="2"/> <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/> </disk> iptables output (see NFS ports blocked; 192.168.1.64 is the host running engine and hosting storage) (other chains: empty and with ACCEPT policy) Chain OUTPUT (policy ACCEPT 111K packets, 6954K bytes) pkts bytes target prot opt in out source destination 53 64964 REJECT tcp -- * * 0.0.0.0/0 192.168.1.64 tcp dpt:2049 reject-with icmp-port-unreachable 0 0 REJECT udp -- * * 0.0.0.0/0 192.168.1.64 udp dpt:2049 reject-with icmp-port-unreachable 192.168.1.64 is the storage host. NFS was blocked before the start of the VM ps -faux snippet root 2291 0.0 0.0 919604 14504 ? Ssl 17:04 0:04 /usr/sbin/libvirtd --listen qemu 1132 0.0 0.0 919604 9528 ? D 22:10 0:00 \_ /usr/sbin/libvirtd --listen where 'libvirtd' is stuck: output of /cat/1132/stack (clearly stuck in NFS) [<ffffffffa04eb1b2>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc] [<ffffffffa04eb6f5>] __rpc_execute+0xf5/0x350 [sunrpc] [<ffffffffa04eb9b1>] rpc_execute+0x61/0xa0 [sunrpc] [<ffffffffa04e23a5>] rpc_run_task+0x75/0x90 [sunrpc] [<ffffffffa04e24c2>] rpc_call_sync+0x42/0x70 [sunrpc] [<ffffffffa077f73d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs] [<ffffffffa077fb1c>] nfs3_proc_access+0xbc/0x180 [nfs] [<ffffffffa076758c>] nfs_do_access+0x19c/0x240 [nfs] [<ffffffffa07676d8>] nfs_permission+0xa8/0x1e0 [nfs] [<ffffffff81198d43>] __link_path_walk+0xb3/0xff0 [<ffffffff81199f3a>] path_walk+0x6a/0xe0 [<ffffffff8119a14b>] filename_lookup+0x6b/0xc0 [<ffffffff8119b277>] user_path_at+0x57/0xa0 [<ffffffff81186d11>] sys_faccessat+0xe1/0x230 [<ffffffff81186e78>] sys_access+0x18/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff (full details here http://paste.fedoraproject.org/92815/) Will try to reproduce on F19 to see how more modern libvirt behaves and to find what still needs to be fixed on VDSM.
libvirt 1.1.3 (F19) seems to have a similar behaviour, although taking a different route (more details will follow). Using the informations provided here, posted a preliminary patch to address the dloppy special case.
as per comment #24 CondNAK on libvirt
The libvirt behaviour turned out to be correct, albeit the NFS timeout is perhaps too high (will be investigated in a separate BZ). http://gerrit.ovirt.org/#/c/26937/ was merged. Reproduced again the scenario described here and no further blockers were spotted; after the NFS timeout, libvirts drops the ISO image and the VM completes the boots. Various warning pops out in the VDSM logs but the VM is operative: - can be connected to (SPICE) - respond to stats requets (either via vdsClient or via engine) - can be powered off.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html