Bug 755528
Summary: | VDSM: vm with cd as first device when iso domain is blocked from host, will remain in 'wait for launch' with no pid until backend will change domain state to inactive | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||||
Component: | vdsm | Assignee: | Francesco Romani <fromani> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Lukas Svaty <lsvaty> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.1.0 | CC: | bazulay, bsettle, cpelland, iheim, jraju, lpeer, mavital, michal.skrivanek, ofrenkel, oramraz, rbalakri, sherold, srevivo, syeghiay, tdosek, vfeenstr, yeylon | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.5.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | virt | ||||||||
Fixed In Version: | vt1.3, 4.16.0-1.el6_5 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-02-11 21:09:50 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1089873 | ||||||||
Bug Blocks: | 1142923, 1156165 | ||||||||
Attachments: |
|
Description
Dafna Ron
2011-11-21 11:54:40 UTC
I have no guess what was blocking in Vdsm. Please supply vdsm.log - here, and in any Vdsm bug that you open. Created attachment 534775 [details]
logs
clientIF._prepareVolumePath has if drive and not os.path.exists(drive): which should be called out of process in order to have a timeout. Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. (In reply to comment #3) > clientIF._prepareVolumePath has > > if drive and not os.path.exists(drive): > > which should be called out of process in order to have a timeout. Offending line has been removed by Eduardo in upstream commit v4.9.1-32-g4e4bb26. Please verify. tested with si24.4 and vdsm-4.9.6-44.0.el6_3.x86_64 vm is still stuck in wait for launch until iso domain is deactivated by engine and than it manages to start. Thread-89164::ERROR::2012-11-22 19:00:48,270::utils::398::vm.Vm::(collect) vmId=`e3c87f07-0af1-469c-936f-593e69f586d1`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x1fc07b0> Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 394, in collect statsFunction() File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 271, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 178, in _sampleDiskLatency diskLatency[vmDrive.name] = stats[vmDrive.alias] AttributeError: 'Drive' object has no attribute 'alias' Dummy-142::DEBUG::2012-11-22 19:00:48,276::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0704693 s, 14.5 MB/s\n'; <rc> = 0 VM Channels Listener::DEBUG::2012-11-22 19:00:48,541::vmChannels::71::vds::(_do_add_channels) fileno 29 was added to unconnected channels. VM Channels Listener::DEBUG::2012-11-22 19:00:48,542::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29. Dafna, this is a different mode of failure. Would you attach more of vdsm.log? I ran a second test and this time added a floppy as well. the vm was stuck in wait for launch and finally was destroyed. engine re-runs the vm but fails because of a different issue on the floppy. attaching the logs Created attachment 651877 [details]
logs
As I understand, having startupPolicy="optional", we can close it already? no, because of e.g.: Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 597, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1392, in _run domxml = hooks.before_vm_start(self._buildCmdLine(), self.conf) File "/usr/share/vdsm/libvirtvm.py", line 1250, in _buildCmdLine devElem = dev.getXML() File "/usr/share/vdsm/libvirtvm.py", line 1133, in getXML self.path)['write']: File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 728, in getUserPermissions fileStats = os.stat(path) OSError: [Errno 2] No such file or directory: '/rhev/data-center/edf0ee04-0cc2-4e13-877d-1e89541aea55/eebcad17-cbf7-4098-880d-81e2c2504169/images/11111111-1111-1111-1111-111111111111/virtio-drivers-1.0.0-45801.vfd' Hello Team, I just wanted to add another scenario to this effect. If the vm is already running & iso domain is not reachable, the vm goes in hung state (although qemu process is not in D state) . Wouldn't it be proper if the vdsm notifies this / rhevm shows events for this guest that it iso image is not accessible ? ( Although in such scenarios , i notice that qemu-kvm does not log this nor does libvirtd . But the VM turns from UP to Not responding state . ) Thanks . Regards, Jaison R (In reply to Jaison Raju from comment #18) > Hello Team, > > I just wanted to add another scenario to this effect. > If the vm is already running & iso domain is not reachable, > the vm goes in hung state (although qemu process is not in D state) . > > Wouldn't it be proper if the vdsm notifies this / rhevm shows events for > this guest that it iso image is not accessible ? > ( Although in such scenarios , i notice that qemu-kvm does not log this > nor does libvirtd . But the VM turns from UP to Not responding state . ) > > Thanks . > > Regards, > Jaison R Hello Team, Let me just describe the scenario in detail . The ISO storage domain goes down . This is logged & after timeout the status of this storage domain changes . Although this takes around 5mins . engine-config -g StorageDomainFalureTimeoutInMinutes StorageDomainFalureTimeoutInMinutes: 5 version: general The vms using iso images from this domain go in hung state . VM poweroff take some time (which take time but succeeds) Till the storage domain status goes down/inactive any vms using the iso domain can we powered up but will remain in 'Wait For Launch' state . What my thought was that , if the file is not accessible to qemu-kvm process , should the qemu process itself fail & send the message to overlying components like libvirt & vdsm ? Regards, Jaison R not converging still, but we started to look at that again (finally!) for 3.5 taking the bug While attempting to reproduce on RHEL6.x this issue to fix the remaining places where VDSM can get stuck, I stumbled into another place we can get stuck inside libvirt. in a nutshell, in qemuProcessStart libvirt iterates on all the configured disk and try to access all the disks (with the proper uid/gid). *Only after* check failed then an appropriate action is done according to the startupPolicy. This *will* block in the described scenario A few snippets demonstrating the behaviour: background: ISCSI domain, NFS ISO storage, one VM with one (ISCSI) disk but with CDROM as first boot device Relevant domain XML section: <disk device="cdrom" snapshot="no" type="file"> <source file="/rhev/data-center/mnt/shinji:_srv_ovirt_iso/029b397f-6a0a-432a-b5ac-03861cbc6b38/images/11111111-1111-1111-1111-111111111111/Fedora-Live-Desktop-x86_64-19-1.iso" startupPolicy="optional"/> <target bus="ide" dev="hdc"/> <readonly/> <serial/> <boot order="1"/> </disk> <disk device="disk" snapshot="no" type="block"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci"/> <source dev="/rhev/data-center/mnt/blockSD/bc1e6c00-7a28-4d0f-860f-e620ac3bd24d/images/e9d5398b-2f2c-4066-9f45-53ae6616899b/86226983-9b04-4b19-9b2b-85c87442d658"/> <target bus="virtio" dev="vda"/> <serial>e9d5398b-2f2c-4066-9f45-53ae6616899b</serial> <boot order="2"/> <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/> </disk> iptables output (see NFS ports blocked; 192.168.1.64 is the host running engine and hosting storage) (other chains: empty and with ACCEPT policy) Chain OUTPUT (policy ACCEPT 111K packets, 6954K bytes) pkts bytes target prot opt in out source destination 53 64964 REJECT tcp -- * * 0.0.0.0/0 192.168.1.64 tcp dpt:2049 reject-with icmp-port-unreachable 0 0 REJECT udp -- * * 0.0.0.0/0 192.168.1.64 udp dpt:2049 reject-with icmp-port-unreachable 192.168.1.64 is the storage host. NFS was blocked before the start of the VM ps -faux snippet root 2291 0.0 0.0 919604 14504 ? Ssl 17:04 0:04 /usr/sbin/libvirtd --listen qemu 1132 0.0 0.0 919604 9528 ? D 22:10 0:00 \_ /usr/sbin/libvirtd --listen where 'libvirtd' is stuck: output of /cat/1132/stack (clearly stuck in NFS) [<ffffffffa04eb1b2>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc] [<ffffffffa04eb6f5>] __rpc_execute+0xf5/0x350 [sunrpc] [<ffffffffa04eb9b1>] rpc_execute+0x61/0xa0 [sunrpc] [<ffffffffa04e23a5>] rpc_run_task+0x75/0x90 [sunrpc] [<ffffffffa04e24c2>] rpc_call_sync+0x42/0x70 [sunrpc] [<ffffffffa077f73d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs] [<ffffffffa077fb1c>] nfs3_proc_access+0xbc/0x180 [nfs] [<ffffffffa076758c>] nfs_do_access+0x19c/0x240 [nfs] [<ffffffffa07676d8>] nfs_permission+0xa8/0x1e0 [nfs] [<ffffffff81198d43>] __link_path_walk+0xb3/0xff0 [<ffffffff81199f3a>] path_walk+0x6a/0xe0 [<ffffffff8119a14b>] filename_lookup+0x6b/0xc0 [<ffffffff8119b277>] user_path_at+0x57/0xa0 [<ffffffff81186d11>] sys_faccessat+0xe1/0x230 [<ffffffff81186e78>] sys_access+0x18/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff (full details here http://paste.fedoraproject.org/92815/) Will try to reproduce on F19 to see how more modern libvirt behaves and to find what still needs to be fixed on VDSM. libvirt 1.1.3 (F19) seems to have a similar behaviour, although taking a different route (more details will follow). Using the informations provided here, posted a preliminary patch to address the dloppy special case. as per comment #24 CondNAK on libvirt The libvirt behaviour turned out to be correct, albeit the NFS timeout is perhaps too high (will be investigated in a separate BZ). http://gerrit.ovirt.org/#/c/26937/ was merged. Reproduced again the scenario described here and no further blockers were spotted; after the NFS timeout, libvirts drops the ISO image and the VM completes the boots. Various warning pops out in the VDSM logs but the VM is operative: - can be connected to (SPICE) - respond to stats requets (either via vdsClient or via engine) - can be powered off. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html |