Bug 755528

Summary:

VDSM: vm with cd as first device when iso domain is blocked from host, will remain in 'wait for launch' with no pid until backend will change domain state to inactive

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Dafna Ron <dron>

Component:

vdsm

Assignee:

Francesco Romani <fromani>

Status:

CLOSED ERRATA

QA Contact:

Lukas Svaty <lsvaty>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.1.0

CC:

bazulay, bsettle, cpelland, iheim, jraju, lpeer, mavital, michal.skrivanek, ofrenkel, oramraz, rbalakri, sherold, srevivo, syeghiay, tdosek, vfeenstr, yeylon

Target Milestone:

---

Target Release:

3.5.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

virt

Fixed In Version:

vt1.3, 4.16.0-1.el6_5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-11 21:09:50 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1089873

Bug Blocks:

1142923, 1156165

Attachments:

Description	Flags
logs	none
logs	none

Description Dafna Ron 2011-11-21 11:54:40 UTC

Description of problem:

when running a vm with ISO attached as first boot device, but ISO domain is unreachable, the vm will remain "hanged" in vds in 'wait for launch' state with no pid and when backend changes the domain state to inactive, the vm will get pid and change state to up  with no boot device. 

running a vm with cd attched when iso in unavailable is blocked by backend with CanDoAction, but since backend is not aware that the domain is inactive when the vm is run it creates this issue. 

Version-Release number of selected component (if applicable):

ic149
vdsm-4.9-112.el6.x86_64
libvirt-0.9.4-23.el6.x86_64
qemu-kvm-0.12.1.2-2.209.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. block connectivity to ISO domain in host
2. start a vm with ISO attached as first boot device
3.
  
Actual results:

until the the backend deactivates the ISO domain the vm will remain in waiting for launch with no pid. 
once domain changes to inactive we run the vm

Expected results:

1) vm should not hang in vds for 10 minutes until backend changes the domain state to inactive
2) once we know the domain is down, vm with no boot device should not change state to up but to down. 

Additional info: rhevm and vdsm logs

Every 2.0s: vdsClient -s 0 list table                                                                                                                    Mon Nov 21 13:21:26 2011

978a480a-7784-4e8b-bea8-b9b975f39ec4	  0  XP                   WaitForLaunch

a few minutes later still in wait for launch with no pid:


Every 2.0s: vdsClient -s 0 list table                                                                                                                    Mon Nov 21 13:25:05 2011

978a480a-7784-4e8b-bea8-b9b975f39ec4	  0  XP                   WaitForLaunch

Comment 1 Dan Kenigsberg 2011-11-21 12:46:32 UTC

I have no guess what was blocking in Vdsm. Please supply vdsm.log - here, and in any Vdsm bug that you open.

Comment 2 Dafna Ron 2011-11-21 12:59:35 UTC

Created attachment 534775 [details]
logs

Comment 3 Dan Kenigsberg 2011-11-21 13:26:56 UTC

clientIF._prepareVolumePath has

            if drive and not os.path.exists(drive):

which should be called out of process in order to have a timeout.

Comment 4 RHEL Program Management 2012-05-04 04:07:14 UTC

Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Dan Kenigsberg 2012-05-21 08:53:53 UTC

(In reply to comment #3)
> clientIF._prepareVolumePath has
> 
>             if drive and not os.path.exists(drive):
> 
> which should be called out of process in order to have a timeout.

Offending line has been removed by Eduardo in upstream commit v4.9.1-32-g4e4bb26. Please verify.

Comment 6 Dafna Ron 2012-11-22 17:35:21 UTC

tested with si24.4 and vdsm-4.9.6-44.0.el6_3.x86_64

vm is still stuck in wait for launch until iso domain is deactivated by engine and than it manages to start. 

Thread-89164::ERROR::2012-11-22 19:00:48,270::utils::398::vm.Vm::(collect) vmId=`e3c87f07-0af1-469c-936f-593e69f586d1`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x1fc07b0>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 394, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 271, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 178, in _sampleDiskLatency
    diskLatency[vmDrive.name] = stats[vmDrive.alias]
AttributeError: 'Drive' object has no attribute 'alias'
Dummy-142::DEBUG::2012-11-22 19:00:48,276::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0704693 s, 14.5 MB/s\n'; <rc> = 0
VM Channels Listener::DEBUG::2012-11-22 19:00:48,541::vmChannels::71::vds::(_do_add_channels) fileno 29 was added to unconnected channels.
VM Channels Listener::DEBUG::2012-11-22 19:00:48,542::vmChannels::103::vds::(_handle_unconnected) Trying to connect fileno 29.

Comment 7 Dan Kenigsberg 2012-11-25 07:52:37 UTC

Dafna, this is a different mode of failure. Would you attach more of vdsm.log?

Comment 8 Dafna Ron 2012-11-26 10:30:14 UTC

I ran a second test and this time added a floppy as well.
the vm was stuck in wait for launch and finally was destroyed. engine re-runs the vm but fails because of a different issue on the floppy. 

attaching the logs

Comment 9 Dafna Ron 2012-11-26 10:31:04 UTC

Created attachment 651877 [details]
logs

Comment 14 Saveliev Peter 2013-08-05 08:01:05 UTC

As I understand, having startupPolicy="optional", we can close it already?

Comment 15 Michal Skrivanek 2013-08-05 08:16:26 UTC

no, because of e.g.:

Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 597, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1392, in _run
    domxml = hooks.before_vm_start(self._buildCmdLine(), self.conf)
  File "/usr/share/vdsm/libvirtvm.py", line 1250, in _buildCmdLine
    devElem = dev.getXML()
  File "/usr/share/vdsm/libvirtvm.py", line 1133, in getXML
    self.path)['write']:
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 728, in getUserPermissions
    fileStats = os.stat(path)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/edf0ee04-0cc2-4e13-877d-1e89541aea55/eebcad17-cbf7-4098-880d-81e2c2504169/images/11111111-1111-1111-1111-111111111111/virtio-drivers-1.0.0-45801.vfd'

Comment 18 Jaison Raju 2014-03-05 16:48:32 UTC

Hello Team,

I just wanted to add another scenario to this effect.
If the vm is already running & iso domain is not reachable,
the vm goes in hung state (although qemu process is not in D state) .

Wouldn't it be proper if the vdsm notifies this / rhevm shows events for
this guest that it iso image is not accessible ?
( Although in such scenarios , i notice that qemu-kvm does not log this
 nor does libvirtd . But the VM turns from UP to Not responding state . )

Thanks .

Regards,
Jaison R

Comment 19 Jaison Raju 2014-03-12 14:13:50 UTC

(In reply to Jaison Raju from comment #18)
> Hello Team,
> 
> I just wanted to add another scenario to this effect.
> If the vm is already running & iso domain is not reachable,
> the vm goes in hung state (although qemu process is not in D state) .
> 
> Wouldn't it be proper if the vdsm notifies this / rhevm shows events for
> this guest that it iso image is not accessible ?
> ( Although in such scenarios , i notice that qemu-kvm does not log this
>  nor does libvirtd . But the VM turns from UP to Not responding state . )
> 
> Thanks .
> 
> Regards,
> Jaison R

Hello Team,

Let me just describe the scenario in detail .

The ISO storage domain goes down . This is logged & after timeout the status of this storage domain changes .
Although this takes around 5mins .

 engine-config -g StorageDomainFalureTimeoutInMinutes
StorageDomainFalureTimeoutInMinutes: 5 version: general

The vms using iso images from this domain go in hung state .
VM poweroff take some time  (which take time but succeeds)

Till the storage domain status goes down/inactive any vms using the iso domain can we powered up   but will remain in 'Wait For Launch' state .
What my thought was that , if the file is not accessible to qemu-kvm process , should the qemu process itself fail & send the message to
overlying components like libvirt & vdsm ? 

Regards,
Jaison R

Comment 20 Michal Skrivanek 2014-04-08 09:12:16 UTC

not converging still, but we started to look at that again (finally!) for 3.5

Comment 22 Francesco Romani 2014-04-08 14:44:49 UTC

taking the bug

Comment 23 Francesco Romani 2014-04-09 14:32:31 UTC

While attempting to reproduce on RHEL6.x this issue to fix the remaining places where VDSM can get stuck, I stumbled into another place we can get stuck inside libvirt.

in a nutshell, in qemuProcessStart libvirt iterates on all the configured disk and try to access all the disks (with the proper uid/gid). *Only after* check failed then an appropriate action is done according to the startupPolicy.

This *will* block in the described scenario

A few snippets demonstrating the behaviour:

background: ISCSI domain, NFS ISO storage, one VM with one (ISCSI) disk but with CDROM as first boot device
Relevant domain XML section:
                <disk device="cdrom" snapshot="no" type="file">
                        <source file="/rhev/data-center/mnt/shinji:_srv_ovirt_iso/029b397f-6a0a-432a-b5ac-03861cbc6b38/images/11111111-1111-1111-1111-111111111111/Fedora-Live-Desktop-x86_64-19-1.iso" startupPolicy="optional"/>
                        <target bus="ide" dev="hdc"/>
                        <readonly/>
                        <serial/>
                        <boot order="1"/>
                </disk>
                <disk device="disk" snapshot="no" type="block">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci"/>
                        <source dev="/rhev/data-center/mnt/blockSD/bc1e6c00-7a28-4d0f-860f-e620ac3bd24d/images/e9d5398b-2f2c-4066-9f45-53ae6616899b/86226983-9b04-4b19-9b2b-85c87442d658"/>
                        <target bus="virtio" dev="vda"/>
                        <serial>e9d5398b-2f2c-4066-9f45-53ae6616899b</serial>
                        <boot order="2"/>
                        <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/>
                </disk>


iptables output (see NFS ports blocked; 192.168.1.64 is the host running engine and hosting storage)
 
(other chains: empty and with ACCEPT policy)
Chain OUTPUT (policy ACCEPT 111K packets, 6954K bytes)
 pkts bytes target     prot opt in     out     source               destination        
   53 64964 REJECT     tcp  --  *      *       0.0.0.0/0            192.168.1.64        tcp dpt:2049 reject-with icmp-port-unreachable
    0     0 REJECT     udp  --  *      *       0.0.0.0/0            192.168.1.64        udp dpt:2049 reject-with icmp-port-unreachable 

192.168.1.64 is the storage host. NFS was blocked before the start of the VM

ps -faux snippet
root      2291  0.0  0.0 919604 14504 ?        Ssl  17:04   0:04 /usr/sbin/libvirtd --listen
qemu      1132  0.0  0.0 919604  9528 ?        D    22:10   0:00  \_ /usr/sbin/libvirtd --listen

where 'libvirtd' is stuck:
output of /cat/1132/stack (clearly stuck in NFS)
[<ffffffffa04eb1b2>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
[<ffffffffa04eb6f5>] __rpc_execute+0xf5/0x350 [sunrpc]
[<ffffffffa04eb9b1>] rpc_execute+0x61/0xa0 [sunrpc]
[<ffffffffa04e23a5>] rpc_run_task+0x75/0x90 [sunrpc]
[<ffffffffa04e24c2>] rpc_call_sync+0x42/0x70 [sunrpc]
[<ffffffffa077f73d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
[<ffffffffa077fb1c>] nfs3_proc_access+0xbc/0x180 [nfs]
[<ffffffffa076758c>] nfs_do_access+0x19c/0x240 [nfs]
[<ffffffffa07676d8>] nfs_permission+0xa8/0x1e0 [nfs]
[<ffffffff81198d43>] __link_path_walk+0xb3/0xff0
[<ffffffff81199f3a>] path_walk+0x6a/0xe0
[<ffffffff8119a14b>] filename_lookup+0x6b/0xc0
[<ffffffff8119b277>] user_path_at+0x57/0xa0
[<ffffffff81186d11>] sys_faccessat+0xe1/0x230
[<ffffffff81186e78>] sys_access+0x18/0x20
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

(full details here http://paste.fedoraproject.org/92815/)

Will try to reproduce on F19 to see how more modern libvirt behaves and to find what still needs to be fixed on VDSM.

Comment 24 Francesco Romani 2014-04-21 18:09:00 UTC

libvirt 1.1.3 (F19) seems to have a similar behaviour, although taking a different route (more details will follow).
Using the informations provided here, posted a preliminary patch to address
the dloppy special case.

Comment 25 Michal Skrivanek 2014-04-22 07:24:47 UTC

as per comment #24 CondNAK on libvirt

Comment 26 Francesco Romani 2014-05-29 14:16:58 UTC

The libvirt behaviour turned out to be correct, albeit the NFS timeout is perhaps too high (will be investigated in a separate BZ).

http://gerrit.ovirt.org/#/c/26937/ was merged.

Reproduced again the scenario described here and no further blockers were spotted; after the NFS timeout, libvirts drops the ISO image and the VM completes the boots.

Various warning pops out in the VDSM logs but the VM is operative:
- can be connected to (SPICE)
- respond to stats requets (either via vdsClient or via engine)
- can be powered off.

Comment 28 errata-xmlrpc 2015-02-11 21:09:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html