Bug 1666795 - VMs migrated to 4.3 are missing the appropriate virt XML for dynamic ownership, and are reset to root:root, preventing them from starting
Summary: VMs migrated to 4.3 are missing the appropriate virt XML for dynamic ownershi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.3.2
: 4.3.0
Assignee: Milan Zamazal
QA Contact: Liran Rotenberg
URL:
Whiteboard:
: 1679882 1687126 (view as bug list)
Depends On:
Blocks: 1679884
TreeView+ depends on / blocked
 
Reported: 2019-01-16 14:52 UTC by Liran Rotenberg
Modified: 2022-10-07 12:57 UTC (History)
38 users (show)

Fixed In Version: vdsm-4.30.10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 12:36:32 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/vdsm (985.81 KB, application/gzip)
2019-01-22 22:04 UTC, Simone Tiraboschi
no flags Details
sanlock.log (9.97 KB, text/plain)
2019-01-22 22:05 UTC, Simone Tiraboschi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-47969 0 None None None 2022-10-07 12:36:51 UTC
Red Hat Product Errata RHBA-2019:1077 0 None None None 2019-05-08 12:36:38 UTC
oVirt gerrit 97721 0 'None' MERGED vm: Fix bad comment about root_squash 2021-01-26 05:44:02 UTC
oVirt gerrit 97892 0 'None' MERGED virt: Remove an obsolete comment from the libvirt hook 2021-01-26 05:44:03 UTC
oVirt gerrit 97893 0 'None' MERGED virt: Remove min_guarantee removal from vm_migrate_hook.py 2021-01-26 05:44:03 UTC
oVirt gerrit 97894 0 'None' MERGED network: Remove vm_migrate_hook.py 2021-01-26 05:44:03 UTC
oVirt gerrit 97895 0 'None' MERGED virt: Add seclabel to element only if not already present 2021-01-26 05:44:45 UTC
oVirt gerrit 97896 0 'None' MERGED virt: Add missing seclabel's on migrations 2021-01-26 05:44:03 UTC
oVirt gerrit 98069 0 'None' MERGED network: Remove vm_migrate_hook.py 2021-01-26 05:44:46 UTC
oVirt gerrit 98070 0 'None' MERGED virt: Add seclabel to element only if not already present 2021-01-26 05:44:04 UTC
oVirt gerrit 98071 0 'None' MERGED virt: Add missing seclabel's on migrations 2021-01-26 05:44:04 UTC
oVirt gerrit 98088 0 'None' MERGED core: Add seclabel's to disks 2021-01-26 05:44:47 UTC

Internal Links: 1719789

Description Liran Rotenberg 2019-01-16 14:52:19 UTC
Description of problem:
A Hosted Engine environment passed a upgrade from 4.1 to 4.2, then to 4.3 (alpha2).
Triggering global maintenance on the environment, then using --vm-power-off, --vm-start and the engine VM doesn't go up.

-- The start-up result is done.
Jan 16 16:43:38 ocelot02.qa.lab.tlv.redhat.com sudo[274947]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 16 16:43:38 ocelot02.qa.lab.tlv.redhat.com sudo[274947]: pam_unix(sudo:session): session closed for user root
Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com systemd[1]: Started Session c52389 of user root.
-- Subject: Unit session-c52389.scope has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit session-c52389.scope has finished starting up.
-- 
-- The start-up result is done.
Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com sudo[274995]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]: ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch record "." column external_ids
Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com sudo[274995]: pam_unix(sudo:session): session closed for user root
Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com systemd[1]: Started Session c52390 of user root.
-- Subject: Unit session-c52390.scope has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit session-c52390.scope has finished starting up.

In vdsm logs i saw some storage trace back:
2019-01-16 16:42:46,953+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='5118431d-94a3-4fae-bd6d-50819d043af6') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in prepareImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3179, in prepareImage
    legality = dom.produceVolume(imgUUID, volUUID).getLegality()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 802, in produceVolume
    volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 800, in __init__
    self._manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 71, in __init__
    volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 86, in __init__
    self.validate()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 112, in validate
    self.validateVolumePath()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 131, in validateVolumePath
    raise se.VolumeDoesNotExist(self.volUUID)
VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)
2019-01-16 16:42:46,953+0200 INFO  (vm/3074151e) [storage.TaskManager.Task] (Task='5118431d-94a3-4fae-bd6d-50819d043af6') aborting: Task is aborted: "Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4
f',)" - code 201 (task:1181)
2019-01-16 16:42:46,953+0200 ERROR (vm/3074151e) [storage.Dispatcher] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) (dispatcher:81)
2019-01-16 16:42:46,954+0200 ERROR (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') The vm start process failed (vm:937)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2749, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2587, in _make_devices
    devices = self._make_devices_from_dict()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _make_devices_from_dict
    self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1011, in _preparePathsForDrives
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 415, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'}
2019-01-16 16:42:46,954+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Changed state to Down: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} (code=1) (vm:1675)
2019-01-16 16:42:46,956+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:454)

Version-Release number of selected component (if applicable):
ovirt-engine 4.3 alpha2
vdsm-4.30.4-1.el7ev.x86_64
libvirt-4.5.0-10.el7_6.3.x86_64
openvswitch2.10-ovn-common-2.10.0-28.el7fdp.x86_64
openvswitch2.10-ovn-host-2.10.0-28.el7fdp.x86_64
rhv-openvswitch-ovn-common-2.10-3.el7ev.noarch
ovirt-provider-ovn-driver-1.2.17-1.el7ev.noarch
rhv-openvswitch-ovn-host-2.10-3.el7ev.noarch
openvswitch2.10-ovn-common-2.10.0-28.el7fdp.x86_64
openvswitch2.10-ovn-host-2.10.0-28.el7fdp.x86_64
rhv-openvswitch-2.10-3.el7ev.noarch
rhv-openvswitch-ovn-common-2.10-3.el7ev.noarch
openvswitch2.10-2.10.0-28.el7fdp.x86_64
python-openvswitch2.10-2.10.0-28.el7fdp.x86_64
rhv-python-openvswitch-2.10-3.el7ev.noarch
rhv-openvswitch-ovn-host-2.10-3.el7ev.noarch
openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch

How reproducible:
Occurred once.

Steps to Reproduce:
1. Deploy SHE environment on 4.1.
2. Upgrade the environment to 4.2.
3. Upgrade the environment to 4.3.
4. Set global maintenance mode on.
5. Power off the HE VM (# hosted-engine --vm-power-off)
6. Power on the HE VM (# hosted-engine --vm-start)

Actual results:
VM Never went back on, engine died.

Expected results:
VM should start, engine comes back.

Additional info:
After a several tries, I set the global maintenance off, to see if the HA will trigger and succeed to bring the HE VM on.
This didn't work, causing a loop:
MainThread::INFO::2019-01-16 15:43:24,183::hosted_engine::949::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost
MainThread::INFO::2019-01-16 15:43:24,218::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent
MainThread::INFO::2019-01-16 15:43:24,227::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2019-01-16 15:43:34,243::states::785::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down.
MainThread::INFO::2019-01-16 15:43:34,250::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'>
MainThread::INFO::2019-01-16 15:43:34,283::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineMaybeAway) sent? sent
MainThread::INFO::2019-01-16 15:43:34,383::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMaybeAway (score: 3400)
MainThread::INFO::2019-01-16 15:43:34,388::states::879::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down
MainThread::INFO::2019-01-16 15:43:34,396::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
MainThread::INFO::2019-01-16 15:43:34,427::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMaybeAway-EngineUnexpectedlyDown) sent? sent
MainThread::INFO::2019-01-16 15:43:34,435::states::699::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Wed Jan 16 15:43:34 2019
MainThread::INFO::2019-01-16 15:43:34,435::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)

Comment 2 Liran Rotenberg 2019-01-16 19:37:15 UTC
This reproduce to me again, this time on ovirt-engine-4.3.0-0.8.rc2.el7.noarch
Raising the severity.

Comment 4 Dan Kenigsberg 2019-01-18 16:05:23 UTC
Liran, from where did you copy the line

  ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

? why do you think this is an openvswitch bug?

What I see here is Vdsm failing to prepare the storage volume of the hosted Engine.
Vdsm claims that 
  VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)
Did you check if it does? Please work with stroage QE to understand what is going on there.


btw, the line

  Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]: ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch record "." column external_ids

is log noise. We should not generate it. This deserves a separate Vdsm|Network bug.

Comment 5 Liran Rotenberg 2019-01-21 08:19:13 UTC
(In reply to Dan Kenigsberg from comment #4)
> Liran, from where did you copy the line
> 
>   ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
> database connection failed (No such file or directory)
> 
> ? why do you think this is an openvswitch bug?
> 

The errors are from messages log.
Because of the DB-OVS error, I thought it might be related.

> What I see here is Vdsm failing to prepare the storage volume of the hosted
> Engine.
> Vdsm claims that 
>   VolumeDoesNotExist: Volume does not exist:
> (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)
> Did you check if it does? Please work with stroage QE to understand what is
> going on there.
> 
> 
> btw, the line
> 
>   Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]:
> ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch
> record "." column external_ids
> 
> is log noise. We should not generate it. This deserves a separate
> Vdsm|Network bug.

As it seems less connected to network, moving the bug to storage.

Comment 6 Fred Rolland 2019-01-21 09:16:19 UTC
Simone hi, can you take a look at this issue? Maybe something in the upgrade went wrong?

Comment 8 Simone Tiraboschi 2019-01-22 13:28:09 UTC
(In reply to Fred Rolland from comment #6)
> Simone hi, can you take a look at this issue? Maybe something in the upgrade
> went wrong?

Between 4.1 and 4.3 we don't need to touch the hosted-engine storage domain at all during upgrades.

This looks strange:
VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'}
2019-01-16 16:42:46,954+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Changed state to Down: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} (code=1) (vm:1675)
2019-01-16 16:42:46,956+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:454)

but honestly I don't know what VolumeError: Bad volume specification really means.

Comment 9 Ryan Barry 2019-01-22 14:48:18 UTC
I suspect this is from lib/vdsm/clientIF.py:145 

Following it back, it goes to the VolumeDoesNotExist exception from comment#1. Is the HE storage mounted?

Comment 10 Liran Rotenberg 2019-01-22 15:03:26 UTC
(In reply to Ryan Barry from comment #9)
> I suspect this is from lib/vdsm/clientIF.py:145 
> 
> Following it back, it goes to the VolumeDoesNotExist exception from
> comment#1. Is the HE storage mounted?

In the reproduced environment:

[root@ocelot03 f8412707-fe3e-490a-9c80-1881db33afb8]# cd images/ 
[root@ocelot03 images]# tree |grep 77756d49-7cb9-45a3-a61d-c87902979eeb 
│   ├── 77756d49-7cb9-45a3-a61d-c87902979eeb 
│   ├── 77756d49-7cb9-45a3-a61d-c87902979eeb.lease 
│   └── 77756d49-7cb9-45a3-a61d-c87902979eeb.meta 
[root@ocelot03 images]# tree |grep -C 4 77756d49-7cb9-45a3-a61d-c87902979eeb 
│   ├── 77ffe32a-774c-4a7f-b478-b5a69849f5e5 
│   ├── 77ffe32a-774c-4a7f-b478-b5a69849f5e5.lease 
│   └── 77ffe32a-774c-4a7f-b478-b5a69849f5e5.meta 
├── bbc99d7c-93bb-44e9-8bbd-8268d19301c2 
│   ├── 77756d49-7cb9-45a3-a61d-c87902979eeb 
│   ├── 77756d49-7cb9-45a3-a61d-c87902979eeb.lease 
│   └── 77756d49-7cb9-45a3-a61d-c87902979eeb.meta 
└── fa8c6ea4-df94-43b4-8a99-6867b8ed2e71 
   ├── df1645f7-97ab-4647-8091-c052d46234af 
   ├── df1645f7-97ab-4647-8091-c052d46234af.lease 
   └── df1645f7-97ab-4647-8091-c052d46234af.meta 
[root@ocelot03 images]#  
[root@ocelot03 images]# cd bbc99d7c-93bb-44e9-8bbd-8268d19301c2 
[root@ocelot03 bbc99d7c-93bb-44e9-8bbd-8268d19301c2]# ll 
total 8687180 
-rw-rw----. 1 root root 62277025792 Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb 
-rw-rw----. 1 vdsm kvm      1048576 Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb.lease 
-rw-r--r--. 1 vdsm kvm          285 Jan 16 17:58 77756d49-7cb9-45a3-a61d-c87902979eeb.meta 
[root@ocelot03 bbc99d7c-93bb-44e9-8bbd-8268d19301c2]# ls -lah 
total 8.3G 
drwxr-xr-x. 2 vdsm kvm  4.0K Jan 16 17:58 . 
drwxr-xr-x. 8 vdsm kvm  4.0K Jan 16 20:00 .. 
-rw-rw----. 1 root root  58G Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb 
-rw-rw----. 1 vdsm kvm  1.0M Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb.lease 
-rw-r--r--. 1 vdsm kvm   285 Jan 16 17:58 77756d49-7cb9-45a3-a61d-c87902979eeb.meta

Comment 11 Ryan Barry 2019-01-22 15:12:49 UTC
Waiting on storage, then. Nir?

Comment 12 Nir Soffer 2019-01-22 16:04:21 UTC
(In reply to Ryan Barry from comment #11)
> Waiting on storage, then. Nir?

This means the volume 0ddd524f-4616-4877-88bb-c5a65cadfb4f does not exist:

    VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)

And this case

    VolumeError: Bad volume specification ...

Is expected, although not very helpful error message.

Form the exception we see:

domainID: 7d394a51-c3a3-47e1-ac31-697710ef0f70
imageID: c8f4766d-5e44-4798-b0e9-000cf97faebe
volumeID: 0ddd524f-4616-4877-88bb-c5a65cadfb4f

Do we have such volume and image in this domain? is the domain mounted?

Waiting on hosted engine developers to explain why the volume should exist.

This is probably wrong initialization on hosted engine side.

Comment 13 Simone Tiraboschi 2019-01-22 17:45:14 UTC
According to the provided sosreport file:

[root@t470s sosreport-ocelot02-2019-01-16-bstqhuu]# cat ./sos_commands/vdsm/su_vdsm_-s_.bin.sh_-c_.usr.bin.tree_-l_.rhev.data-center
/rhev/data-center
|-- 5c3c478b-012f-008c-03ba-00000000005b
`-- mnt
    |-- yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage
    |   |-- 7d394a51-c3a3-47e1-ac31-697710ef0f70
    |   |   |-- dom_md
    |   |   |   |-- ids
    |   |   |   |-- inbox
    |   |   |   |-- leases
    |   |   |   |-- metadata
    |   |   |   |-- outbox
    |   |   |   `-- xleases
    |   |   |-- ha_agent
    |   |   |   |-- hosted-engine.lockspace -> /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/b2157c5a-8680-4778-aaca-14acc7022a95/b83461fa-96e2-41b7-b46b-43677068f43d
    |   |   |   `-- hosted-engine.metadata -> /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/17b8c084-3139-4fd7-8150-8de4a711dd94/65e74ab2-e242-4e4a-afa1-0ddc9eeec586
    |   |   `-- images
    |   |       |-- 17b8c084-3139-4fd7-8150-8de4a711dd94
    |   |       |   |-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586
    |   |       |   |-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586.lease
    |   |       |   `-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586.meta
    |   |       |-- 44d7093c-3616-4499-bdd1-ba59633e382c
    |   |       |   |-- 65bed271-871d-4feb-b4e5-3dc6391952e8
    |   |       |   |-- 65bed271-871d-4feb-b4e5-3dc6391952e8.lease
    |   |       |   `-- 65bed271-871d-4feb-b4e5-3dc6391952e8.meta
    |   |       |-- 7448e1fb-a0d2-4fda-9edc-e8c5eaad6b7e
    |   |       |   |-- d25e22b1-fcc8-4961-bc7b-3bd218292ade
    |   |       |   |-- d25e22b1-fcc8-4961-bc7b-3bd218292ade.lease
    |   |       |   `-- d25e22b1-fcc8-4961-bc7b-3bd218292ade.meta
    |   |       |-- 7fab4d37-aeb2-4ee0-9e5e-113fee053fff
    |   |       |   |-- 2055c803-76e7-4671-b1df-c2d58b3ae20c
    |   |       |   |-- 2055c803-76e7-4671-b1df-c2d58b3ae20c.lease
    |   |       |   `-- 2055c803-76e7-4671-b1df-c2d58b3ae20c.meta
    |   |       |-- b2157c5a-8680-4778-aaca-14acc7022a95
    |   |       |   |-- b83461fa-96e2-41b7-b46b-43677068f43d
    |   |       |   |-- b83461fa-96e2-41b7-b46b-43677068f43d.lease
    |   |       |   `-- b83461fa-96e2-41b7-b46b-43677068f43d.meta
    |   |       `-- c8f4766d-5e44-4798-b0e9-000cf97faebe
    |   |           |-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f
    |   |           |-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f.lease
    |   |           `-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f.meta
    |   `-- __DIRECT_IO_TEST__
    `-- yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_nfs__0
        |-- __DIRECT_IO_TEST__
        `-- d542adfd-e61d-4503-ad91-b19458827917
            |-- dom_md
            |   |-- ids
            |   |-- inbox
            |   |-- leases
            |   |-- metadata
            |   |-- outbox
            |   `-- xleases
            |-- images
            |   |-- 505b326b-2455-4cfc-a4cc-70300ac14afe
            |   |   |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a
            |   |   |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.lease
            |   |   `-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.meta
            |   |-- 97e9c9ef-b2b4-4a10-b9a1-016a54fa4fd9
            |   |   |-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1
            |   |   |-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1.lease
            |   |   `-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1.meta
            |   |-- e56ccc92-4504-49e7-9834-63127f251651
            |   |   |-- e58caa53-54f5-4fb3-97c0-9973ce242b32
            |   |   |-- e58caa53-54f5-4fb3-97c0-9973ce242b32.lease
            |   |   `-- e58caa53-54f5-4fb3-97c0-9973ce242b32.meta
            |   `-- f84cab00-14b1-4242-9ddc-7a5609a0f2d1
            |       |-- 2780716f-e894-41a7-8542-213b2b8d19d0
            |       |-- 2780716f-e894-41a7-8542-213b2b8d19d0.lease
            |       |-- 2780716f-e894-41a7-8542-213b2b8d19d0.meta
            |       |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a
            |       |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.lease
            |       `-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.meta
            `-- master
                |-- tasks
                `-- vms


/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage/images/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f was there as expected (at least at sos report creation time).

Comment 14 Nir Soffer 2019-01-22 17:53:20 UTC
Simone, why the volume was not there on 2019-01-16 16:42:46,953+0200?

We need to check the hosted engine startup flow, and find the reason.
Probably the domain was not mounted when we tried to start the vm?

Comment 15 Simone Tiraboschi 2019-01-22 17:56:42 UTC
The issue reported by Liran happened the first time at 15:27:23,136+0200

2019-01-16 15:27:23,136+0200 INFO  (vm/3074151e) [vdsm.api] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) from=internal, task_id=2b44a209-380d-45e9-8be5-01234bd1fc5b (api:52)
2019-01-16 15:27:23,136+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='2b44a209-380d-45e9-8be5-01234bd1fc5b') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in prepareImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3179, in prepareImage
    legality = dom.produceVolume(imgUUID, volUUID).getLegality()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 802, in produceVolume
    volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 800, in __init__
    self._manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 71, in __init__
    volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 86, in __init__
    self.validate()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 112, in validate
    self.validateVolumePath()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 131, in validateVolumePath
    raise se.VolumeDoesNotExist(self.volUUID)
VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)
2019-01-16 15:27:23,138+0200 INFO  (vm/3074151e) [storage.TaskManager.Task] (Task='2b44a209-380d-45e9-8be5-01234bd1fc5b') aborting: Task is aborted: "Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)" - code 201 (task:1181)
2019-01-16 15:27:23,138+0200 ERROR (vm/3074151e) [storage.Dispatcher] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) (dispatcher:81)
2019-01-16 15:27:23,138+0200 ERROR (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') The vm start process failed (vm:937)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2749, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2587, in _make_devices
    devices = self._make_devices_from_dict()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _make_devices_from_dict
    self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1011, in _preparePathsForDrives
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 415, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'}


But please notice that only 5 millisecond before a getVolumeSize on the same volume (volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f') returned 'truesize': '8600494080'

2019-01-16 15:27:23,131+0200 INFO  (vm/3074151e) [vdsm.api] START getVolumeSize(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', options=None) from=internal, task_id=809a1669-5057-4252-8fa6-735040af97e2 (api:48)
2019-01-16 15:27:23,132+0200 INFO  (vm/3074151e) [vdsm.api] FINISH getVolumeSize return={'truesize': '8600494080', 'apparentsize': '62277025792'} from=internal, task_id=809a1669-5057-4252-8fa6-735040af97e2 (api:54)

Now the question is, how could a getVolumeSize successfully complete at 15:27:23,131 and a prepareImage on the same volume fail at 15:27:23,136 with VolumeDoesNotExist (Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',))?

Comment 16 Simone Tiraboschi 2019-01-22 18:01:55 UTC
And the same pattern repeated also 5 minutes later 15:32:48: 

2019-01-16 15:32:48,159+0200 INFO  (vm/3074151e) [vdsm.api] START getVolumeSize(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', options=None) from=internal, task_id=79624513-215f-4884-97ec-06199222e0c0 (api:48)
2019-01-16 15:32:48,161+0200 INFO  (vm/3074151e) [vdsm.api] FINISH getVolumeSize return={'truesize': '8600494080', 'apparentsize': '62277025792'} from=internal, task_id=79624513-215f-4884-97ec-06199222e0c0 (api:54)
2019-01-16 15:32:48,162+0200 INFO  (vm/3074151e) [vdsm.api] START prepareImage(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', leafUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', allowIllegal=False) from=internal, task_id=72124e10-a230-4c69-9f09-8aad80df24d0 (api:48)
2019-01-16 15:32:48,165+0200 INFO  (vm/3074151e) [vdsm.api] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) from=internal, task_id=72124e10-a230-4c69-9f09-8aad80df24d0 (api:52)
2019-01-16 15:32:48,165+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='72124e10-a230-4c69-9f09-8aad80df24d0') Unexpected error (task:875)

At 15:32:48,159 we got the size of the volume, but the volume disappeared at 15:32:48,165

Comment 17 Simone Tiraboschi 2019-01-22 18:12:47 UTC
repoStats for that SD was also fine at that time:

2019-01-16 15:32:47,930+0200 INFO  (jsonrpc/5) [vdsm.api] START repoStats(domains=[u'7d394a51-c3a3-47e1-ac31-697710ef0f70']) from=::1,51930, task_id=bcf34e9c-26eb-49c3-a158-eb8d2c06c488 (api:48)
2019-01-16 15:32:47,930+0200 INFO  (jsonrpc/5) [vdsm.api] FINISH repoStats return={u'7d394a51-c3a3-47e1-ac31-697710ef0f70': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.2012', 'lastCheck': '2.1', 'valid': True}} from=::1,51930, task_id=bcf34e9c-26eb-49c3-a158-eb8d2c06c488 (api:54)

Comment 18 Nir Soffer 2019-01-22 20:35:07 UTC
Simone, interesting!

Getting volume size use:

    int(int(self.oop.os.stat(volPath).st_size) / bs)

    int(int(self.oop.os.stat(volPath).st_blocks) * BLOCK_SIZE)

Which means calling stat() in ioprocess.

Validating volume path is:

    self.oop.fileUtils.pathExists(volPath)

Which is:

    check = os.R_OK

    if writable:
        check |= os.W_OK

    # This function is workaround for a NFS issue where sometimes
    # os.exists/os.access fails due to NFS stale handle, in such
    # case we need to test os.access a second time.
    if os.access(filename, check):
        return True

    return os.access(filename, check)

The last change in this code was:

commit 8b7c2d07b477d78cbaa630a37aa8ea571ee5f5e7
Author: Federico Simoncelli <fsimonce>
Date:   Thu Aug 4 12:00:18 2011 +0000

    BZ#726630 Use os.access twice for NFS stale handle
    
    Change-Id: Id6a9e94361c19438495a46cc708c76e4d2b43a92
    Reviewed-on: http://gerrit.usersys.redhat.com/789
    Tested-by: Federico Simoncelli <fsimonce>
    Reviewed-by: Dan Kenigsberg <danken>


The change before that was:

Author: Federico Simoncelli <fsimonce>
Date:   Fri Jul 22 14:48:56 2011 +0000

    BZ#726630 Use only os.access to check permissions
    
    The old implementation of fileUtils.pathExists was using a flawed
    re-implementation of os.access as backup check for files with a NFS
    stale handle, the consequence was that accessible files were
    reported as non-accessible.
    We now use os.stat to refresh the NFS handle and then os.access to
    determine weather we can read and write the file.
    
    Change-Id: Ic7b892886416e866178ac88c04cbfb68ed0c055b
    Reviewed-on: http://gerrit.usersys.redhat.com/733
    Reviewed-by: Dan Kenigsberg <danken>
    Tested-by: Jakub Libosvar <libosvar>


All these checks look wrong. I don't think we should use os.access for
checking if file exists and raising VolumeDoesNotExist does not make
any sense if os.access failed.

Even worse we should check if file exits from vdsm, this may cause entire
vdsm to get stuck.

But this code seems to work since 2011, we never had a report about such
failure.

The underlying issue is that os.access() fail twice - something is wrong
with this mount.

What is the output of:

    ls -Zh /rhev/data-center/mnt/server:_path/sd-id/images/img-id/

When we got this error?

Please attach sanlock.log and /var/log/messages from the time of this issue.

Is this reproducible without hosted engine? how? What are the vdsm api calls
that reproduce this issue?

Comment 19 Simone Tiraboschi 2019-01-22 22:01:43 UTC
OK the issue probably comes from here:

[root@t470s sosreport-ocelot02-2019-01-16-bstqhuu]# cat sos_commands/vdsm/su_vdsm_-s_.bin.sh_-c_.bin.ls_-lR_.rhev.data-center
...
/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/images/c8f4766d-5e44-4798-b0e9-000cf97faebe:
total 8399948
-rw-rw----. 1 root root 62277025792 Jan 16 15:27 0ddd524f-4616-4877-88bb-c5a65cadfb4f
-rw-rw----. 1 vdsm kvm      1048576 Jan 16 15:27 0ddd524f-4616-4877-88bb-c5a65cadfb4f.lease
-rw-r--r--. 1 vdsm kvm          285 Jan 14 10:22 0ddd524f-4616-4877-88bb-c5a65cadfb4f.meta

Everything is vdsm:kvm but the file for that volume that is root:root 660 so vdsm user or kvm group cannot read it.
Now the point is, how is it possible?
Before the upgrade that engine VM was running but with that permission it's not possible. What changed the permission at upgrade time?


(In reply to Liran Rotenberg from comment #0)
> Triggering global maintenance on the environment, then using --vm-power-off,

This is also wrong since --vm-poweroff is going to call destroy for an immediate poweroff and not shutdown for a clean shutdown.

Comment 20 Simone Tiraboschi 2019-01-22 22:04:50 UTC
Created attachment 1522514 [details]
/var/log/vdsm

Comment 21 Simone Tiraboschi 2019-01-22 22:05:39 UTC
Created attachment 1522515 [details]
sanlock.log

Comment 22 Simone Tiraboschi 2019-01-22 22:34:58 UTC
OK, in /var/log/messages we see:

Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: info : virSecuritySELinuxSetFileconHelper:1156 : Setting SELinux context on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' to 'system_u:object_r:svirt_image_t:s0:c227,c985'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: debug : virFileIsSharedFSType:3664 : Check if path /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f with FS magic 26985 is shared
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: info : virSecuritySELinuxSetFileconHelper:1200 : Setting security context 'system_u:object_r:svirt_image_t:s0:c227,c985' on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' not supported
...
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '<null>' to '107:107'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: debug : virFileFindResourceFull:1818 : Resolved 'fs' to '/usr/lib64/libvirt/storage-file/libvirt_storage_file_fs.so'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: debug : virModuleLoadFile:47 : Load module file '/usr/lib64/libvirt/storage-file/libvirt_storage_file_fs.so'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virModuleLoadFunc:68 : Lookup function 'virStorageFileFsRegister'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'file' protocol 'none'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'block' protocol 'none'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'dir' protocol 'none'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virFileFindResourceFull:1818 : Resolved 'gluster' to '/usr/lib64/libvirt/storage-file/libvirt_storage_file_gluster.so'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virModuleLoadFile:47 : Load module file '/usr/lib64/libvirt/storage-file/libvirt_storage_file_gluster.so'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: debug : virModuleLoadFunc:68 : Lookup function 'virStorageFileGlusterRegister'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'network' protocol 'gluster'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' to '107:107'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.com.redhat.rhevm.vdsm' to '107:107'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.qemu.guest_agent.0' to '107:107'
Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.ovirt.hosted-engine-setup.0' to '107:107'

And on that host we have:
libvirt.x86_64                  4.5.0-10.el7_6.3      @rhel-76z-internal-compose

Maybe it could be related to https://bugzilla.redhat.com/1633389

Comment 23 Fred Rolland 2019-01-23 10:16:05 UTC
It looks like the DAC introduction changed behavior regarding files permissions.
It was introduced here:
https://gerrit.ovirt.org/#/c/89830/

    devices: enable dynamic_ownership
    
    Libvirt provides dynamic_ownership option. When enabled, libvirt takes
    care of ownership and permissions of endpoints required for VM run. It
    was previously disabled due to legacy shared storage issues; these
    should be fixed by now or, if any problem is discovered, fixed ASAP.
    
    Enabling the option allows us to drop ownership handling from VDSM for
    host devices, and allows OVMF guest vars store without additional
    permission handling code. The new behavior is explicitly disabled for
    storage code - the disk ownership is still handled by VDSM.
    
    One notable device not handled by libvirt dynamic ownership is hwrng.
    We must still handle ownership of that device ourselves.
    
    Change-Id: Ibfd8f67f88e361600f417aaf52625b5bf6ea1214
    Signed-off-by: Martin Polednik <mpolednik>

Milan, can you take look?

Comment 24 Milan Zamazal 2019-01-23 14:05:51 UTC
While dynamic ownership looks like a suspect when file permissions are wrong, I don't know how it could apply in this particular case. Normally, NFS is expected to be configured as (anonuid=36,anongid=36,all_squash) so all files have the right owners automatically. I don't think libvirt can have responsibility here, since the file has wrong permissions even before the VM is attempted to be started (if I look into the logs correctly). And we exclude storage from dynamic ownership in Vdsm, so it shouldn't be touched by libvirt anyway (unless there is a bug).

Comment 26 Jillian Morgan 2019-02-04 15:54:00 UTC
Same issue here (I think) on my 4.2.8 to 4.3.0(.4) cluster upgrade. Each VM was waiting to be rebooted to apply the new configuration after the cluster upgrade.

After each VM was shutdown, it would not restart, with the same Red Herring message about "bad volume specification" which had me scrambling to figure out what-on-earth was wrong with the specification which looked totally fine to me. I did eventually discover that the permissions on the disk images were changed (to root:root and mode 0644). I reset the permissions properly (vdsm:kvm, 0660) and then the VMs started up fine.

I did this on a subset of HA VMs first, thankfully. On the 2nd set, I first verified that the permissions on the disk images were good before the shutdown (while still in the "vm has a new configuration waiting to be applied" state). During the shutdown process of each VM, the disk image file permissions changed to the unworkable root:root and 0644.

With the permissions manually fixed, they seem to be retained properly during subsequent shutdown/restart or reboot cycles, so the issue was only caused by the VM-config-update process? At what other times might the permissions get hosed again?

Comment 27 Simone Tiraboschi 2019-02-10 22:49:33 UTC
Raising this since we had at least 5 upstream users hitting this!

Comment 28 Nir Soffer 2019-02-12 19:18:09 UTC
(In reply to Milan Zamazal from comment #24)
> While dynamic ownership looks like a suspect when file permissions are
> wrong, I don't know how it could apply in this particular case. Normally,
> NFS is expected to be configured as (anonuid=36,anongid=36,all_squash) so
> all files have the right owners automatically.

I don't think we require this setting, but even with this you can have 
wrong permissions:

Here is my test NFS server:

$ cat /etc/exports
/export             *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36,fsid=0)

And session showing that this does not effect file permissions:

# ls -lht /rhev/data-center/
total 0
drwxr-xr-x. 2 vdsm kvm 154 Feb 10 17:50 a0011271-88a4-491f-a566-aec38b2000e9
drwxr-xr-x. 4 vdsm kvm  64 Feb 10 17:50 mnt

# mkdir /rhev/data-center/foo

# ls -lht /rhev/data-center/
total 0
drwxr-xr-x. 2 root root   6 Feb 12 20:57 foo
drwxr-xr-x. 2 vdsm kvm  154 Feb 10 17:50 a0011271-88a4-491f-a566-aec38b2000e9
drwxr-xr-x. 4 vdsm kvm   64 Feb 10 17:50 mnt

# touch /rhev/data-center/foo/bar

# ls -lh /rhev/data-center/foo/
total 0
-rw-r--r--. 1 root root 0 Feb 12 20:58 bar

NFS all_squash,anonuid=36,anongid=36 probably effect the effective uid/gid,
not file permissions.

> I don't think libvirt can
> have responsibility here, since the file has wrong permissions even before
> the VM is attempted to be started (if I look into the logs correctly). And
> we exclude storage from dynamic ownership in Vdsm, so it shouldn't be
> touched by libvirt anyway (unless there is a bug).

Storage has no code to change file ownership, we only check that we can access
files. Storage provisioning code run as vdsm, and anything we create (in file storage)
is owned by vdsm.

In block storage we use sudo to run lvm, and use udev rules to change volumes so
they are owned by vdsm. Libvirt creates its own namespace for block devices, and 
responsible for the ownership and permissions of these block devices.
If this was an issue with block devices ownership, this could be storage issue.

$ git grep chown lib/vdsm/supervdsm_server.py
lib/vdsm/supervdsm_server.py:from vdsm.storage.fileUtils import chown, resolveGid, resolveUid
lib/vdsm/supervdsm_server.py:            chown(address, args.sock_user, args.sock_group)

$ git grep chown lib/vdsm/supervdsm_api/
lib/vdsm/supervdsm_api/udev.py:        os.chown(_HWRNG_PATH, 0, 0)
lib/vdsm/supervdsm_api/virt.py:            os.chown(socketFile,

$ git grep 'chown('
lib/vdsm/gluster/api.py:            os.chown(sshDir, uid, gid)
lib/vdsm/gluster/api.py:        os.chown(authKeysFile, uid, gid)
lib/vdsm/mkimage.py:        os.chown(media, resolveUid(DISKIMAGE_USER),
lib/vdsm/mkimage.py:            os.chown(_P_PAYLOAD_IMAGES,
lib/vdsm/network/configurators/ifcfg.py:        os.chown(dirName, vdsm_uid, -1)
lib/vdsm/network/configurators/ifcfg.py:        os.chown(backup, vdsm_uid, -1)
lib/vdsm/storage/fileUtils.py:def chown(path, user=-1, group=-1):
lib/vdsm/storage/fileUtils.py:    os.chown(path, uid, gid)
lib/vdsm/supervdsm_api/udev.py:        os.chown(_HWRNG_PATH, 0, 0)
lib/vdsm/supervdsm_api/virt.py:            os.chown(socketFile,
lib/vdsm/supervdsm_server.py:            chown(address, args.sock_user, args.sock_group)
lib/vdsm/tool/configurators/managedvolumedb.py:    fileUtils.chown(mvdb.DB_FILE, constants.VDSM_USER, constants.VDSM_GROUP)
lib/vdsm/tool/register.py:            os.chown(_auth_keys_dir, _uid, _uid)
lib/vdsm/tool/register.py:            os.chown(_auth_keys, _uid, _uid)
lib/vdsm/tool/transient.py:    os.chown(TRANSIENT_DISKS_REPO, vdsm_uid, vdsm_gid)
tests/storage/fileutil_test.py:            fileUtils.chown(srcPath, targetId, targetId)
tests/storage/fileutil_test.py:            fileUtils.chown(srcPath, tmpId, tmpId)
tests/storage/fileutil_test.py:            fileUtils.chown(srcPath, "root", "root")

This is new issue since we started to use DAC with libvirt. Libvirt is running
as root, and letting it handle permissions is the most likely reason that 
after stopping vms volumes are owned by root.

I think we need to enable libvirt debug logs and find what happens when we shut down
a vm, maybe unclean shutdown.

Moving to virt to investigate this.

Comment 29 Nir Soffer 2019-02-12 19:20:19 UTC
Peter, do you have any suggestion how to debug this, or info you need
to look at this issue?

Comment 30 Ryan Barry 2019-02-12 19:49:45 UTC
Nir, I'm not able to reproduce this --

On my test system, setting all_squash (or even root_squash), anonuid, and anongid appropriately squashes the user. Are you running Fedora or EL?

I'm also guessing you re-exported+re-mounted the share after making any changes, but as much information as possible would be helpful. Either way, I think this is not virt. If all_squash (or root_squash) doesn't actually squash the remote user, it seems like a very severe platform bug with large security implications.

Comment 31 Simone Tiraboschi 2019-02-12 19:52:45 UTC
with root_squash or all_squash you are probably avoiding it on storage server side; I'd suggest to try reproducing with no_root_squash

Comment 33 Ryan Barry 2019-02-12 20:59:17 UTC
Yes, this is reproducible without all_squash or root_squash -- but not with dynamic_ownership=1 (which the vdsm configurator should already set, per https://gerrit.ovirt.org/#/c/89830/26/lib/vdsm/tool/configurators/libvirt.py) 

Can someone else on this thread verify this externally? Available workarounds are:

Changing ownership manually
Setting all_squash or root_squash + anonuid/anongid
(Best) running "vdsm-tool configure --force" to ensure dynamic_ownership=1

Comment 34 Endre Karlson 2019-02-12 21:30:47 UTC
Hi, I have tried the following: 
service vdsmd stop
vdsm-tool configure --force
service vdsmd start

and then restart the ha agent on all nodes but it doesnt help, the upgraded to 4.3 node is still not able to start hte engine.

Currently my cluster is a brick so help / workaround would be nice!

Comment 35 Ryan Barry 2019-02-12 22:07:01 UTC
Hey Endre - thanks for the report! Is dynamic_ownership=1 set in libvirtd.conf ?

To get it running again, please chown 36:36 the disk for hosted engine (doing this recursively on the entire share may be safe)

It will be something like 1.2.3.4:/storage/hosted-engine/$(uuid)/images/$(uuid)/$(uuid)

uuid is dynamically generated, but tab completion will get you far

The entire tree should be 36:36

Comment 36 Nir Soffer 2019-02-12 22:18:39 UTC
(In reply to Ryan Barry from comment #30)
> Nir, I'm not able to reproduce this --

Right, I tested local file system :-)
(using /rhev/data-center instead of /rhev/data-center/mnt/server:_path)

I tested again both Fedora 28 server and CentOS 7.6 server with:

# cat /etc/exports
/export  *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36,fsid=0)

# exportfs -v
/export  <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,root_squash,all_squash)


Both do squash 
the permissions when using all_squash.

Changing the permissions fail with:
chown: changing ownership of 'test': Operation not permitted


I tested also without all_squash:

# cat /etc/exports
/export  *(rw,sync,no_subtree_check,anonuid=36,anongid=36,fsid=0)

# exportfs -v
/export  <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,root_squash,no_all_squash)

We get root_squash by default, and testing the mount do
sauash root as expected.


And with no_root_squash:

# cat /etc/exports
/export  *(rw,sync,no_subtree_check,no_root_squash,anonuid=36,anongid=36,fsid=0)

# exportfs -v
/export  <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,no_root_squash,no_all_squash)

With this can can create files as root and change their
ownership to root.


So this looks like configuration issue on user side - but even with non-default
configuration disabling root_squash (or maybe some other NFS server that has
different defualts), file are not expected to change ownership after we pass
the file to libvirt.

Recommending to use root_squash should avoid the issue but I think it hides
the real issue - someone running as root is changing file ownership.


Checking our docs, we don't mention root_squash.
- https://www.ovirt.org/documentation/admin-guide/chap-Storage.html#preparing-nfs-storage
- https://www.ovirt.org/develop/troubleshooting-nfs-storage-issues.html


We had this bug about failing to upgrade hosted engine if root_squash
is enabled:
https://bugzilla.redhat.com/1466234

We had this old bug with root_squash:
https://bugzilla.redhat.com/963881

So I think we have these issues:
- storage 1: verify that we test with root_squash - maybe we test it
  without root squash because of historic issues.
- storage 2: fix bad comment in vdsm source about root_squash
  https://gerrit.ovirt.org/c/97721/
- docs issue: recommend to use root_squash after testing show that we
  don't have any issue.
- virt: understand why files used by libvirt change ownership

Comment 37 Nir Soffer 2019-02-12 22:30:27 UTC
Elad, do we test NFS with root_squash, all_squash or no_root_squash?

Comment 38 Simone Tiraboschi 2019-02-12 23:04:34 UTC
(In reply to Nir Soffer from comment #36)
> (In reply to Ryan Barry from comment #30)
> So I think we have these issues:
> - storage 1: verify that we test with root_squash - maybe we test it
>   without root squash because of historic issues.
> - storage 2: fix bad comment in vdsm source about root_squash
>   https://gerrit.ovirt.org/c/97721/
> - docs issue: recommend to use root_squash after testing show that we
>   don't have any issue.

This is currently affecting also GlusterFS SD and I'm not sure we have a root_squash equivalent on GlsuterFS fuse mounts.
Sahina?

> - virt: understand why files used by libvirt change ownership

Comment 39 Elad 2019-02-13 07:40:07 UTC
(In reply to Nir Soffer from comment #37)
> Elad, do we test NFS with root_squash, all_squash or no_root_squash?

No

Comment 40 Nir Soffer 2019-02-13 12:18:05 UTC
(In reply to Elad from comment #39)
> (In reply to Nir Soffer from comment #37)

Elad, sorry if my question was not specific enough.

I need the NFS configuration used for testing RHV, both manual tests and automation.

Please share the /etc/exports file used in all NFS servers used for testing, all 
configurations.

Comment 42 Nir Soffer 2019-02-13 13:48:04 UTC
(In reply to Elad from comment #41)

Thanks Elad!

So we have many server configured with:

    (rw,no_root_squash)

In this setup libvirt can change the ownership of images, and they will
be owned by root.

And many servers configured with:

    (rw,root_squash)

In this setup libvirt is squashed to the default uid/gid 65534. If libvirt
changes images ownership, they will not belong to root, and vdsm will not 
be able to access them.

Why are we testing both no_root_squash and root_sqash to default uid/gid?
Do we have any documentation why we use these settings?

It is clear that the configuration that may prevent the issue in this bug:

    (rw,root_squash,anonuid=36,anongid=36)

Is not tested by QE. So we should be careful about recommending it to users.

Generally I'm not happy about squashing root to 36:36. root has no reason
to touch any file in vdsm storage domain mount, and this setting hides bugs
in code running as root.

I think it is better to fail early if some component is making incorrect
changes instead of hiding them by squashing.

Comment 43 Michal Skrivanek 2019-02-14 12:30:34 UTC
Elad, can we get a clean reproducer? We can then try to switch dynamic_ownership and the seclabel settings to make sure it's caused like that

Comment 44 Elad 2019-02-14 14:17:27 UTC
Nir, this server is being used by many clients, some of the configuration injected to /etc/exports was for some specific needs.
The relevant line from /etc/exports for the path used for reproducing the bug is:
/Compute_NFS 10.0.0.0/255.0.0.0(rw,no_root_squash)

Michal, I understand you took this one offline with Liran.

Comment 45 Peter Krempa 2019-02-19 12:39:50 UTC
Reposting important bits from the private email thread about this bug:

> >>>> *2019-02-18 15:01:19.200+0000: 14951: info :
> >>>> virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on
> >>>> '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/53*
> >>>> *5e0512-987d-4781-be80-f60f36cceacd/images/c2400df6-1d9b-487e-b4c5-abce730ee7a2/16fa2ea1-2baf-4af4-aa09-b16067aa9988'
> >>>> to '0:0'*
> >
> > Is this on the source or destination?
>
> Destination host with dynamic_ownership=1, on shutdown

Ah I see. So the problem is that after you migrate this VM in from a host
which did not have dynamic_ownership enabled and thus the VM XML did not
use the <seclabel> to prevent relabelling and shut down the VM libvirt
will relabel it to root:root.

In this case the libvirt code is correct though I'm afraid. I don't
think it's worth applying any workarounds in libvirt as we've never
encouraged users to disable dynamic ownership.

> > If the migration is controlled by a host which is aware (e.g. non
> > peer-to-peer migration initiated from the destination host ) it's better
> > to use the destination XML parameter directly rather than writing ugly
> > filter code to the hook.
>
> Source is sending it, though destination can make changes too. But it
> would be needed for VM resume too.
> Maybe hook is simpler

Note that you can use virDomainSaveImageDefineXML to modify the XML in
the save image if you opt not to use the hook. Especially if you can
send the correct XML during migration.

Comment 46 Jillian Morgan 2019-02-19 13:29:06 UTC
(In reply to Peter Krempa from comment #45)
> In this case the libvirt code is correct though I'm afraid. I don't
> think it's worth applying any workarounds in libvirt as we've never
> encouraged users to disable dynamic ownership.

To be clear, in my case of hitting this bug, I never (knowingly?) disabled dynamic ownership. As I understand it, that is a new feature in 4.3. All I did was upgrade from 4.2(.8) to 4.3 and it broke up my VMs. So I think it is up to oVirt to manage the necessary corrections on upgrade to avoid breaking VM disk permissions.

An additional point, for clarity, since a lot of this thread discusses NFS-based storage domains, is that on my cluster I use GlusterFS storage domains, so the whole root_squash/no_root_squash issue is moot.

Comment 47 Milan Zamazal 2019-02-19 13:55:33 UTC
Ian, yes, when migrating a running or suspended VM from environment < 4.3, Vdsm must modify domain XML accordingly. I'm going to implement that in a libvirt hook.

Comment 48 SATHEESARAN 2019-02-25 10:00:59 UTC
(In reply to Simone Tiraboschi from comment #38)
> (In reply to Nir Soffer from comment #36)
> > (In reply to Ryan Barry from comment #30)
> > So I think we have these issues:
> > - storage 1: verify that we test with root_squash - maybe we test it
> >   without root squash because of historic issues.
> > - storage 2: fix bad comment in vdsm source about root_squash
> >   https://gerrit.ovirt.org/c/97721/
> > - docs issue: recommend to use root_squash after testing show that we
> >   don't have any issue.
> 
> This is currently affecting also GlusterFS SD and I'm not sure we have a
> root_squash equivalent on GlsuterFS fuse mounts.
> Sahina?
> 
> > - virt: understand why files used by libvirt change ownership

Hi Simone,

Yes, we do see this problem while upgrading from RHV 4.2.8-2 to RHV 4.3
The ownership of the image files becomes root:root, and when reset to vdsm:kvm, 
the VM could start

Comment 49 bipin 2019-02-25 10:06:50 UTC
*** Bug 1679882 has been marked as a duplicate of this bug. ***

Comment 50 Sandro Bonazzola 2019-02-28 07:41:21 UTC
Moving back to POST being https://gerrit.ovirt.org/#/c/98088/ not merged yet

Comment 51 Michal Skrivanek 2019-02-28 13:29:59 UTC
core issue is fixed already though

Comment 52 RHV bug bot 2019-03-05 21:22:50 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 53 Liran Rotenberg 2019-03-06 13:02:09 UTC
Verified on:
vdsm-4.30.10-1.el7ev.x86_64

Steps:
Have an environment with 4.3 host (vdsm-4.30.10-1.el7ev.x86_64), and older host < 4.3, for example: 4.2 with vdsm-4.20.46-1.el7ev.x86_64
1. Create a VM.
2. Run it on the 4.2 host.
3. Migrate the VM.
4. Shutdown the VM.
5. Start the VM.

Results:
The VM could start again, looking in the image:
# ll /rhev/data-center/mnt/<IMAGE PATH>
Showed ownership vdsm:kvm.

Shutting down didn't showed DAC change of the image to '0:0' in libvirt log.

On the source host: virsh -r dumpxml <VM>
showed:
 <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/535e0512-987d-4781-be80-f60f36cceacd/images/7d7719b2-578d-4a89-b6fb-15433f6e47dd/223b153a-761c-4017-8694-e434c8764155'/>

...

<seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c758,c994</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c758,c994</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>

In /etc/libvirt/qemu.conf there is dynamic_ownership=0

After migrating the domxml shows on destination host:
<source file='/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/535e0512-987d-4781-be80-f60f36cceacd/images/7d7719b2-578d-4a89-b6fb-15433f6e47dd/223b153a-761c-4017-8694-e434c8764155'>
        <seclabel model='dac' relabel='no'/>
      </source>

...

  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c240,c534</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c240,c534</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>

seclabel added successfully in live migration from host < 4.3 to 4.3 host.

Comment 54 Darrell 2019-03-16 02:45:35 UTC
This is also occurring when the ovirt HA agent launched the hosted engine VM. I had this happen to me last night, and it occurred in a reproducible fashion. It seems to be the act of the VM being started (i'd stopped it from the VM), or possible the VM stopping. It first happened while I was upgrading from 4.2.8 to 4.3.1, but continued once all my HA hosts were upgraded. Resetting the ownership on the hosted engine vm disk resolved the issues in each case.

ovirt-release42-4.2.8-1.el7.noarch
ovirt-imageio-daemon-1.5.0-0.el7.noarch
ovirt-host-dependencies-4.3.1-1.el7.x86_64
cockpit-machines-ovirt-176-4.el7.centos.noarch
ovirt-release43-4.3.1-1.el7.noarch
ovirt-vmconsole-host-1.0.7-2.el7.noarch
ovirt-hosted-engine-setup-2.3.5-1.el7.noarch
ovirt-ansible-engine-setup-1.1.8-1.el7.noarch
ovirt-imageio-common-1.5.0-0.el7.x86_64
python2-ovirt-host-deploy-1.8.0-1.el7.noarch
ovirt-host-4.3.1-1.el7.x86_64
ovirt-vmconsole-1.0.7-2.el7.noarch
ovirt-ansible-repositories-1.1.5-1.el7.noarch
ovirt-ansible-hosted-engine-setup-1.0.11-1.el7.noarch
ovirt-provider-ovn-driver-1.2.20-1.el7.noarch
ovirt-hosted-engine-ha-2.3.1-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.centos.noarch
python2-ovirt-setup-lib-1.2.0-1.el7.noarch
python-ovirt-engine-sdk4-4.3.0-2.el7.x86_64
cockpit-ovirt-dashboard-0.12.3-1.el7.noarch
ovirt-host-deploy-common-1.8.0-1.el7.noarch

libvirt-daemon-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-10.el7_6.4.x86_64
libvirt-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.4.x86_64
libvirt-lock-sanlock-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-lxc-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-iscsi-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-kvm-4.5.0-10.el7_6.4.x86_64
libvirt-python-4.5.0-1.el7.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-nodedev-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-config-network-4.5.0-10.el7_6.4.x86_64
libvirt-client-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-interface-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-network-4.5.0-10.el7_6.4.x86_64
libvirt-bash-completion-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-core-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-10.el7_6.4.x86_64
libvirt-libs-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-secret-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-10.el7_6.4.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.4.x86_64

Comment 55 nardusg 2019-04-10 06:23:13 UTC
Also having this issue. Upgraded from 4.2.8 to 4.3.2. I am on FCP storage and see the same issues. Any idea to fix permissions on FC storage? If I deattache the disks and reattache to newly created VM the VM starts.

Comment 56 Milan Zamazal 2019-04-10 11:34:27 UTC
(In reply to nardusg from comment #55)
> Also having this issue. Upgraded from 4.2.8 to 4.3.2. I am on FCP storage
> and see the same issues. Any idea to fix permissions on FC storage? If I
> deattache the disks and reattache to newly created VM the VM starts.

As discussed on oVirt users list, the problem is unrelated to this bug, the VM doesn't start for a completely different reason.

Comment 58 errata-xmlrpc 2019-05-08 12:36:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 59 Sahina Bose 2019-05-21 12:08:51 UTC
*** Bug 1687126 has been marked as a duplicate of this bug. ***

Comment 60 ΒΑΣΙΛΟΠΟΥΛΟΣ ΓΕΩΡΓΙΟΣ ΕΛΕΥΘΕΡΙΟΣ 2019-07-23 19:01:52 UTC
I think this is not fully resolved yet. I encountered this after upgrading from 4.2.8 to 4.3.4
All machines are on gluster volumes. I changed cluster compatibility to 4.3 after the upgrade. 
When I shutdown a vm it gets permitions root:root and I have to manualy reset the permitions to vdsm:kvm in order for it to restart.
Machines that where down during the update are not affected they start normaly and their ownership are not altered.
For vm that were running during the upgrade when they shutdown they get an ownership of root:root. After changing the ownership they seem to behave normaly

Comment 61 Milan Zamazal 2019-07-24 07:37:31 UTC
Do you use gluster with gfapi? It's a known bug, a fix is in preparation, see https://bugzilla.redhat.com/1719789.

Comment 62 ΒΑΣΙΛΟΠΟΥΛΟΣ ΓΕΩΡΓΙΟΣ ΕΛΕΥΘΕΡΙΟΣ 2019-07-24 21:09:09 UTC
I use libgfapi but, this only happens once to me not every time. I only need to change the ownership once it does not change again on the next shutdown AFAIK. I mean I only have to change ownership to 36:36 once then the machine shutdowns and start normally. 
This Only happens on machines that where running during the transition from 4.2 to 4.3, After shutdown -> chown 36:36 vm_image the machines shutdown and restart normally. Machines that where stopped during the upgrade are not affected.

Comment 63 Milan Zamazal 2019-07-25 05:49:20 UTC
The bug means that both libvirt and Vdsm (+ your manual changes) try to manage owners of the GlusterFS images. The owners may remain correct or they can be changed later inappropriately.


Note You need to log in before you can comment on or make changes to this bug.