Description of problem: A Hosted Engine environment passed a upgrade from 4.1 to 4.2, then to 4.3 (alpha2). Triggering global maintenance on the environment, then using --vm-power-off, --vm-start and the engine VM doesn't go up. -- The start-up result is done. Jan 16 16:43:38 ocelot02.qa.lab.tlv.redhat.com sudo[274947]: pam_unix(sudo:session): session opened for user root by (uid=0) Jan 16 16:43:38 ocelot02.qa.lab.tlv.redhat.com sudo[274947]: pam_unix(sudo:session): session closed for user root Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com systemd[1]: Started Session c52389 of user root. -- Subject: Unit session-c52389.scope has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit session-c52389.scope has finished starting up. -- -- The start-up result is done. Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com sudo[274995]: pam_unix(sudo:session): session opened for user root by (uid=0) Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]: ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch record "." column external_ids Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com sudo[274995]: pam_unix(sudo:session): session closed for user root Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com systemd[1]: Started Session c52390 of user root. -- Subject: Unit session-c52390.scope has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit session-c52390.scope has finished starting up. In vdsm logs i saw some storage trace back: 2019-01-16 16:42:46,953+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='5118431d-94a3-4fae-bd6d-50819d043af6') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in prepareImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3179, in prepareImage legality = dom.produceVolume(imgUUID, volUUID).getLegality() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 802, in produceVolume volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 800, in __init__ self._manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 71, in __init__ volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 86, in __init__ self.validate() File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 112, in validate self.validateVolumePath() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 131, in validateVolumePath raise se.VolumeDoesNotExist(self.volUUID) VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) 2019-01-16 16:42:46,953+0200 INFO (vm/3074151e) [storage.TaskManager.Task] (Task='5118431d-94a3-4fae-bd6d-50819d043af6') aborting: Task is aborted: "Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4 f',)" - code 201 (task:1181) 2019-01-16 16:42:46,953+0200 ERROR (vm/3074151e) [storage.Dispatcher] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) (dispatcher:81) 2019-01-16 16:42:46,954+0200 ERROR (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') The vm start process failed (vm:937) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2749, in _run self._devices = self._make_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2587, in _make_devices devices = self._make_devices_from_dict() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _make_devices_from_dict self._preparePathsForDrives(dev_spec_map[hwclass.DISK]) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1011, in _preparePathsForDrives drive['path'] = self.cif.prepareVolumePath(drive, self.id) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 415, in prepareVolumePath raise vm.VolumeError(drive) VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} 2019-01-16 16:42:46,954+0200 INFO (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Changed state to Down: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} (code=1) (vm:1675) 2019-01-16 16:42:46,956+0200 INFO (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:454) Version-Release number of selected component (if applicable): ovirt-engine 4.3 alpha2 vdsm-4.30.4-1.el7ev.x86_64 libvirt-4.5.0-10.el7_6.3.x86_64 openvswitch2.10-ovn-common-2.10.0-28.el7fdp.x86_64 openvswitch2.10-ovn-host-2.10.0-28.el7fdp.x86_64 rhv-openvswitch-ovn-common-2.10-3.el7ev.noarch ovirt-provider-ovn-driver-1.2.17-1.el7ev.noarch rhv-openvswitch-ovn-host-2.10-3.el7ev.noarch openvswitch2.10-ovn-common-2.10.0-28.el7fdp.x86_64 openvswitch2.10-ovn-host-2.10.0-28.el7fdp.x86_64 rhv-openvswitch-2.10-3.el7ev.noarch rhv-openvswitch-ovn-common-2.10-3.el7ev.noarch openvswitch2.10-2.10.0-28.el7fdp.x86_64 python-openvswitch2.10-2.10.0-28.el7fdp.x86_64 rhv-python-openvswitch-2.10-3.el7ev.noarch rhv-openvswitch-ovn-host-2.10-3.el7ev.noarch openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch How reproducible: Occurred once. Steps to Reproduce: 1. Deploy SHE environment on 4.1. 2. Upgrade the environment to 4.2. 3. Upgrade the environment to 4.3. 4. Set global maintenance mode on. 5. Power off the HE VM (# hosted-engine --vm-power-off) 6. Power on the HE VM (# hosted-engine --vm-start) Actual results: VM Never went back on, engine died. Expected results: VM should start, engine comes back. Additional info: After a several tries, I set the global maintenance off, to see if the HA will trigger and succeed to bring the HE VM on. This didn't work, causing a loop: MainThread::INFO::2019-01-16 15:43:24,183::hosted_engine::949::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2019-01-16 15:43:24,218::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent MainThread::INFO::2019-01-16 15:43:24,227::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2019-01-16 15:43:34,243::states::785::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is unexpectedly down. MainThread::INFO::2019-01-16 15:43:34,250::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> MainThread::INFO::2019-01-16 15:43:34,283::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineMaybeAway) sent? sent MainThread::INFO::2019-01-16 15:43:34,383::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMaybeAway (score: 3400) MainThread::INFO::2019-01-16 15:43:34,388::states::879::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down MainThread::INFO::2019-01-16 15:43:34,396::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> MainThread::INFO::2019-01-16 15:43:34,427::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMaybeAway-EngineUnexpectedlyDown) sent? sent MainThread::INFO::2019-01-16 15:43:34,435::states::699::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Wed Jan 16 15:43:34 2019 MainThread::INFO::2019-01-16 15:43:34,435::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)
This reproduce to me again, this time on ovirt-engine-4.3.0-0.8.rc2.el7.noarch Raising the severity.
Liran, from where did you copy the line ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) ? why do you think this is an openvswitch bug? What I see here is Vdsm failing to prepare the storage volume of the hosted Engine. Vdsm claims that VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) Did you check if it does? Please work with stroage QE to understand what is going on there. btw, the line Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]: ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch record "." column external_ids is log noise. We should not generate it. This deserves a separate Vdsm|Network bug.
(In reply to Dan Kenigsberg from comment #4) > Liran, from where did you copy the line > > ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: > database connection failed (No such file or directory) > > ? why do you think this is an openvswitch bug? > The errors are from messages log. Because of the DB-OVS error, I thought it might be related. > What I see here is Vdsm failing to prepare the storage volume of the hosted > Engine. > Vdsm claims that > VolumeDoesNotExist: Volume does not exist: > (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) > Did you check if it does? Please work with stroage QE to understand what is > going on there. > > > btw, the line > > Jan 16 16:43:48 ocelot02.qa.lab.tlv.redhat.com ovs-vsctl[274998]: > ovs|00001|db_ctl_base|ERR|no key "odl_os_hostconfig_hostid" in Open_vSwitch > record "." column external_ids > > is log noise. We should not generate it. This deserves a separate > Vdsm|Network bug. As it seems less connected to network, moving the bug to storage.
Simone hi, can you take a look at this issue? Maybe something in the upgrade went wrong?
(In reply to Fred Rolland from comment #6) > Simone hi, can you take a look at this issue? Maybe something in the upgrade > went wrong? Between 4.1 and 4.3 we don't need to touch the hosted-engine storage domain at all during upgrades. This looks strange: VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} 2019-01-16 16:42:46,954+0200 INFO (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Changed state to Down: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} (code=1) (vm:1675) 2019-01-16 16:42:46,956+0200 INFO (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:454) but honestly I don't know what VolumeError: Bad volume specification really means.
I suspect this is from lib/vdsm/clientIF.py:145 Following it back, it goes to the VolumeDoesNotExist exception from comment#1. Is the HE storage mounted?
(In reply to Ryan Barry from comment #9) > I suspect this is from lib/vdsm/clientIF.py:145 > > Following it back, it goes to the VolumeDoesNotExist exception from > comment#1. Is the HE storage mounted? In the reproduced environment: [root@ocelot03 f8412707-fe3e-490a-9c80-1881db33afb8]# cd images/ [root@ocelot03 images]# tree |grep 77756d49-7cb9-45a3-a61d-c87902979eeb │ ├── 77756d49-7cb9-45a3-a61d-c87902979eeb │ ├── 77756d49-7cb9-45a3-a61d-c87902979eeb.lease │ └── 77756d49-7cb9-45a3-a61d-c87902979eeb.meta [root@ocelot03 images]# tree |grep -C 4 77756d49-7cb9-45a3-a61d-c87902979eeb │ ├── 77ffe32a-774c-4a7f-b478-b5a69849f5e5 │ ├── 77ffe32a-774c-4a7f-b478-b5a69849f5e5.lease │ └── 77ffe32a-774c-4a7f-b478-b5a69849f5e5.meta ├── bbc99d7c-93bb-44e9-8bbd-8268d19301c2 │ ├── 77756d49-7cb9-45a3-a61d-c87902979eeb │ ├── 77756d49-7cb9-45a3-a61d-c87902979eeb.lease │ └── 77756d49-7cb9-45a3-a61d-c87902979eeb.meta └── fa8c6ea4-df94-43b4-8a99-6867b8ed2e71 ├── df1645f7-97ab-4647-8091-c052d46234af ├── df1645f7-97ab-4647-8091-c052d46234af.lease └── df1645f7-97ab-4647-8091-c052d46234af.meta [root@ocelot03 images]# [root@ocelot03 images]# cd bbc99d7c-93bb-44e9-8bbd-8268d19301c2 [root@ocelot03 bbc99d7c-93bb-44e9-8bbd-8268d19301c2]# ll total 8687180 -rw-rw----. 1 root root 62277025792 Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb -rw-rw----. 1 vdsm kvm 1048576 Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb.lease -rw-r--r--. 1 vdsm kvm 285 Jan 16 17:58 77756d49-7cb9-45a3-a61d-c87902979eeb.meta [root@ocelot03 bbc99d7c-93bb-44e9-8bbd-8268d19301c2]# ls -lah total 8.3G drwxr-xr-x. 2 vdsm kvm 4.0K Jan 16 17:58 . drwxr-xr-x. 8 vdsm kvm 4.0K Jan 16 20:00 .. -rw-rw----. 1 root root 58G Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb -rw-rw----. 1 vdsm kvm 1.0M Jan 16 21:27 77756d49-7cb9-45a3-a61d-c87902979eeb.lease -rw-r--r--. 1 vdsm kvm 285 Jan 16 17:58 77756d49-7cb9-45a3-a61d-c87902979eeb.meta
Waiting on storage, then. Nir?
(In reply to Ryan Barry from comment #11) > Waiting on storage, then. Nir? This means the volume 0ddd524f-4616-4877-88bb-c5a65cadfb4f does not exist: VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) And this case VolumeError: Bad volume specification ... Is expected, although not very helpful error message. Form the exception we see: domainID: 7d394a51-c3a3-47e1-ac31-697710ef0f70 imageID: c8f4766d-5e44-4798-b0e9-000cf97faebe volumeID: 0ddd524f-4616-4877-88bb-c5a65cadfb4f Do we have such volume and image in this domain? is the domain mounted? Waiting on hosted engine developers to explain why the volume should exist. This is probably wrong initialization on hosted engine side.
According to the provided sosreport file: [root@t470s sosreport-ocelot02-2019-01-16-bstqhuu]# cat ./sos_commands/vdsm/su_vdsm_-s_.bin.sh_-c_.usr.bin.tree_-l_.rhev.data-center /rhev/data-center |-- 5c3c478b-012f-008c-03ba-00000000005b `-- mnt |-- yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage | |-- 7d394a51-c3a3-47e1-ac31-697710ef0f70 | | |-- dom_md | | | |-- ids | | | |-- inbox | | | |-- leases | | | |-- metadata | | | |-- outbox | | | `-- xleases | | |-- ha_agent | | | |-- hosted-engine.lockspace -> /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/b2157c5a-8680-4778-aaca-14acc7022a95/b83461fa-96e2-41b7-b46b-43677068f43d | | | `-- hosted-engine.metadata -> /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/17b8c084-3139-4fd7-8150-8de4a711dd94/65e74ab2-e242-4e4a-afa1-0ddc9eeec586 | | `-- images | | |-- 17b8c084-3139-4fd7-8150-8de4a711dd94 | | | |-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586 | | | |-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586.lease | | | `-- 65e74ab2-e242-4e4a-afa1-0ddc9eeec586.meta | | |-- 44d7093c-3616-4499-bdd1-ba59633e382c | | | |-- 65bed271-871d-4feb-b4e5-3dc6391952e8 | | | |-- 65bed271-871d-4feb-b4e5-3dc6391952e8.lease | | | `-- 65bed271-871d-4feb-b4e5-3dc6391952e8.meta | | |-- 7448e1fb-a0d2-4fda-9edc-e8c5eaad6b7e | | | |-- d25e22b1-fcc8-4961-bc7b-3bd218292ade | | | |-- d25e22b1-fcc8-4961-bc7b-3bd218292ade.lease | | | `-- d25e22b1-fcc8-4961-bc7b-3bd218292ade.meta | | |-- 7fab4d37-aeb2-4ee0-9e5e-113fee053fff | | | |-- 2055c803-76e7-4671-b1df-c2d58b3ae20c | | | |-- 2055c803-76e7-4671-b1df-c2d58b3ae20c.lease | | | `-- 2055c803-76e7-4671-b1df-c2d58b3ae20c.meta | | |-- b2157c5a-8680-4778-aaca-14acc7022a95 | | | |-- b83461fa-96e2-41b7-b46b-43677068f43d | | | |-- b83461fa-96e2-41b7-b46b-43677068f43d.lease | | | `-- b83461fa-96e2-41b7-b46b-43677068f43d.meta | | `-- c8f4766d-5e44-4798-b0e9-000cf97faebe | | |-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f | | |-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f.lease | | `-- 0ddd524f-4616-4877-88bb-c5a65cadfb4f.meta | `-- __DIRECT_IO_TEST__ `-- yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_nfs__0 |-- __DIRECT_IO_TEST__ `-- d542adfd-e61d-4503-ad91-b19458827917 |-- dom_md | |-- ids | |-- inbox | |-- leases | |-- metadata | |-- outbox | `-- xleases |-- images | |-- 505b326b-2455-4cfc-a4cc-70300ac14afe | | |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a | | |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.lease | | `-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.meta | |-- 97e9c9ef-b2b4-4a10-b9a1-016a54fa4fd9 | | |-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1 | | |-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1.lease | | `-- 72e6a592-01e1-4b5b-bbdf-09e0cb4a7cc1.meta | |-- e56ccc92-4504-49e7-9834-63127f251651 | | |-- e58caa53-54f5-4fb3-97c0-9973ce242b32 | | |-- e58caa53-54f5-4fb3-97c0-9973ce242b32.lease | | `-- e58caa53-54f5-4fb3-97c0-9973ce242b32.meta | `-- f84cab00-14b1-4242-9ddc-7a5609a0f2d1 | |-- 2780716f-e894-41a7-8542-213b2b8d19d0 | |-- 2780716f-e894-41a7-8542-213b2b8d19d0.lease | |-- 2780716f-e894-41a7-8542-213b2b8d19d0.meta | |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a | |-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.lease | `-- a2c04c53-4ab6-4416-a1e1-4de2bff88e8a.meta `-- master |-- tasks `-- vms /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage/images/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f was there as expected (at least at sos report creation time).
Simone, why the volume was not there on 2019-01-16 16:42:46,953+0200? We need to check the hosted engine startup flow, and find the reason. Probably the domain was not mounted when we tried to start the vm?
The issue reported by Liran happened the first time at 15:27:23,136+0200 2019-01-16 15:27:23,136+0200 INFO (vm/3074151e) [vdsm.api] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) from=internal, task_id=2b44a209-380d-45e9-8be5-01234bd1fc5b (api:52) 2019-01-16 15:27:23,136+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='2b44a209-380d-45e9-8be5-01234bd1fc5b') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in prepareImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3179, in prepareImage legality = dom.produceVolume(imgUUID, volUUID).getLegality() File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 802, in produceVolume volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 800, in __init__ self._manifest = self.manifestClass(repoPath, sdUUID, imgUUID, volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 71, in __init__ volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 86, in __init__ self.validate() File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 112, in validate self.validateVolumePath() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileVolume.py", line 131, in validateVolumePath raise se.VolumeDoesNotExist(self.volUUID) VolumeDoesNotExist: Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) 2019-01-16 15:27:23,138+0200 INFO (vm/3074151e) [storage.TaskManager.Task] (Task='2b44a209-380d-45e9-8be5-01234bd1fc5b') aborting: Task is aborted: "Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',)" - code 201 (task:1181) 2019-01-16 15:27:23,138+0200 ERROR (vm/3074151e) [storage.Dispatcher] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) (dispatcher:81) 2019-01-16 15:27:23,138+0200 ERROR (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') The vm start process failed (vm:937) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 866, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2749, in _run self._devices = self._make_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2587, in _make_devices devices = self._make_devices_from_dict() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _make_devices_from_dict self._preparePathsForDrives(dev_spec_map[hwclass.DISK]) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1011, in _preparePathsForDrives drive['path'] = self.cif.prepareVolumePath(drive, self.id) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 415, in prepareVolumePath raise vm.VolumeError(drive) VolumeError: Bad volume specification {u'index': u'0', u'iface': u'virtio', 'reqsize': '0', 'vm_custom': {}, u'format': u'raw', u'bootOrder': u'1', u'poolID': u'00000000-0000-0000-0000-000000000000', 'vmid': '3074151e-b38f-461f-bf17-90fb36dd41cb', u'volumeID': u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', 'apparentsize': '62277025792', u'imageID': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', u'readonly': u'false', u'domainID': u'7d394a51-c3a3-47e1-ac31-697710ef0f70', u'deviceId': u'c8f4766d-5e44-4798-b0e9-000cf97faebe', 'truesize': '8600494080', u'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'device': u'disk', u'shared': u'exclusive', u'propagateErrors': u'off', u'type': u'disk'} But please notice that only 5 millisecond before a getVolumeSize on the same volume (volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f') returned 'truesize': '8600494080' 2019-01-16 15:27:23,131+0200 INFO (vm/3074151e) [vdsm.api] START getVolumeSize(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', options=None) from=internal, task_id=809a1669-5057-4252-8fa6-735040af97e2 (api:48) 2019-01-16 15:27:23,132+0200 INFO (vm/3074151e) [vdsm.api] FINISH getVolumeSize return={'truesize': '8600494080', 'apparentsize': '62277025792'} from=internal, task_id=809a1669-5057-4252-8fa6-735040af97e2 (api:54) Now the question is, how could a getVolumeSize successfully complete at 15:27:23,131 and a prepareImage on the same volume fail at 15:27:23,136 with VolumeDoesNotExist (Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',))?
And the same pattern repeated also 5 minutes later 15:32:48: 2019-01-16 15:32:48,159+0200 INFO (vm/3074151e) [vdsm.api] START getVolumeSize(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', volUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', options=None) from=internal, task_id=79624513-215f-4884-97ec-06199222e0c0 (api:48) 2019-01-16 15:32:48,161+0200 INFO (vm/3074151e) [vdsm.api] FINISH getVolumeSize return={'truesize': '8600494080', 'apparentsize': '62277025792'} from=internal, task_id=79624513-215f-4884-97ec-06199222e0c0 (api:54) 2019-01-16 15:32:48,162+0200 INFO (vm/3074151e) [vdsm.api] START prepareImage(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', leafUUID=u'0ddd524f-4616-4877-88bb-c5a65cadfb4f', allowIllegal=False) from=internal, task_id=72124e10-a230-4c69-9f09-8aad80df24d0 (api:48) 2019-01-16 15:32:48,165+0200 INFO (vm/3074151e) [vdsm.api] FINISH prepareImage error=Volume does not exist: (u'0ddd524f-4616-4877-88bb-c5a65cadfb4f',) from=internal, task_id=72124e10-a230-4c69-9f09-8aad80df24d0 (api:52) 2019-01-16 15:32:48,165+0200 ERROR (vm/3074151e) [storage.TaskManager.Task] (Task='72124e10-a230-4c69-9f09-8aad80df24d0') Unexpected error (task:875) At 15:32:48,159 we got the size of the volume, but the volume disappeared at 15:32:48,165
repoStats for that SD was also fine at that time: 2019-01-16 15:32:47,930+0200 INFO (jsonrpc/5) [vdsm.api] START repoStats(domains=[u'7d394a51-c3a3-47e1-ac31-697710ef0f70']) from=::1,51930, task_id=bcf34e9c-26eb-49c3-a158-eb8d2c06c488 (api:48) 2019-01-16 15:32:47,930+0200 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats return={u'7d394a51-c3a3-47e1-ac31-697710ef0f70': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.2012', 'lastCheck': '2.1', 'valid': True}} from=::1,51930, task_id=bcf34e9c-26eb-49c3-a158-eb8d2c06c488 (api:54)
Simone, interesting! Getting volume size use: int(int(self.oop.os.stat(volPath).st_size) / bs) int(int(self.oop.os.stat(volPath).st_blocks) * BLOCK_SIZE) Which means calling stat() in ioprocess. Validating volume path is: self.oop.fileUtils.pathExists(volPath) Which is: check = os.R_OK if writable: check |= os.W_OK # This function is workaround for a NFS issue where sometimes # os.exists/os.access fails due to NFS stale handle, in such # case we need to test os.access a second time. if os.access(filename, check): return True return os.access(filename, check) The last change in this code was: commit 8b7c2d07b477d78cbaa630a37aa8ea571ee5f5e7 Author: Federico Simoncelli <fsimonce> Date: Thu Aug 4 12:00:18 2011 +0000 BZ#726630 Use os.access twice for NFS stale handle Change-Id: Id6a9e94361c19438495a46cc708c76e4d2b43a92 Reviewed-on: http://gerrit.usersys.redhat.com/789 Tested-by: Federico Simoncelli <fsimonce> Reviewed-by: Dan Kenigsberg <danken> The change before that was: Author: Federico Simoncelli <fsimonce> Date: Fri Jul 22 14:48:56 2011 +0000 BZ#726630 Use only os.access to check permissions The old implementation of fileUtils.pathExists was using a flawed re-implementation of os.access as backup check for files with a NFS stale handle, the consequence was that accessible files were reported as non-accessible. We now use os.stat to refresh the NFS handle and then os.access to determine weather we can read and write the file. Change-Id: Ic7b892886416e866178ac88c04cbfb68ed0c055b Reviewed-on: http://gerrit.usersys.redhat.com/733 Reviewed-by: Dan Kenigsberg <danken> Tested-by: Jakub Libosvar <libosvar> All these checks look wrong. I don't think we should use os.access for checking if file exists and raising VolumeDoesNotExist does not make any sense if os.access failed. Even worse we should check if file exits from vdsm, this may cause entire vdsm to get stuck. But this code seems to work since 2011, we never had a report about such failure. The underlying issue is that os.access() fail twice - something is wrong with this mount. What is the output of: ls -Zh /rhev/data-center/mnt/server:_path/sd-id/images/img-id/ When we got this error? Please attach sanlock.log and /var/log/messages from the time of this issue. Is this reproducible without hosted engine? how? What are the vdsm api calls that reproduce this issue?
OK the issue probably comes from here: [root@t470s sosreport-ocelot02-2019-01-16-bstqhuu]# cat sos_commands/vdsm/su_vdsm_-s_.bin.sh_-c_.bin.ls_-lR_.rhev.data-center ... /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-5_hosted__engine__storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/images/c8f4766d-5e44-4798-b0e9-000cf97faebe: total 8399948 -rw-rw----. 1 root root 62277025792 Jan 16 15:27 0ddd524f-4616-4877-88bb-c5a65cadfb4f -rw-rw----. 1 vdsm kvm 1048576 Jan 16 15:27 0ddd524f-4616-4877-88bb-c5a65cadfb4f.lease -rw-r--r--. 1 vdsm kvm 285 Jan 14 10:22 0ddd524f-4616-4877-88bb-c5a65cadfb4f.meta Everything is vdsm:kvm but the file for that volume that is root:root 660 so vdsm user or kvm group cannot read it. Now the point is, how is it possible? Before the upgrade that engine VM was running but with that permission it's not possible. What changed the permission at upgrade time? (In reply to Liran Rotenberg from comment #0) > Triggering global maintenance on the environment, then using --vm-power-off, This is also wrong since --vm-poweroff is going to call destroy for an immediate poweroff and not shutdown for a clean shutdown.
Created attachment 1522514 [details] /var/log/vdsm
Created attachment 1522515 [details] sanlock.log
OK, in /var/log/messages we see: Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: info : virSecuritySELinuxSetFileconHelper:1156 : Setting SELinux context on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' to 'system_u:object_r:svirt_image_t:s0:c227,c985' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: debug : virFileIsSharedFSType:3664 : Check if path /var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f with FS magic 26985 is shared Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.141+0000: 157822: info : virSecuritySELinuxSetFileconHelper:1200 : Setting security context 'system_u:object_r:svirt_image_t:s0:c227,c985' on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' not supported ... Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '<null>' to '107:107' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: debug : virFileFindResourceFull:1818 : Resolved 'fs' to '/usr/lib64/libvirt/storage-file/libvirt_storage_file_fs.so' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.143+0000: 157823: debug : virModuleLoadFile:47 : Load module file '/usr/lib64/libvirt/storage-file/libvirt_storage_file_fs.so' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virModuleLoadFunc:68 : Lookup function 'virStorageFileFsRegister' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'file' protocol 'none' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'block' protocol 'none' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'dir' protocol 'none' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virFileFindResourceFull:1818 : Resolved 'gluster' to '/usr/lib64/libvirt/storage-file/libvirt_storage_file_gluster.so' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.144+0000: 157823: debug : virModuleLoadFile:47 : Load module file '/usr/lib64/libvirt/storage-file/libvirt_storage_file_gluster.so' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: debug : virModuleLoadFunc:68 : Lookup function 'virStorageFileGlusterRegister' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: debug : virStorageFileBackendRegister:96 : Registering storage file backend 'network' protocol 'gluster' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f' to '107:107' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.com.redhat.rhevm.vdsm' to '107:107' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.qemu.guest_agent.0' to '107:107' Jan 15 11:05:27 ocelot02 libvirtd: 2019-01-15 09:05:27.145+0000: 157823: info : virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on '/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.ovirt.hosted-engine-setup.0' to '107:107' And on that host we have: libvirt.x86_64 4.5.0-10.el7_6.3 @rhel-76z-internal-compose Maybe it could be related to https://bugzilla.redhat.com/1633389
It looks like the DAC introduction changed behavior regarding files permissions. It was introduced here: https://gerrit.ovirt.org/#/c/89830/ devices: enable dynamic_ownership Libvirt provides dynamic_ownership option. When enabled, libvirt takes care of ownership and permissions of endpoints required for VM run. It was previously disabled due to legacy shared storage issues; these should be fixed by now or, if any problem is discovered, fixed ASAP. Enabling the option allows us to drop ownership handling from VDSM for host devices, and allows OVMF guest vars store without additional permission handling code. The new behavior is explicitly disabled for storage code - the disk ownership is still handled by VDSM. One notable device not handled by libvirt dynamic ownership is hwrng. We must still handle ownership of that device ourselves. Change-Id: Ibfd8f67f88e361600f417aaf52625b5bf6ea1214 Signed-off-by: Martin Polednik <mpolednik> Milan, can you take look?
While dynamic ownership looks like a suspect when file permissions are wrong, I don't know how it could apply in this particular case. Normally, NFS is expected to be configured as (anonuid=36,anongid=36,all_squash) so all files have the right owners automatically. I don't think libvirt can have responsibility here, since the file has wrong permissions even before the VM is attempted to be started (if I look into the logs correctly). And we exclude storage from dynamic ownership in Vdsm, so it shouldn't be touched by libvirt anyway (unless there is a bug).
Same issue here (I think) on my 4.2.8 to 4.3.0(.4) cluster upgrade. Each VM was waiting to be rebooted to apply the new configuration after the cluster upgrade. After each VM was shutdown, it would not restart, with the same Red Herring message about "bad volume specification" which had me scrambling to figure out what-on-earth was wrong with the specification which looked totally fine to me. I did eventually discover that the permissions on the disk images were changed (to root:root and mode 0644). I reset the permissions properly (vdsm:kvm, 0660) and then the VMs started up fine. I did this on a subset of HA VMs first, thankfully. On the 2nd set, I first verified that the permissions on the disk images were good before the shutdown (while still in the "vm has a new configuration waiting to be applied" state). During the shutdown process of each VM, the disk image file permissions changed to the unworkable root:root and 0644. With the permissions manually fixed, they seem to be retained properly during subsequent shutdown/restart or reboot cycles, so the issue was only caused by the VM-config-update process? At what other times might the permissions get hosed again?
Raising this since we had at least 5 upstream users hitting this!
(In reply to Milan Zamazal from comment #24) > While dynamic ownership looks like a suspect when file permissions are > wrong, I don't know how it could apply in this particular case. Normally, > NFS is expected to be configured as (anonuid=36,anongid=36,all_squash) so > all files have the right owners automatically. I don't think we require this setting, but even with this you can have wrong permissions: Here is my test NFS server: $ cat /etc/exports /export *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36,fsid=0) And session showing that this does not effect file permissions: # ls -lht /rhev/data-center/ total 0 drwxr-xr-x. 2 vdsm kvm 154 Feb 10 17:50 a0011271-88a4-491f-a566-aec38b2000e9 drwxr-xr-x. 4 vdsm kvm 64 Feb 10 17:50 mnt # mkdir /rhev/data-center/foo # ls -lht /rhev/data-center/ total 0 drwxr-xr-x. 2 root root 6 Feb 12 20:57 foo drwxr-xr-x. 2 vdsm kvm 154 Feb 10 17:50 a0011271-88a4-491f-a566-aec38b2000e9 drwxr-xr-x. 4 vdsm kvm 64 Feb 10 17:50 mnt # touch /rhev/data-center/foo/bar # ls -lh /rhev/data-center/foo/ total 0 -rw-r--r--. 1 root root 0 Feb 12 20:58 bar NFS all_squash,anonuid=36,anongid=36 probably effect the effective uid/gid, not file permissions. > I don't think libvirt can > have responsibility here, since the file has wrong permissions even before > the VM is attempted to be started (if I look into the logs correctly). And > we exclude storage from dynamic ownership in Vdsm, so it shouldn't be > touched by libvirt anyway (unless there is a bug). Storage has no code to change file ownership, we only check that we can access files. Storage provisioning code run as vdsm, and anything we create (in file storage) is owned by vdsm. In block storage we use sudo to run lvm, and use udev rules to change volumes so they are owned by vdsm. Libvirt creates its own namespace for block devices, and responsible for the ownership and permissions of these block devices. If this was an issue with block devices ownership, this could be storage issue. $ git grep chown lib/vdsm/supervdsm_server.py lib/vdsm/supervdsm_server.py:from vdsm.storage.fileUtils import chown, resolveGid, resolveUid lib/vdsm/supervdsm_server.py: chown(address, args.sock_user, args.sock_group) $ git grep chown lib/vdsm/supervdsm_api/ lib/vdsm/supervdsm_api/udev.py: os.chown(_HWRNG_PATH, 0, 0) lib/vdsm/supervdsm_api/virt.py: os.chown(socketFile, $ git grep 'chown(' lib/vdsm/gluster/api.py: os.chown(sshDir, uid, gid) lib/vdsm/gluster/api.py: os.chown(authKeysFile, uid, gid) lib/vdsm/mkimage.py: os.chown(media, resolveUid(DISKIMAGE_USER), lib/vdsm/mkimage.py: os.chown(_P_PAYLOAD_IMAGES, lib/vdsm/network/configurators/ifcfg.py: os.chown(dirName, vdsm_uid, -1) lib/vdsm/network/configurators/ifcfg.py: os.chown(backup, vdsm_uid, -1) lib/vdsm/storage/fileUtils.py:def chown(path, user=-1, group=-1): lib/vdsm/storage/fileUtils.py: os.chown(path, uid, gid) lib/vdsm/supervdsm_api/udev.py: os.chown(_HWRNG_PATH, 0, 0) lib/vdsm/supervdsm_api/virt.py: os.chown(socketFile, lib/vdsm/supervdsm_server.py: chown(address, args.sock_user, args.sock_group) lib/vdsm/tool/configurators/managedvolumedb.py: fileUtils.chown(mvdb.DB_FILE, constants.VDSM_USER, constants.VDSM_GROUP) lib/vdsm/tool/register.py: os.chown(_auth_keys_dir, _uid, _uid) lib/vdsm/tool/register.py: os.chown(_auth_keys, _uid, _uid) lib/vdsm/tool/transient.py: os.chown(TRANSIENT_DISKS_REPO, vdsm_uid, vdsm_gid) tests/storage/fileutil_test.py: fileUtils.chown(srcPath, targetId, targetId) tests/storage/fileutil_test.py: fileUtils.chown(srcPath, tmpId, tmpId) tests/storage/fileutil_test.py: fileUtils.chown(srcPath, "root", "root") This is new issue since we started to use DAC with libvirt. Libvirt is running as root, and letting it handle permissions is the most likely reason that after stopping vms volumes are owned by root. I think we need to enable libvirt debug logs and find what happens when we shut down a vm, maybe unclean shutdown. Moving to virt to investigate this.
Peter, do you have any suggestion how to debug this, or info you need to look at this issue?
Nir, I'm not able to reproduce this -- On my test system, setting all_squash (or even root_squash), anonuid, and anongid appropriately squashes the user. Are you running Fedora or EL? I'm also guessing you re-exported+re-mounted the share after making any changes, but as much information as possible would be helpful. Either way, I think this is not virt. If all_squash (or root_squash) doesn't actually squash the remote user, it seems like a very severe platform bug with large security implications.
with root_squash or all_squash you are probably avoiding it on storage server side; I'd suggest to try reproducing with no_root_squash
Yes, this is reproducible without all_squash or root_squash -- but not with dynamic_ownership=1 (which the vdsm configurator should already set, per https://gerrit.ovirt.org/#/c/89830/26/lib/vdsm/tool/configurators/libvirt.py) Can someone else on this thread verify this externally? Available workarounds are: Changing ownership manually Setting all_squash or root_squash + anonuid/anongid (Best) running "vdsm-tool configure --force" to ensure dynamic_ownership=1
Hi, I have tried the following: service vdsmd stop vdsm-tool configure --force service vdsmd start and then restart the ha agent on all nodes but it doesnt help, the upgraded to 4.3 node is still not able to start hte engine. Currently my cluster is a brick so help / workaround would be nice!
Hey Endre - thanks for the report! Is dynamic_ownership=1 set in libvirtd.conf ? To get it running again, please chown 36:36 the disk for hosted engine (doing this recursively on the entire share may be safe) It will be something like 1.2.3.4:/storage/hosted-engine/$(uuid)/images/$(uuid)/$(uuid) uuid is dynamically generated, but tab completion will get you far The entire tree should be 36:36
(In reply to Ryan Barry from comment #30) > Nir, I'm not able to reproduce this -- Right, I tested local file system :-) (using /rhev/data-center instead of /rhev/data-center/mnt/server:_path) I tested again both Fedora 28 server and CentOS 7.6 server with: # cat /etc/exports /export *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36,fsid=0) # exportfs -v /export <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,root_squash,all_squash) Both do squash the permissions when using all_squash. Changing the permissions fail with: chown: changing ownership of 'test': Operation not permitted I tested also without all_squash: # cat /etc/exports /export *(rw,sync,no_subtree_check,anonuid=36,anongid=36,fsid=0) # exportfs -v /export <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,root_squash,no_all_squash) We get root_squash by default, and testing the mount do sauash root as expected. And with no_root_squash: # cat /etc/exports /export *(rw,sync,no_subtree_check,no_root_squash,anonuid=36,anongid=36,fsid=0) # exportfs -v /export <world>(sync,wdelay,hide,no_subtree_check,fsid=0,anonuid=36,anongid=36,sec=sys,rw,secure,no_root_squash,no_all_squash) With this can can create files as root and change their ownership to root. So this looks like configuration issue on user side - but even with non-default configuration disabling root_squash (or maybe some other NFS server that has different defualts), file are not expected to change ownership after we pass the file to libvirt. Recommending to use root_squash should avoid the issue but I think it hides the real issue - someone running as root is changing file ownership. Checking our docs, we don't mention root_squash. - https://www.ovirt.org/documentation/admin-guide/chap-Storage.html#preparing-nfs-storage - https://www.ovirt.org/develop/troubleshooting-nfs-storage-issues.html We had this bug about failing to upgrade hosted engine if root_squash is enabled: https://bugzilla.redhat.com/1466234 We had this old bug with root_squash: https://bugzilla.redhat.com/963881 So I think we have these issues: - storage 1: verify that we test with root_squash - maybe we test it without root squash because of historic issues. - storage 2: fix bad comment in vdsm source about root_squash https://gerrit.ovirt.org/c/97721/ - docs issue: recommend to use root_squash after testing show that we don't have any issue. - virt: understand why files used by libvirt change ownership
Elad, do we test NFS with root_squash, all_squash or no_root_squash?
(In reply to Nir Soffer from comment #36) > (In reply to Ryan Barry from comment #30) > So I think we have these issues: > - storage 1: verify that we test with root_squash - maybe we test it > without root squash because of historic issues. > - storage 2: fix bad comment in vdsm source about root_squash > https://gerrit.ovirt.org/c/97721/ > - docs issue: recommend to use root_squash after testing show that we > don't have any issue. This is currently affecting also GlusterFS SD and I'm not sure we have a root_squash equivalent on GlsuterFS fuse mounts. Sahina? > - virt: understand why files used by libvirt change ownership
(In reply to Nir Soffer from comment #37) > Elad, do we test NFS with root_squash, all_squash or no_root_squash? No
(In reply to Elad from comment #39) > (In reply to Nir Soffer from comment #37) Elad, sorry if my question was not specific enough. I need the NFS configuration used for testing RHV, both manual tests and automation. Please share the /etc/exports file used in all NFS servers used for testing, all configurations.
(In reply to Elad from comment #41) Thanks Elad! So we have many server configured with: (rw,no_root_squash) In this setup libvirt can change the ownership of images, and they will be owned by root. And many servers configured with: (rw,root_squash) In this setup libvirt is squashed to the default uid/gid 65534. If libvirt changes images ownership, they will not belong to root, and vdsm will not be able to access them. Why are we testing both no_root_squash and root_sqash to default uid/gid? Do we have any documentation why we use these settings? It is clear that the configuration that may prevent the issue in this bug: (rw,root_squash,anonuid=36,anongid=36) Is not tested by QE. So we should be careful about recommending it to users. Generally I'm not happy about squashing root to 36:36. root has no reason to touch any file in vdsm storage domain mount, and this setting hides bugs in code running as root. I think it is better to fail early if some component is making incorrect changes instead of hiding them by squashing.
Elad, can we get a clean reproducer? We can then try to switch dynamic_ownership and the seclabel settings to make sure it's caused like that
Nir, this server is being used by many clients, some of the configuration injected to /etc/exports was for some specific needs. The relevant line from /etc/exports for the path used for reproducing the bug is: /Compute_NFS 10.0.0.0/255.0.0.0(rw,no_root_squash) Michal, I understand you took this one offline with Liran.
Reposting important bits from the private email thread about this bug: > >>>> *2019-02-18 15:01:19.200+0000: 14951: info : > >>>> virSecurityDACSetOwnershipInternal:567 : Setting DAC user and group on > >>>> '/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/53* > >>>> *5e0512-987d-4781-be80-f60f36cceacd/images/c2400df6-1d9b-487e-b4c5-abce730ee7a2/16fa2ea1-2baf-4af4-aa09-b16067aa9988' > >>>> to '0:0'* > > > > Is this on the source or destination? > > Destination host with dynamic_ownership=1, on shutdown Ah I see. So the problem is that after you migrate this VM in from a host which did not have dynamic_ownership enabled and thus the VM XML did not use the <seclabel> to prevent relabelling and shut down the VM libvirt will relabel it to root:root. In this case the libvirt code is correct though I'm afraid. I don't think it's worth applying any workarounds in libvirt as we've never encouraged users to disable dynamic ownership. > > If the migration is controlled by a host which is aware (e.g. non > > peer-to-peer migration initiated from the destination host ) it's better > > to use the destination XML parameter directly rather than writing ugly > > filter code to the hook. > > Source is sending it, though destination can make changes too. But it > would be needed for VM resume too. > Maybe hook is simpler Note that you can use virDomainSaveImageDefineXML to modify the XML in the save image if you opt not to use the hook. Especially if you can send the correct XML during migration.
(In reply to Peter Krempa from comment #45) > In this case the libvirt code is correct though I'm afraid. I don't > think it's worth applying any workarounds in libvirt as we've never > encouraged users to disable dynamic ownership. To be clear, in my case of hitting this bug, I never (knowingly?) disabled dynamic ownership. As I understand it, that is a new feature in 4.3. All I did was upgrade from 4.2(.8) to 4.3 and it broke up my VMs. So I think it is up to oVirt to manage the necessary corrections on upgrade to avoid breaking VM disk permissions. An additional point, for clarity, since a lot of this thread discusses NFS-based storage domains, is that on my cluster I use GlusterFS storage domains, so the whole root_squash/no_root_squash issue is moot.
Ian, yes, when migrating a running or suspended VM from environment < 4.3, Vdsm must modify domain XML accordingly. I'm going to implement that in a libvirt hook.
(In reply to Simone Tiraboschi from comment #38) > (In reply to Nir Soffer from comment #36) > > (In reply to Ryan Barry from comment #30) > > So I think we have these issues: > > - storage 1: verify that we test with root_squash - maybe we test it > > without root squash because of historic issues. > > - storage 2: fix bad comment in vdsm source about root_squash > > https://gerrit.ovirt.org/c/97721/ > > - docs issue: recommend to use root_squash after testing show that we > > don't have any issue. > > This is currently affecting also GlusterFS SD and I'm not sure we have a > root_squash equivalent on GlsuterFS fuse mounts. > Sahina? > > > - virt: understand why files used by libvirt change ownership Hi Simone, Yes, we do see this problem while upgrading from RHV 4.2.8-2 to RHV 4.3 The ownership of the image files becomes root:root, and when reset to vdsm:kvm, the VM could start
*** Bug 1679882 has been marked as a duplicate of this bug. ***
Moving back to POST being https://gerrit.ovirt.org/#/c/98088/ not merged yet
core issue is fixed already though
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops
Verified on: vdsm-4.30.10-1.el7ev.x86_64 Steps: Have an environment with 4.3 host (vdsm-4.30.10-1.el7ev.x86_64), and older host < 4.3, for example: 4.2 with vdsm-4.20.46-1.el7ev.x86_64 1. Create a VM. 2. Run it on the 4.2 host. 3. Migrate the VM. 4. Shutdown the VM. 5. Start the VM. Results: The VM could start again, looking in the image: # ll /rhev/data-center/mnt/<IMAGE PATH> Showed ownership vdsm:kvm. Shutting down didn't showed DAC change of the image to '0:0' in libvirt log. On the source host: virsh -r dumpxml <VM> showed: <disk type='file' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/> <source file='/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/535e0512-987d-4781-be80-f60f36cceacd/images/7d7719b2-578d-4a89-b6fb-15433f6e47dd/223b153a-761c-4017-8694-e434c8764155'/> ... <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c758,c994</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c758,c994</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> In /etc/libvirt/qemu.conf there is dynamic_ownership=0 After migrating the domxml shows on destination host: <source file='/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-ge-4_nfs__0/535e0512-987d-4781-be80-f60f36cceacd/images/7d7719b2-578d-4a89-b6fb-15433f6e47dd/223b153a-761c-4017-8694-e434c8764155'> <seclabel model='dac' relabel='no'/> </source> ... <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c240,c534</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c240,c534</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> seclabel added successfully in live migration from host < 4.3 to 4.3 host.
This is also occurring when the ovirt HA agent launched the hosted engine VM. I had this happen to me last night, and it occurred in a reproducible fashion. It seems to be the act of the VM being started (i'd stopped it from the VM), or possible the VM stopping. It first happened while I was upgrading from 4.2.8 to 4.3.1, but continued once all my HA hosts were upgraded. Resetting the ownership on the hosted engine vm disk resolved the issues in each case. ovirt-release42-4.2.8-1.el7.noarch ovirt-imageio-daemon-1.5.0-0.el7.noarch ovirt-host-dependencies-4.3.1-1.el7.x86_64 cockpit-machines-ovirt-176-4.el7.centos.noarch ovirt-release43-4.3.1-1.el7.noarch ovirt-vmconsole-host-1.0.7-2.el7.noarch ovirt-hosted-engine-setup-2.3.5-1.el7.noarch ovirt-ansible-engine-setup-1.1.8-1.el7.noarch ovirt-imageio-common-1.5.0-0.el7.x86_64 python2-ovirt-host-deploy-1.8.0-1.el7.noarch ovirt-host-4.3.1-1.el7.x86_64 ovirt-vmconsole-1.0.7-2.el7.noarch ovirt-ansible-repositories-1.1.5-1.el7.noarch ovirt-ansible-hosted-engine-setup-1.0.11-1.el7.noarch ovirt-provider-ovn-driver-1.2.20-1.el7.noarch ovirt-hosted-engine-ha-2.3.1-1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.centos.noarch python2-ovirt-setup-lib-1.2.0-1.el7.noarch python-ovirt-engine-sdk4-4.3.0-2.el7.x86_64 cockpit-ovirt-dashboard-0.12.3-1.el7.noarch ovirt-host-deploy-common-1.8.0-1.el7.noarch libvirt-daemon-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-scsi-4.5.0-10.el7_6.4.x86_64 libvirt-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.4.x86_64 libvirt-lock-sanlock-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-lxc-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-iscsi-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-kvm-4.5.0-10.el7_6.4.x86_64 libvirt-python-4.5.0-1.el7.x86_64 libvirt-daemon-driver-nwfilter-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-nodedev-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-config-network-4.5.0-10.el7_6.4.x86_64 libvirt-client-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-qemu-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-mpath-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-interface-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-network-4.5.0-10.el7_6.4.x86_64 libvirt-bash-completion-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-core-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-rbd-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-logical-4.5.0-10.el7_6.4.x86_64 libvirt-libs-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-secret-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-disk-4.5.0-10.el7_6.4.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.4.x86_64
Also having this issue. Upgraded from 4.2.8 to 4.3.2. I am on FCP storage and see the same issues. Any idea to fix permissions on FC storage? If I deattache the disks and reattache to newly created VM the VM starts.
(In reply to nardusg from comment #55) > Also having this issue. Upgraded from 4.2.8 to 4.3.2. I am on FCP storage > and see the same issues. Any idea to fix permissions on FC storage? If I > deattache the disks and reattache to newly created VM the VM starts. As discussed on oVirt users list, the problem is unrelated to this bug, the VM doesn't start for a completely different reason.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077
*** Bug 1687126 has been marked as a duplicate of this bug. ***
I think this is not fully resolved yet. I encountered this after upgrading from 4.2.8 to 4.3.4 All machines are on gluster volumes. I changed cluster compatibility to 4.3 after the upgrade. When I shutdown a vm it gets permitions root:root and I have to manualy reset the permitions to vdsm:kvm in order for it to restart. Machines that where down during the update are not affected they start normaly and their ownership are not altered. For vm that were running during the upgrade when they shutdown they get an ownership of root:root. After changing the ownership they seem to behave normaly
Do you use gluster with gfapi? It's a known bug, a fix is in preparation, see https://bugzilla.redhat.com/1719789.
I use libgfapi but, this only happens once to me not every time. I only need to change the ownership once it does not change again on the next shutdown AFAIK. I mean I only have to change ownership to 36:36 once then the machine shutdowns and start normally. This Only happens on machines that where running during the transition from 4.2 to 4.3, After shutdown -> chown 36:36 vm_image the machines shutdown and restart normally. Machines that where stopped during the upgrade are not affected.
The bug means that both libvirt and Vdsm (+ your manual changes) try to manage owners of the GlusterFS images. The owners may remain correct or they can be changed later inappropriately.