Created attachment 1367909 [details] logs from purple-vds1 Description of problem: On two freshly clean hosts I'm getting failed SHE deployments with these details: on first cleanly reprovisioned host purple-vds1 http://pastebin.test.redhat.com/540761, and here what happened on second cleanly reprovisioned host purple-vds2 http://pastebin.test.redhat.com/540764. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch rhvm-appliance-4.2-20171207.0.el7.noarch How reproducible: In my case I have pair of hosts on which reproduction works 100%. Steps to Reproduce: 1.Deploy SHE. Actual results: Deployment fails. Expected results: Deployment should succeed. Additional info: Logs from both hosts attached.
Created attachment 1367910 [details] logs from purple-vds2
Nikolai, when filing vdsm bug, you must specify vdsm version. According to one of the logs it is vdsm-4.20.9.1-1.el7ev could it be that a failure to initialize storage is delaying vdsm startup? MainProcess|hsm/init::ERROR::2017-12-13 18:09:34,699::supervdsm_server::98::SuperVdsm.ServerCallback::(wrapper) Error in resizeMap Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 96, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 146, in resizeMap return multipath._resize_map(devName) File "/usr/lib/python2.7/site-packages/vdsm/storage/multipath.py", line 140, in _resize_map % (name, out, err)) Error: Resizing map 'dm-10' failed: out='fail\n' err=''
(In reply to Dan Kenigsberg from comment #2) > Nikolai, when filing vdsm bug, you must specify vdsm version. According to > one of the logs it is vdsm-4.20.9.1-1.el7ev > > could it be that a failure to initialize storage is delaying vdsm startup? > > MainProcess|hsm/init::ERROR::2017-12-13 > 18:09:34,699::supervdsm_server::98::SuperVdsm.ServerCallback::(wrapper) > Error in resizeMap > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 96, > in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line > 146, in resizeMap > return multipath._resize_map(devName) > File "/usr/lib/python2.7/site-packages/vdsm/storage/multipath.py", line > 140, in _resize_map > % (name, out, err)) > Error: Resizing map 'dm-10' failed: out='fail\n' err='' vdsm-4.20.9-37.gitb42ea79.el7.centos.x86_64
This is not part of a typical vdsm flow. From multipath.py in the storage code: 84 def resize_devices(): 85 """ 86 This is needed in case a device has been increased on the storage server 87 Resize multipath map if the underlying slaves are bigger than 88 the map size. 89 The slaves can be bigger if the LUN size has been increased on the storage 90 server after the initial discovery. 91 """ Did you make some changes to the storage environment while/during your setup of SHE?
Attached a patch to mistake to this bug, doesn't seem a virt issue. On purple1, I see plenty of storage errors, maybe relevant maybe not (like Dan already mentioned). On purple2, I still much less storage errors, but still some of them, and the HostedEngine VM succesfully started: 2017-12-13 19:08:17,950+0200 INFO (vm/37cec957) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=0 err= (hooks:110) 2017-12-13 19:08:18,121+0200 INFO (vm/37cec957) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd: rc=0 err= (hooks:110) 2017-12-13 19:08:18,121+0200 INFO (vm/37cec957) [virt.vm] (vmId='37cec957-13ab-468c-8b40-3a86f7649f87') <?xml version="1.0" encoding="utf-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0"> <name>HostedEngine</name> <uuid>37cec957-13ab-468c-8b40-3a86f7649f87</uuid> <memory>9118720</memory> 2017-12-13 19:08:21,361+0200 INFO (vm/37cec957) [virt.vm] (vmId='37cec957-13ab-468c-8b40-3a86f7649f87') CPU running: domain initialization (vm:5888) I see no evidence of virt bug, and I advise to reassign this bug for further analysis, I cannot do anything else on this bug for the time being
Let's on purple-vds1 since the issues are probably different here. The issue is for sure caused by: 2017-12-13 17:47:15,562+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge with {} (__init__:585) 2017-12-13 17:47:15,562+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 failed (error 99) in 0.00 seconds (__init__:573) as described in: https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c41 but the point here is why vdsm discarded that ping2 message. According to vdsm logs (from purple-vds1), it started on 2017-12-13 17:47:13,954+0200 INFO (MainThread) [vds] (PID: 17560) I am the actual vdsm 4.20.9.1-1.el7ev purple-vds1.qa.lab.tlv.redhat.com (3.10.0-693.11.1.el7.x86_64) (vdsmd:148) and completed the recovery mode at 2017-12-13 17:47:14,703+0200 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:569) but then it was still discarding the ping2 at 17:47:15,562+0200 due to the recovery mode although it completed it at 17:47:14,703+0200 and this on my opinion is a bug by itself.
(In reply to Adam Litke from comment #4) > This is not part of a typical vdsm flow. From multipath.py in the storage > code: > > 84 def resize_devices(): > > 85 """ > > 86 This is needed in case a device has been increased on the storage > server > 87 Resize multipath map if the underlying slaves are bigger than > > 88 the map size. > > 89 The slaves can be bigger if the LUN size has been increased on the > storage > > 90 server after the initial discovery. > > 91 """ > > Did you make some changes to the storage environment while/during your setup > of SHE? No, I did not.
(In reply to Simone Tiraboschi from comment #6) > and completed the recovery mode at > 2017-12-13 17:47:14,703+0200 INFO (vmrecovery) [vds] recovery: completed > in 0s (clientIF:569) > > but then it was still discarding the ping2 at 17:47:15,562+0200 due to the > recovery mode although it completed it at 17:47:14,703+0200 and this on my > opinion is a bug by itself. VM recovery has finished, but not storage initialization. That's why I've called in Adam 4 days ago.
How systematic is this issue?
My bad, I've provided you with vdsm-4.20.9-37.gitb42ea79.el7.centos.x86_64 by mistake. The right one was vdsm-4.20.9.1-1.el7ev. (In reply to Doron Fediuck from comment #9) > How systematic is this issue? It happened on specific HW all the time. Both purple-vds1 and purple-vds2 suffers from this issue.
(In reply to Nikolai Sednev from comment #10) > My bad, I've provided you with vdsm-4.20.9-37.gitb42ea79.el7.centos.x86_64 > by mistake. The right one was vdsm-4.20.9.1-1.el7ev. > (In reply to Doron Fediuck from comment #9) > > How systematic is this issue? > > It happened on specific HW all the time. Both purple-vds1 and purple-vds2 > suffers from this issue. Please try to reproduce on a different pair of hosts (even nested), and report back the results here. Thanks!
(In reply to Allon Mureinik from comment #11) > (In reply to Nikolai Sednev from comment #10) > > My bad, I've provided you with vdsm-4.20.9-37.gitb42ea79.el7.centos.x86_64 > > by mistake. The right one was vdsm-4.20.9.1-1.el7ev. > > (In reply to Doron Fediuck from comment #9) > > > How systematic is this issue? > > > > It happened on specific HW all the time. Both purple-vds1 and purple-vds2 > > suffers from this issue. > > Please try to reproduce on a different pair of hosts (even nested), and > report back the results here. > Thanks! On different hosts it was working and I already have reported that here: https://bugzilla.redhat.com/show_bug.cgi?id=1511094#c9 The thing is, it really depends on HW, on some machines its working flawlessly, on other is not.
I've just verified on fast hosts (almas) and deployment over NFS was fine. ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 sanlock-3.5.0-1.el7.x86_64 libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.11-1.el7ev.noarch vdsm-4.20.9.2-1.el7ev.x86_64 ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch Slow hosts retesting is still in progress.
(In reply to Nikolai Sednev from comment #13) > I've just verified on fast hosts (almas) and deployment over NFS was fine. > ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch > ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch > qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 > sanlock-3.5.0-1.el7.x86_64 > libvirt-client-3.2.0-14.el7_4.5.x86_64 > ovirt-vmconsole-host-1.0.4-1.el7ev.noarch > mom-0.5.11-1.el7ev.noarch > vdsm-4.20.9.2-1.el7ev.x86_64 > ovirt-host-deploy-1.7.0-1.el7ev.noarch > ovirt-vmconsole-1.0.4-1.el7ev.noarch > > Slow hosts retesting is still in progress. Deployment on slow host over FC has failed: [ ERROR ] Error creating Volume Group: Command LVMVolumeGroup.create with args {'devlist': ['3514f0c5a51601092'], 'force': True, 'name': 'f3f15c92-9d84-4c5e-a65c-fb5212181069'} failed: (code=601, message=Failed to initialize physical device: ("['/dev/mapper/3514f0c5a51601092']",)) [ ERROR ] Failed to execute stage 'Misc configuration': Command LVMVolumeGroup.create with args {'devlist': ['3514f0c5a51601092'], 'force': True, 'name': 'f3f15c92-9d84-4c5e-a65c-fb5212181069'} failed: (code=601, message=Failed to initialize physical device: ("['/dev/mapper/3514f0c5a51601092']",)) It looks like different now. Sosreport is attached.
Created attachment 1369506 [details] logs from purple-vds1 after failed SHE deployment over FC.
Adding components from host here: ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch mom-0.5.11-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-setup-lib-1.1.4-1.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 vdsm-4.20.9.2-1.el7ev.x86_64 ovirt-host-4.2.0-1.el7ev.x86_64 ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch sanlock-3.5.0-1.el7.x86_64 ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 libvirt-client-3.2.0-14.el7_4.5.x86_64 Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux purple-vds1.qa.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo)
Adding more details from failed FC deployment: http://pastebin.test.redhat.com/541628
In vdsm logs we see: 2017-12-18 15:27:21,752+0200 INFO (jsonrpc/0) [vdsm.api] START createVG(vgname='f3f15c92-9d84-4c5e-a65c-fb5212181069', devlist=['3514f0c5a51601092'], force=True, options=None) from=::1,49016, task_id=5b5cab39-060b-4a1d-9b24-911fbf741264 (api:46) 2017-12-18 15:27:21,890+0200 ERROR (jsonrpc/0) [storage.LVM] pvcreate failed with rc=5 (lvm:754) 2017-12-18 15:27:21,890+0200 ERROR (jsonrpc/0) [storage.LVM] [], [' Device /dev/mapper/3514f0c5a51601092 not found (or ignored by filtering).'] (lvm:755) 2017-12-18 15:27:21,891+0200 INFO (jsonrpc/0) [vdsm.api] FINISH createVG error=Failed to initialize physical device: ("['/dev/mapper/3514f0c5a51601092']",) from=::1,49016, task_id=5b5cab39-060b-4a1d-9b24-911fbf741264 (api:50) 2017-12-18 15:27:21,891+0200 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='5b5cab39-060b-4a1d-9b24-911fbf741264') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in createVG File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2129, in createVG force=force) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 983, in createVG _initpvs(pvs, metadataSize, force) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 756, in _initpvs raise se.PhysDevInitializationError(str(devices)) PhysDevInitializationError: Failed to initialize physical device: ("['/dev/mapper/3514f0c5a51601092']",) 2017-12-18 15:27:21,951+0200 INFO (jsonrpc/0) [storage.TaskManager.Task] (Task='5b5cab39-060b-4a1d-9b24-911fbf741264') aborting: Task is aborted: 'Failed to initialize physical device: ("[\'/dev/mapper/3514f0c5a51601092\']",)' - code 601 (task:1181) 2017-12-18 15:27:21,951+0200 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH createVG error=Failed to initialize physical device: ("['/dev/mapper/3514f0c5a51601092']",) (dispatcher:82) The device seams fine: [root@purple-vds1 ~]# multipath -ll 3514f0c5a51601092 dm-1 XtremIO ,XtremApp size=60G features='0' hwhandler='0' wp=rw `-+- policy='queue-length 0' prio=1 status=active |- 0:0:0:1 sda 8:0 active ready running `- 0:0:1:1 sdb 8:16 active ready running
I've opened a separate bug https://bugzilla.redhat.com/show_bug.cgi?id=1527077 forth to comment 14-18. The original issue had not been reproduced at the moment. I'll try to reproduce again on another host and if not reproduced, will update in the bug.
(In reply to Simone Tiraboschi from comment #18) This could be caused due to a "dirty" LUN. Check here how to solve this: https://bugzilla.redhat.com/show_bug.cgi?id=1343043 For now, we could not find any storage related issues in this bug. Moving to Hosted Engine
For sure now hosted-engine-setup correctly managed to talk with vdsm as a side effect of https://gerrit.ovirt.org/#/c/85416/ for https://bugzilla.redhat.com/1512534 No other patches are needed on my opinion. Moving to QE for repeating the tests.
Deployment still failed with: [ INFO ] Engine is still unreachable [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ ERROR ] Engine is still not reachable [ ERROR ] Failed to execute stage 'Closing up': Engine is still not reachable [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20171218173244.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171218170411-tj2a1c.log
Moving back to assigned as for a Nir hint: we have also to explicitly check for a successfully pong.
Created attachment 1369564 [details] sosreport from purple vds1
And we would, but the jsonrpc client gets stuck anyway...
I see the same issue happening on FC deployment, error looks the same to me: [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ INFO ] Engine is still not reachable, waiting... [ INFO ] Engine is still unreachable [ ERROR ] Engine is still not reachable [ ERROR ] Failed to execute stage 'Closing up': Engine is still not reachable [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20171219135626.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171219124044-eizanb.log logs from purple-vds2 attached.
Created attachment 1370034 [details] logs from purple-vds2
Deployed on RHEL7.4 purple-vds2 over FC with: rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch vdsm-4.20.9.3-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch mom-0.5.11-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 ovirt-host-4.2.0-1.el7ev.x86_64 sanlock-3.5.0-1.el7.x86_64 ovirt-setup-lib-1.1.4-1.el7ev.noarch Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo)
(In reply to Nikolai Sednev from comment #26) > I see the same issue happening on FC deployment, error looks the same to me: > [ INFO ] Engine is still not reachable, waiting... The engine VM configured to rely on DHCP failed to get an IP address, see attached screenshot. Please ensure that the network is properly working.
Created attachment 1370039 [details] Network on nsednev-he-2
*** Bug 1524076 has been marked as a duplicate of this bug. ***
(In reply to Simone Tiraboschi from comment #30) > Created attachment 1370039 [details] > Network on nsednev-he-2 Network issue fixed. FC deployment passed just fine on these components: rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 ovirt-setup-lib-1.1.4-1.el7ev.noarch ovirt-host-4.2.0-1.el7ev.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch mom-0.5.11-1.el7ev.noarch vdsm-4.20.9.3-1.el7ev.x86_64 ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch sanlock-3.5.0-1.el7.x86_64 ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux I did not redproduced initial errors during the deployment.
Worked just fine using ansible deployment with these components: ovirt-hosted-engine-ha-2.2.5-0.0.master.20180117160214.20180117160210.gitd5b58a8.el7.centos.noarch ovirt-hosted-engine-setup-2.2.8-0.0.master.20180123120416.gitfa69a78.el7.centos.noarch ovirt-engine-appliance-4.2-20180121.1.el7.centos.noarch Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Moving to verified.
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.