Created attachment 1187424 [details] he-1 Description of problem: Hosted Engine always show "Not running" status after deploy it. The HE-VM can up after run hosted-engine --vm-start, but the hostname of engine will lost, and HE status still show as "Not running" in cockpit. # hosted-engine --vm-status /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli import vdsm.vdscli # hosted-engine --vm-start /usr/share/vdsm/vdsClient.py:33: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli from vdsm import utils, vdscli, constants /usr/share/vdsm/vdsClient.py:33: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli from vdsm import utils, vdscli, constants 97004290-0d22-4366-9d28-27471d608f9e Status = WaitForLaunch nicModel = rtl8139,pv statusTime = 4313210600 emulatedMachine = rhel6.5.0 pid = 0 vmName = HostedEngine devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '5403a8ac-d264-4b08-b5b1-5035bf81db65', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '3decb275-9fea-4be0-b19f-dcd94bb479b8', 'imageID': '36255e02-3c95-40cd-b21e-d6843a20cc04', 'specParams': {}, 'readonly': 'false', 'domainID': '8f0b4420-af5d-4ef7-95f7-b1efecea5cfa', 'optional': 'false', 'deviceId': '36255e02-3c95-40cd-b21e-d6843a20cc04', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:4e:44:ea', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': 'f8c80999-c733-4a89-9282-7447f09d16a3', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': 'd325fdb4-7c54-43b8-85d1-16ef9e4387fe', 'alias': 'console0'}, {'device': 'vga', 'alias': 'video0', 'type': 'video'}] guestDiskMapping = {} vmType = kvm clientIp = displaySecurePort = -1 memSize = 4096 displayPort = -1 cpuType = Opteron_G3 spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp = 2 displayIp = 0 display = vnc Version-Release number of selected component (if applicable): redhat-virtualization-host-4.0-20160803.3. imgbased-0.7.4-0.1.el7ev.noarch cockpit-ovirt-dashboard-0.10.6-1.3.4.el7ev.noarch cockpit-0.114-2.el7.x86_64 ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.3-1.el7ev.noarch 20160731.0-1.el7ev.4.0.ova How reproducible: 100% Steps to Reproduce: 1. Anaconda interactive install RHVH via PXE with below ks. 2. Login RHVH via cockpit UI. 3. Deploy Hosted Engine via cockpit with correct steps. 4. After vm shut down, wait a few minutes, check HE status. Actual results: Hosted Engine always show "Not running" status after deploy it. Expected results: Hosted Engine can up and work well after deploy it. Additional info: 1. ks: liveimg --url=http://10.66.10.22:8090/rhevh/rhevh7-ng-36/redhat-virtualization-host-4.0-20160803.3/redhat-virtualization-host-4.0-20160803.3.x86_64.liveimg.squashfs %post imgbase layout --init %end 2. No such issue on redhat-virtualization-host-4.0-20160727.1, so this is a regression bug.
Created attachment 1187425 [details] HE-VM
Created attachment 1187426 [details] all log info
Add keyword "Regression" and "Testblocker" due to no such issue on redhat-virtualization-host-4.0-20160727.1, and it block HE testing.
What does "hosted-engine --vm-status" show?
(In reply to Ryan Barry from comment #4) > What does "hosted-engine --vm-status" show? The output of "hosted-engine --vm-status" in bug description. # hosted-engine --vm-status /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli import vdsm.vdscli
Moving to he-setup due to the import issue.
With vdsm-python-4.18.6-1.el7.centos.noarch 4.18.6-1.el7.centos @ovirt-4.0 we don't have the warning. It's instead
Ok, reproduced with vdsm-python.noarch 0:4.18.10-1.el7.centos. The issue is in ovirt_hosted_engine_ha/lib/storage_backends.py and it's a direct result of rhbz#1101554 which is targeted to 4.1 and could potentially have a wide impact. On the other side, the warning message is on stderr while the json output is on stdout. [root@foobar ovirt-hosted-engine-setup]# hosted-engine --vm-status --json /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli import vdsm.vdscli {"1": {"live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=31141 (Fri Aug 5 18:33:04 2016)\nhost-id=1\nscore=3400\nmaintenance=False\nstate=EngineDown\nstopped=False\n", "hostname": "foobar.localdomain", "host-id": 1, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "7f8be5ec", "host-ts": 31141}, "global_maintenance": false} [root@foobar ovirt-hosted-engine-setup]# hosted-engine --vm-status --json 2>>/dev/null {"1": {"live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=31171 (Fri Aug 5 18:33:33 2016)\nhost-id=1\nscore=3400\nmaintenance=False\nstate=EngineStart\nstopped=False\n", "hostname": "foobar.localdomain", "host-id": 1, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "b8890b47", "host-ts": 31171}, "global_maintenance": false} Ryan, is there any reasons to parse also the stderr?
(In reply to Simone Tiraboschi from comment #8) > Ryan, is there any reasons to parse also the stderr? No, there isn't, but cockpit.spawn discards stderr (sends to the journal) by default. http://cockpit-project.org/guide/latest/cockpit-spawn.html I haven't tested this (we don't expect stderr), but I'd expect that it works as written. The problem here seems to be that the version of vdsm-python in the latest GA compose doesn't send any output at all to stdout, at a guess (I haven't tested this, and I'm out until Wednesday, but see comment#5).
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Here we have an issue in agent logs: MainThread::ERROR::2016-08-04 18:28:35,587::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/lib/ovirt-hosted-engine-ha/ha.conf, key=local_maintenance'' - trying to restart agent MainThread::WARNING::2016-08-04 18:28:40,592::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9' MainThread::ERROR::2016-08-04 18:28:40,593::agent::210::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug.
It's a permission issue: MainThread::DEBUG::2016-08-08 19:58:26,832::config::122::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_load) Configuration file '/var/lib/ovirt-hosted-engine-ha/ha.conf' not available [[Errno 13] Permission denied: '/var/lib/ovirt-hosted-engine-ha/ha.conf'] MainThread::ERROR::2016-08-08 19:58:26,832::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/lib/ovirt-hosted-engine-ha/ha.conf, key=local_maintenance'' - trying to restart agent Indeed /var/lib/ovirt-hosted-engine-ha was drwx------. 2 root kvm and /var/lib/ovirt-hosted-engine-ha/*.conf was -rw-r--r--. 1 root kvm while they are expected to be owned by vdsm user. Fixing the permissions seams enough to solve. Let's investigate now why the permissions got messed up. By the way, the system was also configured to send notification with SMTP on localhost but the postfix service was down and this could lead to other issues. See rhbz#1364286
Manually reinstalling the same ovirt-hosted-engine-ha rpm is enough to fix the permission issue so it looks like a permission drift in node like (probably a duplicate) https://bugzilla.redhat.com/show_bug.cgi?id=1364037#c10
How was this bug produced? Was a Node update performed, then he-setup run? or was it a clean installation?
(In reply to Fabian Deutsch from comment #14) > How was this bug produced? > > Was a Node update performed, then he-setup run? > > or was it a clean installation? It was a clean installation.
(In reply to shaochen from comment #15) > (In reply to Fabian Deutsch from comment #14) > > How was this bug produced? As bug description said and confirmed with reporter, it was reproduced 100% following the test steps. > > > > Was a Node update performed, then he-setup run? > > > > or was it a clean installation? > > It was a clean installation.
*** Bug 1365322 has been marked as a duplicate of this bug. ***
Considering comment 15 this bug then sounds different to bug 1364037 - because in this case the group is wrong (it is not drifted, because the vdsm id is hard-coded) and it happens on a clean install. Simone confirmed that the user is correct in the rpm. But they are already wrong inside the image: $ sudo find l/var/lib/ovirt-hosted-engine-ha/ -ls 137195 4 drwx------ 2 root kvm 4096 Aug 4 01:07 l/var/lib/ovirt-hosted-engine-ha/ 137196 4 -rw-r--r-- 1 root kvm 171 Jul 12 17:27 l/var/lib/ovirt-hosted-engine-ha/broker.conf 137197 4 -rw-r--r-- 1 root kvm 24 Jul 12 17:27 l/var/lib/ovirt-hosted-engine-ha/ha.conf
The problem is in the build process, from the image: 23:12:36,623 INFO packaging: ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch (666/743) 23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root 23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root 23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root 23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root 23:12:36,623 INFO packaging: /var/tmp/rpm-tmp.it86fo: line 23: /usr/bin/systemctl: No such file or directory 23:12:36,624 INFO packaging: /var/tmp/rpm-tmp.it86fo: line 24: /usr/bin/systemctl: No such file or directory 23:12:36,624 INFO packaging: warning: %post(ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch) scriptlet failed, exit status 127 23:12:36,624 INFO packaging: vdsm-4.18.10-1.el7ev.x86_64 (667/743) This shows that ovirt-hosted-engine-ha was installed before vdsm, and thus the vdsm user was not available. But looking at the ovirt-hosted-engine-ha requirements it is clear that *-ha should depend on vdsm.
Bottom line: ovirt-hosted-engine-ha must use Requires(pre): vdsm >= … to ensure that vdsm is available in the %post-let.
Test version: redhat-virtualization-host-4.0-20160810.1 imgbased-0.8.3-0.1.el7ev.noarch redhat-release-virtualization-host-4.0-0.29.el7.x86_64 vdsm-4.18.11-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch rhevm-appliance-20160731.0-1.el7ev.ova Test steps: 1. Anaconda interactive install RHVH via PXE. 2. Login RHVH via cockpit UI. 3. Deploy Hosted Engine via cockpit with correct steps. 4. After vm shut down, wait a few minutes, check HE status. 5. Reboot, check HE status. Test result: After Step4 &5, Hosted Engine can up and work well. So the bug is fixed with above version.
change status to VERIFIED according to comment 22.