Bug 1364034 - Hosted Engine always show "Not running" status in cockpit after deploy it.
Summary: Hosted Engine always show "Not running" status in cockpit after deploy it.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Packaging.rpm
Version: 2.0.1
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.0.2
: 2.0.2
Assignee: Simone Tiraboschi
QA Contact: cshao
URL:
Whiteboard:
: 1365322 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-04 10:45 UTC by cshao
Modified: 2017-05-11 09:23 UTC (History)
13 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-08-12 14:23:18 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
ylavi: planning_ack+
sbonazzo: devel_ack+
ycui: testing_ack+


Attachments (Terms of Use)
he-1 (34.04 KB, image/png)
2016-08-04 10:45 UTC, cshao
no flags Details
HE-VM (34.79 KB, image/png)
2016-08-04 10:45 UTC, cshao
no flags Details
all log info (7.49 MB, application/x-gzip)
2016-08-04 10:47 UTC, cshao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1101554 0 high CLOSED [RFE] HE-ha: use vdsm api instead of vdsClient 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1364037 0 urgent CLOSED uid/gid drift - Breaks Cockpit and HE 2021-02-22 00:41:40 UTC
oVirt gerrit 62105 0 master MERGED rpm: fix VDSM dependency 2020-11-06 06:20:04 UTC
oVirt gerrit 62109 0 v2.0.z MERGED rpm: fix VDSM dependency 2020-11-06 06:20:22 UTC

Internal Links: 1101554 1364037

Description cshao 2016-08-04 10:45:25 UTC
Created attachment 1187424 [details]
he-1

Description of problem:
Hosted Engine always show "Not running" status after deploy it. 

The HE-VM can up after run hosted-engine --vm-start, but the hostname of engine will lost, and HE status still show as "Not running" in cockpit.


# hosted-engine --vm-status
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli


# hosted-engine --vm-start
/usr/share/vdsm/vdsClient.py:33: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  from vdsm import utils, vdscli, constants
/usr/share/vdsm/vdsClient.py:33: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  from vdsm import utils, vdscli, constants

97004290-0d22-4366-9d28-27471d608f9e
	Status = WaitForLaunch
	nicModel = rtl8139,pv
	statusTime = 4313210600
	emulatedMachine = rhel6.5.0
	pid = 0
	vmName = HostedEngine
	devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '5403a8ac-d264-4b08-b5b1-5035bf81db65', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '3decb275-9fea-4be0-b19f-dcd94bb479b8', 'imageID': '36255e02-3c95-40cd-b21e-d6843a20cc04', 'specParams': {}, 'readonly': 'false', 'domainID': '8f0b4420-af5d-4ef7-95f7-b1efecea5cfa', 'optional': 'false', 'deviceId': '36255e02-3c95-40cd-b21e-d6843a20cc04', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:4e:44:ea', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': 'f8c80999-c733-4a89-9282-7447f09d16a3', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': 'd325fdb4-7c54-43b8-85d1-16ef9e4387fe', 'alias': 'console0'}, {'device': 'vga', 'alias': 'video0', 'type': 'video'}]
	guestDiskMapping = {}
	vmType = kvm
	clientIp = 
	displaySecurePort = -1
	memSize = 4096
	displayPort = -1
	cpuType = Opteron_G3
	spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
	smp = 2
	displayIp = 0
	display = vnc




Version-Release number of selected component (if applicable):
redhat-virtualization-host-4.0-20160803.3.
imgbased-0.7.4-0.1.el7ev.noarch
cockpit-ovirt-dashboard-0.10.6-1.3.4.el7ev.noarch
cockpit-0.114-2.el7.x86_64
ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.3-1.el7ev.noarch
20160731.0-1.el7ev.4.0.ova 

How reproducible:
100%

Steps to Reproduce:
1. Anaconda interactive install RHVH via PXE with below ks.
2. Login RHVH via cockpit UI.
3. Deploy Hosted Engine via cockpit with correct steps.
4. After vm shut down, wait a few minutes, check HE status.

Actual results:
Hosted Engine always show "Not running" status after deploy it. 

Expected results:
Hosted Engine can up and work well after deploy it. 


Additional info:
1. ks:
liveimg --url=http://10.66.10.22:8090/rhevh/rhevh7-ng-36/redhat-virtualization-host-4.0-20160803.3/redhat-virtualization-host-4.0-20160803.3.x86_64.liveimg.squashfs
%post
imgbase layout --init
%end 

2. No such issue on redhat-virtualization-host-4.0-20160727.1, so this is a regression bug.

Comment 1 cshao 2016-08-04 10:45:49 UTC
Created attachment 1187425 [details]
HE-VM

Comment 2 cshao 2016-08-04 10:47:06 UTC
Created attachment 1187426 [details]
all log info

Comment 3 cshao 2016-08-04 10:49:38 UTC
Add keyword "Regression" and "Testblocker" due to no such issue on redhat-virtualization-host-4.0-20160727.1, and it block HE testing.

Comment 4 Ryan Barry 2016-08-05 04:17:37 UTC
What does "hosted-engine --vm-status" show?

Comment 5 cshao 2016-08-05 04:41:00 UTC
(In reply to Ryan Barry from comment #4)
> What does "hosted-engine --vm-status" show?

The output of "hosted-engine --vm-status" in bug description.

# hosted-engine --vm-status
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli

Comment 6 Fabian Deutsch 2016-08-05 13:14:43 UTC
Moving to he-setup due to the import issue.

Comment 7 Simone Tiraboschi 2016-08-05 16:19:25 UTC
With 
vdsm-python-4.18.6-1.el7.centos.noarch                            4.18.6-1.el7.centos            @ovirt-4.0
we don't have the warning.
It's instead

Comment 8 Simone Tiraboschi 2016-08-05 16:35:59 UTC
Ok, reproduced with vdsm-python.noarch 0:4.18.10-1.el7.centos.

The issue is in ovirt_hosted_engine_ha/lib/storage_backends.py and it's a direct result of rhbz#1101554 which is targeted to 4.1 and could potentially have a wide impact.

On the other side, the warning message is on stderr while the json output is on stdout.

[root@foobar ovirt-hosted-engine-setup]# hosted-engine --vm-status --json
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli
{"1": {"live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=31141 (Fri Aug  5 18:33:04 2016)\nhost-id=1\nscore=3400\nmaintenance=False\nstate=EngineDown\nstopped=False\n", "hostname": "foobar.localdomain", "host-id": 1, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "7f8be5ec", "host-ts": 31141}, "global_maintenance": false}


[root@foobar ovirt-hosted-engine-setup]# hosted-engine --vm-status --json 2>>/dev/null
{"1": {"live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=31171 (Fri Aug  5 18:33:33 2016)\nhost-id=1\nscore=3400\nmaintenance=False\nstate=EngineStart\nstopped=False\n", "hostname": "foobar.localdomain", "host-id": 1, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "b8890b47", "host-ts": 31171}, "global_maintenance": false}

Ryan, is there any reasons to parse also the stderr?

Comment 9 Ryan Barry 2016-08-05 17:09:04 UTC
(In reply to Simone Tiraboschi from comment #8)
> Ryan, is there any reasons to parse also the stderr?

No, there isn't, but cockpit.spawn discards stderr (sends to the journal) by default.

http://cockpit-project.org/guide/latest/cockpit-spawn.html

I haven't tested this (we don't expect stderr), but I'd expect that it works as written. The problem here seems to be that the version of vdsm-python in the latest GA compose doesn't send any output at all to stdout, at a guess (I haven't tested this, and I'm out until Wednesday, but see comment#5).

Comment 10 Red Hat Bugzilla Rules Engine 2016-08-08 07:44:39 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 11 Simone Tiraboschi 2016-08-08 08:08:50 UTC
Here we have an issue in agent logs:

MainThread::ERROR::2016-08-04 18:28:35,587::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/lib/ovirt-hosted-engine-ha/ha.conf, key=local_maintenance'' - trying to restart agent
MainThread::WARNING::2016-08-04 18:28:40,592::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9'
MainThread::ERROR::2016-08-04 18:28:40,593::agent::210::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug.

Comment 12 Simone Tiraboschi 2016-08-08 12:20:21 UTC
It's a permission issue:
MainThread::DEBUG::2016-08-08 19:58:26,832::config::122::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_load) Configuration file '/var/lib/ovirt-hosted-engine-ha/ha.conf' not available [[Errno 13] Permission denied: '/var/lib/ovirt-hosted-engine-ha/ha.conf']
MainThread::ERROR::2016-08-08 19:58:26,832::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: ''Configuration value not found: file=/var/lib/ovirt-hosted-engine-ha/ha.conf, key=local_maintenance'' - trying to restart agent

Indeed /var/lib/ovirt-hosted-engine-ha
was
drwx------. 2 root    kvm and /var/lib/ovirt-hosted-engine-ha/*.conf
was -rw-r--r--. 1 root kvm 

while they are expected to be owned by vdsm user.
Fixing the permissions seams enough to solve.

Let's investigate now why the permissions got messed up.

By the way, the system was also configured to send notification with SMTP on localhost but the postfix service was down and this could lead to other issues. See rhbz#1364286

Comment 13 Simone Tiraboschi 2016-08-08 12:40:41 UTC
Manually reinstalling the same ovirt-hosted-engine-ha rpm is enough to fix the permission issue so it looks like a permission drift in node like (probably a duplicate) https://bugzilla.redhat.com/show_bug.cgi?id=1364037#c10

Comment 14 Fabian Deutsch 2016-08-08 15:14:21 UTC
How was this bug produced?

Was a Node update performed, then he-setup run?

or was it a clean installation?

Comment 15 cshao 2016-08-09 02:43:34 UTC
(In reply to Fabian Deutsch from comment #14)
> How was this bug produced?
> 
> Was a Node update performed, then he-setup run?
> 
> or was it a clean installation?

It was a clean installation.

Comment 16 Ying Cui 2016-08-09 03:22:08 UTC
(In reply to shaochen from comment #15)
> (In reply to Fabian Deutsch from comment #14)
> > How was this bug produced?

As bug description said and confirmed with reporter, it was reproduced 100% following the test steps.

> > 
> > Was a Node update performed, then he-setup run?
> > 
> > or was it a clean installation?
> 
> It was a clean installation.

Comment 17 Simone Tiraboschi 2016-08-09 07:15:53 UTC
*** Bug 1365322 has been marked as a duplicate of this bug. ***

Comment 18 Fabian Deutsch 2016-08-09 08:02:37 UTC
Considering comment 15 this bug then sounds different to bug 1364037 - because in this case the group is wrong (it is not drifted, because the vdsm id is hard-coded) and it happens on a clean install.

Simone confirmed that the user is correct in the rpm.

But they are already wrong inside the image:

$ sudo find l/var/lib/ovirt-hosted-engine-ha/ -ls
137195    4 drwx------   2 root     kvm          4096 Aug  4 01:07 l/var/lib/ovirt-hosted-engine-ha/
137196    4 -rw-r--r--   1 root     kvm           171 Jul 12 17:27 l/var/lib/ovirt-hosted-engine-ha/broker.conf
137197    4 -rw-r--r--   1 root     kvm            24 Jul 12 17:27 l/var/lib/ovirt-hosted-engine-ha/ha.conf

Comment 19 Fabian Deutsch 2016-08-09 08:08:16 UTC
The problem is in the build process, from the image:

23:12:36,623 INFO packaging: ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch (666/743)

23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root

23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root

23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root

23:12:36,623 INFO packaging: warning: user vdsm does not exist - using root

23:12:36,623 INFO packaging: /var/tmp/rpm-tmp.it86fo: line 23: /usr/bin/systemctl: No such file or directory

23:12:36,624 INFO packaging: /var/tmp/rpm-tmp.it86fo: line 24: /usr/bin/systemctl: No such file or directory

23:12:36,624 INFO packaging: warning: %post(ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch) scriptlet failed, exit status 127

23:12:36,624 INFO packaging: vdsm-4.18.10-1.el7ev.x86_64 (667/743)

This shows that ovirt-hosted-engine-ha was installed before vdsm, and thus the vdsm user was not available.

But looking at the ovirt-hosted-engine-ha requirements it is clear that *-ha should depend on vdsm.

Comment 21 Fabian Deutsch 2016-08-09 08:15:01 UTC
Bottom line: ovirt-hosted-engine-ha must use Requires(pre): vdsm >= … to ensure that vdsm is available in the %post-let.

Comment 22 cshao 2016-08-11 13:44:58 UTC
Test version:
redhat-virtualization-host-4.0-20160810.1 
imgbased-0.8.3-0.1.el7ev.noarch
redhat-release-virtualization-host-4.0-0.29.el7.x86_64
vdsm-4.18.11-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch
rhevm-appliance-20160731.0-1.el7ev.ova


Test steps:
1. Anaconda interactive install RHVH via PXE.
2. Login RHVH via cockpit UI.
3. Deploy Hosted Engine via cockpit with correct steps.
4. After vm shut down, wait a few minutes, check HE status.
5. Reboot, check HE status.

Test result:
After Step4 &5, Hosted Engine can up and work well.
So the bug is fixed with above version.

Comment 23 Ying Cui 2016-08-11 14:31:11 UTC
change status to VERIFIED according to comment 22.


Note You need to log in before you can comment on or make changes to this bug.