1643663 – Hosted-Engine VM failed to start mixing ovirt-hosted-engine-setup from 4.1 with ovirt-hosted-engine-ha from 4.2

Bug 1643663 - Hosted-Engine VM failed to start mixing ovirt-hosted-engine-setup from 4.1 with ovirt-hosted-engine-ha from 4.2

Summary: Hosted-Engine VM failed to start mixing ovirt-hosted-engine-setup from 4.1 wi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	4.1.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1659096
TreeView+	depends on / blocked

Reported:	2018-10-26 23:53 UTC by Bhushan Ranpise
Modified:	2021-12-10 18:05 UTC (History)
CC List:	11 users (show)
Fixed In Version:	v2.2.19
Doc Type:	Bug Fix
Doc Text:	When performing an upgrade, make sure the ovirt-hosted-engine-ha and ovirt-hosted-engine-setup package versions match.
Clone Of:
Clones:	1659096 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:31:57 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:1049	None	None	None	2019-05-08 12:32:03 UTC
oVirt gerrit	95507	master	MERGED	packaging: conflicts older ovirt-hosted-engine-setup	2020-05-19 11:09:21 UTC
oVirt gerrit	95512	v2.2.z	MERGED	packaging: conflicts older ovirt-hosted-engine-setup	2020-05-19 11:09:22 UTC

Description Bhushan Ranpise 2018-10-26 23:53:31 UTC

Description of problem:
 
Hosted Engine VM failed to start giving the python attribute error:


engine-upgrade-check indicated no RHEV upgrade. Applied updates via yum, included a kernel update.
Shutdown of engine completed; however hosted-engine --vm-start fails with python error:
[root@vhacdwdwhvsh223 ~]# hosted-engine --vm-start
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 149, in <module>
    args.command(args)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 63, in checkVmStatus
    status = cli.getVmStats(args.vmid)
AttributeError: '_Client' object has no attribute 'getVmStats'
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 149, in <module>
    args.command(args)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 41, in create
    status = cli.create(vm_params)
AttributeError: '_Client' object has no attribute 'create'


How reproducible:


Steps to Reproduce:

This error is encountered at the customer's HE setup and the occurrence of the events are as inline:

Observations-1:
===============

- Upon checking the the vm status and the ps output from each hosts , we observed the HE VM as UP on one of the hosts, host3:
~~~
$ less 50-clu200_hosted-engine_status|egrep '==|status'
--== Host 1 status ==--
Engine status                      : unknown stale-data
--== Host 2 status ==--
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
--== Host 3 status ==--
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
--== Host 4 status ==--
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
--== Host 5 status ==--
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
~~~
However upon further checking, it was observed that there was no qemu process running, virsh output showed the hosted engine as shut off and the ps output showed the HE VM unrecheable.

Actions Taken:
==============
1. Restarted the agent and broker services in sequence
2. Checked the HE VM status on all the 5 hosts to confirm what HostedEngine believes now.

Outcome:
========
1. Now it showed the HE VM as down in the vm status output on every host.
2. Started the HE VM by running the command on host-3 to check if we still get the python error.
3. The VM did not start and returned the same error of python attribute missing:
~~~
AttributeError: '_Client' object has no attribute 'getVmStats'
AttributeError: '_Client' object has no attribute 'create'
~~~

Observations-2:
===============

- I tried searching for some similar kbases, BZ and found nothing relevant to the specific attribute as mentioned above.
- I observed one more error in the syslog of the host which reads a libvirtd error that made me suspect if the storage is unavailable to that host:
~~~
Oct 25 18:20:37 vhacdwdwhvsh221 libvirtd: 2018-10-25 23:20:37.422+0000: 5996: error : qemuOpenFileAs:3234 : Failed to open file '/var/run/vdsm/storage/eac03447-50c9-4542-a9f3-65aebb58d68f/3188be35-fc63-4f37-a25c-cb82ba2ceeee/44dde38e-d0f5-4d7d-92a3-7994c425ec87': No such file or directory
~~~
- But upon checking the sosreport, it is confirmed that the LV is not missing and the LUN is present:
~~~

/dev/mapper/360002ac000000000000000240001ea6e eac03447-50c9-4542-a9f3-65aebb58d68f lvm2 a--  99.62g  33.25g 100.00g 4i1USJ-P9gG-tC9P-BavB-Yw9J-9RlG-iDi5xi lvm2 4i1USJ-P9gG-tC9P-BavB-Yw9J-9RlG-iDi5xi 100.00g /dev/mapper/360002ac000000000000000240001ea6e 253 4      63.99m   128.00m       2 144.00m 99.62g  33.25g  <66.38g a--  allocatable                         797   531             2        2       0       0    used           lvm2 0nedej-kANN-DVnZ-BZs0-LbeX-rmey-q4GraN eac03447-50c9-4542-a9f3-65aebb58d68f wz--n- writeable  extendable                       normal                99.62g  33.25g                                     128.00m   797   266     0     0   1           0  13   0  26 MDT_CLASS=Data,MDT_DESCRIPTION=hosted_storage,MDT_IOOPTIMEOUTSEC=10,MDT_LEASERETRIES=3,MDT_LEASETIMESEC=60,MDT_LOCKPOLICY=,MDT_LOCKRENEWALINTERVALSEC=5,MDT_LOGBLKSIZE=512,MDT_PHYBLKSIZE=512,MDT_POOL_UUID=5a870ad7-01be-02c2-00dc-00000000030d,MDT_PV0=pv:360002ac000000000000000240001ea6e&44&uuid:4i1USJ-P9gG-tC9P-BavB-Yw9J-9RlG-iDi5xi&44&pestart:0&44&pecount:797&44&mapoffset:0,MDT_ROLE=Regular,MDT_SDUUID=eac03447-50c9-4542-a9f3-65aebb58d68f,MDT_TYPE=FCP,MDT_VERSION=4,MDT_VGUUID=0nedej-kANN-DVnZ-BZs0-LbeX-rmey-q4GraN,MDT__SHA_CKSUM=946e4c59e0f29b022e0b9f09d6b4d37d587e9939,RHAT_storage_domain                                                                                                                                                               2        2    63.99m   128.00m unmanaged


$ less su_vdsm_-s_.bin.sh_-c_.usr.bin.tree_-l_.rhev.data-center | grep 44dde38e-d0f5-4d7d-92a3-7994c425ec87
|   |       |   `-- 44dde38e-d0f5-4d7d-92a3-7994c425ec87 -> /dev/eac03447-50c9-4542-a9f3-65aebb58d68f/44dde38e-d0f5-4d7d-92a3-7994c425ec87
        |       |   `-- 44dde38e-d0f5-4d7d-92a3-7994c425ec87 -> /dev/eac03447-50c9-4542-a9f3-65aebb58d68f/44dde38e-d0f5-4d7d-92a3-7994c425ec87
~~~

- DNS configurations show that the host is pointing to "localhost" in the resolv.conf which should not be the case: (Also see #54)
$ cat etc/resolv.conf

~~~ 
# Managed by ansible, hand edits will be overwritten.
search vha.med.va.gov
domain vha.med.va.gov
nameserver 127.0.0.1
nameserver 10.224.149.150
nameserver 10.224.45.3
nameserver 10.3.27.33
~~~

 
Actual results:


Expected results:


Additional info:

Comment 1 rbolling 2018-10-26 23:57:07 UTC

Just to add additional information. The customer claims that they were simply conducting recommended minor updates on the system and this issue started.

Comment 3 Dan Kenigsberg 2018-10-27 06:03:22 UTC

sosreport-wcarlson.02236185-20181026073634 shows
   ovirt-hosted-engine-setup-2.1.3.6-1.el7ev.noarch (from ovirt-4.1)
but
   ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch and vdsm-client-4.20.27.2-1.el7ev.noarch (from ovirt-4.2)

vdsm-client had a drastic change between 4.1 to 4.2, we should have had a "Conflicts: ovirt-hosted-engine-ha < 2.2" there.

Comment 4 Simone Tiraboschi 2018-10-27 07:47:11 UTC

(In reply to Dan Kenigsberg from comment #3)
> sosreport-wcarlson.02236185-20181026073634 shows
>    ovirt-hosted-engine-setup-2.1.3.6-1.el7ev.noarch (from ovirt-4.1)
> but
>    ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch and
> vdsm-client-4.20.27.2-1.el7ev.noarch (from ovirt-4.2)
> 
> vdsm-client had a drastic change between 4.1 to 4.2, we should have had a
> "Conflicts: ovirt-hosted-engine-ha < 2.2" there.

Yes, I think that issue is simply due to have mixed up ovirt-hosted-engine-setup from 4.1 with ovirt-hosted-engine-ha from 4.2.

In ovirt-hosted-engine-setup spec file from 4.2 we have 
  Requires:       ovirt-hosted-engine-ha >= 2.2.13

but we miss a 
  Conflicts: ovirt-hosted-engine-ha >= 2.2
in ovirt-hosted-engine-setup spec file from 4.1
or a 
  Conflicts: ovirt-hosted-engine-setup < 2.2
in ovirt-hosted-engine-ha spec file from 4.2 (next build)
should prevent this issue.

Upgrading also ovirt-hosted-engine-setup on all the hosts to the latest 2.2.z should solve this.

Comment 8 Michal Skrivanek 2018-10-27 09:31:02 UTC

moving to Integration for consideration of further defensive spec changes, but it doesn't seem like a product bug

Comment 18 Robert McSwain 2018-11-01 20:26:10 UTC

The customer was able to downgrade ovirt-hosted-engine-ha and restart the host and ovirt-engine/Hosted Engine were able to start again. Is there a need to collect a new log collector at this time?

Comment 22 Nikolai Sednev 2019-01-07 16:36:49 UTC

Forth to our latest conversation with Simone, here are steps that were executed for verification:
1.Installed 4.1 ovirt-hosted-engine-setup-2.1.4.2-1.el7ev.noarch and ovirt-hosted-engine-ha-2.1.11-1.el7ev.noarch on RHEL7.5.
2.echo "ovirt-hosted-engine-setup" > /etc/yum/pluginconf.d/versionlock.list
3.Upgraded OS to RHEL7.6.
4.Added 4.2 repos.
5.Tried to "yum update ovirt-hosted-engine-ha", while keeping an older ovirt-hosted-engine-setup due to versionlock.
Update executed successfully and both ovirt-hosted-engine-setup and ovirt-hosted-engine-ha were updated.
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
Mixed versions issue was fixed.
Moving to verified.

Comment 24 errata-xmlrpc 2019-05-08 12:31:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1049

Comment 25 Daniel Gur 2019-08-28 13:14:05 UTC

sync2jira

Comment 26 Daniel Gur 2019-08-28 13:18:21 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.