Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1536286

Summary:	Hosted Engine HA state is in Local Maintenance when upgrading RHV-H
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Denis Chaplygin <dchaplyg>
Status:	CLOSED DUPLICATE	QA Contact:	Ying Cui <ycui>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.8	CC:	alukiano, aperotti, bugs, dchaplyg, dfediuck, jbelka, knarra, lsurette, mavital, mkalinin, msivak, nsednev, nsoffer, rbalakri, rgolan, Rhev-m-bugs, srevivo, stirabos, ycui, ykaul, ylavi
Target Milestone:	ovirt-4.1.10	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1469143	Environment:
Last Closed:	2018-02-28 13:19:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1469143, 1489982
Bug Blocks:	1540310

Description Germano Veit Michel 2018-01-19 03:20:56 UTC

+++ This bug was initially created as a clone of Bug #1469143 +++

Description of problem:
I see that when ever an upgrade of RHV-H 4.1.2 to 4.1.3 is done Hosted Engine Ha state is in Local Maintenance.

Version-Release number of selected component (if applicable):
ovirt-host-deploy-1.6.6-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup with RHV-H 4.1.2 async build
2. Now add all the required repos
3. There is an upgrade symbol next to the hypervisor.
4. click on that.


Actual results:
 RHV-H host gets upgraded to 4.1.3 leaving the Hosted Engine HA state in 'Local Maintenance"

Expected results:
 RHV-H host gets upgraded to 4.1.3 leaving the Hosted Engine HA state should not be in 'Local Maintenance"

Additional info:

Adding hosted-engine --vm-status before and after upgrade:

> Output of hosted-engine --vm-status before upgrade:
> =======================================================
> 
> [root@yarrow ~]# hosted-engine --vm-status
> 
> 
> --== Host 1 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : yarrow.lab.eng.blr.redhat.com
> Host ID                            : 1
> Engine status                      : {"health": "good", "vm": "up",
> "detail": "up"}
> Score                              : 3400
> stopped                            : False
> Local maintenance                  : False
> crc32                              : b4359588
> local_conf_timestamp               : 75583
> Host timestamp                     : 75567
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=75567 (Thu Jul  6 15:09:26 2017)
> 	host-id=1
> 	score=3400
> 	vm_conf_refresh_time=75583 (Thu Jul  6 15:09:42 2017)
> 	conf_on_shared_storage=True
> 	maintenance=False
> 	state=EngineUp
> 	stopped=False
> 
> 
> --== Host 2 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : tettnang.lab.eng.blr.redhat.com
> Host ID                            : 2
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score                              : 1800
> stopped                            : False
> Local maintenance                  : False
> crc32                              : 7bfbbfd5
> local_conf_timestamp               : 1440
> Host timestamp                     : 1423
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=1423 (Thu Jul  6 15:09:07 2017)
> 	host-id=2
> 	score=1800
> 	vm_conf_refresh_time=1440 (Thu Jul  6 15:09:23 2017)
> 	conf_on_shared_storage=True
> 	maintenance=False
> 	state=EngineDown
> 	stopped=False
> 
> 
> --== Host 3 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : zod.lab.eng.blr.redhat.com
> Host ID                            : 3
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score                              : 3400
> stopped                            : False
> Local maintenance                  : False
> crc32                              : 7caabb48
> local_conf_timestamp               : 75597
> Host timestamp                     : 75581
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=75581 (Thu Jul  6 15:09:23 2017)
> 	host-id=3
> 	score=3400
> 	vm_conf_refresh_time=75597 (Thu Jul  6 15:09:39 2017)
> 	conf_on_shared_storage=True
> 	maintenance=False
> 	state=EngineDown
> 	stopped=False
> 
> Output of hosted-engine --vm-status after upgrade:
> ===================================================
> 
> [root@yarrow ~]# hosted-engine --vm-status
> 
> 
> --== Host 1 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : yarrow.lab.eng.blr.redhat.com
> Host ID                            : 1
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score                              : 0
> stopped                            : False
> Local maintenance                  : True
> crc32                              : bc34659d
> local_conf_timestamp               : 7624
> Host timestamp                     : 7608
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=7608 (Thu Jul  6 17:50:33 2017)
> 	host-id=1
> 	score=0
> 	vm_conf_refresh_time=7624 (Thu Jul  6 17:50:48 2017)
> 	conf_on_shared_storage=True
> 	maintenance=True
> 	state=LocalMaintenance
> 	stopped=False
> 
> 
> --== Host 2 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : tettnang.lab.eng.blr.redhat.com
> Host ID                            : 2
> Engine status                      : {"reason": "vm not running on this
> host", "health": "bad", "vm": "down", "detail": "unknown"}
> Score                              : 1800
> stopped                            : False
> Local maintenance                  : False
> crc32                              : 521f80d4
> local_conf_timestamp               : 11121
> Host timestamp                     : 11105
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=11105 (Thu Jul  6 17:50:29 2017)
> 	host-id=2
> 	score=1800
> 	vm_conf_refresh_time=11121 (Thu Jul  6 17:50:45 2017)
> 	conf_on_shared_storage=True
> 	maintenance=False
> 	state=EngineDown
> 	stopped=False
> 
> 
> --== Host 3 status ==--
> 
> conf_on_shared_storage             : True
> Status up-to-date                  : True
> Hostname                           : zod.lab.eng.blr.redhat.com
> Host ID                            : 3
> Engine status                      : {"health": "good", "vm": "up",
> "detail": "up"}
> Score                              : 3400
> stopped                            : False
> Local maintenance                  : False
> crc32                              : 77b3a2d6
> local_conf_timestamp               : 85262
> Host timestamp                     : 85246
> Extra metadata (valid at timestamp):
> 	metadata_parse_version=1
> 	metadata_feature_version=1
> 	timestamp=85246 (Thu Jul  6 17:50:28 2017)
> 	host-id=3
> 	score=3400
> 	vm_conf_refresh_time=85262 (Thu Jul  6 17:50:44 2017)
> 	conf_on_shared_storage=True
> 	maintenance=False
> 	state=EngineUp
> 	stopped=False
> 
> 
> cat /var/lib/ovirt-hosted-engine-ha/ha.conf
> local_maintenance=True

--- Additional comment from Yaniv Lavi on 2017-07-17 05:19:22 EDT ---

Can you check for a regression in the hot activation flow? It is supposed to move the host out of local maintenance.

--- Additional comment from Artyom on 2017-07-19 07:48:04 EDT ---

So it does not a regression in the host activation flow, the problem is:
1) Move host to maintenance via engine(will activate HE "LocalMaintenance" state)
2) Upgrade the host via the engine, after the upgrade host moved straight forward to up state, so from the engine side host is UP, but from the HE side, the host has state "LocalMaintenance", because no one ran activate command on the engine side.
See also bug with the similar problem - https://bugzilla.redhat.com/show_bug.cgi?id=1468875

--- Additional comment from Sandro Bonazzola on 2017-11-18 02:50:36 EST ---

Denis is this going to land in 4.2.0? If not please re-target.

Comment 2 Marina Kalinin 2018-01-30 20:49:58 UTC

Germano, sounds like my bz#1489982.

Comment 3 Germano Veit Michel 2018-01-30 22:37:47 UTC

(In reply to Marina from comment #2)
> Germano, sounds like my bz#1489982.

Indeed.

So your BZ was a dup of Bug #1469143, which wasn't closed when it was fixed. Then  I cloned the original BZ downstream.

Also your BZ says this was fixed in 4.2, but the original BZ is targetted to 4.3
And I reproduced this 4.1.8.

Can it get any more confusing? ;)
Should we close them all or we want to get this fixed in 4.1.10? I think it should be fixed in 4.1.10 too as after a round of upgrades all HE hosts might be in maintenance mode, defeating HA, so it's quite serious. What do you think?

Comment 5 Yaniv Lavi 2018-02-14 12:50:41 UTC

This is severe and should not be targeted so far in the future.
The maintenance mode for HE should be lock the the engine maintenance mode, if the engine is up.

Maintaining this in upgrade it elementary. Retargeting.

Comment 6 Martin Sivák 2018-02-14 13:53:35 UTC

Nikolai, we need to figure out if this is still broken and where. Can you please try reproducing it with 4.1.8 -> 4.1.9 upgrade? It might be RHEV-H specific too.

Comment 7 Nikolai Sednev 2018-02-14 13:58:19 UTC

(In reply to Martin Sivák from comment #6)
> Nikolai, we need to figure out if this is still broken and where. Can you
> please try reproducing it with 4.1.8 -> 4.1.9 upgrade? It might be RHEV-H
> specific too.


Its HC specific issue.
Kasturi Narra, please provide your input.

Comment 8 Nikolai Sednev 2018-02-14 14:01:27 UTC

Jiri, have you seen such an issue during your latest upgrade set of tests?

Comment 9 Marina Kalinin 2018-02-14 14:51:39 UTC

Hey, why is it HC specific?
I believe what happens here is when the host is out of engine side maintenance due to upgrade or reinstall, it should also cancel the HE local maintenance, that's all. Today it enables local HE maintenance once we put the host in maintenance in RHV UI, through engine, but it never cancel's the HE maintenace when it is auto-activated back on the engine side. And this is the problem.

Comment 10 Nikolai Sednev 2018-02-14 20:30:13 UTC

(In reply to Marina from comment #9)
> Hey, why is it HC specific?
> I believe what happens here is when the host is out of engine side
> maintenance due to upgrade or reinstall, it should also cancel the HE local
> maintenance, that's all. Today it enables local HE maintenance once we put
> the host in maintenance in RHV UI, through engine, but it never cancel's the
> HE maintenace when it is auto-activated back on the engine side. And this is
> the problem.

So there is a confirmation from your side that this is not HC specific.Regular RHEL/RHVH ha-hosts will be hitting the same issue during the upgrade.
Martin, please review Comment #9.

Comment 11 Martin Sivák 2018-02-15 09:47:10 UTC

Nikolai, we asked for a test of this to see if it really is happening and where. There are conflicting information with regards to RHEV-H and branches (4.1 vs 4.2).

Since all we have now are opinions, I would like someone from QE to provide some hard data, before we decide what do to with all the linked bugs.

Comment 18 Artyom 2018-02-19 14:03:01 UTC

Before upgrade
==============
# nodectl info
layers: 
  rhvh-4.1-0.20180102.0: 
    rhvh-4.1-0.20180102.0+1
bootloader: 
  default: rhvh-4.1-0.20180102.0+1
  entries: 
    rhvh-4.1-0.20180102.0+1: 
      index: 0
      title: rhvh-4.1-0.20180102.0
      kernel: /boot/rhvh-4.1-0.20180102.0+1/vmlinuz-3.10.0-693.11.6.el7.x86_64
      args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma05/rhvh-4.1-0.20180102.0+1 rd.lvm.lv=rhvh_alma05/swap rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180102.0+1"
      initrd: /boot/rhvh-4.1-0.20180102.0+1/initramfs-3.10.0-693.11.6.el7.x86_64.img
      root: /dev/rhvh_alma05/rhvh-4.1-0.20180102.0+1
current_layer: rhvh-4.1-0.20180102.0+1

After upgrade
=============
# nodectl info
layers: 
  rhvh-4.1-0.20180126.0: 
    rhvh-4.1-0.20180126.0+1
  rhvh-4.1-0.20180102.0: 
    rhvh-4.1-0.20180102.0+1
bootloader: 
  default: rhvh-4.1-0.20180126.0+1
  entries: 
    rhvh-4.1-0.20180102.0+1: 
      index: 1
      title: rhvh-4.1-0.20180102.0
      kernel: /boot/rhvh-4.1-0.20180102.0+1/vmlinuz-3.10.0-693.11.6.el7.x86_64
      args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma06/swap rd.lvm.lv=rhvh_alma06/rhvh-4.1-0.20180102.0+1 rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180102.0+1"
      initrd: /boot/rhvh-4.1-0.20180102.0+1/initramfs-3.10.0-693.11.6.el7.x86_64.img
      root: /dev/rhvh_alma06/rhvh-4.1-0.20180102.0+1
    rhvh-4.1-0.20180126.0+1: 
      index: 0
      title: rhvh-4.1-0.20180126.0
      kernel: /boot/rhvh-4.1-0.20180126.0+1/vmlinuz-3.10.0-693.17.1.el7.x86_64
      args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma06/swap rd.lvm.lv=rhvh_alma06/rhvh-4.1-0.20180126.0+1 rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180126.0+1"
      initrd: /boot/rhvh-4.1-0.20180126.0+1/initramfs-3.10.0-693.17.1.el7.x86_64.img
      root: /dev/rhvh_alma06/rhvh-4.1-0.20180126.0+1
current_layer: rhvh-4.1-0.20180126.0+1


1) Host UP

2) Host has repository with new packages
Check for available updates on host alma06.qa.lab.tlv.redhat.com was completed successfully with message 'found updates for packages redhat-virtualization-host-image-update-4.1-20180126.0.el7_4'.

3) Click on Upgrade link "A new version is available. Upgrade"
Feb 19, 2018 1:53:21 PM
Host alma06.qa.lab.tlv.redhat.com upgrade was completed successfully.
Feb 19, 2018 1:53:20 PM
Host alma06.qa.lab.tlv.redhat.com was restarted using SSH by the engine.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Termination.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Retrieving installation logs to: '/var/log/ovirt-engine/host-deploy/ovirt-host-mgmt-20180219065319-alma06.qa.lab.tlv.redhat.com-f75d262d-cc5f-4d2c-bf2d-4ddc0c24988c.log'.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Pre-termination.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Closing up.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Transaction commit.
Feb 19, 2018 1:53:19 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Misc configuration.
Feb 19, 2018 1:53:18 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum Verify: 2/2: redhat-virtualization-host-image-update-placeholder.noarch 0:4.1-8.1.el7 - od.
Feb 19, 2018 1:53:18 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum Verify: 1/2: redhat-virtualization-host-image-update.noarch 0:4.1-20180126.0.el7_4 - u.
Feb 19, 2018 1:53:18 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum erase: 2/2: redhat-virtualization-host-image-update-placeholder.
Feb 19, 2018 1:45:28 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum obsoleting: 1/2: redhat-virtualization-host-image-update-4.1-20180126.0.el7_4.noarch.
Feb 19, 2018 1:45:28 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum Status: Running Transaction.
Feb 19, 2018 1:45:28 PM
Installing Host alma06.qa.lab.tlv.redhat.com. Yum Status: Running Test Transaction.

4) Host UP under the engine, but has LocalMaintenace state under hosted-engine --vm-status
--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma06.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : True
crc32                              : 69c202ba
local_conf_timestamp               : 3758
Host timestamp                     : 3758
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3758 (Mon Feb 19 15:07:31 2018)
        host-id=2
        score=0
        vm_conf_refresh_time=3758 (Mon Feb 19 15:07:31 2018)
        conf_on_shared_storage=True
        maintenance=True
        state=LocalMaintenance
        stopped=False

Comment 20 Martin Sivák 2018-02-26 09:37:03 UTC

I wonder if this could be closed as DUP of #1489982

Comment 21 Martin Sivák 2018-02-26 12:20:36 UTC

Can you please check what the states are when you upgrade a RHEV-H host?

1) You put the host to maintenance using webadmin button
2) You update the node
3) The node reboots
4) Does it stay in maintenance mode (in engine) or does it move to Up automatically?

Comment 22 Artyom 2018-02-27 11:20:17 UTC

It stays in maintenance state.

Comment 23 Yaniv Lavi 2018-02-28 13:19:05 UTC


*** This bug has been marked as a duplicate of bug 1489982 ***

Comment 24 Franta Kust 2019-05-16 12:54:42 UTC

BZ<2>Jira re-sync