+++ This bug was initially created as a clone of Bug #1469143 +++ Description of problem: I see that when ever an upgrade of RHV-H 4.1.2 to 4.1.3 is done Hosted Engine Ha state is in Local Maintenance. Version-Release number of selected component (if applicable): ovirt-host-deploy-1.6.6-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Install HC setup with RHV-H 4.1.2 async build 2. Now add all the required repos 3. There is an upgrade symbol next to the hypervisor. 4. click on that. Actual results: RHV-H host gets upgraded to 4.1.3 leaving the Hosted Engine HA state in 'Local Maintenance" Expected results: RHV-H host gets upgraded to 4.1.3 leaving the Hosted Engine HA state should not be in 'Local Maintenance" Additional info: Adding hosted-engine --vm-status before and after upgrade: > Output of hosted-engine --vm-status before upgrade: > ======================================================= > > [root@yarrow ~]# hosted-engine --vm-status > > > --== Host 1 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : yarrow.lab.eng.blr.redhat.com > Host ID : 1 > Engine status : {"health": "good", "vm": "up", > "detail": "up"} > Score : 3400 > stopped : False > Local maintenance : False > crc32 : b4359588 > local_conf_timestamp : 75583 > Host timestamp : 75567 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=75567 (Thu Jul 6 15:09:26 2017) > host-id=1 > score=3400 > vm_conf_refresh_time=75583 (Thu Jul 6 15:09:42 2017) > conf_on_shared_storage=True > maintenance=False > state=EngineUp > stopped=False > > > --== Host 2 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : tettnang.lab.eng.blr.redhat.com > Host ID : 2 > Engine status : {"reason": "vm not running on this > host", "health": "bad", "vm": "down", "detail": "unknown"} > Score : 1800 > stopped : False > Local maintenance : False > crc32 : 7bfbbfd5 > local_conf_timestamp : 1440 > Host timestamp : 1423 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=1423 (Thu Jul 6 15:09:07 2017) > host-id=2 > score=1800 > vm_conf_refresh_time=1440 (Thu Jul 6 15:09:23 2017) > conf_on_shared_storage=True > maintenance=False > state=EngineDown > stopped=False > > > --== Host 3 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : zod.lab.eng.blr.redhat.com > Host ID : 3 > Engine status : {"reason": "vm not running on this > host", "health": "bad", "vm": "down", "detail": "unknown"} > Score : 3400 > stopped : False > Local maintenance : False > crc32 : 7caabb48 > local_conf_timestamp : 75597 > Host timestamp : 75581 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=75581 (Thu Jul 6 15:09:23 2017) > host-id=3 > score=3400 > vm_conf_refresh_time=75597 (Thu Jul 6 15:09:39 2017) > conf_on_shared_storage=True > maintenance=False > state=EngineDown > stopped=False > > Output of hosted-engine --vm-status after upgrade: > =================================================== > > [root@yarrow ~]# hosted-engine --vm-status > > > --== Host 1 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : yarrow.lab.eng.blr.redhat.com > Host ID : 1 > Engine status : {"reason": "vm not running on this > host", "health": "bad", "vm": "down", "detail": "unknown"} > Score : 0 > stopped : False > Local maintenance : True > crc32 : bc34659d > local_conf_timestamp : 7624 > Host timestamp : 7608 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=7608 (Thu Jul 6 17:50:33 2017) > host-id=1 > score=0 > vm_conf_refresh_time=7624 (Thu Jul 6 17:50:48 2017) > conf_on_shared_storage=True > maintenance=True > state=LocalMaintenance > stopped=False > > > --== Host 2 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : tettnang.lab.eng.blr.redhat.com > Host ID : 2 > Engine status : {"reason": "vm not running on this > host", "health": "bad", "vm": "down", "detail": "unknown"} > Score : 1800 > stopped : False > Local maintenance : False > crc32 : 521f80d4 > local_conf_timestamp : 11121 > Host timestamp : 11105 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=11105 (Thu Jul 6 17:50:29 2017) > host-id=2 > score=1800 > vm_conf_refresh_time=11121 (Thu Jul 6 17:50:45 2017) > conf_on_shared_storage=True > maintenance=False > state=EngineDown > stopped=False > > > --== Host 3 status ==-- > > conf_on_shared_storage : True > Status up-to-date : True > Hostname : zod.lab.eng.blr.redhat.com > Host ID : 3 > Engine status : {"health": "good", "vm": "up", > "detail": "up"} > Score : 3400 > stopped : False > Local maintenance : False > crc32 : 77b3a2d6 > local_conf_timestamp : 85262 > Host timestamp : 85246 > Extra metadata (valid at timestamp): > metadata_parse_version=1 > metadata_feature_version=1 > timestamp=85246 (Thu Jul 6 17:50:28 2017) > host-id=3 > score=3400 > vm_conf_refresh_time=85262 (Thu Jul 6 17:50:44 2017) > conf_on_shared_storage=True > maintenance=False > state=EngineUp > stopped=False > > > cat /var/lib/ovirt-hosted-engine-ha/ha.conf > local_maintenance=True --- Additional comment from Yaniv Lavi on 2017-07-17 05:19:22 EDT --- Can you check for a regression in the hot activation flow? It is supposed to move the host out of local maintenance. --- Additional comment from Artyom on 2017-07-19 07:48:04 EDT --- So it does not a regression in the host activation flow, the problem is: 1) Move host to maintenance via engine(will activate HE "LocalMaintenance" state) 2) Upgrade the host via the engine, after the upgrade host moved straight forward to up state, so from the engine side host is UP, but from the HE side, the host has state "LocalMaintenance", because no one ran activate command on the engine side. See also bug with the similar problem - https://bugzilla.redhat.com/show_bug.cgi?id=1468875 --- Additional comment from Sandro Bonazzola on 2017-11-18 02:50:36 EST --- Denis is this going to land in 4.2.0? If not please re-target.
Germano, sounds like my bz#1489982.
(In reply to Marina from comment #2) > Germano, sounds like my bz#1489982. Indeed. So your BZ was a dup of Bug #1469143, which wasn't closed when it was fixed. Then I cloned the original BZ downstream. Also your BZ says this was fixed in 4.2, but the original BZ is targetted to 4.3 And I reproduced this 4.1.8. Can it get any more confusing? ;) Should we close them all or we want to get this fixed in 4.1.10? I think it should be fixed in 4.1.10 too as after a round of upgrades all HE hosts might be in maintenance mode, defeating HA, so it's quite serious. What do you think?
This is severe and should not be targeted so far in the future. The maintenance mode for HE should be lock the the engine maintenance mode, if the engine is up. Maintaining this in upgrade it elementary. Retargeting.
Nikolai, we need to figure out if this is still broken and where. Can you please try reproducing it with 4.1.8 -> 4.1.9 upgrade? It might be RHEV-H specific too.
(In reply to Martin Sivák from comment #6) > Nikolai, we need to figure out if this is still broken and where. Can you > please try reproducing it with 4.1.8 -> 4.1.9 upgrade? It might be RHEV-H > specific too. Its HC specific issue. Kasturi Narra, please provide your input.
Jiri, have you seen such an issue during your latest upgrade set of tests?
Hey, why is it HC specific? I believe what happens here is when the host is out of engine side maintenance due to upgrade or reinstall, it should also cancel the HE local maintenance, that's all. Today it enables local HE maintenance once we put the host in maintenance in RHV UI, through engine, but it never cancel's the HE maintenace when it is auto-activated back on the engine side. And this is the problem.
(In reply to Marina from comment #9) > Hey, why is it HC specific? > I believe what happens here is when the host is out of engine side > maintenance due to upgrade or reinstall, it should also cancel the HE local > maintenance, that's all. Today it enables local HE maintenance once we put > the host in maintenance in RHV UI, through engine, but it never cancel's the > HE maintenace when it is auto-activated back on the engine side. And this is > the problem. So there is a confirmation from your side that this is not HC specific.Regular RHEL/RHVH ha-hosts will be hitting the same issue during the upgrade. Martin, please review Comment #9.
Nikolai, we asked for a test of this to see if it really is happening and where. There are conflicting information with regards to RHEV-H and branches (4.1 vs 4.2). Since all we have now are opinions, I would like someone from QE to provide some hard data, before we decide what do to with all the linked bugs.
Before upgrade ============== # nodectl info layers: rhvh-4.1-0.20180102.0: rhvh-4.1-0.20180102.0+1 bootloader: default: rhvh-4.1-0.20180102.0+1 entries: rhvh-4.1-0.20180102.0+1: index: 0 title: rhvh-4.1-0.20180102.0 kernel: /boot/rhvh-4.1-0.20180102.0+1/vmlinuz-3.10.0-693.11.6.el7.x86_64 args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma05/rhvh-4.1-0.20180102.0+1 rd.lvm.lv=rhvh_alma05/swap rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180102.0+1" initrd: /boot/rhvh-4.1-0.20180102.0+1/initramfs-3.10.0-693.11.6.el7.x86_64.img root: /dev/rhvh_alma05/rhvh-4.1-0.20180102.0+1 current_layer: rhvh-4.1-0.20180102.0+1 After upgrade ============= # nodectl info layers: rhvh-4.1-0.20180126.0: rhvh-4.1-0.20180126.0+1 rhvh-4.1-0.20180102.0: rhvh-4.1-0.20180102.0+1 bootloader: default: rhvh-4.1-0.20180126.0+1 entries: rhvh-4.1-0.20180102.0+1: index: 1 title: rhvh-4.1-0.20180102.0 kernel: /boot/rhvh-4.1-0.20180102.0+1/vmlinuz-3.10.0-693.11.6.el7.x86_64 args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma06/swap rd.lvm.lv=rhvh_alma06/rhvh-4.1-0.20180102.0+1 rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180102.0+1" initrd: /boot/rhvh-4.1-0.20180102.0+1/initramfs-3.10.0-693.11.6.el7.x86_64.img root: /dev/rhvh_alma06/rhvh-4.1-0.20180102.0+1 rhvh-4.1-0.20180126.0+1: index: 0 title: rhvh-4.1-0.20180126.0 kernel: /boot/rhvh-4.1-0.20180126.0+1/vmlinuz-3.10.0-693.17.1.el7.x86_64 args: "ro crashkernel=auto rd.lvm.lv=rhvh_alma06/swap rd.lvm.lv=rhvh_alma06/rhvh-4.1-0.20180126.0+1 rhgb quiet LANG=en_US.UTF-8 img.bootid=rhvh-4.1-0.20180126.0+1" initrd: /boot/rhvh-4.1-0.20180126.0+1/initramfs-3.10.0-693.17.1.el7.x86_64.img root: /dev/rhvh_alma06/rhvh-4.1-0.20180126.0+1 current_layer: rhvh-4.1-0.20180126.0+1 1) Host UP 2) Host has repository with new packages Check for available updates on host alma06.qa.lab.tlv.redhat.com was completed successfully with message 'found updates for packages redhat-virtualization-host-image-update-4.1-20180126.0.el7_4'. 3) Click on Upgrade link "A new version is available. Upgrade" Feb 19, 2018 1:53:21 PM Host alma06.qa.lab.tlv.redhat.com upgrade was completed successfully. Feb 19, 2018 1:53:20 PM Host alma06.qa.lab.tlv.redhat.com was restarted using SSH by the engine. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Termination. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Retrieving installation logs to: '/var/log/ovirt-engine/host-deploy/ovirt-host-mgmt-20180219065319-alma06.qa.lab.tlv.redhat.com-f75d262d-cc5f-4d2c-bf2d-4ddc0c24988c.log'. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Pre-termination. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Closing up. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Transaction commit. Feb 19, 2018 1:53:19 PM Installing Host alma06.qa.lab.tlv.redhat.com. Stage: Misc configuration. Feb 19, 2018 1:53:18 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum Verify: 2/2: redhat-virtualization-host-image-update-placeholder.noarch 0:4.1-8.1.el7 - od. Feb 19, 2018 1:53:18 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum Verify: 1/2: redhat-virtualization-host-image-update.noarch 0:4.1-20180126.0.el7_4 - u. Feb 19, 2018 1:53:18 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum erase: 2/2: redhat-virtualization-host-image-update-placeholder. Feb 19, 2018 1:45:28 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum obsoleting: 1/2: redhat-virtualization-host-image-update-4.1-20180126.0.el7_4.noarch. Feb 19, 2018 1:45:28 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum Status: Running Transaction. Feb 19, 2018 1:45:28 PM Installing Host alma06.qa.lab.tlv.redhat.com. Yum Status: Running Test Transaction. 4) Host UP under the engine, but has LocalMaintenace state under hosted-engine --vm-status --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma06.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : False Local maintenance : True crc32 : 69c202ba local_conf_timestamp : 3758 Host timestamp : 3758 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3758 (Mon Feb 19 15:07:31 2018) host-id=2 score=0 vm_conf_refresh_time=3758 (Mon Feb 19 15:07:31 2018) conf_on_shared_storage=True maintenance=True state=LocalMaintenance stopped=False
I wonder if this could be closed as DUP of #1489982
Can you please check what the states are when you upgrade a RHEV-H host? 1) You put the host to maintenance using webadmin button 2) You update the node 3) The node reboots 4) Does it stay in maintenance mode (in engine) or does it move to Up automatically?
It stays in maintenance state.
*** This bug has been marked as a duplicate of bug 1489982 ***
BZ<2>Jira re-sync