Bug 1683161

Summary: No proper events or error message notified, when the host upgrade fails
Product: [oVirt] ovirt-engine Reporter: SATHEESARAN <sasundar>
Component: Frontend.WebAdminAssignee: Dana <delfassy>
Status: CLOSED CURRENTRELEASE QA Contact: bipin <bshetty>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: bshetty, bugs, delfassy, godas, lleistne, mperina, rhs-bugs, sabose, sankarshan, sasundar
Target Milestone: ovirt-4.3.4Keywords: Reopened, Tracking
Target Release: 4.3.4Flags: pm-rhel: ovirt-4.3+
lleistne: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1649502 Environment:
Last Closed: 2019-06-11 06:25:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1649502    

Description SATHEESARAN 2019-02-26 11:18:03 UTC
Description of problem:
-----------------------

When upgrade of RHVH host is initiated from RHV Manager UI, the host is first moved in to maintenance, redhat-virtualization-host-image-update is updated, then the host is rebooted.

As part of this upgrade/update procedure, in case, if moving the host in to maintenance fails, there are no proper events or messages notified to the user 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHHI-V 1.5 ( RHV 4.2.7 & RHGS 3.4.1 )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Stop one of the brick of the volume in host1
2. Try to upgrade host2 from RHV Manager UI

Actual results:
---------------
Upgrade initiated from RHV Manager UI silently fails without throwing any errors or events

Expected results:
-----------------
Upgrade should fail with meaningful error or event, so that user will be aware the reason behind the failure.

Additional info:
----------------
While moving the host in to maintenance, by stopping gluster service, there are proper error messages that were thrown. The same should be implemented for upgrade/update procedure initiated from RHV Manager UI

Comment 1 Martin Perina 2019-02-28 11:53:55 UTC
The issue was most probably fixed by BZ1631215, which was released as a part of RHV 4.2.8, so could you please retest with updated version?

Comment 2 bipin 2019-03-07 06:50:07 UTC
Hi Martin,

I tested in rhvh-4.3.0.5-0.20190305, still see the issue persisting.

Pasting the comments from cloned bug:

So as mentioned in the steps executed, we can clearly see that the quorum will be lost if we upgrade Host2 (since already brick in Host 1 is down). But in UI when we try upgrading i don't see any warning  while upgrading like say "The quorum will be lost" so something like that sort, and also it fails without any specific error event through which user can assume.

Comment 3 Martin Perina 2019-03-07 09:33:54 UTC
(In reply to bipin from comment #2)
> Hi Martin,
> 
> I tested in rhvh-4.3.0.5-0.20190305, still see the issue persisting.
> 
> Pasting the comments from cloned bug:
> 
> So as mentioned in the steps executed, we can clearly see that the quorum
> will be lost if we upgrade Host2 (since already brick in Host 1 is down).
> But in UI when we try upgrading i don't see any warning  while upgrading
> like say "The quorum will be lost" so something like that sort, and also it
> fails without any specific error event through which user can assume.

We don't have such warnings and we even can't have. Those warnings/errors are part of moving host to maintenance, where quorum/healing status should be checked [1]. But moving host to maintenance is asynchronous operation which can be executed for a long time, so users should just start the upgrade and if moving host to maintenance fails, then they should be able to find reasons of failure in the Events (audit log). If above operation is possible (moving 2 host to maintenance which would make gluster storage read only), then there is a bug in gluster part of engine, which should not allow such operation.

If you want to upgrade multiple hosts at once then I recommend to use ovirt.cluster-upgrade Ansible role [2], which performs upgrade of hosts serially (we are not going to move 2nd host to Maintenance untill the 1st one is already Up).

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceNumberOfVdssCommand.java#L455
[2] https://github.com/ovirt/ovirt-ansible-cluster-upgrade

Comment 4 Sahina Bose 2019-03-07 10:13:19 UTC
(In reply to Martin Perina from comment #3)
> (In reply to bipin from comment #2)
> > Hi Martin,
> > 
> > I tested in rhvh-4.3.0.5-0.20190305, still see the issue persisting.
> > 
> > Pasting the comments from cloned bug:
> > 
> > So as mentioned in the steps executed, we can clearly see that the quorum
> > will be lost if we upgrade Host2 (since already brick in Host 1 is down).
> > But in UI when we try upgrading i don't see any warning  while upgrading
> > like say "The quorum will be lost" so something like that sort, and also it
> > fails without any specific error event through which user can assume.
> 
> We don't have such warnings and we even can't have. Those warnings/errors
> are part of moving host to maintenance, where quorum/healing status should
> be checked [1]. But moving host to maintenance is asynchronous operation
> which can be executed for a long time, so users should just start the
> upgrade and if moving host to maintenance fails, then they should be able to
> find reasons of failure in the Events (audit log). If above operation is
> possible (moving 2 host to maintenance which would make gluster storage read
> only), then there is a bug in gluster part of engine, which should not allow
> such operation.
> 
> If you want to upgrade multiple hosts at once then I recommend to use
> ovirt.cluster-upgrade Ansible role [2], which performs upgrade of hosts
> serially (we are not going to move 2nd host to Maintenance untill the 1st
> one is already Up).
> 
> [1]
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceNumberOfVdssCommand.
> java#L455
> [2] https://github.com/ovirt/ovirt-ansible-cluster-upgrade

From the logs in Bug 1649502, in engine.log

2019-02-26 15:18:03,333+05 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-3) [d4185969-4c15-435b-b9d6-220e84def4c8] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_UNSYNCED_ENTRIES_PRESENT_IN_GLUSTER_BRICKS,$BricksList [rhsqa-grafton8-nic2.lab.eng.blr.redhat.com:/gluster_bricks/data/data],$HostsList rhsqa-grafton8-nic2.lab.eng.blr.redhat.com
2019-02-26 15:18:04,201+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-805) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com upgrade was started (User: admin@internal-authz).
2019-02-26 15:18:05,696+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [d4185969-4c15-435b-b9d6-220e84def4c8] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-02-26 15:18:05,719+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [d4185969-4c15-435b-b9d6-220e84def4c8] EVENT_ID: VDS_MAINTENANCE_FAILED(17), Failed to switch Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com to Maintenance mode.
2019-02-26 15:18:06,780+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [d4185969-4c15-435b-b9d6-220e84def4c8] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).

the issue here seems to be that Validation failures are not logged in audit log?

Comment 5 Martin Perina 2019-03-07 13:24:01 UTC
Yes, that make sense, we will need to tight up UpgradeHost and MaintenanceHost commands much more.

Comment 7 Dana 2019-03-21 09:08:14 UTC
I reproduced the issue - both failures appear in the Events with the following errors:

Failed to upgrade Host <hostname> (User: <username>).

Failed to switch Host <hostname> to Maintenance mode.

Comment 8 Dana 2019-03-21 09:09:40 UTC

*** This bug has been marked as a duplicate of bug 1679399 ***

Comment 10 bipin 2019-06-07 07:41:00 UTC
Tested with ovirt-engine-4.3.4.3-0.1.el7.noarch and the fix works , so moving the bug to verified state.


Steps:
=====
1.Deploy RHHI-V 
2.Bring a brick down from one of the host say host1
3.Now try to upgrade host2 and it fails


Logs:
====
2019-06-07 12:38:29,371+05 INFO  [org.ovirt.engine.core.bll.hostdeploy.UpgradeHostCommand] (default task-126) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Running command: UpgradeHostCommand internal: false. Entities 
affected :  ID: d5cb1684-e96d-49ab-a095-0234f4c1a017 Type: VDSAction group EDIT_HOST_CONFIGURATION with role type ADMIN
2019-06-07 12:38:29,395+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-126) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com 
upgrade was started (User: admin@internal-authz).
2019-06-07 12:38:29,460+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-6) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch the following Host(s) to Maintenance mode: rhsqa-grafton8-nic2.lab.eng.blr.redhat.com.
Gluster quorum will be lost for the following Volumes: vmstore.
2019-06-07 12:38:29,460+05 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-6) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET,$VolumesList vmstore,$HostsList rhsqa-grafton8-nic2.lab.eng.blr.redhat.com
2019-06-07 12:38:29,948+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-06-07 12:38:31,104+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).
2019-06-07 12:38:31,505+05 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler2) [98d833e] START, GlusterTasksListVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{hostId='d5cb1684-e96d-49ab-a095-0234f4c1a017'}), log id: 316ed0f0


019-06-07 13:08:42,481+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-132) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com upgrade was started (User: admin@internal-authz).
2019-06-07 13:08:42,521+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-7) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch the following Host(s) to Maintenance mode: rhsqa-grafton8-nic2.lab.eng.blr.redhat.com.
Gluster quorum will be lost for the following Volumes: data.
2019-06-07 13:08:42,521+05 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-7) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET,$VolumesList data,$HostsList rhsqa-grafton8-nic2.lab.eng.blr.redhat.com
2019-06-07 13:08:43,428+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-06-07 13:08:43,435+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: VDS_MAINTENANCE_FAILED(17), Failed to switch Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com to Maintenance mode.
2019-06-07 13:08:44,451+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).

Comment 11 Sandro Bonazzola 2019-06-11 06:25:42 UTC
This bugzilla is included in oVirt 4.3.4 release, published on June 11th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 12 bipin 2019-06-19 11:04:33 UTC
*** Bug 1721111 has been marked as a duplicate of this bug. ***