Bug 1649502

Summary: No proper events or error message notified, when the host upgrade fails
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: rhhiAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: bipin <bshetty>
Severity: medium Docs Contact:
Priority: medium    
Version: rhhiv-1.5CC: bshetty, godas, rhs-bugs, sabose
Target Milestone: ---Keywords: ZStream
Target Release: RHHI-V 1.6.z Async Update   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If pending heals exist or gluster quorum cannot be maintained when you attempt to upgrade an online host, the host upgrade fails so that data or quorum are not lost. Previously, failure occurred with a message that indicated the host could not be moved to maintenance, but did not specify the reason. Logs are now more informative, for example: Cannot switch the following Host(s) to Maintenance mode: server1.example.com. Gluster quorum will be lost for the following Volumes: myvolume.
Story Points: ---
Clone Of:
: 1649503 1683161 1721111 (view as bug list) Environment:
Last Closed: 2019-10-03 12:23:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1679399, 1683161, 1721111    
Bug Blocks:    
Attachments:
Description Flags
UI_Screenshot none

Description SATHEESARAN 2018-11-13 18:20:36 UTC
Description of problem:
-----------------------

When upgrade of RHVH host is initiated from RHV Manager UI, the host is first moved in to maintenance, redhat-virtualization-host-image-update is updated, then the host is rebooted.

As part of this upgrade/update procedure, in case, if moving the host in to maintenance fails, there are no proper events or messages notified to the user 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHHI-V 1.5 ( RHV 4.2.7 & RHGS 3.4.1 )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Stop one of the brick of the volume in host1
2. Try to upgrade host2 from RHV Manager UI

Actual results:
---------------
Upgrade initiated from RHV Manager UI silently fails without throwing any errors or events

Expected results:
-----------------
Upgrade should fail with meaningful error or event, so that user will be aware the reason behind the failure.

Additional info:
----------------
While moving the host in to maintenance, by stopping gluster service, there are proper error messages that were thrown. The same should be implemented for upgrade/update procedure initiated from RHV Manager UI

Comment 3 bipin 2019-02-26 10:02:59 UTC
Since the verification failed reassigning the bug.

Steps executed:
==============
1.Killed a brick(data) on Host1
2.Clicked Host2 --> Installation --> Upgrade

After executing the above, the upgrade failed without any errors nor any pop up alert's. Would expect some error like quorum would be lost if the upgrade succeeds or something like that.
It just failed with the generic error like:
2019-02-26 15:18:05,696+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [d4185969-4c15-435b-b9d6-220e84def4c8] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-02-26 15:18:05,719+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [d4185969-4c15-435b-b9d6-220e84def4c8] EVENT_ID: VDS_MAINTENANCE_FAILED(17), Failed to switch Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com to Maintenance mode.
2019-02-26 15:18:06,780+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [d4185969-4c15-435b-b9d6-220e84def4c8] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).

Comment 4 bipin 2019-02-26 10:03:35 UTC
Created attachment 1538724 [details]
UI_Screenshot

Comment 6 Sahina Bose 2019-03-07 06:15:53 UTC
The original bug was about no error message being returned on upgrade failure. the only event was Upgrade started, and no indication of failure.
I think there's an event indicating failure to move to maintenance now? 
What you're asking is for specific error message on why moving to maintenance failed?

Comment 7 bipin 2019-03-07 06:32:33 UTC
Sahina,

So as mentioned in the steps executed, we can clearly see that the quorum will be lost if we upgrade Host2 (since already brick in Host 1 is down). But in UI when we try upgrading i don't see any warning  while upgrading like say "The quorum will be lost" so something like that sort, and also it fails without any specific error event through which user can assume. Let me know if you think otherwise

Comment 8 Sahina Bose 2019-03-15 06:11:32 UTC
(In reply to bipin from comment #7)
> Sahina,
> 
> So as mentioned in the steps executed, we can clearly see that the quorum
> will be lost if we upgrade Host2 (since already brick in Host 1 is down).
> But in UI when we try upgrading i don't see any warning  while upgrading
> like say "The quorum will be lost" so something like that sort, and also it
> fails without any specific error event through which user can assume. Let me
> know if you think otherwise

Currently there's an error message provide when a host cannot be moved to maintenance, the specific error message regarding heal etc is thrown as part of validations and not logged in audit log. Providing specific error message will required the commands to be changed, so cannot be targeted for 1.6

Comment 9 SATHEESARAN 2019-03-21 08:44:23 UTC
(In reply to Sahina Bose from comment #8)
> (In reply to bipin from comment #7)
> > Sahina,
> > 
> > So as mentioned in the steps executed, we can clearly see that the quorum
> > will be lost if we upgrade Host2 (since already brick in Host 1 is down).
> > But in UI when we try upgrading i don't see any warning  while upgrading
> > like say "The quorum will be lost" so something like that sort, and also it
> > fails without any specific error event through which user can assume. Let me
> > know if you think otherwise
> 
> Currently there's an error message provide when a host cannot be moved to
> maintenance, the specific error message regarding heal etc is thrown as part
> of validations and not logged in audit log. Providing specific error message
> will required the commands to be changed, so cannot be targeted for 1.6

Yes, that makes sense. But the user should be informed about the fact that,if
during upgrade/update of RHVH hosts, if the host doesn't move in to upgrade phase,
then one of problem could be that if that host moves in to maintenance, the cluster
quorum is lost or self-heal is in progress.

User have to make sure all the bricks in the volume are shown up in RHV Manager UI
and there are no pending entries in the brick to heal.

With these fact, I am marking this bug for the known_issue and can be deferred out of RHHI-V 1.6 scope

Comment 12 Sahina Bose 2019-04-01 05:15:34 UTC
I've modified text. please check

Comment 16 bipin 2019-07-31 10:25:26 UTC
Pasting the output from base bug :

Tested with ovirt-engine-4.3.4.3-0.1.el7.noarch and the fix works , so moving the bug to verified state.


Steps:
=====
1.Deploy RHHI-V 
2.Bring a brick down from one of the host say host1
3.Now try to upgrade host2 and it fails


Logs:
====
2019-06-07 12:38:29,371+05 INFO  [org.ovirt.engine.core.bll.hostdeploy.UpgradeHostCommand] (default task-126) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Running command: UpgradeHostCommand internal: false. Entities 
affected :  ID: d5cb1684-e96d-49ab-a095-0234f4c1a017 Type: VDSAction group EDIT_HOST_CONFIGURATION with role type ADMIN
2019-06-07 12:38:29,395+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-126) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com 
upgrade was started (User: admin@internal-authz).
2019-06-07 12:38:29,460+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-6) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch the following Host(s) to Maintenance mode: rhsqa-grafton8-nic2.lab.eng.blr.redhat.com.
Gluster quorum will be lost for the following Volumes: vmstore.
2019-06-07 12:38:29,460+05 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-6) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET,$VolumesList vmstore,$HostsList rhsqa-grafton8-nic2.lab.eng.blr.redhat.com
2019-06-07 12:38:29,948+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-06-07 12:38:31,104+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [682b85e9-8120-4f7e-bca8-4e80a0eb7843] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).
2019-06-07 12:38:31,505+05 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler2) [98d833e] START, GlusterTasksListVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{hostId='d5cb1684-e96d-49ab-a095-0234f4c1a017'}), log id: 316ed0f0


019-06-07 13:08:42,481+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-132) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com upgrade was started (User: admin@internal-authz).
2019-06-07 13:08:42,521+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-7) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch the following Host(s) to Maintenance mode: rhsqa-grafton8-nic2.lab.eng.blr.redhat.com.
Gluster quorum will be lost for the following Volumes: data.
2019-06-07 13:08:42,521+05 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-7) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_GLUSTER_QUORUM_CANNOT_BE_MET,$VolumesList data,$HostsList rhsqa-grafton8-nic2.lab.eng.blr.redhat.com
2019-06-07 13:08:43,428+05 ERROR [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] Host 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com' failed to move to maintenance mode. Upgrade process is terminated.
2019-06-07 13:08:43,435+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: VDS_MAINTENANCE_FAILED(17), Failed to switch Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com to Maintenance mode.
2019-06-07 13:08:44,451+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [e496c065-0c69-4f16-ba2c-3219fd5a8cc6] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host rhsqa-grafton8-nic2.lab.eng.blr.redhat.com (User: admin@internal-authz).

Comment 18 errata-xmlrpc 2019-10-03 12:23:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2963