Bug 1277646 - the agent should avoid trying to upgrade the host if it's not in maintenance mode
the agent should avoid trying to upgrade the host if it's not in maintenance ...
Status: CLOSED CURRENTRELEASE
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: General (Show other bugs)
1.3.1
Unspecified Unspecified
unspecified Severity high (vote)
: ovirt-3.6.1
: 1.3.3.5
Assigned To: Simone Tiraboschi
Artyom
integration
: Triaged
: 1277642 1277645 (view as bug list)
Depends On:
Blocks: ovirt-hosted-engine-ha-1.3.4.3 RHEV3.6Upgrade
  Show dependency treegraph
 
Reported: 2015-11-03 12:45 EST by Simone Tiraboschi
Modified: 2016-01-13 09:38 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In the 3.5 -> 3.6 upgrade flow the user has to set the host in maintenance mode before upgrading. Now we explicit enforce it providing a clear error if not.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-13 09:38:58 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Integration
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: blocker+
ylavi: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
Screenshot (194.39 KB, image/png)
2015-11-03 12:47 EST, Simone Tiraboschi
no flags Details
agent, broker and vdsmd logs (144.86 KB, application/x-xz)
2015-11-04 04:20 EST, Simone Tiraboschi
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 49024 master MERGED upgrade: avoid upgrading if not in maintenance from the engine Never
oVirt gerrit 49151 ovirt-hosted-engine-ha-1.3 MERGED upgrade: avoid upgrading if not in maintenance from the engine Never
oVirt gerrit 50283 master MERGED upgrade: recovery action if failed because not in maintenance Never
oVirt gerrit 50284 ovirt-hosted-engine-ha-1.3 MERGED upgrade: recovery action if failed because not in maintenance Never

  None (edit)
Description Simone Tiraboschi 2015-11-03 12:45:56 EST
Description of problem:
Our wiki ( http://www.ovirt.org/Features/Self_Hosted_Engine_Maintenance_Flows ) that 

'The two types of HA maintenance will be configured in different ways.
Local maintenance, which affects only the host on which it is enabled, will be tied into the existing VDS maintenance operation.'

But now after explicitly putting the engine into maintenance it remains up for the engine.
In order to really put the engine into maintenance it's necessary to do it from the engine.

Please see the attached screenshot:
1. Hosted Engine HA: Local Maintenance Enabled
2. Status: UP

On the host:
[root@c71het20151029 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : c71het20151028.localdomain
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 9501
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=9501 (Tue Nov  3 18:22:14 2015)
	host-id=1
	score=2400
	maintenance=False
	state=EngineUp


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : c71het20151029.localdomain
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
Local maintenance                  : True
Host timestamp                     : 9499
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=9499 (Tue Nov  3 18:22:11 2015)
	host-id=2
	score=0
	maintenance=True
	state=LocalMaintenance

[root@c71het20151029 ~]# vdsClient -s 0 getVdsStats | grep -A 4 haStats
	haStats = {'active': True,
	           'configured': True,
	           'globalMaintenance': False,
	           'localMaintenance': True,
	           'score': 0}


setHaMaintenanceMode verb on VDSM always fails.


[root@c71het20151029 ~]# vdsClient -s 0 setHaMaintenanceMode type=local enabled=false
Failed to set Hosted Engine HA policy
[root@c71het20151029 ~]# vdsClient -s 0 setHaMaintenanceMode type=local enabled=true
Failed to set Hosted Engine HA policy
[root@c71het20151029 ~]# vdsClient -s 0 setHaMaintenanceMode type=none enabled=true
Failed to set Hosted Engine HA policy
[root@c71het20151029 ~]# vdsClient -s 0 setHaMaintenanceMode type=global enabled=true
Failed to set Hosted Engine HA policy



Version-Release number of selected component (if applicable):


How reproducible:
Fully reproducible

Steps to Reproduce:
1. Put an host in local maintenance and check what happens on the engine
2.
3.

Actual results:
ovirt-ha-agent says that the host is in local maintenacne and the engine says that it's up.

Expected results:
Putting an host into local maintenance makes it into maintenance mode also for the engine.

Additional info:
It doesn't seams a matter of time cause neither explicitly refreshing or waiting seams to solve.

Cause the host is not really in maintenance the engine keeps other (non HE) storageServer connections up it keeps it connected to the datacenter storagePool and so hosted-engine upgrade process fails cause VDSM refuses to connect to more than one storage pool (in 3.5 the hosted-engine storage domain was still connected to its bootstrap storage pool and the host needs to connect to that to correctly upgrade).
Comment 1 Simone Tiraboschi 2015-11-03 12:47 EST
Created attachment 1089162 [details]
Screenshot
Comment 2 Red Hat Bugzilla Rules Engine 2015-11-04 04:07:02 EST
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 3 Simone Tiraboschi 2015-11-04 04:20 EST
Created attachment 1089461 [details]
agent, broker and vdsmd logs

Reproduce with:
local maintenance -> no maintenance -> local maintenance

The host was always up for the engine
Comment 4 Simone Tiraboschi 2015-11-04 05:27:14 EST
Just an additional info,
I started the local maintenance mode with
 hosted-engine --set-maintenance --mode=local
directly on the host.
If I put the host into maintenance mode from the webUI the engine obviously recognizes it.
Comment 5 Roy Golan 2015-11-05 03:27:13 EST
Conceptually HA maintenance means this host isn't part of the HA cluster of while it is still functional as a regular host in its virtualization cluster. Those are not the same. 

<< so hosted-engine upgrade process fails cause VDSM refuses to connect to more 

Since 3.6 we have a single DC is that still a must ?
Comment 6 Roy Golan 2015-11-05 03:28:54 EST
*** Bug 1277642 has been marked as a duplicate of this bug. ***
Comment 7 Roy Golan 2015-11-05 03:29:25 EST
*** Bug 1277645 has been marked as a duplicate of this bug. ***
Comment 8 Simone Tiraboschi 2015-11-05 04:03:10 EST
(In reply to Roy Golan from comment #5)
> << so hosted-engine upgrade process fails cause VDSM refuses to connect to
> more 
> 
> Since 3.6 we have a single DC is that still a must ?

The issue is on 3.5 to 3.6 upgrade process where the bootstrap storagePool created in 3.5 or before is still there.
The host has to become the SPM of that storagePool to perform some direct operations on the hosted-engine storageDomain but it cannot if the engine keeps requiring to be connected to its datacenter storagePool (VDSM refuses to connect to more than one storagePool).

The workaround is to have the user putting the host into maintenance mode from the engine.
Probably we have some issues if the user have just one host.
Comment 9 Doron Fediuck 2015-11-05 10:48:29 EST
(In reply to Simone Tiraboschi from comment #8)
> (In reply to Roy Golan from comment #5)
> > << so hosted-engine upgrade process fails cause VDSM refuses to connect to
> > more 
> > 
> > Since 3.6 we have a single DC is that still a must ?
> 
> The issue is on 3.5 to 3.6 upgrade process where the bootstrap storagePool
> created in 3.5 or before is still there.
> The host has to become the SPM of that storagePool to perform some direct
> operations on the hosted-engine storageDomain but it cannot if the engine
> keeps requiring to be connected to its datacenter storagePool (VDSM refuses
> to connect to more than one storagePool).
> 
> The workaround is to have the user putting the host into maintenance mode
> from the engine.
> Probably we have some issues if the user have just one host.

Simone Roy is correct. HE maintenance is not the same as host maintenance.
Based on what you wrote we can assume that any host that is being upgraded
should first move to maintenance in the engine regardless of hosted engine.
Is there a different scenario you found?
Comment 10 Simone Tiraboschi 2015-11-05 12:40:19 EST
(In reply to Doron Fediuck from comment #9)
> Simone Roy is correct. HE maintenance is not the same as host maintenance.
> Based on what you wrote we can assume that any host that is being upgraded
> should first move to maintenance in the engine regardless of hosted engine.
> Is there a different scenario you found?

The flow is a bit different on user eyes but all the code is still valid.
User has to:
1. put the hosted-engine into global maintenance mode to avoid unwanted migrations 
2. select one of hosted-engine hosts where the engine VM is not running (we still need to understand what happens if the user has just one hosted-engine host)
2. put it into maintenance mode from the engine
3. add 3.6 repo on that host
4. yum update
5. manually restart sanlock and wdmd cause they can fail during the upgrade: https://bugzilla.redhat.com/show_bug.cgi?id=1278369
5. manually reconfigure vdsm () cause it will not restart otherwise: https://bugzilla.redhat.com/show_bug.cgi?id=1276736
7. manually restart ovirt-ha-agent and ovirt-ha-agent
8. wait for the agent to do the job. If everything is correct the host will get up to 3400 points
9. exits from global-maintenance mode to have the VM migrating here or manually migrate it on the first 3.6 host
10. repeat 1-8 for other hosts

0 or 12: upgrade the engine to 3.6 and it should automatically tart importing hosted-engine storage domain when the cluster compatibility mode reaches 3.6.
Currently this step is still failing due to sanlock issue.

I'm a bit afraid cause is not that simple nor that intuitive.
Comment 11 Roy Golan 2015-11-23 09:43:52 EST
(In reply to Simone Tiraboschi from comment #10)
> (In reply to Doron Fediuck from comment #9)
> > Simone Roy is correct. HE maintenance is not the same as host maintenance.
> > Based on what you wrote we can assume that any host that is being upgraded
> > should first move to maintenance in the engine regardless of hosted engine.
> > Is there a different scenario you found?
> 
> The flow is a bit different on user eyes but all the code is still valid.
> User has to:
> 1. put the hosted-engine into global maintenance mode to avoid unwanted
> migrations 
> 2. select one of hosted-engine hosts where the engine VM is not running (we
> still need to understand what happens if the user has just one hosted-engine
> host)
> 2. put it into maintenance mode from the engine
> 3. add 3.6 repo on that host
> 4. yum update
> 5. manually restart sanlock and wdmd cause they can fail during the upgrade:
> https://bugzilla.redhat.com/show_bug.cgi?id=1278369
> 5. manually reconfigure vdsm () cause it will not restart otherwise:
> https://bugzilla.redhat.com/show_bug.cgi?id=1276736
> 7. manually restart ovirt-ha-agent and ovirt-ha-agent
> 8. wait for the agent to do the job. If everything is correct the host will
> get up to 3400 points
> 9. exits from global-maintenance mode to have the VM migrating here or
> manually migrate it on the first 3.6 host
> 10. repeat 1-8 for other hosts
> 
> 0 or 12: upgrade the engine to 3.6 and it should automatically tart
> importing hosted-engine storage domain when the cluster compatibility mode
> reaches 3.6.

and storage pool to 3.5 at least (to support import of domain) and Active (i.e it has master data)

> Currently this step is still failing due to sanlock issue.
> 

solved. vdsm patch is pending approval https://gerrit.ovirt.org/#/c/48217/

> I'm a bit afraid cause is not that simple nor that intuitive.
Comment 12 Red Hat Bugzilla Rules Engine 2015-11-23 11:38:51 EST
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 13 Red Hat Bugzilla Rules Engine 2015-11-24 06:49:22 EST
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.
Comment 14 Artyom 2015-12-07 09:00:31 EST
Hi Simone, can you please provide the best way to verify this bug?
Thanks
Comment 15 Simone Tiraboschi 2015-12-10 04:29:41 EST
(In reply to Artyom from comment #14)
> Hi Simone, can you please provide the best way to verify this bug?
> Thanks

You have to deploy hosted-engine from oVirt 3.5 on a couple of hosts; add at least one regular storage domain and start at least one additional VM.
Then select the host where the engine VM is not running (to keep the engine up); put it into local maintenance from the hosted-engine CLI.
Add 3.6 repo and update what is available. At the end restart vdsm, ovirt-ha-broker, ovirt-ha-agent.

Now, after the patch, in this case the ha-agent will not upgrade the host cause not in maintenance at engine eyes; you should see 'Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine' in the ha-agent logs.

Now put the same host into maintenance from the engine, and after a few minutes you should see: 'Successfully upgraded' in ha-agent logs.
Comment 16 Red Hat Bugzilla Rules Engine 2015-12-18 11:11:30 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 17 Roy Golan 2015-12-20 10:49:28 EST
Simone, Sandro see the rules engine notif. I really didn't see that bug in the changelog of the latest packages but I might be missing something
Comment 18 Red Hat Bugzilla Rules Engine 2015-12-21 09:23:39 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 19 Artyom 2015-12-23 08:50:40 EST
Verified on ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch
If host not under maintenance can see:
"Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready"
Comment 20 Sandro Bonazzola 2016-01-13 09:38:58 EST
oVirt 3.6.1 has been released, closing current release

Note You need to log in before you can comment on or make changes to this bug.