Bug 1281539

Summary: [upgrade] possible race condition upgrading different hosts
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Simone Tiraboschi <stirabos>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3.0CC: bmcclain, bugs, gklein, mavital, msivak, rgolan, sbonazzo, ylavi
Target Milestone: ovirt-3.6.1Keywords: Triaged
Target Release: 1.3.3Flags: rule-engine: ovirt-3.6.z+
bmcclain: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-16 12:23:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1269768    
Bug Blocks: 1284954, 1285700    

Description Simone Tiraboschi 2015-11-12 17:28:35 UTC
Description of problem:
in oVirt 3.5 the hosted-engine storage domain was attached to a bootstrap storage pool that was no in use.
In 3.6 (and in 3.5 to 3.6 upgrade) we detach from that to let the engine import it; when done the engine will attach to the host datacenter SP.

Actual agent code will simply assume that if the host has still to be upgrade to 3.6 (the hosted-engine conf file has the previous structure) and the storage domain is still attached to a  storage pool this one is the 3.5 bootstrap one.

But this is not always true: if our host was left out in maintenance mode and the engine already completed the hosted-engine storage domain import, at that point when our host check that it found that the storage domain is still attached to a storage pool and so it tries to disconnect the hosted-engine storage domain from that storage pool.
This will fail cause that storage pool doesn't exist anymore (the first successfully upgrade host already removed it), but this is enough to prevent the agent to start. 

Version-Release number of selected component (if applicable):
1.3.1

How reproducible:
100%

Steps to Reproduce:
1. deploy hosted-engine from 3.5
2. put one host in maintenance, upgrade everything else (hosts and engine VM) to 3.6 
3. let the engine auto-import the hosted-engine storage domain
4. upgrade the last host to 3.6 and activate it

Actual results:
The ha-agent will try to connect to 3.5 bootstrap storage pool to detach the hosted-engine storage domain but at that point the 3.5 bootstrap storage pool doesn't exist anymore and so it will fail.
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::839::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::657::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2015-11-12 11:49:54,728::upgrade::125::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2015-11-12 11:49:54,818::upgrade::165::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:30ea2f43-9d1c-44ee-ac07-80b0639bb6f1, volUUID:2b902c7d-5424-4a78-b44c-278e2aed5bd0
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::595::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::ERROR::2015-11-12 11:49:54,836::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('390f5141-20bc-4f07-9d14-758356cb068f',)' - trying to restart agent


Expected results:
It successfully recognizes that the storage pool where the hosted-engine storage domain is attached is not the 3.5 bootstrap one and so ha agent will proceed to the next step of the upgrade process. 

Additional info:

Comment 1 Artyom 2015-12-01 16:24:48 UTC
Verified on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) Deploy hosted-engine 3.5 on two hosts and on NFS storage
2) Put first host to maintenance via webadmin
3) Upgrade packages and restart host(restart host W/A because bug https://bugzilla.redhat.com/show_bug.cgi?id=1282187)
4) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
5) Activate host via webadmin
6) Put second host to maintenance(wait until all vms migrated and he vm migrate on first host)
7) Upgrade packages and restart second host
8) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
9) Activate second host via webadmin
10) Put environment to global maintenance
11) Update rhevm-setup.noarch package on engine
12) Run engine-setup on vm and finish upgrade process
13) Disable global maintenance via webadmin

Comment 2 Sandro Bonazzola 2015-12-16 12:23:04 UTC
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.