Bug 1281539 - [upgrade] possible race condition upgrading different hosts
Summary: [upgrade] possible race condition upgrading different hosts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 1.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-3.6.1
: 1.3.3
Assignee: Simone Tiraboschi
QA Contact: Artyom
URL:
Whiteboard: integration
Depends On: 1269768
Blocks: ovirt-hosted-engine-ha-1.3.4.3 RHEV3.6Upgrade
TreeView+ depends on / blocked
 
Reported: 2015-11-12 17:28 UTC by Simone Tiraboschi
Modified: 2015-12-16 12:23 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-12-16 12:23:04 UTC
oVirt Team: ---
Embargoed:
rule-engine: ovirt-3.6.z+
bmcclain: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 48509 0 master MERGED upgrade: specifically check if still attached to bootstrap SP Never
oVirt gerrit 48521 0 ovirt-hosted-engine-ha-1.3 MERGED upgrade: specifically check if still attached to bootstrap SP Never

Description Simone Tiraboschi 2015-11-12 17:28:35 UTC
Description of problem:
in oVirt 3.5 the hosted-engine storage domain was attached to a bootstrap storage pool that was no in use.
In 3.6 (and in 3.5 to 3.6 upgrade) we detach from that to let the engine import it; when done the engine will attach to the host datacenter SP.

Actual agent code will simply assume that if the host has still to be upgrade to 3.6 (the hosted-engine conf file has the previous structure) and the storage domain is still attached to a  storage pool this one is the 3.5 bootstrap one.

But this is not always true: if our host was left out in maintenance mode and the engine already completed the hosted-engine storage domain import, at that point when our host check that it found that the storage domain is still attached to a storage pool and so it tries to disconnect the hosted-engine storage domain from that storage pool.
This will fail cause that storage pool doesn't exist anymore (the first successfully upgrade host already removed it), but this is enough to prevent the agent to start. 

Version-Release number of selected component (if applicable):
1.3.1

How reproducible:
100%

Steps to Reproduce:
1. deploy hosted-engine from 3.5
2. put one host in maintenance, upgrade everything else (hosts and engine VM) to 3.6 
3. let the engine auto-import the hosted-engine storage domain
4. upgrade the last host to 3.6 and activate it

Actual results:
The ha-agent will try to connect to 3.5 bootstrap storage pool to detach the hosted-engine storage domain but at that point the 3.5 bootstrap storage pool doesn't exist anymore and so it will fail.
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::839::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::657::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2015-11-12 11:49:54,728::upgrade::125::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2015-11-12 11:49:54,818::upgrade::165::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:30ea2f43-9d1c-44ee-ac07-80b0639bb6f1, volUUID:2b902c7d-5424-4a78-b44c-278e2aed5bd0
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::595::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::ERROR::2015-11-12 11:49:54,836::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('390f5141-20bc-4f07-9d14-758356cb068f',)' - trying to restart agent


Expected results:
It successfully recognizes that the storage pool where the hosted-engine storage domain is attached is not the 3.5 bootstrap one and so ha agent will proceed to the next step of the upgrade process. 

Additional info:

Comment 1 Artyom 2015-12-01 16:24:48 UTC
Verified on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) Deploy hosted-engine 3.5 on two hosts and on NFS storage
2) Put first host to maintenance via webadmin
3) Upgrade packages and restart host(restart host W/A because bug https://bugzilla.redhat.com/show_bug.cgi?id=1282187)
4) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
5) Activate host via webadmin
6) Put second host to maintenance(wait until all vms migrated and he vm migrate on first host)
7) Upgrade packages and restart second host
8) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
9) Activate second host via webadmin
10) Put environment to global maintenance
11) Update rhevm-setup.noarch package on engine
12) Run engine-setup on vm and finish upgrade process
13) Disable global maintenance via webadmin

Comment 2 Sandro Bonazzola 2015-12-16 12:23:04 UTC
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.


Note You need to log in before you can comment on or make changes to this bug.