This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1281539 - [upgrade] possible race condition upgrading different hosts
[upgrade] possible race condition upgrading different hosts
Status: CLOSED CURRENTRELEASE
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent (Show other bugs)
1.3.0
Unspecified Unspecified
medium Severity medium (vote)
: ovirt-3.6.1
: 1.3.3
Assigned To: Simone Tiraboschi
Artyom
integration
: Triaged
Depends On: 1269768
Blocks: ovirt-hosted-engine-ha-1.3.4.3 RHEV3.6Upgrade
  Show dependency treegraph
 
Reported: 2015-11-12 12:28 EST by Simone Tiraboschi
Modified: 2015-12-16 07:23 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-16 07:23:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
bmcclain: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 48509 master MERGED upgrade: specifically check if still attached to bootstrap SP Never
oVirt gerrit 48521 ovirt-hosted-engine-ha-1.3 MERGED upgrade: specifically check if still attached to bootstrap SP Never

  None (edit)
Description Simone Tiraboschi 2015-11-12 12:28:35 EST
Description of problem:
in oVirt 3.5 the hosted-engine storage domain was attached to a bootstrap storage pool that was no in use.
In 3.6 (and in 3.5 to 3.6 upgrade) we detach from that to let the engine import it; when done the engine will attach to the host datacenter SP.

Actual agent code will simply assume that if the host has still to be upgrade to 3.6 (the hosted-engine conf file has the previous structure) and the storage domain is still attached to a  storage pool this one is the 3.5 bootstrap one.

But this is not always true: if our host was left out in maintenance mode and the engine already completed the hosted-engine storage domain import, at that point when our host check that it found that the storage domain is still attached to a storage pool and so it tries to disconnect the hosted-engine storage domain from that storage pool.
This will fail cause that storage pool doesn't exist anymore (the first successfully upgrade host already removed it), but this is enough to prevent the agent to start. 

Version-Release number of selected component (if applicable):
1.3.1

How reproducible:
100%

Steps to Reproduce:
1. deploy hosted-engine from 3.5
2. put one host in maintenance, upgrade everything else (hosts and engine VM) to 3.6 
3. let the engine auto-import the hosted-engine storage domain
4. upgrade the last host to 3.6 and activate it

Actual results:
The ha-agent will try to connect to 3.5 bootstrap storage pool to detach the hosted-engine storage domain but at that point the 3.5 bootstrap storage pool doesn't exist anymore and so it will fail.
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::839::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::657::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2015-11-12 11:49:54,728::upgrade::125::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2015-11-12 11:49:54,818::upgrade::165::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:30ea2f43-9d1c-44ee-ac07-80b0639bb6f1, volUUID:2b902c7d-5424-4a78-b44c-278e2aed5bd0
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::595::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::ERROR::2015-11-12 11:49:54,836::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('390f5141-20bc-4f07-9d14-758356cb068f',)' - trying to restart agent


Expected results:
It successfully recognizes that the storage pool where the hosted-engine storage domain is attached is not the 3.5 bootstrap one and so ha agent will proceed to the next step of the upgrade process. 

Additional info:
Comment 1 Artyom 2015-12-01 11:24:48 EST
Verified on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) Deploy hosted-engine 3.5 on two hosts and on NFS storage
2) Put first host to maintenance via webadmin
3) Upgrade packages and restart host(restart host W/A because bug https://bugzilla.redhat.com/show_bug.cgi?id=1282187)
4) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
5) Activate host via webadmin
6) Put second host to maintenance(wait until all vms migrated and he vm migrate on first host)
7) Upgrade packages and restart second host
8) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
9) Activate second host via webadmin
10) Put environment to global maintenance
11) Update rhevm-setup.noarch package on engine
12) Run engine-setup on vm and finish upgrade process
13) Disable global maintenance via webadmin
Comment 2 Sandro Bonazzola 2015-12-16 07:23:04 EST
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.

Note You need to log in before you can comment on or make changes to this bug.