Red Hat Bugzilla – Bug 1281539
[upgrade] possible race condition upgrading different hosts
Last modified: 2015-12-16 07:23:04 EST
Description of problem:
in oVirt 3.5 the hosted-engine storage domain was attached to a bootstrap storage pool that was no in use.
In 3.6 (and in 3.5 to 3.6 upgrade) we detach from that to let the engine import it; when done the engine will attach to the host datacenter SP.
Actual agent code will simply assume that if the host has still to be upgrade to 3.6 (the hosted-engine conf file has the previous structure) and the storage domain is still attached to a storage pool this one is the 3.5 bootstrap one.
But this is not always true: if our host was left out in maintenance mode and the engine already completed the hosted-engine storage domain import, at that point when our host check that it found that the storage domain is still attached to a storage pool and so it tries to disconnect the hosted-engine storage domain from that storage pool.
This will fail cause that storage pool doesn't exist anymore (the first successfully upgrade host already removed it), but this is enough to prevent the agent to start.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. deploy hosted-engine from 3.5
2. put one host in maintenance, upgrade everything else (hosts and engine VM) to 3.6
3. let the engine auto-import the hosted-engine storage domain
4. upgrade the last host to 3.6 and activate it
The ha-agent will try to connect to 3.5 bootstrap storage pool to detach the hosted-engine storage domain but at that point the 3.5 bootstrap storage pool doesn't exist anymore and so it will fail.
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::839::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::INFO::2015-11-12 11:49:52,719::upgrade::657::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2015-11-12 11:49:54,728::upgrade::125::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2015-11-12 11:49:54,818::upgrade::165::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:30ea2f43-9d1c-44ee-ac07-80b0639bb6f1, volUUID:2b902c7d-5424-4a78-b44c-278e2aed5bd0
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2015-11-12 11:49:54,829::upgrade::595::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::ERROR::2015-11-12 11:49:54,836::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('390f5141-20bc-4f07-9d14-758356cb068f',)' - trying to restart agent
It successfully recognizes that the storage pool where the hosted-engine storage domain is attached is not the 3.5 bootstrap one and so ha agent will proceed to the next step of the upgrade process.
Verified on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) Deploy hosted-engine 3.5 on two hosts and on NFS storage
2) Put first host to maintenance via webadmin
3) Upgrade packages and restart host(restart host W/A because bug https://bugzilla.redhat.com/show_bug.cgi?id=1282187)
4) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
5) Activate host via webadmin
6) Put second host to maintenance(wait until all vms migrated and he vm migrate on first host)
7) Upgrade packages and restart second host
8) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
9) Activate second host via webadmin
10) Put environment to global maintenance
11) Update rhevm-setup.noarch package on engine
12) Run engine-setup on vm and finish upgrade process
13) Disable global maintenance via webadmin
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.