Bug 1292652 - [upgrade] the upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot
Summary: [upgrade] the upgrade from 3.5 to 3.6 can fail if interrupted in the middle a...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 1.3.3.4
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ovirt-3.6.5
: 1.3.5.3
Assignee: Simone Tiraboschi
QA Contact: Artyom
URL:
Whiteboard:
Depends On:
Blocks: RHEV3.6Upgrade 1322020 1326023 1333143
TreeView+ depends on / blocked
 
Reported: 2015-12-18 00:43 UTC by Simone Tiraboschi
Modified: 2020-03-11 15:04 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot cause at that point the storage pool is needed but not connected Consequence: ovirt-ha-agent fails to upgrade the hosted-engine storage domain to 3.6 structure and restarts itself in a loop. Fix: Better check the env condition and ensure that the storagePool is really connected when needed. Result: After the reboot ovirt-ha-agent could correctly resume the upgrade procedure.
Clone Of:
Environment:
Last Closed: 2016-04-21 14:41:57 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-3.6.z+
ylavi: blocker-
bmcclain: planning_ack+
sbonazzo: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 50699 0 master MERGED upgrade: fix upgrade from partially upgraded condition 2015-12-18 15:04:08 UTC
oVirt gerrit 50700 0 ovirt-hosted-engine-ha-1.3 MERGED upgrade: fix upgrade from partially upgraded condition 2015-12-18 15:05:18 UTC
oVirt gerrit 54049 0 ovirt-hosted-engine-ha-1.3 MERGED upgrade: trying reconstructMaster as a recovery action 2016-03-29 08:11:45 UTC
oVirt gerrit 55253 0 ovirt-hosted-engine-ha-1.3 NEW upgrade: trying reconstructMaster as a recovery action 2016-03-25 16:31:51 UTC
oVirt gerrit 55643 0 master MERGED upgrade: _fake_mastersd_uuid as a scalar value 2016-04-05 10:16:53 UTC
oVirt gerrit 55644 0 ovirt-hosted-engine-ha-1.3 MERGED upgrade: _fake_mastersd_uuid as a scalar value 2016-04-05 11:32:28 UTC

Description Simone Tiraboschi 2015-12-18 00:43:35 UTC
Description of problem:
The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot.

he upgrade procedure has to move the hosted-engine configuration to the shared storage and it has to disconnect the hosted-engine storage domain from its
bootstrap storage pool.
The two tasks can be performed by different attempts.

_move_to_shared_conf was connecting the bootstrap storage pool leaving it connected for further steps.
_remove_storage_pool was erroneously assuming that the bootstrap storage pool was already connected.
But if the configuration volume is already OK the upgrade procedure skips _move_to_shared_conf and, if the system has been rebooted in the middle, _remove_storage_pool fails cause the storage pool is not connected.

Connecting the storage pool if needed also from _remove_storage_pool solves it.

Version-Release number of selected component (if applicable):
1.3.3.4

How reproducible:
It's really a corner case: the user has to stop the agent in the middle of the upgrade (it lasts a couple of seconds) and reboot before it retries.

Steps to Reproduce:
1. deploy HE 3.5
2. start upgrading to 3.6 
3. stop the agent after 'Upgrading to current version' but before 'Successfully upgraded'

Actual results:
The upgrade fails with:
MainThread::INFO::2015-12-17 08:37:42,900::hosted_engine::660::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::ERROR::2015-12-17 08:38:18,900::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('b5ec210d-5f08-46e5-80ac-07085ee6096e',)' - trying to restart agent
MainThread::ERROR::2015-12-17 08:40:42,900::agent::210::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug.
MainThread::INFO::2015-12-17 08:40:42,900::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down


Expected results:
It successfully upgrades.

Additional info:

Comment 1 Artyom 2016-02-25 13:55:48 UTC
Verify on ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch

MainThread::INFO::2016-02-25 15:46:52,951::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-02-25 15:46:52,972::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2016-02-25 15:46:52,985::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2016-02-25 15:46:52,999::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2016-02-25 15:46:52,999::upgrade::938::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf
MainThread::INFO::2016-02-25 15:46:53,010::upgrade::298::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectStoragePool) Connecting storage pool - master 'c777a4d3-ac78-49a2-83f5-d8aa8be036ab' - dom_dict '{'c777a4d3-ac78-49a2-83f5-d8aa8be036ab': 'active'}'
MainThread::INFO::2016-02-25 15:46:53,313::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2016-02-25 15:46:53,313::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:53,387::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:55,438::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:57,489::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:59,535::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:01,568::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:03,608::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:05,656::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:05,822::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:07,854::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:09,892::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:11,926::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:13,988::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:16,040::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:16,093::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2016-02-25 15:47:16,112::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2016-02-25 15:47:16,113::upgrade::262::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_shared_conf_volume) Creating hosted-engine configuration volume on the shared storage domain
MainThread::INFO::2016-02-25 15:47:42,684::upgrade::387::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf
MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::375::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'
MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::380::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) unable to read '/etc/ovirt-hosted-engine/answers.conf'
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: hosted-engine.conf
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: broker.conf
MainThread::INFO::2016-02-25 15:47:42,686::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: vm.conf
MainThread::INFO::2016-02-25 15:47:42,717::upgrade::955::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage
MainThread::INFO::2016-02-25 15:47:42,758::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2016-02-25 15:47:42,758::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:44,286::upgrade::553::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectFakeStorageDomainServer) connectFakeStorageDomainServer
MainThread::INFO::2016-02-25 15:47:44,345::upgrade::529::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_createFakeStorageDomain) createFakeStorageDomain
MainThread::INFO::2016-02-25 15:47:44,867::upgrade::593::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_attachFakeStorageDomain) _attachFakeStorageDomain
MainThread::INFO::2016-02-25 15:48:08,436::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_activateFakeStorageDomain) _activateFakeStorageDomain
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:48:08,571::upgrade::319::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_disconnectStoragePool) Disconnecting storage pool
MainThread::INFO::2016-02-25 15:48:14,279::upgrade::756::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_reconstructMaster) _reconstructMaster
MainThread::INFO::2016-02-25 15:49:39,256::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.4.3 started
MainThread::INFO::2016-02-25 15:49:39,273::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: master-vds10.qa.lab.tlv.redhat.com
MainThread::INFO::2016-02-25 15:49:39,274::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-02-25 15:49:39,319::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-02-25 15:49:39,320::storage_server::207::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-02-25 15:49:39,357::storage_server::211::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-02-25 15:49:39,373::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain
MainThread::INFO::2016-02-25 15:49:39,607::hosted_engine::681::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images
MainThread::INFO::2016-02-25 15:49:39,609::image::116::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::684::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain
MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::518::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2016-02-25 15:49:39,785::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.64.254'}
MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40558160
MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'rhevm', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40556816
MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40517136
MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118658448
MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,801::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118655504
MainThread::INFO::2016-02-25 15:49:39,916::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140144984512144
MainThread::INFO::2016-02-25 15:49:39,916::hosted_engine::610::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
MainThread::INFO::2016-02-25 15:49:39,948::hosted_engine::723::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/c777a4d3-ac78-49a2-83f5-d8aa8be036ab/7acaea0e-60c3-457c-9753-1ee423dec5bb/6a3d914a-00a2-49ee-9861-db5b5a9af260)
MainThread::INFO::2016-02-25 15:49:39,967::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-02-25 15:49:40,038::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2016-02-25 15:49:40,051::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2016-02-25 15:49:40,140::upgrade::203::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:af071ec2-e0d2-4027-b704-3f6e66d39c7d, volUUID:86874316-0da4-439d-9e1c-bc0bba455965

....
MainThread::INFO::2016-02-25 15:51:58,351::upgrade::1011::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Successfully upgraded

Comment 2 Artyom 2016-02-25 15:19:54 UTC
Now upgrade fail if host rebooted after reconstructMaster but before upgrade complete,

Comment 3 Red Hat Bugzilla Rules Engine 2016-02-25 15:20:01 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 4 Mike McCune 2016-03-28 23:37:25 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Red Hat Bugzilla Rules Engine 2016-04-04 10:18:56 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 Red Hat Bugzilla Rules Engine 2016-04-04 10:19:46 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 7 Artyom 2016-04-14 16:36:42 UTC
Change still does not exist under the last build so move it to MODIFIED

Comment 8 Gil Klein 2016-04-17 08:05:10 UTC
Moving back to assigned based on comment #7

Comment 9 Red Hat Bugzilla Rules Engine 2016-04-17 08:05:16 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 10 Sandro Bonazzola 2016-04-18 07:15:55 UTC
Artyom, are you sure? all referenced patches are included in ovirt-hosted-engine-ha-1.3.5.3.

Comment 14 Artyom 2016-04-20 10:18:16 UTC
Verified on ovirt-hosted-engine-ha-1.3.5.3-1.el7ev.noarch
Steps:
=====================
1) Stop and mask ovirt-ha-agent
2) Update all packages
3) add two raise Exception to upgrade.py
 1. self._activateFakeStorageDomain()
    raise Exception("MY Exception")
    self._spmStop()
 2. raise Exception("MY Exception")
    self._reconstructMaster(master, dom_dict)
    self._connectStoragePool(master, dom_dict)
4) unmask and start ovirt-ha-agent
5) wait until log will have line with my Exception
6) Delete the first exception from code and restart ovirt-ha-agent
7) wait until log will have line with my Exception
8) Delete the second exception from code and restart ovirt-ha-agent
9) Check that upgrade succeed
PASS

Comment 15 Marina Kalinin 2016-05-04 18:58:44 UTC
Simone,
I cannot understand what upgrade we are talking about here.
It does not sound like it is talking about engine-setup.

IHAC where engine-setup was interrupted in an unrecoverable way, but I cannot how the logs in here related to my customer problem.

Can you please elaborate?

Comment 16 Simone Tiraboschi 2016-05-05 12:10:33 UTC
(In reply to Marina from comment #15)
> Simone,
> I cannot understand what upgrade we are talking about here.
> It does not sound like it is talking about engine-setup.

No, this was about upgrading hosted-engine hosts from 3.5 to 3.6.
The first upgraded host triggers a procedure to copy the hosted-engine configuration to the shared volume and this was failing in a bad way if the host was rebooted just after rpm upgrade in the middle of this procedure.

Now it can correctly recover on next attempt.

What I think are you talking about is instead tracked here:
https://bugzilla.redhat.com/show_bug.cgi?id=1290073


Note You need to log in before you can comment on or make changes to this bug.