Description of problem: The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot. he upgrade procedure has to move the hosted-engine configuration to the shared storage and it has to disconnect the hosted-engine storage domain from its bootstrap storage pool. The two tasks can be performed by different attempts. _move_to_shared_conf was connecting the bootstrap storage pool leaving it connected for further steps. _remove_storage_pool was erroneously assuming that the bootstrap storage pool was already connected. But if the configuration volume is already OK the upgrade procedure skips _move_to_shared_conf and, if the system has been rebooted in the middle, _remove_storage_pool fails cause the storage pool is not connected. Connecting the storage pool if needed also from _remove_storage_pool solves it. Version-Release number of selected component (if applicable): 1.3.3.4 How reproducible: It's really a corner case: the user has to stop the agent in the middle of the upgrade (it lasts a couple of seconds) and reboot before it retries. Steps to Reproduce: 1. deploy HE 3.5 2. start upgrading to 3.6 3. stop the agent after 'Upgrading to current version' but before 'Successfully upgraded' Actual results: The upgrade fails with: MainThread::INFO::2015-12-17 08:37:42,900::hosted_engine::660::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::ERROR::2015-12-17 08:38:18,900::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('b5ec210d-5f08-46e5-80ac-07085ee6096e',)' - trying to restart agent MainThread::ERROR::2015-12-17 08:40:42,900::agent::210::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug. MainThread::INFO::2015-12-17 08:40:42,900::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down Expected results: It successfully upgrades. Additional info:
Verify on ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch MainThread::INFO::2016-02-25 15:46:52,951::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-02-25 15:46:52,972::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain MainThread::INFO::2016-02-25 15:46:52,985::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::ERROR::2016-02-25 15:46:52,999::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume MainThread::INFO::2016-02-25 15:46:52,999::upgrade::938::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf MainThread::INFO::2016-02-25 15:46:53,010::upgrade::298::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectStoragePool) Connecting storage pool - master 'c777a4d3-ac78-49a2-83f5-d8aa8be036ab' - dom_dict '{'c777a4d3-ac78-49a2-83f5-d8aa8be036ab': 'active'}' MainThread::INFO::2016-02-25 15:46:53,313::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart MainThread::INFO::2016-02-25 15:46:53,313::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:53,387::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:55,438::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:57,489::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:59,535::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:01,568::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:03,608::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:05,656::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:05,822::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:07,854::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:09,892::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:11,926::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:13,988::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:16,040::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:16,093::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::ERROR::2016-02-25 15:47:16,112::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume MainThread::INFO::2016-02-25 15:47:16,113::upgrade::262::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_shared_conf_volume) Creating hosted-engine configuration volume on the shared storage domain MainThread::INFO::2016-02-25 15:47:42,684::upgrade::387::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::375::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf' MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::380::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) unable to read '/etc/ovirt-hosted-engine/answers.conf' MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: hosted-engine.conf MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: broker.conf MainThread::INFO::2016-02-25 15:47:42,686::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: vm.conf MainThread::INFO::2016-02-25 15:47:42,717::upgrade::955::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage MainThread::INFO::2016-02-25 15:47:42,758::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart MainThread::INFO::2016-02-25 15:47:42,758::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:44,286::upgrade::553::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectFakeStorageDomainServer) connectFakeStorageDomainServer MainThread::INFO::2016-02-25 15:47:44,345::upgrade::529::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_createFakeStorageDomain) createFakeStorageDomain MainThread::INFO::2016-02-25 15:47:44,867::upgrade::593::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_attachFakeStorageDomain) _attachFakeStorageDomain MainThread::INFO::2016-02-25 15:48:08,436::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_activateFakeStorageDomain) _activateFakeStorageDomain MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop MainThread::INFO::2016-02-25 15:48:08,462::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:48:08,571::upgrade::319::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_disconnectStoragePool) Disconnecting storage pool MainThread::INFO::2016-02-25 15:48:14,279::upgrade::756::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_reconstructMaster) _reconstructMaster MainThread::INFO::2016-02-25 15:49:39,256::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.4.3 started MainThread::INFO::2016-02-25 15:49:39,273::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: master-vds10.qa.lab.tlv.redhat.com MainThread::INFO::2016-02-25 15:49:39,274::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2016-02-25 15:49:39,319::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2016-02-25 15:49:39,320::storage_server::207::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-02-25 15:49:39,357::storage_server::211::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-02-25 15:49:39,373::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2016-02-25 15:49:39,607::hosted_engine::681::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2016-02-25 15:49:39,609::image::116::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::684::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::518::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2016-02-25 15:49:39,785::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.64.254'} MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40558160 MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'rhevm', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40556816 MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40517136 MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118658448 MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,801::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118655504 MainThread::INFO::2016-02-25 15:49:39,916::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140144984512144 MainThread::INFO::2016-02-25 15:49:39,916::hosted_engine::610::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2016-02-25 15:49:39,948::hosted_engine::723::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/c777a4d3-ac78-49a2-83f5-d8aa8be036ab/7acaea0e-60c3-457c-9753-1ee423dec5bb/6a3d914a-00a2-49ee-9861-db5b5a9af260) MainThread::INFO::2016-02-25 15:49:39,967::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-02-25 15:49:40,038::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain MainThread::INFO::2016-02-25 15:49:40,051::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::INFO::2016-02-25 15:49:40,140::upgrade::203::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:af071ec2-e0d2-4027-b704-3f6e66d39c7d, volUUID:86874316-0da4-439d-9e1c-bc0bba455965 .... MainThread::INFO::2016-02-25 15:51:58,351::upgrade::1011::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Successfully upgraded
Now upgrade fail if host rebooted after reconstructMaster but before upgrade complete,
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
Change still does not exist under the last build so move it to MODIFIED
Moving back to assigned based on comment #7
Artyom, are you sure? all referenced patches are included in ovirt-hosted-engine-ha-1.3.5.3.
Verified on ovirt-hosted-engine-ha-1.3.5.3-1.el7ev.noarch Steps: ===================== 1) Stop and mask ovirt-ha-agent 2) Update all packages 3) add two raise Exception to upgrade.py 1. self._activateFakeStorageDomain() raise Exception("MY Exception") self._spmStop() 2. raise Exception("MY Exception") self._reconstructMaster(master, dom_dict) self._connectStoragePool(master, dom_dict) 4) unmask and start ovirt-ha-agent 5) wait until log will have line with my Exception 6) Delete the first exception from code and restart ovirt-ha-agent 7) wait until log will have line with my Exception 8) Delete the second exception from code and restart ovirt-ha-agent 9) Check that upgrade succeed PASS
Simone, I cannot understand what upgrade we are talking about here. It does not sound like it is talking about engine-setup. IHAC where engine-setup was interrupted in an unrecoverable way, but I cannot how the logs in here related to my customer problem. Can you please elaborate?
(In reply to Marina from comment #15) > Simone, > I cannot understand what upgrade we are talking about here. > It does not sound like it is talking about engine-setup. No, this was about upgrading hosted-engine hosts from 3.5 to 3.6. The first upgraded host triggers a procedure to copy the hosted-engine configuration to the shared volume and this was failing in a bad way if the host was rebooted just after rpm upgrade in the middle of this procedure. Now it can correctly recover on next attempt. What I think are you talking about is instead tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1290073