Bug 1292652
Summary: | [upgrade] the upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot | ||
---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Simone Tiraboschi <stirabos> |
Component: | Agent | Assignee: | Simone Tiraboschi <stirabos> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 1.3.3.4 | CC: | alukiano, bmcclain, bugs, eedri, gklein, mavital, mkalinin, pstehlik, sbonazzo, stirabos, ylavi |
Target Milestone: | ovirt-3.6.5 | Keywords: | Triaged |
Target Release: | 1.3.5.3 | Flags: | rule-engine:
ovirt-3.6.z+
ylavi: blocker- bmcclain: planning_ack+ sbonazzo: devel_ack+ pstehlik: testing_ack+ |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot cause at that point the storage pool is needed but not connected
Consequence:
ovirt-ha-agent fails to upgrade the hosted-engine storage domain to 3.6 structure and restarts itself in a loop.
Fix:
Better check the env condition and ensure that the storagePool is really connected when needed.
Result:
After the reboot ovirt-ha-agent could correctly resume the upgrade procedure.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-04-21 14:41:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1285700, 1322020, 1326023, 1333143 |
Description
Simone Tiraboschi
2015-12-18 00:43:35 UTC
Verify on ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch MainThread::INFO::2016-02-25 15:46:52,951::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-02-25 15:46:52,972::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain MainThread::INFO::2016-02-25 15:46:52,985::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::ERROR::2016-02-25 15:46:52,999::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume MainThread::INFO::2016-02-25 15:46:52,999::upgrade::938::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf MainThread::INFO::2016-02-25 15:46:53,010::upgrade::298::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectStoragePool) Connecting storage pool - master 'c777a4d3-ac78-49a2-83f5-d8aa8be036ab' - dom_dict '{'c777a4d3-ac78-49a2-83f5-d8aa8be036ab': 'active'}' MainThread::INFO::2016-02-25 15:46:53,313::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart MainThread::INFO::2016-02-25 15:46:53,313::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:53,387::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:55,438::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:57,489::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:46:59,535::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:01,568::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:03,608::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:05,656::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:05,822::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:07,854::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:09,892::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:11,926::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:13,988::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:16,040::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:16,093::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::ERROR::2016-02-25 15:47:16,112::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume MainThread::INFO::2016-02-25 15:47:16,113::upgrade::262::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_shared_conf_volume) Creating hosted-engine configuration volume on the shared storage domain MainThread::INFO::2016-02-25 15:47:42,684::upgrade::387::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::375::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf' MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::380::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) unable to read '/etc/ovirt-hosted-engine/answers.conf' MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: hosted-engine.conf MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: broker.conf MainThread::INFO::2016-02-25 15:47:42,686::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: vm.conf MainThread::INFO::2016-02-25 15:47:42,717::upgrade::955::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage MainThread::INFO::2016-02-25 15:47:42,758::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart MainThread::INFO::2016-02-25 15:47:42,758::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:47:44,286::upgrade::553::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectFakeStorageDomainServer) connectFakeStorageDomainServer MainThread::INFO::2016-02-25 15:47:44,345::upgrade::529::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_createFakeStorageDomain) createFakeStorageDomain MainThread::INFO::2016-02-25 15:47:44,867::upgrade::593::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_attachFakeStorageDomain) _attachFakeStorageDomain MainThread::INFO::2016-02-25 15:48:08,436::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_activateFakeStorageDomain) _activateFakeStorageDomain MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop MainThread::INFO::2016-02-25 15:48:08,462::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM MainThread::INFO::2016-02-25 15:48:08,571::upgrade::319::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_disconnectStoragePool) Disconnecting storage pool MainThread::INFO::2016-02-25 15:48:14,279::upgrade::756::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_reconstructMaster) _reconstructMaster MainThread::INFO::2016-02-25 15:49:39,256::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.4.3 started MainThread::INFO::2016-02-25 15:49:39,273::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: master-vds10.qa.lab.tlv.redhat.com MainThread::INFO::2016-02-25 15:49:39,274::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2016-02-25 15:49:39,319::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2016-02-25 15:49:39,320::storage_server::207::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-02-25 15:49:39,357::storage_server::211::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-02-25 15:49:39,373::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2016-02-25 15:49:39,607::hosted_engine::681::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2016-02-25 15:49:39,609::image::116::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::684::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::518::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2016-02-25 15:49:39,785::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.64.254'} MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40558160 MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'rhevm', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40556816 MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40517136 MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118658448 MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'} MainThread::INFO::2016-02-25 15:49:39,801::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118655504 MainThread::INFO::2016-02-25 15:49:39,916::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140144984512144 MainThread::INFO::2016-02-25 15:49:39,916::hosted_engine::610::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2016-02-25 15:49:39,948::hosted_engine::723::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/c777a4d3-ac78-49a2-83f5-d8aa8be036ab/7acaea0e-60c3-457c-9753-1ee423dec5bb/6a3d914a-00a2-49ee-9861-db5b5a9af260) MainThread::INFO::2016-02-25 15:49:39,967::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-02-25 15:49:40,038::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain MainThread::INFO::2016-02-25 15:49:40,051::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume MainThread::INFO::2016-02-25 15:49:40,140::upgrade::203::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:af071ec2-e0d2-4027-b704-3f6e66d39c7d, volUUID:86874316-0da4-439d-9e1c-bc0bba455965 .... MainThread::INFO::2016-02-25 15:51:58,351::upgrade::1011::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Successfully upgraded Now upgrade fail if host rebooted after reconstructMaster but before upgrade complete, Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Change still does not exist under the last build so move it to MODIFIED Moving back to assigned based on comment #7 Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Artyom, are you sure? all referenced patches are included in ovirt-hosted-engine-ha-1.3.5.3. Verified on ovirt-hosted-engine-ha-1.3.5.3-1.el7ev.noarch Steps: ===================== 1) Stop and mask ovirt-ha-agent 2) Update all packages 3) add two raise Exception to upgrade.py 1. self._activateFakeStorageDomain() raise Exception("MY Exception") self._spmStop() 2. raise Exception("MY Exception") self._reconstructMaster(master, dom_dict) self._connectStoragePool(master, dom_dict) 4) unmask and start ovirt-ha-agent 5) wait until log will have line with my Exception 6) Delete the first exception from code and restart ovirt-ha-agent 7) wait until log will have line with my Exception 8) Delete the second exception from code and restart ovirt-ha-agent 9) Check that upgrade succeed PASS Simone, I cannot understand what upgrade we are talking about here. It does not sound like it is talking about engine-setup. IHAC where engine-setup was interrupted in an unrecoverable way, but I cannot how the logs in here related to my customer problem. Can you please elaborate? (In reply to Marina from comment #15) > Simone, > I cannot understand what upgrade we are talking about here. > It does not sound like it is talking about engine-setup. No, this was about upgrading hosted-engine hosts from 3.5 to 3.6. The first upgraded host triggers a procedure to copy the hosted-engine configuration to the shared volume and this was failing in a bad way if the host was rebooted just after rpm upgrade in the middle of this procedure. Now it can correctly recover on next attempt. What I think are you talking about is instead tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1290073 |