Bug 1292652
| Summary: | [upgrade] the upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Simone Tiraboschi <stirabos> |
| Component: | Agent | Assignee: | Simone Tiraboschi <stirabos> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 1.3.3.4 | CC: | alukiano, bmcclain, bugs, eedri, gklein, mavital, mkalinin, pstehlik, sbonazzo, stirabos, ylavi |
| Target Milestone: | ovirt-3.6.5 | Keywords: | Triaged |
| Target Release: | 1.3.5.3 | Flags: | rule-engine:
ovirt-3.6.z+
ylavi: blocker- bmcclain: planning_ack+ sbonazzo: devel_ack+ pstehlik: testing_ack+ |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot cause at that point the storage pool is needed but not connected
Consequence:
ovirt-ha-agent fails to upgrade the hosted-engine storage domain to 3.6 structure and restarts itself in a loop.
Fix:
Better check the env condition and ensure that the storagePool is really connected when needed.
Result:
After the reboot ovirt-ha-agent could correctly resume the upgrade procedure.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-21 14:41:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1285700, 1322020, 1326023, 1333143 | ||
Verify on ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch
MainThread::INFO::2016-02-25 15:46:52,951::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-02-25 15:46:52,972::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2016-02-25 15:46:52,985::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2016-02-25 15:46:52,999::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2016-02-25 15:46:52,999::upgrade::938::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf
MainThread::INFO::2016-02-25 15:46:53,010::upgrade::298::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectStoragePool) Connecting storage pool - master 'c777a4d3-ac78-49a2-83f5-d8aa8be036ab' - dom_dict '{'c777a4d3-ac78-49a2-83f5-d8aa8be036ab': 'active'}'
MainThread::INFO::2016-02-25 15:46:53,313::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2016-02-25 15:46:53,313::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:53,387::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:55,438::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:57,489::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:46:59,535::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:01,568::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:03,608::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:05,656::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:05,822::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:07,854::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:09,892::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:11,926::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:13,988::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:16,040::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:16,093::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2016-02-25 15:47:16,112::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2016-02-25 15:47:16,113::upgrade::262::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_shared_conf_volume) Creating hosted-engine configuration volume on the shared storage domain
MainThread::INFO::2016-02-25 15:47:42,684::upgrade::387::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf
MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::375::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'
MainThread::ERROR::2016-02-25 15:47:42,685::upgrade::380::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) unable to read '/etc/ovirt-hosted-engine/answers.conf'
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: hosted-engine.conf
MainThread::INFO::2016-02-25 15:47:42,685::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: broker.conf
MainThread::INFO::2016-02-25 15:47:42,686::upgrade::354::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: vm.conf
MainThread::INFO::2016-02-25 15:47:42,717::upgrade::955::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage
MainThread::INFO::2016-02-25 15:47:42,758::upgrade::668::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStart) spmStart
MainThread::INFO::2016-02-25 15:47:42,758::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:47:44,286::upgrade::553::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_connectFakeStorageDomainServer) connectFakeStorageDomainServer
MainThread::INFO::2016-02-25 15:47:44,345::upgrade::529::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_createFakeStorageDomain) createFakeStorageDomain
MainThread::INFO::2016-02-25 15:47:44,867::upgrade::593::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_attachFakeStorageDomain) _attachFakeStorageDomain
MainThread::INFO::2016-02-25 15:48:08,436::upgrade::605::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_activateFakeStorageDomain) _activateFakeStorageDomain
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::706::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_spmStop) spmStop
MainThread::INFO::2016-02-25 15:48:08,462::upgrade::658::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_isSPM) isSPM
MainThread::INFO::2016-02-25 15:48:08,571::upgrade::319::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_disconnectStoragePool) Disconnecting storage pool
MainThread::INFO::2016-02-25 15:48:14,279::upgrade::756::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_reconstructMaster) _reconstructMaster
MainThread::INFO::2016-02-25 15:49:39,256::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.4.3 started
MainThread::INFO::2016-02-25 15:49:39,273::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: master-vds10.qa.lab.tlv.redhat.com
MainThread::INFO::2016-02-25 15:49:39,274::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-02-25 15:49:39,319::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-02-25 15:49:39,320::storage_server::207::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-02-25 15:49:39,357::storage_server::211::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-02-25 15:49:39,373::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain
MainThread::INFO::2016-02-25 15:49:39,607::hosted_engine::681::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images
MainThread::INFO::2016-02-25 15:49:39,609::image::116::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::684::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain
MainThread::INFO::2016-02-25 15:49:39,784::hosted_engine::518::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2016-02-25 15:49:39,785::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.35.64.254'}
MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40558160
MainThread::INFO::2016-02-25 15:49:39,788::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'rhevm', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40556816
MainThread::INFO::2016-02-25 15:49:39,791::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 40517136
MainThread::INFO::2016-02-25 15:49:39,794::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118658448
MainThread::INFO::2016-02-25 15:49:39,797::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': 'c5c349b3-8907-43a9-b1a8-57b91635636e', 'address': '0'}
MainThread::INFO::2016-02-25 15:49:39,801::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140145118655504
MainThread::INFO::2016-02-25 15:49:39,916::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140144984512144
MainThread::INFO::2016-02-25 15:49:39,916::hosted_engine::610::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
MainThread::INFO::2016-02-25 15:49:39,948::hosted_engine::723::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/c777a4d3-ac78-49a2-83f5-d8aa8be036ab/7acaea0e-60c3-457c-9753-1ee423dec5bb/6a3d914a-00a2-49ee-9861-db5b5a9af260)
MainThread::INFO::2016-02-25 15:49:39,967::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-02-25 15:49:40,038::upgrade::720::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2016-02-25 15:49:40,051::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::INFO::2016-02-25 15:49:40,140::upgrade::203::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Found conf volume: imgUUID:af071ec2-e0d2-4027-b704-3f6e66d39c7d, volUUID:86874316-0da4-439d-9e1c-bc0bba455965
....
MainThread::INFO::2016-02-25 15:51:58,351::upgrade::1011::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Successfully upgraded
Now upgrade fail if host rebooted after reconstructMaster but before upgrade complete, Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Change still does not exist under the last build so move it to MODIFIED Moving back to assigned based on comment #7 Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Artyom, are you sure? all referenced patches are included in ovirt-hosted-engine-ha-1.3.5.3. Verified on ovirt-hosted-engine-ha-1.3.5.3-1.el7ev.noarch
Steps:
=====================
1) Stop and mask ovirt-ha-agent
2) Update all packages
3) add two raise Exception to upgrade.py
1. self._activateFakeStorageDomain()
raise Exception("MY Exception")
self._spmStop()
2. raise Exception("MY Exception")
self._reconstructMaster(master, dom_dict)
self._connectStoragePool(master, dom_dict)
4) unmask and start ovirt-ha-agent
5) wait until log will have line with my Exception
6) Delete the first exception from code and restart ovirt-ha-agent
7) wait until log will have line with my Exception
8) Delete the second exception from code and restart ovirt-ha-agent
9) Check that upgrade succeed
PASS
Simone, I cannot understand what upgrade we are talking about here. It does not sound like it is talking about engine-setup. IHAC where engine-setup was interrupted in an unrecoverable way, but I cannot how the logs in here related to my customer problem. Can you please elaborate? (In reply to Marina from comment #15) > Simone, > I cannot understand what upgrade we are talking about here. > It does not sound like it is talking about engine-setup. No, this was about upgrading hosted-engine hosts from 3.5 to 3.6. The first upgraded host triggers a procedure to copy the hosted-engine configuration to the shared volume and this was failing in a bad way if the host was rebooted just after rpm upgrade in the middle of this procedure. Now it can correctly recover on next attempt. What I think are you talking about is instead tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1290073 |
Description of problem: The upgrade from 3.5 to 3.6 can fail if interrupted in the middle and restarted after a reboot. he upgrade procedure has to move the hosted-engine configuration to the shared storage and it has to disconnect the hosted-engine storage domain from its bootstrap storage pool. The two tasks can be performed by different attempts. _move_to_shared_conf was connecting the bootstrap storage pool leaving it connected for further steps. _remove_storage_pool was erroneously assuming that the bootstrap storage pool was already connected. But if the configuration volume is already OK the upgrade procedure skips _move_to_shared_conf and, if the system has been rebooted in the middle, _remove_storage_pool fails cause the storage pool is not connected. Connecting the storage pool if needed also from _remove_storage_pool solves it. Version-Release number of selected component (if applicable): 1.3.3.4 How reproducible: It's really a corner case: the user has to stop the agent in the middle of the upgrade (it lasts a couple of seconds) and reboot before it retries. Steps to Reproduce: 1. deploy HE 3.5 2. start upgrading to 3.6 3. stop the agent after 'Upgrading to current version' but before 'Successfully upgraded' Actual results: The upgrade fails with: MainThread::INFO::2015-12-17 08:37:42,900::hosted_engine::660::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::ERROR::2015-12-17 08:38:18,900::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to check SPM: Unknown pool id, pool not connected: ('b5ec210d-5f08-46e5-80ac-07085ee6096e',)' - trying to restart agent MainThread::ERROR::2015-12-17 08:40:42,900::agent::210::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Too many errors occurred, giving up. Please review the log and consider filing a bug. MainThread::INFO::2015-12-17 08:40:42,900::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down Expected results: It successfully upgrades. Additional info: