Bug 1460513
Summary: | [DR] - Hosted-engine VM remains paused even though it has been started on another host as part of recovering from EIO | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Elad <ebenahar> | ||||
Component: | Agent | Assignee: | Andrej Krejcir <akrejcir> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Artyom <alukiano> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.1.1 | CC: | bugs, dfediuck, fgarciad, ylavi | ||||
Target Milestone: | ovirt-4.3.0 | Flags: | dfediuck:
ovirt-4.3+
|
||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-07-04 12:44:21 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1230788 | ||||||
Bug Blocks: | 1284364, 1393839, 1534978 | ||||||
Attachments: |
|
Apparently, this seems to be a known issue - BZ #1393839 *** This bug has been marked as a duplicate of bug 1393839 *** |
Created attachment 1286847 [details] logs from engine and hosts Description of problem: After hosted-engine VM is started when it has been paused on EIO on another host and this host has been recovered and is back to an up state, the VM remains paused on this host. Version-Release number of selected component (if applicable): vdsm-4.19.17-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.1.1-1.el7ev.noarch libvirt-daemon-2.0.0-10.el7_3.9.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64 sanlock-3.4.0-1.el7.x86_64 selinux-policy-3.13.1-102.el7_3.16.noarch How reproducible: 1/1 Steps to Reproduce: My setup topology: - 1 DC, 1 cluster with 4 hosted-engine deployed hosts. 2 of the hosts have 10ms latency to the rest of the network simulated with 'tc'. - 1 hosted-engine storage domain in the DC along with 1 more iSCSI domain from the same storage server (XtremIO) and 1 NFS domain 1. Changed hosted_storage SD LUN’s path state from active to offline on the 2 main site’s hosts (that doesn't have latency) (by echo "offline" > /sys/block/sdc/device/state to the hosted_storage SD LUN), while HE VM run on one of them 2. Hosted-engine VM moved to paused and started on one of the remaining hosts 3. Once the engine was accessible again, modified the LUN's path to 'running' on the disconnected hosts (by echo "running" > /sys/block/sdc/device/state), the hosts moved to up state Actual results: Hosted-engine VM got started on a different host and remained in paused state, hence held a qemu process on the host that was previously disconnected from the LUN. Expected results: Hosted-engine VM process should be terminated once it gets started on another host. The VM should not remain in paused state while running on another host. Additional info: HE VM moving to Paused (green-vdsd) vdsm.log: 2017-06-11 18:17:20,573+0300 INFO (jsonrpc/5) [throttled] Current getAllVmStats: {'2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322': 'Paused'} (throttledlog:105) HE VM starting on new host (rose11) agent.log: MainThread::INFO::2017-06-11 18:21:55,506::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host green-vdsd.scl.lab.tlv.redhat.com (id 1): {'conf_on_sh ared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=2818 (Sun Jun 11 18:20:49 2017)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=2837 (Sun Jun 11 18:21:08 2017)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n', 'hostname': 'green-vdsd.scl.lab.tlv.redhat.com', 'alive': True, 'host-id': 1, 'engine -status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'paused'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'a414fe63', 'local_conf_timestamp': 2 837, 'host-ts': 2818} [root@green-vdsd ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : green-vdsd.scl.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"} Score : 3400 stopped : False Local maintenance : False crc32 : c86def39 local_conf_timestamp : 4017 Host timestamp : 3999 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3999 (Sun Jun 11 18:40:30 2017) host-id=1 score=3400 vm_conf_refresh_time=4017 (Sun Jun 11 18:40:48 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : green-vdse.scl.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : ba982705 local_conf_timestamp : 4035 Host timestamp : 4016 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4016 (Sun Jun 11 18:40:38 2017) host-id=2 score=3400 vm_conf_refresh_time=4035 (Sun Jun 11 18:40:56 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 3 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rose11.scl.lab.tlv.redhat.com Host ID : 3 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 53f8e9d0 local_conf_timestamp : 1151108 Host timestamp : 1151048 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1151048 (Sun Jun 11 18:39:06 2017) host-id=3 score=3400 vm_conf_refresh_time=1151108 (Sun Jun 11 18:40:06 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 4 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rose12.scl.lab.tlv.redhat.com Host ID : 4 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : cdcc15a6 local_conf_timestamp : 1151195 Host timestamp : 1151135 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1151135 (Sun Jun 11 18:39:22 2017) host-id=4 score=3400 vm_conf_refresh_time=1151195 (Sun Jun 11 18:40:22 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False **the hosts that got disconnected from the hosted_storage LUN are green-vdsd and green-vdse**