Bug 1460513 - [DR] - Hosted-engine VM remains paused even though it has been started on another host as part of recovering from EIO
Summary: [DR] - Hosted-engine VM remains paused even though it has been started on ano...
Keywords:
Status: CLOSED DUPLICATE of bug 1393839
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.1.1
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ovirt-4.3.0
: ---
Assignee: Andrej Krejcir
QA Contact: Artyom
URL:
Whiteboard:
Depends On: rhv_turn_off_autoresume_of_paused_VMs
Blocks: RHV_DR 1393839 1534978
TreeView+ depends on / blocked
 
Reported: 2017-06-11 15:51 UTC by Elad
Modified: 2018-07-04 12:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-04 12:44:21 UTC
oVirt Team: SLA
Embargoed:
dfediuck: ovirt-4.3+


Attachments (Terms of Use)
logs from engine and hosts (7.23 MB, application/x-gzip)
2017-06-11 15:51 UTC, Elad
no flags Details

Description Elad 2017-06-11 15:51:55 UTC
Created attachment 1286847 [details]
logs from engine and hosts

Description of problem:
After hosted-engine VM is started when it has been paused on EIO on another host and this host has been recovered and is back to an up state, the VM remains paused on this host.

Version-Release number of selected component (if applicable):
vdsm-4.19.17-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.1-1.el7ev.noarch
libvirt-daemon-2.0.0-10.el7_3.9.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64
sanlock-3.4.0-1.el7.x86_64
selinux-policy-3.13.1-102.el7_3.16.noarch


How reproducible:
1/1

Steps to Reproduce:
My setup topology: 
- 1 DC, 1 cluster with 4 hosted-engine deployed hosts. 2 of the hosts have 10ms latency to the rest of the network simulated with 'tc'.
- 1 hosted-engine storage domain in the DC along with 1 more iSCSI domain from the same storage server (XtremIO) and 1 NFS domain
1. Changed hosted_storage SD LUN’s path state from active to offline on the 2 main site’s hosts (that doesn't have latency) (by echo "offline" > /sys/block/sdc/device/state to the hosted_storage SD LUN), while HE VM run on one of them
2. Hosted-engine VM moved to paused and started on one of the remaining hosts
3. Once the engine was accessible again, modified the LUN's path to 'running' on the disconnected hosts (by echo "running" > /sys/block/sdc/device/state), the hosts moved to up state

Actual results:
Hosted-engine VM got started on a different host and remained in paused state, hence held a qemu process on the host that was previously disconnected from the LUN.

Expected results:
Hosted-engine VM process should be terminated once it gets started on another host. The VM should not remain in paused state while running on another host.

Additional info:


HE VM moving to Paused (green-vdsd) vdsm.log: 

2017-06-11 18:17:20,573+0300 INFO  (jsonrpc/5) [throttled] Current getAllVmStats: {'2eb55ea4-dd9e-42b2-bec7-9eea1cfaf322': 'Paused'} (throttledlog:105)


HE VM starting on new host (rose11) agent.log:

MainThread::INFO::2017-06-11 18:21:55,506::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host green-vdsd.scl.lab.tlv.redhat.com (id 1): {'conf_on_sh
ared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=2818 (Sun Jun 11 18:20:49 2017)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=2837 (Sun Jun 11 
18:21:08 2017)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n', 'hostname': 'green-vdsd.scl.lab.tlv.redhat.com', 'alive': True, 'host-id': 1, 'engine
-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'paused'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'a414fe63', 'local_conf_timestamp': 2
837, 'host-ts': 2818}



[root@green-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : green-vdsd.scl.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : c86def39
local_conf_timestamp               : 4017
Host timestamp                     : 3999
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3999 (Sun Jun 11 18:40:30 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=4017 (Sun Jun 11 18:40:48 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : green-vdse.scl.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : ba982705
local_conf_timestamp               : 4035
Host timestamp                     : 4016
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4016 (Sun Jun 11 18:40:38 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=4035 (Sun Jun 11 18:40:56 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : rose11.scl.lab.tlv.redhat.com
Host ID                            : 3
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 53f8e9d0
local_conf_timestamp               : 1151108
Host timestamp                     : 1151048
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1151048 (Sun Jun 11 18:39:06 2017)
        host-id=3
        score=3400
        vm_conf_refresh_time=1151108 (Sun Jun 11 18:40:06 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 4 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : rose12.scl.lab.tlv.redhat.com
Host ID                            : 4
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cdcc15a6
local_conf_timestamp               : 1151195
Host timestamp                     : 1151135
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1151135 (Sun Jun 11 18:39:22 2017)
        host-id=4
        score=3400
        vm_conf_refresh_time=1151195 (Sun Jun 11 18:40:22 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


**the hosts that got disconnected from the hosted_storage LUN are green-vdsd and green-vdse**

Comment 1 Elad 2017-06-11 21:28:00 UTC
Apparently, this seems to be a known issue - BZ #1393839

Comment 3 Andrej Krejcir 2018-07-04 12:44:21 UTC

*** This bug has been marked as a duplicate of bug 1393839 ***


Note You need to log in before you can comment on or make changes to this bug.