Bug 1205225 - [RHEV-H] hosted-engine: [iSCSI support] ha-agent service becomes 'dead but subsys locked'
Summary: [RHEV-H] hosted-engine: [iSCSI support] ha-agent service becomes 'dead but su...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.5.1
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.4
Assignee: Ala Hino
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: 1059435 1198639 1250199
TreeView+ depends on / blocked
 
Reported: 2015-03-24 13:19 UTC by Elad
Modified: 2016-02-10 19:53 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-06 08:01:26 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/ from the host (12.73 MB, application/x-gzip)
2015-03-24 13:19 UTC, Elad
no flags Details
logs2 (601.65 KB, application/x-gzip)
2015-03-25 11:35 UTC, Elad
no flags Details
logs 6.4.15 (1.18 MB, application/x-gzip)
2015-04-06 14:50 UTC, Elad
no flags Details

Description Elad 2015-03-24 13:19:46 UTC
Created attachment 1005846 [details]
/var/log/ from the host

Description of problem:
On Hosted-engine installed over RHEV-H:
I disconnected the hosted engine host from the iSCSI storage server where the engine's VM disk is located by unmapping it. Then I reconnected the host by mapping it back in the storage server. The HostedEngine VM wasn't resumed after the LUNs were exposed back to the host.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor 6.6 (20150304.0.el6ev)
vdsm-4.16.8.1-7.el6ev.x86_64
libvirt-0.10.2-46.el6_6.3.x86_64
qemu-img-rhev-0.12.1.2-2.446.el6.x86_64
gpxe-roms-qemu-0.9.7-6.12.el6.noarch
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
qemu-kvm-rhev-tools-0.12.1.2-2.446.el6.x86_64
sanlock-2.8-1.el6.x86_64
ovirt-node-selinux-3.2.1-9.el6.noarch
selinux-policy-3.7.19-260.el6_6.2.noarch


How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted-engine using iSCSI over RHEV-H
2. Once deployment is done, unmap the host from the LUN where the engine's VM disk is located. Wait for the VM to enter to paused state (check 'virsh -r list') 
3. When the VM is paused, map the LUN back to the host.

Actual results:
The LUN is exposed back to the host:

multipath -ll:

3514f0c5447601046 dm-6 XtremIO,XtremApp
size=45G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 5:0:0:2 sde 8:64 failed faulty running


But the VM is not resumed from paused state. Waited for 1 day.

/var/log/vdsm/vdsm.log.5.xz:Thread-68227::DEBUG::2015-03-23 07:29:13,762::BindingXMLRPC::1140::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'displayInfo': [{'tlsPort': '-1', 'ipAddress': '0', 'type': 'vnc', 'port': u'5900'}], 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '24392', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '0', 'balloonInfo': {}, 'pauseCode': 'EIO', 'network': {u'vnet0': {'macAddr': '00:16:3E:76:D5:D5', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmType': 'kvm', 'cpuUser': '-0.24', 'elapsedTime': '3092', 'vmJobs': {}, 'cpuSys': '0.27', 'appsList': [], 'displayType': 'vnc', 'vcpuCount': '2', 'clientIp': '', 'hash': '-8319244157651588328', 'vmId': '32911199-6e36-46da-8296-7912dce890fd', 'displayIp': '0', 'vcpuPeriod': 100000L, 'displayPort': u'5900', 'vcpuQuota': '-1', 'kvmEnable': 'true', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '26843545600', 'writeLatency': '0', 'imageID': 'c488e544-c788-44ad-9507-08f02b755664', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '26843545600', 'writeRate': '0.00'}, u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '1.69', 'username': 'Unknown', 'status': 'Paused', 'guestCPUCount': -1, 'ioTune': [], 'guestIPs': ''}]}


Expected results:
HostedEngine VM should be resumed automatically after EIO 


Additional info: /var/log/ from the host

Comment 1 Elad 2015-03-24 14:24:28 UTC
It seems that ovirt-ha-agent doesn't function:

[root@green-vdsa ~]# /etc/init.d/ovirt-ha-agent status
ovirt-ha-agent dead but subsys locked


From /var/log/ovirt-hosted-engine-ha/agent.log :


MainThread::WARNING::2015-03-24 14:16:14,591::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7'
MainThread::INFO::2015-03-24 14:16:14,655::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: green-vdsa.qa.lab.tlv.redhat.com
MainThread::ERROR::2015-03-24 14:16:15,116::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'path to storage domain fd6aafd6-bac4-44bf-b41c-50df720a034d not found in /rhev/data-center/mnt' - trying to restart agent
MainThread::WARNING::2015-03-24 14:16:20,123::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8'
MainThread::INFO::2015-03-24 14:16:20,183::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: green-vdsa.qa.lab.tlv.redhat.com
MainThread::ERROR::2015-03-24 14:16:20,513::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'path to storage domain fd6aafd6-bac4-44bf-b41c-50df720a034d not found in /rhev/data-center/mnt' - trying to restart agent

Comment 2 Adam Litke 2015-03-24 15:57:23 UTC
I have found the place in the logs where the VM goes -EIO:

2015-03-23 07:14:43.402|INFO|vm.Vm|libvirtEventLoop|vm|4881|_onIOError|vmId=`32911199-6e36-46da-8296-7912dce890fd`::abnormal vm stop device virtio-disk0 error eio

However, it does not appear that the domain ever becomes Valid again after this point:

$ sqlite3 logdb "select * from messages where text glob '*to Valid*' and timestamp > '2015-03-23 07:14:43.402' order by timestamp"
[alitke@brewer vdsm]$

In the log I do see this warning repeated:
2015-03-23 07:14:44.125|DEBUG|Storage.Misc.excCmd|Thread-66470|lvm|288|cmd|SUCCESS: <err> = '  /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 48318316544: Input/output error\n  /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 48318373888: Input/output error\n  /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3514f0c5447601046 was disabled\n'; <rc> = 0

I wonder if the underlying device was permanently disabled so even though you remapped the LUN it won't start working again without first reconnecting.

Comment 3 Elad 2015-03-24 17:20:01 UTC
Hi Adam,

After re-mapping the device, it appeared to device-mapper as 'active ready running'.


3514f0c5447601046 dm-6 XtremIO,XtremApp
size=45G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 5:0:0:2 sde 8:64 active ready running


Also, I rebooted the host and it didn't help.

Comment 4 Elad 2015-03-25 09:11:25 UTC
Since the link to the storage domain doesn't exist under /rhev/data-center, the VM cannot be started (manually).



[root@green-vdsa ~]# hosted-engine --vm-start
VM exists and is down, destroying it
Machine destroyed

32911199-6e36-46da-8296-7912dce890fd
        Status = WaitForLaunch
        nicModel = rtl8139,pv
        emulatedMachine = rhel6.5.0
        pid = 0
        vmName = HostedEngine
        devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': 'e41722aa-d639-4a4e-ab3f-ea4687348f04', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '03bc33f9-f62a-4eb0-b97e-83a97565bb53', 'imageID': 'c488e544-c788-44ad-9507-08f02b755664', 'specParams': {}, 'readonly': 'false', 'domainID': 'fd6aafd6-bac4-44bf-b41c-50df720a034d', 'optional': 'false', 'deviceId': 'c488e544-c788-44ad-9507-08f02b755664', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3E:76:D5:D5', 'linkActive': 'true', 'network': 'rhevm', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '320a30f0-8e25-4f10-83cf-4c9a3329f103', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '746c957f-a33f-40f5-9554-f10902d3c87a', 'alias': 'console0'}]
        smp = 2
        vmType = kvm
        display = vnc
        displaySecurePort = -1
        memSize = 4096
        displayPort = -1
        cpuType = Conroe
        clientIp = 
        guestDiskMapping = {}
        displayIp = 0
        spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir



[root@green-vdsa ~]# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 111, in <module>
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 58, in print_status
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 155, in get_all_host_stats
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 98, in get_all_stats
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 175, in _configure_broker_conn
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'iscsi', 'sd_uuid': 'fd6aafd6-bac4-44bf-b41c-50df720a034d'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>

Comment 5 Elad 2015-03-25 11:35:13 UTC
Created attachment 1006297 [details]
logs2

It might be that the issue with the ha agent is much more critical than it seemed.

ovirt-ha-agent service doesn't function at all and gets into 'dead but subsys locked' status. 
I tested another scenario, in which I just rebooted the host where hosted-engine VM was running on. When the host came up, the service was started and after a few seconds became 'dead but subsys locked'. 

I tested it on a different environment which wasn't in use by another hosted engine before. (I didn't do anything related to storage, just reboot).



Mar 25 09:23:50 green-vdsb ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.ConnectionHandler ERROR Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=iscsi sd_uuid=1b3447ad-e875-4ec1-aa0e-9fa4b211f71b'#012Traceback (most recent call last):#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 65, in set_storage_domain#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 434, in connect#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 107, in get_domain_path#012BackendFailureException: path to storage domain 1b3447ad-e875-4ec1-aa0e-9fa4b211f71b not found in /rhev/data-center/mnt

Comment 6 Adam Litke 2015-03-25 17:50:13 UTC
According to /var/log/messages, the host starts to have iSCSI errors at 09:23

Mar 25 09:23:27 green-vdsb iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)


This corresponds to the vdsm log where the connectStorageServer verb (although successful) does not result in any new storage domains being found:

Thread-13::DEBUG::2015-03-25 09:23:10,979::hsm::2387::Storage.HSM::(__prefetchDomains) Found SD uuids: ()
Thread-13::DEBUG::2015-03-25 09:23:10,979::hsm::2443::Storage.HSM::(connectStorageServer) knownSDs: {}


Because of this, no links are created under /rhev/datacenter/mnt.  And since the ha-agent is not having luck connecting storage after 10 attempts, it gives up.


It looks like there is a problem with the iSCSI storage or the network connection.  Both vdsm and hosted-engine-ha-agent seem to be working as designed.  I am seeing lots of /var/log/messages entries about eth0 going up and down.

Elad, can you ensure that the host networking is behaving properly and the connectivity to the storage server is reliable.

Comment 7 Elad 2015-03-26 07:47:44 UTC
I can't see any network issue with the connection to the storage server. eth0 on this host is down, eth1 is the physical interface attached to 'rhevm' bridge from which the connection to the storage server is done.

Comment 8 Elad 2015-03-26 13:19:22 UTC
Sorry, on this host eth0 is the one that attached to 'rhevm' bridge so please ignore what I wrote in the previous comment.
Anyway, also on this host, I don't see any network malfunction.

Comment 9 Allon Mureinik 2015-03-30 09:05:31 UTC
Is this a regression? Does this happen in 3.5.0 too?

Comment 10 Allon Mureinik 2015-03-30 09:06:41 UTC
Pushing out, since the PRD item this blocks is also targeted for 3.5.3

Comment 11 Elad 2015-03-30 10:44:02 UTC
(In reply to Allon Mureinik from comment #9)
> Is this a regression? Does this happen in 3.5.0 too?

AFAIK, hosted-engine over RHEV-H wasn't available in 3.5.0

Comment 12 Allon Mureinik 2015-03-30 15:09:57 UTC
(In reply to Elad from comment #11)
> (In reply to Allon Mureinik from comment #9)
> > Is this a regression? Does this happen in 3.5.0 too?
> 
> AFAIK, hosted-engine over RHEV-H wasn't available in 3.5.0
Is this RHEV-H specific, or does it reproduce in RHEL too?

Comment 13 Elad 2015-03-31 08:06:18 UTC
(In reply to Allon Mureinik from comment #12)
> Is this RHEV-H specific, or does it reproduce in RHEL too?

Tested on RHEL7.1, the bug doesn't reproduce.
After the hosted-engine VM entered paused on EIO due to unmapping the LUN where the hosted-engine VM image located, mapped the LUN back and the VM resumed.

ovirt-hosted-engine-setup-1.2.1-9.el7ev.noarch
ovirt-hosted-engine-ha-1.2.4-5.el7ev.noarch
vdsm-4.16.8.1-6.el7ev.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64

Comment 14 Fabian Deutsch 2015-04-02 13:00:19 UTC
Elad, could you please test this bug with the same package versions as given in the description on RHEL?

Currently I do not see anything highlighting that it is RHEV-H specific, so I wonder if it's a specific package combination triggering this bug.

Comment 15 Elad 2015-04-06 14:50:12 UTC
Created attachment 1011392 [details]
logs 6.4.15

T(In reply to Fabian Deutsch from comment #14)
> Elad, could you please test this bug with the same package versions as given
> in the description on RHEL?
> 
> Currently I do not see anything highlighting that it is RHEV-H specific, so
> I wonder if it's a specific package combination triggering this bug.

Checked with RHEL6.6. 
vdsm-4.16.13-1.el6ev.x86_64
ovirt-hosted-engine-setup-1.2.2-3.el6ev.noarch
ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch
selinux-policy-3.7.19-260.el6_6.2.noarch
sanlock-2.8-1.el6.x86_64
libvirt-0.10.2-46.el6_6.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.448.el6_6.1.x86_64


The ha-agent service continues to function when the LUN where the engine image is located is unmapped from the host.

Attaching the logs

Comment 16 Adam Litke 2015-06-08 17:36:31 UTC
Elad,

Since there was a feeling that this bug may have been caused by a specific bad combination of packages on RHEV-H over 2 months ago, could you retest with the latest RHEV-H to see if the problem persists?

Comment 17 Elad 2015-06-09 11:00:22 UTC
Adam, 

Currently there is no new HE RHEV-H build to test on. According to Gil, we will have it in the next few weeks.

Comment 18 Allon Mureinik 2015-06-18 15:40:33 UTC
(In reply to Elad from comment #17)
> Adam, 
> 
> Currently there is no new HE RHEV-H build to test on. According to Gil, we
> will have it in the next few weeks.
Reducing severity till we can prove this still happens.

Comment 19 Allon Mureinik 2015-06-24 12:52:42 UTC
(In reply to Elad from comment #17)
> Adam, 
> 
> Currently there is no new HE RHEV-H build to test on. According to Gil, we
> will have it in the next few weeks.
Do we have one now?

Comment 20 Elad 2015-06-24 12:55:12 UTC
Yes we have, I'll check that again.

Comment 21 Elad 2015-07-01 11:40:11 UTC
Apparently we don't have RHEV-H 6.7 with HE support yet, I'll update when we'll have it.

Comment 24 Elad 2015-07-04 23:09:58 UTC
(In reply to Allon Mureinik from comment #19)
> (In reply to Elad from comment #17)
> > Adam, 
> > 
> > Currently there is no new HE RHEV-H build to test on. According to Gil, we
> > will have it in the next few weeks.
> Do we have one now?

Yes. Tested using:
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150609.0.el6ev)
vdsm-4.16.20-1.el6ev.x86_64
ovirt-hosted-engine-ha-1.2.6-2.el6ev.noarch
ovirt-hosted-engine-setup-1.2.4-2.el6ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-15.0.el6ev.noarch
libvirt-0.10.2-54.el6.x86_64
rhevm-3.5.4-1.1.el6ev.noarch

The problem is fixed, the hosted-engine VM returns to function after it connectivity with the storage is resumed. 

Both ha-broker and ha-agent are running as expected.

Comment 25 Allon Mureinik 2015-07-05 06:10:20 UTC
So can this BZ just be moved to VERIFIED with the given versions?

Comment 26 Elad 2015-07-06 08:01:26 UTC
Closing as CURRENTRELEASE. 


The bug is fixed on RHEV-H 6.7, based on comment #24


Note You need to log in before you can comment on or make changes to this bug.