Created attachment 1005846 [details] /var/log/ from the host Description of problem: On Hosted-engine installed over RHEV-H: I disconnected the hosted engine host from the iSCSI storage server where the engine's VM disk is located by unmapping it. Then I reconnected the host by mapping it back in the storage server. The HostedEngine VM wasn't resumed after the LUNs were exposed back to the host. Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization Hypervisor 6.6 (20150304.0.el6ev) vdsm-4.16.8.1-7.el6ev.x86_64 libvirt-0.10.2-46.el6_6.3.x86_64 qemu-img-rhev-0.12.1.2-2.446.el6.x86_64 gpxe-roms-qemu-0.9.7-6.12.el6.noarch qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64 qemu-kvm-rhev-tools-0.12.1.2-2.446.el6.x86_64 sanlock-2.8-1.el6.x86_64 ovirt-node-selinux-3.2.1-9.el6.noarch selinux-policy-3.7.19-260.el6_6.2.noarch How reproducible: Always Steps to Reproduce: 1. Deploy hosted-engine using iSCSI over RHEV-H 2. Once deployment is done, unmap the host from the LUN where the engine's VM disk is located. Wait for the VM to enter to paused state (check 'virsh -r list') 3. When the VM is paused, map the LUN back to the host. Actual results: The LUN is exposed back to the host: multipath -ll: 3514f0c5447601046 dm-6 XtremIO,XtremApp size=45G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=0 status=enabled `- 5:0:0:2 sde 8:64 failed faulty running But the VM is not resumed from paused state. Waited for 1 day. /var/log/vdsm/vdsm.log.5.xz:Thread-68227::DEBUG::2015-03-23 07:29:13,762::BindingXMLRPC::1140::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'displayInfo': [{'tlsPort': '-1', 'ipAddress': '0', 'type': 'vnc', 'port': u'5900'}], 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '24392', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '0', 'balloonInfo': {}, 'pauseCode': 'EIO', 'network': {u'vnet0': {'macAddr': '00:16:3E:76:D5:D5', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmType': 'kvm', 'cpuUser': '-0.24', 'elapsedTime': '3092', 'vmJobs': {}, 'cpuSys': '0.27', 'appsList': [], 'displayType': 'vnc', 'vcpuCount': '2', 'clientIp': '', 'hash': '-8319244157651588328', 'vmId': '32911199-6e36-46da-8296-7912dce890fd', 'displayIp': '0', 'vcpuPeriod': 100000L, 'displayPort': u'5900', 'vcpuQuota': '-1', 'kvmEnable': 'true', 'disks': {u'vda': {'readLatency': '0', 'apparentsize': '26843545600', 'writeLatency': '0', 'imageID': 'c488e544-c788-44ad-9507-08f02b755664', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '26843545600', 'writeRate': '0.00'}, u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '1.69', 'username': 'Unknown', 'status': 'Paused', 'guestCPUCount': -1, 'ioTune': [], 'guestIPs': ''}]} Expected results: HostedEngine VM should be resumed automatically after EIO Additional info: /var/log/ from the host
It seems that ovirt-ha-agent doesn't function: [root@green-vdsa ~]# /etc/init.d/ovirt-ha-agent status ovirt-ha-agent dead but subsys locked From /var/log/ovirt-hosted-engine-ha/agent.log : MainThread::WARNING::2015-03-24 14:16:14,591::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7' MainThread::INFO::2015-03-24 14:16:14,655::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: green-vdsa.qa.lab.tlv.redhat.com MainThread::ERROR::2015-03-24 14:16:15,116::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'path to storage domain fd6aafd6-bac4-44bf-b41c-50df720a034d not found in /rhev/data-center/mnt' - trying to restart agent MainThread::WARNING::2015-03-24 14:16:20,123::agent::175::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::INFO::2015-03-24 14:16:20,183::hosted_engine::222::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: green-vdsa.qa.lab.tlv.redhat.com MainThread::ERROR::2015-03-24 14:16:20,513::agent::172::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'path to storage domain fd6aafd6-bac4-44bf-b41c-50df720a034d not found in /rhev/data-center/mnt' - trying to restart agent
I have found the place in the logs where the VM goes -EIO: 2015-03-23 07:14:43.402|INFO|vm.Vm|libvirtEventLoop|vm|4881|_onIOError|vmId=`32911199-6e36-46da-8296-7912dce890fd`::abnormal vm stop device virtio-disk0 error eio However, it does not appear that the domain ever becomes Valid again after this point: $ sqlite3 logdb "select * from messages where text glob '*to Valid*' and timestamp > '2015-03-23 07:14:43.402' order by timestamp" [alitke@brewer vdsm]$ In the log I do see this warning repeated: 2015-03-23 07:14:44.125|DEBUG|Storage.Misc.excCmd|Thread-66470|lvm|288|cmd|SUCCESS: <err> = ' /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 48318316544: Input/output error\n /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 48318373888: Input/output error\n /dev/mapper/3514f0c5447601046: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3514f0c5447601046 was disabled\n'; <rc> = 0 I wonder if the underlying device was permanently disabled so even though you remapped the LUN it won't start working again without first reconnecting.
Hi Adam, After re-mapping the device, it appeared to device-mapper as 'active ready running'. 3514f0c5447601046 dm-6 XtremIO,XtremApp size=45G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 5:0:0:2 sde 8:64 active ready running Also, I rebooted the host and it didn't help.
Since the link to the storage domain doesn't exist under /rhev/data-center, the VM cannot be started (manually). [root@green-vdsa ~]# hosted-engine --vm-start VM exists and is down, destroying it Machine destroyed 32911199-6e36-46da-8296-7912dce890fd Status = WaitForLaunch nicModel = rtl8139,pv emulatedMachine = rhel6.5.0 pid = 0 vmName = HostedEngine devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': 'e41722aa-d639-4a4e-ab3f-ea4687348f04', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'poolID': '00000000-0000-0000-0000-000000000000', 'volumeID': '03bc33f9-f62a-4eb0-b97e-83a97565bb53', 'imageID': 'c488e544-c788-44ad-9507-08f02b755664', 'specParams': {}, 'readonly': 'false', 'domainID': 'fd6aafd6-bac4-44bf-b41c-50df720a034d', 'optional': 'false', 'deviceId': 'c488e544-c788-44ad-9507-08f02b755664', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3E:76:D5:D5', 'linkActive': 'true', 'network': 'rhevm', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '320a30f0-8e25-4f10-83cf-4c9a3329f103', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '746c957f-a33f-40f5-9554-f10902d3c87a', 'alias': 'console0'}] smp = 2 vmType = kvm display = vnc displaySecurePort = -1 memSize = 4096 displayPort = -1 cpuType = Conroe clientIp = guestDiskMapping = {} displayIp = 0 spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir [root@green-vdsa ~]# hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 111, in <module> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 58, in print_status File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 155, in get_all_host_stats File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 98, in get_all_stats File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 175, in _configure_broker_conn File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'iscsi', 'sd_uuid': 'fd6aafd6-bac4-44bf-b41c-50df720a034d'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>
Created attachment 1006297 [details] logs2 It might be that the issue with the ha agent is much more critical than it seemed. ovirt-ha-agent service doesn't function at all and gets into 'dead but subsys locked' status. I tested another scenario, in which I just rebooted the host where hosted-engine VM was running on. When the host came up, the service was started and after a few seconds became 'dead but subsys locked'. I tested it on a different environment which wasn't in use by another hosted engine before. (I didn't do anything related to storage, just reboot). Mar 25 09:23:50 green-vdsb ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.ConnectionHandler ERROR Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=iscsi sd_uuid=1b3447ad-e875-4ec1-aa0e-9fa4b211f71b'#012Traceback (most recent call last):#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 65, in set_storage_domain#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 434, in connect#012 File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 107, in get_domain_path#012BackendFailureException: path to storage domain 1b3447ad-e875-4ec1-aa0e-9fa4b211f71b not found in /rhev/data-center/mnt
According to /var/log/messages, the host starts to have iSCSI errors at 09:23 Mar 25 09:23:27 green-vdsb iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3) This corresponds to the vdsm log where the connectStorageServer verb (although successful) does not result in any new storage domains being found: Thread-13::DEBUG::2015-03-25 09:23:10,979::hsm::2387::Storage.HSM::(__prefetchDomains) Found SD uuids: () Thread-13::DEBUG::2015-03-25 09:23:10,979::hsm::2443::Storage.HSM::(connectStorageServer) knownSDs: {} Because of this, no links are created under /rhev/datacenter/mnt. And since the ha-agent is not having luck connecting storage after 10 attempts, it gives up. It looks like there is a problem with the iSCSI storage or the network connection. Both vdsm and hosted-engine-ha-agent seem to be working as designed. I am seeing lots of /var/log/messages entries about eth0 going up and down. Elad, can you ensure that the host networking is behaving properly and the connectivity to the storage server is reliable.
I can't see any network issue with the connection to the storage server. eth0 on this host is down, eth1 is the physical interface attached to 'rhevm' bridge from which the connection to the storage server is done.
Sorry, on this host eth0 is the one that attached to 'rhevm' bridge so please ignore what I wrote in the previous comment. Anyway, also on this host, I don't see any network malfunction.
Is this a regression? Does this happen in 3.5.0 too?
Pushing out, since the PRD item this blocks is also targeted for 3.5.3
(In reply to Allon Mureinik from comment #9) > Is this a regression? Does this happen in 3.5.0 too? AFAIK, hosted-engine over RHEV-H wasn't available in 3.5.0
(In reply to Elad from comment #11) > (In reply to Allon Mureinik from comment #9) > > Is this a regression? Does this happen in 3.5.0 too? > > AFAIK, hosted-engine over RHEV-H wasn't available in 3.5.0 Is this RHEV-H specific, or does it reproduce in RHEL too?
(In reply to Allon Mureinik from comment #12) > Is this RHEV-H specific, or does it reproduce in RHEL too? Tested on RHEL7.1, the bug doesn't reproduce. After the hosted-engine VM entered paused on EIO due to unmapping the LUN where the hosted-engine VM image located, mapped the LUN back and the VM resumed. ovirt-hosted-engine-setup-1.2.1-9.el7ev.noarch ovirt-hosted-engine-ha-1.2.4-5.el7ev.noarch vdsm-4.16.8.1-6.el7ev.x86_64 qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64
Elad, could you please test this bug with the same package versions as given in the description on RHEL? Currently I do not see anything highlighting that it is RHEV-H specific, so I wonder if it's a specific package combination triggering this bug.
Created attachment 1011392 [details] logs 6.4.15 T(In reply to Fabian Deutsch from comment #14) > Elad, could you please test this bug with the same package versions as given > in the description on RHEL? > > Currently I do not see anything highlighting that it is RHEV-H specific, so > I wonder if it's a specific package combination triggering this bug. Checked with RHEL6.6. vdsm-4.16.13-1.el6ev.x86_64 ovirt-hosted-engine-setup-1.2.2-3.el6ev.noarch ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch selinux-policy-3.7.19-260.el6_6.2.noarch sanlock-2.8-1.el6.x86_64 libvirt-0.10.2-46.el6_6.4.x86_64 qemu-kvm-rhev-0.12.1.2-2.448.el6_6.1.x86_64 The ha-agent service continues to function when the LUN where the engine image is located is unmapped from the host. Attaching the logs
Elad, Since there was a feeling that this bug may have been caused by a specific bad combination of packages on RHEV-H over 2 months ago, could you retest with the latest RHEV-H to see if the problem persists?
Adam, Currently there is no new HE RHEV-H build to test on. According to Gil, we will have it in the next few weeks.
(In reply to Elad from comment #17) > Adam, > > Currently there is no new HE RHEV-H build to test on. According to Gil, we > will have it in the next few weeks. Reducing severity till we can prove this still happens.
(In reply to Elad from comment #17) > Adam, > > Currently there is no new HE RHEV-H build to test on. According to Gil, we > will have it in the next few weeks. Do we have one now?
Yes we have, I'll check that again.
Apparently we don't have RHEV-H 6.7 with HE support yet, I'll update when we'll have it.
(In reply to Allon Mureinik from comment #19) > (In reply to Elad from comment #17) > > Adam, > > > > Currently there is no new HE RHEV-H build to test on. According to Gil, we > > will have it in the next few weeks. > Do we have one now? Yes. Tested using: Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150609.0.el6ev) vdsm-4.16.20-1.el6ev.x86_64 ovirt-hosted-engine-ha-1.2.6-2.el6ev.noarch ovirt-hosted-engine-setup-1.2.4-2.el6ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-15.0.el6ev.noarch libvirt-0.10.2-54.el6.x86_64 rhevm-3.5.4-1.1.el6ev.noarch The problem is fixed, the hosted-engine VM returns to function after it connectivity with the storage is resumed. Both ha-broker and ha-agent are running as expected.
So can this BZ just be moved to VERIFIED with the given versions?
Closing as CURRENTRELEASE. The bug is fixed on RHEV-H 6.7, based on comment #24