Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1098285

Summary: VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243.
Product: Red Hat Enterprise Virtualization Manager Reporter: Nikolai Sednev <nsednev>
Component: ovirt-hosted-engine-haAssignee: Jiri Moskovcak <jmoskovc>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: alukiano, amureini, bazulay, dfediuck, ecohen, eedri, fsimonce, iheim, jmoskovc, lpeer, lsurette, mavital, nsednev, scohen, sherold, tdosek, tnisan, wdaniel, yeylon
Target Milestone: ---Keywords: TestOnly
Target Release: 3.4.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: av13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-03 09:55:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs of agent; broker; vdsm from both hosts and engine from HEVM
none
screenshot none

Description Nikolai Sednev 2014-05-15 16:09:15 UTC
Created attachment 896001 [details]
logs of agent; broker; vdsm from both hosts and engine from HEVM

Description of problem:
Getting "VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243." error, while following steps from TCMS https://tcms.engineering.redhat.com/run/136034/#caserun_5038732.

Version-Release number of selected component (if applicable):
libvirt-0.10.2-29.el6_5.7.x86_64
sanlock-2.8-1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.9.x86_64
ovirt-hosted-engine-ha-1.1.2-3.el6ev.noarch
vdsm-4.14.7-2.el6ev.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Assemble HE setup containing 2 hosts and 2 storage domains, one of them should be Imported NFS/ISO domain.
2.Block connection to non-master storage domain (to ISO) "iptables -I INPUT -s 10.35.64.102 -j DROP" on both hosts.
3.Expect to see one of the hosts going to "Inactive" state and getting error message: -"VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243." from engine's UI.

Actual results:
"VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243." and I saw VM on both hosts instead of seeing it only on one:

Rose:
[root@rose05 ~]# vdsClient -s 0 list

68dabad2-3ee6-4830-a1c2-a2c2af019428
        Status = Down               
        nicModel = rtl8139,pv       
        exitMessage = internal error Failed to acquire lock: error -243
        emulatedMachine = rhel6.5.0                                    
        pid = 0                                                        
        displayIp = 0                                                  
        devices = [{'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '50eeca54-8ea5-4f13-bf68-8d914352916f', 'alias': 'console0'}, {'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:61:02:bd', 'linkActive': 'true', 'network': 'rhevm', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': '53143a7e-a98e-4301-bd14-34808cceea6e', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '506dbdd2-e80f-4fa7-b5c6-6c574ff86665', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '26843545600', 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'readonly': 'false', 'shared': 'exclusive', 'truesize': '5840838656', 'type': 'disk', 'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volumeInfo': {'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'leasePath': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166.lease', 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'path': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166'}, 'format': 'raw', 'deviceId': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': '/var/run/vdsm/storage/b1cf00e3-982c-424a-827c-95984a7d7d2f/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166', 'propagateErrors': 'off', 'optional': 'false', 'bootOrder': '1', 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'specParams': {}, 'volumeChain': [{'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'leasePath': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166.lease', 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'path': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166'}]}]                                  
        smp = 2                                                                                                                                            
        vmType = kvm                                                                                                                                       
        display = vnc                                                                                                                                      
        displaySecurePort = -1                                                                                                                             
        memSize = 4096                                                                                                                                     
        displayPort = -1                                                                                                                                   
        cpuType = Conroe                                                                                                                                   
        spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir                                                        
        vmName = HostedEngine                                                                                                                              
        clientIp =                                                                                                                                         
        exitCode = 1                            

Master:
[root@master-vds10 ~]# vdsClient -s 0 list                                                                                                                 

68dabad2-3ee6-4830-a1c2-a2c2af019428
        Status = Up
        nicModel = rtl8139,pv
        emulatedMachine = rhel6.5.0
        pid = 9319
        displayIp = 0
        devices = [{'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '50eeca54-8ea5-4f13-bf68-8d914352916f', 'alias': 'console0'}, {'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon', 'alias': 'balloon0'}, {'device': 'scsi', 'alias': 'scsi0', 'model': 'virtio-scsi', 'type': 'controller', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'nicModel': 'pv', 'macAddr': '00:16:3e:61:02:bd', 'linkActive': True, 'network': 'rhevm', 'specParams': {}, 'filter': 'vdsm-no-mac-spoofing', 'alias': 'net0', 'deviceId': '53143a7e-a98e-4301-bd14-34808cceea6e', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface', 'name': 'vnet0'}, {'index': '2', 'iface': 'ide', 'name': 'hdc', 'alias': 'ide0-1-0', 'specParams': {}, 'readonly': 'True', 'deviceId': '506dbdd2-e80f-4fa7-b5c6-6c574ff86665', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '26843545600', 'specParams': {}, 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'readonly': 'False', 'shared': 'exclusive', 'truesize': '5840814080', 'type': 'disk', 'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volumeInfo': {'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'leasePath': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166.lease', 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'path': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166'}, 'format': 'raw', 'deviceId': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': '/var/run/vdsm/storage/b1cf00e3-982c-424a-827c-95984a7d7d2f/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166', 'propagateErrors': 'off', 'optional': 'false', 'name': 'vda', 'bootOrder': '1', 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'alias': 'virtio-disk0', 'volumeChain': [{'domainID': 'b1cf00e3-982c-424a-827c-95984a7d7d2f', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'cd755562-0063-46b6-bd4a-def730236166', 'leasePath': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166.lease', 'imageID': '42672899-f6a5-4309-aae3-bb309bdbc01a', 'path': '/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/images/42672899-f6a5-4309-aae3-bb309bdbc01a/cd755562-0063-46b6-bd4a-def730236166'}]}, {'device': 'usb', 'alias': 'usb0', 'type': 'controller', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'alias': 'ide0', 'type': 'controller', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'alias': 'virtio-serial0', 'type': 'controller', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}, {'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}, {'device': '', 'alias': 'video0', 'type': 'video', 'address': {'slot': '0x02', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}]
        smp = 2
        vmType = kvm
        display = vnc
        displaySecurePort = -1
        memSize = 4096
        displayPort = 5900
        cpuType = Conroe
        spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
        vmName = HostedEngine
        clientIp =
        pauseCode = NOERR

Command "hosted-engine --vm-status" issued on master, caused it to be stuck via ssh shell and non responsive even to ctrl+c/d:

[root@master-vds10 subsys]# hosted-engine --vm-status
^C



Traceback (most recent call last):
  File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 111, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 58, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 137, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 85, in get_all_stats
    path.get_metadata_path(self._config),
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 47, in get_metadata_path
    return os.path.join(get_domain_path(config_),
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 37, in get_domain_path
    if os.access(path, os.F_OK):
KeyboardInterrupt
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]#
[root@master-vds10 subsys]# service ovirt-ha-agent status
ovirt-ha-agent (pid 7364) is running...
[root@master-vds10 subsys]# hosted-engine --vm-status


[root@master-vds10 ~]# ps aux | grep hosted
vdsm      2738  0.5  0.0 1508288 20040 ?       Sl   16:44   0:49 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
root      4661  0.0  0.0 106228  1368 pts/0    S+   18:17   0:00 /bin/sh /usr/sbin/hosted-engine --vm-status
root      4662  0.0  0.0 170592  6980 pts/0    D+   18:17   0:00 python -m ovirt_hosted_engine_setup.vm_status
vdsm      7364  0.1  0.0 234680 14028 ?        D    16:53   0:13 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
root     12656  0.0  0.0 103256   856 pts/4    S+   19:05   0:00 grep hosted
 
during these troubles, HEVM actually runs uninterruptedly and accessible via UI.

Expected results:
Both hosts should remain in active mode, no errors should occur at all, no stacks should happen.
   

Additional info:
logs of agent; broker; vdsm from both hosts and engine from HEVM

Comment 1 Martin Sivák 2014-05-16 09:44:41 UTC
> 1.Assemble HE setup containing 2 hosts and 2 storage domains,
> one of them should be Imported NFS/ISO domain.

Are you aware of the third storage domain that is used for the hosted engine purposes? Where is it? Can you describe the setup more clearly?

> 2.Block connection to non-master storage domain (to ISO) "iptables -I INPUT -s 10.35.64.102 -j DROP" on both hosts.

What is 10.35.64.102? Can you describe the setup more clearly? Hostnames and IP addresses and what runs where.. I find the report too confusing in this matter.

> exitMessage = internal error Failed to acquire lock: error -243

This is reported from VDSM/libvirt. Not really related to hosted engine...

> Command "hosted-engine --vm-status" issued on master, caused it to be
> stuck via ssh shell and non responsive even to ctrl+c/d:

This usually happens when you kill the connection to the NFS server with the hosted engine storage domain.

> Expected: Both hosts should remain in active mode

What do you mean by this? The engine must not run on more than one host.


I do not see any real errors here except the freeze of hosted-engine tool. The VM is running properly on host master and is not running on the second host as expected.

Rose:
[root@rose05 ~]# vdsClient -s 0 list

68dabad2-3ee6-4830-a1c2-a2c2af019428
        Status = Down               

Master:
[root@master-vds10 ~]# vdsClient -s 0 list                                                                                                                 
68dabad2-3ee6-4830-a1c2-a2c2af019428
        Status = Up

> During these troubles, RHEVM actually runs uninterruptedly and accessible via UI.

So hosted engine did its job properly and you only have a reporting issue?.

Comment 2 Nikolai Sednev 2014-05-18 07:59:05 UTC
(In reply to Martin Sivák from comment #1)
> > 1.Assemble HE setup containing 2 hosts and 2 storage domains,
> > one of them should be Imported NFS/ISO domain.
> 
> Are you aware of the third storage domain that is used for the hosted engine
> purposes? Where is it? Can you describe the setup more clearly?
> 
> > 2.Block connection to non-master storage domain (to ISO) "iptables -I INPUT -s 10.35.64.102 -j DROP" on both hosts.
> 
> What is 10.35.64.102? Can you describe the setup more clearly? Hostnames and
> IP addresses and what runs where.. I find the report too confusing in this
> matter.
> 
> > exitMessage = internal error Failed to acquire lock: error -243
> 
> This is reported from VDSM/libvirt. Not really related to hosted engine...
> 
> > Command "hosted-engine --vm-status" issued on master, caused it to be
> > stuck via ssh shell and non responsive even to ctrl+c/d:
> 
> This usually happens when you kill the connection to the NFS server with the
> hosted engine storage domain.
> 
> > Expected: Both hosts should remain in active mode
> 
> What do you mean by this? The engine must not run on more than one host.
> 
> 
> I do not see any real errors here except the freeze of hosted-engine tool.
> The VM is running properly on host master and is not running on the second
> host as expected.
> 
> Rose:
> [root@rose05 ~]# vdsClient -s 0 list
> 
> 68dabad2-3ee6-4830-a1c2-a2c2af019428
>         Status = Down               
> 
> Master:
> [root@master-vds10 ~]# vdsClient -s 0 list                                  
> 
> 68dabad2-3ee6-4830-a1c2-a2c2af019428
>         Status = Up
> 
> > During these troubles, RHEVM actually runs uninterruptedly and accessible via UI.
> 
> So hosted engine did its job properly and you only have a reporting issue?.

1.HE used it's own NFS3 SD, which is obviously hide out from engine, it lays at the 10.35.160.108:/RHEV/artyom_hosted_engine.
2.10.35.64.102-is the IP address of the "import NFS/ISO SD", the second SD except regular NFS3 SD used as first SD.
3.Not the issue here, as I killed only the connectivity to ISO import SD, on which no HE has anything to deal with for itself, as it runs from 10.35.160.108:/RHEV/artyom_hosted_engine NFS3 share.  
4.Host with "green arrow upwards" means active host, I've got one of them as inactive, no connection to where HE is running at all.
5.Yes, HE VM continued to run properly, but engine received that it's down for some reason and one of the hosts went in to "Inactive" and remained as inactive, even after I removed iptables from both hosts.

Comment 3 Jiri Moskovcak 2014-07-17 08:14:06 UTC
> 5.Yes, HE VM continued to run properly, but engine received that it's down
> for some reason and one of the hosts went in to "Inactive" and remained as
> inactive, even after I removed iptables from both hosts.

- that seems more like a problem between in vdsm or engine, hosted-engine doesn't report the state of the host, that's up to vdsm

- from the engine log it seems like everything is working as expected - engine is trying to periodically do RPC call GetStatsVDSCommand to master-vds10.qa.lab.tlv.redhat.com, but still getting timeouts thus keeping the host as NonOperational, seems like the host still can't access the storage domain properly: "Host master-vds10.qa.lab.tlv.red
hat.com reports about one of the Active Storage Domains as Problematic." Moving to vdsm, because it handles the storage.

Comment 6 Martin Sivák 2014-09-16 11:44:18 UTC
Can you please take a look at this?

Comment 13 Federico Simoncelli 2014-10-02 13:11:39 UTC
Analyzing the logs:

$ grep "return vmGetStats.*'vmId': '68dabad2-3ee6-4830-a1c2-a2c2af019428'" mastersvdsm.log | grep -v "'status': 'Up'" | wc -l
0

we can see that from 2014-05-15 18:01:03,048 to 2014-05-15 18:39:24,229 vdsm always reported (roughly every 10 seconds) 68dabad2-3ee6-4830-a1c2-a2c2af019428 up (HostedEngine VM) on the master-vds10.

So I have no idea why the agent wanted to start the HostedEngine vm on rose as well:

$ grep "client.*vmCreate.*HostedEngine" rosesvdsm.log | cut -c 21-114
2014-05-15 18:04:28,355::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate
2014-05-15 18:14:57,202::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate
2014-05-15 18:25:26,240::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate
2014-05-15 18:35:55,714::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate

which obviously resulted in the locking failures:

Thread-1349::DEBUG::2014-05-15 18:04:30,274::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243
Thread-1896::DEBUG::2014-05-15 18:14:59,043::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243
Thread-2460::DEBUG::2014-05-15 18:25:28,071::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243
Thread-3008::DEBUG::2014-05-15 18:35:57,547::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243

and that finally resulted in dead VMs picked up by engine and reported to the audit log.

Comment 14 Allon Mureinik 2014-10-02 15:39:59 UTC
Doron, can someone from your group take a look please?

Comment 17 Nikolai Sednev 2014-10-06 15:16:41 UTC
What's the question?

Comment 18 Jiri Moskovcak 2014-10-07 06:44:20 UTC
(In reply to Nikolai Sednev from comment #17)
> What's the question?

The questio is if you also think that it's a dupe of #1140824

Comment 19 Jiri Moskovcak 2014-10-08 09:34:46 UTC
*** Bug 1150076 has been marked as a duplicate of this bug. ***

Comment 20 Nikolai Sednev 2014-10-12 06:11:27 UTC
(In reply to Jiri Moskovcak from comment #18)
> (In reply to Nikolai Sednev from comment #17)
> > What's the question?
> 
> The questio is if you also think that it's a dupe of #1140824

1.I think if duplicate, then not mine bug, as it was opened earlier (Reported: 	2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: 	2014-09-11 14:42 EDT by wdaniel).
2.I really don't think that these two bugs are the same, as at 1140824 scenario is different, there maintenance mode to HE and then vm status retrieval causing for troubles, ISO domain have no connection as it's not being blocked by any iptables during scenario test run IMHO.

Comment 21 Doron Fediuck 2014-10-21 14:28:12 UTC
(In reply to Nikolai Sednev from comment #20)
> (In reply to Jiri Moskovcak from comment #18)
> > (In reply to Nikolai Sednev from comment #17)
> > > What's the question?
> > 
> > The questio is if you also think that it's a dupe of #1140824
> 
> 1.I think if duplicate, then not mine bug, as it was opened earlier
> (Reported: 	2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: 
> 2014-09-11 14:42 EDT by wdaniel).
> 2.I really don't think that these two bugs are the same, as at 1140824
> scenario is different, there maintenance mode to HE and then vm status
> retrieval causing for troubles, ISO domain have no connection as it's not
> being blocked by any iptables during scenario test run IMHO.

Even if the scenario is different, if the root cause is the same and the
solution is the same than it should be marked as a duplicate. In this case,
having no connection can be related to iptables or other scenarios, but
the root cause is lack of connectivity and the resolution will be the same.

Comment 22 Nikolai Sednev 2014-10-22 15:04:51 UTC
(In reply to Doron Fediuck from comment #21)
> (In reply to Nikolai Sednev from comment #20)
> > (In reply to Jiri Moskovcak from comment #18)
> > > (In reply to Nikolai Sednev from comment #17)
> > > > What's the question?
> > > 
> > > The questio is if you also think that it's a dupe of #1140824
> > 
> > 1.I think if duplicate, then not mine bug, as it was opened earlier
> > (Reported: 	2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: 
> > 2014-09-11 14:42 EDT by wdaniel).
> > 2.I really don't think that these two bugs are the same, as at 1140824
> > scenario is different, there maintenance mode to HE and then vm status
> > retrieval causing for troubles, ISO domain have no connection as it's not
> > being blocked by any iptables during scenario test run IMHO.
> 
> Even if the scenario is different, if the root cause is the same and the
> solution is the same than it should be marked as a duplicate. In this case,
> having no connection can be related to iptables or other scenarios, but
> the root cause is lack of connectivity and the resolution will be the same.

I totally agree with you, except single difference, that my bug was opened in May this year, while referred bug was opened on October this year (which in turn already closed as dubplicate of this one).

Please let me know if any additional information required.

Comment 23 Doron Fediuck 2014-10-23 07:33:25 UTC
(In reply to Nikolai Sednev from comment #22)

> I totally agree with you, except single difference, that my bug was opened
> in May this year, while referred bug was opened on October this year (which
> in turn already closed as dubplicate of this one).
> 
> Please let me know if any additional information required.

In this case please try to reproduce since it we suspect it's already resolved.

Comment 24 Nikolai Sednev 2014-10-28 15:57:42 UTC
(In reply to Doron Fediuck from comment #23)
> (In reply to Nikolai Sednev from comment #22)
> 
> > I totally agree with you, except single difference, that my bug was opened
> > in May this year, while referred bug was opened on October this year (which
> > in turn already closed as dubplicate of this one).
> > 
> > Please let me know if any additional information required.
> 
> In this case please try to reproduce since it we suspect it's already
> resolved.

Sure, but the problem that I see it fixed/targeted to release in 3.4.4, while latest is 3.4.3.
rhevm-3.4.3-1.2.el6ev.noarch
http://bob.eng.lab.tlv.redhat.com/builds/latest_av/

I tried to reproduce it in latest and it works for me with several exceptions:

Original error was not reproduced and one of the 3 hosts remains active with HE's VM running on top of it and acts as SPM, but 2 additional hosts changing their states to not accessible or down continuously.
Screenshot is attached.
I ran the setup over 3 RHEL6.6 hosts and with upgraded HE VM from RHEL6.5 to RHEL6.6.

Comment 25 Nikolai Sednev 2014-10-28 15:59:46 UTC
Created attachment 951438 [details]
screenshot

Comment 26 Nikolai Sednev 2014-10-28 16:12:33 UTC
The question here is current system behaviour looks OK? We're loosing HA for HE in case of 2 other hosts jumping around with their states, we may be have to define them as inactive for regular VMs, but still good for the HE HA.

I rebooted the host that was shown as active and that hosted HE, I saw that all 3 hosts became down and not accessible, HE's VM changed its status to unknown, yet HE running, although not shown properly via WEBUI.

More over, if I'm running "hosted-engine --vm-status" on both runnning hosts, while third previously rebooted still powering up, the response to command not being received and both hosts stuck.

Comment 27 Jiri Moskovcak 2014-10-29 07:29:03 UTC
The problem with agent getting stuck when ANY of the storage domain is not accessible is not fixed in 3.4.x, but the problem you're describing is something else, frozen agent doesn't cause the engine to think that the host is down or non-responsive I think that your testcase when you're killing the connection to the storage domain somehow messes the connection between the host and the engine.

Comment 28 Nikolai Sednev 2014-10-30 08:37:45 UTC
(In reply to Jiri Moskovcak from comment #27)
> The problem with agent getting stuck when ANY of the storage domain is not
> accessible is not fixed in 3.4.x, but the problem you're describing is
> something else, frozen agent doesn't cause the engine to think that the host
> is down or non-responsive I think that your testcase when you're killing the
> connection to the storage domain somehow messes the connection between the
> host and the engine.

Anyway for the original bug we should close this one as verified.
The additional troubles should get their own bug and fixed based on decisions and inputs received from Doron.