Bug 1329166

Summary: VDSM hosted_engine_2 command failed: Message timeout which can be caused by communication issues
Product: [oVirt] vdsm Reporter: Bhaskarakiran <byarlaga>
Component: CoreAssignee: Dan Kenigsberg <danken>
Status: CLOSED DUPLICATE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: medium    
Version: 4.17.23.1CC: bugs, byarlaga, knarra, mzywusko, sabose, sasundar, stirabos, ylavi
Target Milestone: ---Flags: ykaul: ovirt-4.0.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-04 08:37:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939    
Attachments:
Description Flags
vdsm log
none
supervdsm log none

Description Bhaskarakiran 2016-04-21 10:11:46 UTC
Created attachment 1149418 [details]
vdsm log

Description of problem:

While adding a third node to the hosted engine setup, seeing the $sub error.

Version-Release number of selected component (if applicable):
---------------------------------------------------------------
3.6.5.3-0.1.el6

How reproducible:
100%

Steps to Reproduce:
1. Just add a 3rd node the hosted engine cluster


log snippet:
============
ioprocess communication (23550)::ERROR::2016-04-21 15:15:05,121::__init__::174::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate
    raise Exception("FD closed")
Exception: FD closed
ioprocess communication (23539)::ERROR::2016-04-21 15:15:05,123::__init__::174::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate
    raise Exception("FD closed")
Exception: FD closed
ioprocess communication (23575)::ERROR::2016-04-21 15:15:05,123::__init__::174::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate
    raise Exception("FD closed")
Exception: FD closed




Actual results:


Expected results:


Additional info:

Comment 1 SATHEESARAN 2016-04-21 10:46:55 UTC
The issue is seen with RHEV RHGS HCI setup, where the hypervisors were running RHEL 7.2

The installation of the hosted_engine on the NODE1 was successful, and observed some communication issue while adding the third NODE to hosted-engine setup

Comment 2 SATHEESARAN 2016-04-21 10:47:45 UTC
Also observed that during this failure, the ovirtmgmt bridge was missing on the third node

Comment 3 Sahina Bose 2016-04-21 11:12:29 UTC
Can you also attach the supervdsm.log from /var/log/vdsm/

Comment 4 Bhaskarakiran 2016-04-25 09:21:01 UTC
I missed to copy  the logs. Setting up a new environment now, will copy the logs once it is hit.

Comment 5 Bhaskarakiran 2016-04-26 11:02:00 UTC
Created attachment 1150855 [details]
supervdsm log

Comment 6 Bhaskarakiran 2016-04-26 11:02:27 UTC
Attached the supervdsm.log file.

Comment 7 Yaniv Lavi 2016-04-27 09:06:35 UTC
Please try this on non HCI and let us know if this reproduces. Please move back only if you find this happen on non-HCI and with a clear cause.

Comment 8 Sahina Bose 2016-05-03 08:53:26 UTC
(In reply to Yaniv Dary from comment #7)
> Please try this on non HCI and let us know if this reproduces. Please move
> back only if you find this happen on non-HCI and with a clear cause.

This seems to be a network issue, not HC related - similar to issue encountered by user at https://www.mail-archive.com/users@ovirt.org/msg32710.html

Comment 9 Yaniv Lavi 2016-05-03 09:06:01 UTC
(In reply to Sahina Bose from comment #8)
> (In reply to Yaniv Dary from comment #7)
> > Please try this on non HCI and let us know if this reproduces. Please move
> > back only if you find this happen on non-HCI and with a clear cause.
> 
> This seems to be a network issue, not HC related - similar to issue
> encountered by user at
> https://www.mail-archive.com/users@ovirt.org/msg32710.html

So please provide full logs and steps to reproduce.
Once we have that I ask the right person to check.

Comment 10 Sahina Bose 2016-05-03 09:24:52 UTC
Bhaskar, could you provide engine.log, vdsm.log and supervdsm.log when you encounter this again?

Comment 11 Sahina Bose 2016-05-19 08:27:02 UTC
Removing this from 3.6.z target as this bug is not consistently reproducible

Comment 12 Yaniv Lavi 2016-05-23 13:18:00 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 13 Yaniv Lavi 2016-05-23 13:22:01 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 14 RamaKasturi 2016-06-03 09:10:24 UTC
I have hit this issue with 3.6.7 and logs can be found in this link below.

hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message timeout which can be caused by communication issues.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/

Comment 15 Yaniv Kaul 2016-06-15 07:03:23 UTC
(In reply to RamaKasturi from comment #14)
> I have hit this issue with 3.6.7 and logs can be found in this link below.
> 
> hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message
> timeout which can be caused by communication issues.
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/

Again, I'd like to ensure - is this reproducible in a non HC setup?

Comment 16 SATHEESARAN 2016-06-17 09:50:22 UTC
(In reply to Yaniv Kaul from comment #15)
> (In reply to RamaKasturi from comment #14)
> > I have hit this issue with 3.6.7 and logs can be found in this link below.
> > 
> > hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message
> > timeout which can be caused by communication issues.
> > 
> > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/
> 
> Again, I'd like to ensure - is this reproducible in a non HC setup?

Now I have tested it with non-HC setup with RHEV 3.6.7-4 and RHGS 3.1.3 RC2 ( glusterfs-3.7.9-10 )

1. Installed 3 machines with RHEL 7.2
2. Installed hosted engine on the host1
3. Added the second host to the cluster
4. Added the third host to the cluster

And there are no issues.

Kasturi is now installing the HC setup with RHEV 3.6.7 once again.
Awaiting for the results

Comment 17 RamaKasturi 2016-06-20 13:29:35 UTC
I tried this with HC setup with RHEV 3.6.7-4 and RHGS 3.1.3.

1)Installed 3 machines with RHEL 7.2
2)Installed hosted engine on host 1
3) Tried adding second host to the cluster, it failed with error 

 [ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs.
 [ ERROR ] Unable to add hosted_engine_2 to the manager.

I am not sure why i saw this error. I went and updated my engine vm since i did not update it after installing. 

4) After some time my host went to non_responsive state and in the events log i see that VDSM hosted_engine_2 command failed: Message timeout which can be caused by communication issues.

5) Moved it to maintenance removed from UI, tried installing again. It again failed with error "VDSM hosted_engine_2 command failed: waiting for connect interrupted." failed to configure management network on hosted_engine_2 due to setup networks failure.

6) Moved that to maintenance again, removed from UI and tried adding it again. This time it worked fine.

logs can be found in the link below :

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/

Comment 18 Simone Tiraboschi 2016-06-30 13:13:06 UTC
On my opinion this is just a duplicate of:
https://bugzilla.redhat.com/show_bug.cgi?id=1349218

Comment 19 Simone Tiraboschi 2016-06-30 13:16:27 UTC
(In reply to Simone Tiraboschi from comment #18)
> On my opinion this is just a duplicate of:
> https://bugzilla.redhat.com/show_bug.cgi?id=1349218

sorry, https://bugzilla.redhat.com/show_bug.cgi?id=1350763

Comment 20 Yaniv Lavi 2016-07-03 12:14:09 UTC
Should we close this bug then?

Comment 21 Simone Tiraboschi 2016-07-04 08:37:27 UTC

*** This bug has been marked as a duplicate of bug 1350763 ***