Bug 1329166
Summary: | VDSM hosted_engine_2 command failed: Message timeout which can be caused by communication issues | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Bhaskarakiran <byarlaga> | ||||||
Component: | Core | Assignee: | Dan Kenigsberg <danken> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Aharon Canan <acanan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.17.23.1 | CC: | bugs, byarlaga, knarra, mzywusko, sabose, sasundar, stirabos, ylavi | ||||||
Target Milestone: | --- | Flags: | ykaul:
ovirt-4.0.z?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-07-04 08:37:27 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1277939 | ||||||||
Attachments: |
|
The issue is seen with RHEV RHGS HCI setup, where the hypervisors were running RHEL 7.2 The installation of the hosted_engine on the NODE1 was successful, and observed some communication issue while adding the third NODE to hosted-engine setup Also observed that during this failure, the ovirtmgmt bridge was missing on the third node Can you also attach the supervdsm.log from /var/log/vdsm/ I missed to copy the logs. Setting up a new environment now, will copy the logs once it is hit. Created attachment 1150855 [details]
supervdsm log
Attached the supervdsm.log file. Please try this on non HCI and let us know if this reproduces. Please move back only if you find this happen on non-HCI and with a clear cause. (In reply to Yaniv Dary from comment #7) > Please try this on non HCI and let us know if this reproduces. Please move > back only if you find this happen on non-HCI and with a clear cause. This seems to be a network issue, not HC related - similar to issue encountered by user at https://www.mail-archive.com/users@ovirt.org/msg32710.html (In reply to Sahina Bose from comment #8) > (In reply to Yaniv Dary from comment #7) > > Please try this on non HCI and let us know if this reproduces. Please move > > back only if you find this happen on non-HCI and with a clear cause. > > This seems to be a network issue, not HC related - similar to issue > encountered by user at > https://www.mail-archive.com/users@ovirt.org/msg32710.html So please provide full logs and steps to reproduce. Once we have that I ask the right person to check. Bhaskar, could you provide engine.log, vdsm.log and supervdsm.log when you encounter this again? Removing this from 3.6.z target as this bug is not consistently reproducible oVirt 4.0 beta has been released, moving to RC milestone. oVirt 4.0 beta has been released, moving to RC milestone. I have hit this issue with 3.6.7 and logs can be found in this link below. hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message timeout which can be caused by communication issues. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/ (In reply to RamaKasturi from comment #14) > I have hit this issue with 3.6.7 and logs can be found in this link below. > > hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message > timeout which can be caused by communication issues. > > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/ Again, I'd like to ensure - is this reproducible in a non HC setup? (In reply to Yaniv Kaul from comment #15) > (In reply to RamaKasturi from comment #14) > > I have hit this issue with 3.6.7 and logs can be found in this link below. > > > > hosted_engine3 (zod.lab.eng.blr.redhat.com) failed to add due to message > > timeout which can be caused by communication issues. > > > > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/ > > Again, I'd like to ensure - is this reproducible in a non HC setup? Now I have tested it with non-HC setup with RHEV 3.6.7-4 and RHGS 3.1.3 RC2 ( glusterfs-3.7.9-10 ) 1. Installed 3 machines with RHEL 7.2 2. Installed hosted engine on the host1 3. Added the second host to the cluster 4. Added the third host to the cluster And there are no issues. Kasturi is now installing the HC setup with RHEV 3.6.7 once again. Awaiting for the results I tried this with HC setup with RHEV 3.6.7-4 and RHGS 3.1.3. 1)Installed 3 machines with RHEL 7.2 2)Installed hosted engine on host 1 3) Tried adding second host to the cluster, it failed with error [ ERROR ] The VDSM host was found in a failed state. Please check engine and bootstrap installation logs. [ ERROR ] Unable to add hosted_engine_2 to the manager. I am not sure why i saw this error. I went and updated my engine vm since i did not update it after installing. 4) After some time my host went to non_responsive state and in the events log i see that VDSM hosted_engine_2 command failed: Message timeout which can be caused by communication issues. 5) Moved it to maintenance removed from UI, tried installing again. It again failed with error "VDSM hosted_engine_2 command failed: waiting for connect interrupted." failed to configure management network on hosted_engine_2 due to setup networks failure. 6) Moved that to maintenance again, removed from UI and tried adding it again. This time it worked fine. logs can be found in the link below : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1329166/ On my opinion this is just a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1349218 (In reply to Simone Tiraboschi from comment #18) > On my opinion this is just a duplicate of: > https://bugzilla.redhat.com/show_bug.cgi?id=1349218 sorry, https://bugzilla.redhat.com/show_bug.cgi?id=1350763 Should we close this bug then? *** This bug has been marked as a duplicate of bug 1350763 *** |
Created attachment 1149418 [details] vdsm log Description of problem: While adding a third node to the hosted engine setup, seeing the $sub error. Version-Release number of selected component (if applicable): --------------------------------------------------------------- 3.6.5.3-0.1.el6 How reproducible: 100% Steps to Reproduce: 1. Just add a 3rd node the hosted engine cluster log snippet: ============ ioprocess communication (23550)::ERROR::2016-04-21 15:15:05,121::__init__::174::IOProcessClient::(_communicate) IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate raise Exception("FD closed") Exception: FD closed ioprocess communication (23539)::ERROR::2016-04-21 15:15:05,123::__init__::174::IOProcessClient::(_communicate) IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate raise Exception("FD closed") Exception: FD closed ioprocess communication (23575)::ERROR::2016-04-21 15:15:05,123::__init__::174::IOProcessClient::(_communicate) IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 129, in _communicate raise Exception("FD closed") Exception: FD closed Actual results: Expected results: Additional info: