Bug 973249 - "Couldn't connect to supervdsm" after new host installation
Summary: "Couldn't connect to supervdsm" after new host installation
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.3.0
Assignee: Yaniv Bronhaim
QA Contact:
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-11 14:42 UTC by Petr Spacek
Modified: 2016-02-10 19:23 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-14 11:38:38 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm.log.bz2 (14.01 KB, application/octet-stream)
2013-06-11 14:42 UTC, Petr Spacek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 64965 0 None None None Never

Description Petr Spacek 2013-06-11 14:42:51 UTC
Created attachment 759658 [details]
vdsm.log.bz2

Description of problem:
VDSMd is dead after adding a machine to the RHEV cluster. See attached log and additional info below.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

# rhn-channel -l
rhel-x86_64-rhev-mgmt-agent-6
rhel-x86_64-rhev-mgmt-agent-6-debuginfo
rhel-x86_64-server-6
rhel-x86_64-server-optional-6
rhel-x86_64-server-optional-6-debuginfo
rhel-x86_64-server-supplementary-6
rhel-x86_64-server-supplementary-6-debuginfo

All updates applied.

Version-Release number of selected component (if applicable):
vdsm-4.10.2-22.0.el6ev.x86_64
RHEV cluster is configured 'Compatibility Version' is 3.0

How reproducible:
100 %

Steps to Reproduce:
1. Join machine to RHEV cluster
2. Reboot the machine - it will be in 'Non-Responsive' state
3. Run "tail -f /var/log/vdsm/vdsm.log" and see error messages

Actual results:
VDSMd is not functional and RHEV-M shows machine as 'Non-Responsive'.

Expected results:
The machine works :-)

Additional info:
The machine hosted plain libvirt VMs in the past. I undefined all libvirt VMs before the first attempt to join RHEV cluster.

The first attempt to install RHEV agent and join the machine to RHEV cluster failed because I forgot one network definition in libvirt (the network was called 'private').

The host installation finished successfully after manual un-defining of the network in libvirt. The problem is that after the reboot the machine is in 'Non-Responsive' state.

Comment 1 Yaniv Bronhaim 2013-06-13 10:02:53 UTC
Please provide older vdsm logs (before the error started) to check what leaded to that.

From first look, it reminds a logging error that we had that caused supervdsmServer.py to crash in each run. Vdsm couldn't connect to its socket and killed itself over and over again. 

We changed this logic for 3.2. For 3.3 we redesign this all flow, so I believe that this issue won't appear in 3.3. 

I don't understand from the description how you reproduce it, and if its really reproducible.

If yes, I'll try to reproduce it over 3.3 to check if its not already fixed.

Comment 3 Yaniv Bronhaim 2013-06-13 13:27:58 UTC
After deep investigation the cause of the restarts was that sudoers.d/50_vdsm was not included in sudoers file.

Vdsm user couldn't run "sudo python /usr/share/vdsm/supervdsmServer.py command" in additional to all other sudos command, and that leads to restart over and over.

in each host the file /etc/sudoers should include:
## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

or

# vdsm customizations
#include /etc/sudoers.d/50_vdsm
# end vdsm customizations

Comment 4 Petr Spacek 2013-06-13 13:33:42 UTC
I reopened the bug because machine re-installation (from web admin portal) didn't fix the problem.

I tried to delete the three lines around #include manually, did hypervisor re-installation/re-provisioning and the lines are still not present.

It seems like some glitch in installation process.

Comment 5 Michal Skrivanek 2013-06-14 11:38:38 UTC
after investigation on affected host the cause is a local modification to sudo config:

diff /etc/sudoers /etc/sudoers.rpmnew |tail -2
> ## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
> #includedir /etc/sudoers.d

Comment 6 Petr Spacek 2013-06-14 11:52:47 UTC
The machine had /etc/sudoers file installed by sudo package version < 1.7.4p5-4, which didn't contain the #includedir directive.

Relevant article is: https://access.redhat.com/site/solutions/64965

The error reporting could be better :-), but in general, I agree that this is NOTABUG. Thank you for assistance.


Note You need to log in before you can comment on or make changes to this bug.