Bug 973249

Summary: "Couldn't connect to supervdsm" after new host installation
Product: Red Hat Enterprise Virtualization Manager Reporter: Petr Spacek <pspacek>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.1.3CC: abaron, bazulay, hateya, iheim, lpeer, michal.skrivanek, pspacek, ybronhei, ykaul
Target Milestone: ---Keywords: Reopened
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-14 11:38:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm.log.bz2 none

Description Petr Spacek 2013-06-11 14:42:51 UTC
Created attachment 759658 [details]
vdsm.log.bz2

Description of problem:
VDSMd is dead after adding a machine to the RHEV cluster. See attached log and additional info below.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

# rhn-channel -l
rhel-x86_64-rhev-mgmt-agent-6
rhel-x86_64-rhev-mgmt-agent-6-debuginfo
rhel-x86_64-server-6
rhel-x86_64-server-optional-6
rhel-x86_64-server-optional-6-debuginfo
rhel-x86_64-server-supplementary-6
rhel-x86_64-server-supplementary-6-debuginfo

All updates applied.

Version-Release number of selected component (if applicable):
vdsm-4.10.2-22.0.el6ev.x86_64
RHEV cluster is configured 'Compatibility Version' is 3.0

How reproducible:
100 %

Steps to Reproduce:
1. Join machine to RHEV cluster
2. Reboot the machine - it will be in 'Non-Responsive' state
3. Run "tail -f /var/log/vdsm/vdsm.log" and see error messages

Actual results:
VDSMd is not functional and RHEV-M shows machine as 'Non-Responsive'.

Expected results:
The machine works :-)

Additional info:
The machine hosted plain libvirt VMs in the past. I undefined all libvirt VMs before the first attempt to join RHEV cluster.

The first attempt to install RHEV agent and join the machine to RHEV cluster failed because I forgot one network definition in libvirt (the network was called 'private').

The host installation finished successfully after manual un-defining of the network in libvirt. The problem is that after the reboot the machine is in 'Non-Responsive' state.

Comment 1 Yaniv Bronhaim 2013-06-13 10:02:53 UTC
Please provide older vdsm logs (before the error started) to check what leaded to that.

From first look, it reminds a logging error that we had that caused supervdsmServer.py to crash in each run. Vdsm couldn't connect to its socket and killed itself over and over again. 

We changed this logic for 3.2. For 3.3 we redesign this all flow, so I believe that this issue won't appear in 3.3. 

I don't understand from the description how you reproduce it, and if its really reproducible.

If yes, I'll try to reproduce it over 3.3 to check if its not already fixed.

Comment 3 Yaniv Bronhaim 2013-06-13 13:27:58 UTC
After deep investigation the cause of the restarts was that sudoers.d/50_vdsm was not included in sudoers file.

Vdsm user couldn't run "sudo python /usr/share/vdsm/supervdsmServer.py command" in additional to all other sudos command, and that leads to restart over and over.

in each host the file /etc/sudoers should include:
## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

or

# vdsm customizations
#include /etc/sudoers.d/50_vdsm
# end vdsm customizations

Comment 4 Petr Spacek 2013-06-13 13:33:42 UTC
I reopened the bug because machine re-installation (from web admin portal) didn't fix the problem.

I tried to delete the three lines around #include manually, did hypervisor re-installation/re-provisioning and the lines are still not present.

It seems like some glitch in installation process.

Comment 5 Michal Skrivanek 2013-06-14 11:38:38 UTC
after investigation on affected host the cause is a local modification to sudo config:

diff /etc/sudoers /etc/sudoers.rpmnew |tail -2
> ## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
> #includedir /etc/sudoers.d

Comment 6 Petr Spacek 2013-06-14 11:52:47 UTC
The machine had /etc/sudoers file installed by sudo package version < 1.7.4p5-4, which didn't contain the #includedir directive.

Relevant article is: https://access.redhat.com/site/solutions/64965

The error reporting could be better :-), but in general, I agree that this is NOTABUG. Thank you for assistance.