Bug 973249 - "Couldn't connect to supervdsm" after new host installation
"Couldn't connect to supervdsm" after new host installation
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.1.3
Unspecified Unspecified
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: Yaniv Bronhaim
infra
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-11 10:42 EDT by Petr Spacek
Modified: 2016-02-10 14:23 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-14 07:38:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vdsm.log.bz2 (14.01 KB, application/octet-stream)
2013-06-11 10:42 EDT, Petr Spacek
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 64965 None None None Never

  None (edit)
Description Petr Spacek 2013-06-11 10:42:51 EDT
Created attachment 759658 [details]
vdsm.log.bz2

Description of problem:
VDSMd is dead after adding a machine to the RHEV cluster. See attached log and additional info below.

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

# rhn-channel -l
rhel-x86_64-rhev-mgmt-agent-6
rhel-x86_64-rhev-mgmt-agent-6-debuginfo
rhel-x86_64-server-6
rhel-x86_64-server-optional-6
rhel-x86_64-server-optional-6-debuginfo
rhel-x86_64-server-supplementary-6
rhel-x86_64-server-supplementary-6-debuginfo

All updates applied.

Version-Release number of selected component (if applicable):
vdsm-4.10.2-22.0.el6ev.x86_64
RHEV cluster is configured 'Compatibility Version' is 3.0

How reproducible:
100 %

Steps to Reproduce:
1. Join machine to RHEV cluster
2. Reboot the machine - it will be in 'Non-Responsive' state
3. Run "tail -f /var/log/vdsm/vdsm.log" and see error messages

Actual results:
VDSMd is not functional and RHEV-M shows machine as 'Non-Responsive'.

Expected results:
The machine works :-)

Additional info:
The machine hosted plain libvirt VMs in the past. I undefined all libvirt VMs before the first attempt to join RHEV cluster.

The first attempt to install RHEV agent and join the machine to RHEV cluster failed because I forgot one network definition in libvirt (the network was called 'private').

The host installation finished successfully after manual un-defining of the network in libvirt. The problem is that after the reboot the machine is in 'Non-Responsive' state.
Comment 1 Yaniv Bronhaim 2013-06-13 06:02:53 EDT
Please provide older vdsm logs (before the error started) to check what leaded to that.

From first look, it reminds a logging error that we had that caused supervdsmServer.py to crash in each run. Vdsm couldn't connect to its socket and killed itself over and over again. 

We changed this logic for 3.2. For 3.3 we redesign this all flow, so I believe that this issue won't appear in 3.3. 

I don't understand from the description how you reproduce it, and if its really reproducible.

If yes, I'll try to reproduce it over 3.3 to check if its not already fixed.
Comment 3 Yaniv Bronhaim 2013-06-13 09:27:58 EDT
After deep investigation the cause of the restarts was that sudoers.d/50_vdsm was not included in sudoers file.

Vdsm user couldn't run "sudo python /usr/share/vdsm/supervdsmServer.py command" in additional to all other sudos command, and that leads to restart over and over.

in each host the file /etc/sudoers should include:
## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

or

# vdsm customizations
#include /etc/sudoers.d/50_vdsm
# end vdsm customizations
Comment 4 Petr Spacek 2013-06-13 09:33:42 EDT
I reopened the bug because machine re-installation (from web admin portal) didn't fix the problem.

I tried to delete the three lines around #include manually, did hypervisor re-installation/re-provisioning and the lines are still not present.

It seems like some glitch in installation process.
Comment 5 Michal Skrivanek 2013-06-14 07:38:38 EDT
after investigation on affected host the cause is a local modification to sudo config:

diff /etc/sudoers /etc/sudoers.rpmnew |tail -2
> ## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
> #includedir /etc/sudoers.d
Comment 6 Petr Spacek 2013-06-14 07:52:47 EDT
The machine had /etc/sudoers file installed by sudo package version < 1.7.4p5-4, which didn't contain the #includedir directive.

Relevant article is: https://access.redhat.com/site/solutions/64965

The error reporting could be better :-), but in general, I agree that this is NOTABUG. Thank you for assistance.

Note You need to log in before you can comment on or make changes to this bug.