Created attachment 759658 [details] vdsm.log.bz2 Description of problem: VDSMd is dead after adding a machine to the RHEV cluster. See attached log and additional info below. # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.4 (Santiago) # rhn-channel -l rhel-x86_64-rhev-mgmt-agent-6 rhel-x86_64-rhev-mgmt-agent-6-debuginfo rhel-x86_64-server-6 rhel-x86_64-server-optional-6 rhel-x86_64-server-optional-6-debuginfo rhel-x86_64-server-supplementary-6 rhel-x86_64-server-supplementary-6-debuginfo All updates applied. Version-Release number of selected component (if applicable): vdsm-4.10.2-22.0.el6ev.x86_64 RHEV cluster is configured 'Compatibility Version' is 3.0 How reproducible: 100 % Steps to Reproduce: 1. Join machine to RHEV cluster 2. Reboot the machine - it will be in 'Non-Responsive' state 3. Run "tail -f /var/log/vdsm/vdsm.log" and see error messages Actual results: VDSMd is not functional and RHEV-M shows machine as 'Non-Responsive'. Expected results: The machine works :-) Additional info: The machine hosted plain libvirt VMs in the past. I undefined all libvirt VMs before the first attempt to join RHEV cluster. The first attempt to install RHEV agent and join the machine to RHEV cluster failed because I forgot one network definition in libvirt (the network was called 'private'). The host installation finished successfully after manual un-defining of the network in libvirt. The problem is that after the reboot the machine is in 'Non-Responsive' state.
Please provide older vdsm logs (before the error started) to check what leaded to that. From first look, it reminds a logging error that we had that caused supervdsmServer.py to crash in each run. Vdsm couldn't connect to its socket and killed itself over and over again. We changed this logic for 3.2. For 3.3 we redesign this all flow, so I believe that this issue won't appear in 3.3. I don't understand from the description how you reproduce it, and if its really reproducible. If yes, I'll try to reproduce it over 3.3 to check if its not already fixed.
After deep investigation the cause of the restarts was that sudoers.d/50_vdsm was not included in sudoers file. Vdsm user couldn't run "sudo python /usr/share/vdsm/supervdsmServer.py command" in additional to all other sudos command, and that leads to restart over and over. in each host the file /etc/sudoers should include: ## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment) #includedir /etc/sudoers.d or # vdsm customizations #include /etc/sudoers.d/50_vdsm # end vdsm customizations
I reopened the bug because machine re-installation (from web admin portal) didn't fix the problem. I tried to delete the three lines around #include manually, did hypervisor re-installation/re-provisioning and the lines are still not present. It seems like some glitch in installation process.
after investigation on affected host the cause is a local modification to sudo config: diff /etc/sudoers /etc/sudoers.rpmnew |tail -2 > ## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment) > #includedir /etc/sudoers.d
The machine had /etc/sudoers file installed by sudo package version < 1.7.4p5-4, which didn't contain the #includedir directive. Relevant article is: https://access.redhat.com/site/solutions/64965 The error reporting could be better :-), but in general, I agree that this is NOTABUG. Thank you for assistance.