Bug 1270203

Summary: Failed to Establish Libvirt Connection on RHEV-H 7.2 after several times reboot
Product: Red Hat Enterprise Virtualization Manager Reporter: Ying Cui <ycui>
Component: ovirt-nodeAssignee: Douglas Schilling Landgraf <dougsland>
Status: CLOSED ERRATA QA Contact: Ying Cui <ycui>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: cshao, cwu, danken, dougsland, fdeutsch, gklein, huiwa, huzhao, leiwang, lsurette, mgoldboi, rbarry, ycui, ykaul
Target Milestone: ovirt-3.6.0-rc3Keywords: Regression, Reopened, TestBlocker
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-node-3.3.0-0.18.20151022git82dc52c.el7ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1276125 (view as bug list) Environment:
Last Closed: 2016-03-09 14:40:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1264065, 1276125    
Attachments:
Description Flags
rhevh_var_log
none
sosreport_on_host
none
pic
none
freshlog_varlog_coment8
none
freshlog_sosreport_comment8 none

Description Ying Cui 2015-10-09 09:24:19 UTC
Description of problem:
Failed to Establish Libvirt Connection on RHEV-H 7.2 for 3.6.0 after several times reboot, maybe 2 times reboot, maybe 3 times, maybe 4 times...

Version-Release number of selected component (if applicable):
# rpm -qa ovirt-node vdsm libvirt
libvirt-1.2.17-11.el7.x86_64
vdsm-4.17.8-1.el7ev.noarch
ovirt-node-3.3.0-0.10.20150928gite7ee3f1.el7ev.noarch
# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20150928.0.el7ev)

Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.18.el6

How reproducible:
100% 

Steps to Reproduce:
1. TUI clean installed rhevh 7.2 successful
2. Login TUI
3. Setup network via dhcp
4. Add rhevh via rhevm portal successful, and host is UP.
5. Maintenance the Host
6. Login RHEV-H TUI
7. Restart RHEV-H check whether failed to establish libvirt connection. reproducible: <5%
8. Restart RHEV-H twice to check whether failed to establish libvirt connection. reproducible: 70%
9. Restart RHEV-H third times to check whether failed to establish libvirt connection. reproducible: 95%
...

Actual results:
Failed to Establish Libvirt Connection.

Expected results:
without such error

Test machine:
All machines HP and Dell server.

Additional info:
# systemctl status libvirtd vdsmd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
           └─unlimited-core.conf
   Active: failed (Result: start-limit) since Fri 2015-10-09 07:42:58 UTC; 1h 34min ago
     Docs: man:libvirtd(8)
           http://libvirt.org
  Process: 13855 ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS (code=exited, status=6)
 Main PID: 13855 (code=exited, status=6)

Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Unit libvirtd.service entered failed state.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: libvirtd.service failed.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: libvirtd.service holdoff time over, scheduling restart.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: start request repeated too quickly for libvirtd.service
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Failed to start Virtualization daemon.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Unit libvirtd.service entered failed state.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: libvirtd.service failed.

● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Oct 09 07:42:57 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Oct 09 07:42:58 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.

# systemctl restart libvirtd
Job for libvirtd.service failed because the control process exited with error code. See "systemctl status libvirtd.service" and "journalctl -xe" for details.

# journalctl -xe
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit vdsm-network.service has failed.
-- 
-- The result is dependency.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
-- Subject: Unit vdsmd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit vdsmd.service has failed.
-- 
-- The result is dependency.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Dependency failed for MOM instance configured for VDSM purposes.
-- Subject: Unit mom-vdsm.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit mom-vdsm.service has failed.
-- 
-- The result is dependency.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Job mom-vdsm.service/start failed with result 'dependency'.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Job vdsm-network.service/start failed with result 'dependency'.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Unit libvirtd.service entered failed state.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: libvirtd.service failed.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root...
-- Subject: Unit supervdsmd.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit supervdsmd.service has begun shutting down.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: start request repeated too quickly for supervdsmd.service
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Failed to start Auxiliary vdsm service for running helper functions as root.
-- Subject: Unit supervdsmd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit supervdsmd.service has failed.
-- 
-- The result is failed.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: Unit supervdsmd.service entered failed state.
Oct 09 09:17:48 dhcp-9-114.nay.redhat.com systemd[1]: supervdsmd.service failed.

# systemctl status supervdsmd
● supervdsmd.service - Auxiliary vdsm service for running helper functions as root
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static; vendor preset: enabled)
   Active: failed (Result: start-limit) since Fri 2015-10-09 09:21:54 UTC; 23s ago
  Process: 17402 ExecStart=/usr/share/vdsm/daemonAdapter /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock (code=killed, signal=TERM)
 Main PID: 17402 (code=killed, signal=TERM)

Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: Started Auxiliary vdsm service for running helper functions as root.
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: Starting Auxiliary vdsm service for running helper functions as root...
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root...
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: start request repeated too quickly for supervdsmd.service
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: Failed to start Auxiliary vdsm service for running helper functions as root.
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: Unit supervdsmd.service entered failed state.
Oct 09 09:21:54 dhcp-9-114.nay.redhat.com systemd[1]: supervdsmd.service failed.
Oct 09 09:22:01 dhcp-9-114.nay.redhat.com systemd[1]: start request repeated too quickly for supervdsmd.service
Oct 09 09:22:01 dhcp-9-114.nay.redhat.com systemd[1]: Failed to start Auxiliary vdsm service for running helper functions as root.
Oct 09 09:22:01 dhcp-9-114.nay.redhat.com systemd[1]: supervdsmd.service failed.

Comment 1 Ying Cui 2015-10-09 09:25:32 UTC
Created attachment 1081233 [details]
rhevh_var_log

Comment 2 Ying Cui 2015-10-09 09:26:34 UTC
Created attachment 1081234 [details]
sosreport_on_host

Comment 3 Ying Cui 2015-10-09 09:28:50 UTC
Created attachment 1081235 [details]
pic

Comment 4 Ying Cui 2015-10-09 09:36:11 UTC
Additional info:
1. Once we encountered this error, whatever reboot rhevh how many times, we still encountered this error always.
2. can reproduce this issue without rhevm registration, just need to reboot rhevh several times to reproduce this error.

Comment 5 Ying Cui 2015-10-09 09:38:29 UTC
(In reply to Ying Cui from comment #4)
> Additional info:
> 1. Once we encountered this error, whatever reboot rhevh how many times, we
> still encountered this error always.

I tested 4 times on two machines for above.

Comment 6 Fabian Deutsch 2015-10-09 09:50:38 UTC
This looks a lot like bug 1251151

Comment 7 Fabian Deutsch 2015-10-09 09:51:36 UTC

*** This bug has been marked as a duplicate of bug 1251151 ***

Comment 8 Ying Cui 2015-10-13 05:11:55 UTC
We have to re-open this bug, because the bug 1251151 is verified, but this bug are still here on build rhev-hypervisor7-7.2-20151009.0(ovirt-node-3.3.0-0.13.20151008git03eefb5.el7ev.noarch) and rhev-hypervisor7-7.2-20151013.76.iso(ovirt-node-3.3.0-0.14.20151013git5f84da0.el7ev.noarch)

so reopen this urgent issue now.
log will be provided later with build rhev-hypervisor7-7.2-20151013.76.iso(ovirt-node-3.3.0-0.14.20151013git5f84da0.el7ev.noarch)

Comment 9 Fabian Deutsch 2015-10-13 06:49:41 UTC
In the logs I see:
Oct  9 07:30:01 dhcp-9-114 journal: libvirt version: 1.2.17, package: 11.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-09-25-04:15:16, x86-036.build.eng.bos.redhat.com)
Oct  9 07:30:01 dhcp-9-114 journal: Cannot read certificate '/etc/pki/libvirt/servercert.pem': No such file or directory
Oct  9 07:30:01 dhcp-9-114 systemd: libvirtd.service: main process exited, code=exited, status=6/NOTCONFIGURED
Oct  9 07:30:01 dhcp-9-114 systemd: Failed to start Virtualization daemon.

In my testing that dir is also missing, but libvirtd comes up.

Comment 10 Ying Cui 2015-10-13 09:59:59 UTC
Created attachment 1082318 [details]
freshlog_varlog_coment8

Comment 11 Ying Cui 2015-10-13 10:01:17 UTC
Created attachment 1082319 [details]
freshlog_sosreport_comment8

Comment 14 Douglas Schilling Landgraf 2015-10-13 16:39:50 UTC
Hi,

My findings so far.

I have reproduced the report, in the 3th or 4th reboot after installation, not even needed to setup network or register the node.

In the second reboot VDSM was able to setup itself and configured/persisted the files: /etc/libvirt/libvirtd.conf, /etc/libvirt/qemu-sanlock.conf and /etc/libvirt/qemu.conf.

However, in one of the reboots I have noticed that /etc/libvirt/libvirtd.conf was not persisted anymore and in the next reboot it returned to the original state, without the VDSM configuration affecting the system.

The libvirt error in the logs, shows an error finding the cacert.pem as VDSM configuration is missing to setup the right path to cacert.pem.
==============================================================================
2015-10-13 15:26:24.596+0000: 17114: error : virNetTLSContextCheckCertFile:120 : Cannot read CA certificate '/etc/pki/CA/cacert.pem': No such file or directory

# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151009.0.el7ev)

# rpm -qa | grep -i vdsm
vdsm-python-4.17.8-1.el7ev.noarch
vdsm-infra-4.17.8-1.el7ev.noarch
vdsm-yajsonrpc-4.17.8-1.el7ev.noarch
vdsm-xmlrpc-4.17.8-1.el7ev.noarch
vdsm-jsonrpc-4.17.8-1.el7ev.noarch
vdsm-4.17.8-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-1.el7ev.noarch
vdsm-cli-4.17.8-1.el7ev.noarch
vdsm-hook-ethtool-options-4.17.8-1.el7ev.noarch

# rpm -qa | grep -i libvirt
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-1.2.17-13.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.el7.x86_64
libvirt-1.2.17-13.el7.x86_64
libvirt-client-1.2.17-13.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.el7.x86_64
libvirt-daemon-driver-interface-1.2.17-13.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-13.el7.x86_64
libvirt-daemon-driver-network-1.2.17-13.el7.x86_64
libvirt-daemon-config-network-1.2.17-13.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.el7.x86_64
libvirt-daemon-driver-storage-1.2.17-13.el7.x86_64
libvirt-cim-0.6.3-19.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.17-13.el7.x86_64
libvirt-lock-sanlock-1.2.17-13.el7.x86_64
libvirt-daemon-driver-lxc-1.2.17-13.el7.x86_64
libvirt-daemon-kvm-1.2.17-13.el7.x86_64

Comment 15 Douglas Schilling Landgraf 2015-10-13 19:59:33 UTC
Hi Dan,

Could VDSM configurator removing itself just in libvirtd.conf?

Comment 16 Dan Kenigsberg 2015-10-14 06:57:58 UTC
(In reply to Douglas Schilling Landgraf from comment #15)
> Hi Dan,
> 
> Could VDSM configurator removing itself just in libvirtd.conf?

I can happen only if someone explicitly asked to do so via the vdsm-tool command line, which I don't think has happened here.

Comment 17 Douglas Schilling Landgraf 2015-10-15 21:33:33 UTC
Working in this report, investigating what's causing the vdsm-tool configurator failure (to unpersist?). I don't see it happening in previous versions, like RHEV-H 7.1.

Oct 15 21:20:19 Running handler: /usr/libexec/ovirt-node/hooks/on-boot/01-vdsm-configure
Traceback (most recent call last):
  File "/bin/vdsm-tool", line 219, in main
    return tool_command[cmd]["command"](*args)
  File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in wrapper
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 156, in configure
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line 103, in _configure
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line 65, in configure
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line 104, in removeConf
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line 225, in _removeSection
  File "/usr/lib/python2.7/site-packages/vdsm/tool/configfile.py", line 179, in __exit__
OSError: [Errno 16] Device or resource busy

Comment 18 Ryan Barry 2015-10-15 21:41:50 UTC
(In reply to Douglas Schilling Landgraf from comment #17)
> Working in this report, investigating what's causing the vdsm-tool
> configurator failure (to unpersist?). I don't see it happening in previous
> versions, like RHEV-H 7.1.
> 
> Oct 15 21:20:19 Running handler:
> /usr/libexec/ovirt-node/hooks/on-boot/01-vdsm-configure
> Traceback (most recent call last):
>   File "/bin/vdsm-tool", line 219, in main
>     return tool_command[cmd]["command"](*args)
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py", line 38, in
> wrapper
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line
> 156, in configure
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py", line
> 103, in _configure
>   File
> "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line
> 65, in configure
>   File
> "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line
> 104, in removeConf
>   File
> "/usr/lib/python2.7/site-packages/vdsm/tool/configurators/libvirt.py", line
> 225, in _removeSection
>   File "/usr/lib/python2.7/site-packages/vdsm/tool/configfile.py", line 179,
> in __exit__
> OSError: [Errno 16] Device or resource busy

Note:

I have also experienced this with --force, but not without. I hope to find the time to check the vdsm-tool source to find the difference later tonight.

Comment 19 Fabian Deutsch 2015-10-19 19:31:17 UTC
Appears on 3.5.6 as well.

Comment 20 Fabian Deutsch 2015-10-19 19:45:44 UTC
But maybe a different cause (in comment 19 the ntpd namespace patch was not applied correctly)

Comment 21 Ying Cui 2015-11-06 06:33:59 UTC
Followed the bug description test steps, this bug is fixed and verified on build 

# rpm -qa ovirt-node
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151104.0.el7ev)

Test steps:
1. TUI clean installed rhevh 7.2 successful
2. Login TUI
3. Setup network via dhcp
4. add rhevh via rhevm 3.6.0.3-0.1 portal successful, and host is UP.
5. Maintenance the Host
6. Login RHEV-H TUI
7. Restart RHEV-H more than 7 times, every time there is NO "Failed to Establish Libvirt Connection " error.

Comment 23 errata-xmlrpc 2016-03-09 14:40:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0378.html