Bug 1260892

Summary: vdsmd fails to come up because networking prevents libvirtd from coming up
Product: Red Hat Enterprise Virtualization Manager Reporter: Pavel Zhukov <pzhukov>
Component: vdsmAssignee: Ido Barkan <ibarkan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Meni Yakove <myakove>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.5.4CC: bazulay, danken, ecohen, eedri, fdeutsch, gklein, lsurette, mburman, pzhukov, ycui, yeylon, ylavi
Target Milestone: ovirt-3.5.7Keywords: Reopened
Target Release: 3.5.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-25 10:16:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Zhukov 2015-09-08 07:29:26 UTC
Description of problem:
RHEVH iso
After system boot both libvirtd and vdsmd are in failed state because libvirtd is unable to initialize socket. 

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150828.0.el6ev)
Red Hat Enterprise Virtualization Hypervisor release 6.6 (20150603.0.el6ev)


How reproducible:
100% on some systems

Actual results:
libvirtd is down
with error message:
Starting libvirtd daemon: libvirtd: error: Unable to initialize network sockets. Check /var/log/messages or run without --daemon for more info.

Expected results:
vdsmd and libvirtd should be running

Additional info:
I was able to get this behaviour with following:

"""
service vdsmd stop
service libvirtd stop
rm -rf /var/run/libvirt
service vdsmd start
"""

Libvirtd was running before
Sep 07 10:53:53 Completed ovirt-cim
libvirtd start/running, process 15706
[  OK  ]
supervdsm start[  OK  ]
supervdsm start[  OK  ]

So looks like something removed/remounted run directory before vdsmd started.

Comment 2 Pavel Zhukov 2015-09-08 07:31:09 UTC
Created attachment 1071216 [details]
libvirt log

Comment 4 Fabian Deutsch 2015-09-08 07:43:49 UTC
Moving this to vdsm, as it looks like the network is not getting restored, from the case: "The network is not configured because vdsm is not started because libvirtd is failed to start. "

Comment 5 Pavel Zhukov 2015-09-08 08:01:41 UTC
(In reply to Fabian Deutsch from comment #4)
> Moving this to vdsm, as it looks like the network is not getting restored,
> from the case: "The network is not configured because vdsm is not started
> because libvirtd is failed to start. "

Fabian, 
Sorry but This is not networking issue but unix socket's one. I've opened bug against libvirt BZ#1260885 to change the error message).

Comment 6 Fabian Deutsch 2015-09-09 09:11:06 UTC
It has been discussed if libvirt is behaving right of not coming up when there is no network available.

But independent of this behavior, a problem here is that the networking did not come up as stated in comment 4.

Comment 7 Fabian Deutsch 2015-09-09 09:12:22 UTC
The libvirt behavior is actually nicely explained in bug 1260885 comment 2.

Comment 8 Michael Burman 2015-09-09 09:24:01 UTC
Pavel please add the exact steps to reproduce this bug.
Was the server installed in rhev-m?
Are networks configured via Setup Networks?
Was the server registrated to engine via TUI?
Was there any upgrade involved here?
Your description is not clear.

Thanks.

Comment 9 Dan Kenigsberg 2015-09-09 13:41:24 UTC
Please provide supervdsm.log and the content of /var/lib/vdsm and /etc/sysconfig/network-scripts

Is this an upgrade to 20150828.0.el6ev ?
Shouldn't the Version field be set to 3.5.4?

Comment 10 Michael Burman 2015-09-09 14:24:14 UTC
Can't reproduce this report with RHEV Hypervisor - 6.7 - 20150828.0.el6evvdsm-4.16.26-1.el6ev

Comment 12 Ido Barkan 2015-10-28 08:57:25 UTC
from supervdsm.log I see an older vdsm version:
# Generated by VDSM version 4.16.13.1-1.el6ev
which is 3.5.1.
And if it is, a lot has change in this are since 3.5.1 to latest 3.5.4 (where vdsm is tagged 4.16.26).

*** This bug has been marked as a duplicate of bug 1203422 ***

Comment 13 Pavel Zhukov 2015-11-01 10:47:40 UTC
(In reply to Ido Barkan from comment #12)
> from supervdsm.log I see an older vdsm version:
> # Generated by VDSM version 4.16.13.1-1.el6ev
> which is 3.5.1.
> And if it is, a lot has change in this are since 3.5.1 to latest 3.5.4
It's not true. You've pasted old logs record (see timestamp).

Comment 14 Eyal Edri 2015-11-01 14:16:36 UTC
this bug missed the build date of 3.5.6.
if you believe this is a blocker for the release, please set blocker flag and get relevant acks.

Comment 15 Ido Barkan 2015-11-02 07:22:01 UTC
(In reply to Pavel Zhukov from comment #13)
> (In reply to Ido Barkan from comment #12)
> > from supervdsm.log I see an older vdsm version:
> > # Generated by VDSM version 4.16.13.1-1.el6ev
> > which is 3.5.1.
> > And if it is, a lot has change in this are since 3.5.1 to latest 3.5.4
> It's not true. You've pasted old logs record (see timestamp).

okay,
In that case we need more info. Pavel can you please add the info requested in comment 8  and comment 9 ?

Comment 18 Pavel Zhukov 2015-11-02 08:07:04 UTC
(In reply to Michael Burman from comment #8)
> Pavel please add the exact steps to reproduce this bug.
I don't have hardware to reproduce it at home. Not reproducible with simple network configuration in nested env.
> Was the server installed in rhev-m?
Can you please elaborate? It was registered in rhevm.
> Are networks configured via Setup Networks?
It's upgraded hypervisor
> Was the server registrated to engine via TUI?
It's upgraded hypervisor
> Was there any upgrade involved here?
For sure it was. They hit BZ#1203422 and tried to upgrade to fix the issue.
> Your description is not clear.
> 
> Thanks.

Comment 19 Ido Barkan 2015-11-02 12:03:14 UTC
Ok, so now I understand the versions. sorry about that:

An upgrade of rhev-h 20150603.0.el6ev to rehv-h 20150828.0.el6ev
is an upgrade from vdsm 4.16.20-1 to 4.16.26-1
which is an upgrade from rhev 3.5.3 to rhev 3.5.4.

Since all I see in supervdsm.log is a lonely restart message I can only guess that somehow the restoration process failed to start.

Sadly, until 3.5.4 the ifcfg files where not persisted by rhev-h, so after boot, the node would go up without any ifcfg files owned by vdsm and vdsm would be recreating them according to the stored persistence. This was finally fixed in 3dd0baa (which is only part of 3.5.4- v4.16.24).
If, for some reason, during boot, vdsm failed to call the restoration script, or failed to load at all (libvirt being down is a possible reason), you are left with no networks at all. In your case, only ifcfg-eth0, which was there before vdsm, is there.

We can try to investigate further, but what happened between 3.5.1 and 3.5.4 in the network area is bad because of many reasons, which I hope most of them are fixed already.

Can you please ask the customer to restore his damaged host by hand on 3.5.4, persist the networks, and see if things are lost again in upgrade for latest 3.5? If if all is ok, there is nothing we can really do to help here.