1240921 – EL7: ovirt node restarts network service while vdsm-network is running

Bug 1240921 - EL7: ovirt node restarts network service while vdsm-network is running

Summary: EL7: ovirt node restarts network service while vdsm-network is running

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-node
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Fabian Deutsch
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1245240
TreeView+	depends on / blocked

Reported:	2015-07-08 07:01 UTC by Pavel Zhukov
Modified:	2019-10-10 09:58 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ovirt-node-3.3.0-0.4.20150906git14a6024.el7ev
Doc Type:	Bug Fix
Doc Text:	Cause: Sometimes Consequence: During boot a race could occur between vdsm and networking on RHEV-H, which could lead to missing networks on the host Fix: The race was solved Result: All defined networks are brought up at boot time
Clone Of:
Clones:	1245240 (view as bug list)
Environment:
Last Closed:	2016-03-09 14:32:42 UTC
oVirt Team:	Node
Target Upstream Version:
Embargoed:
Flags:	mgoldboi: Triaged+

Attachments	(Terms of Use)
var log content (6.46 MB, application/x-gzip) 2015-07-08 07:02 UTC, Pavel Zhukov	no flags	Details
ifcfg files after reboot (615 bytes, application/x-gzip) 2015-07-08 11:42 UTC, Pavel Zhukov	no flags	Details
logs (262.76 KB, application/x-gzip) 2015-07-17 13:30 UTC, Pavel Zhukov	no flags	Details
vdsm persistebt (1.19 KB, application/x-gzip) 2015-07-17 13:31 UTC, Pavel Zhukov	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0378	normal	SHIPPED_LIVE	ovirt-node bug fix and enhancement update for RHEV 3.6	2016-03-09 19:06:36 UTC
oVirt gerrit	43781	master	MERGED	init: Restart networking early during boot	Never
oVirt gerrit	43833	ovirt-3.5	MERGED	init: Restart networking early during boot	Never

Description Pavel Zhukov 2015-07-08 07:01:55 UTC

Description of problem:
One of the two identical networks are missed from the host after reboot

Version-Release number of selected component (if applicable):
rhev-hypervisor7-7.1-20150603.0.iso

How reproducible:
100%

Steps to Reproduce:
1. Create 2 rhev networks using UI (no ip addresses assigned)
2. Attach the networks to hypervisor and activate it
3. Put host to maintenance and reboot
4. Activate host

Actual results:
Only one NIC is up

Expected results:
Both NIC should be UP

Additional info:
This is use case for fcoe multipathing

Comment 1 Pavel Zhukov 2015-07-08 07:02:51 UTC

Created attachment 1049731 [details]
var log content

Comment 2 Pavel Zhukov 2015-07-08 08:04:32 UTC

(In reply to Dan Kenigsberg from comment #8#BZ1237212)
> Is this issue a 3.5.0 regression? If so, I guess it is yet another
> consequence of Bug 1203422 (which is to be fixed in 3.5.4).
> 
> Both ifcfg-em1 and ifcfg-em2 have ONBOOT=no, and are taken up by vdsm too
> late (after fcoe have failed to start on top them).
> 
> # Generated by VDSM version 4.16.20-1.el7ev
> DEVICE=em1
> HWADDR=XXXX
> ONBOOT=no
> MTU=9000
> DEFROUTE=no
> NM_CONTROLLED=no
> 
> However, I don't understand where the symmetry between em1 and em2 breaks.
Hi Dan,
I have reproduced the issue at home without enabling fcoe/lldpad even. 
The only problem I can see is source based route is the same for both networks...

Comment 3 Pavel Zhukov 2015-07-08 10:02:50 UTC

It's weird.
I did more tests. Seems like the network is missed randomly.

Comment 4 Dan Kenigsberg 2015-07-08 10:30:46 UTC

Pavel, can you attach the ifcfg files after reboot?

Comment 5 Pavel Zhukov 2015-07-08 11:42:40 UTC

Created attachment 1049827 [details]
ifcfg files after reboot

NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Comment 6 Dan Kenigsberg 2015-07-08 16:24:54 UTC

(In reply to Pavel Zhukov from comment #5)
> 
> NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Could you rephrase that? Does the network fabric2 come up LATE, or not at all?



In the logs I find

MainThread::ERROR::2015-07-03 12:43:37,492::__init__::53::root::(__exit__) Failed rollback transaction last known good network. ERR=%s
Traceback (most recent call last):
  File "/usr/share/vdsm/network/api.py", line 694, in setupNetworks
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 833, in updateDevices
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 739, in get
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 565, in _bridgeinfo
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 177, in ports
OSError: [Errno 2] No such file or directory: '/sys/class/net/fabric2/brif'

which suggests that we have a race between the two lines

            addNetwork(network, configurator=configurator,
                       implicitBonding=True, _netinfo=_netinfo, **d)
            _netinfo.updateDevices()  # Things like a bond mtu can change

apparently, addNetwork returns before the bridge device exists in the hosts kernel. Somehow, this race shows up during boot (when the host is busy doing other things?). When bug 1203422 is fixed, this would become less of an issue (as we would commonly not call addNetwork on boot).

The race should be understood and fixed, regardless.

Comment 7 Pavel Zhukov 2015-07-09 07:09:54 UTC

(In reply to Dan Kenigsberg from comment #6)
> (In reply to Pavel Zhukov from comment #5)
> > 
> > NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.
> 
> Could you rephrase that? Does the network fabric2 come up LATE, or not at
> all?
Not at all. It comes up after reboot.

Comment 10 Pavel Zhukov 2015-07-15 14:52:38 UTC

Reproduced again.
Steps to reproduce:
1) Install RHEVH with 3 interfaces
2) Configure 1st interface as management
3) Add two bridgeless "dummy" networks (no gateway, no IPs)
4) Configure hosts as fcoe client https://access.redhat.com/solutions/1268183
5) Activate host
6) Reboot the host

Actual result:
One of two network interfaces is down in random manner (one of them or neither one of them)

RHEL6 based hypervisor works fine. Only RHEL7.1 is affected

Comment 12 Pavel Zhukov 2015-07-15 14:57:06 UTC

Dan, 
The issue is reproduced without bridges at all. I think the summary is not correct...

Comment 13 Pavel Zhukov 2015-07-17 13:30:37 UTC

Created attachment 1053085 [details]
logs

Comment 14 Pavel Zhukov 2015-07-17 13:31:25 UTC

Created attachment 1053086 [details]
vdsm persistebt

Comment 15 Dan Kenigsberg 2015-07-17 14:11:01 UTC

in ovirt.log

Jul 15 14:31:35 Hardware virtualization detected
Restarting network (via systemctl):  [  OK  ]

while messages have

Jul 15 14:30:47 rhevh7 systemd: Starting Virtual Desktop Server Manager network restoration...
Jul 15 14:31:47 rhevh7 systemd: Failed to start Virtual Desktop Server Manager network restoration.

on the very same time. This cannot work. we must make sure that ovirt restarts its network way before it allows vdsm-network to run.

Fabian, I've think we've seen this before. Do you recall?

Comment 16 Pavel Zhukov 2015-07-17 15:49:46 UTC

ovirt#43781 fixed the bug for me. 
Tested using the same system as in https://bugzilla.redhat.com/show_bug.cgi?id=1240921#c10

Comment 19 Michael Burman 2015-12-06 08:14:01 UTC

Verified on - 3.6.1.1-0.1.el6 with:
vdsm-4.17.12-0.el7ev.noarch
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151201.2.el7ev)
ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch

Comment 21 errata-xmlrpc 2016-03-09 14:32:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0378.html

Note You need to log in before you can comment on or make changes to this bug.