Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1240921

Summary: EL7: ovirt node restarts network service while vdsm-network is running
Product: Red Hat Enterprise Virtualization Manager Reporter: Pavel Zhukov <pzhukov>
Component: ovirt-nodeAssignee: Fabian Deutsch <fdeutsch>
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: urgent Docs Contact:
Priority: medium    
Version: 3.5.1CC: bazulay, bmcclain, danken, dougsland, fdeutsch, gklein, lpeer, lsurette, mburman, mgoldboi, pavel, pzhukov, ycui, yeylon, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: ZStream
Target Release: 3.6.0Flags: mgoldboi: Triaged+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-node-3.3.0-0.4.20150906git14a6024.el7ev Doc Type: Bug Fix
Doc Text:
Cause: Sometimes Consequence: During boot a race could occur between vdsm and networking on RHEV-H, which could lead to missing networks on the host Fix: The race was solved Result: All defined networks are brought up at boot time
Story Points: ---
Clone Of:
: 1245240 (view as bug list) Environment:
Last Closed: 2016-03-09 14:32:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1245240    
Attachments:
Description Flags
var log content
none
ifcfg files after reboot
none
logs
none
vdsm persistebt none

Description Pavel Zhukov 2015-07-08 07:01:55 UTC
Description of problem:
One of the two identical networks are missed from the host after reboot

Version-Release number of selected component (if applicable):
rhev-hypervisor7-7.1-20150603.0.iso

How reproducible:
100%

Steps to Reproduce:
1. Create 2 rhev networks using UI (no ip addresses assigned)
2. Attach the networks to hypervisor and activate it
3. Put host to maintenance and reboot
4. Activate host

Actual results:
Only one NIC is up

Expected results:
Both NIC should be UP

Additional info:
This is use case for fcoe multipathing

Comment 1 Pavel Zhukov 2015-07-08 07:02:51 UTC
Created attachment 1049731 [details]
var log content

Comment 2 Pavel Zhukov 2015-07-08 08:04:32 UTC
(In reply to Dan Kenigsberg from comment #8#BZ1237212)
> Is this issue a 3.5.0 regression? If so, I guess it is yet another
> consequence of Bug 1203422 (which is to be fixed in 3.5.4).
> 
> Both ifcfg-em1 and ifcfg-em2 have ONBOOT=no, and are taken up by vdsm too
> late (after fcoe have failed to start on top them).
> 
> # Generated by VDSM version 4.16.20-1.el7ev
> DEVICE=em1
> HWADDR=XXXX
> ONBOOT=no
> MTU=9000
> DEFROUTE=no
> NM_CONTROLLED=no
> 
> However, I don't understand where the symmetry between em1 and em2 breaks.
Hi Dan,
I have reproduced the issue at home without enabling fcoe/lldpad even. 
The only problem I can see is source based route is the same for both networks...

Comment 3 Pavel Zhukov 2015-07-08 10:02:50 UTC
It's weird.
I did more tests. Seems like the network is missed randomly.

Comment 4 Dan Kenigsberg 2015-07-08 10:30:46 UTC
Pavel, can you attach the ifcfg files after reboot?

Comment 5 Pavel Zhukov 2015-07-08 11:42:40 UTC
Created attachment 1049827 [details]
ifcfg files after reboot

NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Comment 6 Dan Kenigsberg 2015-07-08 16:24:54 UTC
(In reply to Pavel Zhukov from comment #5)
> 
> NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Could you rephrase that? Does the network fabric2 come up LATE, or not at all?



In the logs I find

MainThread::ERROR::2015-07-03 12:43:37,492::__init__::53::root::(__exit__) Failed rollback transaction last known good network. ERR=%s
Traceback (most recent call last):
  File "/usr/share/vdsm/network/api.py", line 694, in setupNetworks
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 833, in updateDevices
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 739, in get
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 565, in _bridgeinfo
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 177, in ports
OSError: [Errno 2] No such file or directory: '/sys/class/net/fabric2/brif'

which suggests that we have a race between the two lines

            addNetwork(network, configurator=configurator,
                       implicitBonding=True, _netinfo=_netinfo, **d)
            _netinfo.updateDevices()  # Things like a bond mtu can change

apparently, addNetwork returns before the bridge device exists in the hosts kernel. Somehow, this race shows up during boot (when the host is busy doing other things?). When bug 1203422 is fixed, this would become less of an issue (as we would commonly not call addNetwork on boot).

The race should be understood and fixed, regardless.

Comment 7 Pavel Zhukov 2015-07-09 07:09:54 UTC
(In reply to Dan Kenigsberg from comment #6)
> (In reply to Pavel Zhukov from comment #5)
> > 
> > NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.
> 
> Could you rephrase that? Does the network fabric2 come up LATE, or not at
> all?
Not at all. It comes up after reboot.

Comment 10 Pavel Zhukov 2015-07-15 14:52:38 UTC
Reproduced again.
Steps to reproduce:
1) Install RHEVH with 3 interfaces
2) Configure 1st interface as management
3) Add two bridgeless "dummy" networks (no gateway, no IPs)
4) Configure hosts as fcoe client https://access.redhat.com/solutions/1268183
5) Activate host
6) Reboot the host

Actual result:
One of two network interfaces is down in random manner (one of them or neither one of them)

RHEL6 based hypervisor works fine. Only RHEL7.1 is affected

Comment 12 Pavel Zhukov 2015-07-15 14:57:06 UTC
Dan, 
The issue is reproduced without bridges at all. I think the summary is not correct...

Comment 13 Pavel Zhukov 2015-07-17 13:30:37 UTC
Created attachment 1053085 [details]
logs

Comment 14 Pavel Zhukov 2015-07-17 13:31:25 UTC
Created attachment 1053086 [details]
vdsm persistebt

Comment 15 Dan Kenigsberg 2015-07-17 14:11:01 UTC
in ovirt.log

Jul 15 14:31:35 Hardware virtualization detected
Restarting network (via systemctl):  [  OK  ]

while messages have

Jul 15 14:30:47 rhevh7 systemd: Starting Virtual Desktop Server Manager network restoration...
Jul 15 14:31:47 rhevh7 systemd: Failed to start Virtual Desktop Server Manager network restoration.

on the very same time. This cannot work. we must make sure that ovirt restarts its network way before it allows vdsm-network to run.

Fabian, I've think we've seen this before. Do you recall?

Comment 16 Pavel Zhukov 2015-07-17 15:49:46 UTC
ovirt#43781 fixed the bug for me. 
Tested using the same system as in https://bugzilla.redhat.com/show_bug.cgi?id=1240921#c10

Comment 19 Michael Burman 2015-12-06 08:14:01 UTC
Verified on - 3.6.1.1-0.1.el6 with:
vdsm-4.17.12-0.el7ev.noarch
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151201.2.el7ev)
ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch

Comment 21 errata-xmlrpc 2016-03-09 14:32:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0378.html