Bug 1240921 - EL7: ovirt node restarts network service while vdsm-network is running
Summary: EL7: ovirt node restarts network service while vdsm-network is running
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-node
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Fabian Deutsch
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks: 1245240
TreeView+ depends on / blocked
 
Reported: 2015-07-08 07:01 UTC by Pavel Zhukov
Modified: 2019-10-10 09:58 UTC (History)
15 users (show)

Fixed In Version: ovirt-node-3.3.0-0.4.20150906git14a6024.el7ev
Doc Type: Bug Fix
Doc Text:
Cause: Sometimes Consequence: During boot a race could occur between vdsm and networking on RHEV-H, which could lead to missing networks on the host Fix: The race was solved Result: All defined networks are brought up at boot time
Clone Of:
: 1245240 (view as bug list)
Environment:
Last Closed: 2016-03-09 14:32:42 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:
mgoldboi: Triaged+


Attachments (Terms of Use)
var log content (6.46 MB, application/x-gzip)
2015-07-08 07:02 UTC, Pavel Zhukov
no flags Details
ifcfg files after reboot (615 bytes, application/x-gzip)
2015-07-08 11:42 UTC, Pavel Zhukov
no flags Details
logs (262.76 KB, application/x-gzip)
2015-07-17 13:30 UTC, Pavel Zhukov
no flags Details
vdsm persistebt (1.19 KB, application/x-gzip)
2015-07-17 13:31 UTC, Pavel Zhukov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0378 0 normal SHIPPED_LIVE ovirt-node bug fix and enhancement update for RHEV 3.6 2016-03-09 19:06:36 UTC
oVirt gerrit 43781 0 master MERGED init: Restart networking early during boot Never
oVirt gerrit 43833 0 ovirt-3.5 MERGED init: Restart networking early during boot Never

Description Pavel Zhukov 2015-07-08 07:01:55 UTC
Description of problem:
One of the two identical networks are missed from the host after reboot

Version-Release number of selected component (if applicable):
rhev-hypervisor7-7.1-20150603.0.iso

How reproducible:
100%

Steps to Reproduce:
1. Create 2 rhev networks using UI (no ip addresses assigned)
2. Attach the networks to hypervisor and activate it
3. Put host to maintenance and reboot
4. Activate host

Actual results:
Only one NIC is up

Expected results:
Both NIC should be UP

Additional info:
This is use case for fcoe multipathing

Comment 1 Pavel Zhukov 2015-07-08 07:02:51 UTC
Created attachment 1049731 [details]
var log content

Comment 2 Pavel Zhukov 2015-07-08 08:04:32 UTC
(In reply to Dan Kenigsberg from comment #8#BZ1237212)
> Is this issue a 3.5.0 regression? If so, I guess it is yet another
> consequence of Bug 1203422 (which is to be fixed in 3.5.4).
> 
> Both ifcfg-em1 and ifcfg-em2 have ONBOOT=no, and are taken up by vdsm too
> late (after fcoe have failed to start on top them).
> 
> # Generated by VDSM version 4.16.20-1.el7ev
> DEVICE=em1
> HWADDR=XXXX
> ONBOOT=no
> MTU=9000
> DEFROUTE=no
> NM_CONTROLLED=no
> 
> However, I don't understand where the symmetry between em1 and em2 breaks.
Hi Dan,
I have reproduced the issue at home without enabling fcoe/lldpad even. 
The only problem I can see is source based route is the same for both networks...

Comment 3 Pavel Zhukov 2015-07-08 10:02:50 UTC
It's weird.
I did more tests. Seems like the network is missed randomly.

Comment 4 Dan Kenigsberg 2015-07-08 10:30:46 UTC
Pavel, can you attach the ifcfg files after reboot?

Comment 5 Pavel Zhukov 2015-07-08 11:42:40 UTC
Created attachment 1049827 [details]
ifcfg files after reboot

NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Comment 6 Dan Kenigsberg 2015-07-08 16:24:54 UTC
(In reply to Pavel Zhukov from comment #5)
> 
> NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.

Could you rephrase that? Does the network fabric2 come up LATE, or not at all?



In the logs I find

MainThread::ERROR::2015-07-03 12:43:37,492::__init__::53::root::(__exit__) Failed rollback transaction last known good network. ERR=%s
Traceback (most recent call last):
  File "/usr/share/vdsm/network/api.py", line 694, in setupNetworks
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 833, in updateDevices
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 739, in get
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 565, in _bridgeinfo
  File "/usr/lib/python2.7/site-packages/vdsm/netinfo.py", line 177, in ports
OSError: [Errno 2] No such file or directory: '/sys/class/net/fabric2/brif'

which suggests that we have a race between the two lines

            addNetwork(network, configurator=configurator,
                       implicitBonding=True, _netinfo=_netinfo, **d)
            _netinfo.updateDevices()  # Things like a bond mtu can change

apparently, addNetwork returns before the bridge device exists in the hosts kernel. Somehow, this race shows up during boot (when the host is busy doing other things?). When bug 1203422 is fixed, this would become less of an issue (as we would commonly not call addNetwork on boot).

The race should be understood and fixed, regardless.

Comment 7 Pavel Zhukov 2015-07-09 07:09:54 UTC
(In reply to Dan Kenigsberg from comment #6)
> (In reply to Pavel Zhukov from comment #5)
> > 
> > NOTE:ifcfg-fabric2  and ifcfg-ens9 appears later than fabric1 and ens8.
> 
> Could you rephrase that? Does the network fabric2 come up LATE, or not at
> all?
Not at all. It comes up after reboot.

Comment 10 Pavel Zhukov 2015-07-15 14:52:38 UTC
Reproduced again.
Steps to reproduce:
1) Install RHEVH with 3 interfaces
2) Configure 1st interface as management
3) Add two bridgeless "dummy" networks (no gateway, no IPs)
4) Configure hosts as fcoe client https://access.redhat.com/solutions/1268183
5) Activate host
6) Reboot the host

Actual result:
One of two network interfaces is down in random manner (one of them or neither one of them)

RHEL6 based hypervisor works fine. Only RHEL7.1 is affected

Comment 12 Pavel Zhukov 2015-07-15 14:57:06 UTC
Dan, 
The issue is reproduced without bridges at all. I think the summary is not correct...

Comment 13 Pavel Zhukov 2015-07-17 13:30:37 UTC
Created attachment 1053085 [details]
logs

Comment 14 Pavel Zhukov 2015-07-17 13:31:25 UTC
Created attachment 1053086 [details]
vdsm persistebt

Comment 15 Dan Kenigsberg 2015-07-17 14:11:01 UTC
in ovirt.log

Jul 15 14:31:35 Hardware virtualization detected
Restarting network (via systemctl):  [  OK  ]

while messages have

Jul 15 14:30:47 rhevh7 systemd: Starting Virtual Desktop Server Manager network restoration...
Jul 15 14:31:47 rhevh7 systemd: Failed to start Virtual Desktop Server Manager network restoration.

on the very same time. This cannot work. we must make sure that ovirt restarts its network way before it allows vdsm-network to run.

Fabian, I've think we've seen this before. Do you recall?

Comment 16 Pavel Zhukov 2015-07-17 15:49:46 UTC
ovirt#43781 fixed the bug for me. 
Tested using the same system as in https://bugzilla.redhat.com/show_bug.cgi?id=1240921#c10

Comment 19 Michael Burman 2015-12-06 08:14:01 UTC
Verified on - 3.6.1.1-0.1.el6 with:
vdsm-4.17.12-0.el7ev.noarch
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151201.2.el7ev)
ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch

Comment 21 errata-xmlrpc 2016-03-09 14:32:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0378.html


Note You need to log in before you can comment on or make changes to this bug.