Bug 856737

Summary: 3.2 - IP of network is silently not set if used by another host on LAN, and no rollback is performed
Product: Red Hat Enterprise Virtualization Manager Reporter: Pavel Stehlik <pstehlik>
Component: vdsmAssignee: Antoni Segura Puimedon <asegurap>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Pavlik <mpavlik>
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: asegurap, bazulay, cpelland, danken, dnaori, gklein, hateya, iheim, lpeer, mavital, sgrinber, ykaul
Target Milestone: ---Keywords: Reopened
Target Release: 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard: network
Fixed In Version: vdsm-4.10.2-4.0 Doc Type: Bug Fix
Doc Text:
Cause: vdsm used to call the `ifup` script, but did not wait to check its return code. Consequence: if ifup failed (in cases such as another host on LAN with the same IP address), vdsm would not have noticed nor report it to Engine. Fix: make the return code of ifup significant to the success of setupNetwork verb (unless non-blocking dhcp was requested). Result: a failure in ifup is reported to engine as a failure of setupNetwork.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-16 02:02:01 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 915537    
Attachments:
Description Flags
vdsm.log none

Description Pavel Stehlik 2012-09-12 13:26:24 EDT
Created attachment 612186 [details]
vdsm-engine-logs

Description of problem:
 1.Go to DataCenter and create a new network (and add a new network to cluster)
 2.Go to host ,choose NIC and attach created network to it.

The bridge is created, ifcfg file contains IP. RHEVM doesn't show it & host doesn't have this IP. 

Version-Release number of selected component (if applicable):
vdsm-4.9.6-31.0.el6_3.x86_64

How reproducible:
15%

Steps to Reproduce:
1. see above
2.
3.
  
Actual results:
See the logs starts:
Thread-506::DEBUG::2012-09-12 16:45:17,667::BindingXMLRPC::864::vds::(wrapper) client [10.34.63.19]::call setupNetworks with ({'sit2': {'nic': 'eth1', 'netmask': '255.255.255.0', 'ipaddr': '192.168.99.5', 'bridged': 'true', 'STP': 'no'}}, {}, {'connectivityCheck': 'true', 'connectivityTimeout': 120}) {} flowID [26d7c5d7]


Expected results:


Additional info:
Comment 4 Dan Kenigsberg 2012-09-13 08:02:31 EDT
Indeed there is a bug here: network 'sit2' reports an empty addr, even though its cfg requests it

'networks': {
...
'sit2': {'iface': 'sit2', 'addr': '', 'cfg': {'IPADDR': '192.168.99.5', 'DELAY': '0', 'NM_CONTROLLED': 'no', 'NETMASK': '255.255.255.0', 'BOOTPROTO': 'none', 'STP': 'no', 'DEVICE': 'sit2', 'TYPE': 'Bridge', 'ONBOOT': 'yes'}, 'mtu': '1500', 'netmask': '', 'stp': 'off', 'bridged': True, 'gateway': '0.0.0.0', 'ports': ['eth1']}}

Would you please reproduce this with vdsm-4.9.6-34.0, which added useful logging of ifup/ifdown ? Is there anything fishy in /var/log/message.
Comment 5 Pavel Stehlik 2012-09-14 05:50:05 EDT
I don't have that machine any more with its messages. Maybe some partners could help here.
I can't reproduce on RHEL with vdsm-4.9.6-34.0.el6_3.x86_64.
Comment 6 Dan Kenigsberg 2012-09-16 02:02:01 EDT
Please reopen when this reproduces.
Comment 8 Martin Pavlik 2012-09-17 08:14:06 EDT
I've tried on SI18

rhevh (20120910.0.rhev31.el6_3) with vdsm-reg-4.9.6-31.0.el6_3.noarch

and

RHEL 6.3 with vdsm-4.9.6-34.0.el6_3.x86_64

multiple times, but the bug does not reproduce.
Comment 9 Pavel Stehlik 2012-09-18 06:51:19 EDT
Tried on si18 same HW with clean rhevh 20120910.0.rhev31.el6_3 - can't repro - tried again 8times.
Comment 10 Pavel Stehlik 2012-09-25 07:44:35 EDT
happened again on si18.1, rhevh, vdsm -34
Comment 12 Dan Kenigsberg 2012-10-04 16:37:51 EDT
Pavel, vdsm.log is missing from this tarball - it seems to include only engine sosreport, which is of little use here.

However, we've found an ovirt-node (RHEV-H) bug 846326 with serious consequence on setting networking. When you re-reproduce the bug, please either use a node image with this bug fixed, or make sure that the files under not bind-mounted /etc/libvirt/qemu/networks/ (do not appear in /proc/mounts).
Comment 13 Pavel Stehlik 2012-10-05 03:04:05 EDT
Dan, I can clearly see the vdsm log is there. 

Please confirm that you can see it as well in package under:
/tmp/logcollector/RHEVH-and-PostgreSQL-reports/10.34.63.136/10.34.63.136-sos...
you need to unpack the 2nd archive and then browse to the vdsm.log
(I just recheck it & I can see it there).
Comment 14 Antoni Segura Puimedon 2012-10-05 16:20:46 EDT
MainProcess|Thread-5133::DEBUG::2012-09-25 11:36:00,619::__init__::1164::Storage.Misc.excCmd::(_log
      ) '/sbin/ifup vvv' (cwd None)                                                                      
37512 MainProcess|Thread-5133::DEBUG::2012-09-25 11:36:01,684::__init__::1164::Storage.Misc.excCmd::(_log
      ) FAILED: <err> = ''; <rc> = 1

The problem is, obviously (that the ifup of the bridge fails). This is not a
very common thing to happen and, unfortunately, as we can see, ifup does not
give any further information as to what the cause might be. In any case, this
exposes a thing that we do very very wrong.

ifupping everything without checking if we fail, and then, and this is the
worse part, call "configWriter.createLibvirtNetwork(network, bridged, iface)",
which unless skipBackup is set backs up the new not working configuration.

This all creates the situation in which the configuration shows the new ifcfg
we want but with the config not applied. If we were to set skipBackup on any
ifup error, at least the new non-working configuration would not get backed up
and the ping fail rollback will restore the old configurations. Having said
that, I'd rather throw an exception on ifup error that would bubble up and
return a "Configuration could not be successfully applied" to the engine.
Comment 15 Dan Kenigsberg 2012-10-06 18:53:55 EDT
Pavel, thanks for showing me where vdsm.log hides. /tmp is not very intuitive.

Would you be kind to try to reproduce this issue after changing the first line of /etc/sysconfig/network-scripts/ifup-eth to

  #!/bin/bash -xv

? This is going to generate a lot of noise into the log, but may give a clue on why ifup occasionally fails.

Toni, yes, the fact that we happily continue with the operation even after a crucial step (ifup) failed is questionable.
Comment 16 Antoni Segura Puimedon 2012-10-08 12:00:49 EDT
I found the culprit of the ifup mishap:

Sep 25 11:36:01 slot-6 /etc/sysconfig/network-scripts/ifup-eth: Error, some other host already uses address 192.168.99.5.
Sep 25 11:36:03 slot-6 ntpd[7898]: Listening on interface #14 em2, fe80::868f:69ff:fe67:1f04#123 Enabled

As you see, some other host already had that IP set, so ifup failed. IP
collision, then.

@Pavel, I think it is not necessary anymore for you to
reproduce with what Dan suggested.

@Dan, This kind of error can happen in a non isolated way by admin's mistakes.
I will work on putting the control for ifup return values.
Comment 17 Dan Kenigsberg 2012-10-09 05:42:25 EDT
Thanks Toni!

Pavel, there's no need to dig any further (unless you find another case with this behavior).

Given the new information, I do not think it is urgent enough to rush a fix such as http://gerrit.ovirt.org/8415 into rhev-3.1.
Comment 18 Dan Kenigsberg 2012-10-09 16:06:32 EDT
*** Bug 787709 has been marked as a duplicate of this bug. ***
Comment 19 lpeer 2012-10-29 02:56:02 EDT
removing the regression keyword as this is not a regression from previous versions.

Having said that i think this is an important bug to fix, if we fail to ifup a device we should not continue as if nothing happened. Let's try to push it to rhev-3.2.
Comment 22 Martin Pavlik 2013-02-07 08:04:49 EST
now behavior changed (for details see attached vdsm.log)

ifup does not fail when user assigns duplicate IP

the sequence used when IP is changed is:

1) ifdown bridge
2) ifdown physical interface
3) ifup physical interface
4) ifup bridge (with duplicate IP 10.34.67.1 configured)

sequence above does not detect that duplicate IP was used
MainProcess|Thread-5453::DEBUG::2013-02-07 13:44:26,659::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '+ . /etc/init.d/functions\n++ TEXTDOMAIN=initscripts\n++ umask 022\n++ PATH=/sbin:/usr/sbin:/bin:/usr/bin\n++ export PATH\n++ \'[\' -z \'\' \']\'\n++ COLUMNS=80\n++ \'[\' -z \'\' \']\'\n+++ /sbin/consoletype\n++ CONSOLETYPE=serial\n++ \'[\' -f /etc/sysconfig/i18n


if just the bridge part is ifdowned/ifuped manually duplicate IP is detected

1) ifdown bridge
2) ifup bridge (with duplicate IP 10.34.67.1 configured)

+ /usr/bin/logger -p daemon.err -t /etc/sysconfig/network-scripts/ifup-eth 'Error, some other host already uses address 10.34.67.1.'
Comment 23 Martin Pavlik 2013-02-07 08:05:14 EST
now behavior changed (for details see attached vdsm.log)

ifup does not fail when user assigns duplicate IP

the sequence used when IP is changed is:

1) ifdown bridge
2) ifdown physical interface
3) ifup physical interface
4) ifup bridge (with duplicate IP 10.34.67.1 configured)

sequence above does not detect that duplicate IP was used
MainProcess|Thread-5453::DEBUG::2013-02-07 13:44:26,659::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '+ . /etc/init.d/functions\n++ TEXTDOMAIN=initscripts\n++ umask 022\n++ PATH=/sbin:/usr/sbin:/bin:/usr/bin\n++ export PATH\n++ \'[\' -z \'\' \']\'\n++ COLUMNS=80\n++ \'[\' -z \'\' \']\'\n+++ /sbin/consoletype\n++ CONSOLETYPE=serial\n++ \'[\' -f /etc/sysconfig/i18n


if just the bridge part is ifdowned/ifuped manually duplicate IP is detected

1) ifdown bridge
2) ifup bridge (with duplicate IP 10.34.67.1 configured)

+ /usr/bin/logger -p daemon.err -t /etc/sysconfig/network-scripts/ifup-eth 'Error, some other host already uses address 10.34.67.1.'
Comment 24 Martin Pavlik 2013-02-07 08:07:49 EST
Created attachment 694473 [details]
vdsm.log
Comment 25 Dan Kenigsberg 2013-03-06 08:47:37 EST
The fact that ifup does not fail as expected is not really an issue of vdsm - I suspect initscript or the kernel. Please open a separate bug on that, and add it to the rhev-3.2 tracker bug.

Could you find other circumstances where ifup can be made to fail? Please try, and test that vdsm notices the error, and reports it back to Engine.
Comment 26 Martin Pavlik 2013-03-08 10:25:05 EST
verified on SF9
Comment 27 Itamar Heim 2013-06-11 04:39:10 EDT
3.2 has been released
Comment 28 Itamar Heim 2013-06-11 04:39:10 EDT
3.2 has been released
Comment 29 Itamar Heim 2013-06-11 04:39:18 EDT
3.2 has been released
Comment 30 Itamar Heim 2013-06-11 04:47:06 EDT
3.2 has been released