Bug 1259468 - Setupnetworks fails from time to time with error 'Failed to bring interface up'
Setupnetworks fails from time to time with error 'Failed to bring interface up'
Status: CLOSED CURRENTRELEASE
Product: vdsm
Classification: oVirt
Component: General (Show other bugs)
4.17.3
x86_64 Linux
high Severity high (vote)
: ovirt-3.6.1
: 4.17.11
Assigned To: Petr Horáček
Meni Yakove
network
: Automation, AutomationBlocker
Depends On: 1272368
Blocks: 1154205 1283245
  Show dependency treegraph
 
Reported: 2015-09-02 12:30 EDT by Meni Yakove
Modified: 2016-02-10 14:16 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1283245 (view as bug list)
Environment:
Last Closed: 2015-12-16 07:19:51 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: blocker+
ylavi: Triaged+
ylavi: planning_ack+
rule-engine: devel_ack+
myakove: testing_ack+


Attachments (Terms of Use)
vdsm, supervdsm and engine logs (1.73 MB, application/zip)
2015-09-02 12:31 EDT, Meni Yakove
no flags Details
vdsm and supervdsm logs with set -x in ifup and ifup-eth (4.33 MB, application/zip)
2015-09-24 05:48 EDT, Meni Yakove
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
FreeDesktop.org 86520 None None None Never
oVirt gerrit 47327 master ABANDONED net: retry ifup if an error occured Never
oVirt gerrit 47627 master MERGED net: fix systemd race in exec_ifup Never
oVirt gerrit 48755 ovirt-3.6 MERGED net: fix systemd race in exec_ifup Never

  None (edit)
Description Meni Yakove 2015-09-02 12:30:41 EDT
Description of problem:
Sometimes when calling SetupNetwork it fails with error 'Failed to bring interface up'

Version-Release number of selected component (if applicable):
vdsm-4.17.3-1.el7ev.noarch
rhevm-3.6.0-0.12.master.el6.noarch


Steps to Reproduce:
1. Run host_network_api automation test
Comment 1 Meni Yakove 2015-09-02 12:31:28 EDT
Created attachment 1069521 [details]
vdsm, supervdsm and engine logs
Comment 3 Yaniv Kaul 2015-09-16 06:54:38 EDT
Meni, can you be more specific on the 'sometimes' ? Does it happen on specific RHEL (7.x? 7.2 only?) version, does it happen in some specific topology? Anything that be help determine the frequency of the issue. How often is 'sometimes' ?
Comment 4 Petr Horáček 2015-09-16 10:16:49 EDT
Hi Meni,

could you please add `set -x` to /etc/sysconfig/network-scripts/ifup, try to reproduce the problem and share ifup output with us? If it is possible, please add `sleep 1` after line 'ip link add dev ${DEVICE} link ${PHYSDEV} type vlan id ${VID} ${FLAG_REORDER_HDR} ${FLAG_GVRP} || {' in the ifup file. Is it still reproducible?

Would it be possible to grant me an access to a machine with reproducer?

Thanks
Comment 5 Dan Kenigsberg 2015-09-21 07:42:55 EDT
(please add set -x to /etc/sysconfig/network-scripts/ifup-eth too)
Comment 6 Meni Yakove 2015-09-24 05:48 EDT
Created attachment 1076425 [details]
vdsm and supervdsm logs with set -x in ifup and ifup-eth
Comment 7 Petr Horáček 2015-09-24 11:52:18 EDT
Meni, could you please share with me supervdsm.log of a non-failed automation test?
Comment 8 Petr Horáček 2015-10-07 13:20:24 EDT
Hi, we suspect, that it is caused by a systemd problem [1]. If this is the case, the problem should be solved in systemd v220. Could you please try to reproduce it with systemd >= 220? Thanks

[1] https://bugs.freedesktop.org/show_bug.cgi?id=86520
Comment 9 Meni Yakove 2015-10-13 07:54:07 EDT
From where can I get systemd v220 for rhel7.2, can find it on brew
Comment 10 Red Hat Bugzilla Rules Engine 2015-10-19 06:49:51 EDT
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 11 GenadiC 2015-10-21 01:20:04 EDT
Run with systemd from https://brewweb.devel.redhat.com/taskinfo?taskID=9969138 and it looks like the problem of bringing interface up is solved
Comment 12 Petr Horáček 2015-10-22 04:54:17 EDT
According to [1] I created a patch [2] which solves the problem on our side and makes sure such race will not occur.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7
[2] https://gerrit.ovirt.org/#/c/47627/
Comment 13 Petr Horáček 2015-10-22 05:23:46 EDT
(In reply to Petr Horáček from comment #12)
> According to [1] I created a patch [2] which solves the problem on our side
> and makes sure such race will not occur.
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7
> [2] https://gerrit.ovirt.org/#/c/47627/

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c8
Comment 14 Petr Horáček 2015-10-22 11:29:34 EDT
Meni, could you please try to (not) reproduce it with this patch https://gerrit.ovirt.org/#/c/47627/ and standard systemd without backported fix. Thanks a lot.
Comment 15 Meni Yakove 2015-10-25 05:53:46 EDT
Petr, I get error: 
systemd_run() got an unexpected keyword argument 'uuid', code = -32603
Comment 16 Yaniv Lavi 2015-10-29 08:32:03 EDT
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Comment 17 Petr Horáček 2015-11-03 07:21:54 EST
That's strange. Network functional tests passed OK and there were no such error in logs. I also tried it on command line:

    In [1]: import uuid
    In [2]: from vdsm import utils, cmdutils
    In [3]: c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(), slice='foo')
    In [4]: utils.execCmd(c)
    Out[4]: 
    (0,
     ['alignmentScanTests.py',
     ...

Used software:
Linux 3.10.0-229.14.1.el7.x86_64
systemd-208-20.el7.x86_64
python-2.7.5-18.el7_1.1.x86_64

I have no idea why it fails (and what does that error code mean). Anyways, hopefully systemd guys will solve it on their side and there will be no need for this patch.

On what system, systemd and Python it fails for you?

Thanks and regards
Comment 18 Meni Yakove 2015-11-11 12:04:28 EST
Linux 3.10.0-322.el7.x86_64
systemd-219-19.el7.x86_64
python-2.7.5-34.el7.x86_64

from python:
>>> import uuid
>>> from vdsm import utils, cmdutils
>>> c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(),slice='foo')
>>> utils.execCmd(c)
(0, ['anaconda-ks.cfg', 'kickstart-default-provision.log', 'openscap_data', 'puppet.log'], ['Running scope as unit e08e51f2-e2ba-49d4-a281-94d8d43aae8d.scope.'])
>>> 

when running via vdsm:
2015-11-11 19:01:45,282 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-3) [hosts_syncAction_b105d104-635b-4305] Exce
ption: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Attempt to call func
tion: <bound method Global.setupNetworks of <API.Global object at 0x7fb03c15c610>> with arguments: ({u'case2_sn1': {u'nic': u'enp1s0f1', u'vlan': u'11', u'STP': u'no', 
u'bridged': u'true', u'mtu': u'1500'}}, {}, {u'connectivityCheck': u'true', u'connectivityTimeout': 60}) error: systemd_run() got an unexpected keyword argument 'uuid',
 code = -32603
Comment 19 Meni Yakove 2015-11-11 12:34:16 EST
correction:
I missed unit= in ifcfg.py and now it's working.
I will run our tests few more times to make sure this solve the issue
Comment 20 Meni Yakove 2015-11-11 15:08:29 EST
After 3 runs and no fail on 'Failed to bring interface up' i think we can say that the fix is working.
Should I verify the bug?
Comment 21 Yaniv Lavi 2015-11-12 01:03:41 EST
(In reply to Meni Yakove from comment #20)
> After 3 runs and no fail on 'Failed to bring interface up' i think we can
> say that the fix is working.
> Should I verify the bug?

How can this be fixed without resolving the platform bug? Should that be closed as well?
Comment 22 Meni Yakove 2015-11-12 02:17:30 EST
Petr can you answer Yaniv?
Comment 23 Petr Horáček 2015-11-17 06:37:07 EST
There is a race in systemd-run which causes that it sometimes runs twice under the same unit name in the same time (which is wrong). We can prevent this by using our own unit name (generated uuid).

It would be better if they backport fix (introduced in systemd v220) on their side, but it's not a big deal to fix it temporary in VDSM and drop it when v220 will be available.

I'm not sure if we really want them to backport it if we have it fixed.
Comment 24 Yaniv Lavi 2015-11-17 06:40:01 EST
(In reply to Petr Horáček from comment #23)
> There is a race in systemd-run which causes that it sometimes runs twice
> under the same unit name in the same time (which is wrong). We can prevent
> this by using our own unit name (generated uuid).
> 
> It would be better if they backport fix (introduced in systemd v220) on
> their side, but it's not a big deal to fix it temporary in VDSM and drop it
> when v220 will be available.
> 
> I'm not sure if we really want them to backport it if we have it fixed.

Workaround should be temporary. Let wait for them to fix the issue and then drop the workaround from our side. Leave this bug open until they fix this.
Comment 25 Petr Horáček 2015-11-18 08:53:05 EST
We need this patch anyways for CentOS and older Fedoras. Let's keep it until we have systemd >= v220 everywhere.
Comment 26 Sandro Bonazzola 2015-11-24 11:43:50 EST
Please set target release or I can't move the bug to ON_QA automatically.
Comment 27 Red Hat Bugzilla Rules Engine 2015-11-24 13:09:23 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 28 Sandro Bonazzola 2015-12-16 07:19:51 EST
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.

Note You need to log in before you can comment on or make changes to this bug.