Bug 1259468

Summary: Setupnetworks fails from time to time with error 'Failed to bring interface up'
Product: [oVirt] vdsm Reporter: Meni Yakove <myakove>
Component: GeneralAssignee: Petr Horáček <phoracek>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: high    
Version: 4.17.3CC: bazulay, bugs, cwu, danken, ecohen, gcheresh, gklein, lsurette, myakove, phoracek, ycui, yeylon, ylavi
Target Milestone: ovirt-3.6.1Keywords: Automation, AutomationBlocker
Target Release: 4.17.11Flags: rule-engine: ovirt-3.6.z+
rule-engine: blocker+
ylavi: Triaged+
ylavi: planning_ack+
rule-engine: devel_ack+
myakove: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1283245 (view as bug list) Environment:
Last Closed: 2015-12-16 12:19:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1272368    
Bug Blocks: 1154205, 1283245    
Attachments:
Description Flags
vdsm, supervdsm and engine logs
none
vdsm and supervdsm logs with set -x in ifup and ifup-eth none

Description Meni Yakove 2015-09-02 16:30:41 UTC
Description of problem:
Sometimes when calling SetupNetwork it fails with error 'Failed to bring interface up'

Version-Release number of selected component (if applicable):
vdsm-4.17.3-1.el7ev.noarch
rhevm-3.6.0-0.12.master.el6.noarch


Steps to Reproduce:
1. Run host_network_api automation test

Comment 1 Meni Yakove 2015-09-02 16:31:28 UTC
Created attachment 1069521 [details]
vdsm, supervdsm and engine logs

Comment 3 Yaniv Kaul 2015-09-16 10:54:38 UTC
Meni, can you be more specific on the 'sometimes' ? Does it happen on specific RHEL (7.x? 7.2 only?) version, does it happen in some specific topology? Anything that be help determine the frequency of the issue. How often is 'sometimes' ?

Comment 4 Petr Horáček 2015-09-16 14:16:49 UTC
Hi Meni,

could you please add `set -x` to /etc/sysconfig/network-scripts/ifup, try to reproduce the problem and share ifup output with us? If it is possible, please add `sleep 1` after line 'ip link add dev ${DEVICE} link ${PHYSDEV} type vlan id ${VID} ${FLAG_REORDER_HDR} ${FLAG_GVRP} || {' in the ifup file. Is it still reproducible?

Would it be possible to grant me an access to a machine with reproducer?

Thanks

Comment 5 Dan Kenigsberg 2015-09-21 11:42:55 UTC
(please add set -x to /etc/sysconfig/network-scripts/ifup-eth too)

Comment 6 Meni Yakove 2015-09-24 09:48:08 UTC
Created attachment 1076425 [details]
vdsm and supervdsm logs with set -x in ifup and ifup-eth

Comment 7 Petr Horáček 2015-09-24 15:52:18 UTC
Meni, could you please share with me supervdsm.log of a non-failed automation test?

Comment 8 Petr Horáček 2015-10-07 17:20:24 UTC
Hi, we suspect, that it is caused by a systemd problem [1]. If this is the case, the problem should be solved in systemd v220. Could you please try to reproduce it with systemd >= 220? Thanks

[1] https://bugs.freedesktop.org/show_bug.cgi?id=86520

Comment 9 Meni Yakove 2015-10-13 11:54:07 UTC
From where can I get systemd v220 for rhel7.2, can find it on brew

Comment 10 Red Hat Bugzilla Rules Engine 2015-10-19 10:49:51 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 11 GenadiC 2015-10-21 05:20:04 UTC
Run with systemd from https://brewweb.devel.redhat.com/taskinfo?taskID=9969138 and it looks like the problem of bringing interface up is solved

Comment 12 Petr Horáček 2015-10-22 08:54:17 UTC
According to [1] I created a patch [2] which solves the problem on our side and makes sure such race will not occur.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7
[2] https://gerrit.ovirt.org/#/c/47627/

Comment 13 Petr Horáček 2015-10-22 09:23:46 UTC
(In reply to Petr Horáček from comment #12)
> According to [1] I created a patch [2] which solves the problem on our side
> and makes sure such race will not occur.
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7
> [2] https://gerrit.ovirt.org/#/c/47627/

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c8

Comment 14 Petr Horáček 2015-10-22 15:29:34 UTC
Meni, could you please try to (not) reproduce it with this patch https://gerrit.ovirt.org/#/c/47627/ and standard systemd without backported fix. Thanks a lot.

Comment 15 Meni Yakove 2015-10-25 09:53:46 UTC
Petr, I get error: 
systemd_run() got an unexpected keyword argument 'uuid', code = -32603

Comment 16 Yaniv Lavi 2015-10-29 12:32:03 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 17 Petr Horáček 2015-11-03 12:21:54 UTC
That's strange. Network functional tests passed OK and there were no such error in logs. I also tried it on command line:

    In [1]: import uuid
    In [2]: from vdsm import utils, cmdutils
    In [3]: c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(), slice='foo')
    In [4]: utils.execCmd(c)
    Out[4]: 
    (0,
     ['alignmentScanTests.py',
     ...

Used software:
Linux 3.10.0-229.14.1.el7.x86_64
systemd-208-20.el7.x86_64
python-2.7.5-18.el7_1.1.x86_64

I have no idea why it fails (and what does that error code mean). Anyways, hopefully systemd guys will solve it on their side and there will be no need for this patch.

On what system, systemd and Python it fails for you?

Thanks and regards

Comment 18 Meni Yakove 2015-11-11 17:04:28 UTC
Linux 3.10.0-322.el7.x86_64
systemd-219-19.el7.x86_64
python-2.7.5-34.el7.x86_64

from python:
>>> import uuid
>>> from vdsm import utils, cmdutils
>>> c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(),slice='foo')
>>> utils.execCmd(c)
(0, ['anaconda-ks.cfg', 'kickstart-default-provision.log', 'openscap_data', 'puppet.log'], ['Running scope as unit e08e51f2-e2ba-49d4-a281-94d8d43aae8d.scope.'])
>>> 

when running via vdsm:
2015-11-11 19:01:45,282 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-3) [hosts_syncAction_b105d104-635b-4305] Exce
ption: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Attempt to call func
tion: <bound method Global.setupNetworks of <API.Global object at 0x7fb03c15c610>> with arguments: ({u'case2_sn1': {u'nic': u'enp1s0f1', u'vlan': u'11', u'STP': u'no', 
u'bridged': u'true', u'mtu': u'1500'}}, {}, {u'connectivityCheck': u'true', u'connectivityTimeout': 60}) error: systemd_run() got an unexpected keyword argument 'uuid',
 code = -32603

Comment 19 Meni Yakove 2015-11-11 17:34:16 UTC
correction:
I missed unit= in ifcfg.py and now it's working.
I will run our tests few more times to make sure this solve the issue

Comment 20 Meni Yakove 2015-11-11 20:08:29 UTC
After 3 runs and no fail on 'Failed to bring interface up' i think we can say that the fix is working.
Should I verify the bug?

Comment 21 Yaniv Lavi 2015-11-12 06:03:41 UTC
(In reply to Meni Yakove from comment #20)
> After 3 runs and no fail on 'Failed to bring interface up' i think we can
> say that the fix is working.
> Should I verify the bug?

How can this be fixed without resolving the platform bug? Should that be closed as well?

Comment 22 Meni Yakove 2015-11-12 07:17:30 UTC
Petr can you answer Yaniv?

Comment 23 Petr Horáček 2015-11-17 11:37:07 UTC
There is a race in systemd-run which causes that it sometimes runs twice under the same unit name in the same time (which is wrong). We can prevent this by using our own unit name (generated uuid).

It would be better if they backport fix (introduced in systemd v220) on their side, but it's not a big deal to fix it temporary in VDSM and drop it when v220 will be available.

I'm not sure if we really want them to backport it if we have it fixed.

Comment 24 Yaniv Lavi 2015-11-17 11:40:01 UTC
(In reply to Petr Horáček from comment #23)
> There is a race in systemd-run which causes that it sometimes runs twice
> under the same unit name in the same time (which is wrong). We can prevent
> this by using our own unit name (generated uuid).
> 
> It would be better if they backport fix (introduced in systemd v220) on
> their side, but it's not a big deal to fix it temporary in VDSM and drop it
> when v220 will be available.
> 
> I'm not sure if we really want them to backport it if we have it fixed.

Workaround should be temporary. Let wait for them to fix the issue and then drop the workaround from our side. Leave this bug open until they fix this.

Comment 25 Petr Horáček 2015-11-18 13:53:05 UTC
We need this patch anyways for CentOS and older Fedoras. Let's keep it until we have systemd >= v220 everywhere.

Comment 26 Sandro Bonazzola 2015-11-24 16:43:50 UTC
Please set target release or I can't move the bug to ON_QA automatically.

Comment 27 Red Hat Bugzilla Rules Engine 2015-11-24 18:09:23 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 28 Sandro Bonazzola 2015-12-16 12:19:51 UTC
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.