Bug 1259468
Summary: | Setupnetworks fails from time to time with error 'Failed to bring interface up' | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Meni Yakove <myakove> | ||||||
Component: | General | Assignee: | Petr Horáček <phoracek> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Meni Yakove <myakove> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.17.3 | CC: | bazulay, bugs, cwu, danken, ecohen, gcheresh, gklein, lsurette, myakove, phoracek, ycui, yeylon, ylavi | ||||||
Target Milestone: | ovirt-3.6.1 | Keywords: | Automation, AutomationBlocker | ||||||
Target Release: | 4.17.11 | Flags: | rule-engine:
ovirt-3.6.z+
rule-engine: blocker+ ylavi: Triaged+ ylavi: planning_ack+ rule-engine: devel_ack+ myakove: testing_ack+ |
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | network | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1283245 (view as bug list) | Environment: | |||||||
Last Closed: | 2015-12-16 12:19:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1272368 | ||||||||
Bug Blocks: | 1154205, 1283245 | ||||||||
Attachments: |
|
Description
Meni Yakove
2015-09-02 16:30:41 UTC
Created attachment 1069521 [details]
vdsm, supervdsm and engine logs
Meni, can you be more specific on the 'sometimes' ? Does it happen on specific RHEL (7.x? 7.2 only?) version, does it happen in some specific topology? Anything that be help determine the frequency of the issue. How often is 'sometimes' ? Hi Meni, could you please add `set -x` to /etc/sysconfig/network-scripts/ifup, try to reproduce the problem and share ifup output with us? If it is possible, please add `sleep 1` after line 'ip link add dev ${DEVICE} link ${PHYSDEV} type vlan id ${VID} ${FLAG_REORDER_HDR} ${FLAG_GVRP} || {' in the ifup file. Is it still reproducible? Would it be possible to grant me an access to a machine with reproducer? Thanks (please add set -x to /etc/sysconfig/network-scripts/ifup-eth too) Created attachment 1076425 [details]
vdsm and supervdsm logs with set -x in ifup and ifup-eth
Meni, could you please share with me supervdsm.log of a non-failed automation test? Hi, we suspect, that it is caused by a systemd problem [1]. If this is the case, the problem should be solved in systemd v220. Could you please try to reproduce it with systemd >= 220? Thanks [1] https://bugs.freedesktop.org/show_bug.cgi?id=86520 From where can I get systemd v220 for rhel7.2, can find it on brew Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Run with systemd from https://brewweb.devel.redhat.com/taskinfo?taskID=9969138 and it looks like the problem of bringing interface up is solved According to [1] I created a patch [2] which solves the problem on our side and makes sure such race will not occur. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7 [2] https://gerrit.ovirt.org/#/c/47627/ (In reply to Petr Horáček from comment #12) > According to [1] I created a patch [2] which solves the problem on our side > and makes sure such race will not occur. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c7 > [2] https://gerrit.ovirt.org/#/c/47627/ [1] https://bugzilla.redhat.com/show_bug.cgi?id=1272368#c8 Meni, could you please try to (not) reproduce it with this patch https://gerrit.ovirt.org/#/c/47627/ and standard systemd without backported fix. Thanks a lot. Petr, I get error: systemd_run() got an unexpected keyword argument 'uuid', code = -32603 In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone. That's strange. Network functional tests passed OK and there were no such error in logs. I also tried it on command line: In [1]: import uuid In [2]: from vdsm import utils, cmdutils In [3]: c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(), slice='foo') In [4]: utils.execCmd(c) Out[4]: (0, ['alignmentScanTests.py', ... Used software: Linux 3.10.0-229.14.1.el7.x86_64 systemd-208-20.el7.x86_64 python-2.7.5-18.el7_1.1.x86_64 I have no idea why it fails (and what does that error code mean). Anyways, hopefully systemd guys will solve it on their side and there will be no need for this patch. On what system, systemd and Python it fails for you? Thanks and regards Linux 3.10.0-322.el7.x86_64 systemd-219-19.el7.x86_64 python-2.7.5-34.el7.x86_64 from python: >>> import uuid >>> from vdsm import utils, cmdutils >>> c = cmdutils.systemd_run(['ls'], scope=True, unit=uuid.uuid4(),slice='foo') >>> utils.execCmd(c) (0, ['anaconda-ks.cfg', 'kickstart-default-provision.log', 'openscap_data', 'puppet.log'], ['Running scope as unit e08e51f2-e2ba-49d4-a281-94d8d43aae8d.scope.']) >>> when running via vdsm: 2015-11-11 19:01:45,282 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (ajp-/127.0.0.1:8702-3) [hosts_syncAction_b105d104-635b-4305] Exce ption: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HostSetupNetworksVDS, error = Attempt to call func tion: <bound method Global.setupNetworks of <API.Global object at 0x7fb03c15c610>> with arguments: ({u'case2_sn1': {u'nic': u'enp1s0f1', u'vlan': u'11', u'STP': u'no', u'bridged': u'true', u'mtu': u'1500'}}, {}, {u'connectivityCheck': u'true', u'connectivityTimeout': 60}) error: systemd_run() got an unexpected keyword argument 'uuid', code = -32603 correction: I missed unit= in ifcfg.py and now it's working. I will run our tests few more times to make sure this solve the issue After 3 runs and no fail on 'Failed to bring interface up' i think we can say that the fix is working. Should I verify the bug? (In reply to Meni Yakove from comment #20) > After 3 runs and no fail on 'Failed to bring interface up' i think we can > say that the fix is working. > Should I verify the bug? How can this be fixed without resolving the platform bug? Should that be closed as well? Petr can you answer Yaniv? There is a race in systemd-run which causes that it sometimes runs twice under the same unit name in the same time (which is wrong). We can prevent this by using our own unit name (generated uuid). It would be better if they backport fix (introduced in systemd v220) on their side, but it's not a big deal to fix it temporary in VDSM and drop it when v220 will be available. I'm not sure if we really want them to backport it if we have it fixed. (In reply to Petr Horáček from comment #23) > There is a race in systemd-run which causes that it sometimes runs twice > under the same unit name in the same time (which is wrong). We can prevent > this by using our own unit name (generated uuid). > > It would be better if they backport fix (introduced in systemd v220) on > their side, but it's not a big deal to fix it temporary in VDSM and drop it > when v220 will be available. > > I'm not sure if we really want them to backport it if we have it fixed. Workaround should be temporary. Let wait for them to fix the issue and then drop the workaround from our side. Leave this bug open until they fix this. We need this patch anyways for CentOS and older Fedoras. Let's keep it until we have systemd >= v220 everywhere. Please set target release or I can't move the bug to ON_QA automatically. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release. |