| Summary: | Team devices cannot be brought up with network service | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dan Sneddon <dsneddon> | ||||
| Component: | libteam | Assignee: | Xin Long <lxin> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Network QE <network-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.3 | CC: | atragler, bgalvani, dsneddon, mleitner, myllynen, racedoro, sukulkar | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-05-17 17:03:50 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1402537 | ||||||
| Attachments: |
|
||||||
Probably related to a fix that went into 7.3 for dealing with ordering during shutdown. (In reply to Marcelo Ricardo Leitner from comment #1) > Probably related to a fix that went into 7.3 for dealing with ordering > during shutdown. That makes sense. I wasn't 100% sure, but I thought I remembered testing this on RHEL 7.2 and it worked with the network service. Maybe it's not caused by the fix dealing with ordering. As after I remove it, the issue is there. # use the config file from comment0, reboot with NM_CONTROLLED=no sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1 ifup team1 err: Error: Connection activation failed. I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls try ifdown team1 first before changing NM_CONTROLLED=yes. but anyway, it should have been handled by NM-team. Another issue I also found is: After running the following command, NM-team cannot be recovered. # use the config file from comment0, reboot with NM_CONTROLLED=yes # run with the following command, the team would not work with NM. sed -i s/NM_CONTROLLED=yes/NM_CONTROLLED=no/g /etc/sysconfig/network-scripts/ifcfg-team1 ifdown team1 sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1 ifup team1 sed -i s/NM_CONTROLLED=yes/NM_CONTROLLED=no/g /etc/sysconfig/network-scripts/ifcfg-team1 ifdown team1 sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1 ifup team1 err: ob for teamd failed because the control process exited with error code. See "systemctl status teamd" and "journalctl -xe" for details. Hi,Beniamino, We need your help, can you check how was these two issue caused in NM-team ? Thanks. Created attachment 1236882 [details]
[PATCH] settings: fix assertion when changing connection managed state
(In reply to Xin Long from comment #4) > Maybe it's not caused by the fix dealing with ordering. > As after I remove it, the issue is there. > > # use the config file from comment0, reboot with NM_CONTROLLED=no > sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g > /etc/sysconfig/network-scripts/ifcfg-team1 > ifup team1 > > err: Error: Connection activation failed. I can't reproduce this, please attach NM logs. > I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown > team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls > try ifdown team1 first before changing NM_CONTROLLED=yes. > but anyway, it should have been handled by NM-team. > > > Another issue I also found is: > After running the following command, NM-team cannot be recovered. > > ... > > err: > ob for teamd failed because the control process exited with > error code. See "systemctl status teamd" and "journalctl -xe" > for details. This is caused by a failed assertion in NM, fixed by the patch in comment 5. However it's unclear to me if this can be the cause of the original issue reported in comment 0, as I'm unable to reproduce it. The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I don't think NM can be involved at all? (In reply to Beniamino Galvani from comment #6) > (In reply to Xin Long from comment #4) > > Maybe it's not caused by the fix dealing with ordering. > > As after I remove it, the issue is there. > > > > # use the config file from comment0, reboot with NM_CONTROLLED=no > > sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g > > /etc/sysconfig/network-scripts/ifcfg-team1 > > ifup team1 > > > > err: Error: Connection activation failed. > > I can't reproduce this, please attach NM logs. http://pastebin.test.redhat.com/442681 pls check, from line 220 (Jan 4 01:00:14), I ifup team1. > > > > I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown > > team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls > > try ifdown team1 first before changing NM_CONTROLLED=yes. > > but anyway, it should have been handled by NM-team. > > > > > > Another issue I also found is: > > After running the following command, NM-team cannot be recovered. > > > > ... > > > > err: > > ob for teamd failed because the control process exited with > > error code. See "systemctl status teamd" and "journalctl -xe" > > for details. > > This is caused by a failed assertion in NM, fixed by the patch in comment 5. since which version is the patch in comment 5 in rhel7.4 ? I'm using: # NetworkManager -V 1.4.0-12.el7 > > However it's unclear to me if this can be the cause of the original issue > reported in comment 0, as I'm unable to reproduce it. > > The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I > don't think NM can be involved at all? I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with the following commands, then team1 will be taken over by NM-team, when ifup team1, it should clean env for team1 first (like, kill the old teamd daemon for team1 ...). But from log, it seemed not to work like that. > > sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g > > /etc/sysconfig/network-scripts/ifcfg-team1 > > ifup team1 Thanks for checking it. (In reply to Xin Long from comment #7) > http://pastebin.test.redhat.com/442681 > pls check, from line 220 (Jan 4 01:00:14), I ifup team1. Ok, I see the cause of the failure: [1483509614.7430] device (team1): Activation: (team) started teamd [pid 2667]... [1483509614.7436] device (team1): disconnecting for new activation request. [1483509614.7436] audit: op="connection-activate" uuid="4293abb7-d898-84ff-dae6-bffba04cbee9" name="Team team1" pid=2661 uid=0 result="success" [1483509614.7444] device (team1): deactivation: stopping teamd... [1483509614.7446] device (team1): Activation: starting connection 'Team team1' (4293abb7-d898-84ff-dae6-bffba04cbee9) Daemon already running on PID 2667. 'ifup' reloads the connection, which also activates it because the connection has ONBOOT=yes. 'ifup' then calls 'nmcli connection up' and so the connection is brought down and up again. While deactivating it, NM stops teamd by calling 'teamd -k', and then launches a new teamd instance. It seems that the old teamd is not dead when the new one is started, causing the failure of activation. Xin, can you please file a new bug for this and provide NM logs at trace level (*)? I don't think the original problem reported here is caused by this NM bug because the steps in the bug description don't involve NM. (*) set 'level=TRACE' in the [logging] section of /etc/NetworkManager/NetworkManager.conf and restart NM. > > This is caused by a failed assertion in NM, fixed by the patch in comment 5. > since which version is the patch in comment 5 in rhel7.4 ? I'm using: > # NetworkManager -V > 1.4.0-12.el7 The patch is not upstream or in RHEL yet. > > However it's unclear to me if this can be the cause of the original issue > > reported in comment 0, as I'm unable to reproduce it. > > > > The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I > > don't think NM can be involved at all? > I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with > the following commands, then team1 will be taken over by NM-team, when ifup > team1, it should clean env for team1 first (like, kill the old teamd daemon > for team1 ...). > > But from log, it seemed not to work like that. NM kills any existing teamd instance for the device, but there is another problem when the connection is activated multiple times in a short interval (see above). (In reply to Beniamino Galvani from comment #8) > (In reply to Xin Long from comment #7) > > > http://pastebin.test.redhat.com/442681 > > pls check, from line 220 (Jan 4 01:00:14), I ifup team1. > > Ok, I see the cause of the failure: > > [1483509614.7430] device (team1): Activation: (team) started teamd [pid > 2667]... > [1483509614.7436] device (team1): disconnecting for new activation request. > [1483509614.7436] audit: op="connection-activate" > uuid="4293abb7-d898-84ff-dae6-bffba04cbee9" name="Team team1" pid=2661 uid=0 > result="success" > [1483509614.7444] device (team1): deactivation: stopping teamd... > [1483509614.7446] device (team1): Activation: starting connection 'Team > team1' (4293abb7-d898-84ff-dae6-bffba04cbee9) > Daemon already running on PID 2667. > > 'ifup' reloads the connection, which also activates it because the > connection has ONBOOT=yes. 'ifup' then calls 'nmcli connection up' and > so the connection is brought down and up again. While deactivating it, > NM stops teamd by calling 'teamd -k', and then launches a new teamd > instance. It seems that the old teamd is not dead when the new one is > started, causing the failure of activation. > > Xin, can you please file a new bug for this and provide NM logs at > trace level (*)? I don't think the original problem reported here is > caused by this NM bug because the steps in the bug description don't > involve NM. > the step in the bug description is actually the same with what I did here. 1. use the config file from comment0, reboot with NM_CONTROLLED=no 2. # sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1 3. # ifup team1 I will new a bug for NM as you wish, as for this one, I would ask Dan to try in his env with your fix after you fix it. > (*) set 'level=TRACE' in the [logging] section of > /etc/NetworkManager/NetworkManager.conf and restart NM. > > > > This is caused by a failed assertion in NM, fixed by the patch in comment 5. > > since which version is the patch in comment 5 in rhel7.4 ? I'm using: > > # NetworkManager -V > > 1.4.0-12.el7 > > The patch is not upstream or in RHEL yet. do you plan to backport it for rhel7.4 ? > > > > > However it's unclear to me if this can be the cause of the original issue > > > reported in comment 0, as I'm unable to reproduce it. > > > > > > The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I > > > don't think NM can be involved at all? > > I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with > > the following commands, then team1 will be taken over by NM-team, when ifup > > team1, it should clean env for team1 first (like, kill the old teamd daemon > > for team1 ...). > > > > But from log, it seemed not to work like that. > > NM kills any existing teamd instance for the device, but there is another > problem > when the connection is activated multiple times in a short interval (see > above). (In reply to Xin Long from comment #9) > > > > This is caused by a failed assertion in NM, fixed by the patch in comment 5. > > > since which version is the patch in comment 5 in rhel7.4 ? I'm using: > > > # NetworkManager -V > > > 1.4.0-12.el7 > > > > The patch is not upstream or in RHEL yet. > > do you plan to backport it for rhel7.4 ? The patch is now in the master branch: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=a9384452ed61ca3f1c6e1db175f499307da9c388 and RHEL 7.4 will include it. (In reply to Dan Sneddon from comment #0) > Steps to Reproduce: > 1. Configure /etc/sysconfig/network-scripts/ifcfg-team1 with the following: > NM_CONTROLLED=yes > > 2. Configure /etc/sysconfig/network-scripts/ifcfg-eth1 (or other interface) > with the following: > NM_CONTROLLED=no > > 3. Configure /etc/sysconfig/network-scripts/ifcfg-eth2 (or other interface) > with the following: > NM_CONTROLLED=no Hi Dan, is this inconsistency expected? It's weird to have the team interface to be NM controlled and not its slaves. You should either have them all handed by NM, or them all not handled. This mixed setup is not tested/supported. (In reply to Marcelo Ricardo Leitner from comment #11) > Hi Dan, is this inconsistency expected? It's weird to have the team > interface to be NM controlled and not its slaves. You should either have > them all handed by NM, or them all not handled. This mixed setup is not > tested/supported. No, Team interfaces should work with either the 'network' service, NetworkManager, or should work when both are installed. Currently, Team interfaces do not function with the 'network' service if NetworkManager is also installed. This is due to a conflict between the two services, if I understand correctly. The root cause is tracked in the attached bug https://bugzilla.redhat.com/show_bug.cgi?id=1402535 and the fix has been merged and will be available in RHEL 7.4 and above. (In reply to Dan Sneddon from comment #12) > The root cause is tracked in the attached bug > https://bugzilla.redhat.com/show_bug.cgi?id=1402535 and the fix has been > merged and will be available in RHEL 7.4 and above. Actually, the above may not be the root cause, it may be an ancillary bug that was discovered while troubleshooting this issue. But yes, it is expected that Team interfaces would work with the network service as well as NetworkManager. the bug for NM: https://bugzilla.redhat.com/show_bug.cgi?id=1415641 since the fix has been merged and will be available in RHEL 7.4 and above, close this. if there still is any problem about this, pls reopen. |
Description of problem: On RHEL 7.3 (and probably older versions), a teamd bond cannot be brought up with 'ifup' if the ifcfg file contains "NM_CONTROLLED=no". Version-Release number of selected component (if applicable): RHEL: 7.3 NetworkManager-team.x86_64 1:1.4.0-12.el7 libteam.x86_64 1.25-4.el7 teamd.x86_64 1.25-4.el7 How reproducible: 100% Steps to Reproduce: 1. Configure /etc/sysconfig/network-scripts/ifcfg-team1 with the following: DEVICE=team1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=yes PEERDNS=no MACADDR="52:54:00:2a:49:2d" DEVICETYPE=Team TEAM_CONFIG='{"runner": {"name": "activebackup"}}' 2. Configure /etc/sysconfig/network-scripts/ifcfg-eth1 (or other interface) with the following: DEVICE=eth1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no TEAM_MASTER=team1 TEAM_PORT_CONFIG='{"prio": 100}' BOOTPROTO=none 3. Configure /etc/sysconfig/network-scripts/ifcfg-eth2 (or other interface) with the following: DEVICE=eth2 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no TEAM_MASTER=team1 BOOTPROTO=none 4. Run "sudo ifup team1" Actual results: # ifup team1 Job for teamd failed because the control process exited with error code. See "systemctl status teamd" and "journalctl -xe" for details. ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Device team1 does not seem to be present, delaying initialization. Expected results: The team should be enabled. If I change the NM_CONTROLLED=no to NM_CONTROLLED=yes in the ifcfg-team1 file, it works when I run "ifup team1". Additional info: ovs-vswitchd.log contains these error messages: 2016-12-07T18:29:56.190Z|00264|bridge|WARN|could not open network device bond1 (No such device) It is desired for teaming to work with the network service as well as NetworkManager, since we use the network service instead of NetworkManager to manage interfaces on OpenStack servers. I'm not sure if this is something that needs to be addressed in libteam, or the network service, or in the initscripts, so I'm opening this BZ against libteam initially. Please change the component if there is a more correct one.