Bug 1402535

Summary: Team devices cannot be brought up with network service
Product: Red Hat Enterprise Linux 7 Reporter: Dan Sneddon <dsneddon>
Component: libteamAssignee: Xin Long <lxin>
Status: CLOSED CURRENTRELEASE QA Contact: Network QE <network-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: atragler, bgalvani, dsneddon, mleitner, myllynen, racedoro, sukulkar
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 17:03:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1402537    
Attachments:
Description Flags
[PATCH] settings: fix assertion when changing connection managed state none

Description Dan Sneddon 2016-12-07 18:36:34 UTC
Description of problem: On RHEL 7.3 (and probably older versions), a teamd bond cannot be brought up with 'ifup' if the ifcfg file contains "NM_CONTROLLED=no".


Version-Release number of selected component (if applicable):
RHEL: 7.3
NetworkManager-team.x86_64   1:1.4.0-12.el7
libteam.x86_64               1.25-4.el7
teamd.x86_64                 1.25-4.el7


How reproducible: 100%


Steps to Reproduce:
1. Configure /etc/sysconfig/network-scripts/ifcfg-team1 with the following:
DEVICE=team1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=yes
PEERDNS=no
MACADDR="52:54:00:2a:49:2d"
DEVICETYPE=Team
TEAM_CONFIG='{"runner": {"name": "activebackup"}}'

2. Configure /etc/sysconfig/network-scripts/ifcfg-eth1 (or other interface) with the following:
DEVICE=eth1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
TEAM_MASTER=team1
TEAM_PORT_CONFIG='{"prio": 100}'
BOOTPROTO=none

3. Configure /etc/sysconfig/network-scripts/ifcfg-eth2 (or other interface) with the following:
DEVICE=eth2
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
TEAM_MASTER=team1
BOOTPROTO=none

4. Run "sudo ifup team1"

Actual results:
# ifup team1
Job for teamd failed because the control process exited with error code. See "systemctl status teamd" and "journalctl -xe" for details.
ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Device team1 does not seem to be present, delaying initialization.


Expected results: The team should be enabled. If I change the NM_CONTROLLED=no to NM_CONTROLLED=yes in the ifcfg-team1 file, it works when I run "ifup team1".

Additional info:
ovs-vswitchd.log contains these error messages:

2016-12-07T18:29:56.190Z|00264|bridge|WARN|could not open network device bond1 (No such device)

It is desired for teaming to work with the network service as well as NetworkManager, since we use the network service instead of NetworkManager to manage interfaces on OpenStack servers.

I'm not sure if this is something that needs to be addressed in libteam, or the network service, or in the initscripts, so I'm opening this BZ against libteam initially. Please change the component if there is a more correct one.

Comment 1 Marcelo Ricardo Leitner 2016-12-07 18:45:58 UTC
Probably related to a fix that went into 7.3 for dealing with ordering during shutdown.

Comment 2 Dan Sneddon 2016-12-07 18:48:39 UTC
(In reply to Marcelo Ricardo Leitner from comment #1)
> Probably related to a fix that went into 7.3 for dealing with ordering
> during shutdown.

That makes sense. I wasn't 100% sure, but I thought I remembered testing this on RHEL 7.2 and it worked with the network service.

Comment 4 Xin Long 2016-12-19 07:33:37 UTC
Maybe it's not caused by the fix dealing with ordering.
As after I remove it, the issue is there.

# use the config file from comment0, reboot with NM_CONTROLLED=no
sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1
ifup team1

err: Error: Connection activation failed.

I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls try ifdown team1 first before changing NM_CONTROLLED=yes.
but anyway, it should have been handled by NM-team.


Another issue I also found is:
After running the following command, NM-team cannot be recovered.

# use the config file from comment0, reboot with NM_CONTROLLED=yes
# run with the following command, the team would not work with NM.
sed -i s/NM_CONTROLLED=yes/NM_CONTROLLED=no/g /etc/sysconfig/network-scripts/ifcfg-team1
ifdown team1
sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1
ifup team1
sed -i s/NM_CONTROLLED=yes/NM_CONTROLLED=no/g /etc/sysconfig/network-scripts/ifcfg-team1
ifdown team1
sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1
ifup team1

err:
ob for teamd failed because the control process exited with error code. See "systemctl status teamd" and "journalctl -xe" for details.

Hi,Beniamino,

We need your help, can you check how was these two issue caused in NM-team ?

Thanks.

Comment 5 Beniamino Galvani 2017-01-03 13:02:25 UTC
Created attachment 1236882 [details]
[PATCH] settings: fix assertion when changing connection managed  state

Comment 6 Beniamino Galvani 2017-01-03 13:07:17 UTC
(In reply to Xin Long from comment #4)
> Maybe it's not caused by the fix dealing with ordering.
> As after I remove it, the issue is there.
> 
> # use the config file from comment0, reboot with NM_CONTROLLED=no
> sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g
> /etc/sysconfig/network-scripts/ifcfg-team1
> ifup team1
> 
> err: Error: Connection activation failed.

I can't reproduce this, please attach NM logs.


> I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown
> team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls
> try ifdown team1 first before changing NM_CONTROLLED=yes.
> but anyway, it should have been handled by NM-team.
> 
> 
> Another issue I also found is:
> After running the following command, NM-team cannot be recovered.
> 
> ...
>
> err:
> ob for teamd failed because the control process exited with
> error code. See "systemctl status teamd" and "journalctl -xe"
> for details.

This is caused by a failed assertion in NM, fixed by the patch in comment 5.

However it's unclear to me if this can be the cause of the original issue reported in comment 0, as I'm unable to reproduce it.

The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I don't think NM can be involved at all?

Comment 7 Xin Long 2017-01-04 06:12:35 UTC
(In reply to Beniamino Galvani from comment #6)
> (In reply to Xin Long from comment #4)
> > Maybe it's not caused by the fix dealing with ordering.
> > As after I remove it, the issue is there.
> > 
> > # use the config file from comment0, reboot with NM_CONTROLLED=no
> > sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g
> > /etc/sysconfig/network-scripts/ifcfg-team1
> > ifup team1
> > 
> > err: Error: Connection activation failed.
> 
> I can't reproduce this, please attach NM logs.
http://pastebin.test.redhat.com/442681
pls check, from line 220 (Jan  4 01:00:14), I ifup team1.

> 
> 
> > I guess before you ifup team1 with NM_CONTROLLED=yes, you didn't ifdown
> > team1 with NM_CONTROLLED=no. The old team1 device/daemon may affect it, pls
> > try ifdown team1 first before changing NM_CONTROLLED=yes.
> > but anyway, it should have been handled by NM-team.
> > 
> > 
> > Another issue I also found is:
> > After running the following command, NM-team cannot be recovered.
> > 
> > ...
> >
> > err:
> > ob for teamd failed because the control process exited with
> > error code. See "systemctl status teamd" and "journalctl -xe"
> > for details.
> 
> This is caused by a failed assertion in NM, fixed by the patch in comment 5.
since which version is the patch in comment 5 in rhel7.4 ? I'm using:
# NetworkManager -V
1.4.0-12.el7

> 
> However it's unclear to me if this can be the cause of the original issue
> reported in comment 0, as I'm unable to reproduce it.
> 
> The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I
> don't think NM can be involved at all?
I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with the following commands, then team1 will be taken over by NM-team, when ifup team1, it should clean env for team1 first (like, kill the old teamd daemon for team1 ...).

But from log, it seemed not to work like that.

> > sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g
> > /etc/sysconfig/network-scripts/ifcfg-team1
> > ifup team1

Thanks for checking it.

Comment 8 Beniamino Galvani 2017-01-09 13:47:15 UTC
(In reply to Xin Long from comment #7)

> http://pastebin.test.redhat.com/442681
> pls check, from line 220 (Jan  4 01:00:14), I ifup team1.

Ok, I see the cause of the failure:

 [1483509614.7430] device (team1): Activation: (team) started teamd [pid 2667]...
 [1483509614.7436] device (team1): disconnecting for new activation request.
 [1483509614.7436] audit: op="connection-activate" uuid="4293abb7-d898-84ff-dae6-bffba04cbee9" name="Team team1" pid=2661 uid=0 result="success"
 [1483509614.7444] device (team1): deactivation: stopping teamd...
 [1483509614.7446] device (team1): Activation: starting connection 'Team team1' (4293abb7-d898-84ff-dae6-bffba04cbee9)
 Daemon already running on PID 2667.

'ifup' reloads the connection, which also activates it because the
connection has ONBOOT=yes. 'ifup' then calls 'nmcli connection up' and
so the connection is brought down and up again. While deactivating it,
NM stops teamd by calling 'teamd -k', and then launches a new teamd
instance. It seems that the old teamd is not dead when the new one is
started, causing the failure of activation.

Xin, can you please file a new bug for this and provide NM logs at
trace level (*)?  I don't think the original problem reported here is
caused by this NM bug because the steps in the bug description don't
involve NM.

(*) set 'level=TRACE' in the [logging] section of
/etc/NetworkManager/NetworkManager.conf and restart NM.

> > This is caused by a failed assertion in NM, fixed by the patch in comment 5.
> since which version is the patch in comment 5 in rhel7.4 ? I'm using:
> # NetworkManager -V
> 1.4.0-12.el7

The patch is not upstream or in RHEL yet.


> > However it's unclear to me if this can be the cause of the original issue
> > reported in comment 0, as I'm unable to reproduce it.
> > 
> > The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I
> > don't think NM can be involved at all?
> I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with
> the following commands, then team1 will be taken over by NM-team, when ifup
> team1, it should clean env for team1 first (like, kill the old teamd daemon
> for team1 ...).
> 
> But from log, it seemed not to work like that.

NM kills any existing teamd instance for the device, but there is another problem
when the connection is activated multiple times in a short interval (see above).

Comment 9 Xin Long 2017-01-16 13:42:37 UTC
(In reply to Beniamino Galvani from comment #8)
> (In reply to Xin Long from comment #7)
> 
> > http://pastebin.test.redhat.com/442681
> > pls check, from line 220 (Jan  4 01:00:14), I ifup team1.
> 
> Ok, I see the cause of the failure:
> 
>  [1483509614.7430] device (team1): Activation: (team) started teamd [pid
> 2667]...
>  [1483509614.7436] device (team1): disconnecting for new activation request.
>  [1483509614.7436] audit: op="connection-activate"
> uuid="4293abb7-d898-84ff-dae6-bffba04cbee9" name="Team team1" pid=2661 uid=0
> result="success"
>  [1483509614.7444] device (team1): deactivation: stopping teamd...
>  [1483509614.7446] device (team1): Activation: starting connection 'Team
> team1' (4293abb7-d898-84ff-dae6-bffba04cbee9)
>  Daemon already running on PID 2667.
> 
> 'ifup' reloads the connection, which also activates it because the
> connection has ONBOOT=yes. 'ifup' then calls 'nmcli connection up' and
> so the connection is brought down and up again. While deactivating it,
> NM stops teamd by calling 'teamd -k', and then launches a new teamd
> instance. It seems that the old teamd is not dead when the new one is
> started, causing the failure of activation.
> 
> Xin, can you please file a new bug for this and provide NM logs at
> trace level (*)?  I don't think the original problem reported here is
> caused by this NM bug because the steps in the bug description don't
> involve NM.
> 
the step in the bug description is actually the same with what I did here.
1. use the config file from comment0, reboot with NM_CONTROLLED=no
2. # sed -i s/NM_CONTROLLED=no/NM_CONTROLLED=yes/g /etc/sysconfig/network-scripts/ifcfg-team1
3. # ifup team1

I will new a bug for NM as you wish, as for this one, I would ask Dan to try in his env with your fix after you fix it.


> (*) set 'level=TRACE' in the [logging] section of
> /etc/NetworkManager/NetworkManager.conf and restart NM.
> 
> > > This is caused by a failed assertion in NM, fixed by the patch in comment 5.
> > since which version is the patch in comment 5 in rhel7.4 ? I'm using:
> > # NetworkManager -V
> > 1.4.0-12.el7
> 
> The patch is not upstream or in RHEL yet.
do you plan to backport it for rhel7.4 ?

> 
> 
> > > However it's unclear to me if this can be the cause of the original issue
> > > reported in comment 0, as I'm unable to reproduce it.
> > > 
> > > The original issue had all ifcfg-files with NM_CONTROLLED=no, and thus I
> > > don't think NM can be involved at all?
> > I got your point. what I'm thinking is: after I set 'NM_CONTROLLED=yes' with
> > the following commands, then team1 will be taken over by NM-team, when ifup
> > team1, it should clean env for team1 first (like, kill the old teamd daemon
> > for team1 ...).
> > 
> > But from log, it seemed not to work like that.
> 
> NM kills any existing teamd instance for the device, but there is another
> problem
> when the connection is activated multiple times in a short interval (see
> above).

Comment 10 Beniamino Galvani 2017-01-23 15:24:51 UTC
(In reply to Xin Long from comment #9)
> > > > This is caused by a failed assertion in NM, fixed by the patch in comment 5.
> > > since which version is the patch in comment 5 in rhel7.4 ? I'm using:
> > > # NetworkManager -V
> > > 1.4.0-12.el7
> > 
> > The patch is not upstream or in RHEL yet.
>
> do you plan to backport it for rhel7.4 ?

The patch is now in the master branch:

 https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=a9384452ed61ca3f1c6e1db175f499307da9c388

and RHEL 7.4 will include it.

Comment 11 Marcelo Ricardo Leitner 2017-01-28 11:16:46 UTC
(In reply to Dan Sneddon from comment #0)
> Steps to Reproduce:
> 1. Configure /etc/sysconfig/network-scripts/ifcfg-team1 with the following:
> NM_CONTROLLED=yes
> 
> 2. Configure /etc/sysconfig/network-scripts/ifcfg-eth1 (or other interface)
> with the following:
> NM_CONTROLLED=no
> 
> 3. Configure /etc/sysconfig/network-scripts/ifcfg-eth2 (or other interface)
> with the following:
> NM_CONTROLLED=no

Hi Dan, is this inconsistency expected? It's weird to have the team interface to be NM controlled and not its slaves. You should either have them all handed by NM, or them all not handled. This mixed setup is not tested/supported.

Comment 12 Dan Sneddon 2017-01-30 16:49:54 UTC
(In reply to Marcelo Ricardo Leitner from comment #11)

> Hi Dan, is this inconsistency expected? It's weird to have the team
> interface to be NM controlled and not its slaves. You should either have
> them all handed by NM, or them all not handled. This mixed setup is not
> tested/supported.

No, Team interfaces should work with either the 'network' service, NetworkManager, or should work when both are installed. Currently, Team interfaces do not function with the 'network' service if NetworkManager is also installed. This is due to a conflict between the two services, if I understand correctly.

The root cause is tracked in the attached bug https://bugzilla.redhat.com/show_bug.cgi?id=1402535 and the fix has been merged and will be available in RHEL 7.4 and above.

Comment 13 Dan Sneddon 2017-01-30 20:33:47 UTC
(In reply to Dan Sneddon from comment #12)

> The root cause is tracked in the attached bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1402535 and the fix has been
> merged and will be available in RHEL 7.4 and above.

Actually, the above may not be the root cause, it may be an ancillary bug that was discovered while troubleshooting this issue. But yes, it is expected that Team interfaces would work with the network service as well as NetworkManager.

Comment 14 Xin Long 2017-02-04 07:17:22 UTC
the bug for NM: https://bugzilla.redhat.com/show_bug.cgi?id=1415641

Comment 15 Xin Long 2017-05-17 17:03:50 UTC
since the fix has been merged and will be available in RHEL 7.4 and above, close this. if there still is any problem about this, pls reopen.