Bug 1357738

Summary:	The device's master is unset when downed outside NetworkManager
Product:	Red Hat Enterprise Linux 7	Reporter:	Zhengtong <zhengtli>
Component:	NetworkManager	Assignee:	Lubomir Rintel <lrintel>
Status:	CLOSED ERRATA	QA Contact:	Desktop QE <desktop-qa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.3	CC:	aloughla, atragler, bgalvani, dcbw, knoel, lrintel, qzhang, rkhan, sukulkar, thaller, thuth, vbenes, virt-maint, weliao, zhengtli
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:	NetworkManager-1.4.0-3.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1523572 (view as bug list)		Environment:
Last Closed:	2016-11-03 19:24:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1523572

Description Zhengtong 2016-07-19 03:40:41 UTC

Description of problem:
Guest network can't resume after link down and link up the backend device

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-13.el7
Host&guest kernel:3.10.0-470.el7.ppc64le


How reproducible:
2/2

Steps to Reproduce:
1.Boot up guest with network device virtio-net-pci
#/usr/libexec/qemu-kvm \
...
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c4:37:94,bus=pci.0,addr=0x5 \
...
2.After guest boot up. ping out inside guest.
#[root@dhcp70-246 ~]# ping 10.19.112.40
ping 10.19.112.40
PING 10.19.112.40 (10.19.112.40) 56(84) bytes of data.
64 bytes from 10.19.112.40: icmp_seq=1 ttl=64 time=0.224 ms
64 bytes from 10.19.112.40: icmp_seq=2 ttl=64 time=0.078 ms
64 bytes from 10.19.112.40: icmp_seq=3 ttl=64 time=0.050 ms

3.In the host . link down the backend
[root@ibm-p8-rhevm-05 staf-kvm-devel]# ip link set tap0 down

4.After some seconds. link up the backend again.
[root@ibm-p8-rhevm-05 staf-kvm-devel]# ip link set tap0 up

Actual results:
Still can't ping out from guest
....
From 10.19.113.72 icmp_seq=123 Destination Host Unreachable
From 10.19.113.72 icmp_seq=124 Destination Host Unreachable
From 10.19.113.72 icmp_seq=125 Destination Host Unreachable
From 10.19.113.72 icmp_seq=126 Destination Host Unreachable
From 10.19.113.72 icmp_seq=127 Destination Host Unreachable
From 10.19.113.72 icmp_seq=128 Destination Host Unreachable
From 10.19.113.72 icmp_seq=129 Destination Host Unreachable
From 10.19.113.72 icmp_seq=130 Destination Host Unreachable
....

Expected results:
The guest can ping out again successfully

Additional info:

Comment 2 David Gibson 2016-08-01 03:04:57 UTC

Does this bug also occur on x86, or only on ppc64le?

Did this bug also occur on RHEL7.2 (host), or is it a regression in RHEL7.3?

Comment 3 Zhengtong 2016-08-01 05:10:42 UTC

only happens on ppc64le.  the network can resume on x86.

Comment 4 Thomas Huth 2016-08-02 20:45:08 UTC

Observation: When taking the link down, the tap interface is removed from the bridge:

# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.9a616a17e4a0	yes		tap0
# ip link set tap0 down
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.000000000000	yes		

And after enabling the interface again, the tap0 is not automatically connected again:

# ip link set tap0 up
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.000000000000	yes		

I can manually connect the tap0 interface to the bridge again:

# brctl addif virbr0 tap0
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.9a616a17e4a0	yes		tap0

... and after doing so, the network in the guest is working properly again!

Now the question is: Why is here a difference between x86 and ppc64 ?

Comment 5 Thomas Huth 2016-08-03 10:33:19 UTC

I've now checked this on a x86_64 host, too, and I get the very same behavior there as on ppc64le - after setting the link down, the tap interface is removed from the bridge, so the guest network does not come up automatically again when doing the "ip link set tap0 up". I also got to execute "brctl addif virbr0 tap0" there manually, too, to get it working again.

Zhengtong, could you please describe your setup on x86 (where it was working) in more details? Which host kernel version and qemu version did you use on x86? What do you get when you execute "brctrl show" there before/after each step?

Comment 6 Zhengtong 2016-08-04 02:45:30 UTC


The configurations I tested with is 

Host: 3.10.0-370.el7.x86_64
qemu-kvm-rhev-2.6.0-17.el7

Guest: 3.10.0-327.10.1.el7.x86_64

That's interesting. After I set the tap0 down, the tap0 device is still attached on virbr0 bridge


Steps and result:

step 1. After guest boot up. keep pinging virbr0 (192.168.122.1)

step 2. On host. set the tap0 down
[root@dhcp-9-217 ~]# ip link set tap0 down
[root@dhcp-9-217 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
switch		8000.00151736e1c5	no		enp1s0f1
							enp2s0
virbr0		8000.525400d3c77f	no		tap0
							virbr0-nic
And the pinging process paused.

step 3. On host. set the tap0 up again.
[root@dhcp-9-217 ~]# ip link set tap0 up
[root@dhcp-9-217 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
switch		8000.00151736e1c5	no		enp1s0f1
							enp2s0
virbr0		8000.525400d3c77f	no		tap0
							virbr0-nic

And the pinging process resumed.

I didn't use the latest host&guest kernel . but I think that the linking back automatically is the natural behaviour.

Comment 7 Thomas Huth 2016-08-04 06:57:24 UTC

OK, thanks a lot for the information! ... I think we're on the right track here: Yesterday, when I was seeing the failure on x86, too, I was using the latest snapshot of RHEL 7.3 on the host (kernel 3.10.0-481.el7.x86_64, qemu-kvm-rhev-2.6.0-18).
Today, I installed RHEL 7.2 on the x86 host (kernel 3.10.0-327.18.2.el7.x86_64, qemu-kvm-1.5.3-105.el7_2.3.x86_64), and now I get the same behavior as you, i.e. the guest network continues to work after setting up the link again!
So it seem like there has been a modification between the two versions that has introduced this different behavior ... I'll do more tests to isolate the exact problem ...

Comment 8 Thomas Huth 2016-08-04 09:31:47 UTC

I've now also installed RHEL 7.2 (with kernel 3.10.0-327.18.2.el7.ppc64le) with qemu-kvm-rhev-2.6.0-18.el7.ppc64le on our POWER8 servers - and the guest network continues to work there, too, after setting the link up again. So this is definitely a regression from RHEL 7.2 to the current version of 7.3.

Since I used qemu-kvm-rhev-2.6.0 on both RHEL 7.2 and RHEL 7.3 when testing on ppc, I think the problem is likely not in QEMU itself. So I've now done an additional test, too: I've installed kernel 3.10.0-481 on the RHEL 7.2 installation and tried again after booting it - however, the guest network then still continues to work after setting the link up again, so the problem is likely also not in the kernel... not sure what else can be the culprit... I'll keep on searching...

Comment 9 Thomas Huth 2016-08-04 17:14:02 UTC

As mentioned earlier, the problem can also be reproduced on x86 (I just tried the RHEL-7.3-20160802 snapshot again and reproduced it there), so I'm changing the "Hardware" field to "All".

Comment 10 Thomas Huth 2016-08-04 17:18:38 UTC

I think I've now found the component that is causing the problems: NetworkManager. If I disable NetworkManager before the test, the tap0 does _not_ get removed from the bridge when setting the interface down, and the guest network continues to work after after setting the link up again! So I'm re-assigning this ticket to the NetworkManager component for further investigation.

Comment 12 Dan Williams 2016-08-16 15:34:24 UTC

Quick question, why does the tap device need to be "down"?  I'm not saying there is no NM bug here, just curious.

Comment 13 Zhengtong 2016-08-17 00:18:06 UTC

(In reply to Dan Williams from comment #12)
> Quick question, why does the tap device need to be "down"?  I'm not saying
> there is no NM bug here, just curious.

I think this is a test for product robustness. It is possible that we want to do some debug or test stuff with setting the tap device down, but that's not for sure.

Comment 14 Lubomir Rintel 2016-08-30 15:36:07 UTC

I can see what is wrong. Working on the fix.

Comment 15 Lubomir Rintel 2016-08-31 10:32:40 UTC

For QE testing:

    ip link add br0 type bridge
    ip addr add 192.0.2.1/24 dev br0
    ip tuntap add tap0 mode tap
    ip addr add 192.0.2.2/24 dev tap0
    ip link set tap0 master br0
    ip link set tap0 up
    ip link set br0 up
     
    # The device has a master...
    sleep 1
    ip link show tap0
    nmcli c
     
    nmcli c down tap0
     
    # ...and now it does not.
    sleep 1
    ip link show tap0
    nmcli c
     
    ip link del br0
    ip link del tap0

Comment 16 Thomas Haller 2016-09-01 09:00:54 UTC

fixed upstream: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=3127fb0d17bff0b250218c7bf82b4335b5290825

Comment 18 Vladimir Benes 2016-09-07 15:44:41 UTC

bridge slave correctly preserves IP address and master settings when juggled with ip link set dev $dev up/down command.

Comment 20 errata-xmlrpc 2016-11-03 19:24:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html