Bug 1357738 - The device's master is unset when downed outside NetworkManager
Summary: The device's master is unset when downed outside NetworkManager
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.3
Hardware: All
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Lubomir Rintel
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks: 1523572
TreeView+ depends on / blocked
 
Reported: 2016-07-19 03:40 UTC by Zhengtong
Modified: 2017-12-08 10:35 UTC (History)
15 users (show)

Fixed In Version: NetworkManager-1.4.0-3.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1523572 (view as bug list)
Environment:
Last Closed: 2016-11-03 19:24:35 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2581 normal SHIPPED_LIVE Low: NetworkManager security, bug fix, and enhancement update 2016-11-03 12:08:07 UTC

Description Zhengtong 2016-07-19 03:40:41 UTC
Description of problem:
Guest network can't resume after link down and link up the backend device

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-13.el7
Host&guest kernel:3.10.0-470.el7.ppc64le


How reproducible:
2/2

Steps to Reproduce:
1.Boot up guest with network device virtio-net-pci
#/usr/libexec/qemu-kvm \
...
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c4:37:94,bus=pci.0,addr=0x5 \
...
2.After guest boot up. ping out inside guest.
#[root@dhcp70-246 ~]# ping 10.19.112.40
ping 10.19.112.40
PING 10.19.112.40 (10.19.112.40) 56(84) bytes of data.
64 bytes from 10.19.112.40: icmp_seq=1 ttl=64 time=0.224 ms
64 bytes from 10.19.112.40: icmp_seq=2 ttl=64 time=0.078 ms
64 bytes from 10.19.112.40: icmp_seq=3 ttl=64 time=0.050 ms

3.In the host . link down the backend
[root@ibm-p8-rhevm-05 staf-kvm-devel]# ip link set tap0 down

4.After some seconds. link up the backend again.
[root@ibm-p8-rhevm-05 staf-kvm-devel]# ip link set tap0 up

Actual results:
Still can't ping out from guest
....
From 10.19.113.72 icmp_seq=123 Destination Host Unreachable
From 10.19.113.72 icmp_seq=124 Destination Host Unreachable
From 10.19.113.72 icmp_seq=125 Destination Host Unreachable
From 10.19.113.72 icmp_seq=126 Destination Host Unreachable
From 10.19.113.72 icmp_seq=127 Destination Host Unreachable
From 10.19.113.72 icmp_seq=128 Destination Host Unreachable
From 10.19.113.72 icmp_seq=129 Destination Host Unreachable
From 10.19.113.72 icmp_seq=130 Destination Host Unreachable
....

Expected results:
The guest can ping out again successfully

Additional info:

Comment 2 David Gibson 2016-08-01 03:04:57 UTC
Does this bug also occur on x86, or only on ppc64le?

Did this bug also occur on RHEL7.2 (host), or is it a regression in RHEL7.3?

Comment 3 Zhengtong 2016-08-01 05:10:42 UTC
only happens on ppc64le.  the network can resume on x86.

Comment 4 Thomas Huth 2016-08-02 20:45:08 UTC
Observation: When taking the link down, the tap interface is removed from the bridge:

# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.9a616a17e4a0	yes		tap0
# ip link set tap0 down
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.000000000000	yes		

And after enabling the interface again, the tap0 is not automatically connected again:

# ip link set tap0 up
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.000000000000	yes		

I can manually connect the tap0 interface to the bridge again:

# brctl addif virbr0 tap0
# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.9a616a17e4a0	yes		tap0

... and after doing so, the network in the guest is working properly again!

Now the question is: Why is here a difference between x86 and ppc64 ?

Comment 5 Thomas Huth 2016-08-03 10:33:19 UTC
I've now checked this on a x86_64 host, too, and I get the very same behavior there as on ppc64le - after setting the link down, the tap interface is removed from the bridge, so the guest network does not come up automatically again when doing the "ip link set tap0 up". I also got to execute "brctl addif virbr0 tap0" there manually, too, to get it working again.

Zhengtong, could you please describe your setup on x86 (where it was working) in more details? Which host kernel version and qemu version did you use on x86? What do you get when you execute "brctrl show" there before/after each step?

Comment 6 Zhengtong 2016-08-04 02:45:30 UTC

The configurations I tested with is 

Host: 3.10.0-370.el7.x86_64
qemu-kvm-rhev-2.6.0-17.el7

Guest: 3.10.0-327.10.1.el7.x86_64

That's interesting. After I set the tap0 down, the tap0 device is still attached on virbr0 bridge


Steps and result:

step 1. After guest boot up. keep pinging virbr0 (192.168.122.1)

step 2. On host. set the tap0 down
[root@dhcp-9-217 ~]# ip link set tap0 down
[root@dhcp-9-217 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
switch		8000.00151736e1c5	no		enp1s0f1
							enp2s0
virbr0		8000.525400d3c77f	no		tap0
							virbr0-nic
And the pinging process paused.

step 3. On host. set the tap0 up again.
[root@dhcp-9-217 ~]# ip link set tap0 up
[root@dhcp-9-217 ~]# brctl show
bridge name	bridge id		STP enabled	interfaces
switch		8000.00151736e1c5	no		enp1s0f1
							enp2s0
virbr0		8000.525400d3c77f	no		tap0
							virbr0-nic

And the pinging process resumed.

I didn't use the latest host&guest kernel . but I think that the linking back automatically is the natural behaviour.

Comment 7 Thomas Huth 2016-08-04 06:57:24 UTC
OK, thanks a lot for the information! ... I think we're on the right track here: Yesterday, when I was seeing the failure on x86, too, I was using the latest snapshot of RHEL 7.3 on the host (kernel 3.10.0-481.el7.x86_64, qemu-kvm-rhev-2.6.0-18).
Today, I installed RHEL 7.2 on the x86 host (kernel 3.10.0-327.18.2.el7.x86_64, qemu-kvm-1.5.3-105.el7_2.3.x86_64), and now I get the same behavior as you, i.e. the guest network continues to work after setting up the link again!
So it seem like there has been a modification between the two versions that has introduced this different behavior ... I'll do more tests to isolate the exact problem ...

Comment 8 Thomas Huth 2016-08-04 09:31:47 UTC
I've now also installed RHEL 7.2 (with kernel 3.10.0-327.18.2.el7.ppc64le) with qemu-kvm-rhev-2.6.0-18.el7.ppc64le on our POWER8 servers - and the guest network continues to work there, too, after setting the link up again. So this is definitely a regression from RHEL 7.2 to the current version of 7.3.

Since I used qemu-kvm-rhev-2.6.0 on both RHEL 7.2 and RHEL 7.3 when testing on ppc, I think the problem is likely not in QEMU itself. So I've now done an additional test, too: I've installed kernel 3.10.0-481 on the RHEL 7.2 installation and tried again after booting it - however, the guest network then still continues to work after setting the link up again, so the problem is likely also not in the kernel... not sure what else can be the culprit... I'll keep on searching...

Comment 9 Thomas Huth 2016-08-04 17:14:02 UTC
As mentioned earlier, the problem can also be reproduced on x86 (I just tried the RHEL-7.3-20160802 snapshot again and reproduced it there), so I'm changing the "Hardware" field to "All".

Comment 10 Thomas Huth 2016-08-04 17:18:38 UTC
I think I've now found the component that is causing the problems: NetworkManager. If I disable NetworkManager before the test, the tap0 does _not_ get removed from the bridge when setting the interface down, and the guest network continues to work after after setting the link up again! So I'm re-assigning this ticket to the NetworkManager component for further investigation.

Comment 12 Dan Williams 2016-08-16 15:34:24 UTC
Quick question, why does the tap device need to be "down"?  I'm not saying there is no NM bug here, just curious.

Comment 13 Zhengtong 2016-08-17 00:18:06 UTC
(In reply to Dan Williams from comment #12)
> Quick question, why does the tap device need to be "down"?  I'm not saying
> there is no NM bug here, just curious.

I think this is a test for product robustness. It is possible that we want to do some debug or test stuff with setting the tap device down, but that's not for sure.

Comment 14 Lubomir Rintel 2016-08-30 15:36:07 UTC
I can see what is wrong. Working on the fix.

Comment 15 Lubomir Rintel 2016-08-31 10:32:40 UTC
For QE testing:

    ip link add br0 type bridge
    ip addr add 192.0.2.1/24 dev br0
    ip tuntap add tap0 mode tap
    ip addr add 192.0.2.2/24 dev tap0
    ip link set tap0 master br0
    ip link set tap0 up
    ip link set br0 up
     
    # The device has a master...
    sleep 1
    ip link show tap0
    nmcli c
     
    nmcli c down tap0
     
    # ...and now it does not.
    sleep 1
    ip link show tap0
    nmcli c
     
    ip link del br0
    ip link del tap0

Comment 18 Vladimir Benes 2016-09-07 15:44:41 UTC
bridge slave correctly preserves IP address and master settings when juggled with ip link set dev $dev up/down command.

Comment 20 errata-xmlrpc 2016-11-03 19:24:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html


Note You need to log in before you can comment on or make changes to this bug.