Bug 1505893

Summary: [NMCI] show_zones_after_firewalld_install test failure
Product: Red Hat Enterprise Linux 7 Reporter: Vladimir Benes <vbenes>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: aloughla, atragler, bgalvani, fgiudici, lmiksik, lrintel, rkhan, sukulkar, thaller, vbenes
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.10.2-6.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 13:31:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vladimir Benes 2017-10-24 13:57:20 UTC
Description of problem:
see this for more details:

https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/NetworkManager/job/beaker-NetworkManager-master-upstream/410/artifact/beaker/J_2107949/FAIL_test_log-NetworkManager-ci_Test401_show_zones_after_firewalld_install.html

it looks like eth1 got precedence when ethie was created even though eth0 was connected before. Maybe there was some race as it's not reproducible when just one test is executed. If whole master test suite is it fails constantly.


Version-Release number of selected component (if applicable):
NetworkManager-1.9.2-18636.b3b9b2bf38

Comment 2 Beniamino Galvani 2017-11-24 09:39:43 UTC
I think this is a kernel issue. DNS requests are sent through eth1:

  [root@testhostname NetworkManager-ci]# ping download.eng.bos.redhat.com &                                                                                                                               
  [1] 21610

  [root@testhostname NetworkManager-ci]# tcpdump -i eth0 -n                                                                                                                                               
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
  04:22:21.328980 IP6 fe80::527b:9dff:fed8:38e6 > ff02::1:ffd8:38e6: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ffd8:38e6, length 24
  04:22:21.515952 IP6 fe80::527b:9dff:fed8:30fe > ff02::1:ffd8:30fe: HBH ICMP6, multicast listener reportmax resp delay: 0 addr: ff02::1:ffd8:30fe, length 24
  ...
  
  [root@testhostname NetworkManager-ci]# tcpdump -i eth1 -n -xx                                                                                                                              
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
  04:22:51.270410 IP 10.16.122.90.52940 > 10.16.36.29.domain: 41986+ A? download.eng.bos.redhat.com.wlan.rhts.eng.bos.redhat.com. (74)
        0x0000:  829d 6753 84e0 f011 2233 4455 0800 4500
        0x0010:  0066 4b53 4000 4011 3c9d 0a10 7a5a 0a10
        0x0020:  241d cecc 0035 0052 b2fa a402 0100 0001
        0x0030:  0000 0000 0000 0864 6f77 6e6c 6f61 6403
        0x0040:  656e 6703 626f 7306 7265 6468 6174 0363
        0x0050:  6f6d 0477 6c61 6e04 7268 7473 0365 6e67
        0x0060:  0362 6f73 0672 6564 6861 7403 636f 6d00
        0x0070:  0001 0001

The source address belongs to another interface and also, the default
route is through eth0:

  [root@testhostname NetworkManager-ci]#  ip r
  default via 10.16.122.254 dev eth0 proto dhcp metric 100 
  default via 192.168.100.1 dev eth1 proto dhcp metric 100 
  10.16.122.0/24 dev eth0 proto kernel scope link src 10.16.122.90 metric 100 
  192.168.100.0/24 dev eth1 proto kernel scope link src 192.168.100.20 metric 100 
  
  [root@testhostname NetworkManager-ci]# ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host 
         valid_lft forever preferred_lft forever
  2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
      link/ether 50:7b:9d:d8:38:e6 brd ff:ff:ff:ff:ff:ff
      inet 10.16.122.90/24 brd 10.16.122.255 scope global noprefixroute dynamic eth0
         valid_lft 85905sec preferred_lft 85905sec
      inet6 2620:52:0:107a:527b:9dff:fed8:38e6/64 scope global noprefixroute dynamic 
         valid_lft 2591848sec preferred_lft 604648sec
      inet6 fe80::527b:9dff:fed8:38e6/64 scope link noprefixroute 
         valid_lft forever preferred_lft forever
  10: eth1@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
      link/ether f0:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff link-netnsid 0
      inet 192.168.100.20/24 brd 192.168.100.255 scope global noprefixroute dynamic eth1
         valid_lft 217sec preferred_lft 217sec
      inet6 fe80::7232:87a3:eb24:325e/64 scope link noprefixroute 
         valid_lft forever preferred_lft forever
  
  [root@testhostname NetworkManager-ci]# ip n
  192.168.100.1 dev eth1 lladdr 82:9d:67:53:84:e0 REACHABLE
  10.16.122.254 dev eth0 lladdr 28:8a:1c:09:a5:c1 STALE
  fe80:52:0:107a::fe dev eth0 lladdr 28:8a:1c:09:a5:c1 router STALE
  fe80::1875:2aff:fe6a:ce23 dev eth10 lladdr 1a:75:2a:6a:ce:23 router STALE
  
The kernel is 3.10.0-730.el7.sgruszka1.x86_64. I'll try with a more recent kernel.

Comment 3 Beniamino Galvani 2017-11-24 16:14:08 UTC
Hi Vladimir,

the problem doesn't seem to happen with kernel 3.10.0-783.

Can you update CI scripts to use a more recent kernel?

Comment 4 Beniamino Galvani 2017-12-05 13:56:07 UTC
When the connection is added on eth1 we get the following default
routes:

 default via 10.16.122.254 dev eth0 proto dhcp metric 100
 default via 192.168.100.1 dev eth1 proto dhcp metric 100

while in the past the device activated before got a lower metric:

 default via 10.16.122.254 dev eth0 proto dhcp metric 100
 default via 192.168.100.1 dev eth1 proto dhcp metric 101

The change is the result of commit [1] that removed the
default-route-manager and started to add default routes without
tweaking the metric. The previous behavior is described in [2].

The effect of having multiple routes with the same metric is that ECMP
(multi-path routing) is used and packets flow through a gateway
or the other based on a layer-3 hash.

In the test scenario, eth1 is added as default route but doesn't
actually routes packets and this leads to the test failure. We can
easily fix the test by specifying a higher metric for eth1 (or
disabling the default route for eth1), but I wonder if the change in
behavior is acceptable.

[1] https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=77ec302714795f905301d500b9aab6c88001f32e
[2] https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=e8824f6a5205ffcf761abd3e0897a22b254c7797

Comment 5 Thomas Haller 2017-12-07 16:59:56 UTC
how about th/device-route-metric-rh1505893 ?

Comment 6 Beniamino Galvani 2017-12-14 08:51:57 UTC
LGTM (didn't test).

Comment 8 Vladimir Benes 2017-12-20 12:19:12 UTC
working on altered branch with tests

Comment 12 errata-xmlrpc 2018-04-10 13:31:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0778