Bug 1578668 - [NMCI] nmcli_general_correct_profile_activated_after_restart test failure
Summary: [NMCI] nmcli_general_correct_profile_activated_after_restart test failure
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Beniamino Galvani
QA Contact: Desktop QE
URL:
Whiteboard:
: 1592830 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-16 06:32 UTC by Vladimir Benes
Modified: 2018-09-17 12:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-17 12:14:34 UTC
Target Upstream Version:


Attachments (Terms of Use)
NM logfile from test failure (colorized, watch with `less -R`) (3.80 MB, application/octet-stream)
2018-05-25 08:23 UTC, Thomas Haller
no flags Details
[PATCH] platform: sort known IPv6 addresses by scope before a sync (2.04 KB, patch)
2018-05-30 14:03 UTC, Beniamino Galvani
no flags Details | Diff

Description Vladimir Benes 2018-05-16 06:32:10 UTC
Description of problem:
after this test I can see this, very odd:
[root@gsm-r6s14-01 ~]# nmcli  device 
DEVICE   TYPE      STATE         CONNECTION 
testG    ethernet  connected     --         
eth0     ethernet  connected     testeth0   
eth1     ethernet  disconnected  --         
eth10    ethernet  disconnected  --     


https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/NetworkManager/job/beaker-NetworkManager-master-upstream/625/artifact/beaker/J_2479044/FAIL_report_NetworkManager-ci_Test245_nmcli_general_correct_profile_activated_after_restart.html

I've slightly modified the test and it's visible every time:
    @nmcli_general_correct_profile_activated_after_restart
    Scenario: nmcli - general - correct profile activated after restart
    * Prepare simulated test "testG" device
    * Add a new connection of type "ethernet" and options "ifname testG con-name con_general -- ipv4.method auto ipv6.method auto ipv4.may-fail no ipv6.may-fail no"
    * Add a new connection of type "ethernet" and options "ifname testG con-name con_general2 -- ipv4.method auto ipv6.method auto ipv4.may-fail no ipv6.may-fail no"
    * Wait for at least "2" seconds
    * Bring up connection "con_general"
    When "100" is visible with command "nmcli  -t -f GENERAL.STATE device show testG"
    When "connected:con_general:testG" is visible with command "nmcli -t -f STATE,CONNECTION,DEVICE device" in "10" seconds
    * Restart NM
    Then "connected:con_general:testG" is visible with command "nmcli -t -f STATE,CONNECTION,DEVICE device" in "10" seconds
     And "con_general2" is not visible with command "nmcli device"


Version-Release number of selected component (if applicable):
NetworkManager-1.11.3-20184.c905ee94f0.el7.x86_64

How reproducible:
always

Comment 4 Thomas Haller 2018-05-25 08:23:09 UTC
Created attachment 1441448 [details]
NM logfile from test failure (colorized, watch with `less -R`)

It seems, after restart of NM, assuming the connection on device "testG" fails.


There are several issues.


Issue 1.):

# start assuming connection

[1527230126.9200] manager: (testG): assume: will attempt to assume matching connection 'con_general' (2f22bc2f-ef70-4b5e-9f77-14bf5678d1ea) (indicated)

[1527230127.1760] device (testG): state change: config -> ip-config (reason 'none', sys-iface-state: 'assume')


# start slaac/DHCPv6

[1527230128.5580] ndisc-lndp[0x5569a2ead200,"testG"]: received router advertisement at 3

[1527230128.5581] ndisc[0x5569a2ead200,"testG"]:   dhcp-level managed

[1527230128.6013] dhcp6 (testG): dhclient started with pid 2072


# DHCPv6 fails

[1527230128.6169] dhcp6 (testG): client pid 2072 exited with status 1

[1527230128.6170] device[0x5569a2e727d0] (testG): DHCPv6 failed


# at this point, no more DHCPv6 is tried, until we eventually fail with:

[1527230608.9838] device (testG): DHCPv6: grace period expired

[1527230608.9839] device (testG): state change: ip-config -> failed (reason 'ip-config-expired', sys-iface-state: 'assume')


Bug: having the grace period, only if DHCP is still running.




Issue 2.):

in the logfile, we receive very often router advertisements, for example:

[1527230598.5162] ndisc-lndp[0x5569a2ead200,"testG"]: received router advertisement at 473

whenever that happens, we remove and add IPv6 addresses

[1527230598.5164] platform: address: deleting IPv6 address fe80::780c:1459:5c81:4527/64, ifindex 38 dev testG

this seems wrong, because repeated RAs should not result in any changes (unless, of course, a router goes away or appears or change config, which does not seem the case here). Needs investigation.



Issue 3.)

Figure out why DHCPv6 fails in the first place at 

[1527230128.6169] dhcp6 (testG): client pid 2072 exited with status 1

Comment 5 Thomas Haller 2018-05-29 16:09:44 UTC
> Issue 3.)
>
> Figure out why DHCPv6 fails in the first place at 
>
> [1527230128.6169] dhcp6 (testG): client pid 2072 exited with status 1

It seems the reason is that a conflicting dhclient instance is running. But I couldn't figure out why that is happening, because we should kill dhclient via the PID file...

Comment 6 Beniamino Galvani 2018-05-30 07:57:09 UTC
I think the problem is:

[1527665185.7301] device (testG):  ---- IPv6 config merge and apply
[1527665185.7302]  ---- composite after merging AC
[1527665185.7302]       a: 2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1800sec pref 1800sec lifetime 6-6[1800,1800] dev 35 flags noprefixroute src ndisc
[1527665185.7303]  ---- composite after merging EXT
[1527665185.7303]       a: 2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1800sec pref 1800sec lifetime 6-6[1800,1800] dev 35 flags noprefixroute src ndisc
[1527665185.7303]       a: fe80::3bf3:e4ef:bbd:3b43/64 lft forever pref forever lifetime 6-0[4294967295,4294967295] dev 35 flags permanent,noprefixroute src kernel
[1527665185.7303]       a: 2620:dead:beaf::18b/128 lft 23sec pref 23sec lifetime 6-4[25,25] dev 35 flags noprefixroute,tentative src kernel
[1527665185.7303]  ---- composite  after merging CON
[1527665185.7304]       a: 2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1800sec pref 1800sec lifetime 6-6[1800,1800] dev 35 flags noprefixroute src ndisc
[1527665185.7304]       a: fe80::3bf3:e4ef:bbd:3b43/64 lft forever pref forever lifetime 6-0[4294967295,4294967295] dev 35 flags permanent,noprefixroute src kernel
[1527665185.7304]       a: 2620:dead:beaf::18b/128 lft 23sec pref 23sec lifetime 6-4[25,25] dev 35 flags noprefixroute,tentative src kernel
[1527665185.7304] device[0x23c74d0] (testG): ip6-config: update (commit=1, new-config=0x23d3e50)
[1527665185.7304] platform:  ---- ip6 address sync, full 0 ----
[1527665185.7304] platform:    (known) 2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1800sec pref 1800sec lifetime 6-6[1800,1800] dev 35 flags noprefixroute src ndisc
[1527665185.7304] platform:    (known) fe80::3bf3:e4ef:bbd:3b43/64 lft forever pref forever lifetime 6-0[4294967295,4294967295] dev 35 flags permanent,noprefixroute src kernel
[1527665185.7305] platform:    (known) 2620:dead:beaf::18b/128 lft 23sec pref 23sec lifetime 6-4[25,25] dev 35 flags noprefixroute,tentative src kernel
[1527665185.7305] platform:    -
[1527665185.7305] platform:    (plat)  2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1766sec pref 1766sec lifetime 6-4[1768,1768] dev 35 flags noprefixroute src kernel
[1527665185.7305] platform:    (plat)  2620:dead:beaf::18b/128 lft 23sec pref 23sec lifetime 6-4[25,25] dev 35 flags noprefixroute,tentative src kernel
[1527665185.7305] platform:    (plat)  fe80::3bf3:e4ef:bbd:3b43/64 lft forever pref forever lifetime 6-0[4294967295,4294967295] dev 35 flags permanent,noprefixroute src kernel
[1527665185.7305] platform:  ### deleting address because doesn't match:
[1527665185.7306] platform:  ###    plat : fe80::3bf3:e4ef:bbd:3b43/64 lft forever pref forever lifetime 6-0[4294967295,4294967295] dev 35 flags permanent,noprefixroute src kernel
[1527665185.7306] platform:  ###    known: 2620:dead:beaf:0:34fa:fb:a0a6:730d/64 lft 1800sec pref 1800sec lifetime 6-6[1800,1800] dev 35 flags noprefixroute src ndisc

After restart the IPv6 LL address is added to the composite config we
want to apply in the wrong order because it is added as part of the
captured external IPv6 config and so at each commit NM removes and
adds again addresses. Therefore, the link-local address stays in
tentative state, which causes problems with DHCP and NDP.

Comment 7 Beniamino Galvani 2018-05-30 14:03:21 UTC
Created attachment 1445908 [details]
[PATCH] platform: sort known IPv6 addresses by scope before a sync

Comment 8 Thomas Haller 2018-05-30 15:29:33 UTC
+    NMPlatformIP6Address *x = NMP_OBJECT_CAST_IP6_ADDRESS (*(void **) a);

"const NMPlatformIP6Address *x"

Comment 9 Thomas Haller 2018-05-30 15:34:38 UTC
(In reply to Thomas Haller from comment #8)
> +    NMPlatformIP6Address *x = NMP_OBJECT_CAST_IP6_ADDRESS (*(void **) a);
> 
> "const NMPlatformIP6Address *x"

otherwise, patch from comment 7 lgtm.

I assuem, this fixes issue #2 (from comment 4).



Can you confirm, that the tentative IPv6 addresss caused dhclient to exit immediately? (issue #3 from comment 4)


It seems, issue #1 still needs fixing (although, it doesn't happen now anymore).

Comment 10 Beniamino Galvani 2018-05-30 16:13:51 UTC
(In reply to Thomas Haller from comment #9)
> (In reply to Thomas Haller from comment #8)
> > +    NMPlatformIP6Address *x = NMP_OBJECT_CAST_IP6_ADDRESS (*(void **) a);
> > 
> > "const NMPlatformIP6Address *x"

Ok.

> otherwise, patch from comment 7 lgtm.
> 
> I assuem, this fixes issue #2 (from comment 4).

Yes.

> Can you confirm, that the tentative IPv6 addresss caused dhclient to exit
> immediately? (issue #3 from comment 4)

Yes:

  root@fedora ~ # echo 40 > /proc/sys/net/ipv6/conf/ens4/dad_transmits
  root@fedora ~ # ip l set ens4 down
  root@fedora ~ # ip l set ens4 up
  root@fedora ~ # ip a show dev ens4
  3: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
      link/ether 52:54:00:a2:8a:9a brd ff:ff:ff:ff:ff:ff
      inet6 fe80::5054:ff:fea2:8a9a/64 scope link tentative 
         valid_lft forever preferred_lft forever
  root@fedora ~ # dhclient -v -6 ens4
  Internet Systems Consortium DHCP Client 4.3.6
  Copyright 2004-2017 Internet Systems Consortium.
  All rights reserved.
  For info, please visit https://www.isc.org/software/dhcp/

  Can't bind to dhcp address: Cannot assign requested address
  ...
  exiting.

Applied the patch to master branch:

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=851d89dc6ab59cd19de03fc842893fabe8c33036

> It seems, issue #1 still needs fixing (although, it doesn't happen now
> anymore).

Correct, I'll investigate this other issue in the next days.

Comment 11 Thomas Haller 2018-07-10 13:30:04 UTC
*** Bug 1592830 has been marked as a duplicate of this bug. ***

Comment 12 Beniamino Galvani 2018-09-04 21:04:55 UTC
Hi Vladimir,

the test is now passing on all branches. Can this be closed now? I'll file a separate bug for the issue #1 suggested by Thomas.

Comment 13 Beniamino Galvani 2018-09-06 08:16:48 UTC
Filed bug 1625901 for issue #1.


Note You need to log in before you can comment on or make changes to this bug.