Bug 2096386

Summary:

crio fails to bind to tentative IP, causing service failure since RHOCS was rebased on RHEL 8.6

Product:

Red Hat Enterprise Linux 8

Reporter:

Dan Williams <dcbw>

Component:

NetworkManager

Assignee:

Beniamino Galvani <bgalvani>

Status:

CLOSED ERRATA

QA Contact:

David Jaša <djasa>

Severity:

high

Docs Contact:

Priority:

high

Version:

8.5

CC:

bgalvani, dcbw, derekh, ferferna, lrintel, mko, rkhan, sfaye, stbenjam, sukulkar, thaller, till, vbenes

Target Milestone:

Keywords:

Triaged

Target Release:

---

Flags:

mko: needinfo-
pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

NetworkManager-1.40.2-1.el8

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

2096226

Environment:

Last Closed:

2023-05-16 09:04:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2096226

Bug Blocks:

Attachments:

Description	Flags
DAD not effective for DHCP address	none
DAD not effective for EUI64 address	none
EUI64 address - 'full' connectivity with link-local address only	none
DAD not effective for EUI64 address	none
EUI64 address - 'full' connectivity with link-local address only	none
DAD not effective for DHCP address	none
report where behaviour described in previous comment is clearly visible	none

Description Dan Williams 2022-06-13 17:02:18 UTC

+++ This bug was initially created as a clone of Bug #2096226 +++

In some cases, crio is failing to start on a baremetal IPv6 run, with a message like this:


  Jun 10 23:14:42 master-0.ostest.test.metalkube.org crio[2878]:
  time="2022-06-10 23:14:42.595976451Z" level=fatal msg="Failed to start
  streaming server: listen tcp [fd2e:6f44:5dd8:c956::14]:10010: bind: cannot
  assign requested address"

Shortly before, we see configure-ovs.sh moving the IP to br-ex:

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + 
  echo 'Brought up connection ovs-if-br-ex successfully'

And the IP is marked tentative, which Derek Higgens tested and confirms makes crio refuse to bind to it.

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     
  inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute

And cri-o's service unit is set to restart on-abnormal not on-failure: https://github.com/cri-o/cri-o/blob/main/contrib/systemd/crio.service#L25, which means it won't retry on this kind of failure:

  If set to on-abnormal, the service will be restarted when the process is 
  terminated by a signal (including on core dump, excluding the aforementioned 
  four signals), when an operation times out, or when the watchdog timeout is   
  triggered.


NetworkManager from 8.6 has had significant parts rewritten, as well as changes to configure-ovs.sh to compensate, so it's likely there's been some timing changes that makes this error occur.

--- Additional comment from Dan Williams on 2022-06-13 11:54:13 CDT ---

It appears that NM isn't waiting for DHCPv6 addresses to finish DAD before indicating the connection has activated when v6.may-fail=no

Jun 10 23:14:38 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + nmcli c add type ovs-interface slave-type ovs-port conn.interface br-ex master ovs-port-br-ex con-name ovs-if-br-ex 802-3-ethernet.mtu 1500 802-3-ethernet.cloned-mac-address 00:5f:27:59:1f:38 ipv4.route-metric 48 ipv6.route-metric 48 ipv6.may-fail no ipv6.addr-gen-mode eui64 connection.autoconnect no
<snip>
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8275] dhcp6 (br-ex): activation: beginning transaction (timeout in 45 seconds)
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8284] dhcp6 (br-ex): state changed new lease, address=fd2e:6f44:5dd8:c956::14
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8923] policy: set 'ovs-if-br-ex' (br-ex) as default for IPv6 routing and DNS
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7901] device (br-ex): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7903] device (br-ex): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7906] device (br-ex): Activation: successful, device activated.
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]: 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     link/ether 00:5f:27:59:1f:38 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     openvswitch numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft 3599sec preferred_lft 3599sec
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fe80::25f:27ff:fe59:1f38/64 scope link noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft forever preferred_lft forever
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org systemd[1]: Starting Wait for a non-localhost hostname...
Jun 10 23:14:42 master-0.ostest.test.metalkube.org systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...

Comment 1 Dan Williams 2022-06-13 17:40:55 UTC

It appears that the DHCPv6 code doesn't wait for addresses DAD to complete before signaling the connection is activated...

Seen in 1.36.0-5.el8_6

Comment 6 David Jaša 2022-08-03 10:55:54 UTC

Testing using WIP scenarios, I tend to think this is FailedQA but I'd need a confirmation. When run in the loop, the scenarios I made occasionally fail for both DHCP-assigned address and for global EUI64 SLAAC address. Logs only mention DAD taking place for fe80:: link-local addresses even at trace level.

The failure is not reliable unless run in a loop.

Comment 7 David Jaša 2022-08-03 10:57:36 UTC

Created attachment 1903249 [details]
DAD not effective for DHCP address

Comment 8 David Jaša 2022-08-03 10:58:12 UTC

Created attachment 1903250 [details]
DAD not effective for EUI64 address

Comment 10 David Jaša 2022-08-04 13:15:06 UTC

Setting as FailedQA. While less likely to happen, it still can happen and logs do not indicate DAD taking place for global addresses; and even when this happens and NM  marks the connection as active/connected.

Comment 14 David Jaša 2022-08-05 13:17:07 UTC

Created attachment 1903934 [details]
EUI64 address - 'full' connectivity with link-local address only

One more thing. In this case, system ended up in state with testX6 device (con_ipv6 connection) having only link-local v6 address, yet NM reports this device as having 'full' v6 connectivity. I couldn't get system to this state in a 'regular' way (not involving duplicate address in network).

Comment 15 David Jaša 2022-08-10 09:46:44 UTC

Created attachment 1904662 [details]
DAD not effective for EUI64 address

With fixed logging settings in the test system, I could only hit the failure in EUI scenario in ~5 % of runs.

Comment 16 David Jaša 2022-08-10 09:50:35 UTC

Created attachment 1904674 [details]
EUI64 address - 'full' connectivity with link-local address only

... in the other 95 % of cases of EUI64 scenarios, the system reliably ends up in state with 'full' ipv6 connectivity over link-local address.

Comment 17 David Jaša 2022-08-10 11:29:56 UTC

Created attachment 1904686 [details]
DAD not effective for DHCP address

Running it in a loop again, we're back to ratio of 5 failures for DHCP scenario per 100 runs as before even with proper level=TRACE logging. One of those runs is attached.

Comment 18 Beniamino Galvani 2022-08-19 13:13:16 UTC

Hi David,

I think there is a race condition in the test. The test activates
con_ipv6 and then checks that the DHCPv6 address is not configured on
the interface. But since IPv6 DAD is performed by kernel, the way it
works in NM is that NM first adds the address to the interface; the
address initially has flag "tentative".

When DAD completes successfully, kernel removes the "tentative" flag
and then NM considers the interface ready.

In case of collision, the kernel sets the "dadfailed" flag on the
address and immediately removes the address from the interface. So,
there is a small interval in which the address is configured (as
tentative) on the interface, and this causes the test failure.

I think we need 2 tests:

- one to check that DAD works. Basically, the one that you wrote, but
  it should wait some time before checking that the address is
  missing, to ensure that DAD detected the collision;

- another test without address collision. In that case, we need to
  ensure that NM waits DAD to complete before the connection becomes
  "activated". To do that, it's sufficient to check that the address
  is non tentative when "nmcli con up" returns.

What do you think?

Comment 19 David Jaša 2022-08-19 16:22:09 UTC

(In reply to Beniamino Galvani from comment #18)
> Hi David,
> 
> I think there is a race condition in the test. The test activates
> con_ipv6 and then checks that the DHCPv6 address is not configured on
> the interface. But since IPv6 DAD is performed by kernel, the way it
> works in NM is that NM first adds the address to the interface; the
> address initially has flag "tentative".
> 
> When DAD completes successfully, kernel removes the "tentative" flag
> and then NM considers the interface ready.
> 
> In case of collision, the kernel sets the "dadfailed" flag on the
> address and immediately removes the address from the interface. So,
> there is a small interval in which the address is configured (as
> tentative) on the interface, and this causes the test failure.
> 

Yeah. The problem as of now is multi-layered:
1) as you say test doesn't give time for DAD done this way to take place
2) NM gives no explicit indication of DAD result (kernel does*)
3) NM accepts lease in spite of DAD failure and given just one address (it can be offered e.g ::1234:5678/126 range by DHCP server, then in case of DAD failure of ::1234:5678, it can choose from ::1234:5678[9ab])

I'd like to hear Dan Williams's opinion on what is the correct or least harmful behaviour after address collision of single DHCPv6 address. The interface ends up in state that
- it has valid link-local address
- it has valid stateless global address (derived from RA), but
- it fails to get DHCPv6 address which I assume is being given to it so it gets known predictable addres that is to be used directly or to which is machine's DNS name pointed to?
So from POV of the machine itself, the network on this interface is fully working, however for the outside world, the machine is invisible/down. What should NM do or report in such a case? Full connectivity, limited connectivity? Keep configuration as-is, try configuring again right away or after some wait, report connection as failed?

* stuff like this in dmsg or journal:
[ 5910.336066] IPv6: dup: IPv6 duplicate address 2620:dead:beaf::1234:5678 used by b2:a0:29:95:cc:84 detected!


(In reply to Beniamino Galvani from comment #18)
> I think we need 2 tests:
> 
> - one to check that DAD works. Basically, the one that you wrote, but
>   it should wait some time before checking that the address is
>   missing, to ensure that DAD detected the collision;
> 
> - another test without address collision. In that case, we need to
>   ensure that NM waits DAD to complete before the connection becomes
>   "activated". To do that, it's sufficient to check that the address
>   is non tentative when "nmcli con up" returns.
> 
> What do you think?

What you said, plus:

- 3rd test verifying that in case of DHCPv6 address with prefix < 128, some other address is chosen (this works but should probably also be covered by the CI)
- NM clearly indicating DAD result in the logs at sensible level (as you said, warning in case of collision and probably info or debug if nothing comes)
- NM needs to act upon DAD collision:
* it must at very least send DHCP DECLINE for that address if I read RFC correctly: https://datatracker.ietf.org/doc/html/rfc8415#page-70
* it must do some sensible action next. I however can't see what it really is when single address was offered and stateless address is present on the interface, see what I ask Dan Williams above. When scope of addresses was offered, rinse & repeat with available addresses...

Comment 20 David Jaša 2022-08-19 16:23:08 UTC

Created attachment 1906581 [details]
report where behaviour described in previous comment is clearly visible

Comment 21 Beniamino Galvani 2022-09-05 08:02:19 UTC

Hi David,

I updated the code to send a DHCPv6 decline when all addresses in the lease fail DAD:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/1363

At the moment only the dhclient plugin supports sending a DECLINE. For the internal one we reuse systemd code and it doesn't support declining the lease.

In case of DAD failure we report a error in logs, and the transaction continues because it could possibly get another address.

Comment 26 Mat Kowalski 2022-09-28 13:43:52 UTC

This issue has been very initially observed by the Assisted Installer team in OCP 4.11; my understanding is that fixing it in RHEL 8.7 makes that fix available only in OCP 4.12. If this is correct, can we please track the backport?

Comment 29 Thomas Haller 2022-11-18 10:43:04 UTC

> Fixed In Version: NetworkManager-1.39.10-1.el8 → NetworkManager-1.40.2-1.el8
> Status: POST → MODIFIED

The follow up fix was https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/fabefd9bd541345fbbb3b6a258e78405c818f346, which is in 1.40.2. Update the fixed-in-version and move forward to errata.

Comment 30 Thomas Haller 2022-11-18 10:46:02 UTC

(In reply to Mat Kowalski from comment #26)
> This issue has been very initially observed by the Assisted Installer team
> in OCP 4.11; my understanding is that fixing it in RHEL 8.7 makes that fix
> available only in OCP 4.12. If this is correct, can we please track the
> backport?

It's "worse". The fix is slanted for the next rhel-release, which will be rhel-8.8.

If a Z-stream update for an earlier RHEL release is required (rhel-8.6.z or rhel-8.7.z), then the Z-stream process needs to be followed. Where exactly do you need this fixed? What exact version of NetworkManager are you currently using (in that environment)? Sorry, I am not familiar, which package versions are in which OCP versions.

Comment 31 Thomas Haller 2022-11-18 10:58:58 UTC

bug 2099794 seems to be a duplicate of this one.
And that bug already has Z-stream updates for rhel-8.6 and rhel-8.7 scheduled...


the only open question is, that this issue was partly fixed a while ago. that fix is present in 8.7 already and no further action is would be necessary.
However, this bug then FailedQA (for some test cases), this bug was reopened and will finally be fixed in 8.8.
The question is, whether 8.7 now needs a Z-stream update for the remaining case.

Comment 34 David Jaša 2022-11-22 03:24:35 UTC

(In reply to Thomas Haller from comment #31)
> bug 2099794 seems to be a duplicate of this one.

Setting as VERIFIED for NM 1.40.2-1.el8 as the same test as in Bug 2099794 comment 37.

Comment 35 Mat Kowalski 2022-11-28 19:13:08 UTC

>If a Z-stream update for an earlier RHEL release is required (rhel-8.6.z or rhel-8.7.z), then the Z-stream process needs to be followed. Where exactly do you need this fixed? What exact version of NetworkManager are you currently using (in that environment)? Sorry, I am not familiar, which package versions are in which OCP versions.

@thaller to get it propagated down to OCP 4.11 we need to have it in RHEL 8.6; this seems to be in line with what we initially observed, i.e. that the issue appeared for the first time in RHEL 8.6

Comment 36 Thomas Haller 2022-12-16 20:48:17 UTC

(In reply to Mat Kowalski from comment #35)
> >If a Z-stream update for an earlier RHEL release is required (rhel-8.6.z or rhel-8.7.z), then the Z-stream process needs to be followed. Where exactly do you need this fixed? What exact version of NetworkManager are you currently using (in that environment)? Sorry, I am not familiar, which package versions are in which OCP versions.
> 
> @thaller to get it propagated down to OCP 4.11 we need to have it
> in RHEL 8.6; this seems to be in line with what we initially observed, i.e.
> that the issue appeared for the first time in RHEL 8.6

OK, thanks. As said in comment 31, there is a z-stream update for rhel-8.6 and rhel-8.7 already in progress.
Please track those bugs, or comment if anything is missing. Thanks.

Comment 39 errata-xmlrpc 2023-05-16 09:04:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2968