Bug 2096226

Summary: crio fails to bind to tentative IP, causing service failure since RHOCS was rebased on RHEL 8.6
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Derek Higgins <derekh>
Networking sub component: runtime-cfg QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dcbw, derekh
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2096386 (view as bug list) Environment:
Last Closed: 2022-08-10 11:17:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2096386    

Description Stephen Benjamin 2022-06-13 10:28:29 UTC
In some cases, crio is failing to start on a baremetal IPv6 run, with a message like this:


  Jun 10 23:14:42 master-0.ostest.test.metalkube.org crio[2878]:
  time="2022-06-10 23:14:42.595976451Z" level=fatal msg="Failed to start
  streaming server: listen tcp [fd2e:6f44:5dd8:c956::14]:10010: bind: cannot
  assign requested address"

Shortly before, we see configure-ovs.sh moving the IP to br-ex:

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + 
  echo 'Brought up connection ovs-if-br-ex successfully'

And the IP is marked tentative, which Derek Higgens tested and confirms makes crio refuse to bind to it.

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     
  inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute

And cri-o's service unit is set to restart on-abnormal not on-failure: https://github.com/cri-o/cri-o/blob/main/contrib/systemd/crio.service#L25, which means it won't retry on this kind of failure:

  If set to on-abnormal, the service will be restarted when the process is 
  terminated by a signal (including on core dump, excluding the aforementioned 
  four signals), when an operation times out, or when the watchdog timeout is   
  triggered.


NetworkManager from 8.6 has had significant parts rewritten, as well as changes to configure-ovs.sh to compensate, so it's likely there's been some timing changes that makes this error occur.

Comment 1 Dan Williams 2022-06-13 16:54:13 UTC
It appears that NM isn't waiting for DHCPv6 addresses to finish DAD before indicating the connection has activated when v6.may-fail=no

Jun 10 23:14:38 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + nmcli c add type ovs-interface slave-type ovs-port conn.interface br-ex master ovs-port-br-ex con-name ovs-if-br-ex 802-3-ethernet.mtu 1500 802-3-ethernet.cloned-mac-address 00:5f:27:59:1f:38 ipv4.route-metric 48 ipv6.route-metric 48 ipv6.may-fail no ipv6.addr-gen-mode eui64 connection.autoconnect no
<snip>
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8275] dhcp6 (br-ex): activation: beginning transaction (timeout in 45 seconds)
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8284] dhcp6 (br-ex): state changed new lease, address=fd2e:6f44:5dd8:c956::14
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8923] policy: set 'ovs-if-br-ex' (br-ex) as default for IPv6 routing and DNS
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7901] device (br-ex): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7903] device (br-ex): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7906] device (br-ex): Activation: successful, device activated.
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]: 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     link/ether 00:5f:27:59:1f:38 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     openvswitch numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft 3599sec preferred_lft 3599sec
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fe80::25f:27ff:fe59:1f38/64 scope link noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft forever preferred_lft forever
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org systemd[1]: Starting Wait for a non-localhost hostname...
Jun 10 23:14:42 master-0.ostest.test.metalkube.org systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...

Comment 5 errata-xmlrpc 2022-08-10 11:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069