Bug 2096226 - crio fails to bind to tentative IP, causing service failure since RHOCS was rebased on RHEL 8.6
Summary: crio fails to bind to tentative IP, causing service failure since RHOCS was r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Derek Higgins
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks: 2096386
TreeView+ depends on / blocked
 
Reported: 2022-06-13 10:28 UTC by Stephen Benjamin
Modified: 2022-08-10 11:17 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2096386 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:17:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 181 0 None open Bug 2096226: Check chosen node-ip can be used 2022-06-13 20:03:23 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:17:40 UTC

Description Stephen Benjamin 2022-06-13 10:28:29 UTC
In some cases, crio is failing to start on a baremetal IPv6 run, with a message like this:


  Jun 10 23:14:42 master-0.ostest.test.metalkube.org crio[2878]:
  time="2022-06-10 23:14:42.595976451Z" level=fatal msg="Failed to start
  streaming server: listen tcp [fd2e:6f44:5dd8:c956::14]:10010: bind: cannot
  assign requested address"

Shortly before, we see configure-ovs.sh moving the IP to br-ex:

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + 
  echo 'Brought up connection ovs-if-br-ex successfully'

And the IP is marked tentative, which Derek Higgens tested and confirms makes crio refuse to bind to it.

  Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     
  inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute

And cri-o's service unit is set to restart on-abnormal not on-failure: https://github.com/cri-o/cri-o/blob/main/contrib/systemd/crio.service#L25, which means it won't retry on this kind of failure:

  If set to on-abnormal, the service will be restarted when the process is 
  terminated by a signal (including on core dump, excluding the aforementioned 
  four signals), when an operation times out, or when the watchdog timeout is   
  triggered.


NetworkManager from 8.6 has had significant parts rewritten, as well as changes to configure-ovs.sh to compensate, so it's likely there's been some timing changes that makes this error occur.

Comment 1 Dan Williams 2022-06-13 16:54:13 UTC
It appears that NM isn't waiting for DHCPv6 addresses to finish DAD before indicating the connection has activated when v6.may-fail=no

Jun 10 23:14:38 master-0.ostest.test.metalkube.org configure-ovs.sh[1758]: + nmcli c add type ovs-interface slave-type ovs-port conn.interface br-ex master ovs-port-br-ex con-name ovs-if-br-ex 802-3-ethernet.mtu 1500 802-3-ethernet.cloned-mac-address 00:5f:27:59:1f:38 ipv4.route-metric 48 ipv6.route-metric 48 ipv6.may-fail no ipv6.addr-gen-mode eui64 connection.autoconnect no
<snip>
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8275] dhcp6 (br-ex): activation: beginning transaction (timeout in 45 seconds)
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8284] dhcp6 (br-ex): state changed new lease, address=fd2e:6f44:5dd8:c956::14
Jun 10 23:14:40 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902880.8923] policy: set 'ovs-if-br-ex' (br-ex) as default for IPv6 routing and DNS
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7901] device (br-ex): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7903] device (br-ex): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 10 23:14:41 master-0.ostest.test.metalkube.org NetworkManager[1398]: <info>  [1654902881.7906] device (br-ex): Activation: successful, device activated.
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]: 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     link/ether 00:5f:27:59:1f:38 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     openvswitch numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fd2e:6f44:5dd8:c956::14/128 scope global tentative dynamic noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft 3599sec preferred_lft 3599sec
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:     inet6 fe80::25f:27ff:fe59:1f38/64 scope link noprefixroute
Jun 10 23:14:41 master-0.ostest.test.metalkube.org configure-ovs.sh[2751]:        valid_lft forever preferred_lft forever
<snip>
Jun 10 23:14:41 master-0.ostest.test.metalkube.org systemd[1]: Starting Wait for a non-localhost hostname...
Jun 10 23:14:42 master-0.ostest.test.metalkube.org systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...

Comment 5 errata-xmlrpc 2022-08-10 11:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.