Bug 1854402 - [release-4.5] CEO chooses 1st host interface as bootstrap ip rather than one that belongs to machine network CIDR
Summary: [release-4.5] CEO chooses 1st host interface as bootstrap ip rather than one ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd Operator
Version: 4.5
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 4.5.z
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1846093
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-07 12:26 UTC by Dan Mace
Modified: 2020-07-22 12:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1846093
Environment:
Last Closed: 2020-07-22 12:20:42 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 385 None closed Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations 2020-08-26 09:10:42 UTC
Red Hat Product Errata RHBA-2020:2956 None None None 2020-07-22 12:21:05 UTC

Description Dan Mace 2020-07-07 12:26:31 UTC
+++ This bug was initially created as a clone of Bug #1846093 +++

In the assisted installer flow, any node (with any number of interfaces) can be a bootstrap node.
There is no guarantee that the 1st interface will have the right connectivity to other nodes.

In case the 1st interface will be always chosen, we will run into cases where etcd will not be in the machine network and will no be able to structure a quorum.

--- Additional comment from Stephen Cuppett on 2020-06-11 12:13:23 UTC ---

This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate.

--- Additional comment from Sam Batschelet on 2020-06-20 13:12:45 UTC ---

Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Udi on 2020-06-26 15:17:05 UTC ---

I tested the fix on libvirt with VMs that were connected to 2 networks. At first, I was able to discover the hosts and start an installation - but the installation hung in the middle and never completed. The bootstrap node was stuck at step 0/7 and the other nodes at step 4/7. I got the same results when I tried setting the VIP on the other network.

I then tried to reverse the order of the NICs. This time - the nodes never even registered back in the service. I could see the following warnings repeating themselves in the journal log on one of the masters:

Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: WARNING: Unable to read board_asset_tag: open /sys/class/dmi/id/board_asset_tag: no such >
Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: WARNING: Unable to read board_serial: open /sys/class/dmi/id/board_serial: no such file o>
Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: WARNING: Unable to read board_vendor: open /sys/class/dmi/id/board_vendor: no such file o>
Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: WARNING: Unable to read board_version: open /sys/class/dmi/id/board_version: no such file>
Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: time="26-06-2020 15:03:17" level=warning msg="Could not find motherboard serial number" f>
Jun 26 15:03:17 master-1-1.****.redhat.com agent[1727]: time="26-06-2020 15:03:17" level=warning msg="Error registering host: Post http://assiste>

--- Additional comment from Udi on 2020-07-06 13:15:02 UTC ---

The latest fix got passed the above error, but still the cluster deployment got stuck at a later stage. Need to test again on a clean environment.

Comment 3 ge liu 2020-07-20 07:08:34 UTC
Confirm with David, he met and verified this issue on 4.6, and there is not issue on 4.5.

Comment 5 errata-xmlrpc 2020-07-22 12:20:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2956


Note You need to log in before you can comment on or make changes to this bug.