Bug 2046520 - OCP 4.9.17 installation with no default gateway on ocp nodes will fail
Summary: OCP 4.9.17 installation with no default gateway on ocp nodes will fail
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.9
Hardware: s390x
OS: Linux
high
high
Target Milestone: ---
: 4.6.z
Assignee: Silke Niemann
QA Contact: Douglas Slavens
Latha S
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-26 21:02 UTC by Philip Chan
Modified: 2022-03-02 17:18 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-02 17:18:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather log (1022.19 KB, application/gzip)
2022-01-26 21:02 UTC, Philip Chan
no flags Details
bootstrap-0 journal log (1.70 MB, text/plain)
2022-01-26 21:03 UTC, Philip Chan
no flags Details
master-0 journal log (5.79 MB, text/plain)
2022-01-26 21:03 UTC, Philip Chan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker MULTIARCH-2057 0 None None None 2022-01-26 21:08:44 UTC

Description Philip Chan 2022-01-26 21:02:48 UTC
Created attachment 1855639 [details]
must-gather log

Description of problem:
Please also refer to bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=2040933 as it is related.

We are performing a OCP 4.9.17 installation on KVM.  The network on the OCP nodes do not have a default gateway configured.  There are static routes added to each of the nodes in order to communicate to the outside world, such as quay.io to pull down the necessary images for installation.

When standing up the control plane, the master nodes will fail to come online.  As noted in the journal log for one of the master nodes, the configure-ovs.sh script will fail while attempting to looking for a default gateway route:

Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + '[' 12 -lt 12 ']'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + extra_bridge_file=/etc/ovnk/extra_bridge
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + '[' '' '!=' br-ex ']'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + '[' -f /etc/ovnk/extra_bridge ']'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + convert_to_bridge '' br-ex phys0
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + iface=
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + bridge_name=br-ex
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + port_name=phys0
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + ovs_port=ovs-port-br-ex
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + ovs_interface=ovs-if-br-ex
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + default_port_name=ovs-port-phys0
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + bridge_interface_name=ovs-if-phys0
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + '[' '' = br-ex ']'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + '[' -z '' ']'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + echo 'ERROR: Unable to find default gateway interface'
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: ERROR: Unable to find default gateway interface
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1298]: + exit 1
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Jan 26 20:05:23 master-0.pok-106.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Consumed 355ms CPU time

This is a single NIC setup, with no default gateway and should be a valid network configuration.  But there seems to be a requirement to have at least one default gateway route.

Version-Release number of selected component (if applicable):
OCP 4.9.17
RHCOS 4.9.0

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Perform OCP 4.9.17 installation
2. Start bootstrap and master nodes with no default gw defined.
3. Master nodes will fail when running the configure-ovs.sh script since it is dependent on picking the correct interface based on the default gateway.

Actual results:
Bootstrap and master (control plane) nodes will boot.  Master nodes will fail to come online and report any status.

Expected results:
All of the bootstrap, master (control plane), and worker (compute) nodes should all successfully install the RHCOS build successfully and become Ready. 

Additional info:
Attached journal logs from bootstrap-0 and master-0 nodes.  A must-gather log is also attached.

Comment 1 Philip Chan 2022-01-26 21:03:25 UTC
Created attachment 1855647 [details]
bootstrap-0 journal log

Comment 2 Philip Chan 2022-01-26 21:03:54 UTC
Created attachment 1855648 [details]
master-0 journal log

Comment 4 Prashanth Sundararaman 2022-01-26 22:58:40 UTC
does it specify in the docs that the default gateway can be omitted? is this a valid configuration ?

Comment 5 Philip Chan 2022-01-27 00:57:01 UTC
Hi Prashanth,

I have verified with our STSM that having no default gateway set at all is absolutely not an invalid network configuration.  This issue was originally found while performing a OCP 4.8.14 installation for zVM under his environment.  I was able to replicate this on ours within KVM.  There are two instances of "default gateway" mentioned within the OCP install documentation:
https://docs.openshift.com/container-platform/4.9/installing/installing_ibm_z/installing-ibm-z.html#installation-user-infra-machines-routing-bonding_installing-ibm-z
Currently, it is not mentioned as a requirement.

Comment 6 Dan Li 2022-01-27 18:57:55 UTC
Moving this bug to the Doc component per conversation from: https://coreos.slack.com/archives/C0138QKKYTU/p1643286051024000

Re-assigning to Silke for future documentation update. 

Hi Prashanth, Muhammad, and Phil, please feel free to include any information or requirements that Silke needs to update the documentation.

Comment 7 Silke Niemann 2022-02-16 15:15:44 UTC
Created a PR to update the docs https://github.com/openshift/openshift-docs/pull/41848/

Comment 8 Latha S 2022-02-17 04:31:58 UTC
Please check whether the fix is applicable from 4.6.z+ versions onwards. 
QA ack required

Comment 9 Dan Li 2022-02-17 13:04:46 UTC
Making comment 8 un-private as Silke does not have the access to view private comments.

Comment 10 Silke Niemann 2022-03-02 17:18:20 UTC
Docs PR is merged and this applies to OCP 4.10 and later only.


Note You need to log in before you can comment on or make changes to this bug.