Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1551088

Summary: Listen address of 0.0.0.0 causes OpenShift to hang during failure mode
Product: OpenShift Container Platform Reporter: Greg Rodriguez II <grodrigu>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, bbennett, grodrigu, hongli, pasik, scuppett
Version: 3.6.0Keywords: NeedsTestCase
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-20 14:53:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Rodriguez II 2018-03-02 17:18:24 UTC
Description of problem:
From Customer - 
I have openshift systems with multiple nics.  The openshift network is configured to run on a bond vlan tag nic (not the default gw nic).

I have an external haproxy lb also for the 3 masters api connection.

While doing a failure test (taking down networking of one of the 3 masters), the api completely hangs even though the haproxy lb IP is still pinging.  

What I have found as an issue, is that by default the /etc/sysconfig/atomic-openshift-master-api and /etc/sysconfig/atomic-openshift-master-controllers are both configured with a LISTEN address of 0.0.0.0.  This doesnt' seem to work well during a failure mode, causes openshift to hang.

If I set the LISTEN address to the respective IP of each master, and restart the master controllers and api services..., and re-do the network failure test, the api connection tests work great.

< OPTIONS=--loglevel=2 --listen=https://10.1.202.11:8443 --master=https://master01.os01.infra.gen.adm1.prod.bamtech.co:8443
---
> OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8443 --master=https://master01.os01.infra.gen.adm1.prod.bamtech.co:8443
[root@master01 sysconfig]# diff atomic-openshift-master-controllers atomic-openshift-master-controllers.mike
1c1
< OPTIONS=--loglevel=2 --listen=https://10.1.202.11:8444
---
> OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8444

Is this a bug in how ansible setup these files?  Or, should there be specific entries in the ansible inventory file when setting up the cluster to ensure the LISTEN address = openshift_ip of each master?

Version-Release number of selected component (if applicable):
OCP 3.6

How reproducible:
Customer Verified

Steps to Reproduce:
Information is in Description

Actual results:
OpenShift hangs during failure with listen address of 0.0.0.0

Expected results:
Expect not to hang

Additional info:
N/A

Comment 1 Ben Bennett 2018-03-02 17:56:02 UTC
@dcbw: Can you clarify the expected behavior please?

Comment 2 Rajat Chopra 2018-05-03 18:17:21 UTC
Need some more details.
1. haproxy config details
   - what is the load balancing policy (roundrobin? stick on srcIP?)
   - what are the endpoint IPs
   - what are the health checks performed for the endpoint IPs
2. how is the network brought down on the failure test?
3. What are the addresses of the other interfaces?

Clearly, the workaround in the interim is to use specific listen addresses.

Comment 5 Stephen Cuppett 2019-11-20 14:53:36 UTC
OCP 3.6 has reached the end of full support [1]. Closing this BZ as WONTFIX. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen.

[1] - https://access.redhat.com/support/policy/updates/openshift