Description of problem: From Customer - I have openshift systems with multiple nics. The openshift network is configured to run on a bond vlan tag nic (not the default gw nic). I have an external haproxy lb also for the 3 masters api connection. While doing a failure test (taking down networking of one of the 3 masters), the api completely hangs even though the haproxy lb IP is still pinging. What I have found as an issue, is that by default the /etc/sysconfig/atomic-openshift-master-api and /etc/sysconfig/atomic-openshift-master-controllers are both configured with a LISTEN address of 0.0.0.0. This doesnt' seem to work well during a failure mode, causes openshift to hang. If I set the LISTEN address to the respective IP of each master, and restart the master controllers and api services..., and re-do the network failure test, the api connection tests work great. < OPTIONS=--loglevel=2 --listen=https://10.1.202.11:8443 --master=https://master01.os01.infra.gen.adm1.prod.bamtech.co:8443 --- > OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8443 --master=https://master01.os01.infra.gen.adm1.prod.bamtech.co:8443 [root@master01 sysconfig]# diff atomic-openshift-master-controllers atomic-openshift-master-controllers.mike 1c1 < OPTIONS=--loglevel=2 --listen=https://10.1.202.11:8444 --- > OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8444 Is this a bug in how ansible setup these files? Or, should there be specific entries in the ansible inventory file when setting up the cluster to ensure the LISTEN address = openshift_ip of each master? Version-Release number of selected component (if applicable): OCP 3.6 How reproducible: Customer Verified Steps to Reproduce: Information is in Description Actual results: OpenShift hangs during failure with listen address of 0.0.0.0 Expected results: Expect not to hang Additional info: N/A
@dcbw: Can you clarify the expected behavior please?
Need some more details. 1. haproxy config details - what is the load balancing policy (roundrobin? stick on srcIP?) - what are the endpoint IPs - what are the health checks performed for the endpoint IPs 2. how is the network brought down on the failure test? 3. What are the addresses of the other interfaces? Clearly, the workaround in the interim is to use specific listen addresses.
OCP 3.6 has reached the end of full support [1]. Closing this BZ as WONTFIX. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen. [1] - https://access.redhat.com/support/policy/updates/openshift