Description of problem: While scale testing RHACM, with OpenShift on OpenStack, the OpenShift cluster (ipi installed) became unable to be reachable outside of the cluster (address https://api.vlan608.rdu2.scalelab.redhat.com:6443 was unreachable) Ex: $ oc get no Unable to connect to the server: dial tcp 10.1.57.3:6443: i/o timeout We found we could still reach anything over cluster ingress (console, prometheus, grafana, RHACM UI...) indicating the scale lab network was probably not any issue. We also discovered we could use our kubeconfig from master-0 while swapping the api address with localhost. We could then run oc commands and see that the cluster was somewhat still healthy. When the cluster became unreachable externally - several pods on the cluster were crash looping: # oc get po -n openshift-sdn; oc get po -n openshift-kube-scheduler; oc get po -n openshift-kube-controller-manager NAME READY STATUS RESTARTS AGE ... sdn-controller-962v7 1/1 Running 89 23h sdn-controller-thwk7 1/1 Running 73 23h sdn-controller-wzd57 1/1 Running 71 23h ... openshift-kube-scheduler-vlan608-489fb-master-0 2/2 Running 106 23h openshift-kube-scheduler-vlan608-489fb-master-1 1/2 CrashLoopBackOff 99 23h openshift-kube-scheduler-vlan608-489fb-master-2 1/2 CrashLoopBackOff 97 23h ... kube-controller-manager-vlan608-489fb-master-0 3/4 CrashLoopBackOff 205 23h kube-controller-manager-vlan608-489fb-master-1 4/4 Running 212 23h kube-controller-manager-vlan608-489fb-master-2 4/4 Running 199 23h ... We then looked at the openstack ports and the ip addresses more closely: $ openstack port list -c "Name" -c "Fixed IP Addresses" -c "Status" +------------------------------+-----------------------------------------------------------------------------+--------+ | Name | Fixed IP Addresses | Status | +------------------------------+-----------------------------------------------------------------------------+--------+ | | ip_address='10.196.0.1', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | vlan608-489fb-worker-0-xvtrh | ip_address='10.196.0.24', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | vlan608-489fb-worker-0-qgws6 | ip_address='10.196.1.42', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | vlan608-489fb-master-port-0 | ip_address='10.196.0.97', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | | ip_address='10.1.57.4', subnet_id='732130c5-793b-424d-a0cf-7966781110e7' | N/A | | | ip_address='10.196.0.10', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | DOWN | | vlan608-489fb-ingress-port | ip_address='10.196.0.7', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | DOWN | | vlan608-489fb-master-port-1 | ip_address='10.196.0.210', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | | ip_address='10.1.57.9', subnet_id='732130c5-793b-424d-a0cf-7966781110e7' | N/A | | vlan608-489fb-master-port-2 | ip_address='10.196.2.26', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | vlan608-489fb-api-port | ip_address='10.196.0.5', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | DOWN | | vlan608-489fb-worker-0-plnsb | ip_address='10.196.0.159', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE | | | ip_address='10.1.57.3', subnet_id='732130c5-793b-424d-a0cf-7966781110e7' | N/A | +------------------------------+-----------------------------------------------------------------------------+--------+ Looking at the api port ip address (10.196.0.5) on the master nodes: [root@vlan608-489fb-master-2 ~]# ip a | grep "10.196.0.5" inet 10.196.0.5/32 scope global ens3 [root@vlan608-489fb-master-1 ~]# ip a | grep "10.196.0.5" inet 10.196.0.5/32 scope global ens3 [root@vlan608-489fb-master-0 tmp]# ip a | grep "10.196.0.5" [root@vlan608-489fb-master-0 tmp]# As you can see master-0 does not have the api port ip address, we rebooted the node and the api began to work agin and the pods we witnessed in a crash loop stopped crashing. Version-Release number of selected component (if applicable): osp 16.1 ocp - 4.6.7 How reproducible: We produced it once in this scale test. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Cluster to not become inaccessible after 12 hours Additional info: Must gather was ran after the cluster was restored. Lastly, I am unsure of what product/component to make this bug against.
Assigning this to the Installer team, since the issue seems to be with OpenStack components being mis-configured or degraded after install.
(In reply to egarcia from comment #2) > Assigning this to the Installer team, since the issue seems to be with > OpenStack components being mis-configured or degraded after install. This is reoccurring in our environment in which rebooting the master nodes and allowing some time for things to normalize will permit the api to work again. I believe this is an issue with "openshift-openstack-infra" namespace's keepalived pods.
@alex Krzos. Has this defect been seen or reproduce in other systems (more than once). Also if so, also in the environment were this was seen and the system recover, was rebooting the system a requirement. In other words, will the system recover if left to normalized for a period of time *without* rebooting it. do you happen to have an estimate of how long the system takes to "normalized" before the api services start working correctly again. thanks
Created attachment 1749824 [details] Grafana network dashboard showing tcp retransmis rate out of all sent segments
Created attachment 1749865 [details] master-0 haproxy
Closing as a duplicate, since we believe this is a symptom of https://bugzilla.redhat.com/show_bug.cgi?id=1915080 *** This bug has been marked as a duplicate of bug 1915080 ***