1906194 – OpenShift cluster on OpenStack lost api ingress after ~12 hours

Bug 1906194 - OpenShift cluster on OpenStack lost api ingress after ~12 hours

Summary: OpenShift cluster on OpenStack lost api ingress after ~12 hours

Keywords:
Status:	CLOSED DUPLICATE of bug 1915080
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Adolfo Duarte
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-09 21:30 UTC by Alex Krzos
Modified:	2021-02-17 17:01 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-17 17:01:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Grafana network dashboard showing tcp retransmis rate out of all sent segments (199.28 KB, image/png) 2021-01-22 15:46 UTC, Alex Krzos	no flags	Details
master-0 haproxy (1.51 MB, application/gzip) 2021-01-22 18:06 UTC, Alex Krzos	no flags	Details
View All

Description Alex Krzos 2020-12-09 21:30:20 UTC

Description of problem:
While scale testing RHACM, with OpenShift on OpenStack, the OpenShift cluster (ipi installed) became unable to be reachable outside of the cluster (address https://api.vlan608.rdu2.scalelab.redhat.com:6443  was unreachable)

Ex: 
$ oc get no
Unable to connect to the server: dial tcp 10.1.57.3:6443: i/o timeout

We found we could still reach anything over cluster ingress (console, prometheus, grafana, RHACM UI...) indicating the scale lab network was probably not any issue. We also discovered we could use our kubeconfig from master-0 while swapping the api address with localhost.  We could then run oc commands and see that the cluster was somewhat still healthy.

When the cluster became unreachable externally - several pods on the cluster were crash looping:
# oc get po -n openshift-sdn; oc get po -n openshift-kube-scheduler; oc get po -n openshift-kube-controller-manager
NAME                   READY   STATUS    RESTARTS   AGE
...
sdn-controller-962v7   1/1     Running   89         23h
sdn-controller-thwk7   1/1     Running   73         23h
sdn-controller-wzd57   1/1     Running   71         23h
...
openshift-kube-scheduler-vlan608-489fb-master-0   2/2     Running            106        23h
openshift-kube-scheduler-vlan608-489fb-master-1   1/2     CrashLoopBackOff   99         23h
openshift-kube-scheduler-vlan608-489fb-master-2   1/2     CrashLoopBackOff   97         23h
...
kube-controller-manager-vlan608-489fb-master-0   3/4     CrashLoopBackOff   205        23h
kube-controller-manager-vlan608-489fb-master-1   4/4     Running            212        23h
kube-controller-manager-vlan608-489fb-master-2   4/4     Running            199        23h
...


We then looked at the openstack ports and the ip addresses more closely:
$ openstack port list -c "Name" -c "Fixed IP Addresses" -c "Status"
+------------------------------+-----------------------------------------------------------------------------+--------+
| Name                         | Fixed IP Addresses                                                          | Status |
+------------------------------+-----------------------------------------------------------------------------+--------+
|                              | ip_address='10.196.0.1', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | ACTIVE |
| vlan608-489fb-worker-0-xvtrh | ip_address='10.196.0.24', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-worker-0-qgws6 | ip_address='10.196.1.42', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-master-port-0  | ip_address='10.196.0.97', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
|                              | ip_address='10.1.57.4', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
|                              | ip_address='10.196.0.10', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | DOWN   |
| vlan608-489fb-ingress-port   | ip_address='10.196.0.7', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | DOWN   |
| vlan608-489fb-master-port-1  | ip_address='10.196.0.210', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE |
|                              | ip_address='10.1.57.9', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
| vlan608-489fb-master-port-2  | ip_address='10.196.2.26', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-api-port       | ip_address='10.196.0.5', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | DOWN   |
| vlan608-489fb-worker-0-plnsb | ip_address='10.196.0.159', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE |
|                              | ip_address='10.1.57.3', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
+------------------------------+-----------------------------------------------------------------------------+--------+

Looking at the api port ip address (10.196.0.5) on the master nodes:

[root@vlan608-489fb-master-2 ~]# ip a | grep "10.196.0.5"                                                                                                                                                    
    inet 10.196.0.5/32 scope global ens3       

[root@vlan608-489fb-master-1 ~]# ip a | grep "10.196.0.5"                                                                                                                                                    
    inet 10.196.0.5/32 scope global ens3       

[root@vlan608-489fb-master-0 tmp]# ip a | grep "10.196.0.5"                                                                                                                                                  
[root@vlan608-489fb-master-0 tmp]#

As you can see master-0 does not have the api port ip address, we rebooted the node and the api began to work agin and the pods we witnessed in a crash loop stopped crashing.


Version-Release number of selected component (if applicable):
osp 16.1
ocp - 4.6.7

How reproducible:
We produced it once in this scale test.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
Cluster to not become inaccessible after 12 hours

Additional info:
Must gather was ran after the cluster was restored.

Lastly, I am unsure of what product/component to make this bug against.

Comment 2 egarcia 2020-12-10 19:30:40 UTC

Assigning this to the Installer team, since the issue seems to be with OpenStack components being mis-configured or degraded after install.

Comment 3 Alex Krzos 2020-12-11 15:25:48 UTC

(In reply to egarcia from comment #2)
> Assigning this to the Installer team, since the issue seems to be with
> OpenStack components being mis-configured or degraded after install.

This is reoccurring in our environment in which rebooting the master nodes and allowing some time for things to normalize will permit the api to work again.  I believe this is an issue with "openshift-openstack-infra" namespace's keepalived pods.

Comment 5 Adolfo Duarte 2021-01-06 19:04:39 UTC

@alex Krzos.  Has this defect been seen or reproduce in other systems (more than once).  
Also if so, also in the environment were this was seen and the system recover, was rebooting the system a requirement. In other words, will the system recover if left to normalized for a period of time *without* rebooting it. 
do you happen to have an estimate of how long the system takes to "normalized" before the api services start working correctly again. 
thanks

Comment 12 Alex Krzos 2021-01-22 15:46:56 UTC

Created attachment 1749824 [details]
Grafana network dashboard showing tcp retransmis rate out of all sent segments

Comment 13 Alex Krzos 2021-01-22 18:06:11 UTC

Created attachment 1749865 [details]
master-0 haproxy

Comment 14 egarcia 2021-02-17 17:01:39 UTC

Closing as a duplicate, since we believe this is a symptom of https://bugzilla.redhat.com/show_bug.cgi?id=1915080

*** This bug has been marked as a duplicate of bug 1915080 ***

Note You need to log in before you can comment on or make changes to this bug.