Bug 1906194 - OpenShift cluster on OpenStack lost api ingress after ~12 hours
Summary: OpenShift cluster on OpenStack lost api ingress after ~12 hours
Keywords:
Status: CLOSED DUPLICATE of bug 1915080
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.8.0
Assignee: Adolfo Duarte
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-09 21:30 UTC by Alex Krzos
Modified: 2021-02-17 17:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-17 17:01:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Grafana network dashboard showing tcp retransmis rate out of all sent segments (199.28 KB, image/png)
2021-01-22 15:46 UTC, Alex Krzos
no flags Details
master-0 haproxy (1.51 MB, application/gzip)
2021-01-22 18:06 UTC, Alex Krzos
no flags Details

Description Alex Krzos 2020-12-09 21:30:20 UTC
Description of problem:
While scale testing RHACM, with OpenShift on OpenStack, the OpenShift cluster (ipi installed) became unable to be reachable outside of the cluster (address https://api.vlan608.rdu2.scalelab.redhat.com:6443  was unreachable)

Ex: 
$ oc get no
Unable to connect to the server: dial tcp 10.1.57.3:6443: i/o timeout

We found we could still reach anything over cluster ingress (console, prometheus, grafana, RHACM UI...) indicating the scale lab network was probably not any issue. We also discovered we could use our kubeconfig from master-0 while swapping the api address with localhost.  We could then run oc commands and see that the cluster was somewhat still healthy.

When the cluster became unreachable externally - several pods on the cluster were crash looping:
# oc get po -n openshift-sdn; oc get po -n openshift-kube-scheduler; oc get po -n openshift-kube-controller-manager
NAME                   READY   STATUS    RESTARTS   AGE
...
sdn-controller-962v7   1/1     Running   89         23h
sdn-controller-thwk7   1/1     Running   73         23h
sdn-controller-wzd57   1/1     Running   71         23h
...
openshift-kube-scheduler-vlan608-489fb-master-0   2/2     Running            106        23h
openshift-kube-scheduler-vlan608-489fb-master-1   1/2     CrashLoopBackOff   99         23h
openshift-kube-scheduler-vlan608-489fb-master-2   1/2     CrashLoopBackOff   97         23h
...
kube-controller-manager-vlan608-489fb-master-0   3/4     CrashLoopBackOff   205        23h
kube-controller-manager-vlan608-489fb-master-1   4/4     Running            212        23h
kube-controller-manager-vlan608-489fb-master-2   4/4     Running            199        23h
...


We then looked at the openstack ports and the ip addresses more closely:
$ openstack port list -c "Name" -c "Fixed IP Addresses" -c "Status"
+------------------------------+-----------------------------------------------------------------------------+--------+
| Name                         | Fixed IP Addresses                                                          | Status |
+------------------------------+-----------------------------------------------------------------------------+--------+
|                              | ip_address='10.196.0.1', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | ACTIVE |
| vlan608-489fb-worker-0-xvtrh | ip_address='10.196.0.24', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-worker-0-qgws6 | ip_address='10.196.1.42', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-master-port-0  | ip_address='10.196.0.97', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
|                              | ip_address='10.1.57.4', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
|                              | ip_address='10.196.0.10', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | DOWN   |
| vlan608-489fb-ingress-port   | ip_address='10.196.0.7', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | DOWN   |
| vlan608-489fb-master-port-1  | ip_address='10.196.0.210', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE |
|                              | ip_address='10.1.57.9', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
| vlan608-489fb-master-port-2  | ip_address='10.196.2.26', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'  | ACTIVE |
| vlan608-489fb-api-port       | ip_address='10.196.0.5', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04'   | DOWN   |
| vlan608-489fb-worker-0-plnsb | ip_address='10.196.0.159', subnet_id='f133cc85-641d-43c7-8d9a-553ab55e9c04' | ACTIVE |
|                              | ip_address='10.1.57.3', subnet_id='732130c5-793b-424d-a0cf-7966781110e7'    | N/A    |
+------------------------------+-----------------------------------------------------------------------------+--------+

Looking at the api port ip address (10.196.0.5) on the master nodes:

[root@vlan608-489fb-master-2 ~]# ip a | grep "10.196.0.5"                                                                                                                                                    
    inet 10.196.0.5/32 scope global ens3       

[root@vlan608-489fb-master-1 ~]# ip a | grep "10.196.0.5"                                                                                                                                                    
    inet 10.196.0.5/32 scope global ens3       

[root@vlan608-489fb-master-0 tmp]# ip a | grep "10.196.0.5"                                                                                                                                                  
[root@vlan608-489fb-master-0 tmp]#

As you can see master-0 does not have the api port ip address, we rebooted the node and the api began to work agin and the pods we witnessed in a crash loop stopped crashing.


Version-Release number of selected component (if applicable):
osp 16.1
ocp - 4.6.7

How reproducible:
We produced it once in this scale test.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
Cluster to not become inaccessible after 12 hours

Additional info:
Must gather was ran after the cluster was restored.

Lastly, I am unsure of what product/component to make this bug against.

Comment 2 egarcia 2020-12-10 19:30:40 UTC
Assigning this to the Installer team, since the issue seems to be with OpenStack components being mis-configured or degraded after install.

Comment 3 Alex Krzos 2020-12-11 15:25:48 UTC
(In reply to egarcia from comment #2)
> Assigning this to the Installer team, since the issue seems to be with
> OpenStack components being mis-configured or degraded after install.

This is reoccurring in our environment in which rebooting the master nodes and allowing some time for things to normalize will permit the api to work again.  I believe this is an issue with "openshift-openstack-infra" namespace's keepalived pods.

Comment 5 Adolfo Duarte 2021-01-06 19:04:39 UTC
@alex Krzos.  Has this defect been seen or reproduce in other systems (more than once).  
Also if so, also in the environment were this was seen and the system recover, was rebooting the system a requirement. In other words, will the system recover if left to normalized for a period of time *without* rebooting it. 
do you happen to have an estimate of how long the system takes to "normalized" before the api services start working correctly again. 
thanks

Comment 12 Alex Krzos 2021-01-22 15:46:56 UTC
Created attachment 1749824 [details]
Grafana network dashboard showing tcp retransmis rate out of all sent segments

Comment 13 Alex Krzos 2021-01-22 18:06:11 UTC
Created attachment 1749865 [details]
master-0 haproxy

Comment 14 egarcia 2021-02-17 17:01:39 UTC
Closing as a duplicate, since we believe this is a symptom of https://bugzilla.redhat.com/show_bug.cgi?id=1915080

*** This bug has been marked as a duplicate of bug 1915080 ***


Note You need to log in before you can comment on or make changes to this bug.