Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1857723

Summary:	Many infra pods restarting periodically on master node
Product:	OpenShift Container Platform	Reporter:	Ke Wang <kewang>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	wlewis
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-21 06:47:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ke Wang 2020-07-16 12:21:16 UTC

Description of problem: 
Some infra pods always restarted periodically on master node after cluster is running for a while. More details see the following must-gather.
 
Version-Release number of selected component (if applicable): 
$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS 
version   4.6.0-0.nightly-2020-07-15-031221   True        False         22h     Cluster version is 4.6.0-0.nightly-2020-07-15-031221 
 
How reproducible: 
Always 

Steps to Reproduce:
Launch a fresh cluster (UPI install on Openstack). Let it run for some time.

Actual results: 
Some infra pods always restarted periodically as below.
$ oc get po -A | awk '$5 > 5 {print}'  
NAMESPACE                                          NAME                                                      READY   STATUS      RESTARTS   AGE 
openshift-apiserver-operator                       openshift-apiserver-operator-6d4bff8c4b-vhxs4             1/1     Running     19         22h 
openshift-authentication-operator                  authentication-operator-89c95d686-lcqkv                   1/1     Running     16         22h 
openshift-cloud-credential-operator                cloud-credential-operator-76cfb6866-8sbzj                 2/2     Running     58         22h 
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-6b4cb9f889-dljgz         1/1     Running     15         22h 
openshift-config-operator                          openshift-config-operator-6d49cdc8b4-nvhfg                1/1     Running     39         22h 
openshift-console-operator                         console-operator-65877477c4-szzfs                         1/1     Running     30         22h 
openshift-controller-manager-operator              openshift-controller-manager-operator-db9cf6554-lp6dk     1/1     Running     18         22h 
openshift-controller-manager                       controller-manager-7dlm2                                  1/1     Running     10         3h32m 
openshift-etcd-operator                            etcd-operator-5fb5d5dc9d-lnt7c                            1/1     Running     19         22h 
openshift-image-registry                           cluster-image-registry-operator-5ccc7d9f47-scrtd          1/1     Running     15         22h 
openshift-kube-apiserver-operator                  kube-apiserver-operator-88f49b4d8-dbszf                   1/1     Running     18         22h 
openshift-kube-apiserver                           kube-apiserver-kewang1561-hmfnf-master-0                  5/5     Running     12         3h26m 
openshift-kube-apiserver                           kube-apiserver-kewang1561-hmfnf-master-1                  5/5     Running     14         3h32m 
openshift-kube-apiserver                           kube-apiserver-kewang1561-hmfnf-master-2                  5/5     Running     12         3h31m 
openshift-kube-controller-manager-operator         kube-controller-manager-operator-bd4f4ccf-8cmzn           1/1     Running     16         22h 
openshift-kube-controller-manager                  kube-controller-manager-kewang1561-hmfnf-master-0         4/4     Running     56         22h 
openshift-kube-controller-manager                  kube-controller-manager-kewang1561-hmfnf-master-1         4/4     Running     47         22h 
openshift-kube-controller-manager                  kube-controller-manager-kewang1561-hmfnf-master-2         4/4     Running     51         22h 
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-5d866459dd-vjbd7        1/1     Running     18         22h 
openshift-kube-scheduler                           openshift-kube-scheduler-kewang1561-hmfnf-master-0        2/2     Running     14         22h 
openshift-kube-scheduler                           openshift-kube-scheduler-kewang1561-hmfnf-master-1        2/2     Running     13         22h 
openshift-kube-scheduler                           openshift-kube-scheduler-kewang1561-hmfnf-master-2        2/2     Running     18         22h 
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-5f4f6c7d55-xnkn5   1/1     Running     18         22h 
openshift-machine-api                              machine-api-controllers-7ddfb44fb5-xxvnq                  7/7     Running     206        22h 
openshift-machine-api                              machine-api-operator-8567f9f77d-ktl5r                     2/2     Running     9          22h 
openshift-machine-config-operator                  machine-config-controller-6b666f7bd9-jtwxc                1/1     Running     11         22h 
openshift-machine-config-operator                  machine-config-operator-7778567df-zkg6d                   1/1     Running     10         22h 
openshift-sdn                                      sdn-controller-8mhs2                                      1/1     Running     11         22h 
openshift-sdn                                      sdn-controller-xt4q2                                      1/1     Running     11         22h 
openshift-service-ca-operator                      service-ca-operator-849df96d8c-2hh4n                      1/1     Running     18         22h 
openshift-service-ca                               service-ca-774c76978c-b8wmj                               1/1     Running     18         22h 
 
$ oc describe pod -n openshift-kube-controller-manager kube-controller-manager-kewang1561-hmfnf-master-0 
... 
  ----     ------     ----                  ----                                ------- 
  Warning  BackOff    10h (x3 over 11h)     kubelet, kewang1561-hmfnf-master-0  Back-off restarting failed container 
  Normal   Killing    9h (x3 over 11h)      kubelet, kewang1561-hmfnf-master-0  Container kube-controller-manager failed liveness probe, will be restarted 
  Warning  Unhealthy  9h (x10 over 11h)     kubelet, kewang1561-hmfnf-master-0  Liveness probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused 
  Warning  Unhealthy  9h (x15 over 11h)     kubelet, kewang1561-hmfnf-master-0  Readiness probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused 
  Normal   Killing    5h47m                 kubelet, kewang1561-hmfnf-master-0  Container cluster-policy-controller failed liveness probe, will be restarted 
  Warning  BackOff    5h45m (x4 over 11h)   kubelet, kewang1561-hmfnf-master-0  Back-off restarting failed container 
  Normal   Killing    4h46m (x5 over 22h)   kubelet, kewang1561-hmfnf-master-0  Container cluster-policy-controller failed startup probe, will be restarted 
  Warning  Unhealthy  4h13m (x10 over 11h)  kubelet, kewang1561-hmfnf-master-0  Liveness probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused 
  Warning  Unhealthy  4h13m (x10 over 11h)  kubelet, kewang1561-hmfnf-master-0  Readiness probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused 
  Normal   Pulled     3h8m (x9 over 22h)    kubelet, kewang1561-hmfnf-master-0  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ed11a27c7887dc4211bdd10a0c0269831bf45ceede49cf12164f83e2c56adce" already present on machine 
  Normal   Started    3h8m (x9 over 22h)    kubelet, kewang1561-hmfnf-master-0  Started container kube-controller-manager-recovery-controller 
  Normal   Created    3h8m (x9 over 22h)    kubelet, kewang1561-hmfnf-master-0  Created container kube-controller-manager-recovery-controller 
  Normal   Created    160m (x20 over 22h)   kubelet, kewang1561-hmfnf-master-0  Created container cluster-policy-controller 
  Normal   Pulled     160m (x20 over 22h)   kubelet, kewang1561-hmfnf-master-0  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:457084ba87cd278d51a7d2e7f6b3b878c2a58c157c2b2869b267323cc76009a5" already present on machine 
  Normal   Started    160m (x20 over 22h)   kubelet, kewang1561-hmfnf-master-0  Started container cluster-policy-controller 
  Warning  Unhealthy  160m (x19 over 22h)   kubelet, kewang1561-hmfnf-master-0  Startup probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused 
  Normal   Pulled     133m (x27 over 22h)   kubelet, kewang1561-hmfnf-master-0  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40af4b93b9826f8c992269cf6b19e7f4b1756d1d8101cd519d8fa241c878830f" already present on machine 
  Normal   Started    133m (x27 over 22h)   kubelet, kewang1561-hmfnf-master-0  Started container kube-controller-manager 
  Normal   Created    133m (x27 over 22h)   kubelet, kewang1561-hmfnf-master-0  Created container kube-controller-manager 
  Warning  Unhealthy  133m (x24 over 10h)   kubelet, kewang1561-hmfnf-master-0  Startup probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused 
  Normal   Killing    131m (x8 over 9h)     kubelet, kewang1561-hmfnf-master-0  Container kube-controller-manager failed startup probe, will be restarted 
 
Checked the pod kube-controller-manager-kewang1561-hmfnf-master-0 logs on master-0 node, found many timeout errors, 
$ oc debug node/kewang1561-hmfnf-master-0 
sh-4.2# chroot /host 
sh-4.4# cd /var/log/pods 
sh-4.4# grep -nri 'timeout' /var/log/pods/openshift-kube-controller-manager_kube-controller-manager-kewang1561-hmfnf-master-0_0b3387df0165fd7fa9fb9849f2172347 
... 
kube-apiserver-check-endpoints.log:221878:I0716 04:44:38.690568       1 connection_checker.go:174] Failure | TCPConnectError | 10.000121601s | Failed to establish a TCP connection to 172.30.60.10:443: dial tcp 172.30.60.10:443: i/o timeout 
... 
kube-apiserver-cert-regeneration-controller.log:7:E0716 02:02:36.832823       1 leaderelection.go:320] error retrieving resource lock openshift-kube-apiserver/cert-regeneration-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps/cert-regeneration-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused 
... 
kube-apiserver-cert-syncer.log:1005:E0716 01:03:12.634125       1 reflector.go:382] k8s.io/client-go.3/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps?allowWatchBookmarks=true&resourceVersion=586296&timeout=7m23s&timeoutSeconds=443&watch=true": dial tcp [::1]:6443: connect: connection refused 
... 
 
sh-4.4# grep -nri 'timeout' /var/log/pods/openshift-kube-controller-manager_kube-controller-manager-kewang1561-hmfnf-master-0_0b3387df0165fd7fa9fb9849f2172347 | wc -l 
3518
 
Checked the etcd servers, found many similar errors 'embed: rejected connection from "<server ip>:<port>" (error "EOF", ServerName "")' 
$ oc logs -n openshift-etcd etcd-kewang1561-hmfnf-master-0-c etcd | grep 'embed: rejected connection from'| wc -l 
42347 
 
I suspect the etcd server doesn't serve well is causing others are timeout and not work.
      
Expected results: 
Infra pods shouldn't restart periodically on master node 
 
Additional info:

Comment 3 Sam Batschelet 2020-08-12 16:13:32 UTC

> Launch a fresh cluster (UPI install on Openstack). Let it run for some time.

Can you clarify the hardware profile for this UPI cluster?

I am seeing your master node kewang1561-hmfnf-master-0

>       node.kubernetes.io/instance-type: m1.xlarge

If this is correct and the backing infra is AWS then we should look at underpowered hardware as a major contributing factor. In IPI we are using m5.2xlarge as a default[1] as you can see m1 is not even a valid option. Lets start here.

[1] https://github.com/openshift/installer/blob/e1a76aba96b794ea152cfc46059fd98dbc788ca3/upi/aws/cloudformation/05_cluster_master_nodes.yaml#L52

Comment 4 Ke Wang 2020-08-13 06:41:23 UTC

The problem was found on UPI install on Openstack, not AWS. I checked the installation template, the master node vm_type: 'm1.xlarge' which has specification with vCPUs:8, RAM:16384 MB, Root Disk Size:160 GB.

Comment 5 Ke Wang 2020-08-13 06:49:45 UTC

>  Checked the etcd servers, found many similar errors 'embed: rejected connection from "<server ip>:<port>" (error "EOF", ServerName "")' 

This problem seems to have been solved in bug 1855284, will verify it to confirm. Not sure the problem 'Infra pods restart periodically on master node' has somthing to do with bug 1855284. 

This bug and 1855284，from opened time point of view, it could be a problem.

Comment 6 ge liu 2020-08-18 04:46:56 UTC

ke, could u double check if this issue be solved  bug 1855284, thanks

Comment 7 Ke Wang 2020-08-21 06:47:10 UTC

I checked the etcd logs from must-gather, we can see in etcd member logs 4 times per second likes description in bug 1855284, so can confirm they are the same issue.

2020-07-16T08:09:54.234698992Z 2020-07-16 08:09:54.234602 I | embed: rejected connection from "192.168.3.130:51058" (error "EOF", ServerName "")
2020-07-16T08:09:54.409955945Z 2020-07-16 08:09:54.409871 I | embed: rejected connection from "192.168.0.130:41920" (error "EOF", ServerName "")
2020-07-16T08:09:54.933405182Z 2020-07-16 08:09:54.933266 I | embed: rejected connection from "[::1]:55238" (error "EOF", ServerName "")
2020-07-16T08:09:54.933405182Z 2020-07-16 08:09:54.933340 I | embed: rejected connection from "192.168.2.234:52822" (error "EOF", ServerName "")
2020-07-16T08:09:55.234834196Z 2020-07-16 08:09:55.234773 I | embed: rejected connection from "192.168.3.130:51098" (error "EOF", ServerName "")
2020-07-16T08:09:55.410148927Z 2020-07-16 08:09:55.409994 I | embed: rejected connection from "192.168.0.130:41958" (error "EOF", ServerName "")
2020-07-16T08:09:55.931878451Z 2020-07-16 08:09:55.931625 I | embed: rejected connection from "192.168.2.234:52862" (error "EOF", ServerName "")
2020-07-16T08:09:55.932613547Z 2020-07-16 08:09:55.932381 I | embed: rejected connection from "[::1]:55286" (error "EOF", ServerName "")
2020-07-16T08:09:56.235472471Z 2020-07-16 08:09:56.235342 I | embed: rejected connection from "192.168.3.130:51134" (error "EOF", ServerName "")
2020-07-16T08:09:56.410273017Z 2020-07-16 08:09:56.410160 I | embed: rejected connection from "192.168.0.130:42006" (error "EOF", ServerName "")
2020-07-16T08:09:56.931671873Z 2020-07-16 08:09:56.931547 I | embed: rejected connection from "192.168.2.234:52894" (error "EOF", ServerName "")
2020-07-16T08:09:56.933056282Z 2020-07-16 08:09:56.932979 I | embed: rejected connection from "[::1]:55328" (error "EOF", ServerName "")
...

*** This bug has been marked as a duplicate of bug 1855284 ***