Bug 1857723
| Summary: | Many infra pods restarting periodically on master node | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> |
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | wlewis |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-21 06:47:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
> Launch a fresh cluster (UPI install on Openstack). Let it run for some time. Can you clarify the hardware profile for this UPI cluster? I am seeing your master node kewang1561-hmfnf-master-0 > node.kubernetes.io/instance-type: m1.xlarge If this is correct and the backing infra is AWS then we should look at underpowered hardware as a major contributing factor. In IPI we are using m5.2xlarge as a default[1] as you can see m1 is not even a valid option. Lets start here. [1] https://github.com/openshift/installer/blob/e1a76aba96b794ea152cfc46059fd98dbc788ca3/upi/aws/cloudformation/05_cluster_master_nodes.yaml#L52 The problem was found on UPI install on Openstack, not AWS. I checked the installation template, the master node vm_type: 'm1.xlarge' which has specification with vCPUs:8, RAM:16384 MB, Root Disk Size:160 GB. > Checked the etcd servers, found many similar errors 'embed: rejected connection from "<server ip>:<port>" (error "EOF", ServerName "")' This problem seems to have been solved in bug 1855284, will verify it to confirm. Not sure the problem 'Infra pods restart periodically on master node' has somthing to do with bug 1855284. This bug and 1855284,from opened time point of view, it could be a problem. ke, could u double check if this issue be solved bug 1855284, thanks I checked the etcd logs from must-gather, we can see in etcd member logs 4 times per second likes description in bug 1855284, so can confirm they are the same issue. 2020-07-16T08:09:54.234698992Z 2020-07-16 08:09:54.234602 I | embed: rejected connection from "192.168.3.130:51058" (error "EOF", ServerName "") 2020-07-16T08:09:54.409955945Z 2020-07-16 08:09:54.409871 I | embed: rejected connection from "192.168.0.130:41920" (error "EOF", ServerName "") 2020-07-16T08:09:54.933405182Z 2020-07-16 08:09:54.933266 I | embed: rejected connection from "[::1]:55238" (error "EOF", ServerName "") 2020-07-16T08:09:54.933405182Z 2020-07-16 08:09:54.933340 I | embed: rejected connection from "192.168.2.234:52822" (error "EOF", ServerName "") 2020-07-16T08:09:55.234834196Z 2020-07-16 08:09:55.234773 I | embed: rejected connection from "192.168.3.130:51098" (error "EOF", ServerName "") 2020-07-16T08:09:55.410148927Z 2020-07-16 08:09:55.409994 I | embed: rejected connection from "192.168.0.130:41958" (error "EOF", ServerName "") 2020-07-16T08:09:55.931878451Z 2020-07-16 08:09:55.931625 I | embed: rejected connection from "192.168.2.234:52862" (error "EOF", ServerName "") 2020-07-16T08:09:55.932613547Z 2020-07-16 08:09:55.932381 I | embed: rejected connection from "[::1]:55286" (error "EOF", ServerName "") 2020-07-16T08:09:56.235472471Z 2020-07-16 08:09:56.235342 I | embed: rejected connection from "192.168.3.130:51134" (error "EOF", ServerName "") 2020-07-16T08:09:56.410273017Z 2020-07-16 08:09:56.410160 I | embed: rejected connection from "192.168.0.130:42006" (error "EOF", ServerName "") 2020-07-16T08:09:56.931671873Z 2020-07-16 08:09:56.931547 I | embed: rejected connection from "192.168.2.234:52894" (error "EOF", ServerName "") 2020-07-16T08:09:56.933056282Z 2020-07-16 08:09:56.932979 I | embed: rejected connection from "[::1]:55328" (error "EOF", ServerName "") ... *** This bug has been marked as a duplicate of bug 1855284 *** |
Description of problem: Some infra pods always restarted periodically on master node after cluster is running for a while. More details see the following must-gather. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-15-031221 True False 22h Cluster version is 4.6.0-0.nightly-2020-07-15-031221 How reproducible: Always Steps to Reproduce: Launch a fresh cluster (UPI install on Openstack). Let it run for some time. Actual results: Some infra pods always restarted periodically as below. $ oc get po -A | awk '$5 > 5 {print}' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver-operator openshift-apiserver-operator-6d4bff8c4b-vhxs4 1/1 Running 19 22h openshift-authentication-operator authentication-operator-89c95d686-lcqkv 1/1 Running 16 22h openshift-cloud-credential-operator cloud-credential-operator-76cfb6866-8sbzj 2/2 Running 58 22h openshift-cluster-storage-operator csi-snapshot-controller-operator-6b4cb9f889-dljgz 1/1 Running 15 22h openshift-config-operator openshift-config-operator-6d49cdc8b4-nvhfg 1/1 Running 39 22h openshift-console-operator console-operator-65877477c4-szzfs 1/1 Running 30 22h openshift-controller-manager-operator openshift-controller-manager-operator-db9cf6554-lp6dk 1/1 Running 18 22h openshift-controller-manager controller-manager-7dlm2 1/1 Running 10 3h32m openshift-etcd-operator etcd-operator-5fb5d5dc9d-lnt7c 1/1 Running 19 22h openshift-image-registry cluster-image-registry-operator-5ccc7d9f47-scrtd 1/1 Running 15 22h openshift-kube-apiserver-operator kube-apiserver-operator-88f49b4d8-dbszf 1/1 Running 18 22h openshift-kube-apiserver kube-apiserver-kewang1561-hmfnf-master-0 5/5 Running 12 3h26m openshift-kube-apiserver kube-apiserver-kewang1561-hmfnf-master-1 5/5 Running 14 3h32m openshift-kube-apiserver kube-apiserver-kewang1561-hmfnf-master-2 5/5 Running 12 3h31m openshift-kube-controller-manager-operator kube-controller-manager-operator-bd4f4ccf-8cmzn 1/1 Running 16 22h openshift-kube-controller-manager kube-controller-manager-kewang1561-hmfnf-master-0 4/4 Running 56 22h openshift-kube-controller-manager kube-controller-manager-kewang1561-hmfnf-master-1 4/4 Running 47 22h openshift-kube-controller-manager kube-controller-manager-kewang1561-hmfnf-master-2 4/4 Running 51 22h openshift-kube-scheduler-operator openshift-kube-scheduler-operator-5d866459dd-vjbd7 1/1 Running 18 22h openshift-kube-scheduler openshift-kube-scheduler-kewang1561-hmfnf-master-0 2/2 Running 14 22h openshift-kube-scheduler openshift-kube-scheduler-kewang1561-hmfnf-master-1 2/2 Running 13 22h openshift-kube-scheduler openshift-kube-scheduler-kewang1561-hmfnf-master-2 2/2 Running 18 22h openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-5f4f6c7d55-xnkn5 1/1 Running 18 22h openshift-machine-api machine-api-controllers-7ddfb44fb5-xxvnq 7/7 Running 206 22h openshift-machine-api machine-api-operator-8567f9f77d-ktl5r 2/2 Running 9 22h openshift-machine-config-operator machine-config-controller-6b666f7bd9-jtwxc 1/1 Running 11 22h openshift-machine-config-operator machine-config-operator-7778567df-zkg6d 1/1 Running 10 22h openshift-sdn sdn-controller-8mhs2 1/1 Running 11 22h openshift-sdn sdn-controller-xt4q2 1/1 Running 11 22h openshift-service-ca-operator service-ca-operator-849df96d8c-2hh4n 1/1 Running 18 22h openshift-service-ca service-ca-774c76978c-b8wmj 1/1 Running 18 22h $ oc describe pod -n openshift-kube-controller-manager kube-controller-manager-kewang1561-hmfnf-master-0 ... ---- ------ ---- ---- ------- Warning BackOff 10h (x3 over 11h) kubelet, kewang1561-hmfnf-master-0 Back-off restarting failed container Normal Killing 9h (x3 over 11h) kubelet, kewang1561-hmfnf-master-0 Container kube-controller-manager failed liveness probe, will be restarted Warning Unhealthy 9h (x10 over 11h) kubelet, kewang1561-hmfnf-master-0 Liveness probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused Warning Unhealthy 9h (x15 over 11h) kubelet, kewang1561-hmfnf-master-0 Readiness probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused Normal Killing 5h47m kubelet, kewang1561-hmfnf-master-0 Container cluster-policy-controller failed liveness probe, will be restarted Warning BackOff 5h45m (x4 over 11h) kubelet, kewang1561-hmfnf-master-0 Back-off restarting failed container Normal Killing 4h46m (x5 over 22h) kubelet, kewang1561-hmfnf-master-0 Container cluster-policy-controller failed startup probe, will be restarted Warning Unhealthy 4h13m (x10 over 11h) kubelet, kewang1561-hmfnf-master-0 Liveness probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused Warning Unhealthy 4h13m (x10 over 11h) kubelet, kewang1561-hmfnf-master-0 Readiness probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused Normal Pulled 3h8m (x9 over 22h) kubelet, kewang1561-hmfnf-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ed11a27c7887dc4211bdd10a0c0269831bf45ceede49cf12164f83e2c56adce" already present on machine Normal Started 3h8m (x9 over 22h) kubelet, kewang1561-hmfnf-master-0 Started container kube-controller-manager-recovery-controller Normal Created 3h8m (x9 over 22h) kubelet, kewang1561-hmfnf-master-0 Created container kube-controller-manager-recovery-controller Normal Created 160m (x20 over 22h) kubelet, kewang1561-hmfnf-master-0 Created container cluster-policy-controller Normal Pulled 160m (x20 over 22h) kubelet, kewang1561-hmfnf-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:457084ba87cd278d51a7d2e7f6b3b878c2a58c157c2b2869b267323cc76009a5" already present on machine Normal Started 160m (x20 over 22h) kubelet, kewang1561-hmfnf-master-0 Started container cluster-policy-controller Warning Unhealthy 160m (x19 over 22h) kubelet, kewang1561-hmfnf-master-0 Startup probe failed: Get "https://192.168.2.234:10357/healthz": dial tcp 192.168.2.234:10357: connect: connection refused Normal Pulled 133m (x27 over 22h) kubelet, kewang1561-hmfnf-master-0 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40af4b93b9826f8c992269cf6b19e7f4b1756d1d8101cd519d8fa241c878830f" already present on machine Normal Started 133m (x27 over 22h) kubelet, kewang1561-hmfnf-master-0 Started container kube-controller-manager Normal Created 133m (x27 over 22h) kubelet, kewang1561-hmfnf-master-0 Created container kube-controller-manager Warning Unhealthy 133m (x24 over 10h) kubelet, kewang1561-hmfnf-master-0 Startup probe failed: Get "https://192.168.2.234:10257/healthz": dial tcp 192.168.2.234:10257: connect: connection refused Normal Killing 131m (x8 over 9h) kubelet, kewang1561-hmfnf-master-0 Container kube-controller-manager failed startup probe, will be restarted Checked the pod kube-controller-manager-kewang1561-hmfnf-master-0 logs on master-0 node, found many timeout errors, $ oc debug node/kewang1561-hmfnf-master-0 sh-4.2# chroot /host sh-4.4# cd /var/log/pods sh-4.4# grep -nri 'timeout' /var/log/pods/openshift-kube-controller-manager_kube-controller-manager-kewang1561-hmfnf-master-0_0b3387df0165fd7fa9fb9849f2172347 ... kube-apiserver-check-endpoints.log:221878:I0716 04:44:38.690568 1 connection_checker.go:174] Failure | TCPConnectError | 10.000121601s | Failed to establish a TCP connection to 172.30.60.10:443: dial tcp 172.30.60.10:443: i/o timeout ... kube-apiserver-cert-regeneration-controller.log:7:E0716 02:02:36.832823 1 leaderelection.go:320] error retrieving resource lock openshift-kube-apiserver/cert-regeneration-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps/cert-regeneration-controller-lock?timeout=35s": dial tcp [::1]:6443: connect: connection refused ... kube-apiserver-cert-syncer.log:1005:E0716 01:03:12.634125 1 reflector.go:382] k8s.io/client-go.3/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/configmaps?allowWatchBookmarks=true&resourceVersion=586296&timeout=7m23s&timeoutSeconds=443&watch=true": dial tcp [::1]:6443: connect: connection refused ... sh-4.4# grep -nri 'timeout' /var/log/pods/openshift-kube-controller-manager_kube-controller-manager-kewang1561-hmfnf-master-0_0b3387df0165fd7fa9fb9849f2172347 | wc -l 3518 Checked the etcd servers, found many similar errors 'embed: rejected connection from "<server ip>:<port>" (error "EOF", ServerName "")' $ oc logs -n openshift-etcd etcd-kewang1561-hmfnf-master-0-c etcd | grep 'embed: rejected connection from'| wc -l 42347 I suspect the etcd server doesn't serve well is causing others are timeout and not work. Expected results: Infra pods shouldn't restart periodically on master node Additional info: