Hide Forgot
Description of problem: Some pods already restarted on one worker omg get pod -A -o wide | grep ip-10-0-61-174.us-east-2.compute.internal openshift-image-registry node-ca-8qzt2 1/1 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-ingress router-default-86f56f7d65-mkqb5 0/1 Running 10 30m 10.129.2.120 ip-10-0-61-174.us-east-2.compute.internal openshift-cluster-node-tuning-operator tuned-mg4kw 1/1 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-multus multus-additional-cni-plugins-wn4sb 1/1 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-multus multus-m7qv2 1/1 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-multus network-metrics-daemon-qm9xk 2/2 Running 0 2h23m 10.129.2.5 ip-10-0-61-174.us-east-2.compute.internal openshift-ingress-canary ingress-canary-sxgv5 1/1 Running 0 2h22m 10.129.2.7 ip-10-0-61-174.us-east-2.compute.internal openshift-cluster-csi-drivers aws-ebs-csi-driver-node-jb74t 3/3 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-ovn-kubernetes ovnkube-node-85x7t 5/5 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-dns dns-default-mjbf9 2/2 Running 0 2h22m 10.129.2.6 ip-10-0-61-174.us-east-2.compute.internal openshift-dns node-resolver-nm8ns 1/1 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-console downloads-5b6658dc6d-tm4xp 0/1 Running 13 30m 10.129.2.112 ip-10-0-61-174.us-east-2.compute.internal openshift-machine-config-operator machine-config-daemon-fmkbr 2/2 Running 0 2h23m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring alertmanager-main-1 6/6 Running 0 31m 10.129.2.127 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring node-exporter-t7n67 2/2 Running 0 2h22m 10.0.61.174 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring prometheus-adapter-69c9bbc468-pl6h4 0/1 Running 10 31m 10.129.2.119 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring prometheus-k8s-1 6/6 Running 0 29m 10.129.2.128 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring prometheus-operator-admission-webhook-6bcb565bc9-s9xhb 0/1 Running 13 31m 10.129.2.121 ip-10-0-61-174.us-east-2.compute.internal openshift-monitoring thanos-querier-5b5675cb64-j7bmx 6/6 Running 0 31m 10.129.2.124 ip-10-0-61-174.us-east-2.compute.internal openshift-network-diagnostics network-check-source-c77957f56-p8jqv 1/1 Running 0 31m 10.129.2.115 ip-10-0-61-174.us-east-2.compute.internal openshift-network-diagnostics network-check-target-p2b4j 1/1 Running 0 2h23m 10.129.2.4 ip-10-0-61-174.us-east-2.compute.internal From must-gather logs `./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ingress/pods/router-default-86f56f7d65-mkqb5/router-default-86f56f7d65-mkqb5.yaml`, it show "dial tcp 172.30.0.1:443: i/o timeout" containerStatuses: - containerID: cri-o://19354680c498e0464e515c46463b5bfceb789e81da388dbcffea70f53063e57e image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079 lastState: terminated: containerID: cri-o://23f15ac22168d816b67d069f8f0e5d4401e43dbf57fa17a18f674b40fd3b1130 exitCode: 137 finishedAt: "2022-07-27T06:51:42Z" message: "top requested\nE0727 06:51:31.918125 1 factory.go:130] failed to sync cache for *v1.Route shared informer\nI0727 06:51:31.918144 1 shared_informer.go:281] stop requested\nE0727 06:51:31.918156 1 factory.go:130] failed to sync cache for *v1.EndpointSlice shared informer\nI0727 06:51:31.919259 \ 1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670} was not added to shared informer because it has stopped already\nI0727 06:51:31.919279 \ 1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670} was not added to shared informer because it has stopped already\nI0727 06:51:31.919323 \ 1 template.go:704] router \"msg\"=\"Shutdown requested, waiting 45s for new connections to cease\" \nE0727 06:51:31.920473 1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory\nI0727 06:51:32.066608 1 router.go:618] template \"msg\"=\"router reloaded\" \"output\"=\" - Checking http://localhost:80 using PROXY protocol ...\\n - Health check ok : 0 retry attempt(s).\\n\"\nW0727 06:51:39.194933 1 reflector.go:324] github.com/openshift/router/pkg/router/template/service_lookup.go:33: failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout\nI0727 06:51:39.194989 1 trace.go:205] Trace[16201266]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:33 (27-Jul-2022 06:51:09.194) (total time: 30000ms):\nTrace[16201266]: ---\"Objects listed\" error:Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout 30000ms (06:51:39.194)\nTrace[16201266]: [30.000478548s] [30.000478548s] END\nE0727 06:51:39.195000 1 reflector.go:138] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout\n" reason: Error ########################33 And found the following error in ovn-controller quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-85x7t/ovn-controller/ovn-controller/logs/current.log 2022-07-27T06:23:01.990095126Z 2022-07-27T06:23:01.990Z|00947|ofctrl|INFO|OpenFlow error: OFPT_ERROR (OF1.5) (xid=0x5aba): OFPBFC_MSG_FAILED Version-Release number of selected component (if applicable): 4.11.0-rc.5-aarch64 --> 4.11.0-rc.6-aarch64 05_aarch64_IPI on AWS & Private cluster & FIPS on & OVN & Etcd Encryption How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather-124715-307932630.tar.gz
The issue happen on 4.11.0-rc.5 version. So it should not be related to upgrade.
and this issue not always can be reproduced.
This bug might be same as : https://bugzilla.redhat.com/show_bug.cgi?id=2111619#c4 I need a kubeconfig or sos-report from ip-10-0-61-174.us-east-2.compute.internal so that I can check for ovs dump-groups on the node where the router pod lives to make sure the necessary flows were installed properly for the k8s api clusterIP service. If I had access I could do an ovs trace, So far from the ovn-controller logs alone provided in the must-gather I didn't spot the group mod issue for 10.0.48.125:6443 or 10.0.65.181:6443 or 10.0.70.203:6443.
controller is also seeing long polls: 2022-07-27T05:48:57.775586061Z 2022-07-27T05:48:47.067Z|00005|timeval(ovn_pinctrl0)|WARN|Unreasonably long 163228ms poll interval (0ms user, 3126ms system) 2022-07-27T06:07:04.726817283Z 2022-07-27T06:06:55.617Z|00548|timeval|WARN|Unreasonably long 1318281ms poll interval (0ms user, 19886ms system)
Let's use this bug to track the actual fix from OVN, so will track the bump to an OVN version where this can be fixed properly.
Still not hit this issue today by kind of testing including: 1. Create more than > 200 pods in 3 workers 2. restart openvswitch on worker 3. Delete openshift-ovn-kubernetes pods 4. Reboot all workers. 5. Delete all 200 pods and recreated.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399