Bug 2111733

Summary:	pod cannot access kubernetes service
Product:	OpenShift Container Platform	Reporter:	zhaozhanqi <zzhao>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	anbhat, dcbw, lwan, mifiedle, surya, wking
Version:	4.11
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:53:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2112111
Bug Blocks:	2111619

Description zhaozhanqi 2022-07-28 04:09:07 UTC

Description of problem:

Some pods already restarted on one worker 

 omg get pod -A -o wide | grep ip-10-0-61-174.us-east-2.compute.internal
openshift-image-registry                          node-ca-8qzt2                                                             1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-ingress                                 router-default-86f56f7d65-mkqb5                                           0/1    Running    10        30m    10.129.2.120  ip-10-0-61-174.us-east-2.compute.internal
openshift-cluster-node-tuning-operator            tuned-mg4kw                                                               1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  multus-additional-cni-plugins-wn4sb                                       1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  multus-m7qv2                                                              1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  network-metrics-daemon-qm9xk                                              2/2    Running    0         2h23m  10.129.2.5    ip-10-0-61-174.us-east-2.compute.internal
openshift-ingress-canary                          ingress-canary-sxgv5                                                      1/1    Running    0         2h22m  10.129.2.7    ip-10-0-61-174.us-east-2.compute.internal
openshift-cluster-csi-drivers                     aws-ebs-csi-driver-node-jb74t                                             3/3    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-ovn-kubernetes                          ovnkube-node-85x7t                                                        5/5    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-dns                                     dns-default-mjbf9                                                         2/2    Running    0         2h22m  10.129.2.6    ip-10-0-61-174.us-east-2.compute.internal
openshift-dns                                     node-resolver-nm8ns                                                       1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-console                                 downloads-5b6658dc6d-tm4xp                                                0/1    Running    13        30m    10.129.2.112  ip-10-0-61-174.us-east-2.compute.internal
openshift-machine-config-operator                 machine-config-daemon-fmkbr                                               2/2    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              alertmanager-main-1                                                       6/6    Running    0         31m    10.129.2.127  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              node-exporter-t7n67                                                       2/2    Running    0         2h22m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-adapter-69c9bbc468-pl6h4                                       0/1    Running    10        31m    10.129.2.119  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-k8s-1                                                          6/6    Running    0         29m    10.129.2.128  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-operator-admission-webhook-6bcb565bc9-s9xhb                    0/1    Running    13        31m    10.129.2.121  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              thanos-querier-5b5675cb64-j7bmx                                           6/6    Running    0         31m    10.129.2.124  ip-10-0-61-174.us-east-2.compute.internal
openshift-network-diagnostics                     network-check-source-c77957f56-p8jqv                                      1/1    Running    0         31m    10.129.2.115  ip-10-0-61-174.us-east-2.compute.internal
openshift-network-diagnostics                     network-check-target-p2b4j                                                1/1    Running    0         2h23m  10.129.2.4    ip-10-0-61-174.us-east-2.compute.internal


From must-gather logs `./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ingress/pods/router-default-86f56f7d65-mkqb5/router-default-86f56f7d65-mkqb5.yaml`, it show "dial tcp 172.30.0.1:443: i/o timeout"


  containerStatuses:
  - containerID: cri-o://19354680c498e0464e515c46463b5bfceb789e81da388dbcffea70f53063e57e
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079
    lastState:
      terminated:
        containerID: cri-o://23f15ac22168d816b67d069f8f0e5d4401e43dbf57fa17a18f674b40fd3b1130
        exitCode: 137
        finishedAt: "2022-07-27T06:51:42Z"
        message: "top requested\nE0727 06:51:31.918125       1 factory.go:130] failed
          to sync cache for *v1.Route shared informer\nI0727 06:51:31.918144       1
          shared_informer.go:281] stop requested\nE0727 06:51:31.918156       1 factory.go:130]
          failed to sync cache for *v1.EndpointSlice shared informer\nI0727 06:51:31.919259
          \      1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670}
          was not added to shared informer because it has stopped already\nI0727 06:51:31.919279
          \      1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670}
          was not added to shared informer because it has stopped already\nI0727 06:51:31.919323
          \      1 template.go:704] router \"msg\"=\"Shutdown requested, waiting 45s
          for new connections to cease\"  \nE0727 06:51:31.920473       1 haproxy.go:418]
          can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect:
          no such file or directory\nI0727 06:51:32.066608       1 router.go:618]
          template \"msg\"=\"router reloaded\"  \"output\"=\" - Checking http://localhost:80
          using PROXY protocol ...\\n - Health check ok : 0 retry attempt(s).\\n\"\nW0727
          06:51:39.194933       1 reflector.go:324] github.com/openshift/router/pkg/router/template/service_lookup.go:33:
          failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout\nI0727 06:51:39.194989       1 trace.go:205]
          Trace[16201266]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:33
          (27-Jul-2022 06:51:09.194) (total time: 30000ms):\nTrace[16201266]: ---\"Objects
          listed\" error:Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout 30000ms (06:51:39.194)\nTrace[16201266]:
          [30.000478548s] [30.000478548s] END\nE0727 06:51:39.195000       1 reflector.go:138]
          github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed
          to watch *v1.Service: failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout\n"
        reason: Error



########################33

And found the following error in ovn-controller 

quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-85x7t/ovn-controller/ovn-controller/logs/current.log

2022-07-27T06:23:01.990095126Z 2022-07-27T06:23:01.990Z|00947|ofctrl|INFO|OpenFlow error: OFPT_ERROR (OF1.5) (xid=0x5aba): OFPBFC_MSG_FAILED




Version-Release number of selected component (if applicable):

4.11.0-rc.5-aarch64 -->  4.11.0-rc.6-aarch64

05_aarch64_IPI on AWS & Private cluster & FIPS on & OVN & Etcd Encryption



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 zhaozhanqi 2022-07-28 04:10:25 UTC

must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather-124715-307932630.tar.gz

Comment 2 zhaozhanqi 2022-07-28 06:38:23 UTC


The issue happen on 4.11.0-rc.5 version. So it should not be related to upgrade.

Comment 3 zhaozhanqi 2022-07-28 07:39:50 UTC

and this issue not always can be reproduced.

Comment 8 Surya Seetharaman 2022-07-28 20:13:21 UTC

This bug might be same as : https://bugzilla.redhat.com/show_bug.cgi?id=2111619#c4

I need a kubeconfig or sos-report from ip-10-0-61-174.us-east-2.compute.internal so that I can check for ovs dump-groups on the node where the router pod lives to make sure the necessary flows were installed properly for the k8s api clusterIP service. If I had access I could do an ovs trace, So far from the ovn-controller logs alone provided in the must-gather I didn't spot the group mod issue for 10.0.48.125:6443 or 10.0.65.181:6443 or 10.0.70.203:6443.

Comment 9 Surya Seetharaman 2022-07-28 20:16:29 UTC

controller is also seeing long polls:

2022-07-27T05:48:57.775586061Z 2022-07-27T05:48:47.067Z|00005|timeval(ovn_pinctrl0)|WARN|Unreasonably long 163228ms poll interval (0ms user, 3126ms system)
2022-07-27T06:07:04.726817283Z 2022-07-27T06:06:55.617Z|00548|timeval|WARN|Unreasonably long 1318281ms poll interval (0ms user, 19886ms system)

Comment 12 Surya Seetharaman 2022-07-29 09:48:09 UTC

Let's use this bug to track the actual fix from OVN, so will track the bump to an OVN version where this can be fixed properly.

Comment 13 zhaozhanqi 2022-07-29 10:49:17 UTC

Still not hit this issue today by kind of testing including:

1. Create more than > 200 pods in 3 workers
2. restart openvswitch on worker
3. Delete openshift-ovn-kubernetes pods
4. Reboot all workers. 
5. Delete all 200 pods and recreated.

Comment 19 errata-xmlrpc 2023-01-17 19:53:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399