2111733 – pod cannot access kubernetes service

Bug 2111733 - pod cannot access kubernetes service

Summary: pod cannot access kubernetes service

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Surya Seetharaman
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	2112111
Blocks:	2111619
TreeView+	depends on / blocked

Reported:	2022-07-28 04:09 UTC by zhaozhanqi
Modified:	2023-01-17 19:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:53:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 1222	0	None	open	Bug 2111733: Bump OVN to 22.06.0-27	2022-07-29 16:09:59 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:54:01 UTC

Description zhaozhanqi 2022-07-28 04:09:07 UTC

Description of problem:

Some pods already restarted on one worker 

 omg get pod -A -o wide | grep ip-10-0-61-174.us-east-2.compute.internal
openshift-image-registry                          node-ca-8qzt2                                                             1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-ingress                                 router-default-86f56f7d65-mkqb5                                           0/1    Running    10        30m    10.129.2.120  ip-10-0-61-174.us-east-2.compute.internal
openshift-cluster-node-tuning-operator            tuned-mg4kw                                                               1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  multus-additional-cni-plugins-wn4sb                                       1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  multus-m7qv2                                                              1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-multus                                  network-metrics-daemon-qm9xk                                              2/2    Running    0         2h23m  10.129.2.5    ip-10-0-61-174.us-east-2.compute.internal
openshift-ingress-canary                          ingress-canary-sxgv5                                                      1/1    Running    0         2h22m  10.129.2.7    ip-10-0-61-174.us-east-2.compute.internal
openshift-cluster-csi-drivers                     aws-ebs-csi-driver-node-jb74t                                             3/3    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-ovn-kubernetes                          ovnkube-node-85x7t                                                        5/5    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-dns                                     dns-default-mjbf9                                                         2/2    Running    0         2h22m  10.129.2.6    ip-10-0-61-174.us-east-2.compute.internal
openshift-dns                                     node-resolver-nm8ns                                                       1/1    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-console                                 downloads-5b6658dc6d-tm4xp                                                0/1    Running    13        30m    10.129.2.112  ip-10-0-61-174.us-east-2.compute.internal
openshift-machine-config-operator                 machine-config-daemon-fmkbr                                               2/2    Running    0         2h23m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              alertmanager-main-1                                                       6/6    Running    0         31m    10.129.2.127  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              node-exporter-t7n67                                                       2/2    Running    0         2h22m  10.0.61.174   ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-adapter-69c9bbc468-pl6h4                                       0/1    Running    10        31m    10.129.2.119  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-k8s-1                                                          6/6    Running    0         29m    10.129.2.128  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              prometheus-operator-admission-webhook-6bcb565bc9-s9xhb                    0/1    Running    13        31m    10.129.2.121  ip-10-0-61-174.us-east-2.compute.internal
openshift-monitoring                              thanos-querier-5b5675cb64-j7bmx                                           6/6    Running    0         31m    10.129.2.124  ip-10-0-61-174.us-east-2.compute.internal
openshift-network-diagnostics                     network-check-source-c77957f56-p8jqv                                      1/1    Running    0         31m    10.129.2.115  ip-10-0-61-174.us-east-2.compute.internal
openshift-network-diagnostics                     network-check-target-p2b4j                                                1/1    Running    0         2h23m  10.129.2.4    ip-10-0-61-174.us-east-2.compute.internal


From must-gather logs `./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ingress/pods/router-default-86f56f7d65-mkqb5/router-default-86f56f7d65-mkqb5.yaml`, it show "dial tcp 172.30.0.1:443: i/o timeout"


  containerStatuses:
  - containerID: cri-o://19354680c498e0464e515c46463b5bfceb789e81da388dbcffea70f53063e57e
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee700fabad64d7d55adf4493394c06cfa7558d9b921e7b927ec5d5d33af3a079
    lastState:
      terminated:
        containerID: cri-o://23f15ac22168d816b67d069f8f0e5d4401e43dbf57fa17a18f674b40fd3b1130
        exitCode: 137
        finishedAt: "2022-07-27T06:51:42Z"
        message: "top requested\nE0727 06:51:31.918125       1 factory.go:130] failed
          to sync cache for *v1.Route shared informer\nI0727 06:51:31.918144       1
          shared_informer.go:281] stop requested\nE0727 06:51:31.918156       1 factory.go:130]
          failed to sync cache for *v1.EndpointSlice shared informer\nI0727 06:51:31.919259
          \      1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670}
          was not added to shared informer because it has stopped already\nI0727 06:51:31.919279
          \      1 shared_informer.go:521] Handler {0x10149f0 0x1014970 0x1014670}
          was not added to shared informer because it has stopped already\nI0727 06:51:31.919323
          \      1 template.go:704] router \"msg\"=\"Shutdown requested, waiting 45s
          for new connections to cease\"  \nE0727 06:51:31.920473       1 haproxy.go:418]
          can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect:
          no such file or directory\nI0727 06:51:32.066608       1 router.go:618]
          template \"msg\"=\"router reloaded\"  \"output\"=\" - Checking http://localhost:80
          using PROXY protocol ...\\n - Health check ok : 0 retry attempt(s).\\n\"\nW0727
          06:51:39.194933       1 reflector.go:324] github.com/openshift/router/pkg/router/template/service_lookup.go:33:
          failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout\nI0727 06:51:39.194989       1 trace.go:205]
          Trace[16201266]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:33
          (27-Jul-2022 06:51:09.194) (total time: 30000ms):\nTrace[16201266]: ---\"Objects
          listed\" error:Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout 30000ms (06:51:39.194)\nTrace[16201266]:
          [30.000478548s] [30.000478548s] END\nE0727 06:51:39.195000       1 reflector.go:138]
          github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed
          to watch *v1.Service: failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\":
          dial tcp 172.30.0.1:443: i/o timeout\n"
        reason: Error



########################33

And found the following error in ovn-controller 

quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f3561052cfbce58451c47fbb3ae99694866e4cce50db113b77a3a78a99906c47/namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-85x7t/ovn-controller/ovn-controller/logs/current.log

2022-07-27T06:23:01.990095126Z 2022-07-27T06:23:01.990Z|00947|ofctrl|INFO|OpenFlow error: OFPT_ERROR (OF1.5) (xid=0x5aba): OFPBFC_MSG_FAILED




Version-Release number of selected component (if applicable):

4.11.0-rc.5-aarch64 -->  4.11.0-rc.6-aarch64

05_aarch64_IPI on AWS & Private cluster & FIPS on & OVN & Etcd Encryption



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 zhaozhanqi 2022-07-28 04:10:25 UTC

must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather-124715-307932630.tar.gz

Comment 2 zhaozhanqi 2022-07-28 06:38:23 UTC


The issue happen on 4.11.0-rc.5 version. So it should not be related to upgrade.

Comment 3 zhaozhanqi 2022-07-28 07:39:50 UTC

and this issue not always can be reproduced.

Comment 8 Surya Seetharaman 2022-07-28 20:13:21 UTC

This bug might be same as : https://bugzilla.redhat.com/show_bug.cgi?id=2111619#c4

I need a kubeconfig or sos-report from ip-10-0-61-174.us-east-2.compute.internal so that I can check for ovs dump-groups on the node where the router pod lives to make sure the necessary flows were installed properly for the k8s api clusterIP service. If I had access I could do an ovs trace, So far from the ovn-controller logs alone provided in the must-gather I didn't spot the group mod issue for 10.0.48.125:6443 or 10.0.65.181:6443 or 10.0.70.203:6443.

Comment 9 Surya Seetharaman 2022-07-28 20:16:29 UTC

controller is also seeing long polls:

2022-07-27T05:48:57.775586061Z 2022-07-27T05:48:47.067Z|00005|timeval(ovn_pinctrl0)|WARN|Unreasonably long 163228ms poll interval (0ms user, 3126ms system)
2022-07-27T06:07:04.726817283Z 2022-07-27T06:06:55.617Z|00548|timeval|WARN|Unreasonably long 1318281ms poll interval (0ms user, 19886ms system)

Comment 12 Surya Seetharaman 2022-07-29 09:48:09 UTC

Let's use this bug to track the actual fix from OVN, so will track the bump to an OVN version where this can be fixed properly.

Comment 13 zhaozhanqi 2022-07-29 10:49:17 UTC

Still not hit this issue today by kind of testing including:

1. Create more than > 200 pods in 3 workers
2. restart openvswitch on worker
3. Delete openshift-ovn-kubernetes pods
4. Reboot all workers. 
5. Delete all 200 pods and recreated.

Comment 19 errata-xmlrpc 2023-01-17 19:53:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.