Bug 1952819
Summary: | failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | mchebbi <mchebbi> | |
Component: | Networking | Assignee: | Federico Paolinelli <fpaoline> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aconstan, aharchin, alegrand, anbhat, anpicker, astoycos, bjarolim, dblack, erooth, fpaoline, gwest, kakkoyun, lcosic, mkarg, nkim, openshift-bugs-escalate, pawankum, pkrupa, rbrattai, spasquie, surbania, trozet, uselessidbr, ychoukse, zzhao | |
Version: | 4.7 | |||
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1962608 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:03:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1962608 |
Description
mchebbi@redhat.com
2021-04-23 08:58:54 UTC
Looking at the logs: ovnkube-master-7dl6w/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:35:24.233947454Z I0419 11:35:24.233935 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 ovnkube-master-7dl6w/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:35:24.327140074Z I0419 11:35:24.327114 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 93.1917ms ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:01.923797846Z I0419 11:40:01.923775 1 pods.go:60] Stale logical port found: openshift-monitoring_prometheus-k8s-0. This logical port will be deleted. ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:01.925549283Z 2021-04-19T11:40:01.925Z|00290|nbctl|INFO|Running command run --if-exists -- lsp-del openshift-monitoring_prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156372889Z I0419 11:40:29.156364 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.2.17/23"],"mac_address":"0a:58:0a:81:02:11","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.17/23","gateway_ip":"10.129.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.208380617Z I0419 11:40:29.208352 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 51.992078ms ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.208508428Z E0419 11:40:29.208494 1 ovn.go:627] error while creating logical port openshift-monitoring_prometheus-k8s-0 error: connection is shut down One issue I see here is that we sync the pod via a nbctl command, but then try to add the port again using go-ovn bindings. It's possible we could run into a cache out of sync issue where the go-ovn cache thinks the port is still there, but its really been deleted via nbctl. We should fix the sync method to use go-ovn as well. I'm not saying that's the root cause of this bug, just noticing it while I'm looking at the logs. What looks odd from these logs is the timing between binding and annotations: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.2.17/23"],"mac_address":"0a:58:0a:81:02:11","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.17/23","gateway_ip":"10.129.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 [trozet@trozet ci.]$ grep -r prometheus-k8s-0 * | grep ovn-controller namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-whdjz/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:22.817976433Z 2021-04-19T11:26:22Z|00038|binding|INFO|Releasing lport openshift-monitoring_prometheus-k8s-0 from this chassis. namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:36.082279873Z 2021-04-19T11:26:36Z|00028|binding|INFO|Claiming lport openshift-monitoring_prometheus-k8s-0 for this chassis. namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:26:36.082279873Z 2021-04-19T11:26:36Z|00029|binding|INFO|openshift-monitoring_prometheus-k8s-0: Claiming 0a:58:0a:80:02:09 10.128.2.9 namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-vmzzl/ovn-controller/ovn-controller/logs/current.log:2021-04-19T11:38:58.508607415Z 2021-04-19T11:38:58Z|00049|binding|INFO|Releasing lport openshift-monitoring_prometheus-k8s-0 from this chassis. We can see at 11:26:36 ovn-controller bound to 10.128.2.9, but we dont see that new IP annotated until: namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-86vmg/ovnkube-master/ovnkube-master/logs/current.log:2021-04-19T12:19:31.964007339Z I0419 12:19:31.963998 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.9/23"],"mac_address":"0a:58:0a:80:02:09","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.2.3"}],"ip_address":"10.128.2.9/23","gateway_ip":"10.128.2.1"}}] on pod openshift-monitoring/prometheus-k8s-0 We think we found what happened here. When ovnkube-master started, it found the prometheus pod not running and deleted its port. 2021-04-19T11:40:01.923797846Z I0419 11:40:01.923775 1 pods.go:60] Stale logical port found: openshift-monitoring_prometheus-k8s-0. This logical port will be deleted. 2021-04-19T11:40:01.925549283Z 2021-04-19T11:40:01.925Z|00290|nbctl|INFO|Running command run --if-exists -- lsp-del openshift-monitoring_prometheus-k8s-0 There was then a disconnection between kubemaster and the nbdb: 2021-04-19T11:40:26.193749278Z 2021/04/19 11:40:26 rpc2: client protocol error: read tcp 172.21.8.55:52212->172.21.8.53:9642: read: connection reset by peer 2021-04-19T11:40:26.193812317Z 2021/04/19 11:40:26 ssl:172.21.8.53:9642,ssl:172.21.8.55:9642,ssl:172.21.8.59:9642 disconnected. Reconnecting ... 2021-04-19T11:40:26.701238456Z 2021/04/19 11:40:26 ssl:172.21.8.53:9642,ssl:172.21.8.55:9642,ssl:172.21.8.59:9642 reconnected. 2021-04-19T11:40:28.754648075Z 2021/04/19 11:40:28 rpc2: client protocol error: read tcp 172.21.8.55:45308->172.21.8.53:9641: read: connection reset by peer 2021-04-19T11:40:28.754903588Z 2021/04/19 11:40:28 ssl:172.21.8.53:9641,ssl:172.21.8.55:9641,ssl:172.21.8.59:9641 disconnected. Reconnecting ... which recovered here 2021-04-19T11:40:29.286480541Z 2021/04/19 11:40:29 ssl:172.21.8.53:9641,ssl:172.21.8.55:9641,ssl:172.21.8.59:9641 reconnected. In the meanwhile, the prometheus pod came back (with fixed name, which is associated to the switch logical port): 2021-04-19T11:40:29.156372889Z I0419 11:40:29.156364 1 pods.go:297] LSP already exists for port: openshift-monitoring_prometheus-k8s-0 2021-04-19T11:40:29.156415283Z I0419 11:40:29.156403 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":[ 2021-04-19T11:40:29.208380617Z I0419 11:40:29.208352 1 pods.go:261] [openshift-monitoring/prometheus-k8s-0] addLogicalPort took 51.992078ms 2021-04-19T11:40:29.208508428Z E0419 11:40:29.208494 1 ovn.go:627] error while creating logical port openshift-monitoring_prometheus-k8s-0 error: connection is shut down What's odd is the fact that the master thinks the LSP is there, but the connection is not up. This is because the go-ovn binding library we use holds a cache of the content of the nbdb. The cache gets refreshed when it receives updates (or deletion) events, and gets aligned when the master starts OR when it reconnects. We found an issue where deletion events handled while the connection was down are not processed when reconnecting, which causes ovnkube master to think that the port is already there while it's not. There was a disconnection between ovn-kubemaster and north bound database: *** Bug 1983109 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |