Bug 2052398

Summary: 4.9 to 4.10 upgrade fails for ovnkube-masters
Product: OpenShift Container Platform Reporter: Ravi Trivedi <travi>
Component: NetworkingAssignee: Andreas Karis <akaris>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: anbhat, bpickard, cblecker, pmannidi, rbryant, sdodson, trozet, wking
Version: 4.10Keywords: FastFix, ServiceDeliveryBlocker, ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A goroutine that was responsible for handling cache updates in libovsdb from ovsdb could stall writing to an unbuffered channel while holding a mutex. Consequence: This could cause a deadlock in the OVN-Kubernetes master processes. Fix: Improvements to the way how libovsdb handles concurrency. Result: With this update, such race conditions were solved.
Story Points: ---
Clone Of:
: 2058729 2058762 (view as bug list) Environment:
Last Closed: 2022-08-10 10:48:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2058729, 2058762    

Description Ravi Trivedi 2022-02-09 07:55:54 UTC
Description of problem:

Cluster upgrade is stuck midway where one of the ovnkube-master pods reports following:

```
2022-02-09T06:57:58.453422184Z F0209 06:57:58.453412 1986765 node_upgrade.go:122] Failed to detect completion of master upgrade after 30 minutes. Check if ovnkube-masters have upgraded correctly!
2022-02-09T06:57:58.455950112Z goroutine 308 [running]:
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.stacks(0x1)
2022-02-09T06:57:58.455950112Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).output(0x2cf5020, 0x3, 0x0, 0xc00098a1c0, 0x1, {0x226e4a1, 0x10}, 0xc00099a000, 0x0)
2022-02-09T06:57:58.455950112Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e5ed8, 0x2e5e3c, 0x0, {0x0, 0x0}, 0x1, {0xc0011c43e0, 0x1, 0x1})
```

Additionally, the openshift-dns-operator namespace has following event:

```
LAST SEEN   TYPE      REASON                   OBJECT                              MESSAGE
4m10s       Warning   FailedCreatePodSandBox   pod/dns-operator-5d956b48ff-s2qqv   (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-5d956b48ff-s2qqv_openshift-dns-operator_64352aa5-d9f6-409b-9ea3-7013ed06c040_0(93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040): error adding pod openshift-dns-operator_dns-operator-5d956b48ff-s2qqv to CNI network "multus-cni-network": [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv/64352aa5-d9f6-409b-9ea3-7013ed06c040:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded...
```

Version-Release number of selected component (if applicable):

Cluster upgrade from 4.9.19 to 4.10.0-rc.1

Actual results:
dns and network clusteroperators are stuck in upgrade

Expected results:
Upgrade should progress as usual

Additional info:
This seems similar to BZ https://bugzilla.redhat.com/show_bug.cgi?id=2011386 but wasn't sure.

Comment 20 Tim Rozet 2022-02-24 22:40:50 UTC
Great analysis Andreas! I think after staring at this for hours we know what is happening, we just don't know the exact conditions of how it happened. We know that the goroutine that is responsible for handling cache updates from ovsdb is stalled, because it is blocked writing to an unbuffered channel. This goroutine holds the cache mutex while it is stalled, which causes deadlock across the entire ovnkube-master. We dont know exactly why the channel is not being read from, other than Andreas and I inspected the stack and we think perhaps the goroutine that reads the channel has been respawned and is reading from a new channel. Either way I think we have enough information to come up with a proper fix.

I can see there is a potential race condition, where when we shutdown the client due to a disconnect from OVSDB server, we will stop the goroutine that receives on this channel. However at this time, the cache update goroutine could still be executing, which would cause this hang. Additionally, the handleCacheErrors goroutine was not functioning correctly. It is supposed to detect cache errors and then disconnect/reconnect the client. However there was a bug there preventing this.

I pushed what I think will fix this: https://github.com/ovn-org/libovsdb/pull/297

I think with ^ plus Andreas fix to regenerate the models we should be covered.

Comment 21 Tim Rozet 2022-02-24 22:43:55 UTC
Ravi, is this bug reproducible? If this is the race we think it is, then I would think it would be more likely on upgrade when things are disconnecting/reconnecting as OVSDB servers are rolled on upgrade.

Comment 30 errata-xmlrpc 2022-08-10 10:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069