Description of problem: Cluster upgrade is stuck midway where one of the ovnkube-master pods reports following: ``` 2022-02-09T06:57:58.453422184Z F0209 06:57:58.453412 1986765 node_upgrade.go:122] Failed to detect completion of master upgrade after 30 minutes. Check if ovnkube-masters have upgraded correctly! 2022-02-09T06:57:58.455950112Z goroutine 308 [running]: 2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.stacks(0x1) 2022-02-09T06:57:58.455950112Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a 2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).output(0x2cf5020, 0x3, 0x0, 0xc00098a1c0, 0x1, {0x226e4a1, 0x10}, 0xc00099a000, 0x0) 2022-02-09T06:57:58.455950112Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd 2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e5ed8, 0x2e5e3c, 0x0, {0x0, 0x0}, 0x1, {0xc0011c43e0, 0x1, 0x1}) ``` Additionally, the openshift-dns-operator namespace has following event: ``` LAST SEEN TYPE REASON OBJECT MESSAGE 4m10s Warning FailedCreatePodSandBox pod/dns-operator-5d956b48ff-s2qqv (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-5d956b48ff-s2qqv_openshift-dns-operator_64352aa5-d9f6-409b-9ea3-7013ed06c040_0(93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040): error adding pod openshift-dns-operator_dns-operator-5d956b48ff-s2qqv to CNI network "multus-cni-network": [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv/64352aa5-d9f6-409b-9ea3-7013ed06c040:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded... ``` Version-Release number of selected component (if applicable): Cluster upgrade from 4.9.19 to 4.10.0-rc.1 Actual results: dns and network clusteroperators are stuck in upgrade Expected results: Upgrade should progress as usual Additional info: This seems similar to BZ https://bugzilla.redhat.com/show_bug.cgi?id=2011386 but wasn't sure.
Great analysis Andreas! I think after staring at this for hours we know what is happening, we just don't know the exact conditions of how it happened. We know that the goroutine that is responsible for handling cache updates from ovsdb is stalled, because it is blocked writing to an unbuffered channel. This goroutine holds the cache mutex while it is stalled, which causes deadlock across the entire ovnkube-master. We dont know exactly why the channel is not being read from, other than Andreas and I inspected the stack and we think perhaps the goroutine that reads the channel has been respawned and is reading from a new channel. Either way I think we have enough information to come up with a proper fix. I can see there is a potential race condition, where when we shutdown the client due to a disconnect from OVSDB server, we will stop the goroutine that receives on this channel. However at this time, the cache update goroutine could still be executing, which would cause this hang. Additionally, the handleCacheErrors goroutine was not functioning correctly. It is supposed to detect cache errors and then disconnect/reconnect the client. However there was a bug there preventing this. I pushed what I think will fix this: https://github.com/ovn-org/libovsdb/pull/297 I think with ^ plus Andreas fix to regenerate the models we should be covered.
Ravi, is this bug reproducible? If this is the race we think it is, then I would think it would be more likely on upgrade when things are disconnecting/reconnecting as OVSDB servers are rolled on upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069