2052398 – 4.9 to 4.10 upgrade fails for ovnkube-masters

Bug 2052398 - 4.9 to 4.10 upgrade fails for ovnkube-masters

Summary: 4.9 to 4.10 upgrade fails for ovnkube-masters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Andreas Karis
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2058729 2058762
TreeView+	depends on / blocked

Reported:	2022-02-09 07:55 UTC by Ravi Trivedi
Modified:	2022-08-10 10:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A goroutine that was responsible for handling cache updates in libovsdb from ovsdb could stall writing to an unbuffered channel while holding a mutex. Consequence: This could cause a deadlock in the OVN-Kubernetes master processes. Fix: Improvements to the way how libovsdb handles concurrency. Result: With this update, such race conditions were solved.
Clone Of:
Clones:	2058729 2058762 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 975	None	open	Bug 2052398: [DownstreamMerge] 2-25-22	2022-02-25 19:48:27 UTC
Github	ovn-org libovsdb pull 294	None	Merged	Mapper: FieldByColumn error message shall contain table name	2022-02-24 13:39:06 UTC
Github	ovn-org libovsdb pull 297	None	Merged	Fix cache error handling	2022-02-25 17:41:36 UTC
Github	ovn-org ovn-kubernetes pull 2831	None	Merged	sbdb model: Re-run libovsdb modelgen on SB OVN schema	2022-02-25 17:41:37 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:49:09 UTC

Description Ravi Trivedi 2022-02-09 07:55:54 UTC

Description of problem:

Cluster upgrade is stuck midway where one of the ovnkube-master pods reports following:

```
2022-02-09T06:57:58.453422184Z F0209 06:57:58.453412 1986765 node_upgrade.go:122] Failed to detect completion of master upgrade after 30 minutes. Check if ovnkube-masters have upgraded correctly!
2022-02-09T06:57:58.455950112Z goroutine 308 [running]:
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.stacks(0x1)
2022-02-09T06:57:58.455950112Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).output(0x2cf5020, 0x3, 0x0, 0xc00098a1c0, 0x1, {0x226e4a1, 0x10}, 0xc00099a000, 0x0)
2022-02-09T06:57:58.455950112Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
2022-02-09T06:57:58.455950112Z k8s.io/klog/v2.(*loggingT).printDepth(0xc0002e5ed8, 0x2e5e3c, 0x0, {0x0, 0x0}, 0x1, {0xc0011c43e0, 0x1, 0x1})
```

Additionally, the openshift-dns-operator namespace has following event:

```
LAST SEEN   TYPE      REASON                   OBJECT                              MESSAGE
4m10s       Warning   FailedCreatePodSandBox   pod/dns-operator-5d956b48ff-s2qqv   (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-5d956b48ff-s2qqv_openshift-dns-operator_64352aa5-d9f6-409b-9ea3-7013ed06c040_0(93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040): error adding pod openshift-dns-operator_dns-operator-5d956b48ff-s2qqv to CNI network "multus-cni-network": [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv/64352aa5-d9f6-409b-9ea3-7013ed06c040:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] [openshift-dns-operator/dns-operator-5d956b48ff-s2qqv 93da98b87abe32e48c421d553ba7a1674ea1478e977355b08da4897293d0d040] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded...
```

Version-Release number of selected component (if applicable):

Cluster upgrade from 4.9.19 to 4.10.0-rc.1

Actual results:
dns and network clusteroperators are stuck in upgrade

Expected results:
Upgrade should progress as usual

Additional info:
This seems similar to BZ https://bugzilla.redhat.com/show_bug.cgi?id=2011386 but wasn't sure.

Comment 20 Tim Rozet 2022-02-24 22:40:50 UTC

Great analysis Andreas! I think after staring at this for hours we know what is happening, we just don't know the exact conditions of how it happened. We know that the goroutine that is responsible for handling cache updates from ovsdb is stalled, because it is blocked writing to an unbuffered channel. This goroutine holds the cache mutex while it is stalled, which causes deadlock across the entire ovnkube-master. We dont know exactly why the channel is not being read from, other than Andreas and I inspected the stack and we think perhaps the goroutine that reads the channel has been respawned and is reading from a new channel. Either way I think we have enough information to come up with a proper fix.

I can see there is a potential race condition, where when we shutdown the client due to a disconnect from OVSDB server, we will stop the goroutine that receives on this channel. However at this time, the cache update goroutine could still be executing, which would cause this hang. Additionally, the handleCacheErrors goroutine was not functioning correctly. It is supposed to detect cache errors and then disconnect/reconnect the client. However there was a bug there preventing this.

I pushed what I think will fix this: https://github.com/ovn-org/libovsdb/pull/297

I think with ^ plus Andreas fix to regenerate the models we should be covered.

Comment 21 Tim Rozet 2022-02-24 22:43:55 UTC

Ravi, is this bug reproducible? If this is the race we think it is, then I would think it would be more likely on upgrade when things are disconnecting/reconnecting as OVSDB servers are rolled on upgrade.

Comment 30 errata-xmlrpc 2022-08-10 10:48:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.