1939734 – Node Fanout Causes Excessive WATCH Secret Calls, Taking Down Clusters

Bug 1939734 - Node Fanout Causes Excessive WATCH Secret Calls, Taking Down Clusters

Summary: Node Fanout Causes Excessive WATCH Secret Calls, Taking Down Clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1951815
TreeView+	depends on / blocked

Reported:	2021-03-16 22:09 UTC by Steve Kuznetsov
Modified:	2021-07-27 22:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:54:20 UTC

Description Steve Kuznetsov 2021-03-16 22:09:23 UTC

The kubelet has a bug where something triggers it to send absolutely enormous volumes of WATCH requests for Secrets. This takes down entire clusters.

Please see the data and discussion in this post-mortem:
https://docs.google.com/document/d/1VfwmECbpCnDTOb0JVE37wcEQm4KnGwbatgIynTa6Wvg/edit#heading=h.5qe176r0lgq5

Comment 2 W. Trevor King 2021-03-18 23:56:08 UTC

I suspected this was a lack of the immutable Secrets / ConfigMaps [1] (still beta?) in these clusters.  ci-operator is being improved to make its mounts immutable [2,3].  But even in the absence of that feature or considerate users, it seems like the kubelet could do a better job throttling its WATCH requests.

[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1412-immutable-secrets-and-configmaps/README.md
[2]: https://github.com/openshift/ci-tools/pull/1786
[3]: https://github.com/openshift/ci-tools/pull/1787

Comment 7 Elana Hashman 2021-03-23 16:12:02 UTC

Not a regression and afaik a relatively rare condition so not a blocker.

Comment 8 Steve Kuznetsov 2021-03-23 19:47:55 UTC

WWhat was the RCA?

Comment 9 Elana Hashman 2021-03-26 23:55:11 UTC

Hi Steve,

Sorry for the delay on the RCA. I have completed an investigation.

This does not appear to be a kubelet issue. I pulled node logs off the cluster from 3 different nodes I confirmed went offline during the duration of the incident, and looked for WATCH requests.

What I found was a large number of log line entries that significantly increased around in the time of the incident for different workloads in frequency like so:

Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113131    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"deployer-dockercfg-ltgw5"
Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113118    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-token-9nthv"
Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113142    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"ci-pull-credentials"
Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113240    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"ci-operator-dockercfg-c8xhg"
<note the large time gap here, log frequency increases rapidly after this point>
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226059    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-dockercfg-tczmg"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226073    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"deployer-token-w7jwq"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226084    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226102    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade-token-9ks24
"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226121    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-token-g4h82"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226136    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"gce-sa-credentials-gcs-publis
her"
Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226165    1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade-token-8hzg7
"

Whereas prior to and after the incident, we might only see messages for a single workload per hour (see the three hour gap in the example above), during the incident we might see 5-6 of these occur simultaneously on the node.

My hypothesis is that there was a cascading failure due to a misbehaving application which impacted capacity and caused a lot of churn. When too many large workloads were simultaneously scheduled on a node, the node dropped offline, then they were rescheduled on another node, causing a worsening effect, which impacted capacity and hence took out other components. I'm wondering why it was possible to schedule so many simultaneous ci-op/* workloads at once---should their requests/limits be examined?

I don't see any smoking guns suggesting that the kubelet demonstrated pathological behaviour. I believe this was an application-level issue, so I am closing this bug. Immutable Secrets/ConfigMaps might help mitigate any issues with WATCH overhead in the future.

Comment 10 W. Trevor King 2021-03-27 02:59:38 UTC

> When too many large workloads were simultaneously scheduled on a node, the node dropped offline, then they were rescheduled on another node, causing a worsening effect, which impacted capacity and hence took out other components.

I thought CI jobs were one-shot, which is why they have 'maxUnavailable: 0' PDB guards.  But regardless, API-server capacity is a cluster-scoped resource.  Node availability is a node-scoped resource.  CI-operator behavior is user workload.  Should it be this easy for user workload to DoS the shared components?  Is it appropriate for nodes to go unavailable when they have many large workloads scheduled?  I feel like we want guards at each level, with the API-server protecting itself against greedy nodes, and nodes protecting themselves against greedy workloads, and workloads trying to behave themselves.  I'd expect this bug to be able to find some sort of improvement we can make to node guards against greedy workloads (and the immutable stuff in comment 2 is about the workload trying to be better behaved.  I'm not sure what the API-server has outside of bug 1937916).

Comment 11 Elana Hashman 2021-04-05 20:49:21 UTC

> I'd expect this bug to be able to find some sort of improvement we can make to node guards against greedy workloads

A lot of those sorts of improvements are gated on feature work in node such as swap enablement and cgroupsv2. It's coming, but I don't think there's anything specific from this bug that warrants keeping it open.

Comment 12 Elana Hashman 2021-04-21 00:15:43 UTC

Let me revive this.

We apparently had an upstream patch come in which should address this for 4.8: https://github.com/openshift/kubernetes/commit/57a3b0abd678c66a9a04e553b6d6ae49671a4779

Per https://bugzilla.redhat.com/show_bug.cgi?id=1951815 we now have a request to backport that fix back to 4.6.

Comment 13 Elana Hashman 2021-04-21 00:20:50 UTC

Note to QE: the cascading failure may be difficult to reproduce. I think if you reboot nodes in a cluster, you should be able to measure # of WATCHes opened from kubelets before and after the patch, and see they are reduced. This commit landed in https://github.com/openshift/kubernetes/pull/641 as part of one of the upstream rebases.

Comment 15 Sunil Choudhary 2021-05-05 06:12:28 UTC

Checked on 4.8.0-0.nightly-2021-04-30-201824, rebooted the node multiple times.
I see the number of watch calls are low.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-30-201824   True        False         27m     Cluster version is 4.8.0-0.nightly-2021-04-30-201824

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-136-151.us-east-2.compute.internal   Ready    worker   43m   v1.21.0-rc.0+aa1dc1f
ip-10-0-138-235.us-east-2.compute.internal   Ready    master   52m   v1.21.0-rc.0+aa1dc1f
ip-10-0-177-201.us-east-2.compute.internal   Ready    master   52m   v1.21.0-rc.0+aa1dc1f
ip-10-0-180-38.us-east-2.compute.internal    Ready    worker   42m   v1.21.0-rc.0+aa1dc1f
ip-10-0-194-94.us-east-2.compute.internal    Ready    worker   43m   v1.21.0-rc.0+aa1dc1f
ip-10-0-197-138.us-east-2.compute.internal   Ready    master   52m   v1.21.0-rc.0+aa1dc1f

$ oc debug node/ip-10-0-136-151.us-east-2.compute.internal
Starting pod/ip-10-0-136-151us-east-2computeinternal-debug ...
...

sh-4.4# journalctl | grep -i "Starting reflector" | wc -l
292

Comment 18 errata-xmlrpc 2021-07-27 22:53:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.