The kubelet has a bug where something triggers it to send absolutely enormous volumes of WATCH requests for Secrets. This takes down entire clusters. Please see the data and discussion in this post-mortem: https://docs.google.com/document/d/1VfwmECbpCnDTOb0JVE37wcEQm4KnGwbatgIynTa6Wvg/edit#heading=h.5qe176r0lgq5
I suspected this was a lack of the immutable Secrets / ConfigMaps [1] (still beta?) in these clusters. ci-operator is being improved to make its mounts immutable [2,3]. But even in the absence of that feature or considerate users, it seems like the kubelet could do a better job throttling its WATCH requests. [1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1412-immutable-secrets-and-configmaps/README.md [2]: https://github.com/openshift/ci-tools/pull/1786 [3]: https://github.com/openshift/ci-tools/pull/1787
Not a regression and afaik a relatively rare condition so not a blocker.
WWhat was the RCA?
Hi Steve, Sorry for the delay on the RCA. I have completed an investigation. This does not appear to be a kubelet issue. I pulled node logs off the cluster from 3 different nodes I confirmed went offline during the duration of the incident, and looked for WATCH requests. What I found was a large number of log line entries that significantly increased around in the time of the incident for different workloads in frequency like so: Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113131 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"deployer-dockercfg-ltgw5" Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113118 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-token-9nthv" Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113142 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"ci-pull-credentials" Mar 11 14:44:38 ip-10-0-143-249 hyperkube[1744]: I0311 22:44:38.113240 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"ci-operator-dockercfg-c8xhg" <note the large time gap here, log frequency increases rapidly after this point> Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226059 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-dockercfg-tczmg" Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226073 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"deployer-token-w7jwq" Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226084 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade" Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226102 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade-token-9ks24 " Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226121 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"default-token-g4h82" Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226136 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"gce-sa-credentials-gcs-publis her" Mar 11 17:21:28 ip-10-0-143-249 hyperkube[1744]: I0312 01:21:28.226165 1744 reflector.go:219] Starting reflector *v1.Secret (0s) from object-"ci-op-p17y3nk6"/"e2e-ovirt-upgrade-token-8hzg7 " Whereas prior to and after the incident, we might only see messages for a single workload per hour (see the three hour gap in the example above), during the incident we might see 5-6 of these occur simultaneously on the node. My hypothesis is that there was a cascading failure due to a misbehaving application which impacted capacity and caused a lot of churn. When too many large workloads were simultaneously scheduled on a node, the node dropped offline, then they were rescheduled on another node, causing a worsening effect, which impacted capacity and hence took out other components. I'm wondering why it was possible to schedule so many simultaneous ci-op/* workloads at once---should their requests/limits be examined? I don't see any smoking guns suggesting that the kubelet demonstrated pathological behaviour. I believe this was an application-level issue, so I am closing this bug. Immutable Secrets/ConfigMaps might help mitigate any issues with WATCH overhead in the future.
> When too many large workloads were simultaneously scheduled on a node, the node dropped offline, then they were rescheduled on another node, causing a worsening effect, which impacted capacity and hence took out other components. I thought CI jobs were one-shot, which is why they have 'maxUnavailable: 0' PDB guards. But regardless, API-server capacity is a cluster-scoped resource. Node availability is a node-scoped resource. CI-operator behavior is user workload. Should it be this easy for user workload to DoS the shared components? Is it appropriate for nodes to go unavailable when they have many large workloads scheduled? I feel like we want guards at each level, with the API-server protecting itself against greedy nodes, and nodes protecting themselves against greedy workloads, and workloads trying to behave themselves. I'd expect this bug to be able to find some sort of improvement we can make to node guards against greedy workloads (and the immutable stuff in comment 2 is about the workload trying to be better behaved. I'm not sure what the API-server has outside of bug 1937916).
> I'd expect this bug to be able to find some sort of improvement we can make to node guards against greedy workloads A lot of those sorts of improvements are gated on feature work in node such as swap enablement and cgroupsv2. It's coming, but I don't think there's anything specific from this bug that warrants keeping it open.
Let me revive this. We apparently had an upstream patch come in which should address this for 4.8: https://github.com/openshift/kubernetes/commit/57a3b0abd678c66a9a04e553b6d6ae49671a4779 Per https://bugzilla.redhat.com/show_bug.cgi?id=1951815 we now have a request to backport that fix back to 4.6.
Note to QE: the cascading failure may be difficult to reproduce. I think if you reboot nodes in a cluster, you should be able to measure # of WATCHes opened from kubelets before and after the patch, and see they are reduced. This commit landed in https://github.com/openshift/kubernetes/pull/641 as part of one of the upstream rebases.
Checked on 4.8.0-0.nightly-2021-04-30-201824, rebooted the node multiple times. I see the number of watch calls are low. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-30-201824 True False 27m Cluster version is 4.8.0-0.nightly-2021-04-30-201824 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-136-151.us-east-2.compute.internal Ready worker 43m v1.21.0-rc.0+aa1dc1f ip-10-0-138-235.us-east-2.compute.internal Ready master 52m v1.21.0-rc.0+aa1dc1f ip-10-0-177-201.us-east-2.compute.internal Ready master 52m v1.21.0-rc.0+aa1dc1f ip-10-0-180-38.us-east-2.compute.internal Ready worker 42m v1.21.0-rc.0+aa1dc1f ip-10-0-194-94.us-east-2.compute.internal Ready worker 43m v1.21.0-rc.0+aa1dc1f ip-10-0-197-138.us-east-2.compute.internal Ready master 52m v1.21.0-rc.0+aa1dc1f $ oc debug node/ip-10-0-136-151.us-east-2.compute.internal Starting pod/ip-10-0-136-151us-east-2computeinternal-debug ... ... sh-4.4# journalctl | grep -i "Starting reflector" | wc -l 292
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438