Bug 1934071

Summary: etcd consuming high amount of memory and CPU after upgrade to 4.6.17
Product: OpenShift Container Platform Reporter: Santiago Maudet <smaudet>
Component: OLMAssignee: Vu Dinh <vdinh>
OLM sub component: OLM QA Contact: xzha
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: akashem, ankithom, aos-bugs, bleanhar, dageoffr, ddelcian, dgautam, evalenzu, hgomes, jkaur, krizza, mfojtik, nhale, oarribas, owasserm, pducai, rbost, sbatsche, tflannag, vdinh, xxia
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:48:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1951657    
Attachments:
Description Flags
olm output none

Description Santiago Maudet 2021-03-02 12:49:04 UTC
Description of problem:

Customer upgraded cluster from 4.5 to 4.6.17

After the upgrade they started to see the following behavior on the cluster: 

- High cpu/memory usage in master nodes. 
- etcd is unstable. It gets disconnected from endpoints and it has frequent leader elections.
- master nodes are presenting very high cpu and memory usage. kube-apiserver and etcd are using all the resources.
- api is not available unless we restart the platform, after restart api is available just for a minute or two.
- When we lose access to api we are not able to log to master nodes either because cpu/memory is too high. 
- After some time we are able to login again to nodes, but again we lose connection with them because cpu/memory usage.
- etcd was defrag moving from 3GB to 957 MB but after 10m the database size is 2GB again...

- All master nodes are affected with some issue that cause etcd and api-server consume 100% memory and 100% cpu. 
- Hosts stay inaccessible even trough ssh. Restarting the nodes makes api-server return but it last just minutes. 
- Cannot send commands to nodes, crioctl and event top are not able to respond

In the muust-gather there are a lots of errors from OLM:

Starting: 06:34 - Ending 07:07
=============================
2021-02-26T06:43:03.556442096-03:00 {"level":"error","ts":1614332583.5563064,"logger":"controllers.adoption","msg":"component not found","error":"InstallPlan.operators.coreos.com \"install-gssr9\" not found","stacktrace":"github.com/go-
logr/zapr.(*zapLogger).Error\n\t/build/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*AdoptionReconciler).adopt\n\t/build/pkg/controller/operators/adoption
_controller.go:269\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*AdoptionReconciler).adoptInstallPlan\n\t/build/pkg/controller/operators/adoption_controller.go:383\ngithub.com/operator-framework/op
erator-lifecycle-manager/pkg/controller/operators.(*AdoptionReconciler).ReconcileSubscription\n\t/build/pkg/controller/operators/adoption_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile\n\t/build/vendor/si
gs.k8s.io/controller-runtime/pkg/reconcile/reconcile.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\
nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).worker\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/
apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pk
g/util/wait.Until\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}
============================
wc -l = 12519


* Before the upgrade the cluster was running fine, no issues. 


Version-Release number of selected component (if applicable):

OCP 4.6.17

How reproducible:

Customer specific at this moment, I wasn't able to reproduce the issue.

Additional info:

Comment 2 Sam Batschelet 2021-03-02 13:11:27 UTC
> etcd defrag works, but it start growing again causing the master nodes to have lack of resources. 

The growth of etcd db after defragmentation is not a bug on etcd it is a reflection of scale and workload.

> * Customer needs to restart the master nodes to regain control of the cluster for a few minutes. Then the problem starts again. 

I see no proof that etcd is the root cause for resource consumption. Can you provide a metrics dump[1] so that we can review cluster performance? 

[1] dump-prom.sh
```
#!/usr/bin/env bash

function queue() {
  local TARGET="${1}"
  shift
  local LIVE
  LIVE="$(jobs | wc -l)"
  while [[ "${LIVE}" -ge 45 ]]; do
    sleep 1
    LIVE="$(jobs | wc -l)"
  done
  echo "${@}"
  if [[ -n "${FILTER:-}" ]]; then
    "${@}" | "${FILTER}" >"${TARGET}" &
  else
    "${@}" >"${TARGET}" &
  fi
}


ARTIFACT_DIR=$PWD

mkdir -p $ARTIFACT_DIR/metrics

echo "Snapshotting prometheus (may take 15s) ..."
queue ${ARTIFACT_DIR}/metrics/prometheus.tar.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus .
FILTER=gzip queue ${ARTIFACT_DIR}/metrics/prometheus-target-metadata.json.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- /bin/bash -c "curl -G http://localhost:9090/api/v1/targets/metadata --data-urlencode 'match_target={instance!=\"\"}'"
```

Comment 12 Abu Kashem 2021-03-03 00:48:38 UTC
hgomes,

I assembled the parts and tried to unzip it, I am running into the following error:

> tar xvzf prometheus.tar.gz
>
> gzip: stdin: invalid compressed data--format violated
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now

Can you verify that you can assemble it at your end? maybe my download is corrupt for some reason.

Comment 14 Abu Kashem 2021-03-03 15:46:46 UTC
Hi smaudet,
can you share the file in a google drive please? 

Thanks!

Comment 15 Abu Kashem 2021-03-03 15:50:57 UTC
Hi smaudet,
I have kicked off a download from the customer site, stay tuned :)

Comment 16 Abu Kashem 2021-03-03 16:15:43 UTC
Hi smaudet,
another ask from my side - did the customer ever capture the audit logs using must-gather? I wanted to look at the audit logs, it helps us track the bad actors. The prometheus data does not have the user information.

Comment 19 Abu Kashem 2021-03-03 22:50:06 UTC
smaudet,
I was able to load the prometheus snapshot into a local grafana instance. The metrics covers a window from 2021-02-28 07:00 to 2021-03-02 11:45. Is there a particular time window (where the incidents occur) you want me to focus on?

Can you also tell me what the event TTL is for this cluster? It looks like that the number of events reached as high as 40K in this cluster.

Comment 20 Robert Bost 2021-03-03 23:02:15 UTC
> I was able to load the prometheus snapshot into a local grafana instance. The metrics covers a window from 2021-02-28 07:00 to 2021-03-02 11:45. Is there a particular time window (where the incidents occur) you want me to focus on?

At this point, the symptoms seem to present all of the time. I will double-check with customer about events on March 2. 

> Can you also tell me what the event TTL is for this cluster? It looks like that the number of events reached as high as 40K in this cluster.

The event-ttl is configured for 3 hours right now (I believe this is default). I'll see if we can find some namespaces generating high numbers of events we can delete from.

Comment 24 Abu Kashem 2021-03-04 19:13:30 UTC
smaudet,
can you get a dump of all events 'oc get events --all-namespaces -o json' and share it with me? (I suspect it will time out due to large number of events)
Looks like prod-bhp, openshift-marketplace and openshift-logging have more events then others. 
I am going to do some more investigation along this line. I see the rate of event API operation is not that high 6 events-(create|patch) / second.


> https://access.redhat.com/solutions/5170631
> This customer has been suffering from this issue since last Feb 25 in their production environment. Do you think if we decrease event-ttl, could give the customer some relief while we work on this BZ? But, as this is not a supported change in OCP 4.x, would it be safe to apply it? Or do you not recommend this? 

Yes, you can reduce the event ttl, this might improve the situation temporarily. When you make this change, the cluster is marked as 'Upgradeable=False'. This means you have to revert this change before you upgrade the upgrade.

Comment 26 Abu Kashem 2021-03-04 19:47:04 UTC
smaudet,
Whatever is causing these events to be created, it is still doing it. I think the frequent number of etcd leader election is resetting the ttl time as Suresh explained. So I think it won't get in the way of our investigation.

Suresh, what are your thoughts?

Comment 27 Abu Kashem 2021-03-04 21:18:24 UTC
I see a lot of KubePodCrashLooping alerts firing, they are consistently firing throughout.
> {alertname="KubePodCrashLooping", severity="Warning"} 63

The crash looping pods are mostly from the "openshift-storage" namespace.
> topk(10, sum(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) by(namespace))


I tracked the sources of the events:
>  cat events.txt | jq '.items | .[] | .source.component' | sort | uniq -c  
>
>    4893 "kubelet"
>    1292 "node-controller"
>    1106 "cronjob-controller"

The ^ are the top contenders. Most of the events from kubelet are due to activities in the 
> openshift-storage
> prod-bhp

Can we check what's going on with the "openshift-storage", I would like to know if this is normal or something has gone wrong with this operator.

Comment 28 Robert Bost 2021-03-04 21:19:09 UTC
The customer ran the following:

  $ etcdctl get --keys-only --prefix /kubernetes.io/events/ > all-events.out
  $ etcdctl get --keys-only --prefix /kubernetes.io/events/ | tr '/' ' ' | awk '{print $3}' | sort | uniq -c

I will attach the list of all events. 

$ grep -c events all-events.out 
6979

For the second command above, the sum of all events comes out to only 6,979 which I believe is insignificant at this point, correct?

Comment 30 Santiago Maudet 2021-03-05 12:53:30 UTC
Hi,

I've just got feedback from the customer about the event-ttl parameter change to 15m. Unfortunately the issue si still present. No changes on the cluster behaviour. etcd is still consuming high amount of memory on the nodes.

Comment 31 Santiago Maudet 2021-03-05 15:28:18 UTC
Ho.

Customer performed the defrag and the size of the etcd database went down from 2.5GB to 1GB, stabilized and started to grow again, remaining at 1.4GB for a couple of minutes and then grew again to 1.6GB.

In terms of memory consumption: etcd PODs in the openshift-etcd namespace was low at first but started to grow:

master01: 11GB 
master02: 15GB 
master03: 9GB (attached graphics from etcd last 45 minutes).

The cluster is showing the same behavior. No stability is observed after event-ttl change and etcd defrag.

Comment 54 Daniel Del Ciancio 2021-03-19 21:26:02 UTC
Is there a plan to backport this fix to 4.6.z ?

Comment 59 peter ducai 2021-04-07 07:21:53 UTC
Created attachment 1769774 [details]
olm output

Comment 64 xzha 2021-04-25 11:37:28 UTC
verify:

zhaoxia@xzha-mac operator-registry % oc exec catalog-operator-786555fc6b-7qnbq -- olm --version
OLM version: 0.17.0
git commit: 9fa1f1249e3acc15b1f628d5f96e7b7047e9f176


1) install etcd operator
zhaoxia@xzha-mac 41174 % cat sub-etcd.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-operator
  namespace: test-41174
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace

2) delete csv  to trigger ResolutionFailed event.

3) check catalog-operator log

time="2021-04-25T09:15:42Z" level=debug msg="resolving sources" id=+SsMO namespace=test-41174
time="2021-04-25T09:15:42Z" level=debug msg="checking if subscriptions need update" id=+SsMO namespace=test-41174
time="2021-04-25T09:15:42Z" level=debug msg="checking for existing installplan" channel=singlenamespace-alpha id=+SsMO namespace=test-41174 pkg=etcd source=community-operators sub=etcd-operator
time="2021-04-25T09:15:42Z" level=debug msg="resolving subscriptions in namespace" id=+SsMO namespace=test-41174
time="2021-04-25T09:15:42Z" level=warning msg="an error was encountered during reconciliation" error="Operation cannot be fulfilled on subscriptions.operators.coreos.com \"etcd-operator\": the object has been modified; please apply your changes to the latest version and try again" event=update reconciling="*v1alpha1.Subscription" selflink=
E0425 09:15:42.579105       1 queueinformer_operator.go:290] sync {"update" "test-41174/etcd-operator"} failed: Operation cannot be fulfilled on subscriptions.operators.coreos.com "etcd-operator": the object has been modified; please apply your changes to the latest version and try again
time="2021-04-25T09:15:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2021-04-25T09:15:42Z" level=debug msg="reconciliation successful" event=update reconciling="*v1alpha1.Subscription" selflink= state="*subscription.catalogHealthyState"
time="2021-04-25T09:15:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2021-04-25T09:15:42Z" level=debug msg="reconciliation successful" event=update reconciling="*v1alpha1.Subscription" selflink= state="*subscription.catalogHealthyState"
time="2021-04-25T09:15:42Z" level=debug msg="resolution failed" error="constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 provide EtcdCluster (etcd.database.coreos.com/v1beta2), subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0" id=+SsMO namespace=test-41174
time="2021-04-25T09:15:42Z" level=debug msg="resolving sources" id=yQKL6 namespace=test-41174
time="2021-04-25T09:15:42Z" level=debug msg="checking if subscriptions need update" id=yQKL6 namespace=test-41174
I0425 09:15:42.596426       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"test-41174", UID:"492cbf3d-0075-4087-b9d5-26d1e37ec23d", APIVersion:"v1", ResourceVersion:"31724", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 provide EtcdCluster (etcd.database.coreos.com/v1beta2), subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0
time="2021-04-25T09:15:42Z" level=debug msg="checking for existing installplan" channel=singlenamespace-alpha id=yQKL6 namespace=test-41174 pkg=etcd source=community-operators sub=etcd-operator
time="2021-04-25T09:15:42Z" level=debug msg="resolving subscriptions in namespace" id=yQKL6 namespace=test-41174
time="2021-04-25T09:15:42Z" level=debug msg="resolution failed" error="constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 originate from package etcd, subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0" id=yQKL6 namespace=test-41174
I0425 09:15:42.622905       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"test-41174", UID:"492cbf3d-0075-4087-b9d5-26d1e37ec23d", APIVersion:"v1", ResourceVersion:"31724", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 originate from package etcd, subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0
time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcd-operator namespace=test-41174
time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4-etcd-operator-7c45f9c46 namespace=test-41174
time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4-etcd-operator-7c45f9c46 namespace=test-41174
time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4 namespace=test-41174
time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4 namespace=test-41174
time="2021-04-25T09:16:11Z" level=debug msg="handling object deletion" name=etcd-operator-59c8fccd54-6qst5 namespace=test-41174


There is no such log "creating installplan" after the "ResolutionFailed" event happen. 

lgtm, verified

Comment 71 errata-xmlrpc 2021-07-27 22:48:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 72 Red Hat Bugzilla 2023-09-18 00:25:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days