Bug 1934071
Summary: | etcd consuming high amount of memory and CPU after upgrade to 4.6.17 | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Santiago Maudet <smaudet> | ||||
Component: | OLM | Assignee: | Vu Dinh <vdinh> | ||||
OLM sub component: | OLM | QA Contact: | xzha | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | akashem, ankithom, aos-bugs, bleanhar, dageoffr, ddelcian, dgautam, evalenzu, hgomes, jkaur, krizza, mfojtik, nhale, oarribas, owasserm, pducai, rbost, sbatsche, tflannag, vdinh, xxia | ||||
Version: | 4.6.z | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 22:48:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1951657 | ||||||
Attachments: |
|
Description
Santiago Maudet
2021-03-02 12:49:04 UTC
> etcd defrag works, but it start growing again causing the master nodes to have lack of resources. The growth of etcd db after defragmentation is not a bug on etcd it is a reflection of scale and workload. > * Customer needs to restart the master nodes to regain control of the cluster for a few minutes. Then the problem starts again. I see no proof that etcd is the root cause for resource consumption. Can you provide a metrics dump[1] so that we can review cluster performance? [1] dump-prom.sh ``` #!/usr/bin/env bash function queue() { local TARGET="${1}" shift local LIVE LIVE="$(jobs | wc -l)" while [[ "${LIVE}" -ge 45 ]]; do sleep 1 LIVE="$(jobs | wc -l)" done echo "${@}" if [[ -n "${FILTER:-}" ]]; then "${@}" | "${FILTER}" >"${TARGET}" & else "${@}" >"${TARGET}" & fi } ARTIFACT_DIR=$PWD mkdir -p $ARTIFACT_DIR/metrics echo "Snapshotting prometheus (may take 15s) ..." queue ${ARTIFACT_DIR}/metrics/prometheus.tar.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus . FILTER=gzip queue ${ARTIFACT_DIR}/metrics/prometheus-target-metadata.json.gz oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- /bin/bash -c "curl -G http://localhost:9090/api/v1/targets/metadata --data-urlencode 'match_target={instance!=\"\"}'" ``` hgomes, I assembled the parts and tried to unzip it, I am running into the following error: > tar xvzf prometheus.tar.gz > > gzip: stdin: invalid compressed data--format violated > tar: Unexpected EOF in archive > tar: Unexpected EOF in archive > tar: Error is not recoverable: exiting now Can you verify that you can assemble it at your end? maybe my download is corrupt for some reason. Hi smaudet, can you share the file in a google drive please? Thanks! Hi smaudet, I have kicked off a download from the customer site, stay tuned :) Hi smaudet, another ask from my side - did the customer ever capture the audit logs using must-gather? I wanted to look at the audit logs, it helps us track the bad actors. The prometheus data does not have the user information. smaudet, I was able to load the prometheus snapshot into a local grafana instance. The metrics covers a window from 2021-02-28 07:00 to 2021-03-02 11:45. Is there a particular time window (where the incidents occur) you want me to focus on? Can you also tell me what the event TTL is for this cluster? It looks like that the number of events reached as high as 40K in this cluster. > I was able to load the prometheus snapshot into a local grafana instance. The metrics covers a window from 2021-02-28 07:00 to 2021-03-02 11:45. Is there a particular time window (where the incidents occur) you want me to focus on? At this point, the symptoms seem to present all of the time. I will double-check with customer about events on March 2. > Can you also tell me what the event TTL is for this cluster? It looks like that the number of events reached as high as 40K in this cluster. The event-ttl is configured for 3 hours right now (I believe this is default). I'll see if we can find some namespaces generating high numbers of events we can delete from. smaudet,
can you get a dump of all events 'oc get events --all-namespaces -o json' and share it with me? (I suspect it will time out due to large number of events)
Looks like prod-bhp, openshift-marketplace and openshift-logging have more events then others.
I am going to do some more investigation along this line. I see the rate of event API operation is not that high 6 events-(create|patch) / second.
> https://access.redhat.com/solutions/5170631
> This customer has been suffering from this issue since last Feb 25 in their production environment. Do you think if we decrease event-ttl, could give the customer some relief while we work on this BZ? But, as this is not a supported change in OCP 4.x, would it be safe to apply it? Or do you not recommend this?
Yes, you can reduce the event ttl, this might improve the situation temporarily. When you make this change, the cluster is marked as 'Upgradeable=False'. This means you have to revert this change before you upgrade the upgrade.
smaudet, Whatever is causing these events to be created, it is still doing it. I think the frequent number of etcd leader election is resetting the ttl time as Suresh explained. So I think it won't get in the way of our investigation. Suresh, what are your thoughts? I see a lot of KubePodCrashLooping alerts firing, they are consistently firing throughout. > {alertname="KubePodCrashLooping", severity="Warning"} 63 The crash looping pods are mostly from the "openshift-storage" namespace. > topk(10, sum(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) by(namespace)) I tracked the sources of the events: > cat events.txt | jq '.items | .[] | .source.component' | sort | uniq -c > > 4893 "kubelet" > 1292 "node-controller" > 1106 "cronjob-controller" The ^ are the top contenders. Most of the events from kubelet are due to activities in the > openshift-storage > prod-bhp Can we check what's going on with the "openshift-storage", I would like to know if this is normal or something has gone wrong with this operator. The customer ran the following: $ etcdctl get --keys-only --prefix /kubernetes.io/events/ > all-events.out $ etcdctl get --keys-only --prefix /kubernetes.io/events/ | tr '/' ' ' | awk '{print $3}' | sort | uniq -c I will attach the list of all events. $ grep -c events all-events.out 6979 For the second command above, the sum of all events comes out to only 6,979 which I believe is insignificant at this point, correct? Hi, I've just got feedback from the customer about the event-ttl parameter change to 15m. Unfortunately the issue si still present. No changes on the cluster behaviour. etcd is still consuming high amount of memory on the nodes. Ho. Customer performed the defrag and the size of the etcd database went down from 2.5GB to 1GB, stabilized and started to grow again, remaining at 1.4GB for a couple of minutes and then grew again to 1.6GB. In terms of memory consumption: etcd PODs in the openshift-etcd namespace was low at first but started to grow: master01: 11GB master02: 15GB master03: 9GB (attached graphics from etcd last 45 minutes). The cluster is showing the same behavior. No stability is observed after event-ttl change and etcd defrag. Is there a plan to backport this fix to 4.6.z ? Created attachment 1769774 [details]
olm output
verify: zhaoxia@xzha-mac operator-registry % oc exec catalog-operator-786555fc6b-7qnbq -- olm --version OLM version: 0.17.0 git commit: 9fa1f1249e3acc15b1f628d5f96e7b7047e9f176 1) install etcd operator zhaoxia@xzha-mac 41174 % cat sub-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd-operator namespace: test-41174 spec: channel: singlenamespace-alpha installPlanApproval: Automatic name: etcd source: community-operators sourceNamespace: openshift-marketplace 2) delete csv to trigger ResolutionFailed event. 3) check catalog-operator log time="2021-04-25T09:15:42Z" level=debug msg="resolving sources" id=+SsMO namespace=test-41174 time="2021-04-25T09:15:42Z" level=debug msg="checking if subscriptions need update" id=+SsMO namespace=test-41174 time="2021-04-25T09:15:42Z" level=debug msg="checking for existing installplan" channel=singlenamespace-alpha id=+SsMO namespace=test-41174 pkg=etcd source=community-operators sub=etcd-operator time="2021-04-25T09:15:42Z" level=debug msg="resolving subscriptions in namespace" id=+SsMO namespace=test-41174 time="2021-04-25T09:15:42Z" level=warning msg="an error was encountered during reconciliation" error="Operation cannot be fulfilled on subscriptions.operators.coreos.com \"etcd-operator\": the object has been modified; please apply your changes to the latest version and try again" event=update reconciling="*v1alpha1.Subscription" selflink= E0425 09:15:42.579105 1 queueinformer_operator.go:290] sync {"update" "test-41174/etcd-operator"} failed: Operation cannot be fulfilled on subscriptions.operators.coreos.com "etcd-operator": the object has been modified; please apply your changes to the latest version and try again time="2021-04-25T09:15:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink= time="2021-04-25T09:15:42Z" level=debug msg="reconciliation successful" event=update reconciling="*v1alpha1.Subscription" selflink= state="*subscription.catalogHealthyState" time="2021-04-25T09:15:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink= time="2021-04-25T09:15:42Z" level=debug msg="reconciliation successful" event=update reconciling="*v1alpha1.Subscription" selflink= state="*subscription.catalogHealthyState" time="2021-04-25T09:15:42Z" level=debug msg="resolution failed" error="constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 provide EtcdCluster (etcd.database.coreos.com/v1beta2), subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0" id=+SsMO namespace=test-41174 time="2021-04-25T09:15:42Z" level=debug msg="resolving sources" id=yQKL6 namespace=test-41174 time="2021-04-25T09:15:42Z" level=debug msg="checking if subscriptions need update" id=yQKL6 namespace=test-41174 I0425 09:15:42.596426 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"test-41174", UID:"492cbf3d-0075-4087-b9d5-26d1e37ec23d", APIVersion:"v1", ResourceVersion:"31724", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 provide EtcdCluster (etcd.database.coreos.com/v1beta2), subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 time="2021-04-25T09:15:42Z" level=debug msg="checking for existing installplan" channel=singlenamespace-alpha id=yQKL6 namespace=test-41174 pkg=etcd source=community-operators sub=etcd-operator time="2021-04-25T09:15:42Z" level=debug msg="resolving subscriptions in namespace" id=yQKL6 namespace=test-41174 time="2021-04-25T09:15:42Z" level=debug msg="resolution failed" error="constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 originate from package etcd, subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0" id=yQKL6 namespace=test-41174 I0425 09:15:42.622905 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"test-41174", UID:"492cbf3d-0075-4087-b9d5-26d1e37ec23d", APIVersion:"v1", ResourceVersion:"31724", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription etcd-operator exists, @existing/test-41174//etcdoperator.v0.9.4 is mandatory, @existing/test-41174//etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 and community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 originate from package etcd, subscription etcd-operator requires at least one of community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.4, community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.2 or community-operators/openshift-marketplace/singlenamespace-alpha/etcdoperator.v0.9.0 time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcd-operator namespace=test-41174 time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4-etcd-operator-7c45f9c46 namespace=test-41174 time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4-etcd-operator-7c45f9c46 namespace=test-41174 time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4 namespace=test-41174 time="2021-04-25T09:16:02Z" level=debug msg="handling object deletion" name=etcdoperator.v0.9.4 namespace=test-41174 time="2021-04-25T09:16:11Z" level=debug msg="handling object deletion" name=etcd-operator-59c8fccd54-6qst5 namespace=test-41174 There is no such log "creating installplan" after the "ResolutionFailed" event happen. lgtm, verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |