Description of problem: Deployed CLO, the pod became `Evicted` due to `node was low on resource: memory`, then there was a new CLO pod started, but the new CLO pod couldn't become the leader. In this condition, creating clusterlogging instance, the EFK pods couldn't start. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-8c8ffd6c4-7mkwt 1/1 Running 0 110m cluster-logging-operator-8c8ffd6c4-g7vtz 0/1 Evicted 0 4h29m $ oc describe pod cluster-logging-operator-8c8ffd6c4-g7vtz Name: cluster-logging-operator-8c8ffd6c4-g7vtz Namespace: openshift-logging Priority: 0 Node: qitang-b462p-worker-c-j74cx.c.openshift-qe.internal/ Start Time: Mon, 22 Jun 2020 20:49:31 -0400 Labels: name=cluster-logging-operator pod-template-hash=8c8ffd6c4 Annotations: alm-examples: [ { "apiVersion": "logging.openshift.io/v1", "kind": "ClusterLogging", "metadata": { "name": "instance", "namespace": "openshift-logging" }, "spec": { "managementState": "Managed", "logStore": { "type": "elasticsearch", "elasticsearch": { "nodeCount": 3, "redundancyPolicy": "SingleRedundancy", "storage": { "storageClassName": "gp2", "size": "200G" } }, "retentionPolicy":{ "logs.app":{ "maxAge":"7d" } } }, "visualization": { "type": "kibana", "kibana": { "replicas": 1 } }, "curation": { "type": "curator", "curator": { "schedule": "30 3 * * *" } }, "collection": { "logs": { "type": "fluentd", "fluentd": {} } } } }, { "apiVersion": "logging.openshift.io/v1alpha1", "kind": "LogForwarding", "metadata": { "name": "instance", "namespace": "openshift-logging" }, "spec": { "outputs": [ { "name": "clo-default-output-es", "type": "elasticsearch", "endpoint": "elasticsearch.openshift-logging.svc:9200", "secret": { "name": "elasticsearch" } } ], "pipelines": [ { "name": "clo-default-app-pipeline", "inputSource": "logs.app", "outputRefs": ["clo-managaged-output-es"] }, { "name": "clo-default-infra-pipeline", "inputSource": "logs.app", "outputRefs": ["clo-managaged-output-es"] } ] } } ] capabilities: Seamless Upgrades categories: OpenShift Optional, Logging & Tracing certified: false containerImage: quay.io/openshift-qe-optional-operators/ose-cluster-logging-operator@sha256:1223f4cbb171852e7133ae15256bd0f21e254a70349d17ce7fdd437c993402... createdAt: 2018-08-01T08:00:00Z description: The Cluster Logging Operator for OKD provides a means for configuring and managing your aggregated logging stack. k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.0.18" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.0.18" ], "default": true, "dns": {} }] olm.operatorGroup: openshift-logging olm.operatorNamespace: openshift-logging olm.skipRange: >=4.4.0-0 <4.5.0-202006180838 olm.targetNamespaces: openshift-logging openshift.io/scc: restricted operatorframework.io/cluster-monitoring: true operatorframework.io/suggested-namespace: openshift-logging support: AOS Logging Status: Failed Reason: Evicted Message: The node was low on resource: memory. Container cluster-logging-operator was using 35552Ki, which exceeds its request of 0. IP: IPs: <none> Controlled By: ReplicaSet/cluster-logging-operator-8c8ffd6c4 Containers: cluster-logging-operator: Image: quay.io/openshift-qe-optional-operators/ose-cluster-logging-operator@sha256:1223f4cbb171852e7133ae15256bd0f21e254a70349d17ce7fdd437c99340217 Port: <none> Host Port: <none> Command: cluster-logging-operator Environment: WATCH_NAMESPACE: (v1:metadata.annotations['olm.targetNamespaces']) POD_NAME: cluster-logging-operator-8c8ffd6c4-g7vtz (v1:metadata.name) OPERATOR_NAME: cluster-logging-operator FLUENTD_IMAGE: quay.io/openshift-qe-optional-operators/ose-logging-fluentd@sha256:feb0ffc45e06c7e8ab50284c0baed322b04bca9bae03f90166818de56865c545 CURATOR_IMAGE: quay.io/openshift-qe-optional-operators/ose-logging-curator5@sha256:ea1c1e012186383efa07255794613edb155ac2cf97e0bbfd5676906533a90d7e PROMTAIL_IMAGE: quay.io/openshift/origin-promtail:latest Mounts: /var/run/secrets/kubernetes.io/serviceaccount from cluster-logging-operator-token-xrxc8 (ro) Volumes: cluster-logging-operator-token-xrxc8: Type: Secret (a volume populated by a Secret) SecretName: cluster-logging-operator-token-xrxc8 Optional: false QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning BackOff 110m (x3 over 112m) kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal Back-off restarting failed container Normal Created 110m (x5 over 4h29m) kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal Created container cluster-logging-operator Normal Started 110m (x5 over 4h29m) kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal Started container cluster-logging-operator Normal Pulled 110m (x4 over 127m) kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal Container image "quay.io/openshift-qe-optional-operators/ose-cluster-logging-operator@sha256:1223f4cbb171852e7133ae15256bd0f21e254a70349d17ce7fdd437c99340217" already present on machine Warning Evicted 110m kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal The node was low on resource: memory. Container cluster-logging-operator was using 35552Ki, which exceeds its request of 0. Normal Killing 110m kubelet, qitang-b462p-worker-c-j74cx.c.openshift-qe.internal Stopping container cluster-logging-operator $ oc logs cluster-logging-operator-8c8ffd6c4-7mkwt {"level":"info","ts":1592882911.0760407,"logger":"cmd","msg":"Go Version: go1.13.4"} {"level":"info","ts":1592882911.076073,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1592882911.0760798,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"} {"level":"info","ts":1592882911.0765388,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1592882911.2322025,"logger":"leader","msg":"Found existing lock","LockOwner":"cluster-logging-operator-8c8ffd6c4-g7vtz"} {"level":"info","ts":1592882911.2413642,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1592882912.3386421,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1592882914.3940628,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1592882919.1144118,"logger":"leader","msg":"Not the leader. Waiting."} ...... {"level":"info","ts":1592889560.6530204,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1592889577.0135827,"logger":"leader","msg":"Not the leader. Waiting."} $ oc get cm -oyaml apiVersion: v1 items: - apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: "2020-06-23T00:49:37Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: {} k:{"uid":"3c667be5-7a3b-4b79-8cb9-a090e244248c"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} manager: cluster-logging-operator operation: Update time: "2020-06-23T00:49:37Z" name: cluster-logging-operator-lock namespace: openshift-logging ownerReferences: - apiVersion: v1 kind: Pod name: cluster-logging-operator-8c8ffd6c4-g7vtz uid: 3c667be5-7a3b-4b79-8cb9-a090e244248c resourceVersion: "32841" selfLink: /api/v1/namespaces/openshift-logging/configmaps/cluster-logging-operator-lock uid: 8d231844-f205-4d63-b399-08f832cd7cea kind: List metadata: resourceVersion: "" selfLink: "" Version-Release number of selected component (if applicable): clusterlogging.4.5.0-202006180838 How reproducible: Not sure Steps to Reproduce: 1. deploy CLO 2. 3. Actual results: Expected results: The new CLO pod could become leader if the old one isn't in normal state. Additional info:
After the Evicted pod is deleted, the new pod becomes leader.
This is a known issue in the Operator SDK's implementation of leader-for-life with the most current fix being in the latest version v0.18.1. CLO vendors operator-sdk v0.8.2. Upstream: Evicted pod do not release controller ConfigMap lock Bug 1749620 https://github.com/operator-framework/operator-sdk/issues/1874 Fix: leader election bugfix: Delete evicted leader pods https://github.com/operator-framework/operator-sdk/pull/2210 Fix for the fix: leader: get most recent lock owner when attempting a claim https://github.com/operator-framework/operator-sdk/pull/3059 [operator-sdk] Evicted pod do not release controller ConfigMap lock https://bugzilla.redhat.com/show_bug.cgi?id=1749620 Note that the same issue affects the EO, which vendors the same version of operator-sdk.
Turns out the Operator SDK update is already underway in the upstream: Migrate operator-sdk and deps to v0.18.1 https://github.com/openshift/cluster-logging-operator/pull/576
(In reply to Sergey Yedrikov from comment #2) > This is a known issue in the Operator SDK's implementation of > leader-for-life with the most current fix being in the latest version > v0.18.1. CLO vendors operator-sdk v0.8.2. > > Upstream: > > Evicted pod do not release controller ConfigMap lock Bug 1749620 > https://github.com/operator-framework/operator-sdk/issues/1874 > > Fix: > leader election bugfix: Delete evicted leader pods > https://github.com/operator-framework/operator-sdk/pull/2210 > > Fix for the fix: > leader: get most recent lock owner when attempting a claim > https://github.com/operator-framework/operator-sdk/pull/3059 > > [operator-sdk] Evicted pod do not release controller ConfigMap lock > https://bugzilla.redhat.com/show_bug.cgi?id=1749620 > > Note that the same issue affects the EO, which vendors the same version of > operator-sdk. EO is upgraded to operator-sdk v0.18.1 already by: https://github.com/openshift/elasticsearch-operator/pull/291
I am putting this to ON_QA because tech debt PRs are merged.
Verified with quay.io/openshift/origin-elasticsearch-operator@sha256:4843a5191961655e7913c0d55efd225ce502ad7e4b8f6ba5fe58cb995492317c and quay.io/openshift/origin-cluster-logging-operator@sha256:8a63a377f26afe46786b5f3cb94b908dcae071001f9e0ade2d82aa318a4f081a. The operator-sdk version is v0.18.1 The new pod could became leader: $ oc logs -n openshift-operators-redhat elasticsearch-operator-85f596849b-7t5x8 {"level":"info","ts":1593311872.4003904,"logger":"cmd","msg":"Operator Version: 0.0.1"} {"level":"info","ts":1593311872.4004252,"logger":"cmd","msg":"Go Version: go1.13.8"} {"level":"info","ts":1593311872.40043,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1593311872.4004338,"logger":"cmd","msg":"Version of operator-sdk: v0.18.1"} {"level":"info","ts":1593311872.4008894,"logger":"leader","msg":"Trying to become the leader."} I0628 02:37:53.451944 1 request.go:621] Throttling request took 1.035717719s, request: GET:https://172.30.0.1:443/apis/operator.openshift.io/v1?timeout=32s {"level":"info","ts":1593311874.6751797,"logger":"leader","msg":"Found existing lock","LockOwner":"elasticsearch-operator-85f596849b-vn2wv"} {"level":"info","ts":1593311874.693283,"logger":"leader","msg":"Operator pod with leader lock has been evicted.","leader":"elasticsearch-operator-85f596849b-vn2wv"} {"level":"info","ts":1593311874.693376,"logger":"leader","msg":"Deleting evicted leader."} {"level":"info","ts":1593311875.8036397,"logger":"leader","msg":"Became the leader."} $ oc logs cluster-logging-operator-58b6875d75-6jk99 {"level":"info","ts":1593311339.1577475,"logger":"cmd","msg":"Operator Version: 0.0.1"} {"level":"info","ts":1593311339.1577837,"logger":"cmd","msg":"Go Version: go1.13.8"} {"level":"info","ts":1593311339.15779,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1593311339.1577952,"logger":"cmd","msg":"Version of operator-sdk: v0.18.1"} {"level":"info","ts":1593311339.1585007,"logger":"leader","msg":"Trying to become the leader."} I0628 02:29:00.210226 1 request.go:621] Throttling request took 1.034629822s, request: GET:https://172.30.0.1:443/apis/logging.openshift.io/v1alpha1?timeout=32s {"level":"info","ts":1593311341.4317126,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."} {"level":"info","ts":1593311341.4317534,"logger":"leader","msg":"Continuing as the leader."}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196