Description of problem: 1. deploy logging 4.5 on a 4.5 cluster 2. upgrade logging to 4.6 3. upgrade cluster to 4.6, the ES deployments are disappeared after the cluster upgraded successfully There are plenty of below logs in the EO: {"level":"error","ts":1600656185.4719505,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:213","msg":"Failed to progress update of unschedulable node","node":"elasticsearch-cdm-2wg9lezz-1","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"} {"level":"error","ts":1600656185.4720213,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to progress unschedulable nodes","cluster":"elasticsearch","namespace":"openshift-logging","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"} I found lots of secrets were recreated: $ oc get secrets NAME TYPE DATA AGE builder-dockercfg-wws6m kubernetes.io/dockercfg 1 4h10m builder-token-qqg2t kubernetes.io/service-account-token 4 4h10m builder-token-xcd6k kubernetes.io/service-account-token 4 4h10m cluster-logging-operator-dockercfg-f29rl kubernetes.io/dockercfg 1 4h10m cluster-logging-operator-token-kx45d kubernetes.io/service-account-token 4 4h10m cluster-logging-operator-token-lhvg8 kubernetes.io/service-account-token 4 4h10m default-dockercfg-w45ql kubernetes.io/dockercfg 1 4h10m default-token-7xxvz kubernetes.io/service-account-token 4 4h10m default-token-mk62p kubernetes.io/service-account-token 4 4h10m deployer-dockercfg-bf96k kubernetes.io/dockercfg 1 4h10m deployer-token-4d9p4 kubernetes.io/service-account-token 4 4h10m deployer-token-cshkb kubernetes.io/service-account-token 4 4h10m elasticsearch Opaque 7 169m elasticsearch-dockercfg-rbxv5 kubernetes.io/dockercfg 1 169m elasticsearch-metrics kubernetes.io/tls 2 169m elasticsearch-token-lbxgf kubernetes.io/service-account-token 4 169m elasticsearch-token-p594b kubernetes.io/service-account-token 4 169m fluentd Opaque 3 169m fluentd-metrics kubernetes.io/tls 2 169m kibana Opaque 3 169m kibana-dockercfg-bn6dk kubernetes.io/dockercfg 1 169m kibana-proxy Opaque 3 169m kibana-token-qqksm kubernetes.io/service-account-token 4 169m kibana-token-zs7w2 kubernetes.io/service-account-token 4 169m logcollector-dockercfg-gbwqn kubernetes.io/dockercfg 1 169m logcollector-token-f5qxl kubernetes.io/service-account-token 4 169m logcollector-token-gbvf8 kubernetes.io/service-account-token 4 169m master-certs Opaque 2 169m $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-779f857c67-zkq6k 1/1 Running 0 167m elasticsearch-delete-app-1600665300-bq9lq 0/1 Error 0 14m elasticsearch-delete-audit-1600665300-5pnrd 0/1 Error 0 14m elasticsearch-delete-infra-1600665300-kbm6l 0/1 Error 0 14m elasticsearch-rollover-app-1600665300-7tlhh 0/1 Error 0 14m elasticsearch-rollover-audit-1600665300-27b5q 0/1 Error 0 14m elasticsearch-rollover-infra-1600665300-psf7m 0/1 Error 0 14m fluentd-7jhf7 0/1 Init:CrashLoopBackOff 31 169m fluentd-fzg9m 0/1 Init:CrashLoopBackOff 30 169m fluentd-k7zqc 0/1 Init:CrashLoopBackOff 31 169m fluentd-l5ms4 0/1 Init:CrashLoopBackOff 31 169m fluentd-m4829 0/1 Init:CrashLoopBackOff 29 169m fluentd-zphfg 0/1 Init:CrashLoopBackOff 30 169m kibana-549dff7bcd-ml6r7 2/2 Running 0 167m Elasticsearch/elasticsearch: spec: indexManagement: mappings: - aliases: - app - logs.app name: app policyRef: app-policy - aliases: - infra - logs.infra name: infra policyRef: infra-policy - aliases: - audit - logs.audit name: audit policyRef: audit-policy policies: - name: app-policy phases: delete: minAge: 1d hot: actions: rollover: maxAge: 1h pollInterval: 15m - name: infra-policy phases: delete: minAge: 12h hot: actions: rollover: maxAge: 36m pollInterval: 15m - name: audit-policy phases: delete: minAge: 2w hot: actions: rollover: maxAge: 2h pollInterval: 15m managementState: Managed nodeSpec: proxyResources: limits: memory: 64Mi requests: cpu: 100m memory: 64Mi resources: requests: memory: 2Gi nodes: - genUUID: 2wg9lezz nodeCount: 3 proxyResources: {} resources: {} roles: - client - data - master storage: size: 20Gi storageClassName: standard redundancyPolicy: SingleRedundancy status: cluster: activePrimaryShards: 0 activeShards: 0 initializingShards: 0 numDataNodes: 0 numNodes: 0 pendingTasks: 0 relocatingShards: 0 status: cluster health unknown unassignedShards: 0 nodes: - conditions: - lastTransitionTime: "2020-09-21T02:38:19Z" message: '0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.' reason: Unschedulable status: "True" type: Unschedulable deploymentName: elasticsearch-cdm-2wg9lezz-1 upgradeStatus: upgradePhase: controllerUpdated - conditions: - lastTransitionTime: "2020-09-21T02:40:26Z" reason: Error status: "True" type: ElasticsearchContainerTerminated - lastTransitionTime: "2020-09-21T02:40:26Z" reason: Error status: "True" type: ProxyContainerTerminated deploymentName: elasticsearch-cdm-2wg9lezz-2 upgradeStatus: upgradePhase: controllerUpdated - conditions: - lastTransitionTime: "2020-09-21T02:40:29Z" reason: ContainerCreating status: "True" type: ElasticsearchContainerWaiting - lastTransitionTime: "2020-09-21T02:40:29Z" reason: ContainerCreating status: "True" type: ProxyContainerWaiting deploymentName: elasticsearch-cdm-2wg9lezz-3 upgradeStatus: upgradePhase: controllerUpdated pods: client: failed: [] notReady: [] ready: [] data: failed: [] notReady: [] ready: [] master: failed: [] notReady: [] ready: [] shardAllocationEnabled: shard allocation unknown $ oc get deploy NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 4h12m kibana 1/1 1 1 171m Version-Release number of selected component (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.4.6.0-202009192030.p0 Cluster Logging 4.6.0-202009192030.p0 clusterlogging.4.5.0-202009161248.p0 Succeeded elasticsearch-operator.4.6.0-202009192030.p0 Elasticsearch Operator 4.6.0-202009192030.p0 elasticsearch-operator.4.5.0-202009182238.p0 Succeeded How reproducible: I tried 3 times, only hit once Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: must-gather: http://file.apac.redhat.com/~qitang/must-gather-0921.tar.gz
@Qiaoling Tang This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1873652 Can you confirm that the fix is included in the EO operator image you used?
@qiaoling Tang Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`?
(In reply to Periklis Tsirakidis from comment #2) > @Qiaoling Tang > > This looks like a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1873652 > > Can you confirm that the fix is included in the EO operator image you used? Per https://bugzilla.redhat.com/show_bug.cgi?id=1873652#c13, the bz was verified on 2020-09-16, the image tag is 4.6.0-202009192030.p0, I think the image has the fix. > Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t > tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`? The unschedulable events was generated when the cluster was under upgrading. Upgrading cluster from 4.5 to 4.6 needs to upgrade all the nodes, the nodes were unschedulable when they were under upgrading. After the nodes were upgraded successfully, they were schedulable.
@Qiaoling Tang I've compiled a fix and the following routine to reproduce this BZ as well as to verify the fix: To reproduce this bug, follow these instructions: 1. Fresh install 4.5.z on a cluster 2. Install clusterlogging with 3 nodes for ES - no need for selectors 3. Wait until the setup is fully workable. 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number 5. Wait until the status.nodes fields marks at least one deployment as unschedulable. 6. Switch CL+ES CR to Unmanaged 7. Delete one of the three ES deployments To verify the fix: 1. Upgrade the operators to 4.6.0 2. Switch resources in CL CR back to normal 3. Reset Unmanaged to Managed and see what happens.
For the record, this keeps happening because the deployments get garbage collected during the cluster upgrade by kube-system's GC. However, this should not happen at all because ownerReferences ensure that the deployments stay alive. Till today, we don't know which component keeps deleting our ClusterLogging/Elasticsearch CR on different scenarios like in one as well as in [1] and [2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1868300 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1873652
For the record, here are the audit logs, telling the story that our deployments get garbage collected: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"80e425e0-2165-4e55-a705-4fca893430ee","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.348424Z","stageTimestamp":"2020-09-21T02:40:18.415860Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9ff43d02-7740-4423-a0cb-d8e057dc1e83","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.458157Z","stageTimestamp":"2020-09-21T02:40:18.502307Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"683e3884-6971-4b85-8a44-1571e7e8fedb","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.554628Z","stageTimestamp":"2020-09-21T02:40:18.624505Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}
(In reply to Periklis Tsirakidis from comment #5) > @Qiaoling Tang > > I've compiled a fix and the following routine to reproduce this BZ as well > as to verify the fix: > > To reproduce this bug, follow these instructions: > > 1. Fresh install 4.5.z on a cluster > 2. Install clusterlogging with 3 nodes for ES - no need for selectors > 3. Wait until the setup is fully workable. > 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", > e.g. set memory requests/limits to an astronomic number > 5. Wait until the status.nodes fields marks at least one deployment as > unschedulable. > 6. Switch CL+ES CR to Unmanaged > 7. Delete one of the three ES deployments > > To verify the fix: > > 1. Upgrade the operators to 4.6.0 > 2. Switch resources in CL CR back to normal > 3. Reset Unmanaged to Managed and see what happens. I tested with clusterlogging.4.6.0-202010061132.p0 and elasticsearch-operator.4.6.0-202010081538.p0 After I reset Unmanaged to Managed, the EO became CrashLoopBackOff, and it couldn't recreate the ES deployment. $ oc get pod -n openshift-operators-redhat NAME READY STATUS RESTARTS AGE elasticsearch-operator-5d75f86c7d-x44c8 0/1 CrashLoopBackOff 5 7m10s $ oc get pod -n openshift-operators-redhat NAME READY STATUS RESTARTS AGE elasticsearch-operator-5d75f86c7d-x44c8 0/1 CrashLoopBackOff 5 7m10s [qitang@preserve-docker-slave gcp]$ oc logs -n openshift-operators-redhat elasticsearch-operator-5d75f86c7d-x44c8 {"level":"info","ts":1602229567.48116,"logger":"elasticsearch-operator","caller":"manager/main.go:45","msg":"starting up...","operator_version":"0.0.1","go_version":"go1.15.0","go_os":"linux","go_arch":"amd64","operator-sdk_version":"v0.18.1"} I1009 07:46:08.533427 1 request.go:621] Throttling request took 1.015335812s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s {"level":"info","ts":1602229572.7882924,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Registering Components.","namespace":""} I1009 07:46:18.749922 1 request.go:621] Throttling request took 1.046099375s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s {"level":"info","ts":1602229582.7128692,"logger":"elasticsearch-operator","caller":"manager/main.go:173","msg":"Could not create ServiceMonitor object","error":"servicemonitors.monitoring.coreos.com \"elasticsearch-operator-metrics\" already exists"} {"level":"info","ts":1602229582.7129803,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"This operator no longer honors the image specified by the custom resources so that it is able to properly coordinate the configuration with the image.","namespace":""} {"level":"info","ts":1602229582.712998,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Starting the Cmd.","namespace":""} {"level":"info","ts":1602229584.7186341,"logger":"elasticsearch-operator","caller":"kibana/controller.go:256","msg":"Registering future events","name":"openshift-logging/kibana"} {"level":"info","ts":1602229585.06746,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-8heixwh2-2","upgradeStatus":{"underUpgrade":"True","upgradePhase":"nodeRestarting"},"conditions":[{"type":"Unschedulable","status":"True","lastTransitionTime":"2020-10-09T07:31:38Z","reason":"Unschedulable","message":"0/6 nodes are available: 6 Insufficient memory."}]}} E1009 07:46:25.067702 1 runtime.go:78] Observed a panic: runtime.boundsError{x:2, y:2, signed:true, code:0x0} (runtime error: index out of range [2] with length 2) goroutine 1550 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16efae0, 0xc000807cc0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x16efae0, 0xc000807cc0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000) /go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d panic: runtime error: index out of range [2] with length 2 [recovered] panic: runtime error: index out of range [2] with length 2 goroutine 1550 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c panic(0x16efae0, 0xc000807cc0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000) /go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d
If deploy logging 4.6, and do the following steps: 1. Install clusterlogging with 3 nodes for ES - no need for selectors 2. Wait until the setup is fully workable. 3. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number 4. Wait until the status.nodes fields marks at least one deployment as unschedulable. 5. Switch CL+ES CR to Unmanaged 6. Delete one of the three ES deployments 7. Switch resources in CL CR back to normal 8. Reset Unmanaged to Managed and see what happens. The EO can running as usual, but it couldn't make the ES cluster back to normal. Logs in the EO: {"level":"info","ts":1602229844.2539637,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"error","ts":1602229845.446016,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 7 shards in preparation for cluster restart"} {"level":"info","ts":1602229845.4970524,"logger":"elasticsearch-operator","caller":"retry/util.go:51","msg":"failed to update node resource","node":"elasticsearch-cdm-5dqdymyw-1","error":"Operation cannot be fulfilled on deployments.apps \"elasticsearch-cdm-5dqdymyw-1\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1602229905.018403,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:519","msg":"Completed restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"info","ts":1602229905.0530834,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-2","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"error","ts":1602229906.0889528,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 4 shards in preparation for cluster restart"} {"level":"error","ts":1602229940.184816,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:131","msg":"failed to perform rolling update","error":"failed to get cluster state for \"elasticsearch\". Error code: true, map[error:map[reason:<nil> root_cause:[map[reason:<nil> type:master_not_discovered_exception]] type:master_not_discovered_exception] status:503]"} {"level":"error","ts":1602229972.703489,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230004.35125,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230035.9973016,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230067.7721658,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230099.7733471,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230131.6252205,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230163.6494224,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230195.5667257,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230227.1968317,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230259.2925787,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230291.9073987,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230323.9044044,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230355.6235714,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230387.3924816,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230419.190769,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230451.3558466,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230483.1463294,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230514.8978357,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"info","ts":1602230585.3486938,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-5dqdymyw-3","upgradeStatus":{"scheduledUpgrade":"True"}}} {"level":"error","ts":1602230616.3881042,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230619.3374827,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: connect: connection refused"} {"level":"error","ts":1602230678.285918,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} $ oc get elasticsearch elasticsearch -oyaml apiVersion: logging.openshift.io/v1 kind: Elasticsearch metadata: creationTimestamp: "2020-10-09T05:48:12Z" generation: 6 name: elasticsearch namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: 6901675c-b8c2-4bcb-9611-0f8a116b0eff resourceVersion: "293498" selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch uid: 2fcaf92d-ac8c-414a-868a-a07c0528dc9d spec: indexManagement: mappings: - aliases: - app - logs.app name: app policyRef: app-policy - aliases: - infra - logs.infra name: infra policyRef: infra-policy - aliases: - audit - logs.audit name: audit policyRef: audit-policy policies: - name: app-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: infra-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: audit-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m managementState: Managed nodeSpec: proxyResources: limits: memory: 64Mi requests: cpu: 100m memory: 64Mi resources: requests: memory: 2Gi nodes: - genUUID: 5dqdymyw nodeCount: 3 proxyResources: {} resources: {} roles: - client - data - master storage: {} redundancyPolicy: SingleRedundancy status: cluster: activePrimaryShards: 0 activeShards: 0 initializingShards: 0 numDataNodes: 0 numNodes: 0 pendingTasks: 0 relocatingShards: 0 status: cluster health unknown unassignedShards: 0 nodes: - deploymentName: elasticsearch-cdm-5dqdymyw-1 upgradeStatus: upgradePhase: controllerUpdated - deploymentName: elasticsearch-cdm-5dqdymyw-2 upgradeStatus: scheduledUpgrade: "True" underUpgrade: "True" upgradePhase: nodeRestarting pods: client: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq ready: [] data: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq ready: [] master: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 ready: [] shardAllocationEnabled: shard allocation unknown Besides, the EO works well after doing a normal upgrade(from 4.5 to 4.6).
why is this only a medium? ES failing after upgrades sounds very severe to me.
Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't become CrashLoopBackOff, but it still couldn't recreate the ES deployments after reset Unmanaged to Managed. EO log: {"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to restart node","cluster":"elasticsearch","namespace":"openshift-logging","node":"elasticsearch-cdm-pw6l3eqr-3","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.212.81:9200: i/o timeout"} nodes: - genUUID: pw6l3eqr nodeCount: 3 proxyResources: {} resources: {} roles: - client - data - master storage: size: 20Gi storageClassName: standard redundancyPolicy: SingleRedundancy status: cluster: activePrimaryShards: 0 activeShards: 0 initializingShards: 0 numDataNodes: 0 numNodes: 0 pendingTasks: 0 relocatingShards: 0 status: cluster health unknown unassignedShards: 0 nodes: - conditions: - lastTransitionTime: "2020-10-13T07:54:24Z" message: '0/6 nodes are available: 1 node(s) were unschedulable, 5 Insufficient memory.' reason: Unschedulable status: "True" type: Unschedulable deploymentName: elasticsearch-cdm-pw6l3eqr-3 upgradeStatus: underUpgrade: "True" upgradePhase: nodeRestarting $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-5747454486-86rpq 1/1 Running 0 15m elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89 1/2 Running 0 11m elasticsearch-delete-app-1602576900-wsvg7 0/1 Error 0 4m29s elasticsearch-delete-audit-1602576900-24jpp 0/1 Error 0 4m29s elasticsearch-delete-infra-1602576900-vjmhz 0/1 Error 0 4m29s elasticsearch-rollover-app-1602576900-s56dr 0/1 Error 0 4m29s elasticsearch-rollover-audit-1602576900-vpdb9 0/1 Error 0 4m29s elasticsearch-rollover-infra-1602576900-nmgxp 0/1 Error 0 4m29s fluentd-8qcmd 1/1 Running 0 34m fluentd-9l77k 0/1 Init:CrashLoopBackOff 5 11m fluentd-bfh7s 1/1 Running 0 34m fluentd-bzdkt 1/1 Running 0 34m fluentd-qqkvq 1/1 Running 0 34m fluentd-rq8c8 1/1 Running 0 34m kibana-684cd479bd-bt82s 2/2 Running 0 14m
(In reply to Qiaoling Tang from comment #15) > Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't > become CrashLoopBackOff, but it still couldn't recreate the ES deployments > after reset Unmanaged to Managed. > > EO log: > {"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator", > "caller":"k8shandler/reconciler.go:65","msg":"unable to restart > node","cluster":"elasticsearch","namespace":"openshift-logging","node": > "elasticsearch-cdm-pw6l3eqr-3","error":"Get > \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": > dial tcp 172.30.212.81:9200: i/o timeout"} > > nodes: > - genUUID: pw6l3eqr > nodeCount: 3 > proxyResources: {} > resources: {} > roles: > - client > - data > - master > storage: > size: 20Gi > storageClassName: standard > redundancyPolicy: SingleRedundancy > status: > cluster: > activePrimaryShards: 0 > activeShards: 0 > initializingShards: 0 > numDataNodes: 0 > numNodes: 0 > pendingTasks: 0 > relocatingShards: 0 > status: cluster health unknown > unassignedShards: 0 > nodes: > - conditions: > - lastTransitionTime: "2020-10-13T07:54:24Z" > message: '0/6 nodes are available: 1 node(s) were unschedulable, 5 > Insufficient memory.' > reason: Unschedulable > status: "True" > type: Unschedulable > deploymentName: elasticsearch-cdm-pw6l3eqr-3 > upgradeStatus: > underUpgrade: "True" > upgradePhase: nodeRestarting > > $ oc get pod > NAME READY STATUS > RESTARTS AGE > cluster-logging-operator-5747454486-86rpq 1/1 Running > 0 15m > elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89 1/2 Running > 0 11m > elasticsearch-delete-app-1602576900-wsvg7 0/1 Error > 0 4m29s > elasticsearch-delete-audit-1602576900-24jpp 0/1 Error > 0 4m29s > elasticsearch-delete-infra-1602576900-vjmhz 0/1 Error > 0 4m29s > elasticsearch-rollover-app-1602576900-s56dr 0/1 Error > 0 4m29s > elasticsearch-rollover-audit-1602576900-vpdb9 0/1 Error > 0 4m29s > elasticsearch-rollover-infra-1602576900-nmgxp 0/1 Error > 0 4m29s > fluentd-8qcmd 1/1 Running > 0 34m > fluentd-9l77k 0/1 > Init:CrashLoopBackOff 5 11m > fluentd-bfh7s 1/1 Running > 0 34m > fluentd-bzdkt 1/1 Running > 0 34m > fluentd-qqkvq 1/1 Running > 0 34m > fluentd-rq8c8 1/1 Running > 0 34m > kibana-684cd479bd-bt82s 2/2 Running > 0 14m @Qiaoling Could you please share the link to the cluster?
Putting this back to ON_QA, because this depends on backporting: https://bugzilla.redhat.com/show_bug.cgi?id=1887357
*** Bug 1887943 has been marked as a duplicate of this bug. ***
Verified with elasticsearch-operator.4.6.0-202010132327.p0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198