Bug 1880926
Summary: | The EO can't recreate ES deployments after upgrade cluster from 4.5 to 4.6 | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> | |
Component: | Logging | Assignee: | Periklis Tsirakidis <periklis> | |
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.6 | CC: | agurenko, aivaraslaimikis, andcosta, aos-bugs, bparees, jcantril, jeder, periklis, rsandu, scuppett | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | logging-exploration osd-45-logging | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1883777 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 15:12:50 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1886796, 1887357 | |||
Bug Blocks: | 1883777 |
Description
Qiaoling Tang
2020-09-21 07:28:12 UTC
@Qiaoling Tang This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1873652 Can you confirm that the fix is included in the EO operator image you used? @qiaoling Tang Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`? (In reply to Periklis Tsirakidis from comment #2) > @Qiaoling Tang > > This looks like a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1873652 > > Can you confirm that the fix is included in the EO operator image you used? Per https://bugzilla.redhat.com/show_bug.cgi?id=1873652#c13, the bz was verified on 2020-09-16, the image tag is 4.6.0-202009192030.p0, I think the image has the fix. > Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t > tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`? The unschedulable events was generated when the cluster was under upgrading. Upgrading cluster from 4.5 to 4.6 needs to upgrade all the nodes, the nodes were unschedulable when they were under upgrading. After the nodes were upgraded successfully, they were schedulable. @Qiaoling Tang I've compiled a fix and the following routine to reproduce this BZ as well as to verify the fix: To reproduce this bug, follow these instructions: 1. Fresh install 4.5.z on a cluster 2. Install clusterlogging with 3 nodes for ES - no need for selectors 3. Wait until the setup is fully workable. 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number 5. Wait until the status.nodes fields marks at least one deployment as unschedulable. 6. Switch CL+ES CR to Unmanaged 7. Delete one of the three ES deployments To verify the fix: 1. Upgrade the operators to 4.6.0 2. Switch resources in CL CR back to normal 3. Reset Unmanaged to Managed and see what happens. For the record, this keeps happening because the deployments get garbage collected during the cluster upgrade by kube-system's GC. However, this should not happen at all because ownerReferences ensure that the deployments stay alive. Till today, we don't know which component keeps deleting our ClusterLogging/Elasticsearch CR on different scenarios like in one as well as in [1] and [2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1868300 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1873652 For the record, here are the audit logs, telling the story that our deployments get garbage collected: {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"80e425e0-2165-4e55-a705-4fca893430ee","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.348424Z","stageTimestamp":"2020-09-21T02:40:18.415860Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9ff43d02-7740-4423-a0cb-d8e057dc1e83","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.458157Z","stageTimestamp":"2020-09-21T02:40:18.502307Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"683e3884-6971-4b85-8a44-1571e7e8fedb","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.554628Z","stageTimestamp":"2020-09-21T02:40:18.624505Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} (In reply to Periklis Tsirakidis from comment #5) > @Qiaoling Tang > > I've compiled a fix and the following routine to reproduce this BZ as well > as to verify the fix: > > To reproduce this bug, follow these instructions: > > 1. Fresh install 4.5.z on a cluster > 2. Install clusterlogging with 3 nodes for ES - no need for selectors > 3. Wait until the setup is fully workable. > 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", > e.g. set memory requests/limits to an astronomic number > 5. Wait until the status.nodes fields marks at least one deployment as > unschedulable. > 6. Switch CL+ES CR to Unmanaged > 7. Delete one of the three ES deployments > > To verify the fix: > > 1. Upgrade the operators to 4.6.0 > 2. Switch resources in CL CR back to normal > 3. Reset Unmanaged to Managed and see what happens. I tested with clusterlogging.4.6.0-202010061132.p0 and elasticsearch-operator.4.6.0-202010081538.p0 After I reset Unmanaged to Managed, the EO became CrashLoopBackOff, and it couldn't recreate the ES deployment. $ oc get pod -n openshift-operators-redhat NAME READY STATUS RESTARTS AGE elasticsearch-operator-5d75f86c7d-x44c8 0/1 CrashLoopBackOff 5 7m10s $ oc get pod -n openshift-operators-redhat NAME READY STATUS RESTARTS AGE elasticsearch-operator-5d75f86c7d-x44c8 0/1 CrashLoopBackOff 5 7m10s [qitang@preserve-docker-slave gcp]$ oc logs -n openshift-operators-redhat elasticsearch-operator-5d75f86c7d-x44c8 {"level":"info","ts":1602229567.48116,"logger":"elasticsearch-operator","caller":"manager/main.go:45","msg":"starting up...","operator_version":"0.0.1","go_version":"go1.15.0","go_os":"linux","go_arch":"amd64","operator-sdk_version":"v0.18.1"} I1009 07:46:08.533427 1 request.go:621] Throttling request took 1.015335812s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s {"level":"info","ts":1602229572.7882924,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Registering Components.","namespace":""} I1009 07:46:18.749922 1 request.go:621] Throttling request took 1.046099375s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s {"level":"info","ts":1602229582.7128692,"logger":"elasticsearch-operator","caller":"manager/main.go:173","msg":"Could not create ServiceMonitor object","error":"servicemonitors.monitoring.coreos.com \"elasticsearch-operator-metrics\" already exists"} {"level":"info","ts":1602229582.7129803,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"This operator no longer honors the image specified by the custom resources so that it is able to properly coordinate the configuration with the image.","namespace":""} {"level":"info","ts":1602229582.712998,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Starting the Cmd.","namespace":""} {"level":"info","ts":1602229584.7186341,"logger":"elasticsearch-operator","caller":"kibana/controller.go:256","msg":"Registering future events","name":"openshift-logging/kibana"} {"level":"info","ts":1602229585.06746,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-8heixwh2-2","upgradeStatus":{"underUpgrade":"True","upgradePhase":"nodeRestarting"},"conditions":[{"type":"Unschedulable","status":"True","lastTransitionTime":"2020-10-09T07:31:38Z","reason":"Unschedulable","message":"0/6 nodes are available: 6 Insufficient memory."}]}} E1009 07:46:25.067702 1 runtime.go:78] Observed a panic: runtime.boundsError{x:2, y:2, signed:true, code:0x0} (runtime error: index out of range [2] with length 2) goroutine 1550 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16efae0, 0xc000807cc0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x16efae0, 0xc000807cc0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000) /go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d panic: runtime error: index out of range [2] with length 2 [recovered] panic: runtime error: index out of range [2] with length 2 goroutine 1550 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c panic(0x16efae0, 0xc000807cc0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000) /go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d If deploy logging 4.6, and do the following steps: 1. Install clusterlogging with 3 nodes for ES - no need for selectors 2. Wait until the setup is fully workable. 3. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number 4. Wait until the status.nodes fields marks at least one deployment as unschedulable. 5. Switch CL+ES CR to Unmanaged 6. Delete one of the three ES deployments 7. Switch resources in CL CR back to normal 8. Reset Unmanaged to Managed and see what happens. The EO can running as usual, but it couldn't make the ES cluster back to normal. Logs in the EO: {"level":"info","ts":1602229844.2539637,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"error","ts":1602229845.446016,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 7 shards in preparation for cluster restart"} {"level":"info","ts":1602229845.4970524,"logger":"elasticsearch-operator","caller":"retry/util.go:51","msg":"failed to update node resource","node":"elasticsearch-cdm-5dqdymyw-1","error":"Operation cannot be fulfilled on deployments.apps \"elasticsearch-cdm-5dqdymyw-1\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1602229905.018403,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:519","msg":"Completed restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"info","ts":1602229905.0530834,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-2","cluster":"elasticsearch","namespace":"openshift-logging"} {"level":"error","ts":1602229906.0889528,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 4 shards in preparation for cluster restart"} {"level":"error","ts":1602229940.184816,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:131","msg":"failed to perform rolling update","error":"failed to get cluster state for \"elasticsearch\". Error code: true, map[error:map[reason:<nil> root_cause:[map[reason:<nil> type:master_not_discovered_exception]] type:master_not_discovered_exception] status:503]"} {"level":"error","ts":1602229972.703489,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230004.35125,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230035.9973016,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230067.7721658,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230099.7733471,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230131.6252205,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230163.6494224,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230195.5667257,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230227.1968317,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230259.2925787,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230291.9073987,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230323.9044044,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230355.6235714,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230387.3924816,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230419.190769,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230451.3558466,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230483.1463294,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230514.8978357,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"info","ts":1602230585.3486938,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-5dqdymyw-3","upgradeStatus":{"scheduledUpgrade":"True"}}} {"level":"error","ts":1602230616.3881042,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} {"level":"error","ts":1602230619.3374827,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: connect: connection refused"} {"level":"error","ts":1602230678.285918,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"} $ oc get elasticsearch elasticsearch -oyaml apiVersion: logging.openshift.io/v1 kind: Elasticsearch metadata: creationTimestamp: "2020-10-09T05:48:12Z" generation: 6 name: elasticsearch namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: 6901675c-b8c2-4bcb-9611-0f8a116b0eff resourceVersion: "293498" selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch uid: 2fcaf92d-ac8c-414a-868a-a07c0528dc9d spec: indexManagement: mappings: - aliases: - app - logs.app name: app policyRef: app-policy - aliases: - infra - logs.infra name: infra policyRef: infra-policy - aliases: - audit - logs.audit name: audit policyRef: audit-policy policies: - name: app-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: infra-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: audit-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m managementState: Managed nodeSpec: proxyResources: limits: memory: 64Mi requests: cpu: 100m memory: 64Mi resources: requests: memory: 2Gi nodes: - genUUID: 5dqdymyw nodeCount: 3 proxyResources: {} resources: {} roles: - client - data - master storage: {} redundancyPolicy: SingleRedundancy status: cluster: activePrimaryShards: 0 activeShards: 0 initializingShards: 0 numDataNodes: 0 numNodes: 0 pendingTasks: 0 relocatingShards: 0 status: cluster health unknown unassignedShards: 0 nodes: - deploymentName: elasticsearch-cdm-5dqdymyw-1 upgradeStatus: upgradePhase: controllerUpdated - deploymentName: elasticsearch-cdm-5dqdymyw-2 upgradeStatus: scheduledUpgrade: "True" underUpgrade: "True" upgradePhase: nodeRestarting pods: client: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq ready: [] data: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq ready: [] master: failed: [] notReady: - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68 ready: [] shardAllocationEnabled: shard allocation unknown Besides, the EO works well after doing a normal upgrade(from 4.5 to 4.6). why is this only a medium? ES failing after upgrades sounds very severe to me. Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't become CrashLoopBackOff, but it still couldn't recreate the ES deployments after reset Unmanaged to Managed. EO log: {"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to restart node","cluster":"elasticsearch","namespace":"openshift-logging","node":"elasticsearch-cdm-pw6l3eqr-3","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.212.81:9200: i/o timeout"} nodes: - genUUID: pw6l3eqr nodeCount: 3 proxyResources: {} resources: {} roles: - client - data - master storage: size: 20Gi storageClassName: standard redundancyPolicy: SingleRedundancy status: cluster: activePrimaryShards: 0 activeShards: 0 initializingShards: 0 numDataNodes: 0 numNodes: 0 pendingTasks: 0 relocatingShards: 0 status: cluster health unknown unassignedShards: 0 nodes: - conditions: - lastTransitionTime: "2020-10-13T07:54:24Z" message: '0/6 nodes are available: 1 node(s) were unschedulable, 5 Insufficient memory.' reason: Unschedulable status: "True" type: Unschedulable deploymentName: elasticsearch-cdm-pw6l3eqr-3 upgradeStatus: underUpgrade: "True" upgradePhase: nodeRestarting $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-5747454486-86rpq 1/1 Running 0 15m elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89 1/2 Running 0 11m elasticsearch-delete-app-1602576900-wsvg7 0/1 Error 0 4m29s elasticsearch-delete-audit-1602576900-24jpp 0/1 Error 0 4m29s elasticsearch-delete-infra-1602576900-vjmhz 0/1 Error 0 4m29s elasticsearch-rollover-app-1602576900-s56dr 0/1 Error 0 4m29s elasticsearch-rollover-audit-1602576900-vpdb9 0/1 Error 0 4m29s elasticsearch-rollover-infra-1602576900-nmgxp 0/1 Error 0 4m29s fluentd-8qcmd 1/1 Running 0 34m fluentd-9l77k 0/1 Init:CrashLoopBackOff 5 11m fluentd-bfh7s 1/1 Running 0 34m fluentd-bzdkt 1/1 Running 0 34m fluentd-qqkvq 1/1 Running 0 34m fluentd-rq8c8 1/1 Running 0 34m kibana-684cd479bd-bt82s 2/2 Running 0 14m (In reply to Qiaoling Tang from comment #15) > Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't > become CrashLoopBackOff, but it still couldn't recreate the ES deployments > after reset Unmanaged to Managed. > > EO log: > {"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator", > "caller":"k8shandler/reconciler.go:65","msg":"unable to restart > node","cluster":"elasticsearch","namespace":"openshift-logging","node": > "elasticsearch-cdm-pw6l3eqr-3","error":"Get > \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": > dial tcp 172.30.212.81:9200: i/o timeout"} > > nodes: > - genUUID: pw6l3eqr > nodeCount: 3 > proxyResources: {} > resources: {} > roles: > - client > - data > - master > storage: > size: 20Gi > storageClassName: standard > redundancyPolicy: SingleRedundancy > status: > cluster: > activePrimaryShards: 0 > activeShards: 0 > initializingShards: 0 > numDataNodes: 0 > numNodes: 0 > pendingTasks: 0 > relocatingShards: 0 > status: cluster health unknown > unassignedShards: 0 > nodes: > - conditions: > - lastTransitionTime: "2020-10-13T07:54:24Z" > message: '0/6 nodes are available: 1 node(s) were unschedulable, 5 > Insufficient memory.' > reason: Unschedulable > status: "True" > type: Unschedulable > deploymentName: elasticsearch-cdm-pw6l3eqr-3 > upgradeStatus: > underUpgrade: "True" > upgradePhase: nodeRestarting > > $ oc get pod > NAME READY STATUS > RESTARTS AGE > cluster-logging-operator-5747454486-86rpq 1/1 Running > 0 15m > elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89 1/2 Running > 0 11m > elasticsearch-delete-app-1602576900-wsvg7 0/1 Error > 0 4m29s > elasticsearch-delete-audit-1602576900-24jpp 0/1 Error > 0 4m29s > elasticsearch-delete-infra-1602576900-vjmhz 0/1 Error > 0 4m29s > elasticsearch-rollover-app-1602576900-s56dr 0/1 Error > 0 4m29s > elasticsearch-rollover-audit-1602576900-vpdb9 0/1 Error > 0 4m29s > elasticsearch-rollover-infra-1602576900-nmgxp 0/1 Error > 0 4m29s > fluentd-8qcmd 1/1 Running > 0 34m > fluentd-9l77k 0/1 > Init:CrashLoopBackOff 5 11m > fluentd-bfh7s 1/1 Running > 0 34m > fluentd-bzdkt 1/1 Running > 0 34m > fluentd-qqkvq 1/1 Running > 0 34m > fluentd-rq8c8 1/1 Running > 0 34m > kibana-684cd479bd-bt82s 2/2 Running > 0 14m @Qiaoling Could you please share the link to the cluster? Putting this back to ON_QA, because this depends on backporting: https://bugzilla.redhat.com/show_bug.cgi?id=1887357 *** Bug 1887943 has been marked as a duplicate of this bug. *** Verified with elasticsearch-operator.4.6.0-202010132327.p0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198 |