Bug 1880926

Summary:	The EO can't recreate ES deployments after upgrade cluster from 4.5 to 4.6
Product:	OpenShift Container Platform	Reporter:	Qiaoling Tang <qitang>
Component:	Logging	Assignee:	Periklis Tsirakidis <periklis>
Status:	CLOSED ERRATA	QA Contact:	Qiaoling Tang <qitang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	agurenko, aivaraslaimikis, andcosta, aos-bugs, bparees, jcantril, jeder, periklis, rsandu, scuppett
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration osd-45-logging
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1883777 (view as bug list)		Environment:
Last Closed:	2020-10-27 15:12:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1886796, 1887357
Bug Blocks:	1883777

Description Qiaoling Tang 2020-09-21 07:28:12 UTC

Description of problem:
1. deploy logging 4.5 on a 4.5 cluster
2. upgrade logging to 4.6
3. upgrade cluster to 4.6, the ES deployments are disappeared after the cluster upgraded successfully

There are plenty of below logs in the EO:
{"level":"error","ts":1600656185.4719505,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:213","msg":"Failed to progress update of unschedulable node","node":"elasticsearch-cdm-2wg9lezz-1","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"}
{"level":"error","ts":1600656185.4720213,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to progress unschedulable nodes","cluster":"elasticsearch","namespace":"openshift-logging","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"}


I found lots of secrets were recreated:

$ oc get secrets
NAME                                       TYPE                                  DATA   AGE
builder-dockercfg-wws6m                    kubernetes.io/dockercfg               1      4h10m
builder-token-qqg2t                        kubernetes.io/service-account-token   4      4h10m
builder-token-xcd6k                        kubernetes.io/service-account-token   4      4h10m
cluster-logging-operator-dockercfg-f29rl   kubernetes.io/dockercfg               1      4h10m
cluster-logging-operator-token-kx45d       kubernetes.io/service-account-token   4      4h10m
cluster-logging-operator-token-lhvg8       kubernetes.io/service-account-token   4      4h10m
default-dockercfg-w45ql                    kubernetes.io/dockercfg               1      4h10m
default-token-7xxvz                        kubernetes.io/service-account-token   4      4h10m
default-token-mk62p                        kubernetes.io/service-account-token   4      4h10m
deployer-dockercfg-bf96k                   kubernetes.io/dockercfg               1      4h10m
deployer-token-4d9p4                       kubernetes.io/service-account-token   4      4h10m
deployer-token-cshkb                       kubernetes.io/service-account-token   4      4h10m
elasticsearch                              Opaque                                7      169m
elasticsearch-dockercfg-rbxv5              kubernetes.io/dockercfg               1      169m
elasticsearch-metrics                      kubernetes.io/tls                     2      169m
elasticsearch-token-lbxgf                  kubernetes.io/service-account-token   4      169m
elasticsearch-token-p594b                  kubernetes.io/service-account-token   4      169m
fluentd                                    Opaque                                3      169m
fluentd-metrics                            kubernetes.io/tls                     2      169m
kibana                                     Opaque                                3      169m
kibana-dockercfg-bn6dk                     kubernetes.io/dockercfg               1      169m
kibana-proxy                               Opaque                                3      169m
kibana-token-qqksm                         kubernetes.io/service-account-token   4      169m
kibana-token-zs7w2                         kubernetes.io/service-account-token   4      169m
logcollector-dockercfg-gbwqn               kubernetes.io/dockercfg               1      169m
logcollector-token-f5qxl                   kubernetes.io/service-account-token   4      169m
logcollector-token-gbvf8                   kubernetes.io/service-account-token   4      169m
master-certs                               Opaque                                2      169m

$ oc get pod
NAME                                            READY   STATUS                  RESTARTS   AGE
cluster-logging-operator-779f857c67-zkq6k       1/1     Running                 0          167m
elasticsearch-delete-app-1600665300-bq9lq       0/1     Error                   0          14m
elasticsearch-delete-audit-1600665300-5pnrd     0/1     Error                   0          14m
elasticsearch-delete-infra-1600665300-kbm6l     0/1     Error                   0          14m
elasticsearch-rollover-app-1600665300-7tlhh     0/1     Error                   0          14m
elasticsearch-rollover-audit-1600665300-27b5q   0/1     Error                   0          14m
elasticsearch-rollover-infra-1600665300-psf7m   0/1     Error                   0          14m
fluentd-7jhf7                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-fzg9m                                   0/1     Init:CrashLoopBackOff   30         169m
fluentd-k7zqc                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-l5ms4                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-m4829                                   0/1     Init:CrashLoopBackOff   29         169m
fluentd-zphfg                                   0/1     Init:CrashLoopBackOff   30         169m
kibana-549dff7bcd-ml6r7                         2/2     Running                 0          167m

Elasticsearch/elasticsearch:
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs.app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs.infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs.audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 1d
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 12h
        hot:
          actions:
            rollover:
              maxAge: 36m
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 2w
        hot:
          actions:
            rollover:
              maxAge: 2h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    proxyResources:
      limits:
        memory: 64Mi
      requests:
        cpu: 100m
        memory: 64Mi
    resources:
      requests:
        memory: 2Gi
  nodes:
  - genUUID: 2wg9lezz
    nodeCount: 3
    proxyResources: {}
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20Gi
      storageClassName: standard
  redundancyPolicy: SingleRedundancy
status:
  cluster:
    activePrimaryShards: 0
    activeShards: 0
    initializingShards: 0
    numDataNodes: 0
    numNodes: 0
    pendingTasks: 0
    relocatingShards: 0
    status: cluster health unknown
    unassignedShards: 0
  nodes:
  - conditions:
    - lastTransitionTime: "2020-09-21T02:38:19Z"
      message: '0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-2wg9lezz-1
    upgradeStatus:
      upgradePhase: controllerUpdated
  - conditions:
    - lastTransitionTime: "2020-09-21T02:40:26Z"
      reason: Error
      status: "True"
      type: ElasticsearchContainerTerminated
    - lastTransitionTime: "2020-09-21T02:40:26Z"
      reason: Error
      status: "True"
      type: ProxyContainerTerminated
    deploymentName: elasticsearch-cdm-2wg9lezz-2
    upgradeStatus:
      upgradePhase: controllerUpdated
  - conditions:
    - lastTransitionTime: "2020-09-21T02:40:29Z"
      reason: ContainerCreating
      status: "True"
      type: ElasticsearchContainerWaiting
    - lastTransitionTime: "2020-09-21T02:40:29Z"
      reason: ContainerCreating
      status: "True"
      type: ProxyContainerWaiting
    deploymentName: elasticsearch-cdm-2wg9lezz-3
    upgradeStatus:
      upgradePhase: controllerUpdated
  pods:
    client:
      failed: []
      notReady: []
      ready: []
    data:
      failed: []
      notReady: []
      ready: []
    master:
      failed: []
      notReady: []
      ready: []
  shardAllocationEnabled: shard allocation unknown

$ oc get deploy
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           4h12m
kibana                     1/1     1            1           171m


Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                           DISPLAY                  VERSION                 REPLACES                                       PHASE
clusterlogging.4.6.0-202009192030.p0           Cluster Logging          4.6.0-202009192030.p0   clusterlogging.4.5.0-202009161248.p0           Succeeded
elasticsearch-operator.4.6.0-202009192030.p0   Elasticsearch Operator   4.6.0-202009192030.p0   elasticsearch-operator.4.5.0-202009182238.p0   Succeeded


How reproducible:
I tried 3 times, only hit once

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
must-gather: http://file.apac.redhat.com/~qitang/must-gather-0921.tar.gz

Comment 2 Periklis Tsirakidis 2020-09-21 13:12:51 UTC

@Qiaoling Tang

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1873652

Can you confirm that the fix is included in the EO operator image you used?

Comment 3 Periklis Tsirakidis 2020-09-21 14:19:00 UTC

@qiaoling Tang

Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`?

Comment 4 Qiaoling Tang 2020-09-22 00:02:11 UTC

(In reply to Periklis Tsirakidis from comment #2)
> @Qiaoling Tang
> 
> This looks like a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1873652
> 
> Can you confirm that the fix is included in the EO operator image you used?

Per https://bugzilla.redhat.com/show_bug.cgi?id=1873652#c13, the bz was verified on 2020-09-16, the image tag is 4.6.0-202009192030.p0, I think the image has the fix.


> Did you resolve this issue on this cluster: `'0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t 
> tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'`?

The unschedulable events was generated when the cluster was under upgrading. 

Upgrading cluster from 4.5 to 4.6 needs to upgrade all the nodes, the nodes were unschedulable when they were under upgrading. After the nodes were upgraded successfully, they were schedulable.

Comment 5 Periklis Tsirakidis 2020-09-24 09:18:52 UTC

@Qiaoling Tang

I've compiled a fix and the following routine to reproduce this BZ as well as to verify the fix:

To reproduce this bug, follow these instructions:

1. Fresh install 4.5.z on a cluster
2. Install clusterlogging with 3 nodes for ES - no need for selectors
3. Wait until the setup is fully workable.
4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number
5. Wait until the status.nodes fields marks at least one deployment as unschedulable.
6. Switch CL+ES CR to Unmanaged
7. Delete one of the three ES deployments

To verify the fix:

1. Upgrade the operators to 4.6.0
2. Switch resources in CL CR back to normal
3. Reset Unmanaged to Managed and see what happens.

Comment 6 Periklis Tsirakidis 2020-09-24 15:03:59 UTC

For the record, this keeps happening because the deployments get garbage collected during the cluster upgrade by kube-system's GC. However, this should not happen at all because ownerReferences ensure that the deployments stay alive. Till today, we don't know which component keeps deleting our ClusterLogging/Elasticsearch CR on different scenarios like in one as well as in [1] and [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1868300
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1873652

Comment 7 Periklis Tsirakidis 2020-09-24 15:12:44 UTC

For the record, here are the audit logs, telling the story that our deployments get garbage collected:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"80e425e0-2165-4e55-a705-4fca893430ee","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-3","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-3","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.348424Z","stageTimestamp":"2020-09-21T02:40:18.415860Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"9ff43d02-7740-4423-a0cb-d8e057dc1e83","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-2","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-2","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.458157Z","stageTimestamp":"2020-09-21T02:40:18.502307Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"683e3884-6971-4b85-8a44-1571e7e8fedb","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-logging/deployments/elasticsearch-cdm-2wg9lezz-1","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"54762d94-cf4c-4103-9813-4dc2c7a9f944","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.0.6"],"userAgent":"kube-controller-manager/v1.19.0+f5121a6 (linux/amd64) kubernetes/f5121a6/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-logging","name":"elasticsearch-cdm-2wg9lezz-1","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2020-09-21T02:40:18.554628Z","stageTimestamp":"2020-09-21T02:40:18.624505Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:kube-system:generic-garbage-collector","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

Comment 10 Qiaoling Tang 2020-10-09 07:48:06 UTC

(In reply to Periklis Tsirakidis from comment #5)
> @Qiaoling Tang
> 
> I've compiled a fix and the following routine to reproduce this BZ as well
> as to verify the fix:
> 
> To reproduce this bug, follow these instructions:
> 
> 1. Fresh install 4.5.z on a cluster
> 2. Install clusterlogging with 3 nodes for ES - no need for selectors
> 3. Wait until the setup is fully workable.
> 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable",
> e.g. set memory requests/limits to an astronomic number
> 5. Wait until the status.nodes fields marks at least one deployment as
> unschedulable.
> 6. Switch CL+ES CR to Unmanaged
> 7. Delete one of the three ES deployments
> 
> To verify the fix:
> 
> 1. Upgrade the operators to 4.6.0
> 2. Switch resources in CL CR back to normal
> 3. Reset Unmanaged to Managed and see what happens.

I tested with clusterlogging.4.6.0-202010061132.p0 and elasticsearch-operator.4.6.0-202010081538.p0 

After I reset Unmanaged to Managed, the EO became CrashLoopBackOff, and it couldn't recreate the ES deployment.

$ oc get pod -n openshift-operators-redhat
NAME                                      READY   STATUS             RESTARTS   AGE
elasticsearch-operator-5d75f86c7d-x44c8   0/1     CrashLoopBackOff   5          7m10s

$ oc get pod -n openshift-operators-redhat
NAME                                      READY   STATUS             RESTARTS   AGE
elasticsearch-operator-5d75f86c7d-x44c8   0/1     CrashLoopBackOff   5          7m10s
[qitang@preserve-docker-slave gcp]$ oc logs -n openshift-operators-redhat elasticsearch-operator-5d75f86c7d-x44c8
{"level":"info","ts":1602229567.48116,"logger":"elasticsearch-operator","caller":"manager/main.go:45","msg":"starting up...","operator_version":"0.0.1","go_version":"go1.15.0","go_os":"linux","go_arch":"amd64","operator-sdk_version":"v0.18.1"}
I1009 07:46:08.533427       1 request.go:621] Throttling request took 1.015335812s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s
{"level":"info","ts":1602229572.7882924,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Registering Components.","namespace":""}
I1009 07:46:18.749922       1 request.go:621] Throttling request took 1.046099375s, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s
{"level":"info","ts":1602229582.7128692,"logger":"elasticsearch-operator","caller":"manager/main.go:173","msg":"Could not create ServiceMonitor object","error":"servicemonitors.monitoring.coreos.com \"elasticsearch-operator-metrics\" already exists"}
{"level":"info","ts":1602229582.7129803,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"This operator no longer honors the image specified by the custom resources so that it is able to properly coordinate the configuration with the image.","namespace":""}
{"level":"info","ts":1602229582.712998,"logger":"elasticsearch-operator","caller":"runtime/proc.go:204","msg":"Starting the Cmd.","namespace":""}
{"level":"info","ts":1602229584.7186341,"logger":"elasticsearch-operator","caller":"kibana/controller.go:256","msg":"Registering future events","name":"openshift-logging/kibana"}
{"level":"info","ts":1602229585.06746,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-8heixwh2-2","upgradeStatus":{"underUpgrade":"True","upgradePhase":"nodeRestarting"},"conditions":[{"type":"Unschedulable","status":"True","lastTransitionTime":"2020-10-09T07:31:38Z","reason":"Unschedulable","message":"0/6 nodes are available: 6 Insufficient memory."}]}}
E1009 07:46:25.067702       1 runtime.go:78] Observed a panic: runtime.boundsError{x:2, y:2, signed:true, code:0x0} (runtime error: index out of range [2] with length 2)
goroutine 1550 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16efae0, 0xc000807cc0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x16efae0, 0xc000807cc0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e
github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd
github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d
panic: runtime error: index out of range [2] with length 2 [recovered]
	panic: runtime error: index out of range [2] with length 2

goroutine 1550 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x16efae0, 0xc000807cc0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).pruneStatus(0xc000b96200, 0xc0002dacf0, 0x2, 0xc000524be0, 0x1c, 0xc0028da000, 0x1, 0x1)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:300 +0xa10
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).updateNodeConditions(0xc000b96200, 0xc0002dacf0, 0xc00016eba0, 0xd)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:291 +0x46f
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).UpdateClusterStatus(0xc000b96200, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/status.go:64 +0x205
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc000b96200, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:63 +0x21e
github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc000528b40, 0x19ea8a0, 0xc00033b170, 0xc0005248a0, 0x11)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:65 +0x5bd
github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc000017dc0, 0xc0005248a0, 0x11, 0xc00016eba0, 0xd, 0x3d132a567, 0xc000be2120, 0xc000be2008, 0xc000be2000)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:112 +0x22a
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001e0b40, 0x16640e0, 0xc000110140, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001e0b40, 0x203000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001e0b40)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0004ca060)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004ca060, 0x19a1320, 0xc0006e0210, 0x1, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004ca060, 0x3b9aca00, 0x0, 0x1, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004ca060, 0x3b9aca00, 0xc000a15260)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d

Comment 11 Qiaoling Tang 2020-10-09 08:45:43 UTC

If deploy logging 4.6, and do the following steps:

1. Install clusterlogging with 3 nodes for ES - no need for selectors
2. Wait until the setup is fully workable.
3. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable", e.g. set memory requests/limits to an astronomic number
4. Wait until the status.nodes fields marks at least one deployment as unschedulable.
5. Switch CL+ES CR to Unmanaged
6. Delete one of the three ES deployments
7. Switch resources in CL CR back to normal
8. Reset Unmanaged to Managed and see what happens.

The EO can running as usual, but it couldn't make the ES cluster back to normal.
Logs in the EO:
{"level":"info","ts":1602229844.2539637,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"}
{"level":"error","ts":1602229845.446016,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 7 shards in preparation for cluster restart"}
{"level":"info","ts":1602229845.4970524,"logger":"elasticsearch-operator","caller":"retry/util.go:51","msg":"failed to update node resource","node":"elasticsearch-cdm-5dqdymyw-1","error":"Operation cannot be fulfilled on deployments.apps \"elasticsearch-cdm-5dqdymyw-1\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1602229905.018403,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:519","msg":"Completed restart of node","node":"elasticsearch-cdm-5dqdymyw-1","cluster":"elasticsearch","namespace":"openshift-logging"}
{"level":"info","ts":1602229905.0530834,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:492","msg":"Beginning restart of node","node":"elasticsearch-cdm-5dqdymyw-2","cluster":"elasticsearch","namespace":"openshift-logging"}
{"level":"error","ts":1602229906.0889528,"logger":"elasticsearch-operator","caller":"k8shandler/clusterrestart.go:288","msg":"Unable to perform synchronized flush","namespace":"openshift-logging","name":"elasticsearch","error":"Failed to flush 4 shards in preparation for cluster restart"}
{"level":"error","ts":1602229940.184816,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:131","msg":"failed to perform rolling update","error":"failed to get cluster state for \"elasticsearch\". Error code: true, map[error:map[reason:<nil> root_cause:[map[reason:<nil> type:master_not_discovered_exception]] type:master_not_discovered_exception] status:503]"}
{"level":"error","ts":1602229972.703489,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230004.35125,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230035.9973016,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230067.7721658,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230099.7733471,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230131.6252205,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230163.6494224,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230195.5667257,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230227.1968317,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230259.2925787,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230291.9073987,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230323.9044044,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230355.6235714,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230387.3924816,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230419.190769,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230451.3558466,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230483.1463294,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230514.8978357,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"info","ts":1602230585.3486938,"logger":"elasticsearch-operator","caller":"k8shandler/status.go:291","msg":"Pruning node status","cluster":"elasticsearch","namespace":"openshift-logging","node":{"deploymentName":"elasticsearch-cdm-5dqdymyw-3","upgradeStatus":{"scheduledUpgrade":"True"}}}
{"level":"error","ts":1602230616.3881042,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}
{"level":"error","ts":1602230619.3374827,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: connect: connection refused"}
{"level":"error","ts":1602230678.285918,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to update node","cluster":"elasticsearch","namespace":"openshift-logging","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.220.101:9200: i/o timeout"}

$ oc get elasticsearch elasticsearch -oyaml
apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: "2020-10-09T05:48:12Z"
  generation: 6
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 6901675c-b8c2-4bcb-9611-0f8a116b0eff
  resourceVersion: "293498"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: 2fcaf92d-ac8c-414a-868a-a07c0528dc9d
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs.app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs.infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs.audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    proxyResources:
      limits:
        memory: 64Mi
      requests:
        cpu: 100m
        memory: 64Mi
    resources:
      requests:
        memory: 2Gi
  nodes:
  - genUUID: 5dqdymyw
    nodeCount: 3
    proxyResources: {}
    resources: {}
    roles:
    - client
    - data
    - master
    storage: {}
  redundancyPolicy: SingleRedundancy
status:
  cluster:
    activePrimaryShards: 0
    activeShards: 0
    initializingShards: 0
    numDataNodes: 0
    numNodes: 0
    pendingTasks: 0
    relocatingShards: 0
    status: cluster health unknown
    unassignedShards: 0
  nodes:
  - deploymentName: elasticsearch-cdm-5dqdymyw-1
    upgradeStatus:
      upgradePhase: controllerUpdated
  - deploymentName: elasticsearch-cdm-5dqdymyw-2
    upgradeStatus:
      scheduledUpgrade: "True"
      underUpgrade: "True"
      upgradePhase: nodeRestarting
  pods:
    client:
      failed: []
      notReady:
      - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68
      - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq
      ready: []
    data:
      failed: []
      notReady:
      - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68
      - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq
      ready: []
    master:
      failed: []
      notReady:
      - elasticsearch-cdm-5dqdymyw-2-59c8fd59cd-jvvkq
      - elasticsearch-cdm-5dqdymyw-1-58c87fd6c8-kxh68
      ready: []
  shardAllocationEnabled: shard allocation unknown

Besides, the EO works well after doing a normal upgrade(from 4.5 to 4.6).

Comment 13 Ben Parees 2020-10-12 16:55:17 UTC

why is this only a medium?  ES failing after upgrades sounds very severe to me.

Comment 15 Qiaoling Tang 2020-10-13 08:20:07 UTC

Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't become CrashLoopBackOff, but it still couldn't recreate the ES deployments after reset Unmanaged to Managed.

EO log:
{"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to restart node","cluster":"elasticsearch","namespace":"openshift-logging","node":"elasticsearch-cdm-pw6l3eqr-3","error":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\": dial tcp 172.30.212.81:9200: i/o timeout"}

    nodes:
    - genUUID: pw6l3eqr
      nodeCount: 3
      proxyResources: {}
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 20Gi
        storageClassName: standard
    redundancyPolicy: SingleRedundancy
  status:
    cluster:
      activePrimaryShards: 0
      activeShards: 0
      initializingShards: 0
      numDataNodes: 0
      numNodes: 0
      pendingTasks: 0
      relocatingShards: 0
      status: cluster health unknown
      unassignedShards: 0
    nodes:
    - conditions:
      - lastTransitionTime: "2020-10-13T07:54:24Z"
        message: '0/6 nodes are available: 1 node(s) were unschedulable, 5 Insufficient memory.'
        reason: Unschedulable
        status: "True"
        type: Unschedulable
      deploymentName: elasticsearch-cdm-pw6l3eqr-3
      upgradeStatus:
        underUpgrade: "True"
        upgradePhase: nodeRestarting

$ oc get pod
NAME                                            READY   STATUS                  RESTARTS   AGE
cluster-logging-operator-5747454486-86rpq       1/1     Running                 0          15m
elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89   1/2     Running                 0          11m
elasticsearch-delete-app-1602576900-wsvg7       0/1     Error                   0          4m29s
elasticsearch-delete-audit-1602576900-24jpp     0/1     Error                   0          4m29s
elasticsearch-delete-infra-1602576900-vjmhz     0/1     Error                   0          4m29s
elasticsearch-rollover-app-1602576900-s56dr     0/1     Error                   0          4m29s
elasticsearch-rollover-audit-1602576900-vpdb9   0/1     Error                   0          4m29s
elasticsearch-rollover-infra-1602576900-nmgxp   0/1     Error                   0          4m29s
fluentd-8qcmd                                   1/1     Running                 0          34m
fluentd-9l77k                                   0/1     Init:CrashLoopBackOff   5          11m
fluentd-bfh7s                                   1/1     Running                 0          34m
fluentd-bzdkt                                   1/1     Running                 0          34m
fluentd-qqkvq                                   1/1     Running                 0          34m
fluentd-rq8c8                                   1/1     Running                 0          34m
kibana-684cd479bd-bt82s                         2/2     Running                 0          14m

Comment 16 Periklis Tsirakidis 2020-10-13 08:39:20 UTC

(In reply to Qiaoling Tang from comment #15)
> Tested with elasticsearch-operator.4.6.0-202010130127.p0, the EO didn't
> become CrashLoopBackOff, but it still couldn't recreate the ES deployments
> after reset Unmanaged to Managed.
> 
> EO log:
> {"level":"error","ts":1602577056.8270836,"logger":"elasticsearch-operator",
> "caller":"k8shandler/reconciler.go:65","msg":"unable to restart
> node","cluster":"elasticsearch","namespace":"openshift-logging","node":
> "elasticsearch-cdm-pw6l3eqr-3","error":"Get
> \"https://elasticsearch.openshift-logging.svc:9200/_cluster/state/nodes\":
> dial tcp 172.30.212.81:9200: i/o timeout"}
> 
>     nodes:
>     - genUUID: pw6l3eqr
>       nodeCount: 3
>       proxyResources: {}
>       resources: {}
>       roles:
>       - client
>       - data
>       - master
>       storage:
>         size: 20Gi
>         storageClassName: standard
>     redundancyPolicy: SingleRedundancy
>   status:
>     cluster:
>       activePrimaryShards: 0
>       activeShards: 0
>       initializingShards: 0
>       numDataNodes: 0
>       numNodes: 0
>       pendingTasks: 0
>       relocatingShards: 0
>       status: cluster health unknown
>       unassignedShards: 0
>     nodes:
>     - conditions:
>       - lastTransitionTime: "2020-10-13T07:54:24Z"
>         message: '0/6 nodes are available: 1 node(s) were unschedulable, 5
> Insufficient memory.'
>         reason: Unschedulable
>         status: "True"
>         type: Unschedulable
>       deploymentName: elasticsearch-cdm-pw6l3eqr-3
>       upgradeStatus:
>         underUpgrade: "True"
>         upgradePhase: nodeRestarting
> 
> $ oc get pod
> NAME                                            READY   STATUS              
> RESTARTS   AGE
> cluster-logging-operator-5747454486-86rpq       1/1     Running             
> 0          15m
> elasticsearch-cdm-pw6l3eqr-3-787f9bfcf7-6sc89   1/2     Running             
> 0          11m
> elasticsearch-delete-app-1602576900-wsvg7       0/1     Error               
> 0          4m29s
> elasticsearch-delete-audit-1602576900-24jpp     0/1     Error               
> 0          4m29s
> elasticsearch-delete-infra-1602576900-vjmhz     0/1     Error               
> 0          4m29s
> elasticsearch-rollover-app-1602576900-s56dr     0/1     Error               
> 0          4m29s
> elasticsearch-rollover-audit-1602576900-vpdb9   0/1     Error               
> 0          4m29s
> elasticsearch-rollover-infra-1602576900-nmgxp   0/1     Error               
> 0          4m29s
> fluentd-8qcmd                                   1/1     Running             
> 0          34m
> fluentd-9l77k                                   0/1    
> Init:CrashLoopBackOff   5          11m
> fluentd-bfh7s                                   1/1     Running             
> 0          34m
> fluentd-bzdkt                                   1/1     Running             
> 0          34m
> fluentd-qqkvq                                   1/1     Running             
> 0          34m
> fluentd-rq8c8                                   1/1     Running             
> 0          34m
> kibana-684cd479bd-bt82s                         2/2     Running             
> 0          14m


@Qiaoling

Could you please share the link to the cluster?

Comment 17 Periklis Tsirakidis 2020-10-13 09:03:45 UTC

Putting this back to ON_QA, because this depends on backporting: https://bugzilla.redhat.com/show_bug.cgi?id=1887357

Comment 18 Ben Parees 2020-10-13 16:04:49 UTC

*** Bug 1887943 has been marked as a duplicate of this bug. ***

Comment 20 Qiaoling Tang 2020-10-14 02:28:40 UTC

Verified with elasticsearch-operator.4.6.0-202010132327.p0

Comment 22 errata-xmlrpc 2020-10-27 15:12:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198