1899589 – panic observed in elasticsearch operator logs

Bug 1899589 - panic observed in elasticsearch operator logs

Summary: panic observed in elasticsearch operator logs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	ewolinet
QA Contact:	Qiaoling Tang
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-19 15:54 UTC by Vimal Kumar
Modified:	2021-02-24 11:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, omitting the Storage size of the Elasticsearch node caused panic in the Elasticsearch operator code. This appeared in the logs as: `Observed a panic: "invalid memory address or nil pointer dereference"` This happened because, although Storage size is a required field, the software didn't check for it. The current release fixes this issue so there is no panic if the storage size is omitted. Instead, the storage defaults to ephemeral storage and generates a log message for the user. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1899589[BZ#1899589])
Clone Of:
Environment:
Last Closed:	2021-02-24 11:22:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 580	0	None	closed	Bug 1899589: Updating CRD to make storage size required if storage is provided	2021-02-12 22:22:33 UTC
Red Hat Product Errata	RHBA-2021:0652	0	None	None	None	2021-02-24 11:23:08 UTC

Description Vimal Kumar 2020-11-19 15:54:25 UTC

Description of problem:

panic observed in elasticsearch operator



Version-Release number of selected component (if applicable): 4.7 (master at commit 271fcc2712e70d055d9d5a6506ac0301cc320807)


How reproducible:
Always

Steps to Reproduce:
1. built ESO from latest master
2. did make deploy in ESO repo, to let cluster imagestream have the images
3. created env variables for ESO and ESO-REGISTRY to point to imagestream images
4. did make deploy from CLO
5. now all 4 images (CLO/EO operator/operator-registry) are locally built on latest master
6. created a CL instance using
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 500m
      type: fluentd
  curation:
    curator:
      schedule: 30 3 * * *
    type: curator
  logStore:
    elasticsearch:
      nodeCount: 1
      redundancyPolicy: ZeroRedundancy
      resources:
        limits:
          cpu: 600m
        requests:
          cpu: 600m
      storage:
        storageClassName: "standard"
        #        size: "2Gi"
    type: elasticsearch
  managementState: Managed

The elasticsearch CR created by CLO is
apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: "2020-11-19T15:02:22Z"
  generation: 2
  managedFields:
  - apiVersion: logging.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:indexManagement:
          .: {}
          f:mappings: {}
          f:policies: {}
        f:managementState: {}
        f:nodeSpec:
          .: {}
          f:proxyResources:
            .: {}
            f:limits:
              .: {}
              f:memory: {}
            f:requests:
              .: {}
              f:cpu: {}
              f:memory: {}
          f:resources:
            .: {}
            f:limits:
              .: {}
              f:cpu: {}
            f:requests:
              .: {}
              f:cpu: {}
        f:redundancyPolicy: {}
      f:status:
        f:clusterHealth: {}
        f:conditions: {}
        f:nodes: {}
        f:pods: {}
        f:shardAllocationEnabled: {}
    manager: cluster-logging-operator
    operation: Update
    time: "2020-11-19T15:02:22Z"
  - apiVersion: logging.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:nodes: {}
      f:status:
        .: {}
        f:cluster:
          .: {}
          f:activePrimaryShards: {}
          f:activeShards: {}
          f:initializingShards: {}
          f:numDataNodes: {}
          f:numNodes: {}
          f:pendingTasks: {}
          f:relocatingShards: {}
          f:status: {}
          f:unassignedShards: {}
    manager: elasticsearch-operator
    operation: Update
    time: "2020-11-19T15:02:23Z"
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: ba311798-6765-414e-93cd-f81518774fd7
  resourceVersion: "757970"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: 1b796d53-01a0-471e-83c5-09a866aa12b8
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs.app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs.infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs.audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    proxyResources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 256Mi
    resources:
      limits:
        cpu: 600m
      requests:
        cpu: 600m
  nodes:
  - genUUID: imnxcsqs
    nodeCount: 1
    proxyResources: {}
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      storageClassName: standard
  redundancyPolicy: ZeroRedundancy


Actual results: panic observed in elasticsearch operator logs

{"component":"elasticsearch-operator","go_arch":"amd64","go_os":"linux","go_version":"go1.15.0","level":"0","message":"Starting the Cmd.","namespace":"","operator-sdk_version":"v0.19.4","operator_version":"0.0.1","ts":"2020-11-19T08:58:26.342390874Z"}
E1119 08:58:27.526030       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1445 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16807e0, 0x244d8b0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x16807e0, 0x244d8b0)
	/usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/openshift/elasticsearch-operator/pkg/k8shandler.newVolumeSource(0xc000044460, 0xd, 0xc001813660, 0x1c, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:599 +0x24f
github.com/openshift/elasticsearch-operator/pkg/k8shandler.newVolumes(0xc000044460, 0xd, 0xc001813660, 0x1c, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:555 +0xd8
github.com/openshift/elasticsearch-operator/pkg/k8shandler.newPodTemplateSpec(0xc001813660, 0x1c, 0xc000044460, 0xd, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:379 +0xc58
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*deploymentNode).populateReference(0xc002048000, 0xc001813660, 0x1c, 0xc00203c510, 0x3, 0x3, 0x1, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/deployment.go:68 +0x9a8
github.com/openshift/elasticsearch-operator/pkg/k8shandler.newDeploymentNode(...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/nodetypefactory.go:85
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).GetNodeTypeInterface(0xc0008a25c0, 0xc000044740, 0x8, 0xc00203c510, 0x3, 0x3, 0x1, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/nodetypefactory.go:49 +0x445
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).populateNodes(0xc0008a25c0, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:296 +0x1c5
github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc0008a25c0, 0x0, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:54 +0x1e9
github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc0005d0000, 0x1a81980, 0xc000af7620, 0xc000794bc0, 0x11)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:62 +0x965
github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc0008f4360, 0xc000794bc0, 0x11, 0xc000044460, 0xd, 0x31447ef98, 0xc00029c2d0, 0xc00029c248, 0xc00029c240)
	/go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:111 +0x22a
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002566c0, 0x16df6c0, 0xc000840a60, 0x0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002566c0, 0x203000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0002566c0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0008166d0)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008166d0, 0x1a36880, 0xc00081ecc0, 0x1, 0xc0008ee000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008166d0, 0x3b9aca00, 0x0, 0x1, 0xc0008ee000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc0008166d0, 0x3b9aca00, 0xc0008ee000)
	/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145a80f]


Expected results: np panic


Additional info:

Comment 3 Qiaoling Tang 2020-11-26 03:41:55 UTC

Tested with elasticsearch-operator.4.7.0-202011251213.p0, found error message in EO:

{"cluster":"elasticsearch","component":"elasticsearch-operator","error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-ppble947-1"},"go_arch":"amd64","go_os":"linux","go_version":"go1.15.2","level":"0","message":"Storage size is required but was missing. Defaulting to EmptyDirVolume. Please adjust your CR accordingly.","name":{"Namespace":"openshift-logging","Name":"kibana"},"namespace":"openshift-logging","node":"elasticsearch-cdm-ppble947-1","objectKey":{"Namespace":"openshift-logging","Name":"elasticsearch"},"operator-sdk_version":"v0.19.4","operator_version":"0.0.1","ts":"2020-11-26T02:59:42.617295079Z"}

The EO and ES pods could be in Running state.

After adding storage.size to the cl/instance, the ES pods were restarted and could mount the new PVCs.

Per the above results, move the bz to VERIFIED.

Comment 4 Qiaoling Tang 2020-11-26 05:49:15 UTC

@ewolinet

The EO panic disappears, but I found an issue related to these changes, so I asked here:

The ES cluster health status was always RED after adding storage.size the cl/instance when there had 3 ES pods, and only the first and the second ES pod could be updated to mount the new PVCs.

Sometimes the whole ES cluster could work, new indices could be created, all the 3 ES pods were in normal status(1 of 3 tries). But sometimes it couldn't, the elasticsearch container couldn't start, the rollover and delete jobs were all failed(2 of 3 tries).

Is this an acceptable behavior?

Comment 5 ewolinet 2020-11-30 17:47:49 UTC

(In reply to Qiaoling Tang from comment #4)
> @ewolinet
> 
> The EO panic disappears, but I found an issue related to these changes, so I
> asked here:
> 
> The ES cluster health status was always RED after adding storage.size the
> cl/instance when there had 3 ES pods, and only the first and the second ES
> pod could be updated to mount the new PVCs.
> 
> Sometimes the whole ES cluster could work, new indices could be created, all
> the 3 ES pods were in normal status(1 of 3 tries). But sometimes it
> couldn't, the elasticsearch container couldn't start, the rollover and
> delete jobs were all failed(2 of 3 tries).
> 
> Is this an acceptable behavior?

I'm not following the test case here...

Is this going from ephemeral storage to persisted storage on a live cluster?
What was the redundancy policy when you did this? If it is "Zero" I would expect the cluster to stay Red because it lost the primary shards that were on the node that was restarted. If it was "Single" I would expect the cluster to eventually recreate the primary shards that were lost and continue the update once the cluster got to "Yellow".

Comment 6 Qiaoling Tang 2020-12-07 05:47:58 UTC

(In reply to ewolinet from comment #5)
> (In reply to Qiaoling Tang from comment #4)
> > @ewolinet
> > 
> > The EO panic disappears, but I found an issue related to these changes, so I
> > asked here:
> > 
> > The ES cluster health status was always RED after adding storage.size the
> > cl/instance when there had 3 ES pods, and only the first and the second ES
> > pod could be updated to mount the new PVCs.
> > 
> > Sometimes the whole ES cluster could work, new indices could be created, all
> > the 3 ES pods were in normal status(1 of 3 tries). But sometimes it
> > couldn't, the elasticsearch container couldn't start, the rollover and
> > delete jobs were all failed(2 of 3 tries).
> > 
> > Is this an acceptable behavior?
> 
> I'm not following the test case here...
> 
> Is this going from ephemeral storage to persisted storage on a live cluster?
Yes

> What was the redundancy policy when you did this? If it is "Zero" I would
> expect the cluster to stay Red because it lost the primary shards that were
> on the node that was restarted. If it was "Single" I would expect the
> cluster to eventually recreate the primary shards that were lost and
> continue the update once the cluster got to "Yellow".

It's "Single", but the indices couldn't become yellow/green, I'll open a new bz to track the issue. Thank you.

Comment 8 Qiaoling Tang 2020-12-08 06:47:55 UTC

How long will it take for ES to recreate the primary shards? I waited for about 1 hour, but the primary shard couldn't be recreated.

$ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4  -- indices
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod.
Tue Dec  8 06:45:18 UTC 2020
health status index                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   app-000002                   _XpnWYCUSuSFCdBm88uduw   3   1          0            0          0              0
green  open   .kibana_1                    he_yqpIdSyO0wcIiW3ff4g   1   1          0            0          0              0
green  open   infra-000003                 3E4RZTHBT1K8dKsWd-wCwA   3   1     202441            0        292            146
red    open   app-000001                   3j2NLFQeQY-W9lDi2ty2og   3   1         28            0          0              0
green  open   .security                    x8eY55RWTJGiUV6VLtJ36A   1   1          5            0          0              0
red    open   infra-000001                 aSlD9YxtTPibrXiGWBON6w   3   1      74888            0         98             49
green  open   infra-000004                 9JX_TQZXT9qMoTGSHHU01g   3   1          0            0          0              0
green  open   infra-000002                 8BBgRr0ySMqQ5pGjY_oIfg   3   1     204957            0        282            141
red    open   audit-000001                 FudxjJL0SpGYCL0D9tffGw   3   1          0            0          0              0
green  open   .kibana_-377444158_kubeadmin RItvXRPsR9aTl5gwtDKkgg   1   1          1            0          0              0

$ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -- shards
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod.
app-000001                   1 p STARTED    
app-000001                   1 r STARTED    
app-000001                   2 p STARTED    
app-000001                   2 r STARTED    
app-000001                   0 p UNASSIGNED NODE_LEFT
app-000001                   0 r UNASSIGNED NODE_LEFT
.kibana_-377444158_kubeadmin 0 p STARTED    
.kibana_-377444158_kubeadmin 0 r STARTED    
.security                    0 p STARTED    
.security                    0 r STARTED    
infra-000001                 1 p UNASSIGNED NODE_LEFT
infra-000001                 1 r UNASSIGNED NODE_LEFT
infra-000001                 2 p STARTED    
infra-000001                 2 r STARTED    
infra-000001                 0 p STARTED    
infra-000001                 0 r STARTED    
.kibana_1                    0 p STARTED    
.kibana_1                    0 r STARTED    
audit-000001                 1 p STARTED    
audit-000001                 1 r STARTED    
audit-000001                 2 p STARTED    
audit-000001                 2 r STARTED    
audit-000001                 0 p UNASSIGNED NODE_LEFT
audit-000001                 0 r UNASSIGNED NODE_LEFT
infra-000003                 1 r STARTED    
infra-000003                 1 p STARTED    
infra-000003                 2 p STARTED    
infra-000003                 2 r STARTED    
infra-000003                 0 p STARTED    
infra-000003                 0 r STARTED    
infra-000002                 1 r STARTED    
infra-000002                 1 p STARTED    
infra-000002                 2 p STARTED    
infra-000002                 2 r STARTED    
infra-000002                 0 p STARTED    
infra-000002                 0 r STARTED    

$ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -- es_util --query=_cat/nodes?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod.
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.129.2.211           59          73  23    0.43    0.73     0.91 mdi       -      elasticsearch-cdm-fvhvad15-1
10.128.2.40            51          77  32    2.03    1.48     1.29 mdi       -      elasticsearch-cdm-fvhvad15-2
10.131.0.56            22          69  15    1.81    1.48     1.26 mdi       *      elasticsearch-cdm-fvhvad15-3

Comment 14 errata-xmlrpc 2021-02-24 11:22:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.