Description of problem: panic observed in elasticsearch operator Version-Release number of selected component (if applicable): 4.7 (master at commit 271fcc2712e70d055d9d5a6506ac0301cc320807) How reproducible: Always Steps to Reproduce: 1. built ESO from latest master 2. did make deploy in ESO repo, to let cluster imagestream have the images 3. created env variables for ESO and ESO-REGISTRY to point to imagestream images 4. did make deploy from CLO 5. now all 4 images (CLO/EO operator/operator-registry) are locally built on latest master 6. created a CL instance using apiVersion: "logging.openshift.io/v1" kind: "ClusterLogging" metadata: name: "instance" namespace: "openshift-logging" spec: collection: logs: fluentd: resources: limits: cpu: 500m requests: cpu: 500m type: fluentd curation: curator: schedule: 30 3 * * * type: curator logStore: elasticsearch: nodeCount: 1 redundancyPolicy: ZeroRedundancy resources: limits: cpu: 600m requests: cpu: 600m storage: storageClassName: "standard" # size: "2Gi" type: elasticsearch managementState: Managed The elasticsearch CR created by CLO is apiVersion: logging.openshift.io/v1 kind: Elasticsearch metadata: creationTimestamp: "2020-11-19T15:02:22Z" generation: 2 managedFields: - apiVersion: logging.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: {} f:spec: .: {} f:indexManagement: .: {} f:mappings: {} f:policies: {} f:managementState: {} f:nodeSpec: .: {} f:proxyResources: .: {} f:limits: .: {} f:memory: {} f:requests: .: {} f:cpu: {} f:memory: {} f:resources: .: {} f:limits: .: {} f:cpu: {} f:requests: .: {} f:cpu: {} f:redundancyPolicy: {} f:status: f:clusterHealth: {} f:conditions: {} f:nodes: {} f:pods: {} f:shardAllocationEnabled: {} manager: cluster-logging-operator operation: Update time: "2020-11-19T15:02:22Z" - apiVersion: logging.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:nodes: {} f:status: .: {} f:cluster: .: {} f:activePrimaryShards: {} f:activeShards: {} f:initializingShards: {} f:numDataNodes: {} f:numNodes: {} f:pendingTasks: {} f:relocatingShards: {} f:status: {} f:unassignedShards: {} manager: elasticsearch-operator operation: Update time: "2020-11-19T15:02:23Z" name: elasticsearch namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: ba311798-6765-414e-93cd-f81518774fd7 resourceVersion: "757970" selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch uid: 1b796d53-01a0-471e-83c5-09a866aa12b8 spec: indexManagement: mappings: - aliases: - app - logs.app name: app policyRef: app-policy - aliases: - infra - logs.infra name: infra policyRef: infra-policy - aliases: - audit - logs.audit name: audit policyRef: audit-policy policies: - name: app-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: infra-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m - name: audit-policy phases: delete: minAge: 7d hot: actions: rollover: maxAge: 8h pollInterval: 15m managementState: Managed nodeSpec: proxyResources: limits: memory: 256Mi requests: cpu: 100m memory: 256Mi resources: limits: cpu: 600m requests: cpu: 600m nodes: - genUUID: imnxcsqs nodeCount: 1 proxyResources: {} resources: {} roles: - client - data - master storage: storageClassName: standard redundancyPolicy: ZeroRedundancy Actual results: panic observed in elasticsearch operator logs {"component":"elasticsearch-operator","go_arch":"amd64","go_os":"linux","go_version":"go1.15.0","level":"0","message":"Starting the Cmd.","namespace":"","operator-sdk_version":"v0.19.4","operator_version":"0.0.1","ts":"2020-11-19T08:58:26.342390874Z"} E1119 08:58:27.526030 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 1445 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16807e0, 0x244d8b0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89 panic(0x16807e0, 0x244d8b0) /usr/lib/golang/src/runtime/panic.go:969 +0x175 github.com/openshift/elasticsearch-operator/pkg/k8shandler.newVolumeSource(0xc000044460, 0xd, 0xc001813660, 0x1c, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:599 +0x24f github.com/openshift/elasticsearch-operator/pkg/k8shandler.newVolumes(0xc000044460, 0xd, 0xc001813660, 0x1c, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:555 +0xd8 github.com/openshift/elasticsearch-operator/pkg/k8shandler.newPodTemplateSpec(0xc001813660, 0x1c, 0xc000044460, 0xd, 0xc000794bc0, 0x11, 0xc00203c510, 0x3, 0x3, 0x1, ...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/common.go:379 +0xc58 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*deploymentNode).populateReference(0xc002048000, 0xc001813660, 0x1c, 0xc00203c510, 0x3, 0x3, 0x1, 0x0, 0x0, 0x0, ...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/deployment.go:68 +0x9a8 github.com/openshift/elasticsearch-operator/pkg/k8shandler.newDeploymentNode(...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/nodetypefactory.go:85 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).GetNodeTypeInterface(0xc0008a25c0, 0xc000044740, 0x8, 0xc00203c510, 0x3, 0x3, 0x1, 0x0, 0x0, 0x0, ...) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/nodetypefactory.go:49 +0x445 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).populateNodes(0xc0008a25c0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:296 +0x1c5 github.com/openshift/elasticsearch-operator/pkg/k8shandler.(*ElasticsearchRequest).CreateOrUpdateElasticsearchCluster(0xc0008a25c0, 0x0, 0x0) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/cluster.go:54 +0x1e9 github.com/openshift/elasticsearch-operator/pkg/k8shandler.Reconcile(0xc0005d0000, 0x1a81980, 0xc000af7620, 0xc000794bc0, 0x11) /go/src/github.com/openshift/elasticsearch-operator/pkg/k8shandler/reconciler.go:62 +0x965 github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile(0xc0008f4360, 0xc000794bc0, 0x11, 0xc000044460, 0xd, 0x31447ef98, 0xc00029c2d0, 0xc00029c248, 0xc00029c240) /go/src/github.com/openshift/elasticsearch-operator/pkg/controller/elasticsearch/controller.go:111 +0x22a sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002566c0, 0x16df6c0, 0xc000840a60, 0x0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x166 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002566c0, 0x203000) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0002566c0) /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0008166d0) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0008166d0, 0x1a36880, 0xc00081ecc0, 0x1, 0xc0008ee000) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008166d0, 0x3b9aca00, 0x0, 0x1, 0xc0008ee000) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc0008166d0, 0x3b9aca00, 0xc0008ee000) /go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x32d panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145a80f] Expected results: np panic Additional info:
Tested with elasticsearch-operator.4.7.0-202011251213.p0, found error message in EO: {"cluster":"elasticsearch","component":"elasticsearch-operator","error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-ppble947-1"},"go_arch":"amd64","go_os":"linux","go_version":"go1.15.2","level":"0","message":"Storage size is required but was missing. Defaulting to EmptyDirVolume. Please adjust your CR accordingly.","name":{"Namespace":"openshift-logging","Name":"kibana"},"namespace":"openshift-logging","node":"elasticsearch-cdm-ppble947-1","objectKey":{"Namespace":"openshift-logging","Name":"elasticsearch"},"operator-sdk_version":"v0.19.4","operator_version":"0.0.1","ts":"2020-11-26T02:59:42.617295079Z"} The EO and ES pods could be in Running state. After adding storage.size to the cl/instance, the ES pods were restarted and could mount the new PVCs. Per the above results, move the bz to VERIFIED.
@ewolinet The EO panic disappears, but I found an issue related to these changes, so I asked here: The ES cluster health status was always RED after adding storage.size the cl/instance when there had 3 ES pods, and only the first and the second ES pod could be updated to mount the new PVCs. Sometimes the whole ES cluster could work, new indices could be created, all the 3 ES pods were in normal status(1 of 3 tries). But sometimes it couldn't, the elasticsearch container couldn't start, the rollover and delete jobs were all failed(2 of 3 tries). Is this an acceptable behavior?
(In reply to Qiaoling Tang from comment #4) > @ewolinet > > The EO panic disappears, but I found an issue related to these changes, so I > asked here: > > The ES cluster health status was always RED after adding storage.size the > cl/instance when there had 3 ES pods, and only the first and the second ES > pod could be updated to mount the new PVCs. > > Sometimes the whole ES cluster could work, new indices could be created, all > the 3 ES pods were in normal status(1 of 3 tries). But sometimes it > couldn't, the elasticsearch container couldn't start, the rollover and > delete jobs were all failed(2 of 3 tries). > > Is this an acceptable behavior? I'm not following the test case here... Is this going from ephemeral storage to persisted storage on a live cluster? What was the redundancy policy when you did this? If it is "Zero" I would expect the cluster to stay Red because it lost the primary shards that were on the node that was restarted. If it was "Single" I would expect the cluster to eventually recreate the primary shards that were lost and continue the update once the cluster got to "Yellow".
(In reply to ewolinet from comment #5) > (In reply to Qiaoling Tang from comment #4) > > @ewolinet > > > > The EO panic disappears, but I found an issue related to these changes, so I > > asked here: > > > > The ES cluster health status was always RED after adding storage.size the > > cl/instance when there had 3 ES pods, and only the first and the second ES > > pod could be updated to mount the new PVCs. > > > > Sometimes the whole ES cluster could work, new indices could be created, all > > the 3 ES pods were in normal status(1 of 3 tries). But sometimes it > > couldn't, the elasticsearch container couldn't start, the rollover and > > delete jobs were all failed(2 of 3 tries). > > > > Is this an acceptable behavior? > > I'm not following the test case here... > > Is this going from ephemeral storage to persisted storage on a live cluster? Yes > What was the redundancy policy when you did this? If it is "Zero" I would > expect the cluster to stay Red because it lost the primary shards that were > on the node that was restarted. If it was "Single" I would expect the > cluster to eventually recreate the primary shards that were lost and > continue the update once the cluster got to "Yellow". It's "Single", but the indices couldn't become yellow/green, I'll open a new bz to track the issue. Thank you.
How long will it take for ES to recreate the primary shards? I waited for about 1 hour, but the primary shard couldn't be recreated. $ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -- indices Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod. Tue Dec 8 06:45:18 UTC 2020 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open app-000002 _XpnWYCUSuSFCdBm88uduw 3 1 0 0 0 0 green open .kibana_1 he_yqpIdSyO0wcIiW3ff4g 1 1 0 0 0 0 green open infra-000003 3E4RZTHBT1K8dKsWd-wCwA 3 1 202441 0 292 146 red open app-000001 3j2NLFQeQY-W9lDi2ty2og 3 1 28 0 0 0 green open .security x8eY55RWTJGiUV6VLtJ36A 1 1 5 0 0 0 red open infra-000001 aSlD9YxtTPibrXiGWBON6w 3 1 74888 0 98 49 green open infra-000004 9JX_TQZXT9qMoTGSHHU01g 3 1 0 0 0 0 green open infra-000002 8BBgRr0ySMqQ5pGjY_oIfg 3 1 204957 0 282 141 red open audit-000001 FudxjJL0SpGYCL0D9tffGw 3 1 0 0 0 0 green open .kibana_-377444158_kubeadmin RItvXRPsR9aTl5gwtDKkgg 1 1 1 0 0 0 $ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -- shards Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod. app-000001 1 p STARTED app-000001 1 r STARTED app-000001 2 p STARTED app-000001 2 r STARTED app-000001 0 p UNASSIGNED NODE_LEFT app-000001 0 r UNASSIGNED NODE_LEFT .kibana_-377444158_kubeadmin 0 p STARTED .kibana_-377444158_kubeadmin 0 r STARTED .security 0 p STARTED .security 0 r STARTED infra-000001 1 p UNASSIGNED NODE_LEFT infra-000001 1 r UNASSIGNED NODE_LEFT infra-000001 2 p STARTED infra-000001 2 r STARTED infra-000001 0 p STARTED infra-000001 0 r STARTED .kibana_1 0 p STARTED .kibana_1 0 r STARTED audit-000001 1 p STARTED audit-000001 1 r STARTED audit-000001 2 p STARTED audit-000001 2 r STARTED audit-000001 0 p UNASSIGNED NODE_LEFT audit-000001 0 r UNASSIGNED NODE_LEFT infra-000003 1 r STARTED infra-000003 1 p STARTED infra-000003 2 p STARTED infra-000003 2 r STARTED infra-000003 0 p STARTED infra-000003 0 r STARTED infra-000002 1 r STARTED infra-000002 1 p STARTED infra-000002 2 p STARTED infra-000002 2 r STARTED infra-000002 0 p STARTED infra-000002 0 r STARTED $ oc exec elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -- es_util --query=_cat/nodes?v Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-fvhvad15-1-654fbfcb4b-shzr4 -n openshift-logging' to see all of the containers in this pod. ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.129.2.211 59 73 23 0.43 0.73 0.91 mdi - elasticsearch-cdm-fvhvad15-1 10.128.2.40 51 77 32 2.03 1.48 1.29 mdi - elasticsearch-cdm-fvhvad15-2 10.131.0.56 22 69 15 1.81 1.48 1.26 mdi * elasticsearch-cdm-fvhvad15-3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652