Bug 2126037
| Summary: | ODF4.12 Deployment, ocs-metrics-exporter pod stuck on CrashLoopBackOff state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | ocs-operator | Assignee: | arun kumar mohan <amohan> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Bukatovic <mbukatov> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | aaaggarw, akrai, amohan, asagare, kramdoss, muagarwa, nthomas, ocs-bugs, odf-bz-bot, pbalogh, sostapov, tdesala, uchapaga |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | ODF 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.12.0-70 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Oded
2022-09-12 08:18:36 UTC
Fix added with PR: https://github.com/red-hat-storage/ocs-operator/pull/1805 Submitted fix with PR: https://github.com/red-hat-storage/ocs-operator/pull/1826 observed same issue without LSO as well. On IBM Power platform (ppc64le), ocs-metrics-exporter-* pod is in Error state using latest-stable ODF4.12 build.
[root@rdr-aaaggarw-lon06-bastion-0 scripts]# oc get csv -A
NAMESPACE NAME DISPLAY VERSION REPLACES PHASE
openshift-local-storage local-storage-operator.4.12.0-202209161347 Local Storage 4.12.0-202209161347 Succeeded
openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded
openshift-storage mcg-operator.v4.12.0 NooBaa Operator 4.12.0 Succeeded
openshift-storage ocs-operator.v4.12.0 OpenShift Container Storage 4.12.0 Installing
openshift-storage odf-csi-addons-operator.v4.12.0 CSI Addons 4.12.0 Succeeded
openshift-storage odf-operator.v4.12.0 OpenShift Data Foundation 4.12.0 Succeeded
[root@rdr-aaaggarw-lon06-bastion-0 scripts]# oc get csv odf-operator.v4.12.0 -n openshift-storage -o yaml |grep full_version
full_version: 4.12.0-65
Pods:
[root@rdr-aaaggarw-lon06-bastion-0 scripts]# oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
csi-addons-controller-manager-7b87dc8945-f6zg4 2/2 Running 0 13m
csi-cephfsplugin-25mpc 2/2 Running 0 12m
csi-cephfsplugin-47ws4 2/2 Running 0 12m
csi-cephfsplugin-bmbcr 2/2 Running 0 12m
csi-cephfsplugin-provisioner-7fcdd97ddb-khw5w 5/5 Running 0 12m
csi-cephfsplugin-provisioner-7fcdd97ddb-qcqzs 5/5 Running 0 12m
csi-rbdplugin-4wx2p 3/3 Running 0 12m
csi-rbdplugin-mcr6b 3/3 Running 0 12m
csi-rbdplugin-provisioner-75f6dcfd48-mdzkv 6/6 Running 0 12m
csi-rbdplugin-provisioner-75f6dcfd48-qr9d2 6/6 Running 0 12m
csi-rbdplugin-s7q9m 3/3 Running 0 12m
noobaa-core-0 1/1 Running 0 9m20s
noobaa-db-pg-0 1/1 Running 0 9m20s
noobaa-endpoint-5f444f44dd-h6f6h 1/1 Running 0 7m47s
noobaa-operator-6f4d8d4b78-svzj7 1/1 Running 0 13m
ocs-metrics-exporter-766f6b65d6-g8jsv 0/1 Error 6 (3m1s ago) 13m
ocs-operator-6df99899bb-b5gqm 1/1 Running 0 13m
odf-console-84878864c5-x7dbz 1/1 Running 0 14m
odf-operator-controller-manager-855d7ffcbb-fnn95 2/2 Running 0 14m
rook-ceph-crashcollector-lon06-worker-0.rdr-aaaggarw.ibm.ch9gmb 1/1 Running 0 10m
rook-ceph-crashcollector-lon06-worker-1.rdr-aaaggarw.ibm.cs8zhf 1/1 Running 0 10m
rook-ceph-crashcollector-lon06-worker-2.rdr-aaaggarw.ibm.c5vshk 1/1 Running 0 10m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6c7ff57dlmwwk 2/2 Running 0 10m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-fc8f88f8l66rs 2/2 Running 0 10m
rook-ceph-mgr-a-66cb6849d-577kd 2/2 Running 0 11m
rook-ceph-mon-a-5fbcfcdc7c-6ds48 2/2 Running 0 12m
rook-ceph-mon-b-59986fcbf9-2pvj9 2/2 Running 0 11m
rook-ceph-mon-c-7858d4884f-n8xnt 2/2 Running 0 11m
rook-ceph-operator-57bc4f8d9c-59l5r 1/1 Running 0 13m
rook-ceph-osd-0-55dcbb7896-g6hgt 2/2 Running 0 10m
rook-ceph-osd-1-7bffdc6fc9-26dr6 2/2 Running 0 10m
rook-ceph-osd-2-675945d86f-6p7mt 2/2 Running 0 10m
rook-ceph-osd-prepare-b4dc1a8fc4c3d3f125294f31d31b26ce-6fsk4 0/1 Completed 0 10m
rook-ceph-osd-prepare-c327fb94b8a00969bde17a058a76b71a-df6z9 0/1 Completed 0 10m
rook-ceph-osd-prepare-de7658dc2d8813fbbe2b3cc1d8915463-vdp62 0/1 Completed 0 10m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7c99c8bf9zwj 2/2 Running 0 9m48s
rook-ceph-tools-5996bdc9-8xzvd 1/1 Running 0 9m20s
Events section of ocs-metrics-exporter-* pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned openshift-storage/ocs-metrics-exporter-766f6b65d6-g8jsv to lon06-worker-2.rdr-aaaggarw.ibm.com by lon06-master-1.rdr-aaaggarw.ibm.com
Normal AddedInterface 14m multus Add eth0 [10.128.2.18/23] from openshift-sdn
Normal Pulling 14m kubelet Pulling image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:ab190d5de6c1e1f504f3ad6bf4da7daf45439fcda3ae64f3831c57eb132fd419"
Normal Pulled 13m kubelet Successfully pulled image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:ab190d5de6c1e1f504f3ad6bf4da7daf45439fcda3ae64f3831c57eb132fd419" in 23.705691487s
Normal Pulled 5m32s (x4 over 9m17s) kubelet Container image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:ab190d5de6c1e1f504f3ad6bf4da7daf45439fcda3ae64f3831c57eb132fd419" already present on machine
Normal Created 5m31s (x5 over 13m) kubelet Created container ocs-metrics-exporter
Normal Started 5m31s (x5 over 13m) kubelet Started container ocs-metrics-exporter
Warning BackOff 4m10s (x14 over 8m21s) kubelet Back-off restarting failed container
Logs of ocs-metrics-exporter-* pod:
[root@rdr-aaaggarw-lon06-bastion-0 scripts]# oc logs -f pod/ocs-metrics-exporter-766f6b65d6-g8jsv -n openshift-storage
I0928 12:47:04.428236 1 main.go:29] using options: &{Apiserver: KubeconfigPath: Host:0.0.0.0 Port:8080 ExporterHost:0.0.0.0 ExporterPort:8081 Help:false AllowedNamespaces:[openshift-storage] flags:0xc00071c000 StopCh:<nil> Kubeconfig:<nil>}
W0928 12:47:04.428533 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0928 12:47:04.431138 1 main.go:70] Running metrics server on 0.0.0.0:8080
I0928 12:47:04.431159 1 main.go:71] Running telemetry server on 0.0.0.0:8081
W0928 12:47:04.444823 1 reflector.go:324] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
E0928 12:47:04.444904 1 reflector.go:138] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
I0928 12:47:04.445317 1 rbd-mirror.go:194] skipping rbd mirror status update for pool openshift-storage/ocs-storagecluster-cephblockpool because mirroring is disabled
I0928 12:47:04.452122 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-361593ed
I0928 12:47:04.452138 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-b4d54cc8
I0928 12:47:04.452149 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-2d84a845
I0928 12:47:04.452158 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-b5b47f2
I0928 12:47:04.452166 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-acc40f3e
I0928 12:47:04.452174 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-b423ba9c
I0928 12:47:04.452183 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-92a12a4a
I0928 12:47:04.452191 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-dca91665
I0928 12:47:04.452198 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-3c0717a0
I0928 12:47:04.452204 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-d2a8a521
I0928 12:47:04.452211 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-c71ea078
I0928 12:47:04.452217 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-65bcd9d4
I0928 12:47:04.452224 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-aad7dd43
I0928 12:47:04.452230 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-98ea1e7f
I0928 12:47:04.452237 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-db7e7f7
I0928 12:47:04.452244 1 pv.go:55] Skipping non Ceph CSI RBD volume pvc-e5b83885-946d-4fca-91ee-492cb55879ec
I0928 12:47:04.452250 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-4b1170a7
I0928 12:47:04.452257 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-d16bdf2d
I0928 12:47:04.452263 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-a9cef73f
I0928 12:47:04.452270 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-abc21c52
I0928 12:47:04.452277 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-e699a89
I0928 12:47:04.452283 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-d4d18a5b
I0928 12:47:04.452290 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-f09b1ed6
I0928 12:47:04.452297 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-179b3886
I0928 12:47:04.452303 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-29885e11
I0928 12:47:04.452309 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-14e03203
I0928 12:47:04.452315 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-cd4f9bc9
I0928 12:47:04.452321 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-611c74bb
W0928 12:47:05.633347 1 reflector.go:324] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
E0928 12:47:05.633417 1 reflector.go:138] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
W0928 12:47:08.523159 1 reflector.go:324] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
E0928 12:47:08.523192 1 reflector.go:138] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
W0928 12:47:12.964015 1 reflector.go:324] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
E0928 12:47:12.964051 1 reflector.go:138] /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:163: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: forbidden: User "system:serviceaccount:openshift-storage:ocs-metrics-exporter" cannot get path "/storageclasses"
E0928 12:47:20.982705 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1218eb0]
goroutine 243 [running]:
github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.(*CephObjectStoreAdvancedFeatureProvider).AdvancedFeature(0xc0005ca250?, {0xc00060c720, 0x1, 0x1})
/remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:57 +0xf0
github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.(*ClusterAdvanceFeatureCollector).Collect(0xc000a1a880, 0xc0000b0f50?)
/remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:183 +0xb0
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:453 +0xe8
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:464 +0x508
https://github.com/red-hat-storage/ocs-operator/pull/1830 will prevent nil pointer exception and stop the crash loop. Tested with latest ocs-operator image which has Umanga's changes merged. No ocs-metric-exporter crash or any errors due to advanced-feature metric were noticed. Following are the logs... ``` 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR 1 rbd-mirror.go:292] RBD mirror store resync started at 2022-10-03 16:10:40.821475786 +0000 UTC m=+5671.204528376 1 rbd-mirror.go:317] RBD mirror store resync ended at 2022-10-03 16:10:40.821553818 +0000 UTC m=+5671.204606641 1 rbd-mirror.go:292] RBD mirror store resync started at 2022-10-03 16:11:10.821706396 +0000 UTC m=+5701.204758272 1 rbd-mirror.go:317] RBD mirror store resync ended at 2022-10-03 16:11:10.821759647 +0000 UTC m=+5701.204811454 ``` |