Created attachment 1998057 [details] rook-ceph-exporter logs Description of problem (please be detailed as possible and provide log snippests): rook-ceph-exporter restarts multiple times rook-ceph-exporter logs: 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists system:0 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] 8: main() 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] 10: __libc_start_main() 11: _start() 2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] 8: main() 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] 10: __libc_start_main() 11: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -3> 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists $ ceph status cluster: id: f112b846-0527-4ef9-ac6e-519e7011a676 health: HEALTH_WARN 2 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 15 up (since 2h), 15 in (since 2h) data: volumes: 1/1 healthy pools: 4 pools, 642 pgs objects: 201 objects, 346 MiB usage: 800 MiB used, 105 TiB / 105 TiB avail pgs: 642 active+clean io: client: 852 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr $ ceph crash ls ID ENTITY NEW 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b client.admin * 2023-11-09T06:46:09.706085Z_30028a56-be4f-45c5-bf0c-1000706ce9b9 client.admin * $ ceph crash info 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b { "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7f4c6a384df0]", "(std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f4c6a61a034]", "(DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x559c0f26ae2e]", "ceph-exporter(+0x45eb0) [0x559c0f26ceb0]", "ceph-exporter(+0x5cb1d) [0x559c0f283b1d]", "ceph-exporter(+0xacb9f) [0x559c0f2d3b9f]", "(DaemonMetricCollector::main()+0x212) [0x559c0f256c22]", "main()", "/lib64/libc.so.6(+0x3feb0) [0x7f4c6a36feb0]", "__libc_start_main()", "_start()" ], "ceph_version": "17.2.6-148.el9cp", "crash_id": "2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b", "entity_name": "client.admin", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-exporter", "stack_sig": "445f7e928870d7f3a4ac83dd88c42c1ea1b453f27da54f8999b3570b25614589", "timestamp": "2023-11-09T06:46:08.690038Z", "utsname_hostname": "compute-1-ru5.rackm01.rtp.raleigh.ibm.com", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.36.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023" } Version of all relevant components (if applicable): ODF: 4.14.0-162 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes, occurring on all fresh installed HCI clusters Can this issue reproduce from the UI? Steps to Reproduce: 1.Install ODF 2.Create storagecluster Actual results: Storage Cluster created and Ceph Cluster is in Health_Warn Expected results: Storage Cluster created and Ceph Cluster is in Health_OK
update on the bug, the issue is that ceph-exporter crashes when it tries to create the admin socket file when it restarts, but the file is already created by either already existing exporter or some other daemon. @athakkar is working on renaming the socket file to avoid the conflict.
Fixed in version was correct, fix is present in that build also. It is still not working on Fusion HCI setup.
The cause for failing during installation is that, ceph-exporter and ceph-crashcollector's deployment is created again. Due to the changes in the toleration `node.kubernetes.io/unreachable` being set to either 5 or 300, reconciling the pods makes a race condition for the asok file for the exporter, and it keeps on crashing until the other pod is fully deleted. Resulting in a crash report being generated.
@dkamboj > Due to the changes in the toleration - from a quick check w/ Rohan, this symptom resembles the deployment strategy - did you try w/ Recreate? Reasoning is as follows t0 -> deployment created, pod-0 is running t1 -> toleration updated, since strategy is RollingUpdate, rs just rolls out pod-1 and awaits it to reach healthy in itself it shouldn't create an issue, clubbing this w/ holding a socket is being considered failure and so the crash is being reported. Anyways, this is only my opinion, disregard if this doesn't apply to the issue you observed. thanks.
(In reply to Mudit Agarwal from comment #17) > Fixed in version was correct, fix is present in that build also. It is still > not working on Fusion HCI setup. SO should this bug continue to be ON_QA? Or Do we have to check with the build on an HCI setup ?
Based on the discussion with Mudit and even from Dhanashree from IBM, seems this bug is still not fixed, hence moving back to assigned ALso, athakkar request you to provide all possible verification steps for the fix (during deployment, post deployment, commands to check for the toleration (if needed) , etc.. That would make us verify it full-proof and with every angle explored
Changed the deployment strategy to rolling update, and tested it out on provider mode clusters. That fixes the issue. @nberry you can reproduce the issue, on any cluster that's in provider mode. After installation, the ceph-health should not go to HEALTH_WARN
fresh installation ODF 4.14.1-13 after 3h no restarts, the age of rook-ceph-exporter pods stays same as of other rook resources oc -n openshift-storage get csv odf-operator.v4.14.1-rhodf -ojsonpath={.metadata.labels.full_version} 4.14.1-13 oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-57c78f8dcc-qhhrd 2/2 Running 0 4m noobaa-core-0 1/1 Running 0 3h2m noobaa-db-pg-0 1/1 Running 0 3h2m noobaa-endpoint-7b4cc64766-p675z 1/1 Running 0 58m noobaa-operator-5db6879bd8-j4hpr 2/2 Running 0 3h8m ocs-metrics-exporter-78cdb76d7f-t2rlf 1/1 Running 0 3h7m ocs-operator-747cb68d6d-lmjzc 1/1 Running 10 (2m13s ago) 3h7m ocs-provider-server-5c96bd9959-xthtk 1/1 Running 0 3h3m odf-console-84798894d9-fx75k 1/1 Running 0 3h7m odf-operator-controller-manager-f6954947-hw55k 2/2 Running 8 (2m36s ago) 3h7m rook-ceph-crashcollector-00-50-56-8f-2e-87-b8cdbc894-khc5j 1/1 Running 0 3h1m rook-ceph-crashcollector-00-50-56-8f-7d-c3-5985d47bdc-m9865 1/1 Running 0 3h rook-ceph-crashcollector-00-50-56-8f-bc-1d-65c989c856-hgp77 1/1 Running 0 3h1m rook-ceph-exporter-00-50-56-8f-2e-87-9f9fb4f5d-ljs22 1/1 Running 0 3h1m rook-ceph-exporter-00-50-56-8f-7d-c3-6f97d76b6c-cbrh2 1/1 Running 0 3h rook-ceph-exporter-00-50-56-8f-bc-1d-58bfc4bfbd-spj2r 1/1 Running 0 3h1m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b859c66q5nxn 2/2 Running 0 3h1m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6f75f8c7m6gqh 2/2 Running 0 3h1m rook-ceph-mgr-a-5679c86dd7-8x96z 2/2 Running 0 3h2m rook-ceph-mon-a-85cb58fdb9-zlggc 2/2 Running 1 (78m ago) 3h3m rook-ceph-mon-b-8695bcb7cb-qkgbd 2/2 Running 0 3h3m rook-ceph-mon-c-d5cf44b-b6tdz 2/2 Running 1 (78m ago) 3h3m rook-ceph-operator-57dc54fc8-v6sjv 1/1 Running 0 3h3m rook-ceph-osd-0-69dcb6bf7d-zrkc4 2/2 Running 0 3h2m rook-ceph-osd-1-67589dff6f-sm4sj 2/2 Running 0 3h2m rook-ceph-osd-2-5dd888dcd9-g6d4f 2/2 Running 0 3h2m rook-ceph-osd-prepare-3456f52398cce3c85a50f4ba965cf80f-7mhxh 0/1 Completed 0 3h2m rook-ceph-osd-prepare-45478423b0aa91d96006e92e856edbaa-swwj5 0/1 Completed 0 3h2m rook-ceph-osd-prepare-e5260a8ca7e86f1362df996245319234-rmgwm 0/1 Completed 0 3h2m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7fbd8497qv6t 2/2 Running 0 3h rook-ceph-tools-67c876b65c-qhszm 1/1 Running 0 114m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7696