+++ This bug was initially created as a clone of Bug #2248850 +++ Description of problem (please be detailed as possible and provide log snippests): rook-ceph-exporter restarts multiple times rook-ceph-exporter logs: 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists system:0 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] 8: main() 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] 10: __libc_start_main() 11: _start() 2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] 8: main() 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] 10: __libc_start_main() 11: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -3> 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists $ ceph status cluster: id: f112b846-0527-4ef9-ac6e-519e7011a676 health: HEALTH_WARN 2 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 15 up (since 2h), 15 in (since 2h) data: volumes: 1/1 healthy pools: 4 pools, 642 pgs objects: 201 objects, 346 MiB usage: 800 MiB used, 105 TiB / 105 TiB avail pgs: 642 active+clean io: client: 852 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr $ ceph crash ls ID ENTITY NEW 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b client.admin * 2023-11-09T06:46:09.706085Z_30028a56-be4f-45c5-bf0c-1000706ce9b9 client.admin * $ ceph crash info 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b { "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7f4c6a384df0]", "(std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f4c6a61a034]", "(DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x559c0f26ae2e]", "ceph-exporter(+0x45eb0) [0x559c0f26ceb0]", "ceph-exporter(+0x5cb1d) [0x559c0f283b1d]", "ceph-exporter(+0xacb9f) [0x559c0f2d3b9f]", "(DaemonMetricCollector::main()+0x212) [0x559c0f256c22]", "main()", "/lib64/libc.so.6(+0x3feb0) [0x7f4c6a36feb0]", "__libc_start_main()", "_start()" ], "ceph_version": "17.2.6-148.el9cp", "crash_id": "2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b", "entity_name": "client.admin", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-exporter", "stack_sig": "445f7e928870d7f3a4ac83dd88c42c1ea1b453f27da54f8999b3570b25614589", "timestamp": "2023-11-09T06:46:08.690038Z", "utsname_hostname": "compute-1-ru5.rackm01.rtp.raleigh.ibm.com", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.36.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023" } Version of all relevant components (if applicable): ODF: 4.14.0-162 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes, occurring on all fresh installed HCI clusters Can this issue reproduce from the UI? Steps to Reproduce: 1.Install ODF 2.Create storagecluster Actual results: Storage Cluster created and Ceph Cluster is in Health_Warn Expected results: Storage Cluster created and Ceph Cluster is in Health_OK --- Additional comment from RHEL Program Management on 2023-11-09 09:33:38 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from avan on 2023-11-10 10:05:05 UTC --- Can you please confirm with Ken dreyer the Ceph image you using has Ceph v6.1z2? --- Additional comment from Rohan Gupta on 2023-11-10 14:19:47 UTC --- @athakkar ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Image is cp.icr.io/cp/ibm-ceph/ceph-6-rhel9@sha256:162ce3abf5e4adc81e6b4957b22caddbb7a87ae30340348a06e21b7c8123b86e will check with Ken and update here --- Additional comment from Juan Miguel Olmo on 2023-11-13 11:58:54 UTC --- I do not know what could be the root cause for having a socket file already created when Ceph exporter starts, But apart of solve what seems to be the root cause of the problem (why we are trying to start ceph exporter twice or more times). I would suggest to manage the error properly in the Ceph exporter. To have the socket file in use must no cause a seg. fault error, probably just an error line in the log. --- Additional comment from Leela Venkaiah Gangavarapu on 2023-11-17 03:58:39 UTC --- tracker for https://github.ibm.com/ProjectAbell/abell-tracking/issues/29079 --- Additional comment from avan on 2023-11-20 12:02:43 UTC --- (In reply to Juan Miguel Olmo from comment #4) > I do not know what could be the root cause for having a socket file already > created when Ceph exporter starts, But apart of solve what seems to be the > root cause of the problem (why we are trying to start ceph exporter twice or > more times). > I would suggest to manage the error properly in the Ceph exporter. To have > the socket file in use must no cause a seg. fault error, probably just an > error line in the log. The error isn't coming from ceph exporter, its the admin socket file. Debugging why this fails only on this env & works fine on others. --- Additional comment from Divyansh Kamboj on 2023-11-20 17:16:57 UTC --- update on the bug, the issue is that ceph-exporter crashes when it tries to create the admin socket file when it restarts, but the file is already created by either already existing exporter or some other daemon. @athakkar is working on renaming the socket file to avoid the conflict. --- Additional comment from Juan Miguel Olmo on 2023-11-21 09:17:24 UTC --- (In reply to avan from comment #6) > (In reply to Juan Miguel Olmo from comment #4) > > I do not know what could be the root cause for having a socket file already > > created when Ceph exporter starts, But apart of solve what seems to be the > > root cause of the problem (why we are trying to start ceph exporter twice or > > more times). > > I would suggest to manage the error properly in the Ceph exporter. To have > > the socket file in use must no cause a seg. fault error, probably just an > > error line in the log. > > The error isn't coming from ceph exporter, its the admin socket file. > Debugging why this fails only on this env & works fine on others. The exporter explodes with a "segmentation fault" error because the socket cannot be used. Probably a better error management can replace the "explosion" by log lines and a standby behavior. Do you think that the ceph exporter behavior is robust in this specific case? --- Additional comment from avan on 2023-11-21 09:51:30 UTC --- I've raised the fix for the issue but seems like still some issues with caps for exporter https://github.com/rook/rook/pull/13239 --- Additional comment from Juan Miguel Olmo on 2023-11-21 10:00:11 UTC --- Is good to have a fix in OCS to avoid happen the issue again, but it would be awesome to have a more robust behavior in the Ceph exporter and unit tests to avoid explosions like this: 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists system:0 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] 8: main() 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] 10: __libc_start_main() 11: _start() 2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) ** in thread 7f75b4165e80 thread_name:ceph-exporter --- Additional comment from avan on 2023-11-21 10:10:37 UTC --- (In reply to Juan Miguel Olmo from comment #10) > Is good to have a fix in OCS to avoid happen the issue again, but it would > be awesome to have a more robust behavior in the Ceph exporter and unit > tests to avoid explosions like this: > > 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to > bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) > File exists > system:0 > *** Caught signal (Segmentation fault) ** > in thread 7f75b4165e80 thread_name:ceph-exporter > ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) > quincy (stable) > 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0] > 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034] > 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e] > 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0] > 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d] > 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f] > 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22] > 8: main() > 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0] > 10: __libc_start_main() > 11: _start() > 2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation > fault) ** > in thread 7f75b4165e80 thread_name:ceph-exporter The issue isn;t coming from ceph-exporter here as it seems to work fine on other env for ODF & also in standalone Ceph. This is very specific case where it seems fix should be in Rook only. I don't follow how exporter unit test would have help given traceback is already coming from AdminSocket --- Additional comment from avan on 2023-11-21 10:42:14 UTC --- (In reply to Juan Miguel Olmo from comment #8) > (In reply to avan from comment #6) > > (In reply to Juan Miguel Olmo from comment #4) > > > I do not know what could be the root cause for having a socket file already > > > created when Ceph exporter starts, But apart of solve what seems to be the > > > root cause of the problem (why we are trying to start ceph exporter twice or > > > more times). > > > I would suggest to manage the error properly in the Ceph exporter. To have > > > the socket file in use must no cause a seg. fault error, probably just an > > > error line in the log. > > > > The error isn't coming from ceph exporter, its the admin socket file. > > Debugging why this fails only on this env & works fine on others. > > The exporter explodes with a "segmentation fault" error because the socket > cannot be used. Probably a better error management can replace the > "explosion" by log lines and a standby behavior. > Do you think that the ceph exporter behavior is robust in this specific case? Well exporter crashes because socket is in use already, not by exporter itself because it works fine in vstart & mstart standalone Ceph env which will generate same socket file name for exporter (client.admin.asok). So I'd put this way: The issue seems more about deployment of exporter for mentioned env and not of exporter itself. Simply having specific user-keyring pair for exporter in Rook like we have for Cephadm[1] should avoid crashes we facing. [1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/cephadmservice.py#L1168C11-L1168C11. --- Additional comment from avan on 2023-11-21 19:18:41 UTC --- Upstream fix is merged. Created the backport for 4.14 https://github.com/red-hat-storage/rook/pull/541
fresh installation ODF 4.14.1-13 after 3h no restarts, the age of rook-ceph-exporter pods stays same as of other rook resources oc -n openshift-storage get csv odf-operator.v4.14.1-rhodf -ojsonpath={.metadata.labels.full_version} 4.14.1-13 oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-57c78f8dcc-qhhrd 2/2 Running 0 4m noobaa-core-0 1/1 Running 0 3h2m noobaa-db-pg-0 1/1 Running 0 3h2m noobaa-endpoint-7b4cc64766-p675z 1/1 Running 0 58m noobaa-operator-5db6879bd8-j4hpr 2/2 Running 0 3h8m ocs-metrics-exporter-78cdb76d7f-t2rlf 1/1 Running 0 3h7m ocs-operator-747cb68d6d-lmjzc 1/1 Running 10 (2m13s ago) 3h7m ocs-provider-server-5c96bd9959-xthtk 1/1 Running 0 3h3m odf-console-84798894d9-fx75k 1/1 Running 0 3h7m odf-operator-controller-manager-f6954947-hw55k 2/2 Running 8 (2m36s ago) 3h7m rook-ceph-crashcollector-00-50-56-8f-2e-87-b8cdbc894-khc5j 1/1 Running 0 3h1m rook-ceph-crashcollector-00-50-56-8f-7d-c3-5985d47bdc-m9865 1/1 Running 0 3h rook-ceph-crashcollector-00-50-56-8f-bc-1d-65c989c856-hgp77 1/1 Running 0 3h1m rook-ceph-exporter-00-50-56-8f-2e-87-9f9fb4f5d-ljs22 1/1 Running 0 3h1m rook-ceph-exporter-00-50-56-8f-7d-c3-6f97d76b6c-cbrh2 1/1 Running 0 3h rook-ceph-exporter-00-50-56-8f-bc-1d-58bfc4bfbd-spj2r 1/1 Running 0 3h1m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b859c66q5nxn 2/2 Running 0 3h1m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6f75f8c7m6gqh 2/2 Running 0 3h1m rook-ceph-mgr-a-5679c86dd7-8x96z 2/2 Running 0 3h2m rook-ceph-mon-a-85cb58fdb9-zlggc 2/2 Running 1 (78m ago) 3h3m rook-ceph-mon-b-8695bcb7cb-qkgbd 2/2 Running 0 3h3m rook-ceph-mon-c-d5cf44b-b6tdz 2/2 Running 1 (78m ago) 3h3m rook-ceph-operator-57dc54fc8-v6sjv 1/1 Running 0 3h3m rook-ceph-osd-0-69dcb6bf7d-zrkc4 2/2 Running 0 3h2m rook-ceph-osd-1-67589dff6f-sm4sj 2/2 Running 0 3h2m rook-ceph-osd-2-5dd888dcd9-g6d4f 2/2 Running 0 3h2m rook-ceph-osd-prepare-3456f52398cce3c85a50f4ba965cf80f-7mhxh 0/1 Completed 0 3h2m rook-ceph-osd-prepare-45478423b0aa91d96006e92e856edbaa-swwj5 0/1 Completed 0 3h2m rook-ceph-osd-prepare-e5260a8ca7e86f1362df996245319234-rmgwm 0/1 Completed 0 3h2m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7fbd8497qv6t 2/2 Running 0 3h rook-ceph-tools-67c876b65c-qhszm 1/1 Running 0 114m
OCP version 4.15.0-0.nightly-2024-01-25-051548 ODF version 4.15.0-126.stable no restarts since Provider / Client deployment oc get pods -n openshift-storage | awk 'NR==1 || /rook-ceph-exporter/' NAME READY STATUS RESTARTS AGE rook-ceph-exporter-b2.fd.3da9.ip4.static.sl-reverse.com-cbjrjc7 1/1 Running 0 74m rook-ceph-exporter-b5.fd.3da9.ip4.static.sl-reverse.com-59m477k 1/1 Running 0 25h rook-ceph-exporter-b8.fd.3da9.ip4.static.sl-reverse.com-86mgcm8 1/1 Running 0 25h rook-ceph-exporter-b9.fd.3da9.ip4.static.sl-reverse.com-75smhd2 1/1 Running 0 68m rook-ceph-exporter-bd.fd.3da9.ip4.static.sl-reverse.com-6dppb2r 1/1 Running 0 20m oc rsh -n openshift-storage $TOOLBOX sh-5.1$ ceph status cluster: id: e0028f51-9387-49b2-9cd9-ec7a20ebb8a6 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 62m) mgr: a(active, since 25h), standbys: b mds: 1/1 daemons up, 1 hot standby osd: 12 osds: 12 up (since 13m), 12 in (since 2d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 15 pools, 1227 pgs objects: 44.34k objects, 169 GiB usage: 509 GiB used, 10 TiB / 10 TiB avail pgs: 1227 active+clean io: client: 1.2 KiB/s rd, 5.1 MiB/s wr, 2 op/s rd, 478 op/s wr moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days