Bug 2250995

Summary:	[Tracker][29079] rook-ceph-exporter pod restarts multiple times on a fresh installed HCI cluster
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Mudit Agarwal <muagarwa>
Component:	rook	Assignee:	Divyansh Kamboj <dkamboj>
Status:	CLOSED ERRATA	QA Contact:	Daniel Osypenko <dosypenk>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	athakkar, dkamboj, dosypenk, ebenahar, jolmomar, kbg, lgangava, muagarwa, nberry, nthomas, odf-bz-bot, omitrani, rohgupta, sapillai, tnielsen
Target Milestone:	---
Target Release:	ODF 4.15.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	isf-provider
Fixed In Version:	4.15.0-112	Doc Type:	Bug Fix
Doc Text:	.Deployment strategy to avoid rook-ceph-exporter pod restart Previously, the `rook-ceph-exporter` pod restarted multiple times on a freshly installed HCI cluster that resulted in crashing of the exporter pod and the Ceph health showing the WARN status. This was because restarting the exporter using `RollingRelease` caused a race condition resulting in crash of the exporter. With this fix, the deployment strategy is changed to `Recreate`. As a result, exporter pods no longer crash and there is no more health WARN status of Ceph.	Story Points:	---
Clone Of:	2248850	Environment:
Last Closed:	2024-03-19 15:29:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2248850
Bug Blocks:	2246375

Description Mudit Agarwal 2023-11-22 07:32:51 UTC

+++ This bug was initially created as a clone of Bug #2248850 +++

Description of problem (please be detailed as possible and provide log
snippests):
rook-ceph-exporter restarts multiple times


rook-ceph-exporter logs:

2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
system:0
*** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter
 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
 8: main()
 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
 10: __libc_start_main()
 11: _start()
2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter

 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
 8: main()
 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
 10: __libc_start_main()
 11: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -3> 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists


$ ceph status
  cluster:
    id:     f112b846-0527-4ef9-ac6e-519e7011a676
    health: HEALTH_WARN
            2 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 15 up (since 2h), 15 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 642 pgs
    objects: 201 objects, 346 MiB
    usage:   800 MiB used, 105 TiB / 105 TiB avail
    pgs:     642 active+clean

  io:
    client:   852 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr




$ ceph crash ls
ID                                                                ENTITY        NEW
2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b  client.admin   *
2023-11-09T06:46:09.706085Z_30028a56-be4f-45c5-bf0c-1000706ce9b9  client.admin   *

$ ceph crash info 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f4c6a384df0]",
        "(std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f4c6a61a034]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x559c0f26ae2e]",
        "ceph-exporter(+0x45eb0) [0x559c0f26ceb0]",
        "ceph-exporter(+0x5cb1d) [0x559c0f283b1d]",
        "ceph-exporter(+0xacb9f) [0x559c0f2d3b9f]",
        "(DaemonMetricCollector::main()+0x212) [0x559c0f256c22]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7f4c6a36feb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-148.el9cp",
    "crash_id": "2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b",
    "entity_name": "client.admin",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-exporter",
    "stack_sig": "445f7e928870d7f3a4ac83dd88c42c1ea1b453f27da54f8999b3570b25614589",
    "timestamp": "2023-11-09T06:46:08.690038Z",
    "utsname_hostname": "compute-1-ru5.rackm01.rtp.raleigh.ibm.com",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.36.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023"
}




Version of all relevant components (if applicable):
ODF: 4.14.0-162


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes, occurring on all fresh installed HCI clusters

Can this issue reproduce from the UI?



Steps to Reproduce:
1.Install ODF
2.Create storagecluster


Actual results:
Storage Cluster created and Ceph Cluster is in Health_Warn
Expected results:
Storage Cluster created and Ceph Cluster is in Health_OK

--- Additional comment from RHEL Program Management on 2023-11-09 09:33:38 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from avan on 2023-11-10 10:05:05 UTC ---

 Can you please confirm with Ken dreyer the Ceph image you using has Ceph v6.1z2?

--- Additional comment from Rohan Gupta on 2023-11-10 14:19:47 UTC ---

@athakkar 
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Image is cp.icr.io/cp/ibm-ceph/ceph-6-rhel9@sha256:162ce3abf5e4adc81e6b4957b22caddbb7a87ae30340348a06e21b7c8123b86e 
will check with Ken and update here

--- Additional comment from Juan Miguel Olmo on 2023-11-13 11:58:54 UTC ---

I do not know what could be the root cause for having a socket file already created when Ceph exporter starts, But apart of solve what seems to be the root cause of the problem (why we are trying to start ceph exporter twice or more times). 
I would suggest to manage the error properly in the Ceph exporter. To have the socket file in use must no cause a seg. fault error, probably just an error line in the log.

--- Additional comment from Leela Venkaiah Gangavarapu on 2023-11-17 03:58:39 UTC ---

tracker for https://github.ibm.com/ProjectAbell/abell-tracking/issues/29079

--- Additional comment from avan on 2023-11-20 12:02:43 UTC ---

(In reply to Juan Miguel Olmo from comment #4)
> I do not know what could be the root cause for having a socket file already
> created when Ceph exporter starts, But apart of solve what seems to be the
> root cause of the problem (why we are trying to start ceph exporter twice or
> more times). 
> I would suggest to manage the error properly in the Ceph exporter. To have
> the socket file in use must no cause a seg. fault error, probably just an
> error line in the log.

The error isn't coming from ceph exporter, its the admin socket file. Debugging why this fails only on this env & works fine on others.

--- Additional comment from Divyansh Kamboj on 2023-11-20 17:16:57 UTC ---

update on the bug, the issue is that ceph-exporter crashes when it tries to create the admin socket file when it restarts, but the file is already created by either already existing exporter or some other daemon. @athakkar is working on renaming the socket file to avoid the conflict.

--- Additional comment from Juan Miguel Olmo on 2023-11-21 09:17:24 UTC ---

(In reply to avan from comment #6)
> (In reply to Juan Miguel Olmo from comment #4)
> > I do not know what could be the root cause for having a socket file already
> > created when Ceph exporter starts, But apart of solve what seems to be the
> > root cause of the problem (why we are trying to start ceph exporter twice or
> > more times). 
> > I would suggest to manage the error properly in the Ceph exporter. To have
> > the socket file in use must no cause a seg. fault error, probably just an
> > error line in the log.
> 
> The error isn't coming from ceph exporter, its the admin socket file.
> Debugging why this fails only on this env & works fine on others.

The exporter explodes with a "segmentation fault" error because the socket cannot be used. Probably a better error management can replace the "explosion" by log lines and a standby behavior.
Do you think that the ceph exporter behavior is robust in this specific case?

--- Additional comment from avan on 2023-11-21 09:51:30 UTC ---

I've raised the fix for the issue but seems like still some issues with caps for exporter https://github.com/rook/rook/pull/13239

--- Additional comment from Juan Miguel Olmo on 2023-11-21 10:00:11 UTC ---

Is good to have a fix in OCS to avoid happen the issue again, but it would be awesome to have a more robust behavior in the Ceph exporter and unit tests to avoid explosions like this:

2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
system:0
*** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter
 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
 8: main()
 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
 10: __libc_start_main()
 11: _start()
2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter

--- Additional comment from avan on 2023-11-21 10:10:37 UTC ---

(In reply to Juan Miguel Olmo from comment #10)
> Is good to have a fix in OCS to avoid happen the issue again, but it would
> be awesome to have a more robust behavior in the Ceph exporter and unit
> tests to avoid explosions like this:
> 
> 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17)
> File exists
> system:0
> *** Caught signal (Segmentation fault) **
>  in thread 7f75b4165e80 thread_name:ceph-exporter
>  ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b)
> quincy (stable)
>  1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
>  2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
>  3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
>  4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
>  5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
>  6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
>  7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
>  8: main()
>  9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
>  10: __libc_start_main()
>  11: _start()
> 2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation
> fault) **
>  in thread 7f75b4165e80 thread_name:ceph-exporter

The issue isn;t coming from ceph-exporter here as it seems to work fine on other env for ODF & also in standalone Ceph. This is very specific case where it seems fix should be in Rook only. 
I don't follow how exporter unit test would have help given traceback is already coming from AdminSocket

--- Additional comment from avan on 2023-11-21 10:42:14 UTC ---

(In reply to Juan Miguel Olmo from comment #8)
> (In reply to avan from comment #6)
> > (In reply to Juan Miguel Olmo from comment #4)
> > > I do not know what could be the root cause for having a socket file already
> > > created when Ceph exporter starts, But apart of solve what seems to be the
> > > root cause of the problem (why we are trying to start ceph exporter twice or
> > > more times). 
> > > I would suggest to manage the error properly in the Ceph exporter. To have
> > > the socket file in use must no cause a seg. fault error, probably just an
> > > error line in the log.
> > 
> > The error isn't coming from ceph exporter, its the admin socket file.
> > Debugging why this fails only on this env & works fine on others.
> 
> The exporter explodes with a "segmentation fault" error because the socket
> cannot be used. Probably a better error management can replace the
> "explosion" by log lines and a standby behavior.
> Do you think that the ceph exporter behavior is robust in this specific case?

Well exporter crashes because socket is in use already, not by exporter itself because it works fine in vstart & mstart standalone Ceph env which will generate same socket file name for exporter (client.admin.asok).
So I'd put this way: The issue seems more about deployment of exporter for mentioned env and not of exporter itself. Simply having specific user-keyring pair for exporter in Rook like we have for Cephadm[1] should avoid crashes we facing. 

[1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/services/cephadmservice.py#L1168C11-L1168C11.

--- Additional comment from avan on 2023-11-21 19:18:41 UTC ---

Upstream fix is merged. Created the backport for 4.14 https://github.com/red-hat-storage/rook/pull/541

Comment 3 Daniel Osypenko 2023-11-29 16:27:45 UTC

fresh installation ODF 4.14.1-13
after 3h no restarts, the age of rook-ceph-exporter pods stays same as of other rook resources

oc -n openshift-storage get csv odf-operator.v4.14.1-rhodf -ojsonpath={.metadata.labels.full_version}
4.14.1-13
oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS         AGE
csi-addons-controller-manager-57c78f8dcc-qhhrd                    2/2     Running     0                4m
noobaa-core-0                                                     1/1     Running     0                3h2m
noobaa-db-pg-0                                                    1/1     Running     0                3h2m
noobaa-endpoint-7b4cc64766-p675z                                  1/1     Running     0                58m
noobaa-operator-5db6879bd8-j4hpr                                  2/2     Running     0                3h8m
ocs-metrics-exporter-78cdb76d7f-t2rlf                             1/1     Running     0                3h7m
ocs-operator-747cb68d6d-lmjzc                                     1/1     Running     10 (2m13s ago)   3h7m
ocs-provider-server-5c96bd9959-xthtk                              1/1     Running     0                3h3m
odf-console-84798894d9-fx75k                                      1/1     Running     0                3h7m
odf-operator-controller-manager-f6954947-hw55k                    2/2     Running     8 (2m36s ago)    3h7m
rook-ceph-crashcollector-00-50-56-8f-2e-87-b8cdbc894-khc5j        1/1     Running     0                3h1m
rook-ceph-crashcollector-00-50-56-8f-7d-c3-5985d47bdc-m9865       1/1     Running     0                3h
rook-ceph-crashcollector-00-50-56-8f-bc-1d-65c989c856-hgp77       1/1     Running     0                3h1m
rook-ceph-exporter-00-50-56-8f-2e-87-9f9fb4f5d-ljs22              1/1     Running     0                3h1m
rook-ceph-exporter-00-50-56-8f-7d-c3-6f97d76b6c-cbrh2             1/1     Running     0                3h
rook-ceph-exporter-00-50-56-8f-bc-1d-58bfc4bfbd-spj2r             1/1     Running     0                3h1m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b859c66q5nxn   2/2     Running     0                3h1m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6f75f8c7m6gqh   2/2     Running     0                3h1m
rook-ceph-mgr-a-5679c86dd7-8x96z                                  2/2     Running     0                3h2m
rook-ceph-mon-a-85cb58fdb9-zlggc                                  2/2     Running     1 (78m ago)      3h3m
rook-ceph-mon-b-8695bcb7cb-qkgbd                                  2/2     Running     0                3h3m
rook-ceph-mon-c-d5cf44b-b6tdz                                     2/2     Running     1 (78m ago)      3h3m
rook-ceph-operator-57dc54fc8-v6sjv                                1/1     Running     0                3h3m
rook-ceph-osd-0-69dcb6bf7d-zrkc4                                  2/2     Running     0                3h2m
rook-ceph-osd-1-67589dff6f-sm4sj                                  2/2     Running     0                3h2m
rook-ceph-osd-2-5dd888dcd9-g6d4f                                  2/2     Running     0                3h2m
rook-ceph-osd-prepare-3456f52398cce3c85a50f4ba965cf80f-7mhxh      0/1     Completed   0                3h2m
rook-ceph-osd-prepare-45478423b0aa91d96006e92e856edbaa-swwj5      0/1     Completed   0                3h2m
rook-ceph-osd-prepare-e5260a8ca7e86f1362df996245319234-rmgwm      0/1     Completed   0                3h2m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7fbd8497qv6t   2/2     Running     0                3h
rook-ceph-tools-67c876b65c-qhszm                                  1/1     Running     0                114m

Comment 8 Daniel Osypenko 2024-01-31 16:43:30 UTC

OCP version 4.15.0-0.nightly-2024-01-25-051548
ODF version 4.15.0-126.stable

no restarts since Provider / Client deployment 

oc get pods -n openshift-storage | awk 'NR==1 || /rook-ceph-exporter/'
NAME                                                              READY   STATUS      RESTARTS      AGE
rook-ceph-exporter-b2.fd.3da9.ip4.static.sl-reverse.com-cbjrjc7   1/1     Running     0             74m
rook-ceph-exporter-b5.fd.3da9.ip4.static.sl-reverse.com-59m477k   1/1     Running     0             25h
rook-ceph-exporter-b8.fd.3da9.ip4.static.sl-reverse.com-86mgcm8   1/1     Running     0             25h
rook-ceph-exporter-b9.fd.3da9.ip4.static.sl-reverse.com-75smhd2   1/1     Running     0             68m
rook-ceph-exporter-bd.fd.3da9.ip4.static.sl-reverse.com-6dppb2r   1/1     Running     0             20m

oc rsh -n openshift-storage $TOOLBOX
sh-5.1$ ceph status
  cluster:
    id:     e0028f51-9387-49b2-9cd9-ec7a20ebb8a6
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 62m)
    mgr: a(active, since 25h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 13m), 12 in (since 2d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   15 pools, 1227 pgs
    objects: 44.34k objects, 169 GiB
    usage:   509 GiB used, 10 TiB / 10 TiB avail
    pgs:     1227 active+clean
 
  io:
    client:   1.2 KiB/s rd, 5.1 MiB/s wr, 2 op/s rd, 478 op/s wr

moving to VERIFIED

Comment 12 errata-xmlrpc 2024-03-19 15:29:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 13 Red Hat Bugzilla 2024-07-18 04:25:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days