2248850 – [Tracker][29079] rook-ceph-exporter pod restarts multiple times on a fresh installed HCI cluster

Bug 2248850 - [Tracker][29079] rook-ceph-exporter pod restarts multiple times on a fresh installed HCI cluster

Summary: [Tracker][29079] rook-ceph-exporter pod restarts multiple times on a fresh in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.14.1
Assignee:	avan
QA Contact:	Daniel Osypenko
Docs Contact:
URL:
Whiteboard:	isf-provider
Depends On:
Blocks:	2250995
TreeView+	depends on / blocked

Reported:	2023-11-09 09:33 UTC by Rohan Gupta
Modified:	2023-12-07 13:21 UTC (History)
CC List:	10 users (show)
Fixed In Version:	4.14.1-13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2250995 (view as bug list)
Environment:
Last Closed:	2023-12-07 13:21:29 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 541	None	Merged	Bug 2248850: exporter: run exporter with specific keyring	2023-11-27 16:33:51 UTC
Github	red-hat-storage rook pull 542	None	Merged	Bug 2248850: exporter: change deployment strategy to Recreate	2023-11-27 16:33:51 UTC
Github	rook rook pull 13239	None	Merged	exporter: run exporter with specific keyring	2023-11-27 16:33:52 UTC
Github	rook rook pull 13265	None	Merged	exporter: change deployment strategy to Recreate	2023-11-27 16:33:53 UTC
Red Hat Product Errata	RHBA-2023:7696	None	None	None	2023-12-07 13:21:38 UTC

Description Rohan Gupta 2023-11-09 09:33:25 UTC

Created attachment 1998057 [details]
rook-ceph-exporter logs

Description of problem (please be detailed as possible and provide log
snippests):
rook-ceph-exporter restarts multiple times


rook-ceph-exporter logs:

2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
system:0
*** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter
 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
 8: main()
 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
 10: __libc_start_main()
 11: _start()
2023-11-09T08:29:35.752+0000 7f75b4165e80 -1 *** Caught signal (Segmentation fault) **
 in thread 7f75b4165e80 thread_name:ceph-exporter

 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f75b4859df0]
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f75b4aef034]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x55d8a6329e2e]
 4: ceph-exporter(+0x45eb0) [0x55d8a632beb0]
 5: ceph-exporter(+0x5cb1d) [0x55d8a6342b1d]
 6: ceph-exporter(+0xacb9f) [0x55d8a6392b9f]
 7: (DaemonMetricCollector::main()+0x212) [0x55d8a6315c22]
 8: main()
 9: /lib64/libc.so.6(+0x3feb0) [0x7f75b4844eb0]
 10: __libc_start_main()
 11: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -3> 2023-11-09T08:29:35.751+0000 7f75b4165e80 -1 asok(0x55d8a71f94d0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists


$ ceph status
  cluster:
    id:     f112b846-0527-4ef9-ac6e-519e7011a676
    health: HEALTH_WARN
            2 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 15 up (since 2h), 15 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 642 pgs
    objects: 201 objects, 346 MiB
    usage:   800 MiB used, 105 TiB / 105 TiB avail
    pgs:     642 active+clean

  io:
    client:   852 B/s rd, 8.3 KiB/s wr, 1 op/s rd, 1 op/s wr




$ ceph crash ls
ID                                                                ENTITY        NEW
2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b  client.admin   *
2023-11-09T06:46:09.706085Z_30028a56-be4f-45c5-bf0c-1000706ce9b9  client.admin   *

$ ceph crash info 2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f4c6a384df0]",
        "(std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0x7f4c6a61a034]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1fe) [0x559c0f26ae2e]",
        "ceph-exporter(+0x45eb0) [0x559c0f26ceb0]",
        "ceph-exporter(+0x5cb1d) [0x559c0f283b1d]",
        "ceph-exporter(+0xacb9f) [0x559c0f2d3b9f]",
        "(DaemonMetricCollector::main()+0x212) [0x559c0f256c22]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7f4c6a36feb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-148.el9cp",
    "crash_id": "2023-11-09T06:46:08.690038Z_ceec941c-4c1c-4b60-b037-efc3cc25905b",
    "entity_name": "client.admin",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-exporter",
    "stack_sig": "445f7e928870d7f3a4ac83dd88c42c1ea1b453f27da54f8999b3570b25614589",
    "timestamp": "2023-11-09T06:46:08.690038Z",
    "utsname_hostname": "compute-1-ru5.rackm01.rtp.raleigh.ibm.com",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.36.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023"
}




Version of all relevant components (if applicable):
ODF: 4.14.0-162


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes, occurring on all fresh installed HCI clusters

Can this issue reproduce from the UI?



Steps to Reproduce:
1.Install ODF
2.Create storagecluster


Actual results:
Storage Cluster created and Ceph Cluster is in Health_Warn
Expected results:
Storage Cluster created and Ceph Cluster is in Health_OK

Comment 7 Divyansh Kamboj 2023-11-20 17:16:57 UTC

update on the bug, the issue is that ceph-exporter crashes when it tries to create the admin socket file when it restarts, but the file is already created by either already existing exporter or some other daemon. @athakkar is working on renaming the socket file to avoid the conflict.

Comment 17 Mudit Agarwal 2023-11-24 12:19:09 UTC

Fixed in version was correct, fix is present in that build also. It is still not working on Fusion HCI setup.

Comment 18 Divyansh Kamboj 2023-11-24 13:08:32 UTC

The cause for failing during installation is that, ceph-exporter and ceph-crashcollector's deployment is created again. Due to the changes in the toleration `node.kubernetes.io/unreachable` being set to either 5 or 300, reconciling the pods makes a race condition for the asok file for the exporter, and it keeps on crashing until the other pod is fully deleted. Resulting in a crash report being generated.

Comment 19 Leela Venkaiah Gangavarapu 2023-11-27 03:33:01 UTC

@dkamboj 

> Due to the changes in the toleration
- from a quick check w/ Rohan, this symptom resembles the deployment strategy
- did you try w/ Recreate? Reasoning is as follows

t0 -> deployment created, pod-0 is running
t1 -> toleration updated, since strategy is RollingUpdate, rs just rolls out pod-1 and awaits it to reach healthy

in itself it shouldn't create an issue, clubbing this w/ holding a socket is being considered failure and so the crash is being reported. Anyways, this is only my opinion, disregard if this doesn't apply to the issue you observed.

thanks.

Comment 20 Neha Berry 2023-11-27 05:32:04 UTC

(In reply to Mudit Agarwal from comment #17)
> Fixed in version was correct, fix is present in that build also. It is still
> not working on Fusion HCI setup.

SO should this bug continue to be ON_QA?

Or Do we have to check with the build on an HCI setup ?

Comment 21 Neha Berry 2023-11-27 05:35:05 UTC

(In reply to Mudit Agarwal from comment #17)
> Fixed in version was correct, fix is present in that build also. It is still
> not working on Fusion HCI setup.

SO should this bug continue to be ON_QA?

Or Do we have to check with the build on an HCI setup ?

Comment 22 Neha Berry 2023-11-27 05:37:49 UTC

Based on the discussion with Mudit and even from Dhanashree from IBM, seems this bug is still not fixed, hence moving back to assigned

ALso, athakkar request you to provide all possible verification steps for the fix (during deployment, post deployment, commands to check for the toleration (if needed) , etc.. 

That would make us verify it full-proof and with every angle explored

Comment 23 Divyansh Kamboj 2023-11-27 08:04:37 UTC

Changed the deployment strategy to rolling update, and tested it out on provider mode clusters. That fixes the issue.

@nberry you can reproduce the issue, on any cluster that's in provider mode. After installation, the ceph-health should not go to HEALTH_WARN

Comment 24 Daniel Osypenko 2023-12-04 09:46:02 UTC

fresh installation ODF 4.14.1-13
after 3h no restarts, the age of rook-ceph-exporter pods stays same as of other rook resources

oc -n openshift-storage get csv odf-operator.v4.14.1-rhodf -ojsonpath={.metadata.labels.full_version}
4.14.1-13
oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS         AGE
csi-addons-controller-manager-57c78f8dcc-qhhrd                    2/2     Running     0                4m
noobaa-core-0                                                     1/1     Running     0                3h2m
noobaa-db-pg-0                                                    1/1     Running     0                3h2m
noobaa-endpoint-7b4cc64766-p675z                                  1/1     Running     0                58m
noobaa-operator-5db6879bd8-j4hpr                                  2/2     Running     0                3h8m
ocs-metrics-exporter-78cdb76d7f-t2rlf                             1/1     Running     0                3h7m
ocs-operator-747cb68d6d-lmjzc                                     1/1     Running     10 (2m13s ago)   3h7m
ocs-provider-server-5c96bd9959-xthtk                              1/1     Running     0                3h3m
odf-console-84798894d9-fx75k                                      1/1     Running     0                3h7m
odf-operator-controller-manager-f6954947-hw55k                    2/2     Running     8 (2m36s ago)    3h7m
rook-ceph-crashcollector-00-50-56-8f-2e-87-b8cdbc894-khc5j        1/1     Running     0                3h1m
rook-ceph-crashcollector-00-50-56-8f-7d-c3-5985d47bdc-m9865       1/1     Running     0                3h
rook-ceph-crashcollector-00-50-56-8f-bc-1d-65c989c856-hgp77       1/1     Running     0                3h1m
rook-ceph-exporter-00-50-56-8f-2e-87-9f9fb4f5d-ljs22              1/1     Running     0                3h1m
rook-ceph-exporter-00-50-56-8f-7d-c3-6f97d76b6c-cbrh2             1/1     Running     0                3h
rook-ceph-exporter-00-50-56-8f-bc-1d-58bfc4bfbd-spj2r             1/1     Running     0                3h1m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6b859c66q5nxn   2/2     Running     0                3h1m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6f75f8c7m6gqh   2/2     Running     0                3h1m
rook-ceph-mgr-a-5679c86dd7-8x96z                                  2/2     Running     0                3h2m
rook-ceph-mon-a-85cb58fdb9-zlggc                                  2/2     Running     1 (78m ago)      3h3m
rook-ceph-mon-b-8695bcb7cb-qkgbd                                  2/2     Running     0                3h3m
rook-ceph-mon-c-d5cf44b-b6tdz                                     2/2     Running     1 (78m ago)      3h3m
rook-ceph-operator-57dc54fc8-v6sjv                                1/1     Running     0                3h3m
rook-ceph-osd-0-69dcb6bf7d-zrkc4                                  2/2     Running     0                3h2m
rook-ceph-osd-1-67589dff6f-sm4sj                                  2/2     Running     0                3h2m
rook-ceph-osd-2-5dd888dcd9-g6d4f                                  2/2     Running     0                3h2m
rook-ceph-osd-prepare-3456f52398cce3c85a50f4ba965cf80f-7mhxh      0/1     Completed   0                3h2m
rook-ceph-osd-prepare-45478423b0aa91d96006e92e856edbaa-swwj5      0/1     Completed   0                3h2m
rook-ceph-osd-prepare-e5260a8ca7e86f1362df996245319234-rmgwm      0/1     Completed   0                3h2m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7fbd8497qv6t   2/2     Running     0                3h
rook-ceph-tools-67c876b65c-qhszm                                  1/1     Running     0                114m

Comment 29 errata-xmlrpc 2023-12-07 13:21:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7696

Note You need to log in before you can comment on or make changes to this bug.