2263256 – [GSS] rook ceph operator crash loop backoff

Bug 2263256 - [GSS] rook ceph operator crash loop backoff

Summary: [GSS] rook ceph operator crash loop backoff

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.12.12
Assignee:	Malay Kumar parida
QA Contact:	Mahesh Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-07 21:54 UTC by khover
Modified:	2024-04-01 13:44 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.12.12-1
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-04-01 13:42:32 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2049	None	Merged	Sort values of topologymap labels to avoid frequent updates to CR	2024-03-05 15:34:09 UTC
Github	red-hat-storage ocs-operator pull 2496	None	open	Bug 2263256: [release-4.12] Bug 2193220:Sort values of topologymap labels to avoid frequent updates to CR	2024-03-06 05:16:22 UTC
Red Hat Bugzilla	2193220	unspecified	CLOSED	[Stretch cluster] CephCluster is updated frequently due to changing ordering of zones	2024-02-14 22:54:30 UTC

Description khover 2024-02-07 21:54:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):

rook ceph operator crash loop backoff after upgrade to 4.12.10 to 4.12.11

Version of all relevant components (if applicable):

ODF 4.12.11

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Customer is concerned this may persist and affect reconcile ops.

Is there any workaround available to the best of your knowledge?

None 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?

Unknown at this time but

Seen on 2 customer clusters same version

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 5 Subham Rai 2024-02-09 12:09:56 UTC

Question:

what was the state of the cluster before upgrading the cluster? Were mon's in quorum?

Comment 6 khover 2024-02-09 12:34:23 UTC

Hello,

We dont have logs from before upgrade but customer stated no issues prior.

current state:

  cluster:
    id:     e6d56c78-3697-4095-ba29-658a2745359e
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum a,c,d,e,f (age 5m)
    mgr: a(active, since 4d), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 16 osds: 16 up (since 15h), 16 in (since 11w)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 449 pgs
    objects: 376.25k objects, 1.2 TiB
    usage:   4.9 TiB used, 18 TiB / 23 TiB avail
    pgs:     449 active+clean
 
  io:
    client:   257 KiB/s rd, 22 MiB/s wr, 54 op/s rd, 2.86k op/s wr

Comment 28 krishnaram Karthick 2024-03-06 05:03:21 UTC

Malay - could you pls share the steps to verify this bug?

Comment 31 Malay Kumar parida 2024-03-06 05:26:05 UTC

Steps to verify would be,
Stretch cluster setup on ODF 4.12.10
Upgrade to ODF 4.12.11 & observer that rook-ceph-operator pod has gone to CLBO.
Upgrade to ODF 4.12.12, now the rook-ceph-operator pod should be back to up and running.
@tnielsen please add if anything else needs to be checked.

Comment 33 Travis Nielsen 2024-03-06 21:51:17 UTC

Another verification step could be to look at the rook operator log and see many messages such as described in https://bugzilla.redhat.com/show_bug.cgi?id=2187952#c28

Comment 35 krishnaram Karthick 2024-04-01 13:42:32 UTC

Closing the bug as we don't intend to test the fix in 4.12.12 for the reason above.

Note You need to log in before you can comment on or make changes to this bug.