Bug 2263256 - [GSS] rook ceph operator crash loop backoff
Summary: [GSS] rook ceph operator crash loop backoff
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.12
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ODF 4.12.12
Assignee: Malay Kumar parida
QA Contact: Mahesh Shetty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-07 21:54 UTC by khover
Modified: 2024-04-01 13:44 UTC (History)
13 users (show)

Fixed In Version: 4.12.12-1
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-04-01 13:42:32 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2049 0 None Merged Sort values of topologymap labels to avoid frequent updates to CR 2024-03-05 15:34:09 UTC
Github red-hat-storage ocs-operator pull 2496 0 None open Bug 2263256: [release-4.12] Bug 2193220:Sort values of topologymap labels to avoid frequent updates to CR 2024-03-06 05:16:22 UTC
Red Hat Bugzilla 2193220 0 unspecified CLOSED [Stretch cluster] CephCluster is updated frequently due to changing ordering of zones 2024-02-14 22:54:30 UTC

Description khover 2024-02-07 21:54:17 UTC
Description of problem (please be detailed as possible and provide log
snippests):

rook ceph operator crash loop backoff after upgrade to 4.12.10 to 4.12.11

Version of all relevant components (if applicable):

ODF 4.12.11

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Customer is concerned this may persist and affect reconcile ops.

Is there any workaround available to the best of your knowledge?

None 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?

Unknown at this time but

Seen on 2 customer clusters same version

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 5 Subham Rai 2024-02-09 12:09:56 UTC
Question:

what was the state of the cluster before upgrading the cluster? Were mon's in quorum?

Comment 6 khover 2024-02-09 12:34:23 UTC
Hello,

We dont have logs from before upgrade but customer stated no issues prior.

current state:

  cluster:
    id:     e6d56c78-3697-4095-ba29-658a2745359e
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum a,c,d,e,f (age 5m)
    mgr: a(active, since 4d), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 16 osds: 16 up (since 15h), 16 in (since 11w)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 449 pgs
    objects: 376.25k objects, 1.2 TiB
    usage:   4.9 TiB used, 18 TiB / 23 TiB avail
    pgs:     449 active+clean
 
  io:
    client:   257 KiB/s rd, 22 MiB/s wr, 54 op/s rd, 2.86k op/s wr

Comment 28 krishnaram Karthick 2024-03-06 05:03:21 UTC
Malay - could you pls share the steps to verify this bug?

Comment 31 Malay Kumar parida 2024-03-06 05:26:05 UTC
Steps to verify would be,
Stretch cluster setup on ODF 4.12.10
Upgrade to ODF 4.12.11 & observer that rook-ceph-operator pod has gone to CLBO.
Upgrade to ODF 4.12.12, now the rook-ceph-operator pod should be back to up and running.
@tnielsen please add if anything else needs to be checked.

Comment 33 Travis Nielsen 2024-03-06 21:51:17 UTC
Another verification step could be to look at the rook operator log and see many messages such as described in https://bugzilla.redhat.com/show_bug.cgi?id=2187952#c28

Comment 35 krishnaram Karthick 2024-04-01 13:42:32 UTC
Closing the bug as we don't intend to test the fix in 4.12.12 for the reason above.


Note You need to log in before you can comment on or make changes to this bug.