Bug 2322531

Summary: Disable pool crush updates by default since new crush rule is causing rebalance
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Travis Nielsen <tnielsen>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: NEW --- QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.16CC: odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Travis Nielsen 2024-10-29 19:36:18 UTC
This bug was initially created as a copy of Bug #2319878

I am copying this bug because: 

The pool crush updates should be disabled by default by ocs operator.
 

Description of problem (please be detailed as possible and provide log
snippests):

- We noticed that some new crush rules get pushed out with an update to ODF 4.16:

.mgr_host_ssd
ocs-storagecluster-cephblockpool_host_ssd
.rgw.root_host_ssd
ocs-storagecluster-cephobjectstore.rgw.otp_host_ssd
ocs-storagecluster-cephobjectstore.rgw.meta_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.index_host_ssd
ocs-storagecluster-cephobjectstore.rgw.control_host_ssd
ocs-storagecluster-cephobjectstore.rgw.log_host_ssd
ocs-storagecluster-cephobjectstore.rgw.buckets.data_host_ssd
ocs-storagecluster-cephfilesystem-metadata_host_ssd
ocs-storagecluster-cephfilesystem-data0_host_ssd

This has resulted in at least 1 case of a big rebalance after updating ODF:

cluster:
    id:     c745f785-45cc-4c32-b62d-67fd61b87321
    health: HEALTH_WARN
            1 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 7 pgs backfill_toofull
            12 pool(s) nearfull
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,c,e (age 13h)
    mgr: a(active, since 13h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 12 osds: 12 up (since 13h), 12 in (since 19M); 125 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 281 pgs
    objects: 8.31M objects, 8.9 TiB
    usage:   27 TiB used, 21 TiB / 48 TiB avail
    pgs:     12427371/24917634 objects misplaced (49.874%)
             156 active+clean
             116 active+remapped+backfill_wait
             7   active+remapped+backfill_wait+backfill_toofull
             2   active+remapped+backfilling

- Here's the rook-ceph-operator logs during the event:

2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807746 I | cephclient: creating a new crush rule for changed failure domain ("host"-->"rack") on crush rule "replicated_rule"
2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807771 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "replicated_rule"
2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842722 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephblockpool_rack"
2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441216 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule ".rgw.root_rack"
2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456778 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack"
2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458423 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack"
2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464360 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack"
2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470228 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack"
2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477762 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack"
2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484742 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack"
2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696712 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack"
2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628941 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-metadata_rack"
2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.195994 I | cephclient: creating a new crush rule for changed deviceClass ("default"-->"ssd") on crush rule "ocs-storagecluster-cephfilesystem-data0_rack"
2024-10-16T02:37:37.807980427Z 2024-10-16 02:37:37.807779 I | cephclient: crush rule "replicated_rule" will no longer be used by pool ".mgr"
2024-10-16T02:37:58.842770528Z 2024-10-16 02:37:58.842744 I | cephclient: crush rule "ocs-storagecluster-cephblockpool_rack" will no longer be used by pool "ocs-storagecluster-cephblockpool"
2024-10-16T02:40:58.441261067Z 2024-10-16 02:40:58.441238 I | cephclient: crush rule ".rgw.root_rack" will no longer be used by pool ".rgw.root"
2024-10-16T02:40:58.456827394Z 2024-10-16 02:40:58.456810 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.meta_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.meta"
2024-10-16T02:40:58.458467757Z 2024-10-16 02:40:58.458446 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.control_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.control"
2024-10-16T02:40:58.464400294Z 2024-10-16 02:40:58.464385 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.otp_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.otp"
2024-10-16T02:40:58.470327637Z 2024-10-16 02:40:58.470254 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.index_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.index"
2024-10-16T02:40:58.477802276Z 2024-10-16 02:40:58.477785 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.log_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.log"
2024-10-16T02:40:58.484785130Z 2024-10-16 02:40:58.484769 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec"
2024-10-16T02:41:03.696760054Z 2024-10-16 02:41:03.696736 I | cephclient: crush rule "ocs-storagecluster-cephobjectstore.rgw.buckets.data_rack" will no longer be used by pool "ocs-storagecluster-cephobjectstore.rgw.buckets.data"
2024-10-16T02:43:30.629001666Z 2024-10-16 02:43:30.628966 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-metadata_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-metadata"
2024-10-16T02:43:34.196044470Z 2024-10-16 02:43:34.196026 I | cephclient: crush rule "ocs-storagecluster-cephfilesystem-data0_rack" will no longer be used by pool "ocs-storagecluster-cephfilesystem-data0"

Version of all relevant components (if applicable):
ODF 4.16

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, it temporarily degrades the ODF cluster until backfilling completes

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

I haven't been able to reproduce it in the lab yet, but have tried once and will be trying again.

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

No


Actual results:
Ceph has to go through a rebalance after upgrading to ODF 4.16

Expected results:
Ceph doesn't have to go through a rebalance after upgrading to ODF 4.16

Additional info:
If engineering can shed some insight on a) whether the rebalance is expected behaviour and b) if it is expected offer some solution to mitigate this interruption that would be very helpful. This is my main question/concern.

Relevant Attachments in supporthshell:
[bmcmurra@supportshell-1 03961471]$ ll
total 132
drwxrwxrwx+ 3 yank     yank         54 Oct 16 16:02 0010-inspect-lbs-i2l.tar.gz
drwxrwxrwx+ 3 yank     yank         55 Oct 16 16:02 0020-inspect-openshift-storage.tar.gz
-rw-rw-rw-+ 1 yank     yank     112257 Oct 16 17:11 0030-image.png
drwxrwxrwx+ 3 yank     yank         59 Oct 16 17:24 0040-must-gather-openshift-logging.tar.gz
drwxrwxrwx+ 3 yank     yank         59 Oct 17 18:22 0050-must-gather-openshift-storage.tar.gz

Let me know if you require anymore data than what's already in supportshell.

Thanks

Brandon McMurray
Technical Support Engineer, RHCE
Software Defined Storage and Openshift Data Foundation