Description of problem (please be detailed as possible and provide log snippests): Ceph PG autoscaler did not increase PG count for storage pool consuming most of available space (all the used space). This results in an OSD being near full well before the other OSDs are, when the cluster is only supposedly 68% full. The whole point of the PG autoscaler was to prevent these kind of OSD imbalance problems (also the rebalancer). It is not clear if the rebalancer module was engaged or could have helped this problem. Why do I care? efficient storage space utilization is a performance dimension of sorts - increasing utilization of storage devices means lower total system cost. Version of all relevant components (if applicable): sh-4.2# ceph version ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable) [kni@e23-h15-740xd ~]$ oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.5.2 OpenShift Container Storage 4.5.2 Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No. Just testing at this time, have a little less space to work with. Is there any workaround available to the best of your knowledge? We can force the PG autoscaler into action with: ceph osd pool set ocs-storagecluster-cephblockpool target_size_ratio 0.95 Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? probably Can this issue reproduce from the UI? probably Steps to Reproduce: 1. Create an OCS cluster on baremetal with 6 OCS nodes, 2 NVM partitions/node 2. Install OpenShift Virtualization 3. Create 100 VMs that together use up or overcommit the available space 4. Run workload that fills up the storage space given to the VMs I do not suspect this has anything to do with OpenShift Virtualization, I think you could get the same result by just writing any kind of data to Ceph using a single pool. Actual results: 1 OSD is overflowing when others are not. Worst OSD used 1094 GB out of 1453 GB, 75%, whereas least used OSD had 880 GB, or 60%. Expected results: All OSDs have within 10% of each other in used space. Additional info: oc adm must-gather info is in this tarball: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/tmp/ocp4/cnv-boaz-must-gather-bz.tgz [kni@e23-h15-740xd ~]$ ocos rsh rook-ceph-tools-6c4ff47568-pr4dt we add up the GB promised to every RBD volume created by OpenShift Virtualization via the ocs-storagecluster-ceph-rbd storageclass sh-4.2# for v in $(rbd -p ocs-storagecluster-cephblockpool ls) ; do \ rbd -p ocs-storagecluster-cephblockpool info $v ; done \ | awk '/size/{sum+=$2}END{print sum}' 8070 sh-4.2# ceph -s cluster: id: ea293d9e-b5c4-4858-9b14-30724100c548 health: HEALTH_WARN 1 nearfull osd(s) 10 pool(s) nearfull services: mon: 3 daemons, quorum a,b,c (age 15h) mgr: a(active, since 15h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 12 osds: 12 up (since 14h), 12 in (since 14h) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 368 pgs objects: 1.03M objects, 3.9 TiB usage: 12 TiB used, 5.3 TiB / 17 TiB avail pgs: 368 active+clean io: client: 852 B/s rd, 40 KiB/s wr, 1 op/s rd, 6 op/s wr all the space is in -cephblockpool sh-4.2# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 17 TiB 5.3 TiB 12 TiB 12 TiB 68.99 TOTAL 17 TiB 5.3 TiB 12 TiB 12 TiB 68.99 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 3.9 TiB 1.03M 12 TiB 87.65 564 GiB ocs-storagecluster-cephobjectstore.rgw.control 2 0 B 8 0 B 0 564 GiB ocs-storagecluster-cephfilesystem-metadata 3 2.2 KiB 22 96 KiB 0 564 GiB ocs-storagecluster-cephfilesystem-data0 4 0 B 0 0 B 0 564 GiB ocs-storagecluster-cephobjectstore.rgw.meta 5 1.4 KiB 7 72 KiB 0 564 GiB ocs-storagecluster-cephobjectstore.rgw.log 6 3.5 KiB 179 408 KiB 0 564 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.index 7 0 B 11 0 B 0 564 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 0 B 0 0 B 0 564 GiB .rgw.root 9 4.7 KiB 16 180 KiB 0 564 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 1 KiB 1 12 KiB 0 564 GiB see osd 5 here sh-4.2# ceph osd status +----+-----------+-------+-------+--------+---------+--------+---------+--------------------+ | id | host | used | avail | wr ops | wr data | rd ops | rd data | state | +----+-----------+-------+-------+--------+---------+--------+---------+--------------------+ | 0 | worker002 | 938G | 514G | 0 | 1843 | 4 | 0 | exists,up | | 1 | worker001 | 880G | 573G | 0 | 0 | 2 | 0 | exists,up | | 2 | worker000 | 970G | 482G | 0 | 819 | 1 | 0 | exists,up | | 3 | worker000 | 908G | 545G | 1 | 8908 | 3 | 0 | exists,up | | 4 | worker001 | 1063G | 389G | 0 | 0 | 3 | 0 | exists,up | | 5 | worker002 | 1094G | 359G | 0 | 4505 | 3 | 0 | exists,nearfull,up | | 6 | master-0 | 1064G | 388G | 0 | 1638 | 5 | 0 | exists,up | | 7 | master-1 | 910G | 542G | 0 | 13.6k | 5 | 0 | exists,up | | 8 | master-2 | 1065G | 388G | 0 | 0 | 5 | 106 | exists,up | | 9 | master-2 | 1066G | 386G | 1 | 13.7k | 2 | 0 | exists,up | | 10 | master-0 | 1001G | 451G | 0 | 819 | 2 | 0 | exists,up | | 11 | master-1 | 1067G | 386G | 0 | 2355 | 3 | 0 | exists,up | +----+-----------+-------+-------+--------+---------+--------+---------+--------------------+ each OSD is partition containing 1/2 an NVM device and fed to the LSO sh-4.2# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 17.03156 root default -12 5.67719 rack rack0 -17 2.83859 host master-2 8 ssd 1.41930 osd.8 up 1.00000 1.00000 9 ssd 1.41930 osd.9 up 1.00000 1.00000 -11 2.83859 host worker000 2 ssd 1.41930 osd.2 up 1.00000 1.00000 3 ssd 1.41930 osd.3 up 1.00000 1.00000 -8 5.67719 rack rack1 -15 2.83859 host master-0 6 ssd 1.41930 osd.6 up 1.00000 1.00000 10 ssd 1.41930 osd.10 up 1.00000 1.00000 -7 2.83859 host worker001 1 ssd 1.41930 osd.1 up 1.00000 1.00000 4 ssd 1.41930 osd.4 up 1.00000 1.00000 -4 5.67719 rack rack2 -19 2.83859 host master-1 7 ssd 1.41930 osd.7 up 1.00000 1.00000 11 ssd 1.41930 osd.11 up 1.00000 1.00000 -3 2.83859 host worker002 0 ssd 1.41930 osd.0 up 1.00000 1.00000 5 ssd 1.41930 osd.5 up 1.00000 1.00000 cephblockpool has autoscale_mode on sh-4.2# ceph osd pool ls detail pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 151 lfor 0/0/86 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd removed_snaps [1~3] pool 2 'ocs-storagecluster-cephobjectstore.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 3 'ocs-storagecluster-cephfilesystem-metadata' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 4 'ocs-storagecluster-cephfilesystem-data0' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 151 lfor 0/0/88 flags hashpspool,nearfull stripe_width 0 target_size_ratio 0.49 application cephfs pool 5 'ocs-storagecluster-cephobjectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 6 'ocs-storagecluster-cephobjectstore.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 7 'ocs-storagecluster-cephobjectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 8 'ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 9 '.rgw.root' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 pg_num_min 8 application rook-ceph-rgw pool 10 'ocs-storagecluster-cephobjectstore.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 151 flags hashpspool,nearfull stripe_width 0 application rook-ceph-rgw OCS operator has enabled osd_pool_default_pg_autoscale_mode sh-4.2# ceph config dump WHO MASK LEVEL OPTION VALUE RO global basic log_file * global advanced mon_allow_pool_delete true global advanced mon_cluster_log_file global advanced mon_pg_warn_min_per_osd 0 global advanced osd_pool_default_pg_autoscale_mode on global advanced rbd_default_features 3 mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/orchestrator_cli/orchestrator rook * mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 4294967296 mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 4294967296 client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_enable_usage_log true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_log_nonexistent_bucket true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_log_object_name_utc true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_zone ocs-storagecluster-cephobjectstore * client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_zonegroup ocs-storagecluster-cephobjectstore * client.rgw.ocs.storagecluster.cephobjectstore.b advanced rgw_enable_usage_log true client.rgw.ocs.storagecluster.cephobjectstore.b advanced rgw_log_nonexistent_bucket true client.rgw.ocs.storagecluster.cephobjectstore.b advanced rgw_log_object_name_utc true client.rgw.ocs.storagecluster.cephobjectstore.b advanced rgw_zone ocs-storagecluster-cephobjectstore * client.rgw.ocs.storagecluster.cephobjectstore.b advanced rgw_zonegroup ocs-storagecluster-cephobjectstore * Yes the pg autoscaler and balancer module are enabled sh-4.2# ceph mgr module ls { "always_on_modules": [ "balancer", "crash", "devicehealth", "orchestrator_cli", "progress", "rbd_support", "status", "volumes" ], "enabled_modules": [ "iostat", "pg_autoscaler", "prometheus", "restful", "rook" ],
Most likely a Ceph issue, that I don't see us fixing right now for 4.6.0 (also unlikely a new issue) - deferring to 4.7.
This was already seen before and fixed as part of 1782756. Therefore, most likely a regression. In addition to that, the situation is exposed more easily for CNV - bug 1897351
(In reply to Elad from comment #3) > This was already seen before and fixed as part of 1782756. The correct one is https://bugzilla.redhat.com/show_bug.cgi?id=1797918
I was able to reproduce this issue by performing the following steps 1) On a 3 node OCS cluster with one 512 GB OSD per node, fill up the capacity 2) Add capacity; 1 more OSD per node 3) Follow https://access.redhat.com/solutions/3001761 so that recovery IOs can start 4) Allow the rebalance to new OSDs to complete [At this point, we could see that the PGs are not quite equally distributed] cat ceph_osd_df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 1 ssd 0.50000 1.00000 512 GiB 380 GiB 379 GiB 43 KiB 1.5 GiB 132 GiB 74.27 1.49 189 up 5 ssd 0.50000 1.00000 512 GiB 129 GiB 128 GiB 20 KiB 1024 MiB 383 GiB 25.15 0.50 78 up 2 ssd 0.50000 1.00000 512 GiB 388 GiB 386 GiB 39 KiB 1.5 GiB 124 GiB 75.70 1.52 198 up 4 ssd 0.50000 1.00000 512 GiB 129 GiB 128 GiB 27 KiB 1024 MiB 383 GiB 25.28 0.51 72 up 0 ssd 0.50000 1.00000 512 GiB 366 GiB 364 GiB 47 KiB 1.5 GiB 146 GiB 71.43 1.43 167 up 3 ssd 0.50000 1.00000 512 GiB 140 GiB 139 GiB 20 KiB 1024 MiB 372 GiB 27.35 0.55 97 up TOTAL 3 TiB 1.5 TiB 1.5 TiB 199 KiB 7.5 GiB 1.5 TiB 49.86 MIN/MAX VAR: 0.50/1.52 STDDEV: 23.98 5) write more data to fill up the cluster. [At this point, one of the OSDs hit full ratio way ahead of the other (new) OSDs]
Created attachment 1731025 [details] must_gather for comment#6
(In reply to krishnaram Karthick from comment #7) > Created attachment 1731025 [details] > must_gather for comment#6 Thanks for the detailed data Karthick. In your case, the cluster was not rebalanced yet, there were still many backfilling pgs: data: pools: 3 pools, 288 pgs objects: 130.90k objects, 505 GiB usage: 1.5 TiB used, 1.5 TiB / 3 TiB avail pgs: 96496/392685 objects misplaced (24.573%) 232 active+clean 54 active+remapped+backfill_wait 2 active+remapped+backfilling io: client: 853 B/s rd, 303 MiB/s wr, 1 op/s rd, 94 op/s wr recovery: 15 MiB/s, 3 objects/s The balancer won't run until <5% of objects are misplaced. As you can see at this point in time, nearly 25% of the objects were still being rebalanced. Thus in this case, the balancer hasn't run at all. You can verify this by observing that there are no upmaps in ceph_osd_dump, which is how the balancer redistributes pgs. What happened after this must-gather was taken? I'd expect backfill to complete, and then the balancer to redistribute pgs as needed at that point.
(In reply to Josh Durgin from comment #8) > (In reply to krishnaram Karthick from comment #7) > > Created attachment 1731025 [details] > > must_gather for comment#6 > > Thanks for the detailed data Karthick. In your case, the cluster was not > rebalanced yet, there were still many backfilling pgs: > > data: > pools: 3 pools, 288 pgs > objects: 130.90k objects, 505 GiB > usage: 1.5 TiB used, 1.5 TiB / 3 TiB avail > pgs: 96496/392685 objects misplaced (24.573%) > 232 active+clean > 54 active+remapped+backfill_wait > 2 active+remapped+backfilling > > io: > client: 853 B/s rd, 303 MiB/s wr, 1 op/s rd, 94 op/s wr > recovery: 15 MiB/s, 3 objects/s > > The balancer won't run until <5% of objects are misplaced. As you can see at > this point in time, nearly 25% of the objects were still being rebalanced. > Thus in this case, the balancer hasn't run at all. You can verify this by > observing that there are no upmaps in ceph_osd_dump, which is how the > balancer redistributes pgs. > > What happened after this must-gather was taken? I'd expect backfill to > complete, and then the balancer to redistribute pgs as needed at that point. Thanks Josh. I don't have the cluster anymore. QE's AWS cluster automatically get deleted after 12 hours. I'll rerun this test and update once I have the results.
I reran the test and waited for a long time. I see that this time the OSDs are more evenly distributed. After expanding to 6 OSDs: ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 ssd 0.50000 1.00000 512 GiB 216 GiB 215 GiB 72 KiB 1.5 GiB 296 GiB 42.24 0.96 133 up 3 ssd 0.50000 1.00000 512 GiB 236 GiB 235 GiB 27 KiB 1024 MiB 276 GiB 46.01 1.04 155 up 1 ssd 0.50000 1.00000 512 GiB 226 GiB 224 GiB 72 KiB 1.5 GiB 286 GiB 44.04 1.00 146 up 4 ssd 0.50000 1.00000 512 GiB 226 GiB 225 GiB 27 KiB 1024 MiB 286 GiB 44.21 1.00 140 up 0 ssd 0.50000 1.00000 512 GiB 239 GiB 238 GiB 75 KiB 1.6 GiB 273 GiB 46.75 1.06 149 up 5 ssd 0.50000 1.00000 512 GiB 213 GiB 212 GiB 45 KiB 1024 MiB 299 GiB 41.51 0.94 139 up TOTAL 3 TiB 1.3 TiB 1.3 TiB 321 KiB 7.6 GiB 1.7 TiB 44.13 MIN/MAX VAR: 0.94/1.06 STDDEV: 1.86 After expanding to 9 OSDs: ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 ssd 0.50000 1.00000 512 GiB 293 GiB 292 GiB 42 KiB 1.4 GiB 219 GiB 57.25 0.96 93 up 3 ssd 0.50000 1.00000 512 GiB 307 GiB 305 GiB 40 KiB 1.7 GiB 205 GiB 59.92 1.01 103 up 6 ssd 0.50000 1.00000 512 GiB 313 GiB 312 GiB 32 KiB 1024 MiB 199 GiB 61.19 1.03 92 up 1 ssd 0.50000 1.00000 512 GiB 279 GiB 277 GiB 55 KiB 1.8 GiB 233 GiB 54.54 0.92 93 up 4 ssd 0.50000 1.00000 512 GiB 307 GiB 305 GiB 43 KiB 1.5 GiB 205 GiB 59.87 1.01 95 up 7 ssd 0.50000 1.00000 512 GiB 328 GiB 327 GiB 35 KiB 1024 MiB 184 GiB 63.98 1.08 99 up 0 ssd 0.50000 1.00000 512 GiB 328 GiB 327 GiB 51 KiB 1.4 GiB 184 GiB 64.05 1.08 101 up 5 ssd 0.50000 1.00000 512 GiB 272 GiB 271 GiB 39 KiB 1.4 GiB 240 GiB 53.20 0.89 89 up 8 ssd 0.50000 1.00000 512 GiB 313 GiB 312 GiB 24 KiB 1024 MiB 199 GiB 61.05 1.03 98 up TOTAL 4.5 TiB 2.7 TiB 2.7 TiB 366 KiB 12 GiB 1.8 TiB 59.45 MIN/MAX VAR: 0.89/1.08 STDDEV: 3.58
(In reply to krishnaram Karthick from comment #10) > I reran the test and waited for a long time. > I see that this time the OSDs are more evenly distributed. Good - so what's the next step? > > After expanding to 6 OSDs: > > ceph osd df > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL > %USE VAR PGS STATUS > 2 ssd 0.50000 1.00000 512 GiB 216 GiB 215 GiB 72 KiB 1.5 GiB 296 GiB > 42.24 0.96 133 up > 3 ssd 0.50000 1.00000 512 GiB 236 GiB 235 GiB 27 KiB 1024 MiB 276 GiB > 46.01 1.04 155 up > 1 ssd 0.50000 1.00000 512 GiB 226 GiB 224 GiB 72 KiB 1.5 GiB 286 GiB > 44.04 1.00 146 up > 4 ssd 0.50000 1.00000 512 GiB 226 GiB 225 GiB 27 KiB 1024 MiB 286 GiB > 44.21 1.00 140 up > 0 ssd 0.50000 1.00000 512 GiB 239 GiB 238 GiB 75 KiB 1.6 GiB 273 GiB > 46.75 1.06 149 up > 5 ssd 0.50000 1.00000 512 GiB 213 GiB 212 GiB 45 KiB 1024 MiB 299 GiB > 41.51 0.94 139 up > TOTAL 3 TiB 1.3 TiB 1.3 TiB 321 KiB 7.6 GiB 1.7 TiB > 44.13 > MIN/MAX VAR: 0.94/1.06 STDDEV: 1.86 > > After expanding to 9 OSDs: > ceph osd df > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL > %USE VAR PGS STATUS > 2 ssd 0.50000 1.00000 512 GiB 293 GiB 292 GiB 42 KiB 1.4 GiB 219 GiB > 57.25 0.96 93 up > 3 ssd 0.50000 1.00000 512 GiB 307 GiB 305 GiB 40 KiB 1.7 GiB 205 GiB > 59.92 1.01 103 up > 6 ssd 0.50000 1.00000 512 GiB 313 GiB 312 GiB 32 KiB 1024 MiB 199 GiB > 61.19 1.03 92 up > 1 ssd 0.50000 1.00000 512 GiB 279 GiB 277 GiB 55 KiB 1.8 GiB 233 GiB > 54.54 0.92 93 up > 4 ssd 0.50000 1.00000 512 GiB 307 GiB 305 GiB 43 KiB 1.5 GiB 205 GiB > 59.87 1.01 95 up > 7 ssd 0.50000 1.00000 512 GiB 328 GiB 327 GiB 35 KiB 1024 MiB 184 GiB > 63.98 1.08 99 up > 0 ssd 0.50000 1.00000 512 GiB 328 GiB 327 GiB 51 KiB 1.4 GiB 184 GiB > 64.05 1.08 101 up > 5 ssd 0.50000 1.00000 512 GiB 272 GiB 271 GiB 39 KiB 1.4 GiB 240 GiB > 53.20 0.89 89 up > 8 ssd 0.50000 1.00000 512 GiB 313 GiB 312 GiB 24 KiB 1024 MiB 199 GiB > 61.05 1.03 98 up > TOTAL 4.5 TiB 2.7 TiB 2.7 TiB 366 KiB 12 GiB 1.8 TiB > 59.45 > MIN/MAX VAR: 0.89/1.08 STDDEV: 3.58
As the balancer is working expected, this is not a regression or blocker. Removing the blocker flag as discussed in the OCS meeting yesterday, probably this should be closed as not a bug.
(In reply to Yaniv Kaul from comment #11) > (In reply to krishnaram Karthick from comment #10) > > I reran the test and waited for a long time. > > I see that this time the OSDs are more evenly distributed. > > Good - so what's the next step? > Reaching out to the performance team running CNV workloads to see if this is seen on a scaled-up cluster with CNV workload as that is where the issue was originally seen.
Moving out of 4.6, once we have the inputs from perf team we can move forward.
(In reply to krishnaram Karthick from comment #13) > (In reply to Yaniv Kaul from comment #11) > > (In reply to krishnaram Karthick from comment #10) > > > I reran the test and waited for a long time. > > > I see that this time the OSDs are more evenly distributed. > > > > Good - so what's the next step? > > > > Reaching out to the performance team running CNV workloads to see if this is > seen on a scaled-up cluster with CNV workload as that is where the issue was > originally seen. Any update on this?
(In reply to Josh Durgin from comment #15) > (In reply to krishnaram Karthick from comment #13) > > (In reply to Yaniv Kaul from comment #11) > > > (In reply to krishnaram Karthick from comment #10) > > > > I reran the test and waited for a long time. > > > > I see that this time the OSDs are more evenly distributed. > > > > > > Good - so what's the next step? > > > > > > > Reaching out to the performance team running CNV workloads to see if this is > > seen on a scaled-up cluster with CNV workload as that is where the issue was > > originally seen. > > Any update on this? The last time I reached out, I couldn't get a CNV system that runs with a storage capacity as described in the bug. But, I'm retaining the needinfo to check once again, Or maybe see if there is an automated test that we could run on our test environments.
Please reopen if you see this again.
Removing the needinfo flag. We weren't able to reproduce this scenario.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days