Description of problem: [Workload-DFG] EC profile is setting min_size to K+1 instead of K Version-Release number of selected component (if applicable): RHCS 5.1 - 16.2.7-14.el8cp How reproducible: always Steps to Reproduce: 1. Deploy an RHCS 5.1 cluster 2. Create an EC pool 3. Check min_size In RHCS 5(pacific) and above we have a feature to recover a PG with EC K shards but still when creating a pool the min_size is set to K+1 that is causing PGs to go in inactive when only K shards are available and recovery to stuck for these PGs. For example in workload-DFG cluster: services: mon: 3 daemons, quorum f04-h29-b05-5039ms.rdu2.scalelab.redhat.com,f04-h29-b06-5039ms,f04-h29-b07-5039ms (age 30h) mgr: f04-h29-b06-5039ms.urzwpy(active, since 22h), standbys: f04-h29-b07-5039ms.harxed, f04-h29-b05-5039ms.rdu2.scalelab.redhat.com.bonpxg osd: 288 osds: 240 up (since 11m), 240 in (since 87s); 1046 remapped pgs flags noscrub,nodeep-scrub rgw: 10 daemons active (10 hosts, 1 zones) data: pools: 7 pools, 1513 pgs objects: 376.61M objects, 5.1 TiB usage: 30 TiB used, 425 TiB / 455 TiB avail pgs: 20.291% pgs not active 373382957/2259669324 objects degraded (16.524%) 52436093/2259669324 objects misplaced (2.321%) 702 active+undersized+degraded+remapped+backfill_wait 467 active+clean 248 undersized+degraded+remapped+backfill_wait+peered 59 undersized+degraded+remapped+backfilling+peered 25 active+undersized+degraded+remapped+backfilling 11 active+remapped+backfill_wait 1 active+recovery_wait+undersized+degraded+remapped We can clearly see we have the following PGs inactive: 248 undersized+degraded+remapped+backfill_wait+peered 59 undersized+degraded+remapped+backfilling+peered While checking min_size for the EC pool, you can clearly see for EC 4+2 it has min_size 5. # ceph osd erasure-code-profile get myprofile crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 pool 17 'default.rgw.buckets.data' erasure profile myprofile size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 1270 pgp_num 1207 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 8071 lfor 0/8069/8067 flags hashpspool stripe_width 16384 application rgw [root@f04-h29-b05-5039ms ~]# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "default.rgw.buckets.data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] When we set the min_size to 4 the inactive pgs go to active state! [root@f04-h29-b05-5039ms ~]# ceph osd pool set default.rgw.buckets.data min_size 4 set pool 17 min_size to 4 pgs: 371088857/2259669324 objects degraded (16.422%) 52299025/2259669324 objects misplaced (2.314%) 919 active+undersized+degraded+remapped+backfill_wait 498 active+clean 83 active+undersized+degraded+remapped+backfilling 11 active+remapped+backfill_wait 2 active+undersized+remapped+backfill_wait [root@f04-h29-b05-5039ms ~]# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "default.rgw.buckets.data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] There is no change in the crush rule because it has min_size set as 3 but the issue is mainly during pool creation! So looks like we have fixed and brought the feature to recover only with K shard in pacific and above but maybe we need to fix the pool creation command to set min_size for EC pools as K, not K+1.
I think this is where we need to fix the code: In pacific branch - File - src/mon/OSDMonitor.cc int OSDMonitor::prepare_pool_size(const unsigned pool_type, const string &erasure_code_profile, uint8_t repl_size, unsigned *size, unsigned *min_size, ostream *ss) <.........> case pg_pool_t::TYPE_ERASURE: { if (osdmap.stretch_mode_enabled) { *ss << "prepare_pool_size: we are in stretch mode; cannot create EC pools!"; return -EINVAL; } ErasureCodeInterfaceRef erasure_code; err = get_erasure_code(erasure_code_profile, &erasure_code, ss); if (err == 0) { *size = erasure_code->get_chunk_count(); ^^ size is all set because it is total chunk count for EC 4+2 => size is 6 and for EC 8+3 size is 11 pool 17 'default.rgw.buckets.data' erasure profile myprofile size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 1270 pgp_num 1207 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 8071 lfor 0/8069/8067 flags hashpspool stripe_width 16384 application rgw *min_size = erasure_code->get_data_chunk_count() + std::min<int>(1, erasure_code->get_coding_chunk_count() - 1); <======================= assert(*min_size <= *size); assert(*min_size >= erasure_code->get_data_chunk_count()); } } break; With the current code in EC 4+2, it is giving 5 to min_size! *min_size = erasure_code->get_data_chunk_count() + std::min<int>(1, erasure_code->get_coding_chunk_count() - 1); min_size = 4+min(1, (2-1)) = 4+1 = 5 I think the fix should be? *min_size = erasure_code->get_data_chunk_count();
Closing this one. Please check https://tracker.ceph.com/issues/53940#note-1 for more details. We will create a KCS and attach it to this bug.