Bug 2039585

Summary: [Workload-DFG] EC pool creation is setting min_size to K+1 instead of K
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Laura Flores <lflores>
Status: CLOSED DUPLICATE QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.1CC: akupczyk, amathuri, bhubbard, ceph-eng-bugs, ceph-qe-bugs, choffman, gsitlani, ksirivad, lflores, nojha, pdhange, rfriedma, rzarzyns, sseshasa, vumrao
Target Milestone: ---   
Target Release: 5.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-20 22:27:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2022-01-12 01:08:27 UTC
Description of problem:
[Workload-DFG] EC profile is setting min_size to K+1 instead of K 


Version-Release number of selected component (if applicable):
RHCS 5.1 - 16.2.7-14.el8cp

How reproducible:
always

Steps to Reproduce:
1. Deploy an RHCS 5.1 cluster
2. Create an EC pool
3. Check min_size

In RHCS 5(pacific) and above we have a feature to recover a PG with EC K shards but still when creating a pool the min_size is set to K+1 that is causing PGs to go in inactive when only K shards are available and recovery to stuck for these PGs.

For example in workload-DFG cluster:

services:
    mon: 3 daemons, quorum f04-h29-b05-5039ms.rdu2.scalelab.redhat.com,f04-h29-b06-5039ms,f04-h29-b07-5039ms (age 30h)
    mgr: f04-h29-b06-5039ms.urzwpy(active, since 22h), standbys: f04-h29-b07-5039ms.harxed, f04-h29-b05-5039ms.rdu2.scalelab.redhat.com.bonpxg
    osd: 288 osds: 240 up (since 11m), 240 in (since 87s); 1046 remapped pgs
         flags noscrub,nodeep-scrub
    rgw: 10 daemons active (10 hosts, 1 zones)
 
  data:
    pools:   7 pools, 1513 pgs
    objects: 376.61M objects, 5.1 TiB
    usage:   30 TiB used, 425 TiB / 455 TiB avail
    pgs:     20.291% pgs not active
             373382957/2259669324 objects degraded (16.524%)
             52436093/2259669324 objects misplaced (2.321%)
             702 active+undersized+degraded+remapped+backfill_wait
             467 active+clean
             248 undersized+degraded+remapped+backfill_wait+peered
             59  undersized+degraded+remapped+backfilling+peered
             25  active+undersized+degraded+remapped+backfilling
             11  active+remapped+backfill_wait
             1   active+recovery_wait+undersized+degraded+remapped


We can clearly see we have the following PGs inactive:

  248 undersized+degraded+remapped+backfill_wait+peered
   59  undersized+degraded+remapped+backfilling+peered


While checking min_size for the EC pool, you can clearly see for EC 4+2 it has min_size 5.


# ceph osd erasure-code-profile get myprofile
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


pool 17 'default.rgw.buckets.data' erasure profile myprofile size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 1270 pgp_num 1207 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 8071 lfor 0/8069/8067 flags hashpspool stripe_width 16384 application rgw



[root@f04-h29-b05-5039ms ~]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "default.rgw.buckets.data",
        "ruleset": 1,
        "type": 3,
        "min_size": 3,
        "max_size": 6,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]


When we set the min_size to 4 the inactive pgs go to active state!

[root@f04-h29-b05-5039ms ~]# ceph osd pool set default.rgw.buckets.data min_size 4
set pool 17 min_size to 4


pgs:     371088857/2259669324 objects degraded (16.422%)
             52299025/2259669324 objects misplaced (2.314%)
             919 active+undersized+degraded+remapped+backfill_wait
             498 active+clean
             83  active+undersized+degraded+remapped+backfilling
             11  active+remapped+backfill_wait
             2   active+undersized+remapped+backfill_wait



[root@f04-h29-b05-5039ms ~]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 1,
        "rule_name": "default.rgw.buckets.data",
        "ruleset": 1,
        "type": 3,
        "min_size": 3,
        "max_size": 6,
        "steps": [
            {
                "op": "set_chooseleaf_tries",
                "num": 5
            },
            {
                "op": "set_choose_tries",
                "num": 100
            },
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_indep",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]


There is no change in the crush rule because it has min_size set as 3 but the issue is mainly during pool creation!

So looks like we have fixed and brought the feature to recover only with K shard in pacific and above but maybe we need to fix the pool creation command to set min_size for EC pools as K, not K+1.

Comment 1 Vikhyat Umrao 2022-01-12 02:14:41 UTC
I think this is where we need to fix the code:

In pacific branch - File - src/mon/OSDMonitor.cc

int OSDMonitor::prepare_pool_size(const unsigned pool_type,
                                  const string &erasure_code_profile,
                                  uint8_t repl_size,
                                  unsigned *size, unsigned *min_size,
                                  ostream *ss)


<.........>


  case pg_pool_t::TYPE_ERASURE:
    {
      if (osdmap.stretch_mode_enabled) {
        *ss << "prepare_pool_size: we are in stretch mode; cannot create EC pools!";
        return -EINVAL;
      }
      ErasureCodeInterfaceRef erasure_code;
      err = get_erasure_code(erasure_code_profile, &erasure_code, ss);
      if (err == 0) {
        *size = erasure_code->get_chunk_count();

          ^^ size is all set because it is total chunk count for EC 4+2 => size is 6 and for EC 8+3 size is 11

pool 17 'default.rgw.buckets.data' erasure profile myprofile size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 1270 pgp_num 1207 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 8071 lfor 0/8069/8067 flags hashpspool stripe_width 16384 application rgw


        *min_size =
          erasure_code->get_data_chunk_count() +
          std::min<int>(1, erasure_code->get_coding_chunk_count() - 1);  <=======================



        assert(*min_size <= *size);
        assert(*min_size >= erasure_code->get_data_chunk_count());
      }
    }
    break;

With the current code in EC 4+2, it is giving 5 to min_size!

  *min_size =
          erasure_code->get_data_chunk_count() +
          std::min<int>(1, erasure_code->get_coding_chunk_count() - 1);

  min_size = 4+min(1, (2-1)) 
           = 4+1
           = 5

I think the fix should be?

  *min_size = erasure_code->get_data_chunk_count();

Comment 2 Vikhyat Umrao 2022-01-20 22:27:37 UTC
Closing this one. Please check https://tracker.ceph.com/issues/53940#note-1 for more details. We will create a KCS and attach it to this bug.