Bug 1814681 - [RFE] use topologySpreadConstraints to evenly spread OSDs across hosts
Summary: [RFE] use topologySpreadConstraints to evenly spread OSDs across hosts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Kesavan
QA Contact: Warren
URL:
Whiteboard:
: 1778216 (view as bug list)
Depends On:
Blocks: 1776562 1817438
TreeView+ depends on / blocked
 
Reported: 2020-03-18 14:12 UTC by acalhoun
Modified: 2023-09-15 00:30 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:14:51 UTC
Embargoed:


Attachments (Terms of Use)
PGs per OSD (26.31 KB, image/png)
2020-03-18 14:12 UTC, acalhoun
no flags Details
Per OSD Capacity (53.26 KB, image/png)
2020-03-18 14:13 UTC, acalhoun
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:15:38 UTC

Description acalhoun 2020-03-18 14:12:33 UTC
Created attachment 1671097 [details]
PGs per OSD

Description of problem:
While conducting dry runs for Bare-metal Capacity Scaling, I observed that when transitioning from 3 to 12 OSDs, PGs were not re-adjusted across all new OSDs. As testing continued OSDs with more PGs filled up before others and result in uneven distribution, and incorrect* full status which prevents further data to be written. Ultimately resulting in a failed test.

Version-Release number of selected component (if applicable):
quay.io/ocs-dev/ocs-registry:latest
ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)

How reproducible:
Every time

Steps to Reproduce:
Deploy OCP on bare-metal env
Deploy OCS with 3 OSDs
Modify device set and increase number of OSDs, choose 

Actual results:

ceph status
cluster:
    id:     c8a5852b-6035-4333-b2db-d4498e85c2a6
    health: HEALTH_ERRN
            1 backfillfull osd(s)
            1 full osd(s)k
            7 pool(s) full

  services:d
    mon: 5 daemons, quorum a,b,c,d,e (age 104m)
    mgr: a(active, since 104m)
    mds: example-storagecluster-cephfilesystem:1 {0=example-storagec
luster-cephfilesystem-a=up:active} 1 up:standby-replaya
    osd: 12 osds: 12 up (since 90m), 12 in (since 90m)

  data:l
    pools:   7 pools, 672 pgs
    objects: 977.60k objects, 3.7 TiB
    usage:   11 TiB used, 16 TiB / 28 TiB avail
    pgs:     672 active+clean

  io:l
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

pg balancer status 
{
    "last_optimize_duration": "0:00:00.010144",
    "plans": [],
    "mode": "upmap",
    "active": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "last_optimize_started": "Tue Mar 17 23:19:12 2020"
}


Expected results:
Expect ceph to rebalance PGs as additonal OSDs are added to cluster 

Additional info:
Attached two graphs that depict PG per OSD counts and Per OSD Capacity.

installation and adding OSDs was performed following this guide,  https://docs.google.com/document/d/1AqFw3ylCfZ7vxq-63QVbGPHESPW-xep4reGVS0mPgXQ/edit?usp=sharing

Comment 2 acalhoun 2020-03-18 14:13:49 UTC
Created attachment 1671098 [details]
Per OSD Capacity

Observe that OSDs 0, 1, 2, and 11 have a significantly higher capacity than remaining OSDs. At ~20:00 these/a OSD(s) reach full limits(Total Cap. is ~2.2TiB) and prevented additional writes to occur on the cluster.

Comment 3 acalhoun 2020-03-18 14:15:40 UTC
Comment on attachment 1671097 [details]
PGs per OSD

added additional OSDs at ~18:37, initial PGs were allocated but balancing did not occur.

Comment 4 Josh Durgin 2020-03-18 14:26:49 UTC
14.2.8 would be post-4.3 - that's the basis of RHCS 4.1, which is still
in development.

As noted in email, the pg counts are lower than they should be.

Which pool(s) were you filling up?

ocs-operator sets up the target ratio for rbd and cephfs data pools [1].
If you're creating additional pools, you'll need to set a target ratio
or size for them so the autoscaler can set the pg count appropriately.

You can see the target size ratios for each pool in 'ceph osd pool ls detail'.

Comment 5 acalhoun 2020-03-18 14:49:15 UTC
My cluster is no longer responding, but I was using the default pools provided by the storage cluster, example-storagecluster-cephblockpool

below is the previously recorded pg auto scaling status and ceph osd df output 

oc rsh rook-ceph-tools-7f96779fb9-48c6h ceph osd pool autoscale-status
POOL                                                 SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE 
example-storagecluster-cephblockpool                3801G              3.0       28317G 0.4027       0.4900  1.0    256            on        
example-storagecluster-cephobjectstore.rgw.control     0               3.0       28317G 0.0000               1.0     32            on        
example-storagecluster-cephobjectstore.rgw.log         0               3.0       28317G 0.0000               1.0     32            on        
.rgw.root                                              0               3.0       28317G 0.0000               1.0     32            on        
example-storagecluster-cephfilesystem-metadata      2286               3.0       28317G 0.0000               4.0     32            on        
example-storagecluster-cephfilesystem-data0            0               3.0       28317G 0.0000       0.4900  1.0    256            on        
example-storagecluster-cephobjectstore.rgw.meta        0               3.0       28317G 0.0000               1.0     32            on   

nce 2h); epoch: e80
[root@f03-h29-000-r620 test-files]#  oc rsh rook-ceph-tools-7f96779fb9-48c6h ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS 
 2   ssd 2.91100  1.00000 2.9 TiB 2.0 TiB 2.0 TiB  71 KiB 3.6 GiB 886 GiB 70.28 1.74 341     up 
11   ssd 2.91100  1.00000 2.9 TiB 1.7 TiB 1.7 TiB  32 KiB 3.0 GiB 1.2 TiB 57.47 1.42 319     up 
 0   ssd 2.18320  1.00000 2.2 TiB 1.9 TiB 1.9 TiB  43 KiB 3.2 GiB 333 GiB 85.11 2.11 329     up 
 1   ssd 2.18320  1.00000 2.2 TiB 1.9 TiB 1.9 TiB  52 KiB 3.2 GiB 330 GiB 85.22 2.11 327     up 
 3   ssd 2.18320  1.00000 2.2 TiB 551 GiB 549 GiB  20 KiB 1.3 GiB 1.6 TiB 24.63 0.61 103     up 
 4   ssd 2.18320  1.00000 2.2 TiB 397 GiB 396 GiB  27 KiB 1.0 GiB 1.8 TiB 17.75 0.44  97     up 
 5   ssd 2.18320  1.00000 2.2 TiB 539 GiB 538 GiB  12 KiB 1.2 GiB 1.7 TiB 24.12 0.60  98     up 
 6   ssd 2.18320  1.00000 2.2 TiB 474 GiB 473 GiB  20 KiB 1.2 GiB 1.7 TiB 21.21 0.53  78     up 
 7   ssd 2.18320  1.00000 2.2 TiB 417 GiB 416 GiB  19 KiB 1.0 GiB 1.8 TiB 18.65 0.46  80     up 
 8   ssd 2.18320  1.00000 2.2 TiB 461 GiB 460 GiB  12 KiB 1.0 GiB 1.7 TiB 20.63 0.51  78     up 
 9   ssd 2.18320  1.00000 2.2 TiB 493 GiB 492 GiB  20 KiB 1.2 GiB 1.7 TiB 22.06 0.55  90     up 
10   ssd 2.18320  1.00000 2.2 TiB 478 GiB 477 GiB  22 KiB 1.2 GiB 1.7 TiB 21.40 0.53  76     up 
                    TOTAL  28 TiB  11 TiB  11 TiB 354 KiB  22 GiB  16 TiB 40.35

Comment 6 Josh Durgin 2020-03-18 16:14:16 UTC
Doing the math again, the nmuber of pgs does match the target ratio: 

12 * 100 / 3 * 0.49 = 196 -> rounded to 256 pgs

When you reproduce can you attach a pg dump and a binary osdmap (ceph osd getmap -o /tmp/osdmap). That will let us see the distribution of pgs per pool, and check the balancer's behavior.

Comment 7 Michael Adam 2020-03-18 22:40:21 UTC
not 4.3 material

Comment 8 acalhoun 2020-03-19 22:41:48 UTC
Retested with quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc2 and the same issue occured. 

Evaluated cluster Configuration with Ben England and observed that I had non uniform Racks that is Rack0 had 3host, Rack1 had 3host, Rack2 had 2host. We suspected that because each rack did not have the same number of host, OCS had improperly balanced OSDS/PGs.

When deploying OSDs, OCS deployed them as below:

Rack 0 
 -host 0
   --osd0
 -host 1
   --osd1
Rack 1
 -host 2
   --osd2
 -host 3
   --osd11
Rack 2
 -host4
   --osd3
   --osd4
   --osd5
   --osd6
 -host5
   --osd7
   --osd8
   --osd9
   --osd10

Is it expected that users have the same number of host/capacity for each Rack, Is this just a configuration error or still a functional issue?

Comment 9 Josh Durgin 2020-03-20 20:15:05 UTC
(In reply to acalhoun from comment #8)
> Retested with quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc2 and the same
> issue occured. 
> 
> Evaluated cluster Configuration with Ben England and observed that I had non
> uniform Racks that is Rack0 had 3host, Rack1 had 3host, Rack2 had 2host. We
> suspected that because each rack did not have the same number of host, OCS
> had improperly balanced OSDS/PGs.
> 
> When deploying OSDs, OCS deployed them as below:
> 
> Rack 0 
>  -host 0
>    --osd0
>  -host 1
>    --osd1
> Rack 1
>  -host 2
>    --osd2
>  -host 3
>    --osd11
> Rack 2
>  -host4
>    --osd3
>    --osd4
>    --osd5
>    --osd6
>  -host5
>    --osd7
>    --osd8
>    --osd9
>    --osd10
> 
> Is it expected that users have the same number of host/capacity for each
> Rack, Is this just a configuration error or still a functional issue?

If crush is configured to split across racks, then it's expected to have similar capacity in each, otherwise you may not be able to use the full capacity or balance data appropriately. Same for splitting across hosts. Without this, crush constrains how balanced the cluster can be. For example, if you have two hosts with 5x8 TB osds and one host with 1x8 TB osd you can only use up to 10TB of 3x replicated space, and the single osd host will have 5x the pgs/osd as the other two. For maximum parallelism and performance, you need equally sized hosts and racks.

I'm not sure how ocs/rook/ocp is controlling scheduling of osds onto hosts/racks - Seb where is this controlled?

Comment 10 Raz Tamir 2020-03-23 07:08:37 UTC
Hi acalhoun,

Is this a bare metal only issue?

Comment 11 acalhoun 2020-03-23 17:17:15 UTC
Tamir,

Based on my testing I haven't observed this issue on AWS, although the Rack topology setup is different between AWS and BM. I believe in AWS Rack aligns with Availability Zones and with BM is in the traditional sense of rack.

Overall, I believe Josh's assessment is correct about the poor distribution due to the variation in Rack/Host Capacity. I am surprised at how significant this is, but not sure if this is "Okay" or changes are necessary. I re-ran this test with balanced Racks (1 Rack to 2 host, w/ total of 4Racks), although devices varied from 2.2TiB to 2.9TiB and this difference still resulted in high variation of data distribution and incorrect Full status at 12 OSDs.

Comment 12 Sébastien Han 2020-03-24 15:09:47 UTC
@Josh, the kube scheduler is responsible for this, OCS labels the nodes and passes that selector to Rook which then creates the resources and leave the rest to Kubernetes.
We need 1.18 to get proper placement with topologySpreadConstraints, this is already tracked upstream https://github.com/rook/rook/issues/4387

Comment 13 Kyle Bader 2020-03-24 18:12:39 UTC
We noticed that Kubernetes didn't have a thing like topologySpreadConstraints when we were doing failure domain testing of 4.2. The resulting scheme was to have a storageDeviceSets in the storageCluster, and then build a cephCluster storageClassDeviceSets list with a set for each failure-domain (rack or zone). If there rack or zone failure-domain labels didn't exist, we'd create virtual 'racks' (for VMware, because their cloud provider didn't lend the ability to surface failure-domain data to OCP, at least in 4.2). Pools would be use rack or zone as the failure-domain to decluster replicas. The UI for 4.2 would surfacce the failure-domain information, and when you "picked" nodes you needed to pick an equal number from each failure domain, and have at least 3 failure domains.

With bare metal, we don't use the UI, and the docs don't have "here be dragons" text when we instruct the user to add OCS labels to hosts. The result is someone might end up labeling hosts in 2 racks, with an unbalanced number of nodes each each. Because the rack label exists, we can't create virtual ones. Since there are < 3 failure-domain rack, we fall back to host, and we presumably have a single set in storageClassDeviceSets. In this scenario, there is no way to ensure OSDs are balanced across hosts.

Docs should probably add text that tells people not to point a gun at their foot when labeling hosts on bare metal. Second, it might be worthwhile to have the OCS Operator use a crush bucket type between "rack" and "host" for "virtual racks" if rack/zone failure-domain count is less than three. I haven't thought about how topologySpreadConstraints would change the strategy.

Comment 14 Kyle Bader 2020-03-24 18:14:53 UTC
Also, since we have a bug where the PVC ID is used for the host bucket name in crush, even when portable: false, we can't even guarantee replicas are on distinct hosts!

Comment 16 Travis Nielsen 2020-03-25 15:25:55 UTC
@Kyle Besides the potential "portable: false" issue, are you seeing any reason we couldn't solve these issues with documentation for bare metal? Or the concern is basically how we can help the user not shoot themselves in the foot while following the documentation?

Comment 17 Kyle Bader 2020-03-25 15:38:47 UTC
I'm doing this in parallel with Annette. Basically after folks label their nodes, we're going to have them run -

oc get nodes -L failure-domain.beta.kubernetes.io/zone,failure-domain.beta.kubernetes.io/rack -l cluster.ocs.openshift.io/openshift-storage=''

NAME                                         STATUS   ROLES    AGE     VERSION   ZONE         RACK
ip-10-0-128-167.us-west-2.compute.internal   Ready    worker   41h     v1.16.2   us-west-2a
ip-10-0-133-93.us-west-2.compute.internal    Ready    worker   5d15h   v1.16.2   us-west-2a
ip-10-0-159-206.us-west-2.compute.internal   Ready    worker   5d15h   v1.16.2   us-west-2b
ip-10-0-172-122.us-west-2.compute.internal   Ready    worker   5d15h   v1.16.2   us-west-2c

To verify they do in fact have an even number of nodes in at least 3 distinct racks or zones.

Personally, I'd set the minimum at closing #1816820 combined with a docsfix along the lines of the above, and then look into the viability of switching to a new crush bucket type between host and rack/zone as our "virtual rack" for 4.4.

Comment 18 Michael Adam 2020-03-25 22:40:59 UTC
(In reply to leseb from comment #12)
> @Josh, the kube scheduler is responsible for this, OCS labels the nodes and
> passes that selector to Rook which then creates the resources and leave the
> rest to Kubernetes.

@Seb, Not 100% correct, afaik. OCS operator does a bit more. See below:

(In reply to Kyle Bader from comment #13)
> We noticed that Kubernetes didn't have a thing like
> topologySpreadConstraints when we were doing failure domain testing of 4.2.
> The resulting scheme was to have a storageDeviceSets in the storageCluster,
> and then build a cephCluster storageClassDeviceSets list with a set for each
> failure-domain (rack or zone). If there rack or zone failure-domain labels
> didn't exist, we'd create virtual 'racks' (for VMware, because their cloud
> provider didn't lend the ability to surface failure-domain data to OCP, at
> least in 4.2). Pools would be use rack or zone as the failure-domain to
> decluster replicas.

@Kyle, I'm a little confused about the "we" here: This sounds like what we did in
the development of OCS operator (see below). Are you referring to that work,
or are you describing what you did in testing with manual preparation of the OCP cluster?

> The UI for 4.2 would surfacce the failure-domain
> information, and when you "picked" nodes you needed to pick an equal number
> from each failure domain, and have at least 3 failure domains.
> 
> With bare metal, we don't use the UI,

Yeah... My original understanding was that we would only be using the CLI to
set up the LSO and PVs for the backend disks. And would use the UI just like
normal from there on. Maybe that is not how it ended up.

> and the docs don't have "here be
> dragons" text when we instruct the user to add OCS labels to hosts. The
> result is someone might end up labeling hosts in 2 racks, with an unbalanced
> number of nodes each each. Because the rack label exists, we can't create
> virtual ones. Since there are < 3 failure-domain rack, we fall back to host,
> and we presumably have a single set in storageClassDeviceSets. In this
> scenario, there is no way to ensure OSDs are balanced across hosts.
> 
> Docs should probably add text that tells people not to point a gun at their
> foot when labeling hosts on bare metal. Second, it might be worthwhile to
> have the OCS Operator use a crush bucket type between "rack" and "host" for
> "virtual racks" if rack/zone failure-domain count is less than three.

ocs-operator is using "rack" as a virtual zone.
This was done in cases where we have < #replica AZs in AWS.
In genreal, if it does not find enough (>= #replica) zone labels in the nodes,
it will create #replica rack labels on the nodes, distributing the nodes
as evenly as possible across the racks, and will try to make sure
to have an as even as possible distribution of the osds among the racks
using affinity and anti-affinity settings.
It also chops the StorageDeviceSet up into multiple StorageClassDeviceSets.
Within the rack label, the kubernetes scheduler is responsible for placing OSDs,
so we will not necessarily have an even distribution of OSDs among nodes with the
same rack.

So if, in a bare-metal environment, the admin has already created rack labels,
then ocs-operator would honor them and just try and distribute the OSDs among
them as evenly as possible. But it is indeed important to spread the storage
nodes evenly across the racks.

The description here looks as if the hosts have been distributed across the racks
well (2 in each rack), so that is fine. Not sure why the OSDs are not distributed
well across racks.

Note that all this "magic" would only happen in ocs-operator if no "placement" is
explicitly configured in the StorageDeviceSet.

See: https://github.com/openshift/ocs-operator/blob/release-4.3/deploy/olm-catalog/ocs-operator/4.3.0/storagecluster.crd.yaml#L110

I am not 100% sure which doc was followed to set this up.

- Was the UI not used for setup after the LSO and backend PVs were created?
- Was "placement" used?
- Can I see the StorageCluster cr that was used?


Regarding the introduction of a bucket between rack and host:
We are doing pretty much what you are describing, but with racks.
We were not aware of any existing level between rack and host, so
ended up using rack since ceph knows about it, and we didn't know
that OCP would set these labels automatically (like it sets zone
labels automatically for AWS AZs...).


> I haven't thought about how topologySpreadConstraints would change the
> strategy.

Comment 19 Raz Tamir 2020-04-12 13:59:04 UTC
As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS (comment #11), marking as a blocker to 4.4

Comment 20 Michael Adam 2020-04-28 11:34:59 UTC
(In reply to Raz Tamir from comment #19)
> As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS
> (comment #11), marking as a blocker to 4.4

Have you not seen it in AWS ever, or possibly just not when running with at least 3 zones?

Comment 21 Michael Adam 2020-04-28 11:47:36 UTC
This whole BZ is a bit convoluted.
I can not see clearly what is the actual problem that this BZ is about.

If I understand it correctly, what is described here, is is a combination of various aspects and currently mostly works as designed...


(In reply to Josh Durgin from comment #9)
> 
> If crush is configured to split across racks, then it's expected to have
> similar capacity in each, otherwise you may not be able to use the full
> capacity or balance data appropriately. Same for splitting across hosts.
> Without this, crush constrains how balanced the cluster can be. For example,
> if you have two hosts with 5x8 TB osds and one host with 1x8 TB osd you can
> only use up to 10TB of 3x replicated space, and the single osd host will
> have 5x the pgs/osd as the other two. For maximum parallelism and
> performance, you need equally sized hosts and racks.
> 
> I'm not sure how ocs/rook/ocp is controlling scheduling of osds onto
> hosts/racks - Seb where is this controlled?

As explained in comment #18,

* the ocs operator either detects a failure domain (zone, corresoponding to
  AWS AZ) or creates one (rack) by labelling nodes into racks artificially. 

* The various OSDs should be distributed across the failure domain (rack or
  zone) as evenly as possible. In particular we should roughly the same
  capacity in each failure domain (zone or rack).
  ==> If this is not the case, then this is a bug.

* Within the failure domain (rack/zone), the distribution is entirely up
  to the kubernetes scheduler. This is currently NOT done homogeneously 
  across nodes, but it will frequently happen that some nodes get many
  OSDs, some nodes get only a few and some get none.
  There is just nothing we can currently do about it, and if Ceph assumes
  the hosts to be of similar capacity (even if the failure domain is set
  to rack or zone), then this is just a fact that we have to accept at this
  point.
  With Kube 1.18 / OCP 4.6, we will fix this by the use of topologySpreadConstraints.

* There is one possible problem with the portable=true on OSDs for bare metal.
  But it is treated in a separate BZ.
  https://bugzilla.redhat.com/show_bug.cgi?id=1816820

* Is there an additional problem at all with the pg adjusting?
  (I really don't know it, it just seems that it is a combination of the above...)
  ==> @Josh?

Comment 22 Rohan CJ 2020-04-28 12:14:43 UTC
I think the google doc you're following is outdated.It specifically mentions rack labels in the StorageCluster object[1]. It should not.
I don't see a must-gather attached, so I can't check if that's the exact cause.

I think it was corrected in doc review process [2]

[1] https://docs.google.com/document/d/1AqFw3ylCfZ7vxq-63QVbGPHESPW-xep4reGVS0mPgXQ/edit
[2] Step 1.2. ctrl+f "kind: storagecluster" 
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/html-single/deploying_openshift_container_storage/index?lb_target=preview#installing-openshift-container-storage-using-local-storage-devices_rhocs

Comment 23 Rohan CJ 2020-04-28 12:17:18 UTC
> * The various OSDs should be distributed across the failure domain (rack or
>  zone) as evenly as possible. In particular we should roughly the same
>  capacity in each failure domain (zone or rack).
>  ==> If this is not the case, then this is a bug.

To clarify, specifying the rack labels in the StorageCluster.Spec.StorageDeviceSets would cause the issue Michael mentioned here.

Comment 24 Michael Adam 2020-04-28 14:19:12 UTC
(In reply to Rohan CJ from comment #23)
> > * The various OSDs should be distributed across the failure domain (rack or
> >  zone) as evenly as possible. In particular we should roughly the same
> >  capacity in each failure domain (zone or rack).
> >  ==> If this is not the case, then this is a bug.
> 
> To clarify, specifying the rack labels in the
> StorageCluster.Spec.StorageDeviceSets would cause the issue Michael
> mentioned here.

Hmm, I don't understand. It is perfectly fine if rack labels already exist.
In this case ocs-operator is going to honour them. ocs-operator will only
create rack labels if nodes are in less than 3 zones and if there are no
rack labels already.

If the admin has distributed the nodes into racks with labels, it should
just work, and ocs-operator should still make sure to distribute the
disks evenly across those racks.

If that is not working (the even distribution across racks or zones),
it is a bug in ocs-operator that needs to be fixed, I believe.

I'd have to double check if the expansion scenario different though.

Comment 25 Josh Durgin 2020-04-28 21:24:36 UTC
(In reply to Michael Adam from comment #21)
> * Within the failure domain (rack/zone), the distribution is entirely up
>   to the kubernetes scheduler. This is currently NOT done homogeneously 
>   across nodes, but it will frequently happen that some nodes get many
>   OSDs, some nodes get only a few and some get none.
>   There is just nothing we can currently do about it, and if Ceph assumes
>   the hosts to be of similar capacity (even if the failure domain is set
>   to rack or zone), then this is just a fact that we have to accept at this
>   point.
>   With Kube 1.18 / OCP 4.6, we will fix this by the use of
> topologySpreadConstraints.

IMO this BZ should track this problem.

> * Is there an additional problem at all with the pg adjusting?
>   (I really don't know it, it just seems that it is a combination of the
> above...)
>   ==> @Josh?

There is no additional problem to my knowledge.

Comment 26 Yaniv Kaul 2020-05-04 06:46:28 UTC
(In reply to Josh Durgin from comment #25)
> (In reply to Michael Adam from comment #21)
> > * Within the failure domain (rack/zone), the distribution is entirely up
> >   to the kubernetes scheduler. This is currently NOT done homogeneously 
> >   across nodes, but it will frequently happen that some nodes get many
> >   OSDs, some nodes get only a few and some get none.
> >   There is just nothing we can currently do about it, and if Ceph assumes
> >   the hosts to be of similar capacity (even if the failure domain is set
> >   to rack or zone), then this is just a fact that we have to accept at this
> >   point.
> >   With Kube 1.18 / OCP 4.6, we will fix this by the use of
> > topologySpreadConstraints.
> 
> IMO this BZ should track this problem.

So why is this bug not targeted to 4.6?

> 
> > * Is there an additional problem at all with the pg adjusting?
> >   (I really don't know it, it just seems that it is a combination of the
> > above...)
> >   ==> @Josh?
> 
> There is no additional problem to my knowledge.

Comment 27 Josh Durgin 2020-05-04 16:02:18 UTC
(In reply to Yaniv Kaul from comment #26)
> (In reply to Josh Durgin from comment #25)
> > (In reply to Michael Adam from comment #21)
> > > * Within the failure domain (rack/zone), the distribution is entirely up
> > >   to the kubernetes scheduler. This is currently NOT done homogeneously 
> > >   across nodes, but it will frequently happen that some nodes get many
> > >   OSDs, some nodes get only a few and some get none.
> > >   There is just nothing we can currently do about it, and if Ceph assumes
> > >   the hosts to be of similar capacity (even if the failure domain is set
> > >   to rack or zone), then this is just a fact that we have to accept at this
> > >   point.
> > >   With Kube 1.18 / OCP 4.6, we will fix this by the use of
> > > topologySpreadConstraints.
> > 
> > IMO this BZ should track this problem.
> 
> So why is this bug not targeted to 4.6?

I assumed Michael was waiting for other comments. Since there are none in a week, going ahead with my suggestion to track only this problem with this BZ, and retitling/moving as appropriate.

If anyone wants to track another issue, please open a separate BZ.

Comment 28 Raz Tamir 2020-05-05 07:17:39 UTC
(In reply to Michael Adam from comment #20)
> (In reply to Raz Tamir from comment #19)
> > As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS
> > (comment #11), marking as a blocker to 4.4
> 
> Have you not seen it in AWS ever, or possibly just not when running with at
> least 3 zones?

Not that I'm aware of.
We are checking OSD distribution and remember we had few bugs but noting new on AWS

Comment 29 Rohan CJ 2020-05-07 14:37:46 UTC
@Rajat confirmed that the api is available in 4.5 onwards. We're okay with using beta.

We're aiming to land this in 4.6.

Comment 30 Mudit Agarwal 2020-09-18 04:28:32 UTC
Removing the blocker flag which was added in 4.4

Comment 31 Michael Adam 2020-09-18 09:07:17 UTC
This is now an epic for 4.7:
https://issues.redhat.com/browse/KNIP-1512

==> moving to 4.7

Comment 33 Neha Berry 2020-12-15 12:39:46 UTC
Hi Kesavan,

Are the fixes for https://bugzilla.redhat.com/show_bug.cgi?id=1817438 and this bug inter-related ? 

Or they need to be verified separately ?

Comment 34 Kesavan 2020-12-15 12:52:49 UTC
Hey Neha, The two bugs needs to be verified separately as on them is of baremetal scenario and the other one is AWS. 
The topology spread for baremetal is rack and for AWS is AZ(zones).

Comment 35 Michael Adam 2021-01-14 18:01:21 UTC
adding pm-ack, (which was not given automatically b/c of the [RFE] tag), since this is an approved epic for 4.7

Comment 38 Jose A. Rivera 2021-02-03 14:46:06 UTC
*** Bug 1778216 has been marked as a duplicate of this bug. ***

Comment 39 Warren 2021-02-10 05:40:47 UTC
On Aws, pg distribution in range of 92 to 100.

I will try to do this on vSphere on Wednesday

====================================================================================================
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h53m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h53m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Init:0/2    0          8s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Init:0/2    0          7s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Init:0/2    0          7s
(venv) wusui@localhost:~/ocs-ci$ sleep 120; !!
sleep 120; oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h55m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h55m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h55m
rook-ceph-osd-3-58577cf8c5-przg2                                  2/2     Running     0          106s
rook-ceph-osd-4-799f945b7f-zwrp5                                  2/2     Running     0          105s
rook-ceph-osd-5-856545cfc7-7bspr                                  2/2     Running     0          103s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd pool autoscale-status
POOL                                         SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE 
ocs-storagecluster-cephblockpool            4513M              3.0       12288G 0.0011       0.4900          1.0000  1.0    128            on        
ocs-storagecluster-cephfilesystem-metadata 56886               3.0       12288G 0.0000                               4.0     32            on        
ocs-storagecluster-cephfilesystem-data0      158               3.0       12288G 0.0000                               1.0     32            on        
sh-4.4# ceph osd df output
Error EINVAL: you must specify both 'filter_by' and 'filter'
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB  19 GiB  13 GiB  0 B 6 GiB  12 TiB 0.16                 
MIN/MAX VAR: 0.95/1.05  STDDEV: 0.01
sh-4.4# 
  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 8h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 18m), 6 in (since 18m)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 1.33k objects, 4.4 GiB
    usage:   18 GiB used, 12 TiB / 12 TiB avail
    pgs:     192 active+clean
 
  io:
    client:   853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr
 

sh-4.4# ceph osd df
(venv) wusui@localhost:~/ocs-ci$ vi !$
vi ~/bugreview/1*/aws-message
(venv) wusui@localhost:~/ocs-ci$ cat !$
cat ~/bugreview/1*/aws-message
On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256.

I will try to do this on vSphere on Wednesday

====================================================================================================
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h53m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h53m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Init:0/2    0          8s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Init:0/2    0          7s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Init:0/2    0          7s
(venv) wusui@localhost:~/ocs-ci$ sleep 120; !!
sleep 120; oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h55m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h55m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h55m
rook-ceph-osd-3-58577cf8c5-przg2                                  2/2     Running     0          106s
rook-ceph-osd-4-799f945b7f-zwrp5                                  2/2     Running     0          105s
rook-ceph-osd-5-856545cfc7-7bspr                                  2/2     Running     0          103s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB  19 GiB  13 GiB  0 B 6 GiB  12 TiB 0.16                 
MIN/MAX VAR: 0.95/1.05  STDDEV: 0.01
sh-4.4# 
  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 8h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 18m), 6 in (since 18m)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 1.33k objects, 4.4 GiB
    usage:   18 GiB used, 12 TiB / 12 TiB avail
    pgs:     192 active+clean
 
  io:
    client:   853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr
 
(venv) wusui@localhost:~/ocs-ci$ On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256.
eph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd pool autoscale-status
POOL                                         SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE 
ocs-storagecluster-cephblockpool            4513M              3.0       12288G 0.0011       0.4900          1.0000  1.0    128            on        
ocs-storagecluster-cephfilesystem-metadata 56886               3.0       12288G 0.0000                               4.0     32            on        
ocs-storagecluster-cephfilesystem-data0      158               3.0       12288G 0.0000                               1.0    bash: On: command not found
 32            on        
sh-4.4# ceph osd df output
Error EINVAL: you must specify both 'filter_by' and 'filter'
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB(venv) wusui@localhost:~/ocs-ci$

Comment 40 Warren 2021-02-10 05:41:12 UTC
On Aws, pg distribution in range of 92 to 100.

I will try to do this on vSphere on Wednesday

====================================================================================================
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h53m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h53m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Init:0/2    0          8s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Init:0/2    0          7s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Init:0/2    0          7s
(venv) wusui@localhost:~/ocs-ci$ sleep 120; !!
sleep 120; oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h55m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h55m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h55m
rook-ceph-osd-3-58577cf8c5-przg2                                  2/2     Running     0          106s
rook-ceph-osd-4-799f945b7f-zwrp5                                  2/2     Running     0          105s
rook-ceph-osd-5-856545cfc7-7bspr                                  2/2     Running     0          103s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd pool autoscale-status
POOL                                         SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE 
ocs-storagecluster-cephblockpool            4513M              3.0       12288G 0.0011       0.4900          1.0000  1.0    128            on        
ocs-storagecluster-cephfilesystem-metadata 56886               3.0       12288G 0.0000                               4.0     32            on        
ocs-storagecluster-cephfilesystem-data0      158               3.0       12288G 0.0000                               1.0     32            on        
sh-4.4# ceph osd df output
Error EINVAL: you must specify both 'filter_by' and 'filter'
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB  19 GiB  13 GiB  0 B 6 GiB  12 TiB 0.16                 
MIN/MAX VAR: 0.95/1.05  STDDEV: 0.01
sh-4.4# 
  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 8h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 18m), 6 in (since 18m)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 1.33k objects, 4.4 GiB
    usage:   18 GiB used, 12 TiB / 12 TiB avail
    pgs:     192 active+clean
 
  io:
    client:   853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr
 

sh-4.4# ceph osd df
(venv) wusui@localhost:~/ocs-ci$ vi !$
vi ~/bugreview/1*/aws-message
(venv) wusui@localhost:~/ocs-ci$ cat !$
cat ~/bugreview/1*/aws-message
On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256.

I will try to do this on vSphere on Wednesday

====================================================================================================
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h53m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h53m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Init:0/2    0          8s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Init:0/2    0          7s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h53m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Init:0/2    0          7s
(venv) wusui@localhost:~/ocs-ci$ sleep 120; !!
sleep 120; oc -n openshift-storage get pods | grep osd
rook-ceph-osd-0-7dc45754fc-8w5vs                                  2/2     Running     0          7h55m
rook-ceph-osd-1-588f9fdf9-t8v4d                                   2/2     Running     0          7h55m
rook-ceph-osd-2-779d9c795b-bxjdk                                  2/2     Running     0          7h55m
rook-ceph-osd-3-58577cf8c5-przg2                                  2/2     Running     0          106s
rook-ceph-osd-4-799f945b7f-zwrp5                                  2/2     Running     0          105s
rook-ceph-osd-5-856545cfc7-7bspr                                  2/2     Running     0          103s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB  19 GiB  13 GiB  0 B 6 GiB  12 TiB 0.16                 
MIN/MAX VAR: 0.95/1.05  STDDEV: 0.01
sh-4.4# 
  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 8h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 18m), 6 in (since 18m)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 1.33k objects, 4.4 GiB
    usage:   18 GiB used, 12 TiB / 12 TiB avail
    pgs:     192 active+clean
 
  io:
    client:   853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr
 
(venv) wusui@localhost:~/ocs-ci$ On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256.
eph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws           0/1     Completed   0          2m14s
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv           0/1     Completed   0          2m13s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st           0/1     Completed   0          7h55m
rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk           0/1     Completed   0          2m13s

====================================================================================================

sh-4.4# ceph status
  cluster:
    id:     467e00f5-3885-4fb5-949e-6f3eef7d40a1
    health: HEALTH_OK
 
sh-4.4# ceph osd pool autoscale-status
POOL                                         SIZE TARGET SIZE RATE RAW CAPACITY  RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE 
ocs-storagecluster-cephblockpool            4513M              3.0       12288G 0.0011       0.4900          1.0000  1.0    128            on        
ocs-storagecluster-cephfilesystem-metadata 56886               3.0       12288G 0.0000                               4.0     32            on        
ocs-storagecluster-cephfilesystem-data0      158               3.0       12288G 0.0000                               1.0    bash: On: command not found
 32            on        
sh-4.4# ceph osd df output
Error EINVAL: you must specify both 'filter_by' and 'filter'
sh-4.4# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 4   ssd 2.00000  1.00000  2 TiB 3.3 GiB 2.3 GiB  0 B 1 GiB 2.0 TiB 0.16 1.03  94     up 
 2   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.97  98     up 
 1   ssd 2.00000  1.00000  2 TiB 3.0 GiB 2.0 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  96     up 
 3   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05  96     up 
 0   ssd 2.00000  1.00000  2 TiB 3.1 GiB 2.1 GiB  0 B 1 GiB 2.0 TiB 0.15 0.95  92     up 
 5   ssd 2.00000  1.00000  2 TiB 3.4 GiB 2.4 GiB  0 B 1 GiB 2.0 TiB 0.16 1.05 100     up 
                    TOTAL 12 TiB(venv) wusui@localhost:~/ocs-ci$

Comment 41 Warren 2021-02-16 21:12:48 UTC
On vSphere, I added three OSDs to a 3 OSD cluster and got the following results:

ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP META  AVAIL   %USE VAR  PGS STATUS 
 2       2.00000  1.00000  2 TiB 1.3 GiB 347 MiB  0 B 1 GiB 2.0 TiB 0.07 0.99 133     up 
 5   hdd 2.00000  1.00000  2 TiB 1.4 GiB 366 MiB  0 B 1 GiB 2.0 TiB 0.07 1.00 127     up 
 0   hdd 2.00000  1.00000  2 TiB 1.5 GiB 480 MiB  0 B 1 GiB 2.0 TiB 0.07 1.08 146     up 
 3   hdd 2.00000  1.00000  2 TiB 1.3 GiB 341 MiB  0 B 1 GiB 2.0 TiB 0.07 0.98 128     up 
 1   hdd 2.00000  1.00000  2 TiB 1.3 GiB 314 MiB  0 B 1 GiB 2.0 TiB 0.06 0.96 127     up 
 4   hdd 2.00000  1.00000  2 TiB 1.3 GiB 347 MiB  0 B 1 GiB 2.0 TiB 0.07 0.99 155     up 
                    TOTAL 12 TiB 8.1 GiB 2.1 GiB  0 B 6 GiB  12 TiB 0.07                 
MIN/MAX VAR: 0.96/1.08  STDDEV: 0.00

Is the range from 127 to 155 pgs per osd considered balanced?

Comment 42 Warren 2021-02-17 06:53:21 UTC
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep ceph-osd-
rook-ceph-osd-0-6b9cbf8bc-h954h                                   2/2     Running     0          10h
rook-ceph-osd-1-bb68456bb-ssp6r                                   2/2     Running     0          10h
rook-ceph-osd-2-6f87d6fdcc-8sknc                                  2/2     Running     0          10h
rook-ceph-osd-3-69f8fd65d9-nhcv5                                  2/2     Running     0          9h
rook-ceph-osd-4-6fffbb8c46-btcdz                                  2/2     Running     0          9h
rook-ceph-osd-5-6bd4654c58-2pc5x                                  2/2     Running     0          9h
rook-ceph-osd-6-5b4d8c9595-8r6f6                                  0/2     Pending     0          25m
rook-ceph-osd-7-57df9c4b4-xwdrg                                   0/2     Pending     0          25m
rook-ceph-osd-8-8bcf9447b-fhmrk                                   0/2     Pending     0          25m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0r9rr2-475cs           0/1     Completed   0          10h
rook-ceph-osd-prepare-ocs-deviceset-0-data-1g2jp2-q5m88           0/1     Completed   0          9h
rook-ceph-osd-prepare-ocs-deviceset-0-data-24tn86-cnf7h           0/1     Completed   0          25m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0z54ww-kjh7g           0/1     Completed   0          10h
rook-ceph-osd-prepare-ocs-deviceset-1-data-1g7qfw-gn2dn           0/1     Completed   0          9h
rook-ceph-osd-prepare-ocs-deviceset-1-data-2xjzmk-zpdbw           0/1     Completed   0          25m
rook-ceph-osd-prepare-ocs-deviceset-2-data-069r9t-xlscm           0/1     Completed   0          10h
rook-ceph-osd-prepare-ocs-deviceset-2-data-1fhmtq-m9rgm           0/1     Completed   0          9h
rook-ceph-osd-prepare-ocs-deviceset-2-data-2zmc4m-729jf           0/1     Completed   0          25m
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep ceph-osd-  | egrep "Running|Pending" | sed -e 's/ .*//'
rook-ceph-osd-0-6b9cbf8bc-h954h
rook-ceph-osd-1-bb68456bb-ssp6r
rook-ceph-osd-2-6f87d6fdcc-8sknc
rook-ceph-osd-3-69f8fd65d9-nhcv5
rook-ceph-osd-4-6fffbb8c46-btcdz
rook-ceph-osd-5-6bd4654c58-2pc5x
rook-ceph-osd-6-5b4d8c9595-8r6f6
rook-ceph-osd-7-57df9c4b4-xwdrg
rook-ceph-osd-8-8bcf9447b-fhmrk
(venv) wusui@localhost:~/ocs-ci$ for  i in `cat /tmp/foo`; do         oc -n openshift-storage describe pod $i | grep topology-location; done

              topology-location-host=ocs-deviceset-1-data-0z54ww
              topology-location-rack=rack1
              topology-location-root=default

              topology-location-host=ocs-deviceset-0-data-0r9rr2
              topology-location-rack=rack2
              topology-location-root=default
              topology-location-host=ocs-deviceset-2-data-069r9t
              topology-location-rack=rack0
              topology-location-root=default
              topology-location-host=ocs-deviceset-1-data-1g7qfw
              topology-location-rack=rack1
              topology-location-root=default
              topology-location-host=ocs-deviceset-0-data-1g2jp2
              topology-location-rack=rack2
              topology-location-root=default
              topology-location-host=ocs-deviceset-2-data-1fhmtq
              topology-location-rack=rack0
              topology-location-root=default
                topology-location-host=ocs-deviceset-2-data-2zmc4m
                topology-location-rack=rack0
                topology-location-root=default
                topology-location-host=ocs-deviceset-0-data-24tn86
                topology-location-rack=rack2
                topology-location-root=default
                topology-location-host=ocs-deviceset-1-data-2xjzmk
                topology-location-rack=rack1
                topology-location-root=default

So the above output shows what I saw after I added three OSDs, waited some number of hours, and added three more OSDs.

The first three OSDs appear to be part of the cluster.  The next three are still in Pending state and causing Ceph to not be healthy.

The OSDs are evenly allocated on the the nodes and so that appears to be correct.  However the new OSDs are not Running.

Is this a separate bug to be reported?  And this verifies that the new OSDS are evenly distributed, even if faultly.  So has the
actual gist of this change been verified and we are running into another issue?

I am asking for more info for both of these questions.

Comment 43 Kesavan 2021-02-17 09:33:00 UTC
On looking into the cluster, The newly added OSDs moved to pending state because of insufficient memory, probably we need to run in a cluster with higher specs in case of adding multiple OSDs

0/6 nodes are available: 3 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 44 Warren 2021-02-17 11:19:13 UTC
According to Kesevan, the Pending status is due to the fact that we are out of available memory on this cluster.

This is the expected behavior in this case.

Since the distribution of osds added is even among all nodes added, I am marking this as verified.

Comment 47 errata-xmlrpc 2021-05-19 09:14:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 48 Red Hat Bugzilla 2023-09-15 00:30:20 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.