Created attachment 1671097 [details] PGs per OSD Description of problem: While conducting dry runs for Bare-metal Capacity Scaling, I observed that when transitioning from 3 to 12 OSDs, PGs were not re-adjusted across all new OSDs. As testing continued OSDs with more PGs filled up before others and result in uneven distribution, and incorrect* full status which prevents further data to be written. Ultimately resulting in a failed test. Version-Release number of selected component (if applicable): quay.io/ocs-dev/ocs-registry:latest ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable) How reproducible: Every time Steps to Reproduce: Deploy OCP on bare-metal env Deploy OCS with 3 OSDs Modify device set and increase number of OSDs, choose Actual results: ceph status cluster: id: c8a5852b-6035-4333-b2db-d4498e85c2a6 health: HEALTH_ERRN 1 backfillfull osd(s) 1 full osd(s)k 7 pool(s) full services:d mon: 5 daemons, quorum a,b,c,d,e (age 104m) mgr: a(active, since 104m) mds: example-storagecluster-cephfilesystem:1 {0=example-storagec luster-cephfilesystem-a=up:active} 1 up:standby-replaya osd: 12 osds: 12 up (since 90m), 12 in (since 90m) data:l pools: 7 pools, 672 pgs objects: 977.60k objects, 3.7 TiB usage: 11 TiB used, 16 TiB / 28 TiB avail pgs: 672 active+clean io:l client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr pg balancer status { "last_optimize_duration": "0:00:00.010144", "plans": [], "mode": "upmap", "active": true, "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", "last_optimize_started": "Tue Mar 17 23:19:12 2020" } Expected results: Expect ceph to rebalance PGs as additonal OSDs are added to cluster Additional info: Attached two graphs that depict PG per OSD counts and Per OSD Capacity. installation and adding OSDs was performed following this guide, https://docs.google.com/document/d/1AqFw3ylCfZ7vxq-63QVbGPHESPW-xep4reGVS0mPgXQ/edit?usp=sharing
Created attachment 1671098 [details] Per OSD Capacity Observe that OSDs 0, 1, 2, and 11 have a significantly higher capacity than remaining OSDs. At ~20:00 these/a OSD(s) reach full limits(Total Cap. is ~2.2TiB) and prevented additional writes to occur on the cluster.
Comment on attachment 1671097 [details] PGs per OSD added additional OSDs at ~18:37, initial PGs were allocated but balancing did not occur.
14.2.8 would be post-4.3 - that's the basis of RHCS 4.1, which is still in development. As noted in email, the pg counts are lower than they should be. Which pool(s) were you filling up? ocs-operator sets up the target ratio for rbd and cephfs data pools [1]. If you're creating additional pools, you'll need to set a target ratio or size for them so the autoscaler can set the pg count appropriately. You can see the target size ratios for each pool in 'ceph osd pool ls detail'.
My cluster is no longer responding, but I was using the default pools provided by the storage cluster, example-storagecluster-cephblockpool below is the previously recorded pg auto scaling status and ceph osd df output oc rsh rook-ceph-tools-7f96779fb9-48c6h ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE example-storagecluster-cephblockpool 3801G 3.0 28317G 0.4027 0.4900 1.0 256 on example-storagecluster-cephobjectstore.rgw.control 0 3.0 28317G 0.0000 1.0 32 on example-storagecluster-cephobjectstore.rgw.log 0 3.0 28317G 0.0000 1.0 32 on .rgw.root 0 3.0 28317G 0.0000 1.0 32 on example-storagecluster-cephfilesystem-metadata 2286 3.0 28317G 0.0000 4.0 32 on example-storagecluster-cephfilesystem-data0 0 3.0 28317G 0.0000 0.4900 1.0 256 on example-storagecluster-cephobjectstore.rgw.meta 0 3.0 28317G 0.0000 1.0 32 on nce 2h); epoch: e80 [root@f03-h29-000-r620 test-files]# oc rsh rook-ceph-tools-7f96779fb9-48c6h ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 ssd 2.91100 1.00000 2.9 TiB 2.0 TiB 2.0 TiB 71 KiB 3.6 GiB 886 GiB 70.28 1.74 341 up 11 ssd 2.91100 1.00000 2.9 TiB 1.7 TiB 1.7 TiB 32 KiB 3.0 GiB 1.2 TiB 57.47 1.42 319 up 0 ssd 2.18320 1.00000 2.2 TiB 1.9 TiB 1.9 TiB 43 KiB 3.2 GiB 333 GiB 85.11 2.11 329 up 1 ssd 2.18320 1.00000 2.2 TiB 1.9 TiB 1.9 TiB 52 KiB 3.2 GiB 330 GiB 85.22 2.11 327 up 3 ssd 2.18320 1.00000 2.2 TiB 551 GiB 549 GiB 20 KiB 1.3 GiB 1.6 TiB 24.63 0.61 103 up 4 ssd 2.18320 1.00000 2.2 TiB 397 GiB 396 GiB 27 KiB 1.0 GiB 1.8 TiB 17.75 0.44 97 up 5 ssd 2.18320 1.00000 2.2 TiB 539 GiB 538 GiB 12 KiB 1.2 GiB 1.7 TiB 24.12 0.60 98 up 6 ssd 2.18320 1.00000 2.2 TiB 474 GiB 473 GiB 20 KiB 1.2 GiB 1.7 TiB 21.21 0.53 78 up 7 ssd 2.18320 1.00000 2.2 TiB 417 GiB 416 GiB 19 KiB 1.0 GiB 1.8 TiB 18.65 0.46 80 up 8 ssd 2.18320 1.00000 2.2 TiB 461 GiB 460 GiB 12 KiB 1.0 GiB 1.7 TiB 20.63 0.51 78 up 9 ssd 2.18320 1.00000 2.2 TiB 493 GiB 492 GiB 20 KiB 1.2 GiB 1.7 TiB 22.06 0.55 90 up 10 ssd 2.18320 1.00000 2.2 TiB 478 GiB 477 GiB 22 KiB 1.2 GiB 1.7 TiB 21.40 0.53 76 up TOTAL 28 TiB 11 TiB 11 TiB 354 KiB 22 GiB 16 TiB 40.35
Doing the math again, the nmuber of pgs does match the target ratio: 12 * 100 / 3 * 0.49 = 196 -> rounded to 256 pgs When you reproduce can you attach a pg dump and a binary osdmap (ceph osd getmap -o /tmp/osdmap). That will let us see the distribution of pgs per pool, and check the balancer's behavior.
not 4.3 material
Retested with quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc2 and the same issue occured. Evaluated cluster Configuration with Ben England and observed that I had non uniform Racks that is Rack0 had 3host, Rack1 had 3host, Rack2 had 2host. We suspected that because each rack did not have the same number of host, OCS had improperly balanced OSDS/PGs. When deploying OSDs, OCS deployed them as below: Rack 0 -host 0 --osd0 -host 1 --osd1 Rack 1 -host 2 --osd2 -host 3 --osd11 Rack 2 -host4 --osd3 --osd4 --osd5 --osd6 -host5 --osd7 --osd8 --osd9 --osd10 Is it expected that users have the same number of host/capacity for each Rack, Is this just a configuration error or still a functional issue?
(In reply to acalhoun from comment #8) > Retested with quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc2 and the same > issue occured. > > Evaluated cluster Configuration with Ben England and observed that I had non > uniform Racks that is Rack0 had 3host, Rack1 had 3host, Rack2 had 2host. We > suspected that because each rack did not have the same number of host, OCS > had improperly balanced OSDS/PGs. > > When deploying OSDs, OCS deployed them as below: > > Rack 0 > -host 0 > --osd0 > -host 1 > --osd1 > Rack 1 > -host 2 > --osd2 > -host 3 > --osd11 > Rack 2 > -host4 > --osd3 > --osd4 > --osd5 > --osd6 > -host5 > --osd7 > --osd8 > --osd9 > --osd10 > > Is it expected that users have the same number of host/capacity for each > Rack, Is this just a configuration error or still a functional issue? If crush is configured to split across racks, then it's expected to have similar capacity in each, otherwise you may not be able to use the full capacity or balance data appropriately. Same for splitting across hosts. Without this, crush constrains how balanced the cluster can be. For example, if you have two hosts with 5x8 TB osds and one host with 1x8 TB osd you can only use up to 10TB of 3x replicated space, and the single osd host will have 5x the pgs/osd as the other two. For maximum parallelism and performance, you need equally sized hosts and racks. I'm not sure how ocs/rook/ocp is controlling scheduling of osds onto hosts/racks - Seb where is this controlled?
Hi acalhoun, Is this a bare metal only issue?
Tamir, Based on my testing I haven't observed this issue on AWS, although the Rack topology setup is different between AWS and BM. I believe in AWS Rack aligns with Availability Zones and with BM is in the traditional sense of rack. Overall, I believe Josh's assessment is correct about the poor distribution due to the variation in Rack/Host Capacity. I am surprised at how significant this is, but not sure if this is "Okay" or changes are necessary. I re-ran this test with balanced Racks (1 Rack to 2 host, w/ total of 4Racks), although devices varied from 2.2TiB to 2.9TiB and this difference still resulted in high variation of data distribution and incorrect Full status at 12 OSDs.
@Josh, the kube scheduler is responsible for this, OCS labels the nodes and passes that selector to Rook which then creates the resources and leave the rest to Kubernetes. We need 1.18 to get proper placement with topologySpreadConstraints, this is already tracked upstream https://github.com/rook/rook/issues/4387
We noticed that Kubernetes didn't have a thing like topologySpreadConstraints when we were doing failure domain testing of 4.2. The resulting scheme was to have a storageDeviceSets in the storageCluster, and then build a cephCluster storageClassDeviceSets list with a set for each failure-domain (rack or zone). If there rack or zone failure-domain labels didn't exist, we'd create virtual 'racks' (for VMware, because their cloud provider didn't lend the ability to surface failure-domain data to OCP, at least in 4.2). Pools would be use rack or zone as the failure-domain to decluster replicas. The UI for 4.2 would surfacce the failure-domain information, and when you "picked" nodes you needed to pick an equal number from each failure domain, and have at least 3 failure domains. With bare metal, we don't use the UI, and the docs don't have "here be dragons" text when we instruct the user to add OCS labels to hosts. The result is someone might end up labeling hosts in 2 racks, with an unbalanced number of nodes each each. Because the rack label exists, we can't create virtual ones. Since there are < 3 failure-domain rack, we fall back to host, and we presumably have a single set in storageClassDeviceSets. In this scenario, there is no way to ensure OSDs are balanced across hosts. Docs should probably add text that tells people not to point a gun at their foot when labeling hosts on bare metal. Second, it might be worthwhile to have the OCS Operator use a crush bucket type between "rack" and "host" for "virtual racks" if rack/zone failure-domain count is less than three. I haven't thought about how topologySpreadConstraints would change the strategy.
Also, since we have a bug where the PVC ID is used for the host bucket name in crush, even when portable: false, we can't even guarantee replicas are on distinct hosts!
https://bugzilla.redhat.com/show_bug.cgi?id=1816820
@Kyle Besides the potential "portable: false" issue, are you seeing any reason we couldn't solve these issues with documentation for bare metal? Or the concern is basically how we can help the user not shoot themselves in the foot while following the documentation?
I'm doing this in parallel with Annette. Basically after folks label their nodes, we're going to have them run - oc get nodes -L failure-domain.beta.kubernetes.io/zone,failure-domain.beta.kubernetes.io/rack -l cluster.ocs.openshift.io/openshift-storage='' NAME STATUS ROLES AGE VERSION ZONE RACK ip-10-0-128-167.us-west-2.compute.internal Ready worker 41h v1.16.2 us-west-2a ip-10-0-133-93.us-west-2.compute.internal Ready worker 5d15h v1.16.2 us-west-2a ip-10-0-159-206.us-west-2.compute.internal Ready worker 5d15h v1.16.2 us-west-2b ip-10-0-172-122.us-west-2.compute.internal Ready worker 5d15h v1.16.2 us-west-2c To verify they do in fact have an even number of nodes in at least 3 distinct racks or zones. Personally, I'd set the minimum at closing #1816820 combined with a docsfix along the lines of the above, and then look into the viability of switching to a new crush bucket type between host and rack/zone as our "virtual rack" for 4.4.
(In reply to leseb from comment #12) > @Josh, the kube scheduler is responsible for this, OCS labels the nodes and > passes that selector to Rook which then creates the resources and leave the > rest to Kubernetes. @Seb, Not 100% correct, afaik. OCS operator does a bit more. See below: (In reply to Kyle Bader from comment #13) > We noticed that Kubernetes didn't have a thing like > topologySpreadConstraints when we were doing failure domain testing of 4.2. > The resulting scheme was to have a storageDeviceSets in the storageCluster, > and then build a cephCluster storageClassDeviceSets list with a set for each > failure-domain (rack or zone). If there rack or zone failure-domain labels > didn't exist, we'd create virtual 'racks' (for VMware, because their cloud > provider didn't lend the ability to surface failure-domain data to OCP, at > least in 4.2). Pools would be use rack or zone as the failure-domain to > decluster replicas. @Kyle, I'm a little confused about the "we" here: This sounds like what we did in the development of OCS operator (see below). Are you referring to that work, or are you describing what you did in testing with manual preparation of the OCP cluster? > The UI for 4.2 would surfacce the failure-domain > information, and when you "picked" nodes you needed to pick an equal number > from each failure domain, and have at least 3 failure domains. > > With bare metal, we don't use the UI, Yeah... My original understanding was that we would only be using the CLI to set up the LSO and PVs for the backend disks. And would use the UI just like normal from there on. Maybe that is not how it ended up. > and the docs don't have "here be > dragons" text when we instruct the user to add OCS labels to hosts. The > result is someone might end up labeling hosts in 2 racks, with an unbalanced > number of nodes each each. Because the rack label exists, we can't create > virtual ones. Since there are < 3 failure-domain rack, we fall back to host, > and we presumably have a single set in storageClassDeviceSets. In this > scenario, there is no way to ensure OSDs are balanced across hosts. > > Docs should probably add text that tells people not to point a gun at their > foot when labeling hosts on bare metal. Second, it might be worthwhile to > have the OCS Operator use a crush bucket type between "rack" and "host" for > "virtual racks" if rack/zone failure-domain count is less than three. ocs-operator is using "rack" as a virtual zone. This was done in cases where we have < #replica AZs in AWS. In genreal, if it does not find enough (>= #replica) zone labels in the nodes, it will create #replica rack labels on the nodes, distributing the nodes as evenly as possible across the racks, and will try to make sure to have an as even as possible distribution of the osds among the racks using affinity and anti-affinity settings. It also chops the StorageDeviceSet up into multiple StorageClassDeviceSets. Within the rack label, the kubernetes scheduler is responsible for placing OSDs, so we will not necessarily have an even distribution of OSDs among nodes with the same rack. So if, in a bare-metal environment, the admin has already created rack labels, then ocs-operator would honor them and just try and distribute the OSDs among them as evenly as possible. But it is indeed important to spread the storage nodes evenly across the racks. The description here looks as if the hosts have been distributed across the racks well (2 in each rack), so that is fine. Not sure why the OSDs are not distributed well across racks. Note that all this "magic" would only happen in ocs-operator if no "placement" is explicitly configured in the StorageDeviceSet. See: https://github.com/openshift/ocs-operator/blob/release-4.3/deploy/olm-catalog/ocs-operator/4.3.0/storagecluster.crd.yaml#L110 I am not 100% sure which doc was followed to set this up. - Was the UI not used for setup after the LSO and backend PVs were created? - Was "placement" used? - Can I see the StorageCluster cr that was used? Regarding the introduction of a bucket between rack and host: We are doing pretty much what you are describing, but with racks. We were not aware of any existing level between rack and host, so ended up using rack since ceph knows about it, and we didn't know that OCP would set these labels automatically (like it sets zone labels automatically for AWS AZs...). > I haven't thought about how topologySpreadConstraints would change the > strategy.
As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS (comment #11), marking as a blocker to 4.4
(In reply to Raz Tamir from comment #19) > As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS > (comment #11), marking as a blocker to 4.4 Have you not seen it in AWS ever, or possibly just not when running with at least 3 zones?
This whole BZ is a bit convoluted. I can not see clearly what is the actual problem that this BZ is about. If I understand it correctly, what is described here, is is a combination of various aspects and currently mostly works as designed... (In reply to Josh Durgin from comment #9) > > If crush is configured to split across racks, then it's expected to have > similar capacity in each, otherwise you may not be able to use the full > capacity or balance data appropriately. Same for splitting across hosts. > Without this, crush constrains how balanced the cluster can be. For example, > if you have two hosts with 5x8 TB osds and one host with 1x8 TB osd you can > only use up to 10TB of 3x replicated space, and the single osd host will > have 5x the pgs/osd as the other two. For maximum parallelism and > performance, you need equally sized hosts and racks. > > I'm not sure how ocs/rook/ocp is controlling scheduling of osds onto > hosts/racks - Seb where is this controlled? As explained in comment #18, * the ocs operator either detects a failure domain (zone, corresoponding to AWS AZ) or creates one (rack) by labelling nodes into racks artificially. * The various OSDs should be distributed across the failure domain (rack or zone) as evenly as possible. In particular we should roughly the same capacity in each failure domain (zone or rack). ==> If this is not the case, then this is a bug. * Within the failure domain (rack/zone), the distribution is entirely up to the kubernetes scheduler. This is currently NOT done homogeneously across nodes, but it will frequently happen that some nodes get many OSDs, some nodes get only a few and some get none. There is just nothing we can currently do about it, and if Ceph assumes the hosts to be of similar capacity (even if the failure domain is set to rack or zone), then this is just a fact that we have to accept at this point. With Kube 1.18 / OCP 4.6, we will fix this by the use of topologySpreadConstraints. * There is one possible problem with the portable=true on OSDs for bare metal. But it is treated in a separate BZ. https://bugzilla.redhat.com/show_bug.cgi?id=1816820 * Is there an additional problem at all with the pg adjusting? (I really don't know it, it just seems that it is a combination of the above...) ==> @Josh?
I think the google doc you're following is outdated.It specifically mentions rack labels in the StorageCluster object[1]. It should not. I don't see a must-gather attached, so I can't check if that's the exact cause. I think it was corrected in doc review process [2] [1] https://docs.google.com/document/d/1AqFw3ylCfZ7vxq-63QVbGPHESPW-xep4reGVS0mPgXQ/edit [2] Step 1.2. ctrl+f "kind: storagecluster" https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/html-single/deploying_openshift_container_storage/index?lb_target=preview#installing-openshift-container-storage-using-local-storage-devices_rhocs
> * The various OSDs should be distributed across the failure domain (rack or > zone) as evenly as possible. In particular we should roughly the same > capacity in each failure domain (zone or rack). > ==> If this is not the case, then this is a bug. To clarify, specifying the rack labels in the StorageCluster.Spec.StorageDeviceSets would cause the issue Michael mentioned here.
(In reply to Rohan CJ from comment #23) > > * The various OSDs should be distributed across the failure domain (rack or > > zone) as evenly as possible. In particular we should roughly the same > > capacity in each failure domain (zone or rack). > > ==> If this is not the case, then this is a bug. > > To clarify, specifying the rack labels in the > StorageCluster.Spec.StorageDeviceSets would cause the issue Michael > mentioned here. Hmm, I don't understand. It is perfectly fine if rack labels already exist. In this case ocs-operator is going to honour them. ocs-operator will only create rack labels if nodes are in less than 3 zones and if there are no rack labels already. If the admin has distributed the nodes into racks with labels, it should just work, and ocs-operator should still make sure to distribute the disks evenly across those racks. If that is not working (the even distribution across racks or zones), it is a bug in ocs-operator that needs to be fixed, I believe. I'd have to double check if the expansion scenario different though.
(In reply to Michael Adam from comment #21) > * Within the failure domain (rack/zone), the distribution is entirely up > to the kubernetes scheduler. This is currently NOT done homogeneously > across nodes, but it will frequently happen that some nodes get many > OSDs, some nodes get only a few and some get none. > There is just nothing we can currently do about it, and if Ceph assumes > the hosts to be of similar capacity (even if the failure domain is set > to rack or zone), then this is just a fact that we have to accept at this > point. > With Kube 1.18 / OCP 4.6, we will fix this by the use of > topologySpreadConstraints. IMO this BZ should track this problem. > * Is there an additional problem at all with the pg adjusting? > (I really don't know it, it just seems that it is a combination of the > above...) > ==> @Josh? There is no additional problem to my knowledge.
(In reply to Josh Durgin from comment #25) > (In reply to Michael Adam from comment #21) > > * Within the failure domain (rack/zone), the distribution is entirely up > > to the kubernetes scheduler. This is currently NOT done homogeneously > > across nodes, but it will frequently happen that some nodes get many > > OSDs, some nodes get only a few and some get none. > > There is just nothing we can currently do about it, and if Ceph assumes > > the hosts to be of similar capacity (even if the failure domain is set > > to rack or zone), then this is just a fact that we have to accept at this > > point. > > With Kube 1.18 / OCP 4.6, we will fix this by the use of > > topologySpreadConstraints. > > IMO this BZ should track this problem. So why is this bug not targeted to 4.6? > > > * Is there an additional problem at all with the pg adjusting? > > (I really don't know it, it just seems that it is a combination of the > > above...) > > ==> @Josh? > > There is no additional problem to my knowledge.
(In reply to Yaniv Kaul from comment #26) > (In reply to Josh Durgin from comment #25) > > (In reply to Michael Adam from comment #21) > > > * Within the failure domain (rack/zone), the distribution is entirely up > > > to the kubernetes scheduler. This is currently NOT done homogeneously > > > across nodes, but it will frequently happen that some nodes get many > > > OSDs, some nodes get only a few and some get none. > > > There is just nothing we can currently do about it, and if Ceph assumes > > > the hosts to be of similar capacity (even if the failure domain is set > > > to rack or zone), then this is just a fact that we have to accept at this > > > point. > > > With Kube 1.18 / OCP 4.6, we will fix this by the use of > > > topologySpreadConstraints. > > > > IMO this BZ should track this problem. > > So why is this bug not targeted to 4.6? I assumed Michael was waiting for other comments. Since there are none in a week, going ahead with my suggestion to track only this problem with this BZ, and retitling/moving as appropriate. If anyone wants to track another issue, please open a separate BZ.
(In reply to Michael Adam from comment #20) > (In reply to Raz Tamir from comment #19) > > As BM is going to be GAed in 4.4 and this issue wasn't observed in AWS > > (comment #11), marking as a blocker to 4.4 > > Have you not seen it in AWS ever, or possibly just not when running with at > least 3 zones? Not that I'm aware of. We are checking OSD distribution and remember we had few bugs but noting new on AWS
@Rajat confirmed that the api is available in 4.5 onwards. We're okay with using beta. We're aiming to land this in 4.6.
Removing the blocker flag which was added in 4.4
This is now an epic for 4.7: https://issues.redhat.com/browse/KNIP-1512 ==> moving to 4.7
Hi Kesavan, Are the fixes for https://bugzilla.redhat.com/show_bug.cgi?id=1817438 and this bug inter-related ? Or they need to be verified separately ?
Hey Neha, The two bugs needs to be verified separately as on them is of baremetal scenario and the other one is AWS. The topology spread for baremetal is rack and for AWS is AZ(zones).
adding pm-ack, (which was not given automatically b/c of the [RFE] tag), since this is an approved epic for 4.7
*** Bug 1778216 has been marked as a duplicate of this bug. ***
On Aws, pg distribution in range of 92 to 100. I will try to do this on vSphere on Wednesday ==================================================================================================== (venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd rook-ceph-osd-0-7dc45754fc-8w5vs 2/2 Running 0 7h53m rook-ceph-osd-1-588f9fdf9-t8v4d 2/2 Running 0 7h53m rook-ceph-osd-2-779d9c795b-bxjdk 2/2 Running 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws 0/1 Init:0/2 0 8s rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv 0/1 Init:0/2 0 7s rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk 0/1 Init:0/2 0 7s (venv) wusui@localhost:~/ocs-ci$ sleep 120; !! sleep 120; oc -n openshift-storage get pods | grep osd rook-ceph-osd-0-7dc45754fc-8w5vs 2/2 Running 0 7h55m rook-ceph-osd-1-588f9fdf9-t8v4d 2/2 Running 0 7h55m rook-ceph-osd-2-779d9c795b-bxjdk 2/2 Running 0 7h55m rook-ceph-osd-3-58577cf8c5-przg2 2/2 Running 0 106s rook-ceph-osd-4-799f945b7f-zwrp5 2/2 Running 0 105s rook-ceph-osd-5-856545cfc7-7bspr 2/2 Running 0 103s rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws 0/1 Completed 0 2m14s rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv 0/1 Completed 0 2m13s rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk 0/1 Completed 0 2m13s ==================================================================================================== sh-4.4# ceph status cluster: id: 467e00f5-3885-4fb5-949e-6f3eef7d40a1 health: HEALTH_OK sh-4.4# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE ocs-storagecluster-cephblockpool 4513M 3.0 12288G 0.0011 0.4900 1.0000 1.0 128 on ocs-storagecluster-cephfilesystem-metadata 56886 3.0 12288G 0.0000 4.0 32 on ocs-storagecluster-cephfilesystem-data0 158 3.0 12288G 0.0000 1.0 32 on sh-4.4# ceph osd df output Error EINVAL: you must specify both 'filter_by' and 'filter' sh-4.4# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 4 ssd 2.00000 1.00000 2 TiB 3.3 GiB 2.3 GiB 0 B 1 GiB 2.0 TiB 0.16 1.03 94 up 2 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.97 98 up 1 ssd 2.00000 1.00000 2 TiB 3.0 GiB 2.0 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 96 up 3 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 96 up 0 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 92 up 5 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 100 up TOTAL 12 TiB 19 GiB 13 GiB 0 B 6 GiB 12 TiB 0.16 MIN/MAX VAR: 0.95/1.05 STDDEV: 0.01 sh-4.4# services: mon: 3 daemons, quorum a,b,c (age 8h) mgr: a(active, since 8h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 18m), 6 in (since 18m) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 3 pools, 192 pgs objects: 1.33k objects, 4.4 GiB usage: 18 GiB used, 12 TiB / 12 TiB avail pgs: 192 active+clean io: client: 853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr sh-4.4# ceph osd df (venv) wusui@localhost:~/ocs-ci$ vi !$ vi ~/bugreview/1*/aws-message (venv) wusui@localhost:~/ocs-ci$ cat !$ cat ~/bugreview/1*/aws-message On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256. I will try to do this on vSphere on Wednesday ==================================================================================================== (venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep osd rook-ceph-osd-0-7dc45754fc-8w5vs 2/2 Running 0 7h53m rook-ceph-osd-1-588f9fdf9-t8v4d 2/2 Running 0 7h53m rook-ceph-osd-2-779d9c795b-bxjdk 2/2 Running 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws 0/1 Init:0/2 0 8s rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv 0/1 Init:0/2 0 7s rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st 0/1 Completed 0 7h53m rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk 0/1 Init:0/2 0 7s (venv) wusui@localhost:~/ocs-ci$ sleep 120; !! sleep 120; oc -n openshift-storage get pods | grep osd rook-ceph-osd-0-7dc45754fc-8w5vs 2/2 Running 0 7h55m rook-ceph-osd-1-588f9fdf9-t8v4d 2/2 Running 0 7h55m rook-ceph-osd-2-779d9c795b-bxjdk 2/2 Running 0 7h55m rook-ceph-osd-3-58577cf8c5-przg2 2/2 Running 0 106s rook-ceph-osd-4-799f945b7f-zwrp5 2/2 Running 0 105s rook-ceph-osd-5-856545cfc7-7bspr 2/2 Running 0 103s rook-ceph-osd-prepare-ocs-deviceset-0-data-0x52qn-zq8gc 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws 0/1 Completed 0 2m14s rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv 0/1 Completed 0 2m13s rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk 0/1 Completed 0 2m13s ==================================================================================================== sh-4.4# ceph status cluster: id: 467e00f5-3885-4fb5-949e-6f3eef7d40a1 health: HEALTH_OK sh-4.4# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 4 ssd 2.00000 1.00000 2 TiB 3.3 GiB 2.3 GiB 0 B 1 GiB 2.0 TiB 0.16 1.03 94 up 2 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.97 98 up 1 ssd 2.00000 1.00000 2 TiB 3.0 GiB 2.0 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 96 up 3 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 96 up 0 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 92 up 5 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 100 up TOTAL 12 TiB 19 GiB 13 GiB 0 B 6 GiB 12 TiB 0.16 MIN/MAX VAR: 0.95/1.05 STDDEV: 0.01 sh-4.4# services: mon: 3 daemons, quorum a,b,c (age 8h) mgr: a(active, since 8h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 18m), 6 in (since 18m) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 3 pools, 192 pgs objects: 1.33k objects, 4.4 GiB usage: 18 GiB used, 12 TiB / 12 TiB avail pgs: 192 active+clean io: client: 853 B/s rd, 124 KiB/s wr, 1 op/s rd, 1 op/s wr (venv) wusui@localhost:~/ocs-ci$ On Aws, pg distribution in range of 92 to 100 which is better than 32 to 256. eph-osd-prepare-ocs-deviceset-0-data-1xc52x-2tcws 0/1 Completed 0 2m14s rook-ceph-osd-prepare-ocs-deviceset-1-data-0fw9zb-qkmzd 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-1-data-16cg8r-9vxjv 0/1 Completed 0 2m13s rook-ceph-osd-prepare-ocs-deviceset-2-data-0l8zh2-gd8st 0/1 Completed 0 7h55m rook-ceph-osd-prepare-ocs-deviceset-2-data-1h46wn-g6wqk 0/1 Completed 0 2m13s ==================================================================================================== sh-4.4# ceph status cluster: id: 467e00f5-3885-4fb5-949e-6f3eef7d40a1 health: HEALTH_OK sh-4.4# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE ocs-storagecluster-cephblockpool 4513M 3.0 12288G 0.0011 0.4900 1.0000 1.0 128 on ocs-storagecluster-cephfilesystem-metadata 56886 3.0 12288G 0.0000 4.0 32 on ocs-storagecluster-cephfilesystem-data0 158 3.0 12288G 0.0000 1.0 bash: On: command not found 32 on sh-4.4# ceph osd df output Error EINVAL: you must specify both 'filter_by' and 'filter' sh-4.4# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 4 ssd 2.00000 1.00000 2 TiB 3.3 GiB 2.3 GiB 0 B 1 GiB 2.0 TiB 0.16 1.03 94 up 2 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.97 98 up 1 ssd 2.00000 1.00000 2 TiB 3.0 GiB 2.0 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 96 up 3 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 96 up 0 ssd 2.00000 1.00000 2 TiB 3.1 GiB 2.1 GiB 0 B 1 GiB 2.0 TiB 0.15 0.95 92 up 5 ssd 2.00000 1.00000 2 TiB 3.4 GiB 2.4 GiB 0 B 1 GiB 2.0 TiB 0.16 1.05 100 up TOTAL 12 TiB(venv) wusui@localhost:~/ocs-ci$
On vSphere, I added three OSDs to a 3 OSD cluster and got the following results: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 2 2.00000 1.00000 2 TiB 1.3 GiB 347 MiB 0 B 1 GiB 2.0 TiB 0.07 0.99 133 up 5 hdd 2.00000 1.00000 2 TiB 1.4 GiB 366 MiB 0 B 1 GiB 2.0 TiB 0.07 1.00 127 up 0 hdd 2.00000 1.00000 2 TiB 1.5 GiB 480 MiB 0 B 1 GiB 2.0 TiB 0.07 1.08 146 up 3 hdd 2.00000 1.00000 2 TiB 1.3 GiB 341 MiB 0 B 1 GiB 2.0 TiB 0.07 0.98 128 up 1 hdd 2.00000 1.00000 2 TiB 1.3 GiB 314 MiB 0 B 1 GiB 2.0 TiB 0.06 0.96 127 up 4 hdd 2.00000 1.00000 2 TiB 1.3 GiB 347 MiB 0 B 1 GiB 2.0 TiB 0.07 0.99 155 up TOTAL 12 TiB 8.1 GiB 2.1 GiB 0 B 6 GiB 12 TiB 0.07 MIN/MAX VAR: 0.96/1.08 STDDEV: 0.00 Is the range from 127 to 155 pgs per osd considered balanced?
(venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep ceph-osd- rook-ceph-osd-0-6b9cbf8bc-h954h 2/2 Running 0 10h rook-ceph-osd-1-bb68456bb-ssp6r 2/2 Running 0 10h rook-ceph-osd-2-6f87d6fdcc-8sknc 2/2 Running 0 10h rook-ceph-osd-3-69f8fd65d9-nhcv5 2/2 Running 0 9h rook-ceph-osd-4-6fffbb8c46-btcdz 2/2 Running 0 9h rook-ceph-osd-5-6bd4654c58-2pc5x 2/2 Running 0 9h rook-ceph-osd-6-5b4d8c9595-8r6f6 0/2 Pending 0 25m rook-ceph-osd-7-57df9c4b4-xwdrg 0/2 Pending 0 25m rook-ceph-osd-8-8bcf9447b-fhmrk 0/2 Pending 0 25m rook-ceph-osd-prepare-ocs-deviceset-0-data-0r9rr2-475cs 0/1 Completed 0 10h rook-ceph-osd-prepare-ocs-deviceset-0-data-1g2jp2-q5m88 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-0-data-24tn86-cnf7h 0/1 Completed 0 25m rook-ceph-osd-prepare-ocs-deviceset-1-data-0z54ww-kjh7g 0/1 Completed 0 10h rook-ceph-osd-prepare-ocs-deviceset-1-data-1g7qfw-gn2dn 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-1-data-2xjzmk-zpdbw 0/1 Completed 0 25m rook-ceph-osd-prepare-ocs-deviceset-2-data-069r9t-xlscm 0/1 Completed 0 10h rook-ceph-osd-prepare-ocs-deviceset-2-data-1fhmtq-m9rgm 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-2-data-2zmc4m-729jf 0/1 Completed 0 25m (venv) wusui@localhost:~/ocs-ci$ oc -n openshift-storage get pods | grep ceph-osd- | egrep "Running|Pending" | sed -e 's/ .*//' rook-ceph-osd-0-6b9cbf8bc-h954h rook-ceph-osd-1-bb68456bb-ssp6r rook-ceph-osd-2-6f87d6fdcc-8sknc rook-ceph-osd-3-69f8fd65d9-nhcv5 rook-ceph-osd-4-6fffbb8c46-btcdz rook-ceph-osd-5-6bd4654c58-2pc5x rook-ceph-osd-6-5b4d8c9595-8r6f6 rook-ceph-osd-7-57df9c4b4-xwdrg rook-ceph-osd-8-8bcf9447b-fhmrk (venv) wusui@localhost:~/ocs-ci$ for i in `cat /tmp/foo`; do oc -n openshift-storage describe pod $i | grep topology-location; done topology-location-host=ocs-deviceset-1-data-0z54ww topology-location-rack=rack1 topology-location-root=default topology-location-host=ocs-deviceset-0-data-0r9rr2 topology-location-rack=rack2 topology-location-root=default topology-location-host=ocs-deviceset-2-data-069r9t topology-location-rack=rack0 topology-location-root=default topology-location-host=ocs-deviceset-1-data-1g7qfw topology-location-rack=rack1 topology-location-root=default topology-location-host=ocs-deviceset-0-data-1g2jp2 topology-location-rack=rack2 topology-location-root=default topology-location-host=ocs-deviceset-2-data-1fhmtq topology-location-rack=rack0 topology-location-root=default topology-location-host=ocs-deviceset-2-data-2zmc4m topology-location-rack=rack0 topology-location-root=default topology-location-host=ocs-deviceset-0-data-24tn86 topology-location-rack=rack2 topology-location-root=default topology-location-host=ocs-deviceset-1-data-2xjzmk topology-location-rack=rack1 topology-location-root=default So the above output shows what I saw after I added three OSDs, waited some number of hours, and added three more OSDs. The first three OSDs appear to be part of the cluster. The next three are still in Pending state and causing Ceph to not be healthy. The OSDs are evenly allocated on the the nodes and so that appears to be correct. However the new OSDs are not Running. Is this a separate bug to be reported? And this verifies that the new OSDS are evenly distributed, even if faultly. So has the actual gist of this change been verified and we are running into another issue? I am asking for more info for both of these questions.
On looking into the cluster, The newly added OSDs moved to pending state because of insufficient memory, probably we need to run in a cluster with higher specs in case of adding multiple OSDs 0/6 nodes are available: 3 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
According to Kesevan, the Pending status is due to the fact that we are out of available memory on this cluster. This is the expected behavior in this case. Since the distribution of osds added is even among all nodes added, I am marking this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days