Bug 1817438 - OSDs not distributed uniformly across OCS nodes on a 9-node AWS IPI setup
Summary: OSDs not distributed uniformly across OCS nodes on a 9-node AWS IPI setup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Kesavan
QA Contact: Ramakrishnan Periyasamy
URL:
Whiteboard:
: 1821161 (view as bug list)
Depends On: 1814681
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-26 11:02 UTC by Manoj Pillai
Modified: 2023-09-15 00:30 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-17 06:22:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:22:58 UTC

Description Manoj Pillai 2020-03-26 11:02:18 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Following up on https://bugzilla.redhat.com/show_bug.cgi?id=1790500#c47, I'm trying to do performance tests on a 9-node OCS setup, with 27 OSDs, on an AWS IPI OCP setup. However, I end up with a situation where some of the OCS nodes have 6 OSDs and some others have none.

OCS is installed and managed using the UI, sticking to our official documentation for initial cluster creation, and subsequent scaling to 9 nodes and 27 OSDs.

Version of all relevant components (if applicable):
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.1     True        False         46h     Cluster version is 4.3.1

# oc get csv -n openshift-storage
ocs-operator.v4.3.0-376.ci      OpenShift Container Storage   4.3.0-376.ci


Steps to Reproduce:

* create AWS IPI OCP cluster
* oc apply -f deploy-with-olm.yaml (with image: quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc1)
* from here use the UI
* create initial OCS cluster following OCS 4.2 GA documentation, to get 3 OCS nodes with 1 OSD each of 2TiB gp2.
* add 4TiB storage capacity so that we have 3 OCS nodes, each with 3 OSDs of 2TiB gp2.
* scale the setup to 9 nodes using the UI according to steps in the documentation (labelling of nodes done via CLI)
* add 12 TiB storage capacity (to get total 18TiB) and wait for cluster to become ready


Actual results:
* there are now 27 OSDs, as expected. But they are not equally distributed across the OCS nodes. In my latest attempt, 5 OCS nodes have 3 OSDs each; 2 OCS nodes have 6 OSDs each, and 2 OCS nodes have 0 OSDs.

Expected results:
At least according to OCS 4.2 guidelines, this is an unsupported configuration. Following our documented steps for cluster scaling should not lead to that.

Additional info:

stats collected during the performance test shows output like this on 2 OCS nodes, showing 6 OSDs receiving IO:

03/26/20 10:06:54
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    0.00    1.40     0.00    10.60    15.14     0.00    0.50    0.00    0.50   1.00   0.14
dm-0              0.00     0.00    0.00    1.40     0.00    10.60    15.14     0.00    0.43    0.00    0.43   1.00   0.14
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop1             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop2             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop3             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop4             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme2n1           0.00     3.20 2121.10    2.20  8484.40    22.00     8.01     0.01    0.39    0.39    0.64   0.44  92.41
nvme3n1           0.00     0.00 2391.40    0.00  9566.00     0.00     8.00     0.07    0.39    0.39    0.00   0.39  93.08
nvme4n1           0.00     1.30 3001.40    0.70 12005.60    10.80     8.01     0.63    0.58    0.58    0.57   0.32  95.96
nvme5n1           0.00     0.40 3126.10    0.20 12504.40     4.00     8.00     0.01    0.39    0.39    0.50   0.31  96.60
nvme6n1           0.00     0.00 1900.40    0.00  7601.60     0.00     8.00     0.01    0.36    0.36    0.00   0.47  89.07
nvme7n1           0.00     0.00 2257.70    0.00  9030.80     0.00     8.00     0.01    0.35    0.35    0.00   0.41  92.74
ceph--d9a3f7f4--1ba3--496c--ae03--6c890550b28f-osd--block--e3fe9b48--fd74--47c4--bda1--85cef9ae1068     0.00     0.00 3126.40    0.60 12505.60     4.00     8.00     1.20    0.38    0.38    1.00   0.31  96.59
ceph--02cd8c67--07c0--4381--84fd--75668fb2d240-osd--block--97a6a773--6cb6--43d1--a91e--63eeae4228da     0.00     0.00 2121.40    5.40  8485.60    22.00     8.00     0.83    0.39    0.39    0.69   0.43  92.36
ceph--ee11d629--21e4--419d--8895--c498d94d305e-osd--block--bf34dbdf--6bb6--40b8--9e7c--6768f9f325d3     0.00     0.00 1900.40    0.00  7601.60     0.00     8.00     0.69    0.36    0.36    0.00   0.47  89.07
ceph--8282bf71--a988--4446--9037--7f3bf9b73246-osd--block--0dc8cc5e--86bf--4fcb--9404--0fbb1c7ec120     0.00     0.00 3001.00    2.00 12004.00    10.80     8.00     1.73    0.58    0.58    0.85   0.32  95.98
loop5             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
ceph--c8258555--bac0--4beb--a4a8--5b1f05c5024b-osd--block--00ac521e--35d9--48a4--915a--b35b86d5bd69     0.00     0.00 2257.90    0.00  9031.60     0.00     8.00     0.79    0.35    0.35    0.00   0.41  92.71
ceph--05e81cc2--6a4c--4a40--b9b0--436f441377e3-osd--block--382b31de--1038--4a0f--b46a--fb54af7aed9b     0.00     0.00 2391.40    0.00  9566.00     0.00     8.00     0.93    0.39    0.39    0.00   0.39  93.05
nvme1n1           0.00     2.40    0.00    3.60     0.00    30.00    16.67     0.00    0.56    0.00    0.56   0.92   0.33

Whereas on 2 of the OCS nodes I see no OSDs:

03/26/20 10:06:54
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.60    0.20   23.40     6.00   122.30    10.87     0.00    0.58    0.50    0.58   0.28   0.65
dm-0              0.00     0.00    0.20   24.00     6.00   122.30    10.60     0.02    0.66    0.00    0.67   0.26   0.64

Others show 3 OSDs:

03/26/20 10:06:54
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.20    0.00    3.30     0.00    35.90    21.76     0.00    0.82    0.00    0.82   0.52   0.17
dm-0              0.00     0.00    0.00    3.50     0.00    35.90    20.51     0.00    0.71    0.00    0.71   0.49   0.17
loop0             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop1             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop2             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
loop3             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme2n1           0.00     0.40 3081.10    0.20 12324.40     4.00     8.00     0.28    0.52    0.52    0.50   0.31  97.01
nvme3n1           0.00     0.00 1970.70    0.00  7882.80     0.00     8.00     0.13    0.49    0.49    0.00   0.46  91.36
nvme4n1           0.00     0.00 2233.30    0.00  8933.20     0.00     8.00     0.01    0.41    0.41    0.00   0.42  92.75
ceph--77c3f7bb--4ce8--493f--b3fe--51a8425dbe89-osd--block--504a47c5--105e--4708--9cb0--71af98dfde5e     0.00     0.00 2233.20    0.00  8932.80     0.00     8.00     0.89    0.40    0.40    0.00   0.42  92.72
ceph--63730511--0b48--4b5a--a04f--a747814ff6b0-osd--block--69ff341f--0571--4156--89f5--698cb99bd143     0.00     0.00 1970.70    0.00  7882.80     0.00     8.00     0.94    0.48    0.48    0.00   0.46  91.32
ceph--8fd8ca6b--361a--4faf--a348--de334f07ef40-osd--block--a424782e--f5ac--422e--883f--0281c66b8c70     0.00     0.00 3081.20    0.60 12324.80     4.00     8.00     1.58    0.51    0.51    0.33   0.31  97.03
nvme1n1           0.00     2.40    0.00    3.60     0.00    35.20    19.56     0.00    0.53    0.00    0.53   0.89   0.32

Comment 2 Manoj Pillai 2020-03-26 11:08:11 UTC
ceph_osd_tree:

ID  CLASS WEIGHT   TYPE NAME                                STATUS REWEIGHT PRI-AFF
 -1       53.97281 root default
 -5       53.97281     region us-east-1
-10       17.99094         zone us-east-1a
 -9        1.99899             host ocs-deviceset-1-0-z444z
  2   ssd  1.99899                 osd.2                        up  1.00000 1.00000
-21        1.99899             host ocs-deviceset-1-1-xc6fz
  4   ssd  1.99899                 osd.4                        up  1.00000 1.00000
-25        1.99899             host ocs-deviceset-1-2-vd52m
  6   ssd  1.99899                 osd.6                        up  1.00000 1.00000
-35        1.99899             host ocs-deviceset-1-3-ms5tm
 18   ssd  1.99899                 osd.18                       up  1.00000 1.00000
-29        1.99899             host ocs-deviceset-1-4-7bwqb
 12   ssd  1.99899                 osd.12                       up  1.00000 1.00000
-33        1.99899             host ocs-deviceset-1-5-9fpqz
 13   ssd  1.99899                 osd.13                       up  1.00000 1.00000
-37        1.99899             host ocs-deviceset-1-6-jbbdn
 15   ssd  1.99899                 osd.15                       up  1.00000 1.00000
-31        1.99899             host ocs-deviceset-1-7-hp8fd
 14   ssd  1.99899                 osd.14                       up  1.00000 1.00000
-63        1.99899             host ocs-deviceset-1-8-whv7w
 26   ssd  1.99899                 osd.26                       up  1.00000 1.00000
-14       17.99094         zone us-east-1b
-13        1.99899             host ocs-deviceset-2-0-rwqzz
  1   ssd  1.99899                 osd.1                        up  1.00000 1.00000
-27        1.99899             host ocs-deviceset-2-1-mgxxz
  7   ssd  1.99899                 osd.7                        up  1.00000 1.00000
-23        1.99899             host ocs-deviceset-2-2-5x2pq
  8   ssd  1.99899                 osd.8                        up  1.00000 1.00000
-43        1.99899             host ocs-deviceset-2-3-dcdbd
 16   ssd  1.99899                 osd.16                       up  1.00000 1.00000
-45        1.99899             host ocs-deviceset-2-4-tnk8p
 22   ssd  1.99899                 osd.22                       up  1.00000 1.00000
-57        1.99899             host ocs-deviceset-2-5-scs28
 23   ssd  1.99899                 osd.23                       up  1.00000 1.00000
-55        1.99899             host ocs-deviceset-2-6-cd4hz
 25   ssd  1.99899                 osd.25                       up  1.00000 1.00000
-39        1.99899             host ocs-deviceset-2-7-vltbt
 24   ssd  1.99899                 osd.24                       up  1.00000 1.00000
-49        1.99899             host ocs-deviceset-2-8-m8fxs
 21   ssd  1.99899                 osd.21                       up  1.00000 1.00000
 -4       17.99094         zone us-east-1c
 -3        1.99899             host ocs-deviceset-0-0-rh5xd
  0   ssd  1.99899                 osd.0                        up  1.00000 1.00000
-17        1.99899             host ocs-deviceset-0-1-mdnj7
  5   ssd  1.99899                 osd.5                        up  1.00000 1.00000
-19        1.99899             host ocs-deviceset-0-2-4zc87
  3   ssd  1.99899                 osd.3                        up  1.00000 1.00000
-41        1.99899             host ocs-deviceset-0-3-kftrr
 10   ssd  1.99899                 osd.10                       up  1.00000 1.00000
-59        1.99899             host ocs-deviceset-0-4-8t6dj
 11   ssd  1.99899                 osd.11                       up  1.00000 1.00000
-47        1.99899             host ocs-deviceset-0-5-6ljc7
 17   ssd  1.99899                 osd.17                       up  1.00000 1.00000
-61        1.99899             host ocs-deviceset-0-6-nml8q
 20   ssd  1.99899                 osd.20                       up  1.00000 1.00000
-51        1.99899             host ocs-deviceset-0-7-78jfn
 19   ssd  1.99899                 osd.19                       up  1.00000 1.00000
-53        1.99899             host ocs-deviceset-0-8-5x8n4
  9   ssd  1.99899                 osd.9                        up  1.00000 1.00000

Comment 3 Michael Adam 2020-03-26 11:35:05 UTC
@Manoj, within a zone, the scheduling of osds across nodes is up to kubernetes scheduling. We currently have no control over it.

There is an upcoming feature in kubernetes that will lead to even distribution as a result of scheduling. Currently it's rather the opposite. Or at least more random.

Comment 6 Jose A. Rivera 2020-04-01 14:00:57 UTC
This is dependent on an upcoming feature in Kubernetes 1.18, which will be the base for OCP 4.5. Rook-Ceph will also be implementing this for OCS 4.5, so we just need to update the StorageDevice placement in ocs-operator to use this new feature. Moving to OCS 4.5.

Comment 7 Michael Adam 2020-04-07 12:05:16 UTC
*** Bug 1821161 has been marked as a duplicate of this bug. ***

Comment 22 Elad 2020-07-02 11:00:36 UTC
Hi, this bug was already acked for 4.5, as a blocker. Can we please have a discussion first before pushing it out?
Eran, Michael, please review

Comment 23 Michael Adam 2020-07-02 12:17:53 UTC
Yeah, this was approved for 4.5, but prematurely.

As José said, it's a feature.
It is also the same as this BZ effectively:
https://bugzilla.redhat.com/show_bug.cgi?id=1814681

There are design discussions going on about this.
As José said, the basic feature is there in k8s/ocp, but how to properly make use of it in rook/ocs-operator in a fashion that solves our problem and that is upgrade-safe needs much more discussion than a simple bugfix. Major work will be required in rook.

This will also fix https://bugzilla.redhat.com/show_bug.cgi?id=1776562

Comment 24 Michael Adam 2020-07-02 12:20:23 UTC
So:

1) I agree with José's assesment

2) I agree that this should not just have been moved but first discussed with you.

Comment 28 Eran Tamir 2020-07-06 07:46:26 UTC
I'm not sure why it's a block and why it's a  serious scalability issue. I understand that it would be great if we could spread it, but as long as we get the resources, we should be fine, hence we trust K8S  scheduling. I don't see a problem to push  it to 4.6

Comment 32 Mudit Agarwal 2020-09-29 16:10:11 UTC
This is still dependent on "topology spread contraints" feature which is targeted for 4.7

Comment 35 Jose A. Rivera 2020-10-05 14:15:13 UTC
The PR for topology spread constraints has merged.

Comment 36 Kesavan 2020-10-06 08:10:17 UTC
This bug has been fixed in OCS 4.7 by topology spread constraints and its corresponding PR has been merged. 

Note : Topology Spread Constraints is supported on OCP 4.6+

Tested in OCP 4.6 by following the "Steps to reproduce" and the OSDs were distributed uniformly across OCS nodes on 9-node AWS setup (each with 3 OSDs)


Worker nodes:
oc get nodes | grep worker
ip-10-0-131-148.ec2.internal   Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty
ip-10-0-149-130.ec2.internal   Ready    worker   144m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-158-252.ec2.internal   Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty
ip-10-0-161-65.ec2.internal    Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty
ip-10-0-165-63.ec2.internal    Ready    worker   144m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-185-7.ec2.internal     Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty
ip-10-0-193-103.ec2.internal   Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty
ip-10-0-212-165.ec2.internal   Ready    worker   144m   v1.19.0-rc.2+99cb93a-dirty
ip-10-0-213-134.ec2.internal   Ready    worker   69m    v1.19.0-rc.2+99cb93a-dirty


OSD Pods placement :
oc get pods -owide | grep osd
rook-ceph-osd-0-6d57d4dc47-j92nf                                  1/1     Running     0          94m     10.131.0.32    ip-10-0-149-130.ec2.internal   <none>           <none>
rook-ceph-osd-1-5dc5fd969d-fs7p7                                  1/1     Running     0          94m     10.129.2.19    ip-10-0-165-63.ec2.internal    <none>           <none>
rook-ceph-osd-10-5d7f855999-qq68x                                 1/1     Running     0          23m     10.131.4.12    ip-10-0-185-7.ec2.internal     <none>           <none>
rook-ceph-osd-11-64b5854558-chzw8                                 1/1     Running     0          23m     10.129.4.7     ip-10-0-131-148.ec2.internal   <none>           <none>
rook-ceph-osd-12-6cb8475dd6-lj9qs                                 1/1     Running     0          19m     10.130.2.8     ip-10-0-193-103.ec2.internal   <none>           <none>
rook-ceph-osd-13-7685644678-ml9vl                                 1/1     Running     0          19m     10.131.2.11    ip-10-0-161-65.ec2.internal    <none>           <none>
rook-ceph-osd-14-5f846855bf-ghk8s                                 1/1     Running     0          19m     10.130.4.9     ip-10-0-158-252.ec2.internal   <none>           <none>
rook-ceph-osd-15-75d655b657-tx668                                 1/1     Running     0          13m     10.131.4.15    ip-10-0-185-7.ec2.internal     <none>           <none>
rook-ceph-osd-16-574d5ddb6d-wsrj9                                 1/1     Running     0          13m     10.130.4.12    ip-10-0-158-252.ec2.internal   <none>           <none>
rook-ceph-osd-17-b869d8c76-h8n4s                                  1/1     Running     0          13m     10.128.4.12    ip-10-0-213-134.ec2.internal   <none>           <none>
rook-ceph-osd-18-5985f978cd-8g5nz                                 1/1     Running     0          9m59s   10.131.2.14    ip-10-0-161-65.ec2.internal    <none>           <none>
rook-ceph-osd-19-58f467fdd8-mhk47                                 1/1     Running     0          9m58s   10.130.2.10    ip-10-0-193-103.ec2.internal   <none>           <none>
rook-ceph-osd-2-594449cd9d-tvjkn                                  1/1     Running     0          94m     10.128.2.27    ip-10-0-212-165.ec2.internal   <none>           <none>
rook-ceph-osd-20-cb8f574d-smnkq                                   1/1     Running     0          9m48s   10.129.4.11    ip-10-0-131-148.ec2.internal   <none>           <none>
rook-ceph-osd-21-75bf5bdc7d-hcq65                                 1/1     Running     0          5m46s   10.131.2.15    ip-10-0-161-65.ec2.internal    <none>           <none>
rook-ceph-osd-22-5c6d7754f4-zrvnf                                 1/1     Running     0          5m45s   10.128.4.14    ip-10-0-213-134.ec2.internal   <none>           <none>
rook-ceph-osd-23-7dcc8dcfbd-6q7zw                                 1/1     Running     0          5m40s   10.129.4.13    ip-10-0-131-148.ec2.internal   <none>           <none>
rook-ceph-osd-24-5f9f94d444-tqmzb                                 1/1     Running     0          60s     10.130.4.14    ip-10-0-158-252.ec2.internal   <none>           <none>
rook-ceph-osd-25-f9cf7558c-f5zvc                                  1/1     Running     0          57s     10.130.2.14    ip-10-0-193-103.ec2.internal   <none>           <none>
rook-ceph-osd-26-d46dd8944-dmrmj                                  1/1     Running     0          56s     10.131.4.18    ip-10-0-185-7.ec2.internal     <none>           <none>
rook-ceph-osd-3-558db7498b-c59x4                                  1/1     Running     0          83m     10.131.0.36    ip-10-0-149-130.ec2.internal   <none>           <none>
rook-ceph-osd-4-65dcbb54f6-6grp8                                  1/1     Running     0          82m     10.129.2.27    ip-10-0-165-63.ec2.internal    <none>           <none>
rook-ceph-osd-5-675d4dcc74-wmpjn                                  1/1     Running     0          82m     10.128.2.31    ip-10-0-212-165.ec2.internal   <none>           <none>
rook-ceph-osd-6-bb48b855c-4k4ck                                   1/1     Running     0          74m     10.128.2.33    ip-10-0-212-165.ec2.internal   <none>           <none>
rook-ceph-osd-7-86bcf856db-tbhhw                                  1/1     Running     0          74m     10.129.2.30    ip-10-0-165-63.ec2.internal    <none>           <none>
rook-ceph-osd-8-7864cd7df5-ktqpq                                  1/1     Running     0          74m     10.131.0.38    ip-10-0-149-130.ec2.internal   <none>           <none>
rook-ceph-osd-9-8d7bf6df6-dprxm                                   1/1     Running     0          23m     10.128.4.9     ip-10-0-213-134.ec2.internal   <none>           <none>

Comment 39 errata-xmlrpc 2020-12-17 06:22:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Comment 43 Red Hat Bugzilla 2023-09-15 00:30:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.