Description of problem: If upscale is performed with command: $ rosa edit service --id=<service_ID> --size="<new_size in TiB>" then some osds are not up and user doesn't have enough capacity. For example when size is changed from 4 to 8 and then to 20: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: f009e4c6-06e5-4f09-9476-3d55e9d439b0 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 9 up (since 9m), 11 in (since 14s) data: volumes: 1/1 healthy pools: 4 pools, 609 pgs objects: 23 objects, 14 KiB usage: 156 MiB used, 36 TiB / 36 TiB avail pgs: 609 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 36 TiB 36 TiB 155 MiB 155 MiB 0 TOTAL 36 TiB 36 TiB 155 MiB 155 MiB 0 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 10 TiB ocs-storagecluster-cephfilesystem-metadata 2 32 21 KiB 22 155 KiB 0 10 TiB ocs-storagecluster-cephfilesystem-data0 3 512 0 B 0 0 B 0 10 TiB cephblockpool-storageconsumer-3717e957-2e13-4339-b433-dc9e65fdc3ae 4 64 19 B 1 12 KiB 0 10 TiB Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.8 How reproducible: 1/1 Steps to Reproduce: 1. Deploy 4 TiB provider cluster 2. Upscale it to 20 $ rosa edit service --id=<service_ID> --size="20" 3. Check ceph status: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: f009e4c6-06e5-4f09-9476-3d55e9d439b0 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 9 up (since 9m), 11 in (since 14s) data: volumes: 1/1 healthy pools: 4 pools, 609 pgs objects: 23 objects, 14 KiB usage: 156 MiB used, 36 TiB / 36 TiB avail pgs: 609 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr 4. Check events of osd pods of down pods. Actual results: Some osds are down because of insufficient memory: 0/18 nodes are available: 12 Insufficient memory, 12 node(s) had no available volume zone, 12 node(s) had volume node affinity conflict, 15 node(s) didn't match Pod's node affinity/selector, 3 Insufficient cpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 Preemption is not helpful for scheduling, 3 Insufficient memory. Users have only 10 TiB available instead of 20 TiB. Expected results: All new osds are up Additional info: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 36.00000 root default -5 36.00000 region us-east-1 -4 12.00000 zone us-east-1a -17 4.00000 host default-0-data-1997d8 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -3 4.00000 host default-2-data-0kmcrj 0 ssd 4.00000 osd.0 up 1.00000 1.00000 -23 4.00000 host default-2-data-3vz9vb 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -14 12.00000 zone us-east-1b -27 4.00000 host default-0-data-2jt598 9 ssd 4.00000 osd.9 up 1.00000 1.00000 -13 4.00000 host default-1-data-0sq66t 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -21 4.00000 host default-2-data-1fk8ck 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -10 12.00000 zone us-east-1c -9 4.00000 host default-0-data-0vlnfx 1 ssd 4.00000 osd.1 up 1.00000 1.00000 -25 4.00000 host default-0-data-4z6vps 6 ssd 4.00000 osd.6 up 1.00000 1.00000 -19 4.00000 host default-1-data-12xm7l 4 ssd 4.00000 osd.4 up 1.00000 1.00000 7 0 osd.7 down 0 1.00000 8 0 osd.8 down 0 1.00000 10 0 osd.10 down 0 1.00000 12 0 osd.12 down 0 1.00000 13 0 osd.13 down 1.00000 1.00000 14 0 osd.14 down 1.00000 1.00000 $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 36 TiB 36 TiB 156 MiB 156 MiB 0 TOTAL 36 TiB 36 TiB 156 MiB 156 MiB 0 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 10 TiB ocs-storagecluster-cephfilesystem-metadata 2 32 21 KiB 22 155 KiB 0 10 TiB ocs-storagecluster-cephfilesystem-data0 3 512 0 B 0 0 B 0 10 TiB cephblockpool-storageconsumer-3717e957-2e13-4339-b433-dc9e65fdc3ae 4 64 19 B 1 12 KiB 0 10 TiB
All osds are in after upscale from 4 -> 20. --> VERIFIED Tested with: ocs-osd-deployer.v2.0.11
Closing this bug as fixed in v2.0.11 and tested by QE.