Bug 2142013
| Summary: | Not all osds are up after upscale | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
| Component: | odf-managed-service | Assignee: | Leela Venkaiah Gangavarapu <lgangava> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Filip Balák <fbalak> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10 | CC: | aeyal, lgangava, ocs-bugs, odf-bz-bot, rchikatw |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-14 15:28:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
All osds are in after upscale from 4 -> 20. --> VERIFIED Tested with: ocs-osd-deployer.v2.0.11 Closing this bug as fixed in v2.0.11 and tested by QE. |
Description of problem: If upscale is performed with command: $ rosa edit service --id=<service_ID> --size="<new_size in TiB>" then some osds are not up and user doesn't have enough capacity. For example when size is changed from 4 to 8 and then to 20: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: f009e4c6-06e5-4f09-9476-3d55e9d439b0 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 9 up (since 9m), 11 in (since 14s) data: volumes: 1/1 healthy pools: 4 pools, 609 pgs objects: 23 objects, 14 KiB usage: 156 MiB used, 36 TiB / 36 TiB avail pgs: 609 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 36 TiB 36 TiB 155 MiB 155 MiB 0 TOTAL 36 TiB 36 TiB 155 MiB 155 MiB 0 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 10 TiB ocs-storagecluster-cephfilesystem-metadata 2 32 21 KiB 22 155 KiB 0 10 TiB ocs-storagecluster-cephfilesystem-data0 3 512 0 B 0 0 B 0 10 TiB cephblockpool-storageconsumer-3717e957-2e13-4339-b433-dc9e65fdc3ae 4 64 19 B 1 12 KiB 0 10 TiB Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.8 How reproducible: 1/1 Steps to Reproduce: 1. Deploy 4 TiB provider cluster 2. Upscale it to 20 $ rosa edit service --id=<service_ID> --size="20" 3. Check ceph status: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: f009e4c6-06e5-4f09-9476-3d55e9d439b0 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 9 up (since 9m), 11 in (since 14s) data: volumes: 1/1 healthy pools: 4 pools, 609 pgs objects: 23 objects, 14 KiB usage: 156 MiB used, 36 TiB / 36 TiB avail pgs: 609 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr 4. Check events of osd pods of down pods. Actual results: Some osds are down because of insufficient memory: 0/18 nodes are available: 12 Insufficient memory, 12 node(s) had no available volume zone, 12 node(s) had volume node affinity conflict, 15 node(s) didn't match Pod's node affinity/selector, 3 Insufficient cpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 Preemption is not helpful for scheduling, 3 Insufficient memory. Users have only 10 TiB available instead of 20 TiB. Expected results: All new osds are up Additional info: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 36.00000 root default -5 36.00000 region us-east-1 -4 12.00000 zone us-east-1a -17 4.00000 host default-0-data-1997d8 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -3 4.00000 host default-2-data-0kmcrj 0 ssd 4.00000 osd.0 up 1.00000 1.00000 -23 4.00000 host default-2-data-3vz9vb 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -14 12.00000 zone us-east-1b -27 4.00000 host default-0-data-2jt598 9 ssd 4.00000 osd.9 up 1.00000 1.00000 -13 4.00000 host default-1-data-0sq66t 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -21 4.00000 host default-2-data-1fk8ck 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -10 12.00000 zone us-east-1c -9 4.00000 host default-0-data-0vlnfx 1 ssd 4.00000 osd.1 up 1.00000 1.00000 -25 4.00000 host default-0-data-4z6vps 6 ssd 4.00000 osd.6 up 1.00000 1.00000 -19 4.00000 host default-1-data-12xm7l 4 ssd 4.00000 osd.4 up 1.00000 1.00000 7 0 osd.7 down 0 1.00000 8 0 osd.8 down 0 1.00000 10 0 osd.10 down 0 1.00000 12 0 osd.12 down 0 1.00000 13 0 osd.13 down 1.00000 1.00000 14 0 osd.14 down 1.00000 1.00000 $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 36 TiB 36 TiB 156 MiB 156 MiB 0 TOTAL 36 TiB 36 TiB 156 MiB 156 MiB 0 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 10 TiB ocs-storagecluster-cephfilesystem-metadata 2 32 21 KiB 22 155 KiB 0 10 TiB ocs-storagecluster-cephfilesystem-data0 3 512 0 B 0 0 B 0 10 TiB cephblockpool-storageconsumer-3717e957-2e13-4339-b433-dc9e65fdc3ae 4 64 19 B 1 12 KiB 0 10 TiB