Description of problem: Deployment of MS provider cluster with QE addon created a cluster without the required number of OSD pods. With the cluster size 20, the expected number of OSD are 15. But only 14 OSD pods are present. In the output of 'ceph osd tree' command, 16 OSDs are listed with 2 marked as down. This issue was seen twice. $ oc get pods -o wide -l osd NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-69fb65d74-628qz 2/2 Running 0 9h 10.0.159.229 ip-10-0-159-229.ec2.internal <none> <none> rook-ceph-osd-10-6f57f5b96-wl6hp 2/2 Running 0 9h 10.0.133.149 ip-10-0-133-149.ec2.internal <none> <none> rook-ceph-osd-11-65b7b96f78-l2rv9 2/2 Running 0 9h 10.0.133.149 ip-10-0-133-149.ec2.internal <none> <none> rook-ceph-osd-12-5c97f8dd5-xnpkr 2/2 Running 0 9h 10.0.133.149 ip-10-0-133-149.ec2.internal <none> <none> rook-ceph-osd-13-54dbb7bfcf-n8ppw 2/2 Running 0 9h 10.0.133.149 ip-10-0-133-149.ec2.internal <none> <none> rook-ceph-osd-14-67998c764c-5lxch 2/2 Running 0 9h 10.0.159.229 ip-10-0-159-229.ec2.internal <none> <none> rook-ceph-osd-2-7465f44d75-s49wz 2/2 Running 0 9h 10.0.159.229 ip-10-0-159-229.ec2.internal <none> <none> rook-ceph-osd-3-ffbbbfd7-pwt9r 2/2 Running 0 9h 10.0.159.229 ip-10-0-159-229.ec2.internal <none> <none> rook-ceph-osd-4-8d7db8c69-rj2g4 2/2 Running 0 9h 10.0.175.29 ip-10-0-175-29.ec2.internal <none> <none> rook-ceph-osd-5-6b8858587-lpqq6 2/2 Running 0 9h 10.0.175.29 ip-10-0-175-29.ec2.internal <none> <none> rook-ceph-osd-6-69c85cd994-zm2b4 2/2 Running 0 9h 10.0.175.29 ip-10-0-175-29.ec2.internal <none> <none> rook-ceph-osd-7-b7f66f6bf-7lwhr 2/2 Running 0 9h 10.0.175.29 ip-10-0-175-29.ec2.internal <none> <none> rook-ceph-osd-8-74d444bf4f-vxhwq 2/2 Running 0 9h 10.0.175.29 ip-10-0-175-29.ec2.internal <none> <none> rook-ceph-osd-9-7d8b8bdf49-lpvfr 2/2 Running 0 9h 10.0.133.149 ip-10-0-133-149.ec2.internal <none> <none> $ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph status cluster: id: 9d589944-620e-4949-80e7-adb11468c634 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 9h) mgr: a(active, since 9h) mds: 1/1 daemons up, 1 hot standby osd: 16 osds: 14 up (since 9h), 14 in (since 9h) data: volumes: 1/1 healthy pools: 5 pools, 801 pgs objects: 24 objects, 22 KiB usage: 130 MiB used, 56 TiB / 56 TiB avail pgs: 801 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr $ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 56.00000 root default -5 56.00000 region us-east-1 -4 20.00000 zone us-east-1a -9 4.00000 host default-0-data-0bgnf5 9 ssd 4.00000 osd.9 up 1.00000 1.00000 -15 4.00000 host default-0-data-3r4ppn 13 ssd 4.00000 osd.13 up 1.00000 1.00000 -13 4.00000 host default-1-data-1dw8z9 10 ssd 4.00000 osd.10 up 1.00000 1.00000 -3 4.00000 host default-1-data-44rpbg 12 ssd 4.00000 osd.12 up 1.00000 1.00000 -11 4.00000 host default-2-data-2jstsl 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -30 16.00000 zone us-east-1b -33 4.00000 host default-0-data-2ckx4q 0 ssd 4.00000 osd.0 up 1.00000 1.00000 -29 4.00000 host default-1-data-36fh9t 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -35 4.00000 host default-2-data-1xqdp8 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -37 4.00000 host default-2-data-4qzh48 14 ssd 4.00000 osd.14 up 1.00000 1.00000 -18 20.00000 zone us-east-1c -21 4.00000 host default-0-data-1x9xxp 8 ssd 4.00000 osd.8 up 1.00000 1.00000 -23 4.00000 host default-0-data-49zhbn 4 ssd 4.00000 osd.4 up 1.00000 1.00000 -27 4.00000 host default-1-data-2dzx2r 7 ssd 4.00000 osd.7 up 1.00000 1.00000 -25 4.00000 host default-2-data-058gh4 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -17 4.00000 host default-2-data-3nbwkq 6 ssd 4.00000 osd.6 up 1.00000 1.00000 1 0 osd.1 down 0 1.00000 15 0 osd.15 down 0 1.00000 $ oc describe job rook-ceph-osd-prepare-default-1-data-0gcdm2 | grep "Events" -A 10 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 142m job-controller Created pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-4j6cn Normal SuccessfulCreate 141m job-controller Created pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-2rfk9 Normal SuccessfulDelete 133m job-controller Deleted pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-2rfk9 Warning BackoffLimitExceeded 133m job-controller Job has reached the specified backoff limi Deployment is present for 14 only. These 14 are Running. $ oc get deployment | grep rook-ceph-osd rook-ceph-osd-0 1/1 1 1 141m rook-ceph-osd-10 1/1 1 1 143m rook-ceph-osd-11 1/1 1 1 143m rook-ceph-osd-12 1/1 1 1 143m rook-ceph-osd-13 1/1 1 1 143m rook-ceph-osd-14 1/1 1 1 141m rook-ceph-osd-2 1/1 1 1 141m rook-ceph-osd-3 1/1 1 1 141m rook-ceph-osd-4 1/1 1 1 143m rook-ceph-osd-5 1/1 1 1 143m rook-ceph-osd-6 1/1 1 1 143m rook-ceph-osd-7 1/1 1 1 143m rook-ceph-osd-8 1/1 1 1 143m rook-ceph-osd-9 1/1 1 1 143m rook-ceph-operator logs: 2022-12-07 07:52:00.303704 E | op-osd: failed to provision OSD(s) on PVC default-1-data-0gcdm2. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to initialize devices on PVC: failed to run ceph-volume. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-07 07:52:14.583414 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on nodes in namespace "openshift-storage". 2022-12-07 07:53:30.604141 E | op-osd: failed to provision OSD(s) on PVC default-1-data-0gcdm2. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to initialize devices on PVC: failed to run ceph-volume. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. 2022-12-07 07:53:30.800427 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on nodes in namespace "openshift-storage". 2022-12-07 07:54:14.499775 E | ceph-spec: failed to update cluster condition to {Type:Progressing Status:True Reason:ClusterProgressing Message:Processing OSD 3 on PVC "default-2-data-1xqdp8" LastHeartbeatTime:2022-12-07 07:54:14.433328238 +0000 UTC m=+633.045903685 LastTransitionTime:2022-12-07 07:54:14.433328157 +0000 UTC m=+633.045903623}. failed to update object "openshift-storage/ocs-storagecluster-cephcluster" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "ocs-storagecluster-cephcluster": the object has been modified; please apply your changes to the latest version and try again managedocs status: status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Ready reconcileStrategy: strict must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-d7-pr/jijoy-d7-pr_20221207T062041/logs/testcases_1670426210/ ========================================================================================================== Version-Release number of selected component (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.8 NooBaa Operator 4.10.8 mcg-operator.v4.10.7 Succeeded observability-operator.v0.0.15 Observability Operator 0.0.15 observability-operator.v0.0.15-rc Succeeded ocs-operator.v4.10.7 OpenShift Container Storage 4.10.7 ocs-operator.v4.10.6 Succeeded ocs-osd-deployer.v2.0.10 OCS OSD Deployer 2.0.10 ocs-osd-deployer.v2.0.9 Succeeded odf-csi-addons-operator.v4.10.7 CSI Addons 4.10.7 odf-csi-addons-operator.v4.10.6 Succeeded odf-operator.v4.10.7 OpenShift Data Foundation 4.10.7 odf-operator.v4.10.6 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded ============================================================================================================= How reproducible: Observed twice. First time with size 4. Second time with size 20 which is reported here. There are successful deployment with size 4 and 20. The issue is intermittent. ============================================================================================================= Steps to Reproduce: 1. Deploy MS provider cluster with QE addon. Note: This is an intermittent issue. ============================================================================================================= Actual results: Less number of OSD. Expected results: Required number of OSDs according to the given size should be available. Additional info:
Can be tested with the latest build