Bug 2151640 - All OSDs were not created in MS provider cluster
Summary: All OSDs were not created in MS provider cluster
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ohad
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-07 17:30 UTC by Jilju Joy
Modified: 2023-08-09 17:16 UTC (History)
4 users (show)

Fixed In Version: v2.0.11
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Jilju Joy 2022-12-07 17:30:23 UTC
Description of problem:
Deployment of MS provider cluster with QE addon created a cluster without the required number of OSD pods. With the cluster size 20, the expected number of OSD are 15. But only 14 OSD pods are present. In the output of 'ceph osd tree' command, 16 OSDs are listed with 2 marked as down.
This issue was seen twice.

$ oc get pods -o wide -l osd
NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-69fb65d74-628qz     2/2     Running   0          9h    10.0.159.229   ip-10-0-159-229.ec2.internal   <none>           <none>
rook-ceph-osd-10-6f57f5b96-wl6hp    2/2     Running   0          9h    10.0.133.149   ip-10-0-133-149.ec2.internal   <none>           <none>
rook-ceph-osd-11-65b7b96f78-l2rv9   2/2     Running   0          9h    10.0.133.149   ip-10-0-133-149.ec2.internal   <none>           <none>
rook-ceph-osd-12-5c97f8dd5-xnpkr    2/2     Running   0          9h    10.0.133.149   ip-10-0-133-149.ec2.internal   <none>           <none>
rook-ceph-osd-13-54dbb7bfcf-n8ppw   2/2     Running   0          9h    10.0.133.149   ip-10-0-133-149.ec2.internal   <none>           <none>
rook-ceph-osd-14-67998c764c-5lxch   2/2     Running   0          9h    10.0.159.229   ip-10-0-159-229.ec2.internal   <none>           <none>
rook-ceph-osd-2-7465f44d75-s49wz    2/2     Running   0          9h    10.0.159.229   ip-10-0-159-229.ec2.internal   <none>           <none>
rook-ceph-osd-3-ffbbbfd7-pwt9r      2/2     Running   0          9h    10.0.159.229   ip-10-0-159-229.ec2.internal   <none>           <none>
rook-ceph-osd-4-8d7db8c69-rj2g4     2/2     Running   0          9h    10.0.175.29    ip-10-0-175-29.ec2.internal    <none>           <none>
rook-ceph-osd-5-6b8858587-lpqq6     2/2     Running   0          9h    10.0.175.29    ip-10-0-175-29.ec2.internal    <none>           <none>
rook-ceph-osd-6-69c85cd994-zm2b4    2/2     Running   0          9h    10.0.175.29    ip-10-0-175-29.ec2.internal    <none>           <none>
rook-ceph-osd-7-b7f66f6bf-7lwhr     2/2     Running   0          9h    10.0.175.29    ip-10-0-175-29.ec2.internal    <none>           <none>
rook-ceph-osd-8-74d444bf4f-vxhwq    2/2     Running   0          9h    10.0.175.29    ip-10-0-175-29.ec2.internal    <none>           <none>
rook-ceph-osd-9-7d8b8bdf49-lpvfr    2/2     Running   0          9h    10.0.133.149   ip-10-0-133-149.ec2.internal   <none>           <none>


$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph status
  cluster:
    id:     9d589944-620e-4949-80e7-adb11468c634
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 9h)
    mgr: a(active, since 9h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 16 osds: 14 up (since 9h), 14 in (since 9h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 801 pgs
    objects: 24 objects, 22 KiB
    usage:   130 MiB used, 56 TiB / 56 TiB avail
    pgs:     801 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr



$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         56.00000  root default                                                     
 -5         56.00000      region us-east-1                                             
 -4         20.00000          zone us-east-1a                                          
 -9          4.00000              host default-0-data-0bgnf5                           
  9    ssd   4.00000                  osd.9                       up   1.00000  1.00000
-15          4.00000              host default-0-data-3r4ppn                           
 13    ssd   4.00000                  osd.13                      up   1.00000  1.00000
-13          4.00000              host default-1-data-1dw8z9                           
 10    ssd   4.00000                  osd.10                      up   1.00000  1.00000
 -3          4.00000              host default-1-data-44rpbg                           
 12    ssd   4.00000                  osd.12                      up   1.00000  1.00000
-11          4.00000              host default-2-data-2jstsl                           
 11    ssd   4.00000                  osd.11                      up   1.00000  1.00000
-30         16.00000          zone us-east-1b                                          
-33          4.00000              host default-0-data-2ckx4q                           
  0    ssd   4.00000                  osd.0                       up   1.00000  1.00000
-29          4.00000              host default-1-data-36fh9t                           
  2    ssd   4.00000                  osd.2                       up   1.00000  1.00000
-35          4.00000              host default-2-data-1xqdp8                           
  3    ssd   4.00000                  osd.3                       up   1.00000  1.00000
-37          4.00000              host default-2-data-4qzh48                           
 14    ssd   4.00000                  osd.14                      up   1.00000  1.00000
-18         20.00000          zone us-east-1c                                          
-21          4.00000              host default-0-data-1x9xxp                           
  8    ssd   4.00000                  osd.8                       up   1.00000  1.00000
-23          4.00000              host default-0-data-49zhbn                           
  4    ssd   4.00000                  osd.4                       up   1.00000  1.00000
-27          4.00000              host default-1-data-2dzx2r                           
  7    ssd   4.00000                  osd.7                       up   1.00000  1.00000
-25          4.00000              host default-2-data-058gh4                           
  5    ssd   4.00000                  osd.5                       up   1.00000  1.00000
-17          4.00000              host default-2-data-3nbwkq                           
  6    ssd   4.00000                  osd.6                       up   1.00000  1.00000
  1                0  osd.1                                     down         0  1.00000
 15                0  osd.15                                    down         0  1.00000



$ oc describe job rook-ceph-osd-prepare-default-1-data-0gcdm2 | grep "Events" -A 10
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      142m  job-controller  Created pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-4j6cn
  Normal   SuccessfulCreate      141m  job-controller  Created pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-2rfk9
  Normal   SuccessfulDelete      133m  job-controller  Deleted pod: rook-ceph-osd-prepare-default-1-data-0gcdm2-2rfk9
  Warning  BackoffLimitExceeded  133m  job-controller  Job has reached the specified backoff limi



Deployment is present for 14 only. These 14 are Running.
$ oc get deployment | grep rook-ceph-osd
rook-ceph-osd-0                                         1/1     1            1           141m
rook-ceph-osd-10                                        1/1     1            1           143m
rook-ceph-osd-11                                        1/1     1            1           143m
rook-ceph-osd-12                                        1/1     1            1           143m
rook-ceph-osd-13                                        1/1     1            1           143m
rook-ceph-osd-14                                        1/1     1            1           141m
rook-ceph-osd-2                                         1/1     1            1           141m
rook-ceph-osd-3                                         1/1     1            1           141m
rook-ceph-osd-4                                         1/1     1            1           143m
rook-ceph-osd-5                                         1/1     1            1           143m
rook-ceph-osd-6                                         1/1     1            1           143m
rook-ceph-osd-7                                         1/1     1            1           143m
rook-ceph-osd-8                                         1/1     1            1           143m
rook-ceph-osd-9                                         1/1     1            1           143m


rook-ceph-operator logs:

2022-12-07 07:52:00.303704 E | op-osd: failed to provision OSD(s) on PVC default-1-data-0gcdm2. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to initialize devices on PVC: failed to run ceph-volume. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
2022-12-07 07:52:14.583414 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on nodes in namespace "openshift-storage". 
2022-12-07 07:53:30.604141 E | op-osd: failed to provision OSD(s) on PVC default-1-data-0gcdm2. &{OSDs:[] Status:failed PvcBackedOSD:true Message:failed to configure devices: failed to initialize devices on PVC: failed to run ceph-volume. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
2022-12-07 07:53:30.800427 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph osds: 1 failures encountered while running osds on nodes in namespace "openshift-storage". 
2022-12-07 07:54:14.499775 E | ceph-spec: failed to update cluster condition to {Type:Progressing Status:True Reason:ClusterProgressing Message:Processing OSD 3 on PVC "default-2-data-1xqdp8" LastHeartbeatTime:2022-12-07 07:54:14.433328238 +0000 UTC m=+633.045903685 LastTransitionTime:2022-12-07 07:54:14.433328157 +0000 UTC m=+633.045903623}. failed to update object "openshift-storage/ocs-storagecluster-cephcluster" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "ocs-storagecluster-cephcluster": the object has been modified; please apply your changes to the latest version and try again


managedocs status:

status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Ready
    reconcileStrategy: strict



must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-d7-pr/jijoy-d7-pr_20221207T062041/logs/testcases_1670426210/

==========================================================================================================
Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.8                      NooBaa Operator               4.10.8            mcg-operator.v4.10.7                      Succeeded
observability-operator.v0.0.15            Observability Operator        0.0.15            observability-operator.v0.0.15-rc         Succeeded
ocs-operator.v4.10.7                      OpenShift Container Storage   4.10.7            ocs-operator.v4.10.6                      Succeeded
ocs-osd-deployer.v2.0.10                  OCS OSD Deployer              2.0.10            ocs-osd-deployer.v2.0.9                   Succeeded
odf-csi-addons-operator.v4.10.7           CSI Addons                    4.10.7            odf-csi-addons-operator.v4.10.6           Succeeded
odf-operator.v4.10.7                      OpenShift Data Foundation     4.10.7            odf-operator.v4.10.6                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.451-3df1ed1   Route Monitor Operator        0.1.451-3df1ed1   route-monitor-operator.v0.1.450-6e98c37   Succeeded


=============================================================================================================

How reproducible:
Observed twice. First time with size 4. Second time with size 20 which is reported here. There are successful deployment with size 4 and 20. The issue is intermittent.

=============================================================================================================

Steps to Reproduce:
1. Deploy MS provider cluster with QE addon.
Note: This is an intermittent issue.
=============================================================================================================
Actual results:
Less number of OSD.

Expected results:
Required number of OSDs according to the given size should be available.


Additional info:

Comment 4 Dhruv Bindra 2023-01-20 10:05:45 UTC
Can be tested with the latest build


Note You need to log in before you can comment on or make changes to this bug.