Bug 2167045

Summary: All OSDs were not created in MS provider cluster of size 20
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: odf-managed-serviceAssignee: Ohad <omitrani>
Status: CLOSED WORKSFORME QA Contact: Jilju Joy <jijoy>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: cblum, fbalak, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-27 10:48:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jilju Joy 2023-02-04 06:00:51 UTC
Description of problem:
2/15 OSD pods were not created in the Managed Services provider cluster of size 20.
Installation was done with ocs-provider-qe addon.

$ oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
addon-ocs-provider-qe-catalog-6hlvd                               1/1     Running     0          11h   10.130.2.22   ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     0          11h   10.131.2.17   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
csi-addons-controller-manager-759b488df-xrhx4                     2/2     Running     0          11h   10.131.2.18   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
ocs-metrics-exporter-5dd96c885b-x9ls7                             1/1     Running     0          11h   10.131.2.14   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
ocs-operator-6888799d6b-qn9b7                                     1/1     Running     0          11h   10.129.2.9    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
ocs-osd-aws-data-gather-87db84b8b-rh452                           1/1     Running     0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
ocs-osd-controller-manager-8d55ffccd-kzwmr                        3/3     Running     0          11h   10.130.2.28   ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
ocs-provider-server-6c47b6c7c9-c65n4                              1/1     Running     0          11h   10.131.2.12   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
odf-console-57b8476cd4-fkmwg                                      1/1     Running     0          11h   10.130.2.29   ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
odf-operator-controller-manager-6f44676f4f-p48b2                  2/2     Running     0          11h   10.130.2.26   ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
prometheus-managed-ocs-prometheus-0                               3/3     Running     0          11h   10.131.2.16   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
prometheus-operator-8547cc9f89-lgjqz                              1/1     Running     0          11h   10.130.2.24   ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-crashcollector-04c17f9d1a57254b9b8f55072ae1557b-24ll5   1/1     Running     0          11h   10.0.19.29    ip-10-0-19-29.us-east-2.compute.internal    <none>           <none>
rook-ceph-crashcollector-0a0d3c68c7ab5b2c0e551505cd3d86fc-jp5x9   1/1     Running     0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-b319c5fd4e7a6a3619e66719a3d16180-t4gzx   1/1     Running     0          11h   10.0.23.98    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-crashcollector-c0db9449293af286deacfef5500c908c-59j59   1/1     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-c5123a40ecb6868b9caaab671f018de4-hb49r   1/1     Running     0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-crashcollector-cbdb47f615d01227c71b30b63707e135-p86c7   1/1     Running     0          11h   10.0.14.108   ip-10-0-14-108.us-east-2.compute.internal   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79b6b894rf5kg   2/2     Running     0          11h   10.0.23.98    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-d9876ff7vkt4x   2/2     Running     0          11h   10.0.14.108   ip-10-0-14-108.us-east-2.compute.internal   <none>           <none>
rook-ceph-mgr-a-f8895b454-fgj55                                   2/2     Running     0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-a-6bd7ccb97c-qk6gb                                  2/2     Running     0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-b-79d9cb59f6-j92bf                                  2/2     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-c9fbf66fb-fdqd9                                   2/2     Running     0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-operator-548b87d44b-mncjg                               1/1     Running     0          11h   10.129.2.8    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-0-7b87b47858-kpfv6                                  2/2     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-88c6867fb-wrfgs                                   2/2     Running     0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-11-7d7458f76-wql6k                                  2/2     Running     0          11h   10.0.23.98    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-13-846c6fcd8c-6gl52                                 2/2     Running     0          11h   10.0.19.29    ip-10-0-19-29.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-14-5445548848-nkllt                                 2/2     Running     0          11h   10.0.19.29    ip-10-0-19-29.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-2-6c5466b9d7-qpr8n                                  2/2     Running     0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-3-5df7c9f96c-rxjp4                                  2/2     Running     0          11h   10.0.14.108   ip-10-0-14-108.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-4-556c86bc54-kw94j                                  2/2     Running     0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-5-7ff7958f64-kldhm                                  2/2     Running     0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-6-c67b4f987-rdw22                                   2/2     Running     0          11h   10.0.23.98    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-7-76cf788868-nvbsb                                  2/2     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-8-7d5f78c798-4dn5w                                  2/2     Running     0          11h   10.0.14.108   ip-10-0-14-108.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-9-99977b6dd-p9cpx                                   2/2     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-02pt7c-7chxv                 0/1     Completed   0          11h   10.0.17.148   ip-10-0-17-148.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-16zc9r-4k45d                 0/1     Completed   0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-prepare-default-0-data-24q4cg-mlsql                 0/1     Completed   0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-0dstck-mbj9v                 0/1     Completed   0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-1d6k26-cnhjl                 0/1     Pending     0          11h   <none>        <none>                                      <none>           <none>
rook-ceph-osd-prepare-default-1-data-3hbbtb-6wj6m                 0/1     Completed   0          11h   10.0.23.98    ip-10-0-23-98.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-prepare-default-1-data-4npcdn-sbl84                 0/1     Completed   0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-0kdf28-t54mb                 0/1     Completed   0          11h   10.0.14.9     ip-10-0-14-9.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-prepare-default-2-data-1mjk7w-nss5f                 0/1     Completed   0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-4s9snn-ntr9j                 0/1     Pending     0          11h   <none>        <none>                                      <none>           <none>
rook-ceph-tools-7c8c77bd96-9rtnx                                  1/1     Running     0          11h   10.0.23.153   ip-10-0-23-153.us-east-2.compute.internal   <none>           <none>



$ oc describe pod rook-ceph-osd-prepare-default-1-data-1d6k26-cnhjl | grep "Events:" -A 20
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  29m (x806 over 11h)  default-scheduler  0/12 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had volume node affinity conflict.


$ oc rsh rook-ceph-tools-7c8c77bd96-9rtnx ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         52.00000  root default                                                     
 -5         52.00000      region us-east-2                                             
-10         16.00000          zone us-east-2a                                          
 -9          4.00000              host default-0-data-16zc9r                           
  2    ssd   4.00000                  osd.2                       up   1.00000  1.00000
-13          4.00000              host default-0-data-4rw242                           
  3    ssd   4.00000                  osd.3                       up   1.00000  1.00000
-15          4.00000              host default-2-data-0kdf28                           
  5    ssd   4.00000                  osd.5                       up   1.00000  1.00000
-27          4.00000              host default-2-data-3w45nk                           
  8    ssd   4.00000                  osd.8                       up   1.00000  1.00000
 -4         16.00000          zone us-east-2b                                          
 -3          4.00000              host default-0-data-02pt7c                           
  1    ssd   4.00000                  osd.1                       up   1.00000  1.00000
-17          4.00000              host default-0-data-3h8zxb                           
  4    ssd   4.00000                  osd.4                       up   1.00000  1.00000
-31          4.00000              host default-1-data-28gzn7                           
 13    ssd   4.00000                  osd.13                      up   1.00000  1.00000
-33          4.00000              host default-2-data-2npbmj                           
 14    ssd   4.00000                  osd.14                      up   1.00000  1.00000
-20         20.00000          zone us-east-2c                                          
-23          4.00000              host default-0-data-24q4cg                           
  0    ssd   4.00000                  osd.0                       up   1.00000  1.00000
-19          4.00000              host default-1-data-0dstck                           
  7    ssd   4.00000                  osd.7                       up   1.00000  1.00000
-35          4.00000              host default-1-data-3hbbtb                           
 11    ssd   4.00000                  osd.11                      up   1.00000  1.00000
-25          4.00000              host default-1-data-4npcdn                           
  6    ssd   4.00000                  osd.6                       up   1.00000  1.00000
-29          4.00000              host default-2-data-1mjk7w                           
  9    ssd   4.00000                  osd.9                       up   1.00000  1.00000
 10                0  osd.10                                    down         0  1.00000
 12                0  osd.12                                    down         0  1.00000



$ oc rsh rook-ceph-tools-7c8c77bd96-9rtnx ceph status
  cluster:
    id:     b271d171-d9c5-4b32-ba39-c095e12f4d28
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 12h)
    mgr: a(active, since 12h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 13 up (since 11h), 13 in (since 11h)
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 545 pgs
    objects: 38 objects, 3.3 KiB
    usage:   112 MiB used, 52 TiB / 52 TiB avail
    pgs:     545 active+clean
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr




must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-20tb-pr/jijoy-20tb-pr_20230203T170743/logs/failed_testcase_ocs_logs_1675448890/test_deployment_ocs_logs/
=================================================================================
Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11            ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.456-02ea942   Route Monitor Operator        0.1.456-02ea942   route-monitor-operator.v0.1.454-494fffd   Succeeded


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.50   True        False         12h     Cluster version is 4.10.50

=================================================================================

How reproducible:
Reporting the first occurrence.

Steps to Reproduce:
1. Deploy MS provider cluster of size 20 with ocs-provider-qe addon.
Deployment example: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/20290/

=================================================================================
Actual results:
Cluster verification after the installation of addon failed because of the absence of 2 osd pods. 

Expected results:
Required OSD pods should be running.

Additional info:

Comment 1 Chris Blum 2023-02-10 10:24:15 UTC
Is this resolved by the new image?