Bug 2188000

Summary: Missing osds and mon on provider - 0/9 nodes are available
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Ohad <omitrani>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.12CC: ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-19 14:11:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2023-04-19 12:31:44 UTC
Description of problem:
Installation according to Fusion aaS guide [1] fails and some pods and monitors are not deployed due to:

$ oc describe pod rook-ceph-mon-c-76f95dd57c-hs4tv -n openshift-storage 
Name:                 rook-ceph-mon-c-76f95dd57c-hs4tv
Namespace:            openshift-storage
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 <none>
Labels:               app=rook-ceph-mon
                      app.kubernetes.io/component=cephclusters.ceph.rook.io
                      app.kubernetes.io/created-by=rook-ceph-operator
                      app.kubernetes.io/instance=c
                      app.kubernetes.io/managed-by=rook-ceph-operator
                      app.kubernetes.io/name=ceph-mon
                      app.kubernetes.io/part-of=ocs-storagecluster-cephcluster
                      ceph_daemon_id=c
                      ceph_daemon_type=mon
                      mon=c
                      mon_cluster=openshift-storage
                      pod-template-hash=76f95dd57c
                      pvc_name=rook-ceph-mon-c
                      pvc_size=50Gi
                      rook.io/operator-namespace=openshift-storage
                      rook_cluster=openshift-storage
Annotations:          openshift.io/scc: rook-ceph
Status:               Pending
(...)
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  51m                default-scheduler  0/9 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  31m (x3 over 46m)  default-scheduler  0/9 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.

Version-Release number of selected component (if applicable):
ROSA 4.12.12
quay.io/resoni/managed-fusion-agent-index:4.13.0-164

How reproducible:
1/1

Steps to Reproduce:
1. Deploy ODF on Fusion according to the guide [1]
2. Check storagecluster and cluster resources

Actual results:
Deployment of all ODF resources is blocked by unavailable resources.

Expected results:
All ODF resources are deployed successfully.

Additional info:
[1] https://docs.google.com/document/d/1Jdx8czlMjbumvilw8nZ6LtvWOMAx3H4TfwoVwiBs0nE/edit#

Comment 1 Filip Balák 2023-04-19 13:16:09 UTC
$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
875729c50f952b4913290e018f605305b1f88af3654c1c625da1364b39zbxts   0/1     Completed   0          123m
managed-fusion-offering-catalog-vsxhd                             1/1     Running     0          126m
ocs-metrics-exporter-695dc5d6dc-bznk6                             1/1     Running     0          123m
ocs-operator-59cd8cd764-tfdt2                                     1/1     Running     0          123m
ocs-provider-server-7dcdbf87fc-lgrwq                              1/1     Running     0          122m
rook-ceph-crashcollector-86278db17f8b36c54319352a92416617-zf6gt   1/1     Running     0          112m
rook-ceph-crashcollector-93d940652afee5da0612a8bdb72a3bd4-qzq87   1/1     Running     0          112m
rook-ceph-mgr-a-86d6d7d46b-tskvs                                  2/2     Running     0          112m
rook-ceph-mon-a-54f64d7b95-4zmhs                                  2/2     Running     0          120m
rook-ceph-mon-b-5c9966b6dc-mpz7b                                  2/2     Running     0          120m
rook-ceph-mon-c-76f95dd57c-hs4tv                                  0/2     Pending     0          118m
rook-ceph-operator-66fd6f59f5-xjj49                               1/1     Running     0          122m
rook-ceph-osd-0-84945579cc-8vsgh                                  2/2     Running     0          112m
rook-ceph-osd-1-58bd6b9494-94nwn                                  2/2     Running     0          112m
rook-ceph-osd-prepare-default-0-data-06kthp-jvkqk                 0/1     Completed   0          112m
rook-ceph-osd-prepare-default-1-data-0clt8x-ltl4v                 0/1     Completed   0          112m
rook-ceph-osd-prepare-default-2-data-0m96dr-sgwfx                 0/1     Pending     0          112m
rook-ceph-tools-78d8f5799-l4zx6                                   1/1     Running     0          123m

Comment 2 Filip Balák 2023-04-19 13:19:17 UTC
$ oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE    IP             NODE                                        NOMINATED NODE   READINESS GATES
875729c50f952b4913290e018f605305b1f88af3654c1c625da1364b39zbxts   0/1     Completed   0          126m   10.129.2.93    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
managed-fusion-offering-catalog-vsxhd                             1/1     Running     0          129m   10.128.2.64    ip-10-0-14-242.us-east-2.compute.internal   <none>           <none>
ocs-metrics-exporter-695dc5d6dc-bznk6                             1/1     Running     0          126m   10.129.2.109   ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
ocs-operator-59cd8cd764-tfdt2                                     1/1     Running     0          126m   10.129.2.107   ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
ocs-provider-server-7dcdbf87fc-lgrwq                              1/1     Running     0          125m   10.129.2.115   ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-86278db17f8b36c54319352a92416617-zf6gt   1/1     Running     0          115m   10.0.14.242    ip-10-0-14-242.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-93d940652afee5da0612a8bdb72a3bd4-qzq87   1/1     Running     0          115m   10.0.17.184    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-mgr-a-86d6d7d46b-tskvs                                  2/2     Running     0          115m   10.0.17.184    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-a-54f64d7b95-4zmhs                                  2/2     Running     0          123m   10.0.17.184    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-b-5c9966b6dc-mpz7b                                  2/2     Running     0          123m   10.0.14.242    ip-10-0-14-242.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-76f95dd57c-hs4tv                                  0/2     Pending     0          120m   <none>         <none>                                      <none>           <none>
rook-ceph-operator-66fd6f59f5-xjj49                               1/1     Running     0          125m   10.129.2.116   ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-84945579cc-8vsgh                                  2/2     Running     0          115m   10.0.14.242    ip-10-0-14-242.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-58bd6b9494-94nwn                                  2/2     Running     0          115m   10.0.17.184    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-06kthp-jvkqk                 0/1     Completed   0          115m   10.0.14.242    ip-10-0-14-242.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-0clt8x-ltl4v                 0/1     Completed   0          115m   10.0.17.184    ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-0m96dr-sgwfx                 0/1     Pending     0          115m   <none>         <none>                                      <none>           <none>
rook-ceph-tools-78d8f5799-l4zx6                                   1/1     Running     0          125m   10.129.2.114   ip-10-0-17-184.us-east-2.compute.internal   <none>           <none>
$ oc get nodes -o wide
NAME                                        STATUS                     ROLES                  AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-13-203.us-east-2.compute.internal   Ready                      control-plane,master   156m   v1.25.7+eab9cc9   10.0.13.203   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-14-22.us-east-2.compute.internal    Ready                      infra,worker           134m   v1.25.7+eab9cc9   10.0.14.22    <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-14-242.us-east-2.compute.internal   Ready                      worker                 148m   v1.25.7+eab9cc9   10.0.14.242   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-16-158.us-east-2.compute.internal   Ready                      control-plane,master   156m   v1.25.7+eab9cc9   10.0.16.158   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-17-184.us-east-2.compute.internal   Ready                      worker                 145m   v1.25.7+eab9cc9   10.0.17.184   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-19-133.us-east-2.compute.internal   Ready                      infra,worker           135m   v1.25.7+eab9cc9   10.0.19.133   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-23-214.us-east-2.compute.internal   Ready                      infra,worker           135m   v1.25.7+eab9cc9   10.0.23.214   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-23-239.us-east-2.compute.internal   Ready,SchedulingDisabled   worker                 148m   v1.25.7+eab9cc9   10.0.23.239   <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
ip-10-0-23-91.us-east-2.compute.internal    Ready                      control-plane,master   156m   v1.25.7+eab9cc9   10.0.23.91    <none>        Red Hat Enterprise Linux CoreOS 412.86.202303241612-0 (Ootpa)   4.18.0-372.49.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8

Comment 3 Filip Balák 2023-04-19 14:11:37 UTC
This looks like an infrastructure problem where one of the nodes is degraded. --> CLOSED
It will be reopened when reproduced again.

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-abbd35f0db6c3dd53c35d134083b08a2   True      False      False      3              3                   3                     0                      159m
worker   rendered-worker-bd7605d18e1361d6e608cdd48564dac1   False     True       True       6              4                   4                     1                      159m