Bug 2156988 - Managed Service cluster with size 20 can not be installed
Summary: Managed Service cluster with size 20 can not be installed
Keywords:
Status: CLOSED DUPLICATE of bug 2131237
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Leela Venkaiah Gangavarapu
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-30 05:19 UTC by Jilju Joy
Modified: 2023-08-09 17:00 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-02 11:59:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSD-14421 0 None None None 2022-12-30 06:06:27 UTC

Description Jilju Joy 2022-12-30 05:19:03 UTC
Description of problem:
Managed Services provider cluster with QE addon(v2.0.11) that contains changes in topology related to ODFMS-55 can not finish ODF addon installation and is stuck in Installing state. Size parameter was set to 20.

$ rosa list addons -c jijoy-size20-pr | grep ocs-provider-qe
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       installing


Some pods are in pending state due to unavailable resources.

$ oc get pods | egrep -v '(Running|Completed)'
NAME                                                              READY   STATUS      RESTARTS   AGE
alertmanager-managed-ocs-alertmanager-0                           0/2     Pending     0          110m
ocs-metrics-exporter-5dd96c885b-lf46k                             0/1     Pending     0          110m
rook-ceph-crashcollector-ip-10-0-142-35.ec2.internal-7d947nsp5c   0/1     Pending     0          110m
rook-ceph-crashcollector-ip-10-0-154-93.ec2.internal-56688gknf4   0/1     Pending     0          111m
rook-ceph-crashcollector-ip-10-0-160-72.ec2.internal-7f454ffxjm   0/1     Pending     0          109m
rook-ceph-osd-14-66c75f68dc-sxs7z                                 0/2     Pending     0          111m
rook-ceph-osd-6-55d85cccf8-kddcx                                  0/2     Pending     0          111m
rook-ceph-osd-7-74d599f4b6-pcx7j                                  0/2     Pending     0          111m
rook-ceph-osd-8-78bd587c97-mgz54                                  0/2     Pending     0          111m
rook-ceph-osd-9-5b68fc68b4-2p99f                                  0/2     Pending     0          111m

One node is in "SchedulingDisabled" state.

$ oc get nodes
NAME                           STATUS                     ROLES          AGE    VERSION
ip-10-0-128-44.ec2.internal    Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-132-177.ec2.internal   Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-133-84.ec2.internal    Ready                      infra,worker   121m   v1.23.12+8a6bfe4
ip-10-0-136-188.ec2.internal   Ready                      master         140m   v1.23.12+8a6bfe4
ip-10-0-142-35.ec2.internal    Ready                      worker         132m   v1.23.12+8a6bfe4
ip-10-0-143-114.ec2.internal   Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-147-121.ec2.internal   Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-151-231.ec2.internal   Ready                      master         140m   v1.23.12+8a6bfe4
ip-10-0-153-87.ec2.internal    Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-154-93.ec2.internal    Ready                      worker         131m   v1.23.12+8a6bfe4
ip-10-0-155-208.ec2.internal   Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-157-56.ec2.internal    Ready                      infra,worker   121m   v1.23.12+8a6bfe4
ip-10-0-160-174.ec2.internal   Ready                      infra,worker   121m   v1.23.12+8a6bfe4
ip-10-0-160-72.ec2.internal    Ready,SchedulingDisabled   worker         120m   v1.23.12+8a6bfe4
ip-10-0-161-135.ec2.internal   Ready                      worker         134m   v1.23.12+8a6bfe4
ip-10-0-162-68.ec2.internal    Ready                      master         140m   v1.23.12+8a6bfe4
ip-10-0-164-82.ec2.internal    Ready                      worker         120m   v1.23.12+8a6bfe4
ip-10-0-168-82.ec2.internal    Ready                      worker         120m   v1.23.12+8a6bfe4

Events from one of the pods (rook-ceph-osd-14-66c75f68dc-sxs7z):

Warning  FailedScheduling  25m (x112 over 114m)  default-scheduler  0/18 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 9 node(s) didn't match Pod's node affinity/selector.



$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.17            Observability Operator        0.0.17            observability-operator.v0.0.17-rc         Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Installing
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11            ocs-osd-deployer.v2.0.10                  Installing
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.451-3df1ed1   Route Monitor Operator        0.1.451-3df1ed1   route-monitor-operator.v0.1.450-6e98c37   Succeeded

[Note: ocs-operator.v4.10.9 showed Failed state also. ocs-osd-deployer.v2.0.11 showed Pending state also.]


managedocs status:

status:
    components:
      alertmanager:
        state: Pending
      prometheus:
        state: Ready
      storageCluster:
        state: Ready


OCS and OCP must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-size20-pr/jijoy-size20-pr_20221229T154719/logs/failed_testcase_ocs_logs_1672329486/deployment_ocs_logs/

===========================================================================
Version-Release number of selected component (if applicable):
ocs-osd-deployer.v2.0.11
odf-operator.v4.10.9

===========================================================================
How reproducible:
2/2

===========================================================================

Steps to Reproduce:
1. Deploy MS provider cluster with QE addon (v2.0.11)


Actual results:
Installation is not completing. Some pods are in Pending state due to unavailable resources. One node in "SchedulingDisabled" state.


Expected results:
Installation should be successful.

Additional info:

Comment 4 Leela Venkaiah Gangavarapu 2023-01-02 11:59:30 UTC

*** This bug has been marked as a duplicate of bug 2131237 ***


Note You need to log in before you can comment on or make changes to this bug.