Storagecluster is stuck in Progressing state after patching it for NonResilientPools: while validating the feature "Replica 1 - Non-resilient pool" and following steps as in https://hackmd.io/@Yh4a4hAATcW2BNYBJVSx4w/BJsr4dQeo# Version of all relevant components (if applicable): OCP 4.12 and ODF latest build 4.12 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-ppc64le-2022-11-08-015301 True False 3h21m Cluster version is 4.12.0-0.nightly-ppc64le-2022-11-08-015301 [root@rdr-cicd-odf-69bf-bastion-0 scripts]# oc get csv odf-operator.v4.12.0 -n openshift-storage -o yaml |grep full_version full_version: 4.12.0-91 ocs-storagecluster 21m Ready 2022-11-08T12:08:27Z 4.12.0 [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc rsh rook-ceph-tools-868cff5cf6-vszmr sh-4.4$ ceph versions { "mon": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 1 }, "osd": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 3 }, "mds": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 1 }, "overall": { "ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable)": 10 } } sh-4.4$ Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?3 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy the latest build for ODF 4.12 and confirm storagecluster is in ready stage 2. Patch with below command oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephNonResilientPools/enable", "value": true }]' storagecluster.ocs.openshift.io/ocs-storagecluster patched 3. storagecluster will be stuck in Progressing stage Actual results: storagecluster should get updated to ready state. on the PowerVM it is not able to create the cephblockpool Expected results: cephblockpool creation should work and storagecluster should get updated to ready state Additional info: must gather logs are uploaded to google drive: https://drive.google.com/file/d/1SKFyycKJgYKuJ-8xMkDdyLkZ3qk2Mpxz/view?usp=sharing
outputs of commands requested are attached. [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get storagecluster ocs-storagecluster -o yaml apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: cluster.ocs.openshift.io/local-devices: “true” uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: “2022-11-08T12:08:27Z” finalizers: storagecluster.ocs.openshift.io generation: 3 name: ocs-storagecluster namespace: openshift-storage ownerReferences: apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: ebd7bb6e-e051-4837-b2ee-3f30e9bdc8d4 resourceVersion: “983940” uid: 5508f168-bc46-4b84-89c5-b8d64a06776c spec: arbiter: {} encryption: kms: {} externalStorage: {} flexibleScaling: true managedResources: cephBlockPools: {} cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: {} cephNonResilientPools: enable: true cephObjectStoreUsers: {} cephObjectStores: {} cephToolbox: {} mirroring: {} monDataDirHostPath: /var/lib/rook storageDeviceSets: config: {} count: 3 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: localblock volumeMode: Block status: {} name: ocs-deviceset-localblock placement: {} preparePlacement: {} replica: 1 resources: {} status: conditions: lastHeartbeatTime: “2022-11-09T05:55:47Z” lastTransitionTime: “2022-11-08T12:30:32Z” message: ‘Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-non-resilient-rbd]’ reason: ReconcileFailed status: “False” type: ReconcileComplete lastHeartbeatTime: “2022-11-08T12:30:30Z” lastTransitionTime: “2022-11-08T12:16:39Z” message: Reconcile completed successfully reason: ReconcileCompleted status: “True” type: Available lastHeartbeatTime: “2022-11-08T12:30:30Z” lastTransitionTime: “2022-11-08T12:16:39Z” message: Reconcile completed successfully reason: ReconcileCompleted status: “False” type: Progressing lastHeartbeatTime: “2022-11-08T12:30:30Z” lastTransitionTime: “2022-11-08T12:08:28Z” message: Reconcile completed successfully reason: ReconcileCompleted status: “False” type: Degraded lastHeartbeatTime: “2022-11-08T12:30:32Z” lastTransitionTime: “2022-11-08T12:30:31Z” message: StorageCluster is expanding reason: Expanding status: “False” type: Upgradeable externalStorage: grantedCapacity: “0” failureDomain: host failureDomainKey: kubernetes.io/hostname failureDomainValues: worker-2 worker-0 worker-1 images: ceph: actualImage: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7 desiredImage: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7 noobaaCore: actualImage: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ee1bc56dc3cf3b7f0136184668700caca835712f3252bb79c6c745e772850e25 desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ee1bc56dc3cf3b7f0136184668700caca835712f3252bb79c6c745e772850e25 noobaaDB: actualImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:f9393bef938580aa39aacf94bc56fd6f2ac515173f770c75f7fac9650eff62ba desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:f9393bef938580aa39aacf94bc56fd6f2ac515173f770c75f7fac9650eff62ba kmsServerConnection: {} nodeTopologies: labels: kubernetes.io/hostname: worker-2 worker-0 worker-1 phase: Progressing relatedObjects: apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: “983639” uid: 4e3d64c0-ee31-49d7-9bfd-2d7c70a60db4 apiVersion: noobaa.io/v1alpha1 kind: NooBaa name: noobaa namespace: openshift-storage resourceVersion: “133883” uid: 85666aa5-138a-47c4-93cb-23f3f4e62b91 version: 4.12.0 [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cephblockpools NAME PHASE ocs-storagecluster-cephblockpool Ready ocs-storagecluster-cephblockpool-worker-0 Failure ocs-storagecluster-cephblockpool-worker-1 Failure ocs-storagecluster-cephblockpool-worker-2 Failure [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cephblockpools ocs-storagecluster-cephblockpool-worker-0 -o yaml apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: creationTimestamp: “2022-11-08T12:30:31Z” finalizers: cephblockpool.ceph.rook.io generation: 1 name: ocs-storagecluster-cephblockpool-worker-0 namespace: openshift-storage ownerReferences: apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: StorageCluster name: ocs-storagecluster uid: 5508f168-bc46-4b84-89c5-b8d64a06776c resourceVersion: “134070” uid: e6836b78-d825-4a74-a003-9d68df4fec39 spec: deviceClass: worker-0 enableRBDStats: true erasureCoded: codingChunks: 0 dataChunks: 0 failureDomain: host mirroring: {} quotas: {} replicated: size: 1 statusCheck: mirror: {} status: phase: Failure [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE localblock kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 17h ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 17h ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 17h ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 17h openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 17h [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get pods | grep osd rook-ceph-osd-0-748f6f8897-ww995 2/2 Running 0 17h rook-ceph-osd-1-7f9585774-ldg2d 2/2 Running 0 17h rook-ceph-osd-2-b8cf8cd6-z8dzb 2/2 Running 0 17h rook-ceph-osd-prepare-40704edebd520f1ff9d6d8f09e8a5545-mltnm 0/1 Completed 0 17h rook-ceph-osd-prepare-42fdf53e28e5f8f91945f982560011a3-5mlqn 0/1 Completed 0 17h rook-ceph-osd-prepare-90c417e325953a4bb1a96ea237e474e2-hl8gs 0/1 Completed 0 17h rook-ceph-osd-prepare-worker-0-data-0jtpn7-bt7ql 0/1 Completed 0 17h rook-ceph-osd-prepare-worker-1-data-0kxwn9-ld6t9 0/1 Completed 0 17h rook-ceph-osd-prepare-worker-2-data-05jq7k-sqqlj 0/1 Completed 0 17h [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get pvc | grep data ocs-deviceset-localblock-0-data-07qcxj Bound local-pv-d215812c 500Gi RWO localblock 17h ocs-deviceset-localblock-0-data-1wkjdp Bound local-pv-49015b6b 500Gi RWO localblock 17h ocs-deviceset-localblock-0-data-2crrhx Bound local-pv-3ac6d77f 500Gi RWO localblock 17h worker-0-data-0jtpn7 Bound local-pv-8a3b2355 500Gi RWO localblock 17h worker-1-data-0kxwn9 Bound local-pv-e5de8aa9 500Gi RWO localblock 17h worker-2-data-05jq7k Bound local-pv-13390437 500Gi RWO localblock 17h [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc rsh rook-ceph-tools-868cff5cf6-vszmr sh-4.4$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.46489 root default -5 0.48830 host worker-0 2 hdd 0.48830 osd.2 up 1.00000 1.00000 -7 0.48830 host worker-1 0 hdd 0.48830 osd.0 up 1.00000 1.00000 -3 0.48830 host worker-2 1 hdd 0.48830 osd.1 up 1.00000 1.00000 sh-4.4$ sh-4.4$ ceph osd pool ls detail pool 1 ‘device_health_metrics’ replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 13 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth pool 2 ‘ocs-storagecluster-cephblockpool’ replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 468 lfor 0/465/463 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd pool 3 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.index’ replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 4 ‘ocs-storagecluster-cephobjectstore.rgw.meta’ replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 5 ‘ocs-storagecluster-cephobjectstore.rgw.control’ replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 6 ‘.rgw.root’ replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 7 ‘ocs-storagecluster-cephobjectstore.rgw.otp’ replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 8 ‘ocs-storagecluster-cephobjectstore.rgw.log’ replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 9 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec’ replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 10 ‘ocs-storagecluster-cephfilesystem-metadata’ replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 11 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.data’ replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 469 flags hashpspool stripe_width 0 target_size_ratio 0.49 application rook-ceph-rgw pool 12 ‘ocs-storagecluster-cephfilesystem-data0’ replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 470 flags hashpspool stripe_width 0 target_size_ratio 0.49 application cephfs sh-4.4$ [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cm rook-ceph-operator-config -n openshift-storage -o yaml apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: “2022-11-08T12:06:15Z” name: rook-ceph-operator-config namespace: openshift-storage resourceVersion: “111551” uid: e365976c-bc79-464d-aa04-d9816970b525 [root@rdr-cicd-odf-69bf-bastion-0 ~]# [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc describe cephcluster ocs-storagecluster-cephcluster -n openshift-storage Name: ocs-storagecluster-cephcluster Namespace: openshift-storage Labels: app=ocs-storagecluster Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephCluster Metadata: Creation Timestamp: 2022-11-08T12:08:27Z Finalizers: cephcluster.ceph.rook.io Generation: 2 Managed Fields: API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:“cephcluster.ceph.rook.io”: Manager: rook Operation: Update Time: 2022-11-08T12:08:27Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:app: f:ownerReferences: .: k:{“uid”:“5508f168-bc46-4b84-89c5-b8d64a06776c”}: f:spec: .: f:cephVersion: .: f:image: f:cleanupPolicy: .: f:sanitizeDisks: f:continueUpgradeAfterChecksEvenIfNotHealthy: f:crashCollector: f:dashboard: f:dataDirHostPath: f:disruptionManagement: .: f:machineDisruptionBudgetNamespace: f:managePodBudgets: f:external: f:healthCheck: .: f:daemonHealth: .: f:mon: f:osd: f:status: f:labels: .: f:monitoring: .: f:rook.io/managedBy: f:logCollector: .: f:enabled: f:maxLogSize: f:periodicity: f:mgr: .: f:modules: f:mon: .: f:count: f:monitoring: .: f:enabled: f:network: f:placement: .: f:all: .: f:nodeAffinity: .: f:requiredDuringSchedulingIgnoredDuringExecution: .: f:nodeSelectorTerms: f:tolerations: f:arbiter: .: f:tolerations: f:mon: .: f:nodeAffinity: .: f:requiredDuringSchedulingIgnoredDuringExecution: .: f:nodeSelectorTerms: f:podAntiAffinity: .: f:requiredDuringSchedulingIgnoredDuringExecution: f:priorityClassNames: .: f:mgr: f:mon: f:osd: f:resources: .: f:mds: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:mgr: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:mon: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:rgw: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:security: .: f:kms: f:storage: .: f:storageClassDeviceSets: Manager: ocs-operator Operation: Update Time: 2022-11-08T12:30:31Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:ceph: .: f:capacity: .: f:bytesAvailable: f:bytesTotal: f:bytesUsed: f:lastUpdated: f:fsid: f:health: f:lastChecked: f:versions: .: f:mds: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:mgr: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:mon: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:osd: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:overall: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:rgw: .: f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): f:conditions: f:message: f:observedGeneration: f:phase: f:state: f:storage: .: f:deviceClasses: f:version: .: f:image: f:version: Manager: rook Operation: Update Subresource: status Time: 2022-11-09T06:55:09Z Owner References: API Version: ocs.openshift.io/v1 Block Owner Deletion: true Controller: true Kind: StorageCluster Name: ocs-storagecluster UID: 5508f168-bc46-4b84-89c5-b8d64a06776c Resource Version: 1033769 UID: 4e3d64c0-ee31-49d7-9bfd-2d7c70a60db4 Spec: Ceph Version: Image: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7 Cleanup Policy: Sanitize Disks: Continue Upgrade After Checks Even If Not Healthy: true Crash Collector: Dashboard: Data Dir Host Path: /var/lib/rook Disruption Management: Machine Disruption Budget Namespace: openshift-machine-api Manage Pod Budgets: true External: Health Check: Daemon Health: Mon: Osd: Status: Labels: Monitoring: rook.io/managedBy: ocs-storagecluster Log Collector: Enabled: true Max Log Size: 500Mi Periodicity: daily Mgr: Modules: Enabled: true Name: pg_autoscaler Enabled: true Name: balancer Mon: Count: 3 Monitoring: Enabled: true Network: Placement: All: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Arbiter: Tolerations: Effect: NoSchedule Key: node-role.kubernetes.io/master Operator: Exists Mon: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Pod Anti Affinity: Required During Scheduling Ignored During Execution: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-mon Topology Key: kubernetes.io/hostname Priority Class Names: Mgr: system-node-critical Mon: system-node-critical Osd: system-node-critical Resources: Mds: Limits: Cpu: 3 Memory: 8Gi Requests: Cpu: 3 Memory: 8Gi Mgr: Limits: Cpu: 1 Memory: 3Gi Requests: Cpu: 1 Memory: 3Gi Mon: Limits: Cpu: 1 Memory: 2Gi Requests: Cpu: 1 Memory: 2Gi Rgw: Limits: Cpu: 2 Memory: 4Gi Requests: Cpu: 2 Memory: 4Gi Security: Kms: Storage: Storage Class Device Sets: Count: 3 Name: ocs-deviceset-localblock-0 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Topology Spread Constraints: Label Selector: Match Expressions: Key: ceph.rook.io/pvc Operator: Exists Max Skew: 1 Topology Key: kubernetes.io/hostname When Unsatisfiable: ScheduleAnyway Prepare Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Topology Spread Constraints: Label Selector: Match Expressions: Key: ceph.rook.io/pvc Operator: Exists Max Skew: 1 Topology Key: kubernetes.io/hostname When Unsatisfiable: ScheduleAnyway Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 2 Memory: 5Gi Volume Claim Templates: Metadata: Annotations: Crush Device Class: replicated Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 100Gi Storage Class Name: localblock Volume Mode: Block Status: Count: 1 Name: worker-2 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-2 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Prepare Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-2 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 2 Memory: 5Gi Volume Claim Templates: Metadata: Annotations: Crush Device Class: worker-2 Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 100Gi Storage Class Name: localblock Volume Mode: Block Status: Count: 1 Name: worker-0 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-0 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Prepare Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-0 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 2 Memory: 5Gi Volume Claim Templates: Metadata: Annotations: Crush Device Class: worker-0 Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 100Gi Storage Class Name: localblock Volume Mode: Block Status: Count: 1 Name: worker-1 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-1 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Prepare Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: worker-1 Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 2 Memory: 5Gi Volume Claim Templates: Metadata: Annotations: Crush Device Class: worker-1 Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 100Gi Storage Class Name: localblock Volume Mode: Block Status: Status: Ceph: Capacity: Bytes Available: 1570786713600 Bytes Total: 1610612736000 Bytes Used: 39826022400 Last Updated: 2022-11-09T06:55:07Z Fsid: b8ab4bab-769b-495a-ab68-26cf669644e4 Health: HEALTH_OK Last Checked: 2022-11-09T06:55:07Z Versions: Mds: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 2 Mgr: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 1 Mon: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 3 Osd: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 3 Overall: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 10 Rgw: ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 1 Conditions: Last Heartbeat Time: 2022-11-09T06:55:09Z Last Transition Time: 2022-11-08T12:11:35Z Message: Cluster created successfully Reason: ClusterCreated Status: True Type: Ready Message: Cluster created successfully Observed Generation: 2 Phase: Ready State: Created Storage: Device Classes: Name: hdd Version: Image: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7 Version: 16.2.10-50 Events: <none> [root@rdr-cicd-odf-69bf-bastion-0 ~]#
error details below: [root@rdr-cicd-odf-69bf-bastion-0 ~]# oc describe cephblockpool ocs-storagecluster-cephblockpool-worker-0 Name: ocs-storagecluster-cephblockpool-worker-0 Namespace: openshift-storage Labels: <none> Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephBlockPool Metadata: Creation Timestamp: 2022-11-08T12:30:31Z Finalizers: cephblockpool.ceph.rook.io Generation: 1 Managed Fields: API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: k:{"uid":"5508f168-bc46-4b84-89c5-b8d64a06776c"}: f:spec: .: f:deviceClass: f:enableRBDStats: f:erasureCoded: .: f:codingChunks: f:dataChunks: f:failureDomain: f:mirroring: f:quotas: f:replicated: .: f:size: f:statusCheck: .: f:mirror: Manager: ocs-operator Operation: Update Time: 2022-11-08T12:30:31Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"cephblockpool.ceph.rook.io": Manager: rook Operation: Update Time: 2022-11-08T12:30:34Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:phase: Manager: rook Operation: Update Subresource: status Time: 2022-11-08T12:30:40Z Owner References: API Version: ocs.openshift.io/v1 Block Owner Deletion: true Controller: true Kind: StorageCluster Name: ocs-storagecluster UID: 5508f168-bc46-4b84-89c5-b8d64a06776c Resource Version: 134070 UID: e6836b78-d825-4a74-a003-9d68df4fec39 Spec: Device Class: worker-0 Enable RBD Stats: true Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Failure Domain: host Mirroring: Quotas: Replicated: Size: 1 Status Check: Mirror: Status: Phase: Failure Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 11m (x20 over 51m) rook-ceph-block-pool-controller failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool-worker-0". failed to create pool "ocs-storagecluster-cephblockpool-worker-0".: failed to create pool "ocs-storagecluster-cephblockpool-worker-0".: failed to create pool "ocs-storagecluster-cephblockpool-worker-0": failed to create replicated crush rule "ocs-storagecluster-cephblockpool-worker-0": failed to create crush rule ocs-storagecluster-cephblockpool-worker-0: exit status 22
How are the local PVs configured? It almost seems that the local devices are mounted in multiple locations. Are multiple PVs actually pointing In one prepare log it shows osd.2 was provisioned: 2022-11-08T12:11:33.572775637Z 2022-11-08 12:11:33.572672 D | cephosd: { 2022-11-08T12:11:33.572775637Z "771e58ed-e4bd-4468-80ef-971301838fe1": { 2022-11-08T12:11:33.572775637Z "ceph_fsid": "b8ab4bab-769b-495a-ab68-26cf669644e4", 2022-11-08T12:11:33.572775637Z "device": "/mnt/ocs-deviceset-localblock-0-data-2crrhx", 2022-11-08T12:11:33.572775637Z "osd_id": 2, 2022-11-08T12:11:33.572775637Z "osd_uuid": "771e58ed-e4bd-4468-80ef-971301838fe1", 2022-11-08T12:11:33.572775637Z "type": "bluestore" 2022-11-08T12:11:33.572775637Z } 2022-11-08T12:11:33.572775637Z } And in another osd prepare log is shows a different device, but the same osd.2 and other properties: 2022-11-08T12:31:47.690917126Z 2022-11-08 12:31:47.690824 D | cephosd: { 2022-11-08T12:31:47.690917126Z "771e58ed-e4bd-4468-80ef-971301838fe1": { 2022-11-08T12:31:47.690917126Z "ceph_fsid": "b8ab4bab-769b-495a-ab68-26cf669644e4", 2022-11-08T12:31:47.690917126Z "device": "/mnt/worker-0-data-0jtpn7", 2022-11-08T12:31:47.690917126Z "osd_id": 2, 2022-11-08T12:31:47.690917126Z "osd_uuid": "771e58ed-e4bd-4468-80ef-971301838fe1", 2022-11-08T12:31:47.690917126Z "type": "bluestore" 2022-11-08T12:31:47.690917126Z } 2022-11-08T12:31:47.690917126Z } Notice the different device path. There is also 20 minutes difference between running the OSDs. Could the devices have been cleaned and the OSDs attempted to be re-configured? Was this a clean install with clean local PVs?
Hi Travis, This is PowerVM cluster and it was a fresh deployment done for the feature testing. Malay was able to see the same error in his environment when he created a storagecluster with failure domain as host
Can we connect to your cluster? Digging through the must-gather I'm not finding any other meaningful clues.
This cluster is on PowerVm. I am not sure if you are able to connect to this. you can try with the cluster details shared earlier over chat.
Hi , We have tried to create another new cluster on PowerVS. and after patching to enable Non-Resilient pools, the storage cluster is stuck in progressing stage. you can access the cluster as well. https://console-openshift-console.apps.rdr-odf412.ibm.com will share the credentials over IM.
Moving back to 4.12 as a potential blocker, otherwise the replica 1 feature is not working.
In the operator log, I see that the cluster and OSDs were originally created without the replica 1 feature enabled: 2022-11-08T12:11:31.566267784Z 2022-11-08 12:11:31.566181 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-1wkjdp is "completed" 2022-11-08T12:11:31.566267784Z 2022-11-08 12:11:31.566229 I | op-osd: creating OSD 1 on PVC "ocs-deviceset-localblock-0-data-1wkjdp" 2022-11-08T12:11:31.566378182Z 2022-11-08 12:11:31.566256 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-1wkjdp" 2022-11-08T12:11:32.370803330Z 2022-11-08 12:11:32.370703 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-07qcxj is "completed" 2022-11-08T12:11:32.370803330Z 2022-11-08 12:11:32.370768 I | op-osd: creating OSD 0 on PVC "ocs-deviceset-localblock-0-data-07qcxj" 2022-11-08T12:11:32.370930142Z 2022-11-08 12:11:32.370801 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-07qcxj" 2022-11-08T12:11:33.637998416Z 2022-11-08 12:11:33.637944 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-2crrhx is "completed" 2022-11-08T12:11:33.637998416Z 2022-11-08 12:11:33.637974 I | op-osd: creating OSD 2 on PVC "ocs-deviceset-localblock-0-data-2crrhx" 2022-11-08T12:11:33.638088923Z 2022-11-08 12:11:33.637991 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-2crrhx" Then 20 minutes later, the cephcluster CR was updated with the replica 1 storageClassDeviceSets: 2022-11-08T12:30:31.547876784Z 2022-11-08 12:30:31.547778 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". diff= v1.ClusterSpec{ While I would like to get this cluster update scenario working, please just create the replica 1 configuration from the start to see if that will get a working cluster. Later we can try the scenario of updating an existing cluster to add replica 1 OSDs.
Hi , Can you please share me the steps. how do we enable replica 1 configuration before deploying ODF.
Malay As long as the StorageCluster CR is created initially with non-resilient pools, we shouldn't see the cephcluster CR updated later like this, right? I'd like to confirm this really is a clean install with non-resilient pools before investigating the upgraded case. Narayanaswamy How are you creating the ODF cluster? From the UI or by creating the StorageCluster CR?
I think Narayanswami is creating the storagecluster from the UI(Most people do it this way only), And later the storagecluster is patched from non-resilient pools. If we want a replica 1 configuration from the start we have to use a StorageCluster CR and set the value there and then create the CR.
Deployed ODF using UI as mentioned by Malay.
Could I get a connection to the repro to take a look again? The original cluster looks like it is no longer running since it has been a while. Even better if it's possible to get a connection to the cluster before the non-resilient setting is changed to see the state before and after we apply that setting. Thanks!
The Original cluster is not available. I have created a new OCP 4.12 cluster. nothing else is deployed. you can connect to this and check. Let me know if need perform any steps. https://console-openshift-console.apps.rdr-nara3.ibm.com hosts file to be updated with below details: 158.176.146.114 api.rdr-nara3.ibm.com console-openshift-console.apps.rdr-nara3.ibm.com integrated-oauth-server-openshift-authentication.apps.rdr-nara3.ibm.com oauth-openshift.apps.rdr-nara3.ibm.com prometheus-k8s-openshift-monitoring.apps.rdr-nara3.ibm.com grafana-openshift-monitoring.apps.rdr-nara3.ibm.com example.apps.rdr-nara3.ibm.com kubeadmin password shared over chat to Travis & Malay.
When I tried to connect to the console in my browser, it warned me about the insecure connection, then I told it to go to the site anyway, and then it can't connect. So it seems there was an initial connection at least. Let's try again next week.
Looking in detail at the osd prepare logs of a live cluster, with Malay and Narayan we were able to repro independently from the non-resilient cluster. All we had to do was create two OSDs per node. In that case, the OSDs conflicted with each other and failed to come up, with the same symptoms as the original repro for the non-resilient cluster. The symptom is that the second OSD to be prepare discovers the device already to be configured as the first OSD provisioned, and returns that OSD ID instead of provisioning a new OSD. Thus, only one OSD remains provisioned per node even though two were requested per node. Let's look in detail at two of the OSD prepare jobs that show the conflict. We will call them A and B: A: rook-ceph-osd-prepare-bf01f3ae932a303a855e7bf451e01629-bvr5n 0/1 Completed 0 3h36m B: rook-ceph-osd-prepare-bfc9ab8ba369e0d62d70e7fcabd0f4e6-7gqsj 0/1 Completed 0 3h37m OSD A: The log shows that the path to the device mounted to the pod is: /mnt/ocs-deviceset-localblock-0-data-07ckv5 The PV name is local-pv-6fe167c7 The PV spec shows the local.path of /mnt/local-storage/localblock/sdc This PV has node affinity to worker-0 OSD B: The log shows that the path to the device mounted to the pod is: /mnt/ocs-deviceset-localblock-0-data-4hjn82 The PV name is local-pv-a933fcc1 The PV spec shows the local.path of /mnt/local-storage/localblock/sda This PV also has node affinity to worker-0 So far, everything looks independent between these two OSDs as expected. The key here is that the dev links are showing some of the same SCSI paths under the covers. In particular, notice that these two devlinks are the same for both devices: /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 Dev links for OSD A: 2022-12-08 16:49:33.539589 D | exec: Running command: udevadm info --query=property /dev/sdc 2022-12-08 16:49:33.546408 D | sys: udevadm info output:"DEVLINKS= /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 /dev/disk/by-id/wwn-0x600507681081818c2000000000008c50 /dev/disk/by-path/fc-0xc050760c345122ae-0x5005076810243184-lun-0 /dev/disk/by-path/fc-0x5005076810243184-lun-0\nDEVNAME=/dev/sdc\nDEVPATH=/devices/vio/30000004/host1/rport-1:0-1/target1:0:1/1:0:1:0/block/sdc\nDEVTYPE=disk\nFC_INITIATOR_WWPN=0xc050760c345122ae\nFC_TARGET_LUN=0\nFC_TARGET_WWPN=0x5005076810243184\nID_BUS=scsi\nID_MODEL=2145\nID_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nID_PATH=fc-0x5005076810243184-lun-0\nID_PATH_TAG=fc-0x5005076810243184-lun-0\nID_REVISION=0000\nID_SCSI=1\nID_SCSI_INQUIRY=1\nID_SCSI_SERIAL=020420606308XX00\nID_SERIAL=3600507681081818c2000000000008c50\nID_SERIAL_SHORT=600507681081818c2000000000008c50\nID_TARGET_PORT=0\nID_TYPE=disk\nID_VENDOR=IBM\nID_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nID_WWN=0x600507681081818c\nID_WWN_VENDOR_EXTENSION=0x2000000000008c50\nID_WWN_WITH_EXTENSION=0x600507681081818c2000000000008c50\nMAJOR=8\nMINOR=32\nSCSI_IDENT_LUN_NAA_REGEXT=600507681081818c2000000000008c50\nSCSI_IDENT_PORT_RELATIVE=135\nSCSI_IDENT_PORT_TARGET_PORT_GROUP=0x0\nSCSI_IDENT_PORT_VENDOR=600507681081818c2000000000000001\nSCSI_IDENT_SERIAL=020420606308XX00\nSCSI_MODEL=2145\nSCSI_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nSCSI_REVISION=0000\nSCSI_TPGS=1\nSCSI_TYPE=disk\nSCSI_VENDOR=IBM\nSCSI_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=11331746" Dev links for OSD B: 2022-12-08 16:48:41.021847 D | exec: Running command: udevadm info --query=property /dev/sda 2022-12-08 16:48:41.027232 D | sys: udevadm info output:"DEVLINKS= /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 /dev/disk/by-path/fc-0x5005076810243152-lun-0 /dev/disk/by-path/fc-0xc050760c345122ae-0x5005076810243152-lun-0 /dev/disk/by-id/wwn-0x600507681081818c2000000000008c50\nDEVNAME=/dev/sda\nDEVPATH=/devices/vio/30000004/host1/rport-1:0-0/target1:0:0/1:0:0:0/block/sda\nDEVTYPE=disk\nFC_INITIATOR_WWPN=0xc050760c345122ae\nFC_TARGET_LUN=0\nFC_TARGET_WWPN=0x5005076810243152\nID_BUS=scsi\nID_MODEL=2145\nID_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nID_PATH=fc-0x5005076810243152-lun-0\nID_PATH_TAG=fc-0x5005076810243152-lun-0\nID_REVISION=0000\nID_SCSI=1\nID_SCSI_INQUIRY=1\nID_SCSI_SERIAL=020420606308XX00\nID_SERIAL=3600507681081818c2000000000008c50\nID_SERIAL_SHORT=600507681081818c2000000000008c50\nID_TARGET_PORT=1\nID_TYPE=disk\nID_VENDOR=IBM\nID_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nID_WWN=0x600507681081818c\nID_WWN_VENDOR_EXTENSION=0x2000000000008c50\nID_WWN_WITH_EXTENSION=0x600507681081818c2000000000008c50\nMAJOR=8\nMINOR=0\nSCSI_IDENT_LUN_NAA_REGEXT=600507681081818c2000000000008c50\nSCSI_IDENT_PORT_RELATIVE=2183\nSCSI_IDENT_PORT_TARGET_PORT_GROUP=0x1\nSCSI_IDENT_PORT_VENDOR=600507681081818c2000000000000002\nSCSI_IDENT_SERIAL=020420606308XX00\nSCSI_MODEL=2145\nSCSI_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nSCSI_REVISION=0000\nSCSI_TPGS=1\nSCSI_TYPE=disk\nSCSI_VENDOR=IBM\nSCSI_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=11335171" Therefore, this is an incorrectly configured environment. We can't be using SCSI disks that are conflicting under the covers. Please remove the SCSI overlap and then see if this will repro. This does not appear to be an ODF bug.
Removing blocker for 4.12 while finalizing investigation.
Thanks for the update Travis. PowerVS clusters comes with multipath. are you saying Multipath is not supported?
(In reply to narayanspg from comment #22) > Thanks for the update Travis. PowerVS clusters comes with multipath. are you > saying Multipath is not supported? LSO will need to be configured to create local PVs such that multiple PVs are not created pointing to the same device. In the node I was looking at, sda and sdc were both pointing to the same disk, so the OSDs were conflicting. We will need to find filters for LSO to create the correct PVs.
Malay and I tested localvolume instead of localvolumeset with disk/by-id mentioning in the yaml for each worernode as below. apiVersion: local.storage.openshift.io/v1 kind: LocalVolume metadata: name: localblock namespace: openshift-local-storage spec: logLevel: Normal managementState: Managed nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker-0 - worker-1 - worker-2 storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-3600507681081818c2000000000008f96 - /dev/disk/by-id/scsi-3600507681081818c2000000000008f95 - /dev/disk/by-id/scsi-3600507681081818c2000000000008f97 storageClassName: localblock volumeMode: Block The OSD were in pending state after the patching to enable the cephNonResilientPools. storagecluster in progressing state.
Were the OSD PVCs bound to these PVs and did the OSD prepare jobs run? If so, please share their logs.
this cluster is destroyed as it was giving different results with disk/by-id usage. if required can recreate the cluster to simulate the scenario and share.
Yes please, we need a repro to confirm if there is still a similar configuration issue as described in Comment 19, or if there is another issue here. So far, we can only repro in this mpath environment so it appears environmental.
I have created new OCP cluster - https://console-openshift-console.apps.rdr-res2.ibm.com below are the disk details on the worker nodes: [core@lon06-worker-0 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c2000000000009190 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part2 -> ../../sdk2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part4 -> ../../sdm4 lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdm2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdm4 lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c2000000000009190 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part2 -> ../../sdm2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part4 -> ../../sdm4 [core@lon06-worker-0 ~]$ [core@lon06-worker-0 ~]$ # [core@lon06-worker-1 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009191 -> ../../sdo lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199 -> ../../sdn lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part4 -> ../../sdn4 lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdn4 lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009191 -> ../../sdo lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199 -> ../../sdn lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part4 -> ../../sdn4 [core@lon06-worker-1 ~]$ # [core@lon06-worker-2 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c2000000000009192 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part4 -> ../../sdo4 lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdo4 lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c2000000000009192 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part4 -> ../../sdo4 [core@lon06-worker-2 ~]$ [core@lon06-worker-2 ~]$ lsblk -o name,type,wwn NAME TYPE WWN sda disk 0x600507681081818c200000000000919e ├─sda1 part 0x600507681081818c200000000000919e ├─sda2 part 0x600507681081818c200000000000919e ├─sda3 part 0x600507681081818c200000000000919e └─sda4 part 0x600507681081818c200000000000919e sdb disk 0x600507681081818c2000000000009192 sdc disk 0x600507681081818c200000000000919e ├─sdc1 part 0x600507681081818c200000000000919e ├─sdc2 part 0x600507681081818c200000000000919e ├─sdc3 part 0x600507681081818c200000000000919e └─sdc4 part 0x600507681081818c200000000000919e sdd disk 0x600507681081818c2000000000009192 sde disk 0x600507681081818c200000000000919e ├─sde1 part 0x600507681081818c200000000000919e ├─sde2 part 0x600507681081818c200000000000919e ├─sde3 part 0x600507681081818c200000000000919e └─sde4 part 0x600507681081818c200000000000919e sdf disk 0x600507681081818c2000000000009192 sdg disk 0x600507681081818c200000000000919e ├─sdg1 part 0x600507681081818c200000000000919e ├─sdg2 part 0x600507681081818c200000000000919e ├─sdg3 part 0x600507681081818c200000000000919e └─sdg4 part 0x600507681081818c200000000000919e sdh disk 0x600507681081818c2000000000009192 sdi disk 0x600507681081818c200000000000919e ├─sdi1 part 0x600507681081818c200000000000919e ├─sdi2 part 0x600507681081818c200000000000919e ├─sdi3 part 0x600507681081818c200000000000919e └─sdi4 part 0x600507681081818c200000000000919e sdj disk 0x600507681081818c2000000000009192 sdk disk 0x600507681081818c200000000000919e ├─sdk1 part 0x600507681081818c200000000000919e ├─sdk2 part 0x600507681081818c200000000000919e ├─sdk3 part 0x600507681081818c200000000000919e └─sdk4 part 0x600507681081818c200000000000919e sdl disk 0x600507681081818c2000000000009192 sdm disk 0x600507681081818c200000000000919e ├─sdm1 part 0x600507681081818c200000000000919e ├─sdm2 part 0x600507681081818c200000000000919e ├─sdm3 part 0x600507681081818c200000000000919e └─sdm4 part 0x600507681081818c200000000000919e sdn disk 0x600507681081818c2000000000009192 sdo disk 0x600507681081818c200000000000919e ├─sdo1 part 0x600507681081818c200000000000919e ├─sdo2 part 0x600507681081818c200000000000919e ├─sdo3 part 0x600507681081818c200000000000919e └─sdo4 part 0x600507681081818c200000000000919e sdp disk 0x600507681081818c2000000000009192 I will be creating(we can do over call today) localvolume with yaml updated as below. #######if using UI then dont use line from cat cat <<EOF | oc create -f - apiVersion: local.storage.openshift.io/v1 kind: LocalVolume metadata: name: localblock namespace: openshift-local-storage spec: logLevel: Normal managementState: Managed nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker-0 - worker-1 - worker-2 storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-3600507681081818c2000000000009190 - /dev/disk/by-id/scsi-3600507681081818c2000000000009191 - /dev/disk/by-id/scsi-3600507681081818c2000000000009192 storageClassName: localblock volumeMode: Block EOF
As discussed, we need at least two PVs on each node for the scenario of non-resilient pools. One PV is for the replicated pools, and one PV per node for the non-resilient pools. In this repro there were only three PVs, so the remaining OSDs would remain pending until more PVs are available.
Malay and I tested again today with additional disk on worker nodes. storagecluster stuck issue after patching is resolved now. since it requires 6 Pv's so it was in pending state. we should document this requirement of additional disk for this feature. Created new deployment and added new disks for each node. then we got 6 Pv's created and storagecluster got to ready state. I started validating the feature, see issue with volume mount and discussing with Malay on this.
Hi Travis/Malay, Please let us know if we have to raise new BZ for pod staying in pending state. Have shared the cluster details with Malay for debugging. while validating "Replica 1 - Non-resilient pool - Dev Preview" - When we try to create a Pod with volume mount trying to use the pvc that we create with the non-resilient storageclass The pod stays forever in pending state. The PVC the pod refers to stays forever in pending state.
The original BZ was that the cluster was stuck in progressing, and that is no longer the case, right? I'd recommend we close this BZ and open a new one. Some ideas to troubleshoot the latest issue: - Does the storage class use the correct pool? - Does the pool have the correct deviceClass applied? (look at the CRUSH rules for the pool) - Do the OSDs have the expected deviceClasses? (ceph osd tree) - If the PV is stuck pending, see the CSI troubleshooting guide [1] or let's ask someone from CSI team to take a look. [1] https://rook.io/docs/rook/latest/Troubleshooting/ceph-csi-common-issues/
after adding additional disk, OSD's were created and the storagecluster went to ready state. additional documentation should be added for this feature to add additional disk.