Bug 2141265
| Summary: | Storagecluster is stuck in Progressing state after patching it for NonResilientPools | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | narayanspg <ngowda> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | madam, mparida, muagarwa, ngowda, ocs-bugs, odf-bz-bot, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-12-21 06:07:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
narayanspg
2022-11-09 11:31:14 UTC
outputs of commands requested are attached.
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get storagecluster ocs-storagecluster -o yaml
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
annotations:
cluster.ocs.openshift.io/local-devices: “true”
uninstall.ocs.openshift.io/cleanup-policy: delete
uninstall.ocs.openshift.io/mode: graceful
creationTimestamp: “2022-11-08T12:08:27Z”
finalizers:
storagecluster.ocs.openshift.io
generation: 3
name: ocs-storagecluster
namespace: openshift-storage
ownerReferences:
apiVersion: odf.openshift.io/v1alpha1
kind: StorageSystem
name: ocs-storagecluster-storagesystem
uid: ebd7bb6e-e051-4837-b2ee-3f30e9bdc8d4
resourceVersion: “983940”
uid: 5508f168-bc46-4b84-89c5-b8d64a06776c
spec:
arbiter: {}
encryption:
kms: {}
externalStorage: {}
flexibleScaling: true
managedResources:
cephBlockPools: {}
cephCluster: {}
cephConfig: {}
cephDashboard: {}
cephFilesystems: {}
cephNonResilientPools:
enable: true
cephObjectStoreUsers: {}
cephObjectStores: {}
cephToolbox: {}
mirroring: {}
monDataDirHostPath: /var/lib/rook
storageDeviceSets:
config: {}
count: 3
dataPVCTemplate:
metadata: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: localblock
volumeMode: Block
status: {}
name: ocs-deviceset-localblock
placement: {}
preparePlacement: {}
replica: 1
resources: {}
status:
conditions:
lastHeartbeatTime: “2022-11-09T05:55:47Z”
lastTransitionTime: “2022-11-08T12:30:32Z”
message: ‘Error while reconciling: some StorageClasses were skipped while waiting
for pre-requisites to be met: [ocs-storagecluster-ceph-non-resilient-rbd]’
reason: ReconcileFailed
status: “False”
type: ReconcileComplete
lastHeartbeatTime: “2022-11-08T12:30:30Z”
lastTransitionTime: “2022-11-08T12:16:39Z”
message: Reconcile completed successfully
reason: ReconcileCompleted
status: “True”
type: Available
lastHeartbeatTime: “2022-11-08T12:30:30Z”
lastTransitionTime: “2022-11-08T12:16:39Z”
message: Reconcile completed successfully
reason: ReconcileCompleted
status: “False”
type: Progressing
lastHeartbeatTime: “2022-11-08T12:30:30Z”
lastTransitionTime: “2022-11-08T12:08:28Z”
message: Reconcile completed successfully
reason: ReconcileCompleted
status: “False”
type: Degraded
lastHeartbeatTime: “2022-11-08T12:30:32Z”
lastTransitionTime: “2022-11-08T12:30:31Z”
message: StorageCluster is expanding
reason: Expanding
status: “False”
type: Upgradeable
externalStorage:
grantedCapacity: “0”
failureDomain: host
failureDomainKey: kubernetes.io/hostname
failureDomainValues:
worker-2
worker-0
worker-1
images:
ceph:
actualImage: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7
desiredImage: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7
noobaaCore:
actualImage: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ee1bc56dc3cf3b7f0136184668700caca835712f3252bb79c6c745e772850e25
desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ee1bc56dc3cf3b7f0136184668700caca835712f3252bb79c6c745e772850e25
noobaaDB:
actualImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:f9393bef938580aa39aacf94bc56fd6f2ac515173f770c75f7fac9650eff62ba
desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:f9393bef938580aa39aacf94bc56fd6f2ac515173f770c75f7fac9650eff62ba
kmsServerConnection: {}
nodeTopologies:
labels:
kubernetes.io/hostname:
worker-2
worker-0
worker-1
phase: Progressing
relatedObjects:
apiVersion: ceph.rook.io/v1
kind: CephCluster
name: ocs-storagecluster-cephcluster
namespace: openshift-storage
resourceVersion: “983639”
uid: 4e3d64c0-ee31-49d7-9bfd-2d7c70a60db4
apiVersion: noobaa.io/v1alpha1
kind: NooBaa
name: noobaa
namespace: openshift-storage
resourceVersion: “133883”
uid: 85666aa5-138a-47c4-93cb-23f3f4e62b91
version: 4.12.0
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cephblockpools
NAME PHASE
ocs-storagecluster-cephblockpool Ready
ocs-storagecluster-cephblockpool-worker-0 Failure
ocs-storagecluster-cephblockpool-worker-1 Failure
ocs-storagecluster-cephblockpool-worker-2 Failure
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cephblockpools ocs-storagecluster-cephblockpool-worker-0 -o yaml
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
creationTimestamp: “2022-11-08T12:30:31Z”
finalizers:
cephblockpool.ceph.rook.io
generation: 1
name: ocs-storagecluster-cephblockpool-worker-0
namespace: openshift-storage
ownerReferences:
apiVersion: ocs.openshift.io/v1
blockOwnerDeletion: true
controller: true
kind: StorageCluster
name: ocs-storagecluster
uid: 5508f168-bc46-4b84-89c5-b8d64a06776c
resourceVersion: “134070”
uid: e6836b78-d825-4a74-a003-9d68df4fec39
spec:
deviceClass: worker-0
enableRBDStats: true
erasureCoded:
codingChunks: 0
dataChunks: 0
failureDomain: host
mirroring: {}
quotas: {}
replicated:
size: 1
statusCheck:
mirror: {}
status:
phase: Failure
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
localblock kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 17h
ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 17h
ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 17h
ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 17h
openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 17h
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get pods | grep osd
rook-ceph-osd-0-748f6f8897-ww995 2/2 Running 0 17h
rook-ceph-osd-1-7f9585774-ldg2d 2/2 Running 0 17h
rook-ceph-osd-2-b8cf8cd6-z8dzb 2/2 Running 0 17h
rook-ceph-osd-prepare-40704edebd520f1ff9d6d8f09e8a5545-mltnm 0/1 Completed 0 17h
rook-ceph-osd-prepare-42fdf53e28e5f8f91945f982560011a3-5mlqn 0/1 Completed 0 17h
rook-ceph-osd-prepare-90c417e325953a4bb1a96ea237e474e2-hl8gs 0/1 Completed 0 17h
rook-ceph-osd-prepare-worker-0-data-0jtpn7-bt7ql 0/1 Completed 0 17h
rook-ceph-osd-prepare-worker-1-data-0kxwn9-ld6t9 0/1 Completed 0 17h
rook-ceph-osd-prepare-worker-2-data-05jq7k-sqqlj 0/1 Completed 0 17h
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get pvc | grep data
ocs-deviceset-localblock-0-data-07qcxj Bound local-pv-d215812c 500Gi RWO localblock 17h
ocs-deviceset-localblock-0-data-1wkjdp Bound local-pv-49015b6b 500Gi RWO localblock 17h
ocs-deviceset-localblock-0-data-2crrhx Bound local-pv-3ac6d77f 500Gi RWO localblock 17h
worker-0-data-0jtpn7 Bound local-pv-8a3b2355 500Gi RWO localblock 17h
worker-1-data-0kxwn9 Bound local-pv-e5de8aa9 500Gi RWO localblock 17h
worker-2-data-05jq7k Bound local-pv-13390437 500Gi RWO localblock 17h
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc rsh rook-ceph-tools-868cff5cf6-vszmr
sh-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.46489 root default
-5 0.48830 host worker-0
2 hdd 0.48830 osd.2 up 1.00000 1.00000
-7 0.48830 host worker-1
0 hdd 0.48830 osd.0 up 1.00000 1.00000
-3 0.48830 host worker-2
1 hdd 0.48830 osd.1 up 1.00000 1.00000
sh-4.4$
sh-4.4$ ceph osd pool ls detail
pool 1 ‘device_health_metrics’ replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 13 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth
pool 2 ‘ocs-storagecluster-cephblockpool’ replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 468 lfor 0/465/463 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd
pool 3 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.index’ replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 4 ‘ocs-storagecluster-cephobjectstore.rgw.meta’ replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 5 ‘ocs-storagecluster-cephobjectstore.rgw.control’ replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 6 ‘.rgw.root’ replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 7 ‘ocs-storagecluster-cephobjectstore.rgw.otp’ replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 8 ‘ocs-storagecluster-cephobjectstore.rgw.log’ replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 9 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec’ replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 10 ‘ocs-storagecluster-cephfilesystem-metadata’ replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 38 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 11 ‘ocs-storagecluster-cephobjectstore.rgw.buckets.data’ replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 469 flags hashpspool stripe_width 0 target_size_ratio 0.49 application rook-ceph-rgw
pool 12 ‘ocs-storagecluster-cephfilesystem-data0’ replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 470 flags hashpspool stripe_width 0 target_size_ratio 0.49 application cephfs
sh-4.4$
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc get cm rook-ceph-operator-config -n openshift-storage -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: “2022-11-08T12:06:15Z”
name: rook-ceph-operator-config
namespace: openshift-storage
resourceVersion: “111551”
uid: e365976c-bc79-464d-aa04-d9816970b525
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc describe cephcluster ocs-storagecluster-cephcluster -n openshift-storage
Name: ocs-storagecluster-cephcluster
Namespace: openshift-storage
Labels: app=ocs-storagecluster
Annotations: <none>
API Version: ceph.rook.io/v1
Kind: CephCluster
Metadata:
Creation Timestamp: 2022-11-08T12:08:27Z
Finalizers:
cephcluster.ceph.rook.io
Generation: 2
Managed Fields:
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:“cephcluster.ceph.rook.io”:
Manager: rook
Operation: Update
Time: 2022-11-08T12:08:27Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:app:
f:ownerReferences:
.:
k:{“uid”:“5508f168-bc46-4b84-89c5-b8d64a06776c”}:
f:spec:
.:
f:cephVersion:
.:
f:image:
f:cleanupPolicy:
.:
f:sanitizeDisks:
f:continueUpgradeAfterChecksEvenIfNotHealthy:
f:crashCollector:
f:dashboard:
f:dataDirHostPath:
f:disruptionManagement:
.:
f:machineDisruptionBudgetNamespace:
f:managePodBudgets:
f:external:
f:healthCheck:
.:
f:daemonHealth:
.:
f:mon:
f:osd:
f:status:
f:labels:
.:
f:monitoring:
.:
f:rook.io/managedBy:
f:logCollector:
.:
f:enabled:
f:maxLogSize:
f:periodicity:
f:mgr:
.:
f:modules:
f:mon:
.:
f:count:
f:monitoring:
.:
f:enabled:
f:network:
f:placement:
.:
f:all:
.:
f:nodeAffinity:
.:
f:requiredDuringSchedulingIgnoredDuringExecution:
.:
f:nodeSelectorTerms:
f:tolerations:
f:arbiter:
.:
f:tolerations:
f:mon:
.:
f:nodeAffinity:
.:
f:requiredDuringSchedulingIgnoredDuringExecution:
.:
f:nodeSelectorTerms:
f:podAntiAffinity:
.:
f:requiredDuringSchedulingIgnoredDuringExecution:
f:priorityClassNames:
.:
f:mgr:
f:mon:
f:osd:
f:resources:
.:
f:mds:
.:
f:limits:
.:
f:cpu:
f:memory:
f:requests:
.:
f:cpu:
f:memory:
f:mgr:
.:
f:limits:
.:
f:cpu:
f:memory:
f:requests:
.:
f:cpu:
f:memory:
f:mon:
.:
f:limits:
.:
f:cpu:
f:memory:
f:requests:
.:
f:cpu:
f:memory:
f:rgw:
.:
f:limits:
.:
f:cpu:
f:memory:
f:requests:
.:
f:cpu:
f:memory:
f:security:
.:
f:kms:
f:storage:
.:
f:storageClassDeviceSets:
Manager: ocs-operator
Operation: Update
Time: 2022-11-08T12:30:31Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:ceph:
.:
f:capacity:
.:
f:bytesAvailable:
f:bytesTotal:
f:bytesUsed:
f:lastUpdated:
f:fsid:
f:health:
f:lastChecked:
f:versions:
.:
f:mds:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:mgr:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:mon:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:osd:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:overall:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:rgw:
.:
f:ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable):
f:conditions:
f:message:
f:observedGeneration:
f:phase:
f:state:
f:storage:
.:
f:deviceClasses:
f:version:
.:
f:image:
f:version:
Manager: rook
Operation: Update
Subresource: status
Time: 2022-11-09T06:55:09Z
Owner References:
API Version: ocs.openshift.io/v1
Block Owner Deletion: true
Controller: true
Kind: StorageCluster
Name: ocs-storagecluster
UID: 5508f168-bc46-4b84-89c5-b8d64a06776c
Resource Version: 1033769
UID: 4e3d64c0-ee31-49d7-9bfd-2d7c70a60db4
Spec:
Ceph Version:
Image: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7
Cleanup Policy:
Sanitize Disks:
Continue Upgrade After Checks Even If Not Healthy: true
Crash Collector:
Dashboard:
Data Dir Host Path: /var/lib/rook
Disruption Management:
Machine Disruption Budget Namespace: openshift-machine-api
Manage Pod Budgets: true
External:
Health Check:
Daemon Health:
Mon:
Osd:
Status:
Labels:
Monitoring:
rook.io/managedBy: ocs-storagecluster
Log Collector:
Enabled: true
Max Log Size: 500Mi
Periodicity: daily
Mgr:
Modules:
Enabled: true
Name: pg_autoscaler
Enabled: true
Name: balancer
Mon:
Count: 3
Monitoring:
Enabled: true
Network:
Placement:
All:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: cluster.ocs.openshift.io/openshift-storage
Operator: Exists
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Arbiter:
Tolerations:
Effect: NoSchedule
Key: node-role.kubernetes.io/master
Operator: Exists
Mon:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: cluster.ocs.openshift.io/openshift-storage
Operator: Exists
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app
Operator: In
Values:
rook-ceph-mon
Topology Key: kubernetes.io/hostname
Priority Class Names:
Mgr: system-node-critical
Mon: system-node-critical
Osd: system-node-critical
Resources:
Mds:
Limits:
Cpu: 3
Memory: 8Gi
Requests:
Cpu: 3
Memory: 8Gi
Mgr:
Limits:
Cpu: 1
Memory: 3Gi
Requests:
Cpu: 1
Memory: 3Gi
Mon:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 1
Memory: 2Gi
Rgw:
Limits:
Cpu: 2
Memory: 4Gi
Requests:
Cpu: 2
Memory: 4Gi
Security:
Kms:
Storage:
Storage Class Device Sets:
Count: 3
Name: ocs-deviceset-localblock-0
Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: cluster.ocs.openshift.io/openshift-storage
Operator: Exists
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Topology Spread Constraints:
Label Selector:
Match Expressions:
Key: ceph.rook.io/pvc
Operator: Exists
Max Skew: 1
Topology Key: kubernetes.io/hostname
When Unsatisfiable: ScheduleAnyway
Prepare Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: cluster.ocs.openshift.io/openshift-storage
Operator: Exists
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Topology Spread Constraints:
Label Selector:
Match Expressions:
Key: ceph.rook.io/pvc
Operator: Exists
Max Skew: 1
Topology Key: kubernetes.io/hostname
When Unsatisfiable: ScheduleAnyway
Resources:
Limits:
Cpu: 2
Memory: 5Gi
Requests:
Cpu: 2
Memory: 5Gi
Volume Claim Templates:
Metadata:
Annotations:
Crush Device Class: replicated
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 100Gi
Storage Class Name: localblock
Volume Mode: Block
Status:
Count: 1
Name: worker-2
Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-2
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Prepare Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-2
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Resources:
Limits:
Cpu: 2
Memory: 5Gi
Requests:
Cpu: 2
Memory: 5Gi
Volume Claim Templates:
Metadata:
Annotations:
Crush Device Class: worker-2
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 100Gi
Storage Class Name: localblock
Volume Mode: Block
Status:
Count: 1
Name: worker-0
Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-0
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Prepare Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-0
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Resources:
Limits:
Cpu: 2
Memory: 5Gi
Requests:
Cpu: 2
Memory: 5Gi
Volume Claim Templates:
Metadata:
Annotations:
Crush Device Class: worker-0
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 100Gi
Storage Class Name: localblock
Volume Mode: Block
Status:
Count: 1
Name: worker-1
Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-1
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Prepare Placement:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: kubernetes.io/hostname
Operator: In
Values:
worker-1
Tolerations:
Effect: NoSchedule
Key: node.ocs.openshift.io/storage
Operator: Equal
Value: true
Resources:
Limits:
Cpu: 2
Memory: 5Gi
Requests:
Cpu: 2
Memory: 5Gi
Volume Claim Templates:
Metadata:
Annotations:
Crush Device Class: worker-1
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 100Gi
Storage Class Name: localblock
Volume Mode: Block
Status:
Status:
Ceph:
Capacity:
Bytes Available: 1570786713600
Bytes Total: 1610612736000
Bytes Used: 39826022400
Last Updated: 2022-11-09T06:55:07Z
Fsid: b8ab4bab-769b-495a-ab68-26cf669644e4
Health: HEALTH_OK
Last Checked: 2022-11-09T06:55:07Z
Versions:
Mds:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 2
Mgr:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 1
Mon:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 3
Osd:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 3
Overall:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 10
Rgw:
ceph version 16.2.10-50.el8cp (f311fa3856a155d4cd9b658e25a78def0ae7a7c3) pacific (stable): 1
Conditions:
Last Heartbeat Time: 2022-11-09T06:55:09Z
Last Transition Time: 2022-11-08T12:11:35Z
Message: Cluster created successfully
Reason: ClusterCreated
Status: True
Type: Ready
Message: Cluster created successfully
Observed Generation: 2
Phase: Ready
State: Created
Storage:
Device Classes:
Name: hdd
Version:
Image: quay.io/rhceph-dev/rhceph@sha256:9b9d1dffa2254ee04f6d7628daa244e805637cf03420bad89545495fadb491d7
Version: 16.2.10-50
Events: <none>
[root@rdr-cicd-odf-69bf-bastion-0 ~]#
error details below:
[root@rdr-cicd-odf-69bf-bastion-0 ~]# oc describe cephblockpool ocs-storagecluster-cephblockpool-worker-0
Name: ocs-storagecluster-cephblockpool-worker-0
Namespace: openshift-storage
Labels: <none>
Annotations: <none>
API Version: ceph.rook.io/v1
Kind: CephBlockPool
Metadata:
Creation Timestamp: 2022-11-08T12:30:31Z
Finalizers:
cephblockpool.ceph.rook.io
Generation: 1
Managed Fields:
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.:
k:{"uid":"5508f168-bc46-4b84-89c5-b8d64a06776c"}:
f:spec:
.:
f:deviceClass:
f:enableRBDStats:
f:erasureCoded:
.:
f:codingChunks:
f:dataChunks:
f:failureDomain:
f:mirroring:
f:quotas:
f:replicated:
.:
f:size:
f:statusCheck:
.:
f:mirror:
Manager: ocs-operator
Operation: Update
Time: 2022-11-08T12:30:31Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"cephblockpool.ceph.rook.io":
Manager: rook
Operation: Update
Time: 2022-11-08T12:30:34Z
API Version: ceph.rook.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:phase:
Manager: rook
Operation: Update
Subresource: status
Time: 2022-11-08T12:30:40Z
Owner References:
API Version: ocs.openshift.io/v1
Block Owner Deletion: true
Controller: true
Kind: StorageCluster
Name: ocs-storagecluster
UID: 5508f168-bc46-4b84-89c5-b8d64a06776c
Resource Version: 134070
UID: e6836b78-d825-4a74-a003-9d68df4fec39
Spec:
Device Class: worker-0
Enable RBD Stats: true
Erasure Coded:
Coding Chunks: 0
Data Chunks: 0
Failure Domain: host
Mirroring:
Quotas:
Replicated:
Size: 1
Status Check:
Mirror:
Status:
Phase: Failure
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileFailed 11m (x20 over 51m) rook-ceph-block-pool-controller failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool-worker-0". failed to create pool "ocs-storagecluster-cephblockpool-worker-0".: failed to create pool "ocs-storagecluster-cephblockpool-worker-0".: failed to create pool "ocs-storagecluster-cephblockpool-worker-0": failed to create replicated crush rule "ocs-storagecluster-cephblockpool-worker-0": failed to create crush rule ocs-storagecluster-cephblockpool-worker-0: exit status 22
How are the local PVs configured? It almost seems that the local devices are mounted in multiple locations. Are multiple PVs actually pointing
In one prepare log it shows osd.2 was provisioned:
2022-11-08T12:11:33.572775637Z 2022-11-08 12:11:33.572672 D | cephosd: {
2022-11-08T12:11:33.572775637Z "771e58ed-e4bd-4468-80ef-971301838fe1": {
2022-11-08T12:11:33.572775637Z "ceph_fsid": "b8ab4bab-769b-495a-ab68-26cf669644e4",
2022-11-08T12:11:33.572775637Z "device": "/mnt/ocs-deviceset-localblock-0-data-2crrhx",
2022-11-08T12:11:33.572775637Z "osd_id": 2,
2022-11-08T12:11:33.572775637Z "osd_uuid": "771e58ed-e4bd-4468-80ef-971301838fe1",
2022-11-08T12:11:33.572775637Z "type": "bluestore"
2022-11-08T12:11:33.572775637Z }
2022-11-08T12:11:33.572775637Z }
And in another osd prepare log is shows a different device, but the same osd.2 and other properties:
2022-11-08T12:31:47.690917126Z 2022-11-08 12:31:47.690824 D | cephosd: {
2022-11-08T12:31:47.690917126Z "771e58ed-e4bd-4468-80ef-971301838fe1": {
2022-11-08T12:31:47.690917126Z "ceph_fsid": "b8ab4bab-769b-495a-ab68-26cf669644e4",
2022-11-08T12:31:47.690917126Z "device": "/mnt/worker-0-data-0jtpn7",
2022-11-08T12:31:47.690917126Z "osd_id": 2,
2022-11-08T12:31:47.690917126Z "osd_uuid": "771e58ed-e4bd-4468-80ef-971301838fe1",
2022-11-08T12:31:47.690917126Z "type": "bluestore"
2022-11-08T12:31:47.690917126Z }
2022-11-08T12:31:47.690917126Z }
Notice the different device path. There is also 20 minutes difference between running the OSDs. Could the devices have been cleaned and the OSDs attempted to be re-configured? Was this a clean install with clean local PVs?
Hi Travis, This is PowerVM cluster and it was a fresh deployment done for the feature testing. Malay was able to see the same error in his environment when he created a storagecluster with failure domain as host Can we connect to your cluster? Digging through the must-gather I'm not finding any other meaningful clues. This cluster is on PowerVm. I am not sure if you are able to connect to this. you can try with the cluster details shared earlier over chat. Hi , We have tried to create another new cluster on PowerVS. and after patching to enable Non-Resilient pools, the storage cluster is stuck in progressing stage. you can access the cluster as well. https://console-openshift-console.apps.rdr-odf412.ibm.com will share the credentials over IM. Moving back to 4.12 as a potential blocker, otherwise the replica 1 feature is not working. In the operator log, I see that the cluster and OSDs were originally created without the replica 1 feature enabled:
2022-11-08T12:11:31.566267784Z 2022-11-08 12:11:31.566181 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-1wkjdp is "completed"
2022-11-08T12:11:31.566267784Z 2022-11-08 12:11:31.566229 I | op-osd: creating OSD 1 on PVC "ocs-deviceset-localblock-0-data-1wkjdp"
2022-11-08T12:11:31.566378182Z 2022-11-08 12:11:31.566256 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-1wkjdp"
2022-11-08T12:11:32.370803330Z 2022-11-08 12:11:32.370703 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-07qcxj is "completed"
2022-11-08T12:11:32.370803330Z 2022-11-08 12:11:32.370768 I | op-osd: creating OSD 0 on PVC "ocs-deviceset-localblock-0-data-07qcxj"
2022-11-08T12:11:32.370930142Z 2022-11-08 12:11:32.370801 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-07qcxj"
2022-11-08T12:11:33.637998416Z 2022-11-08 12:11:33.637944 I | op-osd: OSD orchestration status for PVC ocs-deviceset-localblock-0-data-2crrhx is "completed"
2022-11-08T12:11:33.637998416Z 2022-11-08 12:11:33.637974 I | op-osd: creating OSD 2 on PVC "ocs-deviceset-localblock-0-data-2crrhx"
2022-11-08T12:11:33.638088923Z 2022-11-08 12:11:33.637991 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-localblock-0-data-2crrhx"
Then 20 minutes later, the cephcluster CR was updated with the replica 1 storageClassDeviceSets:
2022-11-08T12:30:31.547876784Z 2022-11-08 12:30:31.547778 I | ceph-cluster-controller: CR has changed for "ocs-storagecluster-cephcluster". diff= v1.ClusterSpec{
While I would like to get this cluster update scenario working, please just create the replica 1 configuration from the start to see if that will get a working cluster. Later we can try the scenario of updating an existing cluster to add replica 1 OSDs.
Hi , Can you please share me the steps. how do we enable replica 1 configuration before deploying ODF. Malay As long as the StorageCluster CR is created initially with non-resilient pools, we shouldn't see the cephcluster CR updated later like this, right? I'd like to confirm this really is a clean install with non-resilient pools before investigating the upgraded case. Narayanaswamy How are you creating the ODF cluster? From the UI or by creating the StorageCluster CR? I think Narayanswami is creating the storagecluster from the UI(Most people do it this way only), And later the storagecluster is patched from non-resilient pools. If we want a replica 1 configuration from the start we have to use a StorageCluster CR and set the value there and then create the CR. Deployed ODF using UI as mentioned by Malay. Could I get a connection to the repro to take a look again? The original cluster looks like it is no longer running since it has been a while. Even better if it's possible to get a connection to the cluster before the non-resilient setting is changed to see the state before and after we apply that setting. Thanks! The Original cluster is not available. I have created a new OCP 4.12 cluster. nothing else is deployed. you can connect to this and check. Let me know if need perform any steps. https://console-openshift-console.apps.rdr-nara3.ibm.com hosts file to be updated with below details: 158.176.146.114 api.rdr-nara3.ibm.com console-openshift-console.apps.rdr-nara3.ibm.com integrated-oauth-server-openshift-authentication.apps.rdr-nara3.ibm.com oauth-openshift.apps.rdr-nara3.ibm.com prometheus-k8s-openshift-monitoring.apps.rdr-nara3.ibm.com grafana-openshift-monitoring.apps.rdr-nara3.ibm.com example.apps.rdr-nara3.ibm.com kubeadmin password shared over chat to Travis & Malay. When I tried to connect to the console in my browser, it warned me about the insecure connection, then I told it to go to the site anyway, and then it can't connect. So it seems there was an initial connection at least. Let's try again next week. Looking in detail at the osd prepare logs of a live cluster, with Malay and Narayan we were able to repro independently from the non-resilient cluster. All we had to do was create two OSDs per node. In that case, the OSDs conflicted with each other and failed to come up, with the same symptoms as the original repro for the non-resilient cluster. The symptom is that the second OSD to be prepare discovers the device already to be configured as the first OSD provisioned, and returns that OSD ID instead of provisioning a new OSD. Thus, only one OSD remains provisioned per node even though two were requested per node. Let's look in detail at two of the OSD prepare jobs that show the conflict. We will call them A and B: A: rook-ceph-osd-prepare-bf01f3ae932a303a855e7bf451e01629-bvr5n 0/1 Completed 0 3h36m B: rook-ceph-osd-prepare-bfc9ab8ba369e0d62d70e7fcabd0f4e6-7gqsj 0/1 Completed 0 3h37m OSD A: The log shows that the path to the device mounted to the pod is: /mnt/ocs-deviceset-localblock-0-data-07ckv5 The PV name is local-pv-6fe167c7 The PV spec shows the local.path of /mnt/local-storage/localblock/sdc This PV has node affinity to worker-0 OSD B: The log shows that the path to the device mounted to the pod is: /mnt/ocs-deviceset-localblock-0-data-4hjn82 The PV name is local-pv-a933fcc1 The PV spec shows the local.path of /mnt/local-storage/localblock/sda This PV also has node affinity to worker-0 So far, everything looks independent between these two OSDs as expected. The key here is that the dev links are showing some of the same SCSI paths under the covers. In particular, notice that these two devlinks are the same for both devices: /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 Dev links for OSD A: 2022-12-08 16:49:33.539589 D | exec: Running command: udevadm info --query=property /dev/sdc 2022-12-08 16:49:33.546408 D | sys: udevadm info output:"DEVLINKS= /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 /dev/disk/by-id/wwn-0x600507681081818c2000000000008c50 /dev/disk/by-path/fc-0xc050760c345122ae-0x5005076810243184-lun-0 /dev/disk/by-path/fc-0x5005076810243184-lun-0\nDEVNAME=/dev/sdc\nDEVPATH=/devices/vio/30000004/host1/rport-1:0-1/target1:0:1/1:0:1:0/block/sdc\nDEVTYPE=disk\nFC_INITIATOR_WWPN=0xc050760c345122ae\nFC_TARGET_LUN=0\nFC_TARGET_WWPN=0x5005076810243184\nID_BUS=scsi\nID_MODEL=2145\nID_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nID_PATH=fc-0x5005076810243184-lun-0\nID_PATH_TAG=fc-0x5005076810243184-lun-0\nID_REVISION=0000\nID_SCSI=1\nID_SCSI_INQUIRY=1\nID_SCSI_SERIAL=020420606308XX00\nID_SERIAL=3600507681081818c2000000000008c50\nID_SERIAL_SHORT=600507681081818c2000000000008c50\nID_TARGET_PORT=0\nID_TYPE=disk\nID_VENDOR=IBM\nID_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nID_WWN=0x600507681081818c\nID_WWN_VENDOR_EXTENSION=0x2000000000008c50\nID_WWN_WITH_EXTENSION=0x600507681081818c2000000000008c50\nMAJOR=8\nMINOR=32\nSCSI_IDENT_LUN_NAA_REGEXT=600507681081818c2000000000008c50\nSCSI_IDENT_PORT_RELATIVE=135\nSCSI_IDENT_PORT_TARGET_PORT_GROUP=0x0\nSCSI_IDENT_PORT_VENDOR=600507681081818c2000000000000001\nSCSI_IDENT_SERIAL=020420606308XX00\nSCSI_MODEL=2145\nSCSI_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nSCSI_REVISION=0000\nSCSI_TPGS=1\nSCSI_TYPE=disk\nSCSI_VENDOR=IBM\nSCSI_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=11331746" Dev links for OSD B: 2022-12-08 16:48:41.021847 D | exec: Running command: udevadm info --query=property /dev/sda 2022-12-08 16:48:41.027232 D | sys: udevadm info output:"DEVLINKS= /dev/disk/by-id/scsi-3600507681081818c2000000000008c50 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 /dev/disk/by-path/fc-0x5005076810243152-lun-0 /dev/disk/by-path/fc-0xc050760c345122ae-0x5005076810243152-lun-0 /dev/disk/by-id/wwn-0x600507681081818c2000000000008c50\nDEVNAME=/dev/sda\nDEVPATH=/devices/vio/30000004/host1/rport-1:0-0/target1:0:0/1:0:0:0/block/sda\nDEVTYPE=disk\nFC_INITIATOR_WWPN=0xc050760c345122ae\nFC_TARGET_LUN=0\nFC_TARGET_WWPN=0x5005076810243152\nID_BUS=scsi\nID_MODEL=2145\nID_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nID_PATH=fc-0x5005076810243152-lun-0\nID_PATH_TAG=fc-0x5005076810243152-lun-0\nID_REVISION=0000\nID_SCSI=1\nID_SCSI_INQUIRY=1\nID_SCSI_SERIAL=020420606308XX00\nID_SERIAL=3600507681081818c2000000000008c50\nID_SERIAL_SHORT=600507681081818c2000000000008c50\nID_TARGET_PORT=1\nID_TYPE=disk\nID_VENDOR=IBM\nID_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nID_WWN=0x600507681081818c\nID_WWN_VENDOR_EXTENSION=0x2000000000008c50\nID_WWN_WITH_EXTENSION=0x600507681081818c2000000000008c50\nMAJOR=8\nMINOR=0\nSCSI_IDENT_LUN_NAA_REGEXT=600507681081818c2000000000008c50\nSCSI_IDENT_PORT_RELATIVE=2183\nSCSI_IDENT_PORT_TARGET_PORT_GROUP=0x1\nSCSI_IDENT_PORT_VENDOR=600507681081818c2000000000000002\nSCSI_IDENT_SERIAL=020420606308XX00\nSCSI_MODEL=2145\nSCSI_MODEL_ENC=2145\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\\x20\nSCSI_REVISION=0000\nSCSI_TPGS=1\nSCSI_TYPE=disk\nSCSI_VENDOR=IBM\nSCSI_VENDOR_ENC=IBM\\x20\\x20\\x20\\x20\\x20\nSUBSYSTEM=block\nTAGS=:systemd:\nUSEC_INITIALIZED=11335171" Therefore, this is an incorrectly configured environment. We can't be using SCSI disks that are conflicting under the covers. Please remove the SCSI overlap and then see if this will repro. This does not appear to be an ODF bug. Removing blocker for 4.12 while finalizing investigation. Thanks for the update Travis. PowerVS clusters comes with multipath. are you saying Multipath is not supported? (In reply to narayanspg from comment #22) > Thanks for the update Travis. PowerVS clusters comes with multipath. are you > saying Multipath is not supported? LSO will need to be configured to create local PVs such that multiple PVs are not created pointing to the same device. In the node I was looking at, sda and sdc were both pointing to the same disk, so the OSDs were conflicting. We will need to find filters for LSO to create the correct PVs. Malay and I tested localvolume instead of localvolumeset with disk/by-id mentioning in the yaml for each worernode as below.
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
name: localblock
namespace: openshift-local-storage
spec:
logLevel: Normal
managementState: Managed
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-0
- worker-1
- worker-2
storageClassDevices:
- devicePaths:
- /dev/disk/by-id/scsi-3600507681081818c2000000000008f96
- /dev/disk/by-id/scsi-3600507681081818c2000000000008f95
- /dev/disk/by-id/scsi-3600507681081818c2000000000008f97
storageClassName: localblock
volumeMode: Block
The OSD were in pending state after the patching to enable the cephNonResilientPools. storagecluster in progressing state.
Were the OSD PVCs bound to these PVs and did the OSD prepare jobs run? If so, please share their logs. this cluster is destroyed as it was giving different results with disk/by-id usage. if required can recreate the cluster to simulate the scenario and share. Yes please, we need a repro to confirm if there is still a similar configuration issue as described in Comment 19, or if there is another issue here. So far, we can only repro in this mpath environment so it appears environmental. I have created new OCP cluster - https://console-openshift-console.apps.rdr-res2.ibm.com below are the disk details on the worker nodes: [core@lon06-worker-0 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c2000000000009190 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part2 -> ../../sdk2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-3600507681081818c200000000000919f-part4 -> ../../sdm4 lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdm2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdm4 lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c2000000000009190 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f -> ../../sdm lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part1 -> ../../sdm1 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part2 -> ../../sdm2 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part3 -> ../../sdm3 lrwxrwxrwx. 1 root root 10 Dec 14 09:14 /dev/disk/by-id/wwn-0x600507681081818c200000000000919f-part4 -> ../../sdm4 [core@lon06-worker-0 ~]$ [core@lon06-worker-0 ~]$ # [core@lon06-worker-1 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009191 -> ../../sdo lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199 -> ../../sdn lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-3600507681081818c2000000000009199-part4 -> ../../sdn4 lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdn4 lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009191 -> ../../sdo lrwxrwxrwx. 1 root root 9 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199 -> ../../sdn lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part1 -> ../../sdp1 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part2 -> ../../sdh2 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part3 -> ../../sdh3 lrwxrwxrwx. 1 root root 10 Dec 14 09:02 /dev/disk/by-id/wwn-0x600507681081818c2000000000009199-part4 -> ../../sdn4 [core@lon06-worker-1 ~]$ # [core@lon06-worker-2 ~]$ ls -l /dev/disk/by-id/* lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c2000000000009192 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-3600507681081818c200000000000919e-part4 -> ../../sdo4 lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00 -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/scsi-SIBM_2145_020420606308XX00-part4 -> ../../sdo4 lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c2000000000009192 -> ../../sdp lrwxrwxrwx. 1 root root 9 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e -> ../../sdo lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part1 -> ../../sdo1 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part2 -> ../../sdo2 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part3 -> ../../sdo3 lrwxrwxrwx. 1 root root 10 Dec 14 09:08 /dev/disk/by-id/wwn-0x600507681081818c200000000000919e-part4 -> ../../sdo4 [core@lon06-worker-2 ~]$ [core@lon06-worker-2 ~]$ lsblk -o name,type,wwn NAME TYPE WWN sda disk 0x600507681081818c200000000000919e ├─sda1 part 0x600507681081818c200000000000919e ├─sda2 part 0x600507681081818c200000000000919e ├─sda3 part 0x600507681081818c200000000000919e └─sda4 part 0x600507681081818c200000000000919e sdb disk 0x600507681081818c2000000000009192 sdc disk 0x600507681081818c200000000000919e ├─sdc1 part 0x600507681081818c200000000000919e ├─sdc2 part 0x600507681081818c200000000000919e ├─sdc3 part 0x600507681081818c200000000000919e └─sdc4 part 0x600507681081818c200000000000919e sdd disk 0x600507681081818c2000000000009192 sde disk 0x600507681081818c200000000000919e ├─sde1 part 0x600507681081818c200000000000919e ├─sde2 part 0x600507681081818c200000000000919e ├─sde3 part 0x600507681081818c200000000000919e └─sde4 part 0x600507681081818c200000000000919e sdf disk 0x600507681081818c2000000000009192 sdg disk 0x600507681081818c200000000000919e ├─sdg1 part 0x600507681081818c200000000000919e ├─sdg2 part 0x600507681081818c200000000000919e ├─sdg3 part 0x600507681081818c200000000000919e └─sdg4 part 0x600507681081818c200000000000919e sdh disk 0x600507681081818c2000000000009192 sdi disk 0x600507681081818c200000000000919e ├─sdi1 part 0x600507681081818c200000000000919e ├─sdi2 part 0x600507681081818c200000000000919e ├─sdi3 part 0x600507681081818c200000000000919e └─sdi4 part 0x600507681081818c200000000000919e sdj disk 0x600507681081818c2000000000009192 sdk disk 0x600507681081818c200000000000919e ├─sdk1 part 0x600507681081818c200000000000919e ├─sdk2 part 0x600507681081818c200000000000919e ├─sdk3 part 0x600507681081818c200000000000919e └─sdk4 part 0x600507681081818c200000000000919e sdl disk 0x600507681081818c2000000000009192 sdm disk 0x600507681081818c200000000000919e ├─sdm1 part 0x600507681081818c200000000000919e ├─sdm2 part 0x600507681081818c200000000000919e ├─sdm3 part 0x600507681081818c200000000000919e └─sdm4 part 0x600507681081818c200000000000919e sdn disk 0x600507681081818c2000000000009192 sdo disk 0x600507681081818c200000000000919e ├─sdo1 part 0x600507681081818c200000000000919e ├─sdo2 part 0x600507681081818c200000000000919e ├─sdo3 part 0x600507681081818c200000000000919e └─sdo4 part 0x600507681081818c200000000000919e sdp disk 0x600507681081818c2000000000009192 I will be creating(we can do over call today) localvolume with yaml updated as below. #######if using UI then dont use line from cat cat <<EOF | oc create -f - apiVersion: local.storage.openshift.io/v1 kind: LocalVolume metadata: name: localblock namespace: openshift-local-storage spec: logLevel: Normal managementState: Managed nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker-0 - worker-1 - worker-2 storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-3600507681081818c2000000000009190 - /dev/disk/by-id/scsi-3600507681081818c2000000000009191 - /dev/disk/by-id/scsi-3600507681081818c2000000000009192 storageClassName: localblock volumeMode: Block EOF As discussed, we need at least two PVs on each node for the scenario of non-resilient pools. One PV is for the replicated pools, and one PV per node for the non-resilient pools. In this repro there were only three PVs, so the remaining OSDs would remain pending until more PVs are available. Malay and I tested again today with additional disk on worker nodes. storagecluster stuck issue after patching is resolved now. since it requires 6 Pv's so it was in pending state. we should document this requirement of additional disk for this feature. Created new deployment and added new disks for each node. then we got 6 Pv's created and storagecluster got to ready state. I started validating the feature, see issue with volume mount and discussing with Malay on this. Hi Travis/Malay, Please let us know if we have to raise new BZ for pod staying in pending state. Have shared the cluster details with Malay for debugging. while validating "Replica 1 - Non-resilient pool - Dev Preview" - When we try to create a Pod with volume mount trying to use the pvc that we create with the non-resilient storageclass The pod stays forever in pending state. The PVC the pod refers to stays forever in pending state. The original BZ was that the cluster was stuck in progressing, and that is no longer the case, right? I'd recommend we close this BZ and open a new one. Some ideas to troubleshoot the latest issue: - Does the storage class use the correct pool? - Does the pool have the correct deviceClass applied? (look at the CRUSH rules for the pool) - Do the OSDs have the expected deviceClasses? (ceph osd tree) - If the PV is stuck pending, see the CSI troubleshooting guide [1] or let's ask someone from CSI team to take a look. [1] https://rook.io/docs/rook/latest/Troubleshooting/ceph-csi-common-issues/ after adding additional disk, OSD's were created and the storagecluster went to ready state. additional documentation should be added for this feature to add additional disk. |