Created attachment 1804381 [details] rook-ceph-operator logs Description of problem (please be detailed as possible and provide log snippests): Mon pods failing with `Init:CreateContainerConfigError` and failing with Warning Failed 9m30s (x73 over 24m) kubelet, azure-vms000004 Error: stat /var/data/kubelet/pods/5003c6bd-bed7-4293-85cf-112dcf751138/volumes/kubernetes.io~csi/pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e/mount: no such file or directory Version of all relevant components (if applicable): OCP 4.7 OCS 4.7 Does this issue impact your ability to continue to work with the product Yes, as OCS is not getting installed successfully Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create an IBM ROKS Satellite cluster using Azure machines 2. Install OCS on it Actual results: Mon pods didn't get created successfully and went into state Init:CreateContainerConfigError Expected results: Expected mon pods to go into running state Additional info: The PVs are created successfully and bound % oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-011e60df-98a1-4862-a098-7b2243ec0fa7 20Gi RWO Delete Bound openshift-storage/rook-ceph-mon-c sat-azure-block-gold-metro 22m pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e 20Gi RWO Delete Bound openshift-storage/rook-ceph-mon-a sat-azure-block-gold-metro 22m pvc-faf485b2-3d83-4354-8fe3-9bc6755a9da1 20Gi RWO Delete Bound openshift-storage/rook-ceph-mon-b sat-azure-block-gold-metro 22m Pods in openshift-storage namespace % oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-kfpbc 3/3 Running 0 24m csi-cephfsplugin-lg5sd 3/3 Running 0 24m csi-cephfsplugin-n2p8d 3/3 Running 0 24m csi-cephfsplugin-provisioner-645fc77d9f-7cqmk 6/6 Running 0 24m csi-cephfsplugin-provisioner-645fc77d9f-w2nvt 6/6 Running 0 24m csi-rbdplugin-nj4bc 3/3 Running 0 24m csi-rbdplugin-provisioner-676b57fb6b-kxkrc 6/6 Running 0 24m csi-rbdplugin-provisioner-676b57fb6b-nw2p9 6/6 Running 0 24m csi-rbdplugin-qk6zt 3/3 Running 0 24m csi-rbdplugin-xzljv 3/3 Running 0 24m noobaa-operator-6d769d678d-c9vnb 1/1 Running 0 24m ocs-metrics-exporter-77b7dc5d8b-4jfv8 1/1 Running 0 24m ocs-operator-577676bfbb-f9knq 0/1 Running 0 24m rook-ceph-crashcollector-azure-vms000004-559d6596f7-vbbsc 0/1 Init:0/2 0 24m rook-ceph-crashcollector-azure-vms000005-7fdcc75d46-b9m6k 0/1 Init:0/2 0 117s rook-ceph-crashcollector-azure-vms000009-78cd6b9594-jd9ps 0/1 Init:0/2 0 13m rook-ceph-mon-a-f778c8475-49kkz 0/2 Init:CreateContainerConfigError 0 24m rook-ceph-mon-b-68998cf469-q4srk 0/2 Init:CreateContainerConfigError 0 13m rook-ceph-mon-c-7bd7469975-b9cqz 0/2 Init:CreateContainerConfigError 0 117s rook-ceph-operator-6fbdcbc9d7-446gf 1/1 Running 0 24m The describe of failing mon pod % oc describe pod rook-ceph-mon-a-f778c8475-49kkz -n openshift-storage Name: rook-ceph-mon-a-f778c8475-49kkz Namespace: openshift-storage Priority: 0 Node: azure-vms000004/10.0.0.9 Start Time: Thu, 22 Jul 2021 13:01:30 +0530 Labels: app=rook-ceph-mon ceph_daemon_id=a ceph_daemon_type=mon mon=a mon_cluster=openshift-storage pod-template-hash=f778c8475 pvc_name=rook-ceph-mon-a rook_cluster=openshift-storage Annotations: cni.projectcalico.org/podIP: 172.30.45.130/32 cni.projectcalico.org/podIPs: 172.30.45.130/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "", "ips": [ "172.30.45.130" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "ips": [ "172.30.45.130" ], "default": true, "dns": {} }] openshift.io/scc: rook-ceph Status: Pending IP: 172.30.45.130 IPs: IP: 172.30.45.130 Controlled By: ReplicaSet/rook-ceph-mon-a-f778c8475 Init Containers: chown-container-data-dir: Container ID: Image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c Image ID: Port: <none> Host Port: <none> Command: chown Args: --verbose --recursive ceph:ceph /var/log/ceph /var/lib/ceph/crash /var/lib/ceph/mon/ceph-a State: Waiting Reason: CreateContainerConfigError Ready: False Restart Count: 0 Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data") /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro) init-mon-fs: Container ID: Image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c Image ID: Port: <none> Host Port: <none> Command: ceph-mon Args: --fsid=42be1453-f05a-4590-b7c8-f5363ef9bc29 --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=$(ROOK_CEPH_MON_HOST) --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) --id=a --setuser=ceph --setgroup=ceph --public-addr=172.21.254.247 --mkfs State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi Environment: CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c POD_NAME: rook-ceph-mon-a-f778c8475-49kkz (v1:metadata.name) POD_NAMESPACE: openshift-storage (v1:metadata.namespace) NODE_NAME: (v1:spec.nodeName) POD_MEMORY_LIMIT: 2147483648 (limits.memory) POD_MEMORY_REQUEST: 2147483648 (requests.memory) POD_CPU_LIMIT: 1 (limits.cpu) POD_CPU_REQUEST: 1 (requests.cpu) ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data") /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro) Containers: mon: Container ID: Image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c Image ID: Port: 6789/TCP Host Port: 0/TCP Command: ceph-mon Args: --fsid=42be1453-f05a-4590-b7c8-f5363ef9bc29 --keyring=/etc/ceph/keyring-store/keyring --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=$(ROOK_CEPH_MON_HOST) --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) --id=a --setuser=ceph --setgroup=ceph --foreground --public-addr=172.21.254.247 --setuser-match-path=/var/lib/ceph/mon/ceph-a/store.db --public-bind-addr=$(ROOK_POD_IP) State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 1 memory: 2Gi Requests: cpu: 1 memory: 2Gi Liveness: exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status] delay=10s timeout=1s period=10s #success=1 #failure=3 Environment: CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c POD_NAME: rook-ceph-mon-a-f778c8475-49kkz (v1:metadata.name) POD_NAMESPACE: openshift-storage (v1:metadata.namespace) NODE_NAME: (v1:spec.nodeName) POD_MEMORY_LIMIT: 2147483648 (limits.memory) POD_MEMORY_REQUEST: 2147483648 (requests.memory) POD_CPU_LIMIT: 1 (limits.cpu) POD_CPU_REQUEST: 1 (requests.cpu) ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false ROOK_POD_IP: (v1:status.podIP) Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data") /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro) log-collector: Container ID: Image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c Image ID: Port: <none> Host Port: <none> Command: /bin/bash -c set -xe CEPH_CLIENT_ID=ceph-mon.a PERIODICITY=24h LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph if [ -z "$PERIODICITY" ]; then PERIODICITY=24h fi # edit the logrotate file to only rotate a specific daemon log # otherwise we will logrotate log files without reloading certain daemons # this might happen when multiple daemons run on the same machine sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE" while true; do sleep "$PERIODICITY" echo "starting log rotation" logrotate --verbose --force "$LOG_ROTATE_CEPH_FILE" echo "I am going to sleep now, see you in $PERIODICITY" done State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: rook-config-override: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: rook-config-override ConfigMapOptional: <nil> rook-ceph-mons-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-mons-keyring Optional: false rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/log HostPathType: rook-ceph-crash: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/crash HostPathType: ceph-daemon-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: rook-ceph-mon-a ReadOnly: false default-token-lkjlq: Type: Secret (a volume populated by a Secret) SecretName: default-token-lkjlq Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 25m (x2 over 25m) default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match pod anti-affinity rules. Normal Scheduled 25m default-scheduler Successfully assigned openshift-storage/rook-ceph-mon-a-f778c8475-49kkz to azure-vms000004 Normal AddedInterface 24m multus Add eth0 [172.30.45.130/32] Warning Failed 9m30s (x73 over 24m) kubelet, azure-vms000004 Error: stat /var/data/kubelet/pods/5003c6bd-bed7-4293-85cf-112dcf751138/volumes/kubernetes.io~csi/pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e/mount: no such file or directory Normal Pulled 4m32s (x97 over 24m) kubelet, azure-vms000004 Container image "registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c" already present on machine I've attached docs-operator and rook-ceph operator logs
Created attachment 1804383 [details] ocs-operator logs
By default the globalmount path is /var/lib/kubelet this is failing because here the globalmount path is changed to /var/data/kubelet Do we know why the mount path is changed? Please follow [1] https://cloud.ibm.com/docs/openshift?topic=openshift-ocs-storage-install
We're using the azure CSI Driver to provision remote block volumes. The mon PVCs and PVs were successfully created and bound.
Hi, in IBM ROKS, for OCS, we create a ConfigMap to change the Kubelet path for OCS to /var/data/kubelet. Like this % oc describe cm rook-ceph-operator-config -n openshift-storage Name: rook-ceph-operator-config Namespace: openshift-storage Labels: <none> Annotations: <none> Data ==== ROOK_CSI_KUBELET_DIR_PATH: ---- /var/data/kubelet Events: <none>
Ok, if you are using ROOK_CSI_KUBELET_DIR_PATH then the problem lies somewhere else. Because its mon pods, its better if rook guys take a look.
Travis, could someone look at this?
Shrisha Can we get a must-gather for more info? For example, we at least need: - CephCluster CR - Mon PVCs But it sounds like something isn't working with the azure volumes being mounted in the mons. Are you able to create a test pod with a pvc from the same azure storage class?
Travis, Yes I was able to create a test pod and mount a PVC of the same storage class successfully. This is the cephcluster CR : % oc get cephcluster -n openshift-storage -o yaml apiVersion: v1 items: - apiVersion: ceph.rook.io/v1 kind: CephCluster metadata: creationTimestamp: "2021-07-22T07:31:07Z" finalizers: - cephcluster.ceph.rook.io generation: 1 labels: app: ocs-storagecluster managedFields: - apiVersion: ceph.rook.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:ownerReferences: .: {} k:{"uid":"3ca890d2-ddb7-4209-aaa3-6f5c779d77ad"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: .: {} f:cephVersion: .: {} f:image: {} f:cleanupPolicy: .: {} f:sanitizeDisks: {} f:continueUpgradeAfterChecksEvenIfNotHealthy: {} f:crashCollector: .: {} f:disable: {} f:dashboard: {} f:dataDirHostPath: {} f:disruptionManagement: .: {} f:machineDisruptionBudgetNamespace: {} f:managePodBudgets: {} f:external: .: {} f:enable: {} f:healthCheck: .: {} f:daemonHealth: {} f:logCollector: .: {} f:enabled: {} f:periodicity: {} f:mgr: .: {} f:modules: {} f:mon: .: {} f:count: {} f:volumeClaimTemplate: .: {} f:metadata: {} f:spec: {} f:status: {} f:monitoring: .: {} f:enabled: {} f:rulesNamespace: {} f:network: .: {} f:hostNetwork: {} f:provider: {} f:selectors: {} f:placement: .: {} f:all: {} f:arbiter: {} f:mon: {} f:removeOSDsIfOutAndSafeToRemove: {} f:resources: .: {} f:mds: {} f:mgr: {} f:mon: {} f:rgw: {} f:security: .: {} f:kms: {} f:storage: .: {} f:config: {} f:storageClassDeviceSets: {} manager: ocs-operator operation: Update time: "2021-07-22T07:31:07Z" - apiVersion: ceph.rook.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"cephcluster.ceph.rook.io": {} f:status: .: {} f:conditions: {} f:message: {} f:phase: {} f:state: {} f:version: {} manager: rook operation: Update time: "2021-07-22T07:31:13Z" name: ocs-storagecluster-cephcluster namespace: openshift-storage ownerReferences: - apiVersion: ocs.openshift.io/v1 blockOwnerDeletion: true controller: true kind: StorageCluster name: ocs-storagecluster uid: 3ca890d2-ddb7-4209-aaa3-6f5c779d77ad resourceVersion: "2837895" selfLink: /apis/ceph.rook.io/v1/namespaces/openshift-storage/cephclusters/ocs-storagecluster-cephcluster uid: ebd50e6c-97af-42a6-b6b6-eee525d2a2de spec: cephVersion: image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c cleanupPolicy: sanitizeDisks: {} continueUpgradeAfterChecksEvenIfNotHealthy: true crashCollector: disable: false dashboard: {} dataDirHostPath: /var/lib/rook disruptionManagement: machineDisruptionBudgetNamespace: openshift-machine-api managePodBudgets: true external: enable: false healthCheck: daemonHealth: mon: timeout: 15m osd: {} status: {} logCollector: enabled: true periodicity: 24h mgr: modules: - enabled: true name: pg_autoscaler - enabled: true name: balancer mon: count: 3 volumeClaimTemplate: metadata: creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: sat-azure-block-gold-metro volumeMode: Filesystem status: {} monitoring: enabled: true rulesNamespace: openshift-storage network: hostNetwork: false provider: "" selectors: null placement: all: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" arbiter: tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists mon: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - rook-ceph-mon topologyKey: topology.rook.io/rack removeOSDsIfOutAndSafeToRemove: false resources: mds: limits: cpu: "3" memory: 8Gi requests: cpu: "3" memory: 8Gi mgr: limits: cpu: "1" memory: 3Gi requests: cpu: "1" memory: 3Gi mon: limits: cpu: "1" memory: 2Gi requests: cpu: "1" memory: 2Gi rgw: limits: cpu: "2" memory: 4Gi requests: cpu: "2" memory: 4Gi security: kms: {} storage: config: null storageClassDeviceSets: - count: 1 name: ocs-deviceset-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: sat-azure-block-gold-metro volumeMode: Block status: {} - count: 1 name: ocs-deviceset-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: sat-azure-block-gold-metro volumeMode: Block status: {} - count: 1 name: ocs-deviceset-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.rook.io/rack whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" creationTimestamp: null spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: sat-azure-block-gold-metro volumeMode: Block status: {} status: conditions: - lastHeartbeatTime: "2021-07-22T07:31:13Z" lastTransitionTime: "2021-07-22T07:31:13Z" message: Cluster is creating reason: ClusterProgressing status: "True" type: Progressing - lastHeartbeatTime: "2021-07-22T07:41:25Z" lastTransitionTime: "2021-07-22T07:41:25Z" message: Failed to create cluster reason: ClusterFailure status: "True" type: Failure message: Failed to create cluster phase: Failure state: Error version: image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c version: 14.2.11-181 kind: List metadata: resourceVersion: "" selfLink: "" And this is the PVC : % oc get pvc rook-ceph-mon-a -n openshift-storage -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com volume.kubernetes.io/selected-node: azure-vms000004 creationTimestamp: "2021-07-22T07:31:18Z" finalizers: - kubernetes.io/pvc-protection labels: app: rook-ceph-mon ceph-version: 14.2.11-181 ceph_daemon_id: a ceph_daemon_type: mon mon: a mon_canary: "true" mon_cluster: openshift-storage pvc_name: rook-ceph-mon-a rook-version: 4.7-145.a418c74.release_4.7 rook_cluster: openshift-storage managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:ceph-version: {} f:ceph_daemon_id: {} f:ceph_daemon_type: {} f:mon: {} f:mon_canary: {} f:mon_cluster: {} f:pvc_name: {} f:rook-version: {} f:rook_cluster: {} f:ownerReferences: .: {} k:{"uid":"ebd50e6c-97af-42a6-b6b6-eee525d2a2de"}: .: {} f:apiVersion: {} f:blockOwnerDeletion: {} f:controller: {} f:kind: {} f:name: {} f:uid: {} f:spec: f:accessModes: {} f:resources: f:requests: .: {} f:storage: {} f:storageClassName: {} f:volumeMode: {} manager: rook operation: Update time: "2021-07-22T07:31:18Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:volume.kubernetes.io/selected-node: {} manager: kube-scheduler operation: Update time: "2021-07-22T07:31:19Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:pv.kubernetes.io/bind-completed: {} f:pv.kubernetes.io/bound-by-controller: {} f:volume.beta.kubernetes.io/storage-provisioner: {} f:spec: f:volumeName: {} f:status: f:accessModes: {} f:capacity: .: {} f:storage: {} f:phase: {} manager: kube-controller-manager operation: Update time: "2021-07-22T07:31:21Z" name: rook-ceph-mon-a namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: ebd50e6c-97af-42a6-b6b6-eee525d2a2de resourceVersion: "2086629" selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-a uid: f7cb5eb2-5231-4ba2-9757-2bed3cf0207e spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: sat-azure-block-gold-metro volumeMode: Filesystem volumeName: pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e status: accessModes: - ReadWriteOnce capacity: storage: 20Gi phase: Bound
Please find the must-gather for the cluster here : https://drive.google.com/file/d/1Hmxk-G8J_olGqJI0nGgVkfNeGm9KNt-d/view?usp=sharing
The interesting point here is that the mon PVCs are bound to their PVs and successfully mounted in the mon canary pods, so we know the volumes are basically working. Then rook deletes the canary pods and starts up the mon daemon pods, where these volumes claim they don't exist. Shirisha Could you check on the following? - What nodes are the mon-canary pods assigned to? Are the mon daemon pods starting on the same nodes or different nodes? - If you create another test pod with a test volume, then delete the pod (without deleting the pvc), then create a new pod, does it start successfully? Or does that test pod hit the same issue after the volume is re-used? Or is it a factor if the pod starts on the same node or a different node? - Does it change anything if you change the reclaimPolicy on the azure storage class to retain?
Not a 4.8 blocker
Hi Travis, There was a mount propagation issue with the installed azure driver and fixing that fixed this issue. Thank you for looking into it.