Bug 1984763

Summary: [ROKS][Azure Satellite] Mon pods failing with `Init:CreateContainerConfigError`
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Shirisha S Rao <shrao>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: madam, muagarwa, ocs-bugs, odf-bz-bot, sabose, tnielsen
Target Milestone: ---Flags: tnielsen: needinfo? (shrao)
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-04 07:29:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rook-ceph-operator logs
none
ocs-operator logs none

Description Shirisha S Rao 2021-07-22 08:00:31 UTC
Created attachment 1804381 [details]
rook-ceph-operator logs

Description of problem (please be detailed as possible and provide log
snippests):

Mon pods failing with `Init:CreateContainerConfigError` and failing with 
Warning  Failed            9m30s (x73 over 24m)  kubelet, azure-vms000004  Error: stat /var/data/kubelet/pods/5003c6bd-bed7-4293-85cf-112dcf751138/volumes/kubernetes.io~csi/pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e/mount: no such file or directory



Version of all relevant components (if applicable):
OCP 4.7
OCS 4.7


Does this issue impact your ability to continue to work with the product
Yes, as OCS is not getting installed successfully


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create an IBM ROKS Satellite cluster using Azure machines
2. Install OCS on it

Actual results:
Mon pods didn't get created successfully and went into state Init:CreateContainerConfigError

Expected results:
Expected mon pods to go into running state

Additional info:
The PVs are created successfully and bound

% oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS                 REASON   AGE
pvc-011e60df-98a1-4862-a098-7b2243ec0fa7   20Gi       RWO            Delete           Bound    openshift-storage/rook-ceph-mon-c   sat-azure-block-gold-metro            22m
pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e   20Gi       RWO            Delete           Bound    openshift-storage/rook-ceph-mon-a   sat-azure-block-gold-metro            22m
pvc-faf485b2-3d83-4354-8fe3-9bc6755a9da1   20Gi       RWO            Delete           Bound    openshift-storage/rook-ceph-mon-b   sat-azure-block-gold-metro            22m

Pods in openshift-storage namespace

% oc get pods -n openshift-storage
NAME                                                        READY   STATUS                            RESTARTS   AGE
csi-cephfsplugin-kfpbc                                      3/3     Running                           0          24m
csi-cephfsplugin-lg5sd                                      3/3     Running                           0          24m
csi-cephfsplugin-n2p8d                                      3/3     Running                           0          24m
csi-cephfsplugin-provisioner-645fc77d9f-7cqmk               6/6     Running                           0          24m
csi-cephfsplugin-provisioner-645fc77d9f-w2nvt               6/6     Running                           0          24m
csi-rbdplugin-nj4bc                                         3/3     Running                           0          24m
csi-rbdplugin-provisioner-676b57fb6b-kxkrc                  6/6     Running                           0          24m
csi-rbdplugin-provisioner-676b57fb6b-nw2p9                  6/6     Running                           0          24m
csi-rbdplugin-qk6zt                                         3/3     Running                           0          24m
csi-rbdplugin-xzljv                                         3/3     Running                           0          24m
noobaa-operator-6d769d678d-c9vnb                            1/1     Running                           0          24m
ocs-metrics-exporter-77b7dc5d8b-4jfv8                       1/1     Running                           0          24m
ocs-operator-577676bfbb-f9knq                               0/1     Running                           0          24m
rook-ceph-crashcollector-azure-vms000004-559d6596f7-vbbsc   0/1     Init:0/2                          0          24m
rook-ceph-crashcollector-azure-vms000005-7fdcc75d46-b9m6k   0/1     Init:0/2                          0          117s
rook-ceph-crashcollector-azure-vms000009-78cd6b9594-jd9ps   0/1     Init:0/2                          0          13m
rook-ceph-mon-a-f778c8475-49kkz                             0/2     Init:CreateContainerConfigError   0          24m
rook-ceph-mon-b-68998cf469-q4srk                            0/2     Init:CreateContainerConfigError   0          13m
rook-ceph-mon-c-7bd7469975-b9cqz                            0/2     Init:CreateContainerConfigError   0          117s
rook-ceph-operator-6fbdcbc9d7-446gf                         1/1     Running                           0          24m

The describe of failing mon pod

% oc describe pod rook-ceph-mon-a-f778c8475-49kkz -n openshift-storage
Name:         rook-ceph-mon-a-f778c8475-49kkz
Namespace:    openshift-storage
Priority:     0
Node:         azure-vms000004/10.0.0.9
Start Time:   Thu, 22 Jul 2021 13:01:30 +0530
Labels:       app=rook-ceph-mon
              ceph_daemon_id=a
              ceph_daemon_type=mon
              mon=a
              mon_cluster=openshift-storage
              pod-template-hash=f778c8475
              pvc_name=rook-ceph-mon-a
              rook_cluster=openshift-storage
Annotations:  cni.projectcalico.org/podIP: 172.30.45.130/32
              cni.projectcalico.org/podIPs: 172.30.45.130/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "ips": [
                        "172.30.45.130"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "ips": [
                        "172.30.45.130"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: rook-ceph
Status:       Pending
IP:           172.30.45.130
IPs:
  IP:           172.30.45.130
Controlled By:  ReplicaSet/rook-ceph-mon-a-f778c8475
Init Containers:
  chown-container-data-dir:
    Container ID:  
    Image:         registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      chown
    Args:
      --verbose
      --recursive
      ceph:ceph
      /var/log/ceph
      /var/lib/ceph/crash
      /var/lib/ceph/mon/ceph-a
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:        1
      memory:     2Gi
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data")
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro)
  init-mon-fs:
    Container ID:  
    Image:         registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      ceph-mon
    Args:
      --fsid=42be1453-f05a-4590-b7c8-f5363ef9bc29
      --keyring=/etc/ceph/keyring-store/keyring
      --log-to-stderr=true
      --err-to-stderr=true
      --mon-cluster-log-to-stderr=true
      --log-stderr-prefix=debug 
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --setuser=ceph
      --setgroup=ceph
      --public-addr=172.21.254.247
      --mkfs
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      CONTAINER_IMAGE:                registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
      POD_NAME:                       rook-ceph-mon-a-f778c8475-49kkz (v1:metadata.name)
      POD_NAMESPACE:                  openshift-storage (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               2147483648 (limits.memory)
      POD_MEMORY_REQUEST:             2147483648 (requests.memory)
      POD_CPU_LIMIT:                  1 (limits.cpu)
      POD_CPU_REQUEST:                1 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data")
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro)
Containers:
  mon:
    Container ID:  
    Image:         registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
    Image ID:      
    Port:          6789/TCP
    Host Port:     0/TCP
    Command:
      ceph-mon
    Args:
      --fsid=42be1453-f05a-4590-b7c8-f5363ef9bc29
      --keyring=/etc/ceph/keyring-store/keyring
      --log-to-stderr=true
      --err-to-stderr=true
      --mon-cluster-log-to-stderr=true
      --log-stderr-prefix=debug 
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --setuser=ceph
      --setgroup=ceph
      --foreground
      --public-addr=172.21.254.247
      --setuser-match-path=/var/lib/ceph/mon/ceph-a/store.db
      --public-bind-addr=$(ROOK_POD_IP)
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Liveness:  exec [env -i sh -c ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status] delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CONTAINER_IMAGE:                registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
      POD_NAME:                       rook-ceph-mon-a-f778c8475-49kkz (v1:metadata.name)
      POD_NAMESPACE:                  openshift-storage (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               2147483648 (limits.memory)
      POD_MEMORY_REQUEST:             2147483648 (requests.memory)
      POD_CPU_LIMIT:                  1 (limits.cpu)
      POD_CPU_REQUEST:                1 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
      ROOK_POD_IP:                     (v1:status.podIP)
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw,path="data")
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro)
  log-collector:
    Container ID:  
    Image:         registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      
      set -xe
      
      CEPH_CLIENT_ID=ceph-mon.a
      PERIODICITY=24h
      LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph
      
      if [ -z "$PERIODICITY" ]; then
        PERIODICITY=24h
      fi
      
      # edit the logrotate file to only rotate a specific daemon log
      # otherwise we will logrotate log files without reloading certain daemons
      # this might happen when multiple daemons run on the same machine
      sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE"
      
      while true; do
        sleep "$PERIODICITY"
        echo "starting log rotation"
        logrotate --verbose --force "$LOG_ROTATE_CEPH_FILE"
        echo "I am going to sleep now, see you in $PERIODICITY"
      done
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lkjlq (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  rook-config-override:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      rook-config-override
    ConfigMapOptional:  <nil>
  rook-ceph-mons-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-mons-keyring
    Optional:    false
  rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/log
    HostPathType:  
  rook-ceph-crash:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/crash
    HostPathType:  
  ceph-daemon-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rook-ceph-mon-a
    ReadOnly:   false
  default-token-lkjlq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lkjlq
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason            Age                   From                      Message
  ----     ------            ----                  ----                      -------
  Warning  FailedScheduling  25m (x2 over 25m)     default-scheduler         0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match pod anti-affinity rules.
  Normal   Scheduled         25m                   default-scheduler         Successfully assigned openshift-storage/rook-ceph-mon-a-f778c8475-49kkz to azure-vms000004
  Normal   AddedInterface    24m                   multus                    Add eth0 [172.30.45.130/32]
  Warning  Failed            9m30s (x73 over 24m)  kubelet, azure-vms000004  Error: stat /var/data/kubelet/pods/5003c6bd-bed7-4293-85cf-112dcf751138/volumes/kubernetes.io~csi/pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e/mount: no such file or directory
  Normal   Pulled            4m32s (x97 over 24m)  kubelet, azure-vms000004  Container image "registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c" already present on machine

I've attached docs-operator and rook-ceph operator logs

Comment 2 Shirisha S Rao 2021-07-22 08:02:41 UTC
Created attachment 1804383 [details]
ocs-operator logs

Comment 3 Mudit Agarwal 2021-07-22 08:13:17 UTC
By default the globalmount path is /var/lib/kubelet this is failing because here the globalmount path is changed to /var/data/kubelet

Do we know why the mount path is changed?

Please follow 
[1] https://cloud.ibm.com/docs/openshift?topic=openshift-ocs-storage-install

Comment 4 Shirisha S Rao 2021-07-22 08:17:22 UTC
We're using the azure CSI Driver to provision remote block volumes. The mon PVCs and PVs were successfully created and bound.

Comment 5 Shirisha S Rao 2021-07-22 08:18:56 UTC
Hi, in IBM ROKS, for OCS, we create a ConfigMap to change the Kubelet path for OCS to /var/data/kubelet. Like this

% oc describe cm rook-ceph-operator-config -n openshift-storage
Name:         rook-ceph-operator-config
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>

Data
====
ROOK_CSI_KUBELET_DIR_PATH:
----
/var/data/kubelet
Events:  <none>

Comment 6 Mudit Agarwal 2021-07-22 08:51:58 UTC
Ok, if you are using ROOK_CSI_KUBELET_DIR_PATH then the problem lies somewhere else.
Because its mon pods, its better if rook guys take a look.

Comment 7 Sahina Bose 2021-07-22 14:39:10 UTC
Travis, could someone look at this?

Comment 8 Travis Nielsen 2021-07-22 18:28:00 UTC
Shrisha Can we get a must-gather for more info? For example, we at least need:
- CephCluster CR
- Mon PVCs

But it sounds like something isn't working with the azure volumes being mounted in the mons. Are you able to create a test pod with a pvc from the same azure storage class?

Comment 9 Shirisha S Rao 2021-07-23 06:38:32 UTC
Travis,
Yes I was able to create a test pod and mount a PVC of the same storage class successfully.
This is the cephcluster CR :

 % oc get cephcluster -n openshift-storage -o yaml
apiVersion: v1
items:
- apiVersion: ceph.rook.io/v1
  kind: CephCluster
  metadata:
    creationTimestamp: "2021-07-22T07:31:07Z"
    finalizers:
    - cephcluster.ceph.rook.io
    generation: 1
    labels:
      app: ocs-storagecluster
    managedFields:
    - apiVersion: ceph.rook.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"3ca890d2-ddb7-4209-aaa3-6f5c779d77ad"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
        f:spec:
          .: {}
          f:cephVersion:
            .: {}
            f:image: {}
          f:cleanupPolicy:
            .: {}
            f:sanitizeDisks: {}
          f:continueUpgradeAfterChecksEvenIfNotHealthy: {}
          f:crashCollector:
            .: {}
            f:disable: {}
          f:dashboard: {}
          f:dataDirHostPath: {}
          f:disruptionManagement:
            .: {}
            f:machineDisruptionBudgetNamespace: {}
            f:managePodBudgets: {}
          f:external:
            .: {}
            f:enable: {}
          f:healthCheck:
            .: {}
            f:daemonHealth: {}
          f:logCollector:
            .: {}
            f:enabled: {}
            f:periodicity: {}
          f:mgr:
            .: {}
            f:modules: {}
          f:mon:
            .: {}
            f:count: {}
            f:volumeClaimTemplate:
              .: {}
              f:metadata: {}
              f:spec: {}
              f:status: {}
          f:monitoring:
            .: {}
            f:enabled: {}
            f:rulesNamespace: {}
          f:network:
            .: {}
            f:hostNetwork: {}
            f:provider: {}
            f:selectors: {}
          f:placement:
            .: {}
            f:all: {}
            f:arbiter: {}
            f:mon: {}
          f:removeOSDsIfOutAndSafeToRemove: {}
          f:resources:
            .: {}
            f:mds: {}
            f:mgr: {}
            f:mon: {}
            f:rgw: {}
          f:security:
            .: {}
            f:kms: {}
          f:storage:
            .: {}
            f:config: {}
            f:storageClassDeviceSets: {}
      manager: ocs-operator
      operation: Update
      time: "2021-07-22T07:31:07Z"
    - apiVersion: ceph.rook.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"cephcluster.ceph.rook.io": {}
        f:status:
          .: {}
          f:conditions: {}
          f:message: {}
          f:phase: {}
          f:state: {}
          f:version: {}
      manager: rook
      operation: Update
      time: "2021-07-22T07:31:13Z"
    name: ocs-storagecluster-cephcluster
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ocs.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: StorageCluster
      name: ocs-storagecluster
      uid: 3ca890d2-ddb7-4209-aaa3-6f5c779d77ad
    resourceVersion: "2837895"
    selfLink: /apis/ceph.rook.io/v1/namespaces/openshift-storage/cephclusters/ocs-storagecluster-cephcluster
    uid: ebd50e6c-97af-42a6-b6b6-eee525d2a2de
  spec:
    cephVersion:
      image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
    cleanupPolicy:
      sanitizeDisks: {}
    continueUpgradeAfterChecksEvenIfNotHealthy: true
    crashCollector:
      disable: false
    dashboard: {}
    dataDirHostPath: /var/lib/rook
    disruptionManagement:
      machineDisruptionBudgetNamespace: openshift-machine-api
      managePodBudgets: true
    external:
      enable: false
    healthCheck:
      daemonHealth:
        mon:
          timeout: 15m
        osd: {}
        status: {}
    logCollector:
      enabled: true
      periodicity: 24h
    mgr:
      modules:
      - enabled: true
        name: pg_autoscaler
      - enabled: true
        name: balancer
    mon:
      count: 3
      volumeClaimTemplate:
        metadata:
          creationTimestamp: null
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi
          storageClassName: sat-azure-block-gold-metro
          volumeMode: Filesystem
        status: {}
    monitoring:
      enabled: true
      rulesNamespace: openshift-storage
    network:
      hostNetwork: false
      provider: ""
      selectors: null
    placement:
      all:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
      arbiter:
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
      mon:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - rook-ceph-mon
            topologyKey: topology.rook.io/rack
    removeOSDsIfOutAndSafeToRemove: false
    resources:
      mds:
        limits:
          cpu: "3"
          memory: 8Gi
        requests:
          cpu: "3"
          memory: 8Gi
      mgr:
        limits:
          cpu: "1"
          memory: 3Gi
        requests:
          cpu: "1"
          memory: 3Gi
      mon:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      rgw:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi
    security:
      kms: {}
    storage:
      config: null
      storageClassDeviceSets:
      - count: 1
        name: ocs-deviceset-0
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        portable: true
        preparePlacement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        resources:
          limits:
            cpu: "2"
            memory: 5Gi
          requests:
            cpu: "2"
            memory: 5Gi
        tuneFastDeviceClass: true
        volumeClaimTemplates:
        - metadata:
            annotations:
              crushDeviceClass: ""
            creationTimestamp: null
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 100Gi
            storageClassName: sat-azure-block-gold-metro
            volumeMode: Block
          status: {}
      - count: 1
        name: ocs-deviceset-1
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        portable: true
        preparePlacement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        resources:
          limits:
            cpu: "2"
            memory: 5Gi
          requests:
            cpu: "2"
            memory: 5Gi
        tuneFastDeviceClass: true
        volumeClaimTemplates:
        - metadata:
            annotations:
              crushDeviceClass: ""
            creationTimestamp: null
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 100Gi
            storageClassName: sat-azure-block-gold-metro
            volumeMode: Block
          status: {}
      - count: 1
        name: ocs-deviceset-2
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        portable: true
        preparePlacement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
          tolerations:
          - effect: NoSchedule
            key: node.ocs.openshift.io/storage
            operator: Equal
            value: "true"
          topologySpreadConstraints:
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: topology.rook.io/rack
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchExpressions:
              - key: ceph.rook.io/pvc
                operator: Exists
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        resources:
          limits:
            cpu: "2"
            memory: 5Gi
          requests:
            cpu: "2"
            memory: 5Gi
        tuneFastDeviceClass: true
        volumeClaimTemplates:
        - metadata:
            annotations:
              crushDeviceClass: ""
            creationTimestamp: null
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 100Gi
            storageClassName: sat-azure-block-gold-metro
            volumeMode: Block
          status: {}
  status:
    conditions:
    - lastHeartbeatTime: "2021-07-22T07:31:13Z"
      lastTransitionTime: "2021-07-22T07:31:13Z"
      message: Cluster is creating
      reason: ClusterProgressing
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2021-07-22T07:41:25Z"
      lastTransitionTime: "2021-07-22T07:41:25Z"
      message: Failed to create cluster
      reason: ClusterFailure
      status: "True"
      type: Failure
    message: Failed to create cluster
    phase: Failure
    state: Error
    version:
      image: registry.redhat.io/rhceph/rhceph-4-rhel8@sha256:725f93133acc0fb1ca845bd12e77f20d8629cad0e22d46457b2736578698eb6c
      version: 14.2.11-181
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

And this is the PVC :

% oc get pvc rook-ceph-mon-a -n openshift-storage -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com
    volume.kubernetes.io/selected-node: azure-vms000004
  creationTimestamp: "2021-07-22T07:31:18Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: rook-ceph-mon
    ceph-version: 14.2.11-181
    ceph_daemon_id: a
    ceph_daemon_type: mon
    mon: a
    mon_canary: "true"
    mon_cluster: openshift-storage
    pvc_name: rook-ceph-mon-a
    rook-version: 4.7-145.a418c74.release_4.7
    rook_cluster: openshift-storage
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
          f:ceph-version: {}
          f:ceph_daemon_id: {}
          f:ceph_daemon_type: {}
          f:mon: {}
          f:mon_canary: {}
          f:mon_cluster: {}
          f:pvc_name: {}
          f:rook-version: {}
          f:rook_cluster: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"ebd50e6c-97af-42a6-b6b6-eee525d2a2de"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:accessModes: {}
        f:resources:
          f:requests:
            .: {}
            f:storage: {}
        f:storageClassName: {}
        f:volumeMode: {}
    manager: rook
    operation: Update
    time: "2021-07-22T07:31:18Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:volume.kubernetes.io/selected-node: {}
    manager: kube-scheduler
    operation: Update
    time: "2021-07-22T07:31:19Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:pv.kubernetes.io/bind-completed: {}
          f:pv.kubernetes.io/bound-by-controller: {}
          f:volume.beta.kubernetes.io/storage-provisioner: {}
      f:spec:
        f:volumeName: {}
      f:status:
        f:accessModes: {}
        f:capacity:
          .: {}
          f:storage: {}
        f:phase: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-07-22T07:31:21Z"
  name: rook-ceph-mon-a
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: ceph.rook.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: CephCluster
    name: ocs-storagecluster-cephcluster
    uid: ebd50e6c-97af-42a6-b6b6-eee525d2a2de
  resourceVersion: "2086629"
  selfLink: /api/v1/namespaces/openshift-storage/persistentvolumeclaims/rook-ceph-mon-a
  uid: f7cb5eb2-5231-4ba2-9757-2bed3cf0207e
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: sat-azure-block-gold-metro
  volumeMode: Filesystem
  volumeName: pvc-f7cb5eb2-5231-4ba2-9757-2bed3cf0207e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 20Gi
  phase: Bound

Comment 10 Shirisha S Rao 2021-07-26 12:26:18 UTC
Please find the must-gather for the cluster here : https://drive.google.com/file/d/1Hmxk-G8J_olGqJI0nGgVkfNeGm9KNt-d/view?usp=sharing

Comment 11 Travis Nielsen 2021-07-29 19:37:31 UTC
The interesting point here is that the mon PVCs are bound to their PVs and successfully mounted in the mon canary pods, so we know the volumes are basically working. Then rook deletes the canary pods and starts up the mon daemon pods, where these volumes claim they don't exist. 

Shirisha Could you check on the following?
- What nodes are the mon-canary pods assigned to? Are the mon daemon pods starting on the same nodes or different nodes?
- If you create another test pod with a test volume, then delete the pod (without deleting the pvc), then create a new pod, does it start successfully? Or does that test pod hit the same issue after the volume is re-used? Or is it a factor if the pod starts on the same node or a different node?
- Does it change anything if you change the reclaimPolicy on the azure storage class to retain?

Comment 12 Mudit Agarwal 2021-08-03 06:40:26 UTC
Not a 4.8 blocker

Comment 13 Shirisha S Rao 2021-08-04 07:28:49 UTC
Hi Travis,

There was a mount propagation issue with the installed azure driver and fixing that fixed this issue.
Thank you for looking into it.