1744958 – Pod fails to move to state "Running" due to incorrect "FailedMount" event

Bug 1744958 - Pod fails to move to state "Running" due to incorrect "FailedMount" event

Summary: Pod fails to move to state "Running" due to incorrect "FailedMount" event

Keywords:
Status:	CLOSED DUPLICATE of bug 1745773
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Fabio Bertinatto
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:	ocs-monkey
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-23 11:23 UTC by Shyamsundar
Modified:	2019-08-29 14:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-29 14:47:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Must gather output for OCP and OCS must gather (11.75 MB, application/gzip) 2019-08-23 11:23 UTC, Shyamsundar	no flags	Details
View All

Description Shyamsundar 2019-08-23 11:23:56 UTC

Created attachment 1607280 [details]
Must gather output for OCP and OCS must gather

Description of problem:

Pod fails to move to state "Running" due to "FailedMount" event on a CSI provisioned volume.

On investigating the CSI (ceph RBD in this case) nodeplugin logs and state, it is seen that the volume is mounted succesfully (and within reasonable time) on the node where the pod is scheduled.

The problem hence seems to be that "other" parts of the system are unable to pick up the change in state but keep reporting "FailedMount" events.

End result is that the pod fails to move into "Running" state.

Version-Release number of selected component (if applicable):

The setup was on AWS OCP dev clusters

$ oc version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v4.2.0-alpha.0-21-g3daf746", GitCommit:"3daf746", GitTreeState:"clean", BuildDate:"2019-08-21T16:17:30Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+757c32e", GitCommit:"757c32e", GitTreeState:"clean", BuildDate:"2019-08-21T17:56:00Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
OpenShift Version: 4.2.0-0.ci-2019-08-22-111635

How reproducible:

Happens once in 2 to 3 runs of the said reproducer in the "Steps to reproduce" section. It takes anywhere between 2-10 minutes to reproduce, when it occurs.

Steps to Reproduce:
1. Setup a OCP+OCS cluster, with Rook+Ceph and a storage class created for RBD (named csi-rbd)
2. Run the ocs-monkey tool (https://github.com/JohnStrunk/ocs-monkey) against this setup with the following CLI options:
  - ./workload_runner.py -l ./ -m ReadWriteOnce -n testws --oc /path/to/oc --ocs-namespace <rook-ceph namespace> -s csi-rbd
3. In about 2-10 minutes the workload runner will report failure and start gathering ocs and ocp must-gather logs

Actual results:

Post this failure reporting and log gathering, if the failure reported by ocs-monkey is of the nature,
2019-08-22 14:17:44,014 osio - ERROR - Unhealthy: testws/osio-worker-375766658
2019-08-22 14:17:44,015 root - INFO - starting log collection

Inspect the pods in the namespace "testws" to see if any are in the state "ContainerCreating" and further if described report "FailedMount" reasons in their events like below,
Events:
  Type     Reason                  Age                    From                                   Message
  ----     ------                  ----                   ----                                   -------
  Normal   Scheduled               11m                    default-scheduler                      Successfully assigned testws/osio-worker-375766658-7f4cf9698c-mjtgp to ip-10-0-149-116.ec2.internal
  Warning  FailedAttachVolume      11m                    attachdetach-controller                Multi-Attach error for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  10m                    attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570"
  Warning  FailedMount             2m10s (x4 over 8m58s)  kubelet, ip-10-0-149-116.ec2.internal  Unable to mount volumes for pod "osio-worker-375766658-7f4cf9698c-mjtgp_testws(643b767a-c4e7-11e9-a719-0ac717fcefd8)": timeout expired waiting for volumes to attach or mount for pod "testws"/"osio-worker-375766658-7f4cf9698c-mjtgp". list of unmounted volumes=[data]. list of unattached volumes=[temp data default-token-rxwbx]

Expected results:

The pod should not fail on "FailedMount" events when the CSI nodeplugin has succeeded on the actual mount.

Master Log:

Attached ocp must-gather and ocs must-gather

Node Log (of failed PODs):

Attached ocp must-gather and ocs must-gather

PV Dump:

$ oc get pv -o yaml pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com
  creationTimestamp: "2019-08-22T14:13:20Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570
  resourceVersion: "21693"
  selfLink: /api/v1/persistentvolumes/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570
  uid: fe7227e4-c4e6-11e9-926c-02d05bd2f570
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 3Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-375766658
    namespace: testws
    resourceVersion: "21668"
    uid: fdd30dfe-c4e6-11e9-926c-02d05bd2f570
  csi:
    driver: rook-ceph.rbd.csi.ceph.com
    fsType: xfs
    nodeStageSecretRef:
      name: rook-ceph-csi
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      pool: replicapool
      storage.kubernetes.io/csiProvisionerIdentity: 1566482577706-8081-rook-ceph.rbd.csi.ceph.com
    volumeHandle: 0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-rbd
  volumeMode: Filesystem
status:
  phase: Bound

PVC Dump:

$ oc get pvc -n testws -o yaml pvc-375766658
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
  creationTimestamp: "2019-08-22T14:13:19Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pvc-375766658
  namespace: testws
  resourceVersion: "21697"
  selfLink: /api/v1/namespaces/testws/persistentvolumeclaims/pvc-375766658
  uid: fdd30dfe-c4e6-11e9-926c-02d05bd2f570
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
  storageClassName: csi-rbd
  volumeMode: Filesystem
  volumeName: pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 3Gi
  phase: Bound

StorageClass Dump (if StorageClass used by PV/PVC):

$ oc get sc/csi-rbd -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"csi-rbd"},"parameters":{"clusterID":"rook-ceph","csi.storage.k8s.io/fstype":"xfs","csi.storage.k8s.io/node-stage-secret-name":"rook-ceph-csi","csi.storage.k8s.io/node-stage-secret-namespace":"rook-ceph","csi.storage.k8s.io/provisioner-secret-name":"rook-ceph-csi","csi.storage.k8s.io/provisioner-secret-namespace":"rook-ceph","pool":"replicapool"},"provisioner":"rook-ceph.rbd.csi.ceph.com","reclaimPolicy":"Delete"}
  creationTimestamp: "2019-08-22T14:01:56Z"
  name: csi-rbd
  resourceVersion: "17408"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/csi-rbd
  uid: 66e6086b-c4e5-11e9-926c-02d05bd2f570
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/fstype: xfs
  csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  pool: replicapool
provisioner: rook-ceph.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Additional info:

### Here is the full application pod description that failed ###
$ oc describe -n testws pod/osio-worker-375766658-7f4cf9698c-mjtgp
Name:               osio-worker-375766658-7f4cf9698c-mjtgp
Namespace:          testws
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-149-116.ec2.internal/10.0.149.116
Start Time:         Thu, 22 Aug 2019 10:16:10 -0400
Labels:             deployment-id=375766658
                    pod-template-hash=7f4cf9698c
Annotations:        openshift.io/scc: restricted
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/osio-worker-375766658-7f4cf9698c
Containers:
  osio-workload:
    Container ID:  
    Image:         quay.io/johnstrunk/osio-workload
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --untar-rate
      10
      --rm-rate
      10
      --kernel-slots
      3
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Readiness:      exec [/health.sh] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from data (rw)
      /status from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rxwbx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-375766658
    ReadOnly:   false
  default-token-rxwbx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rxwbx
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                                   Message
  ----     ------                  ----                   ----                                   -------
  Normal   Scheduled               11m                    default-scheduler                      Successfully assigned testws/osio-worker-375766658-7f4cf9698c-mjtgp to ip-10-0-149-116.ec2.internal
  Warning  FailedAttachVolume      11m                    attachdetach-controller                Multi-Attach error for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  10m                    attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570"
  Warning  FailedMount             2m10s (x4 over 8m58s)  kubelet, ip-10-0-149-116.ec2.internal  Unable to mount volumes for pod "osio-worker-375766658-7f4cf9698c-mjtgp_testws(643b767a-c4e7-11e9-a719-0ac717fcefd8)": timeout expired waiting for volumes to attach or mount for pod "testws"/"osio-worker-375766658-7f4cf9698c-mjtgp". list of unmounted volumes=[data]. list of unattached volumes=[temp data default-token-rxwbx]

### CSI RBD nodeplugin information ###
$ oc get pods -n rook-ceph -o wide | grep rbdplugin
csi-rbdplugin-6lh65                           3/3     Running     0          25m   10.0.149.116   ip-10-0-149-116.ec2.internal   <none>           <none>
csi-rbdplugin-gmlpx                           3/3     Running     0          25m   10.0.139.83    ip-10-0-139-83.ec2.internal    <none>           <none>
csi-rbdplugin-grpjn                           3/3     Running     0          25m   10.0.171.100   ip-10-0-171-100.ec2.internal   <none>           <none>
csi-rbdplugin-provisioner-0                   5/5     Running     0          25m   10.131.0.13    ip-10-0-171-100.ec2.internal   <none>           <none>

### Pertinent logs from the RBD nodeplugin when the volume was staged and published: ###
Time stamp when the volume was staged: 14:16:37.683840

2019-08-22T14:16:30.683964044Z I0822 14:16:30.681870   39231 log.go:44] ID: 100 GRPC call: /csi.v1.Node/NodeStageVolume
./ocs-monkey-173462618/ocs-must-gather/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-6lh65/csi-rbdplugin/csi-rbdplugin/logs/current.log:2019-08-22T14:16:37.683960521Z I0822 14:16:37.683840   39231 log.go:50] ID: 110 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount/0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d to targetPath /var/lib/kubelet/pods/643b767a-c4e7-11e9-a719-0ac717fcefd8/volumes/kubernetes.io~csi/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/mount
2019-08-22T14:16:37.606735716Z I0822 14:16:37.600294   39231 log.go:44] ID: 100 GRPC response: {}

2019-08-22T14:16:37.620975755Z I0822 14:16:37.620492   39231 log.go:44] ID: 110 GRPC call: /csi.v1.Node/NodePublishVolume
2019-08-22T14:16:37.625522042Z I0822 14:16:37.620517   39231 log.go:44] ID: 110 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount","target_path":"/var/lib/kubelet/pods/643b767a-c4e7-11e9-a719-0ac717fcefd8/volumes/kubernetes.io~csi/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566482577706-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d"}
2019-08-22T14:16:37.683960521Z I0822 14:16:37.683840   39231 log.go:50] ID: 110 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount/0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d to targetPath /var/lib/kubelet/pods/643b767a-c4e7-11e9-a719-0ac717fcefd8/volumes/kubernetes.io~csi/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/mount
2019-08-22T14:16:37.684804717Z I0822 14:16:37.683869   39231 log.go:44] ID: 110 GRPC response: {}

### Partial kubelet logs from the node where volume was staged and published ###

sh-4.4# journalctl -u kubelet > /tmp/kub.log
sh-4.4# cat /tmp/kub.log | grep fdd30dfe-c4e6-11e9-926c-02d05bd2f570
Aug 22 14:15:57 ip-10-0-149-116 hyperkube[1163]: I0822 14:15:57.964398    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:15:57 ip-10-0-149-116 hyperkube[1163]: E0822 14:15:57.964525    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:15:58.464479379 +0000 UTC m=+1782.779281076 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:15:58 ip-10-0-149-116 hyperkube[1163]: I0822 14:15:58.537435    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:15:58 ip-10-0-149-116 hyperkube[1163]: E0822 14:15:58.538046    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:15:59.538001315 +0000 UTC m=+1783.852803039 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:15:59 ip-10-0-149-116 hyperkube[1163]: I0822 14:15:59.548020    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:15:59 ip-10-0-149-116 hyperkube[1163]: E0822 14:15:59.548268    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:16:01.548228037 +0000 UTC m=+1785.863029746 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:16:01 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:01.605943    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:16:01 ip-10-0-149-116 hyperkube[1163]: E0822 14:16:01.606188    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:16:05.606140163 +0000 UTC m=+1789.920941841 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:16:05 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:05.651696    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:16:06 ip-10-0-149-116 hyperkube[1163]: E0822 14:16:06.457283    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:16:14.457235875 +0000 UTC m=+1798.772037565 (durationBeforeRetry 8s). Error: "Volume not attached according to node status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:16:14 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:14.476106    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-pbhcp" (UID: "5abd666d-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:16:14 ip-10-0-149-116 hyperkube[1163]: E0822 14:16:14.485682    1163 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\"" failed. No retries permitted until 2019-08-22 14:16:30.485636719 +0000 UTC m=+1814.800438468 (durationBeforeRetry 16s). Error: "Volume not attached according to node status for volume \"pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d\") pod \"osio-worker-375766658-7f4cf9698c-pbhcp\" (UID: \"5abd666d-c4e7-11e9-a719-0ac717fcefd8\") "
Aug 22 14:16:30 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:30.553588    1163 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:16:30 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:30.560062    1163 operation_generator.go:1339] Controller attach succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") device path: "csi-c3f012e05fd3cff043645fbd2b3ba824aa055564194b2e1f55574787ac5a8dbe"
Aug 22 14:16:30 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:30.657457    1163 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8")
Aug 22 14:16:30 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:30.658345    1163 operation_generator.go:587] MountVolume.WaitForAttach entering for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") DevicePath "csi-c3f012e05fd3cff043645fbd2b3ba824aa055564194b2e1f55574787ac5a8dbe"
Aug 22 14:16:30 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:30.665243    1163 operation_generator.go:596] MountVolume.WaitForAttach succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") DevicePath "csi-c3f012e05fd3cff043645fbd2b3ba824aa055564194b2e1f55574787ac5a8dbe"
Aug 22 14:16:37 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:37.603104    1163 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount"

Comment 2 Fabio Bertinatto 2019-08-29 13:10:23 UTC

The last log message from the Kubelet was:

> Aug 22 14:16:37 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:37.603104    1163 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount"

The following message should be something around the lines of: 

> MountVolume.SetUp succeeded for volume...

Since it was shown in the output of journalctl, I believe the MountVolume.SetUp call was still running by the time the logs were taken. That's where the file ownership is set, so I believe this is a duplicate of bug #1745773.

Can you confirm it's a duplicate so we can close this?

Comment 3 Shyamsundar 2019-08-29 14:47:50 UTC

(In reply to Fabio Bertinatto from comment #2)
> The last log message from the Kubelet was:
> 
> > Aug 22 14:16:37 ip-10-0-149-116 hyperkube[1163]: I0822 14:16:37.603104    1163 operation_generator.go:623] MountVolume.MountDevice succeeded for volume "pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-fde9dd57-c4e6-11e9-ad2b-0a580a83000d") pod "osio-worker-375766658-7f4cf9698c-mjtgp" (UID: "643b767a-c4e7-11e9-a719-0ac717fcefd8") device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-fdd30dfe-c4e6-11e9-926c-02d05bd2f570/globalmount"
> 
> The following message should be something around the lines of: 
> 
> > MountVolume.SetUp succeeded for volume...
> 
> Since it was shown in the output of journalctl, I believe the
> MountVolume.SetUp call was still running by the time the logs were taken.
> That's where the file ownership is set, so I believe this is a duplicate of
> bug #1745773.
> 
> Can you confirm it's a duplicate so we can close this?

I did not wait long enough to see if the pod never came up or not by the looks of it. It also seems related to the other bug. Closing this as duplicate and will reopen/rekindle the conversation if I see a stuck pod for much longer duration.

Thanks for taking a look at this.

*** This bug has been marked as a duplicate of bug 1745773 ***

Note You need to log in before you can comment on or make changes to this bug.