Bug 1745776

Summary:	Pod reports "FailedMount" events for Ceph-CSI PVs due to stale mount/Stage of the same PV in a different node
Product:	OpenShift Container Platform	Reporter:	Shyamsundar <srangana>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Liang Xia <lxia>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	aos-bugs, aos-storage-staff, bbennett, bchilds, fdeutsch, fsimonce, hchiramm, hekumar, jsafrane, jstrunk, mrajanna
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	ocs-monkey
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Kubernetes did not unmount a CSI volume when a pod was deleted while the volume mount timed out, but was eventually completed by the CSI driver. Consequence: A volume was mounted on a node without Kubernetes knowing about it and could not be mounted anywhere else. Fix: Kubernetes waits for a final success / error after CSI driver returns timeout or a similar transient error. Result: Kubernetes knows if a volume was mounted / unmounted and cleans up possible stray mounts when a Pod is deleted.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-04 15:39:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shyamsundar 2019-08-26 21:42:14 UTC

Description of problem:

Pod reports "FailedMount" events for PVs backed by ceph-csi, whereas kubelet never makes any attempt to unmount the said PV from another node where it is mounted, or mount on the node where the pod is currently scheduled.

$ oc version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v4.2.0-alpha.0-27-gb833744", GitCommit:"b833744", GitTreeState:"clean", BuildDate:"2019-08-23T21:44:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+be7ae76", GitCommit:"be7ae76", GitTreeState:"clean", BuildDate:"2019-08-26T09:22:11Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
OpenShift Version: 4.2.0-0.ci-2019-08-26-110123

How reproducible:

It happens about 1 in 5 times the ocs-monkey workload is executed.

Steps to Reproduce:
1. Setup a OCP+OCS cluster, with Rook+Ceph and a storage class created for RBD (named csi-rbd)
2. Run the ocs-monkey tool (https://github.com/JohnStrunk/ocs-monkey) against this setup with the following CLI options:
  - ./workload_runner.py -l ./ -m ReadWriteOnce -n testws --oc /path/to/oc --ocs-namespace <rook-ceph namespace> -s csi-rbd
3. In about 2-10 minutes the workload runner will report failure and start gathering ocs and ocp must-gather logs
4. The failure would be along the lines of "Unhealthy: testws/osio-worker-524728073" which basically means pod did not report as "Running" within a certain timeout

Actual results:

Workload stops running due to failure to detect pod is in "Running" state within specified time limits (say around 30 seconds).

Expected results:

Workload should run without any such failures.

Master Log:

See http://file.rdu.redhat.com/srangana/bz1745773.tar.gz (under directory "./ocs-monkey-82240754/must-gather-2/")

Node Log (of failed PODs):

See http://file.rdu.redhat.com/srangana/bz1745773.tar.gz (under ./ocs-monkey-82240754/ip-10-0*.node.log)

PV Dump:

$ oc get pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com
  creationTimestamp: "2019-08-26T15:47:42Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c
  resourceVersion: "29344"
  selfLink: /api/v1/persistentvolumes/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c
  uid: d72bd051-c818-11e9-b67a-0e9b9a4cfc94
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 3Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-389152222
    namespace: testws
    resourceVersion: "29109"
    uid: d18e08c9-c818-11e9-b2cc-0a7451fe601c
  csi:
    driver: rook-ceph.rbd.csi.ceph.com
    fsType: xfs
    nodeStageSecretRef:
      name: rook-ceph-csi
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      pool: replicapool
      storage.kubernetes.io/csiProvisionerIdentity: 1566833504844-8081-rook-ceph.rbd.csi.ceph.com
    volumeHandle: 0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-rbd
  volumeMode: Filesystem
status:
  phase: Bound

PVC Dump:

$ oc -n testws get pvc/pvc-389152222 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
  creationTimestamp: "2019-08-26T15:47:33Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pvc-389152222
  namespace: testws
  resourceVersion: "29346"
  selfLink: /api/v1/namespaces/testws/persistentvolumeclaims/pvc-389152222
  uid: d18e08c9-c818-11e9-b2cc-0a7451fe601c
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
  storageClassName: csi-rbd
  volumeMode: Filesystem
  volumeName: pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 3Gi
  phase: Bound

StorageClass Dump (if StorageClass used by PV/PVC):

$ oc get sc/csi-rbd -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"csi-rbd"},"parameters":{"clusterID":"rook-ceph","csi.storage.k8s.io/fstype":"xfs","csi.storage.k8s.io/node-stage-secret-name":"rook-ceph-csi","csi.storage.k8s.io/node-stage-secret-namespace":"rook-ceph","csi.storage.k8s.io/provisioner-secret-name":"rook-ceph-csi","csi.storage.k8s.io/provisioner-secret-namespace":"rook-ceph","pool":"replicapool"},"provisioner":"rook-ceph.rbd.csi.ceph.com","reclaimPolicy":"Delete"}
  creationTimestamp: "2019-08-26T15:29:29Z"
  name: csi-rbd
  resourceVersion: "15434"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/csi-rbd
  uid: 4b8e1065-c816-11e9-b67a-0e9b9a4cfc94
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/fstype: xfs
  csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  pool: replicapool
provisioner: rook-ceph.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Additional info:

- Describe output of failing pod:
-------------------------------------------
$ oc -n testws describe pod/osio-worker-389152222-7b9786fb59-9trgz
Name:               osio-worker-389152222-7b9786fb59-9trgz
Namespace:          testws
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-162-206.ec2.internal/10.0.162.206
Start Time:         Mon, 26 Aug 2019 11:48:21 -0400
Labels:             deployment-id=389152222
                    pod-template-hash=7b9786fb59
Annotations:        openshift.io/scc: restricted
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/osio-worker-389152222-7b9786fb59
Containers:
  osio-workload:
    Container ID:  
    Image:         quay.io/johnstrunk/osio-workload
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --untar-rate
      10
      --rm-rate
      10
      --kernel-slots
      3
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Readiness:      exec [/health.sh] delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from data (rw)
      /status from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtrzt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-389152222
    ReadOnly:   false
  default-token-mtrzt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-mtrzt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                   From                                   Message
  ----     ------                  ----                  ----                                   -------
  Normal   Scheduled               100m                  default-scheduler                      Successfully assigned testws/osio-worker-389152222-7b9786fb59-9trgz to ip-10-0-162-206.ec2.internal
  Warning  FailedAttachVolume      100m                  attachdetach-controller                Multi-Attach error for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  100m                  attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c"
  Warning  FailedMount             14m (x38 over 98m)    kubelet, ip-10-0-162-206.ec2.internal  Unable to mount volumes for pod "osio-worker-389152222-7b9786fb59-9trgz_testws(edf5bece-c818-11e9-9022-02f7b213f410)": timeout expired waiting for volumes to attach or mount for pod "testws"/"osio-worker-389152222-7b9786fb59-9trgz". list of unmounted volumes=[data]. list of unattached volumes=[temp data default-token-mtrzt]
  Warning  FailedMount             4m27s (x49 over 99m)  kubelet, ip-10-0-162-206.ec2.internal  MountVolume.MountDevice failed for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
-------------------------------------------

- Ceph-CSI rbd nodeplugin activities:
-------------------------------------------
$ oc -n rook-ceph get pods -o wide | grep rbd
csi-rbdplugin-hdtdc                           3/3     Running     0          166m   10.0.129.7     ip-10-0-129-7.ec2.internal     <none>           <none>
csi-rbdplugin-n7ddn                           3/3     Running     0          166m   10.0.159.243   ip-10-0-159-243.ec2.internal   <none>           <none>
csi-rbdplugin-provisioner-0                   5/5     Running     0          166m   10.131.0.16    ip-10-0-129-7.ec2.internal     <none>           <none>
csi-rbdplugin-qgwql                           3/3     Running     0          166m   10.0.162.206   ip-10-0-162-206.ec2.internal   <none>           <none>
-------------------------------------------
Node: ip-10-0-162-206.ec2.internal
This is the node where the pod is scheduled on @ "Start Time:         Mon, 26 Aug 2019 11:48:21 -0400" (from pods describe output above)
-------------------------------------------
kubelet node logs errors pertaining to mount of the PV (from node ip-10-0-162-206.ec2.internal):

Aug 26 15:48:21 ip-10-0-162-206 hyperkube[1156]: I0826 15:48:21.154089    1156 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010") pod "osio-worker-389152222-7b9786fb59-9trgz" (UID: "edf5bece-c818-11e9-9022-02f7b213f410")
Aug 26 15:48:21 ip-10-0-162-206 hyperkube[1156]: E0826 15:48:21.154246    1156 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\"" failed. No retries permitted until 2019-08-26 15:48:21.654203474 +0000 UTC m=+1625.547452690 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\") pod \"osio-worker-389152222-7b9786fb59-9trgz\" (UID: \"edf5bece-c818-11e9-9022-02f7b213f410\") "
-------------------------------------------
NOTE: The mount in CSI is split as State and then Publish, in the following logs, the PV has been mounted on this node in the past, but when the pod is restarted on this node later, there are no mount requests for the same (later, another nodes data CSI data is published, where the PV is actually mounted)

CSI-rbd nodeplugin logs from pod csi-rbdplugin-qgwql: ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log
Log analysis:
NodeStage request:
2019-08-26T15:44:38.19923549Z I0826 15:44:38.198954   18583 utils.go:119] ID: 156 GRPC call: /csi.v1.Node/NodeStageVolume
2019-08-26T15:44:38.202653365Z I0826 15:44:38.198972   18583 utils.go:120] ID: 156 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeStage response:
2019-08-26T15:44:45.643081872Z I0826 15:44:45.641918   18583 nodeserver.go:161] ID: 156 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010
2019-08-26T15:44:45.643081872Z I0826 15:44:45.641962   18583 utils.go:125] ID: 156 GRPC response: {}

NodePublish request:
2019-08-26T15:44:45.658363729Z I0826 15:44:45.658330   18583 utils.go:119] ID: 158 GRPC call: /csi.v1.Node/NodePublishVolume
2019-08-26T15:44:45.65975691Z I0826 15:44:45.658351   18583 utils.go:120] ID: 158 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodePublish response:
2019-08-26T15:44:45.679689085Z I0826 15:44:45.679563   18583 nodeserver.go:261] ID: 158 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount
2019-08-26T15:44:45.680487981Z I0826 15:44:45.679589   18583 utils.go:125] ID: 158 GRPC response: {}

NodeStage request (again):
2019-08-26T15:45:49.139843301Z I0826 15:45:49.138878   18583 utils.go:119] ID: 202 GRPC call: /csi.v1.Node/NodeStageVolume
2019-08-26T15:45:49.141522389Z I0826 15:45:49.138899   18583 utils.go:120] ID: 202 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeStage response (already mounted):
2019-08-26T15:45:49.152646957Z I0826 15:45:49.152112   18583 nodeserver.go:110] ID: 202 rbd: volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 is already mounted to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010, skipping
2019-08-26T15:45:49.152906796Z I0826 15:45:49.152160   18583 utils.go:125] ID: 202 GRPC response: {}

NodePublish request (different pod ID on the same node):
2019-08-26T15:45:49.169713816Z I0826 15:45:49.168956   18583 utils.go:119] ID: 204 GRPC call: /csi.v1.Node/NodePublishVolume
2019-08-26T15:45:49.171125362Z I0826 15:45:49.168975   18583 utils.go:120] ID: 204 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodePublish response:
2019-08-26T15:45:49.20473302Z I0826 15:45:49.194322   18583 nodeserver.go:261] ID: 204 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount
2019-08-26T15:45:49.20473302Z I0826 15:45:49.194344   18583 utils.go:125] ID: 204 GRPC response: {}

NodeUnpublish request:
2019-08-26T15:46:09.113677418Z I0826 15:46:09.113264   18583 utils.go:119] ID: 218 GRPC call: /csi.v1.Node/NodeUnpublishVolume
2019-08-26T15:46:09.114394307Z I0826 15:46:09.113288   18583 utils.go:120] ID: 218 GRPC request: {"target_path":"/var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeUnpublish response:
2019-08-26T15:46:09.179370298Z I0826 15:46:09.179254   18583 nodeserver.go:404] ID: 218 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount
2019-08-26T15:46:09.180695974Z I0826 15:46:09.179280   18583 utils.go:125] ID: 218 GRPC response: {}

NodeUnpublish request:
2019-08-26T15:47:19.566715155Z I0826 15:47:19.565844   18583 utils.go:119] ID: 271 GRPC call: /csi.v1.Node/NodeUnpublishVolume
2019-08-26T15:47:19.566715155Z I0826 15:47:19.565862   18583 utils.go:120] ID: 271 GRPC request: {"target_path":"/var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeUnpublish response:
2019-08-26T15:47:19.605667638Z I0826 15:47:19.604987   18583 nodeserver.go:404] ID: 271 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount
2019-08-26T15:47:19.605853907Z I0826 15:47:19.605013   18583 utils.go:125] ID: 271 GRPC response: {}

NodeUnstage request:
2019-08-26T15:47:19.726735877Z I0826 15:47:19.726222   18583 utils.go:119] ID: 275 GRPC call: /csi.v1.Node/NodeUnstageVolume
2019-08-26T15:47:19.731722772Z I0826 15:47:19.726243   18583 utils.go:120] ID: 275 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeUnstage response:
2019-08-26T15:47:20.228667204Z I0826 15:47:20.227693   18583 nodeserver.go:472] ID: 275 successfully unmounted volume (0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010)
2019-08-26T15:47:20.228667204Z I0826 15:47:20.227830   18583 utils.go:125] ID: 275 GRPC response: {}

There are no further mount request to this ceph-csi node plugin instance. The time stamp of the last unmount (15:47:20.228667204) on the same node is far before the pod restart time on this node ("11:48:21 -0400").
-------------------------------------------
Data from the node where the PV is actually mounted is presented below

Node: ip-10-0-162-206.ec2.internal
This is the node where the last instance of the PV was bound on as per the ceph-csi nodeplugin logs. This occurs post the unmount of the PV on the previous node, but there are no future calls to unmount the PV from this node, before attempting a mount on the node where the failing pod is scheduled.

ceph-csi pod: csi-rbdplugin-n7ddn
CSI-rbd Nodeplugin logs: ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log
-------------------------------------------
Log analysis:

NodeStage request:
2019-08-26T15:47:30.89587171Z I0826 15:47:30.887132   20545 utils.go:119] ID: 279 GRPC call: /csi.v1.Node/NodeStageVolume
2019-08-26T15:47:30.89587171Z I0826 15:47:30.887152   20545 utils.go:120] ID: 279 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodeStage response:
2019-08-26T15:47:33.580775634Z I0826 15:47:33.580600   20545 nodeserver.go:161] ID: 279 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010
2019-08-26T15:47:33.583885329Z I0826 15:47:33.580652   20545 utils.go:125] ID: 279 GRPC response: {}

NodePublish request:
2019-08-26T15:47:33.601456621Z I0826 15:47:33.600735   20545 utils.go:119] ID: 281 GRPC call: /csi.v1.Node/NodePublishVolume
2019-08-26T15:47:33.603306004Z I0826 15:47:33.600757   20545 utils.go:120] ID: 281 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/bad60295-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"}

NodePublish response:
2019-08-26T15:47:33.715244475Z I0826 15:47:33.714462   20545 nodeserver.go:261] ID: 281 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/bad60295-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount
2019-08-26T15:47:33.715244475Z I0826 15:47:33.714488   20545 utils.go:125] ID: 281 GRPC response: {}

NOTE: PV is left mounted here, there are no Unpublish or Unstage calls, IOW no unmount requests send to this node.
-------------------------------------------

Comment 4 Humble Chirammal 2019-08-30 08:53:31 UTC

**** Here is my RCA of this bug ***

*) Node stage volume didnt happen at "15s" timeout [1] which caused the cleanup operation to kickoff which failed [2] due to the method Ceph-CSI perform to **staging at present** [3] compared to some time back. 

1) The question on why "15s" timeout ? 

 --> Short answer would be, this would be solved with the PR which I filed 10 days back ( actually proactively, so no BUG reference in the same)  to Openshift Origin. 

PR reference # https://github.com/openshift/origin/pull/23624

--snip-- from PR description

In case of short timeout (15s), some CSI plugins like RBD can bahave
weird. That said, if timeout occurs RBD plugin try an unmap and do other cleanup/rollback operations. In most
of the cases, it is just that, its taking more than 15s when there are
parellel requests to serve and map. If the timeout is high, most of the
mapping operations will succeed and it will not cause any rollback
operation/cleanup in the system which avoid other issues.

--/snip---

[3] Recently we have changed the way Ceph CSI do the staging. That said, we introduced some changes via this PR https://github.com/ceph/ceph-csi/pull/510. in short, the change is this, 

Instead of doing a mount on the 'staging target path' given by CO, we create a "metadata file' and a  'new directory' inside this CO provided 'staging target Path' . Then consider this "new" directory as our 'staging target path'.  
   
   If I fetch this new directory name from the logs, it is '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010' ----> NOTE that, this directory is INSIDE 'globalmount'

  metadata file will be the 'volume ID' in general.

[2] The 'entire' staging operation didnt finish within this '15s' and caused the cleanup operation to be triggered which is actually 'staging target path' removal attempted in this setup. When it was attempted, the 'removal' of this path failed as it carried an 'extra file and directory' inside this path. 


Some interesting things which I would like to point out here:
-------------------------------------------------------------------

1) CO pass 'staging target path' to SP and as per spec (https://github.com/container-storage-interface/spec/blob/master/spec.md) or the understanding was that, SP can choose the 'staging target path' its own.
  
  First immediate question would be, Is above understanding of the SPEC correct ? atleast its not clear in the SPEC that, whether the 'staging' has to mandatory happen on THIS provided path from CO. If thats not mandatory, I see couple of bugs or possibilities of isues here. That said, for SP, there are "2" options, either "stage" it on a totally different path or stage it on a "directory" or path which is inside the "staging target path" given by CO (.../globalmount/")

-> On a "different path" :

 If SP put the staging on a different path, the cleanup operation code in CO is ***not*** correct OR its NO OP or can always fail.  The reason being 'cleanup' is attempted on the path 
 which is KNOWN to CO which is nothing but the PATH given by CO at stage RPC call.

-> On a 'directory' or 'path' inside the staging target path

 If SP use a directory as in this case or put some file/directory inside the staging target path, the removal/cleanup will fail as shown in the logs or with the error 'directory not empty'. 
 The immediate fix I can think of here is, at 'removeMountDir' make use of 'os.Removeall' instead of 'os.Remove'!!!




--snip--

func removeMountDir(plug *csiPlugin, mountPath string) error {
	klog.V(4).Info(log("removing mount path [%s]", mountPath))

	....
	if !mnt {
		klog.V(4).Info(log("dir not mounted, deleting it [%s]", mountPath))
		if err := os.Remove(mountPath); err != nil && !os.IsNotExist(err) {
			return errors.New(log("failed to remove dir [%s]: %v", mountPath, err))
		}

        ....


--/snip--

2) Other interesting fact here is this:

Suppose the cleanup failed due to above reason, it looks like 'CO' will **think** that, staging is actually failed on NODE 1, so reschedule or move the workload/POD to second NODE!

Which cause "staging" to happen on Second NODE. There is a possibility that, meanwhile 'SP' would have completed 'staging' of the volume in NODE 1 too!! 

If thats what happened, I believe it answers ***how we landed upon 'Staging' of the volume in MORE than ONE NODE at same time*** thus 'data corruption' or "FS integrity" issue!


Reference bugzillas# 

https://bugzilla.redhat.com/show_bug.cgi?id=1745773 (Delay in detecting a successful CSI mount or incorrect report of failed mount, by kubelet for Ceph CSI based volume )
https://bugzilla.redhat.com/show_bug.cgi?id=1746111 ( RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem header corruption)





---------------------------------------------- LOGS used for RCA or hypothesis ----------------------------------------------------------------------

---> ceph-csi NodeStageVolume invoked here as per timestamps (logs from csi node plugin interspersed for clarity) <---
2019-08-26T15:48:06.921973038Z I0826 15:48:06.921427   30300 utils.go:119] ID: 346 GRPC call: /csi.v1.Node/NodeStageVolume
2019-08-26T15:48:06.927130275Z I0826 15:48:06.921449   30300 utils.go:120] ID: 346 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010"
---> end interspersed logs (Note: NodeStageVolume request has not yet returned success or failure at this point in time) <---

Aug 26 15:48:10 ip-10-0-129-7 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010.

---> Some error is triggered and hence cleanup routines have kicked in (within 15 seconds?)
Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868121    1189 csi_mounter.go:433] kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty
---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct)

Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868142    1189 csi_attacher.go:322] kubernetes.io/csi: attacher.MountDevice failed to remove mount dir after errir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty
---> Error carried over above

Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868262    1189 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\"" failed. No retries permitted until 2019-08-26 15:48:22.368216697 +0000 UTC m=+1632.155530185 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\") pod \"osio-worker-389152222-7b9786fb59-rtr66\" (UID: \"d1b70966-c818-11e9-9022-02f7b213f410\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded"
---> Looks like original error was a deadline exceeded, IOW a request timeout

-------------------------------------------------------------------------------------------------------------------
 


It would be really appreciated your thoughts on this analysis/RCA or suggested fix, so that we can take up from there.

Comment 5 Shyamsundar 2019-08-30 11:29:34 UTC

(In reply to Humble Chirammal from comment #4)
> 1) The question on why "15s" timeout ? 
> 
>  --> Short answer would be, this would be solved with the PR which I filed
> 10 days back ( actually proactively, so no BUG reference in the same)  to
> Openshift Origin. 
> 
> PR reference # https://github.com/openshift/origin/pull/23624

I was looking at the openshift/kubernetes branch where it was 2 minutes, but this explains it. We would need to move this timeout up.

BUT, to reiterate, moving it to 2 minutes is NOT the fix I am looking for. I am attempting to understand why the volume was staged on two nodes, when the first node did not respond with a success or a failure.

> [2] The 'entire' staging operation didnt finish within this '15s' and caused
> the cleanup operation to be triggered which is actually 'staging target
> path' removal attempted in this setup. When it was attempted, the 'removal'
> of this path failed as it carried an 'extra file and directory' inside this
> path. 

Yes this is what I noted as "---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct)"

> Suppose the cleanup failed due to above reason, it looks like 'CO' will
> **think** that, staging is actually failed on NODE 1, so reschedule or move
> the workload/POD to second NODE!

If the removeMountDir error is masking the original timeout error, and the state machine assumes the CSI plugin returned an error, then this is what requires a fix.

BUT, If irrespective of the error, the state machine would assume the Stage failed (i.e even on timeout (deadline exceeded) errors), and still progress with detach and attach on a different node (because the pod moved), then we need to fix the same.

NOTE: Changing CO to perform a RemoveAll instead of a Remove would be incorrect, it is the plugins responsibility to keep the staging path clean at the end of Unstage or failed Stage. Further as this is the staging path, on timeout errors or such a RemoveAll may inadvertently remove user/app data from the mount, if the mount succeeds a little later. I would not change the CO behaviour here.

Comment 7 Shyamsundar 2019-08-30 19:46:54 UTC

(In reply to Shyamsundar from comment #5)
> (In reply to Humble Chirammal from comment #4)
> > Suppose the cleanup failed due to above reason, it looks like 'CO' will
> > **think** that, staging is actually failed on NODE 1, so reschedule or move
> > the workload/POD to second NODE!
> 
> If the removeMountDir error is masking the original timeout error, and the
> state machine assumes the CSI plugin returned an error, then this is what
> requires a fix.

Just to clear up the error masking possibility due to defer call removeMountDir, this is not feasible as the defer has its own err variable being declared in its scope, due to the := in the statement https://github.com/openshift/origin/blob/f4eaba02647295e3b1c26ac859dc2e723ad08149/vendor/k8s.io/kubernetes/pkg/volume/csi/csi_attacher.go#L321

Tested out in a sample go program: https://play.golang.org/p/EiSfoU3zNXg

Comment 15 Humble Chirammal 2019-09-03 09:08:23 UTC

[Status update]

https://github.com/openshift/origin/pull/23624 has been merged now in OCP upstream.

Comment 21 Fabio Bertinatto 2019-09-05 08:25:36 UTC

*** Bug 1746111 has been marked as a duplicate of this bug. ***

Comment 22 Yaniv Kaul 2019-09-05 12:03:38 UTC

What's the status of this BZ? We have to fix it, as it may cause corruption.

Comment 26 Bradley Childs 2019-09-06 14:44:09 UTC

We are unable to fix this before the 4.2 release date but will continue working on it for 4.3... Once we have the fix we will re-evalulate the bug and decide if a backport to 4.2 is necessary.