Description of problem: Pod reports "FailedMount" events for PVs backed by ceph-csi, whereas kubelet never makes any attempt to unmount the said PV from another node where it is mounted, or mount on the node where the pod is currently scheduled. $ oc version Client Version: version.Info{Major:"", Minor:"", GitVersion:"v4.2.0-alpha.0-27-gb833744", GitCommit:"b833744", GitTreeState:"clean", BuildDate:"2019-08-23T21:44:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+be7ae76", GitCommit:"be7ae76", GitTreeState:"clean", BuildDate:"2019-08-26T09:22:11Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"} OpenShift Version: 4.2.0-0.ci-2019-08-26-110123 How reproducible: It happens about 1 in 5 times the ocs-monkey workload is executed. Steps to Reproduce: 1. Setup a OCP+OCS cluster, with Rook+Ceph and a storage class created for RBD (named csi-rbd) 2. Run the ocs-monkey tool (https://github.com/JohnStrunk/ocs-monkey) against this setup with the following CLI options: - ./workload_runner.py -l ./ -m ReadWriteOnce -n testws --oc /path/to/oc --ocs-namespace <rook-ceph namespace> -s csi-rbd 3. In about 2-10 minutes the workload runner will report failure and start gathering ocs and ocp must-gather logs 4. The failure would be along the lines of "Unhealthy: testws/osio-worker-524728073" which basically means pod did not report as "Running" within a certain timeout Actual results: Workload stops running due to failure to detect pod is in "Running" state within specified time limits (say around 30 seconds). Expected results: Workload should run without any such failures. Master Log: See http://file.rdu.redhat.com/srangana/bz1745773.tar.gz (under directory "./ocs-monkey-82240754/must-gather-2/") Node Log (of failed PODs): See http://file.rdu.redhat.com/srangana/bz1745773.tar.gz (under ./ocs-monkey-82240754/ip-10-0*.node.log) PV Dump: $ oc get pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: rook-ceph.rbd.csi.ceph.com creationTimestamp: "2019-08-26T15:47:42Z" finalizers: - kubernetes.io/pv-protection name: pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c resourceVersion: "29344" selfLink: /api/v1/persistentvolumes/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c uid: d72bd051-c818-11e9-b67a-0e9b9a4cfc94 spec: accessModes: - ReadWriteOnce capacity: storage: 3Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: pvc-389152222 namespace: testws resourceVersion: "29109" uid: d18e08c9-c818-11e9-b2cc-0a7451fe601c csi: driver: rook-ceph.rbd.csi.ceph.com fsType: xfs nodeStageSecretRef: name: rook-ceph-csi namespace: rook-ceph volumeAttributes: clusterID: rook-ceph pool: replicapool storage.kubernetes.io/csiProvisionerIdentity: 1566833504844-8081-rook-ceph.rbd.csi.ceph.com volumeHandle: 0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010 persistentVolumeReclaimPolicy: Delete storageClassName: csi-rbd volumeMode: Filesystem status: phase: Bound PVC Dump: $ oc -n testws get pvc/pvc-389152222 -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com creationTimestamp: "2019-08-26T15:47:33Z" finalizers: - kubernetes.io/pvc-protection name: pvc-389152222 namespace: testws resourceVersion: "29346" selfLink: /api/v1/namespaces/testws/persistentvolumeclaims/pvc-389152222 uid: d18e08c9-c818-11e9-b2cc-0a7451fe601c spec: accessModes: - ReadWriteOnce resources: requests: storage: 3Gi storageClassName: csi-rbd volumeMode: Filesystem volumeName: pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c status: accessModes: - ReadWriteOnce capacity: storage: 3Gi phase: Bound StorageClass Dump (if StorageClass used by PV/PVC): $ oc get sc/csi-rbd -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"csi-rbd"},"parameters":{"clusterID":"rook-ceph","csi.storage.k8s.io/fstype":"xfs","csi.storage.k8s.io/node-stage-secret-name":"rook-ceph-csi","csi.storage.k8s.io/node-stage-secret-namespace":"rook-ceph","csi.storage.k8s.io/provisioner-secret-name":"rook-ceph-csi","csi.storage.k8s.io/provisioner-secret-namespace":"rook-ceph","pool":"replicapool"},"provisioner":"rook-ceph.rbd.csi.ceph.com","reclaimPolicy":"Delete"} creationTimestamp: "2019-08-26T15:29:29Z" name: csi-rbd resourceVersion: "15434" selfLink: /apis/storage.k8s.io/v1/storageclasses/csi-rbd uid: 4b8e1065-c816-11e9-b67a-0e9b9a4cfc94 parameters: clusterID: rook-ceph csi.storage.k8s.io/fstype: xfs csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph pool: replicapool provisioner: rook-ceph.rbd.csi.ceph.com reclaimPolicy: Delete volumeBindingMode: Immediate Additional info: - Describe output of failing pod: ------------------------------------------- $ oc -n testws describe pod/osio-worker-389152222-7b9786fb59-9trgz Name: osio-worker-389152222-7b9786fb59-9trgz Namespace: testws Priority: 0 PriorityClassName: <none> Node: ip-10-0-162-206.ec2.internal/10.0.162.206 Start Time: Mon, 26 Aug 2019 11:48:21 -0400 Labels: deployment-id=389152222 pod-template-hash=7b9786fb59 Annotations: openshift.io/scc: restricted Status: Pending IP: Controlled By: ReplicaSet/osio-worker-389152222-7b9786fb59 Containers: osio-workload: Container ID: Image: quay.io/johnstrunk/osio-workload Image ID: Port: <none> Host Port: <none> Args: --untar-rate 10 --rm-rate 10 --kernel-slots 3 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Readiness: exec [/health.sh] delay=5s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /data from data (rw) /status from temp (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-mtrzt (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: temp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: <unset> data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-389152222 ReadOnly: false default-token-mtrzt: Type: Secret (a volume populated by a Secret) SecretName: default-token-mtrzt Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 100m default-scheduler Successfully assigned testws/osio-worker-389152222-7b9786fb59-9trgz to ip-10-0-162-206.ec2.internal Warning FailedAttachVolume 100m attachdetach-controller Multi-Attach error for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" Volume is already exclusively attached to one node and can't be attached to another Normal SuccessfulAttachVolume 100m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" Warning FailedMount 14m (x38 over 98m) kubelet, ip-10-0-162-206.ec2.internal Unable to mount volumes for pod "osio-worker-389152222-7b9786fb59-9trgz_testws(edf5bece-c818-11e9-9022-02f7b213f410)": timeout expired waiting for volumes to attach or mount for pod "testws"/"osio-worker-389152222-7b9786fb59-9trgz". list of unmounted volumes=[data]. list of unattached volumes=[temp data default-token-mtrzt] Warning FailedMount 4m27s (x49 over 99m) kubelet, ip-10-0-162-206.ec2.internal MountVolume.MountDevice failed for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" : rpc error: code = DeadlineExceeded desc = context deadline exceeded ------------------------------------------- - Ceph-CSI rbd nodeplugin activities: ------------------------------------------- $ oc -n rook-ceph get pods -o wide | grep rbd csi-rbdplugin-hdtdc 3/3 Running 0 166m 10.0.129.7 ip-10-0-129-7.ec2.internal <none> <none> csi-rbdplugin-n7ddn 3/3 Running 0 166m 10.0.159.243 ip-10-0-159-243.ec2.internal <none> <none> csi-rbdplugin-provisioner-0 5/5 Running 0 166m 10.131.0.16 ip-10-0-129-7.ec2.internal <none> <none> csi-rbdplugin-qgwql 3/3 Running 0 166m 10.0.162.206 ip-10-0-162-206.ec2.internal <none> <none> ------------------------------------------- Node: ip-10-0-162-206.ec2.internal This is the node where the pod is scheduled on @ "Start Time: Mon, 26 Aug 2019 11:48:21 -0400" (from pods describe output above) ------------------------------------------- kubelet node logs errors pertaining to mount of the PV (from node ip-10-0-162-206.ec2.internal): Aug 26 15:48:21 ip-10-0-162-206 hyperkube[1156]: I0826 15:48:21.154089 1156 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010") pod "osio-worker-389152222-7b9786fb59-9trgz" (UID: "edf5bece-c818-11e9-9022-02f7b213f410") Aug 26 15:48:21 ip-10-0-162-206 hyperkube[1156]: E0826 15:48:21.154246 1156 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\"" failed. No retries permitted until 2019-08-26 15:48:21.654203474 +0000 UTC m=+1625.547452690 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\") pod \"osio-worker-389152222-7b9786fb59-9trgz\" (UID: \"edf5bece-c818-11e9-9022-02f7b213f410\") " ------------------------------------------- NOTE: The mount in CSI is split as State and then Publish, in the following logs, the PV has been mounted on this node in the past, but when the pod is restarted on this node later, there are no mount requests for the same (later, another nodes data CSI data is published, where the PV is actually mounted) CSI-rbd nodeplugin logs from pod csi-rbdplugin-qgwql: ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-qgwql/csi-rbdplugin/csi-rbdplugin/logs/current.log Log analysis: NodeStage request: 2019-08-26T15:44:38.19923549Z I0826 15:44:38.198954 18583 utils.go:119] ID: 156 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:44:38.202653365Z I0826 15:44:38.198972 18583 utils.go:120] ID: 156 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeStage response: 2019-08-26T15:44:45.643081872Z I0826 15:44:45.641918 18583 nodeserver.go:161] ID: 156 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 2019-08-26T15:44:45.643081872Z I0826 15:44:45.641962 18583 utils.go:125] ID: 156 GRPC response: {} NodePublish request: 2019-08-26T15:44:45.658363729Z I0826 15:44:45.658330 18583 utils.go:119] ID: 158 GRPC call: /csi.v1.Node/NodePublishVolume 2019-08-26T15:44:45.65975691Z I0826 15:44:45.658351 18583 utils.go:120] ID: 158 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodePublish response: 2019-08-26T15:44:45.679689085Z I0826 15:44:45.679563 18583 nodeserver.go:261] ID: 158 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount 2019-08-26T15:44:45.680487981Z I0826 15:44:45.679589 18583 utils.go:125] ID: 158 GRPC response: {} NodeStage request (again): 2019-08-26T15:45:49.139843301Z I0826 15:45:49.138878 18583 utils.go:119] ID: 202 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:45:49.141522389Z I0826 15:45:49.138899 18583 utils.go:120] ID: 202 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeStage response (already mounted): 2019-08-26T15:45:49.152646957Z I0826 15:45:49.152112 18583 nodeserver.go:110] ID: 202 rbd: volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 is already mounted to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010, skipping 2019-08-26T15:45:49.152906796Z I0826 15:45:49.152160 18583 utils.go:125] ID: 202 GRPC response: {} NodePublish request (different pod ID on the same node): 2019-08-26T15:45:49.169713816Z I0826 15:45:49.168956 18583 utils.go:119] ID: 204 GRPC call: /csi.v1.Node/NodePublishVolume 2019-08-26T15:45:49.171125362Z I0826 15:45:49.168975 18583 utils.go:120] ID: 204 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodePublish response: 2019-08-26T15:45:49.20473302Z I0826 15:45:49.194322 18583 nodeserver.go:261] ID: 204 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount 2019-08-26T15:45:49.20473302Z I0826 15:45:49.194344 18583 utils.go:125] ID: 204 GRPC response: {} NodeUnpublish request: 2019-08-26T15:46:09.113677418Z I0826 15:46:09.113264 18583 utils.go:119] ID: 218 GRPC call: /csi.v1.Node/NodeUnpublishVolume 2019-08-26T15:46:09.114394307Z I0826 15:46:09.113288 18583 utils.go:120] ID: 218 GRPC request: {"target_path":"/var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeUnpublish response: 2019-08-26T15:46:09.179370298Z I0826 15:46:09.179254 18583 nodeserver.go:404] ID: 218 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/5f7de71a-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount 2019-08-26T15:46:09.180695974Z I0826 15:46:09.179280 18583 utils.go:125] ID: 218 GRPC response: {} NodeUnpublish request: 2019-08-26T15:47:19.566715155Z I0826 15:47:19.565844 18583 utils.go:119] ID: 271 GRPC call: /csi.v1.Node/NodeUnpublishVolume 2019-08-26T15:47:19.566715155Z I0826 15:47:19.565862 18583 utils.go:120] ID: 271 GRPC request: {"target_path":"/var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeUnpublish response: 2019-08-26T15:47:19.605667638Z I0826 15:47:19.604987 18583 nodeserver.go:404] ID: 271 rbd: successfully unbound volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 from /var/lib/kubelet/pods/936c319e-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount 2019-08-26T15:47:19.605853907Z I0826 15:47:19.605013 18583 utils.go:125] ID: 271 GRPC response: {} NodeUnstage request: 2019-08-26T15:47:19.726735877Z I0826 15:47:19.726222 18583 utils.go:119] ID: 275 GRPC call: /csi.v1.Node/NodeUnstageVolume 2019-08-26T15:47:19.731722772Z I0826 15:47:19.726243 18583 utils.go:120] ID: 275 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeUnstage response: 2019-08-26T15:47:20.228667204Z I0826 15:47:20.227693 18583 nodeserver.go:472] ID: 275 successfully unmounted volume (0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010) 2019-08-26T15:47:20.228667204Z I0826 15:47:20.227830 18583 utils.go:125] ID: 275 GRPC response: {} There are no further mount request to this ceph-csi node plugin instance. The time stamp of the last unmount (15:47:20.228667204) on the same node is far before the pod restart time on this node ("11:48:21 -0400"). ------------------------------------------- Data from the node where the PV is actually mounted is presented below Node: ip-10-0-162-206.ec2.internal This is the node where the last instance of the PV was bound on as per the ceph-csi nodeplugin logs. This occurs post the unmount of the PV on the previous node, but there are no future calls to unmount the PV from this node, before attempting a mount on the node where the failing pod is scheduled. ceph-csi pod: csi-rbdplugin-n7ddn CSI-rbd Nodeplugin logs: ocs-must-gather-2/ceph/namespaces/rook-ceph/pods/csi-rbdplugin-n7ddn/csi-rbdplugin/csi-rbdplugin/logs/current.log ------------------------------------------- Log analysis: NodeStage request: 2019-08-26T15:47:30.89587171Z I0826 15:47:30.887132 20545 utils.go:119] ID: 279 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:47:30.89587171Z I0826 15:47:30.887152 20545 utils.go:120] ID: 279 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodeStage response: 2019-08-26T15:47:33.580775634Z I0826 15:47:33.580600 20545 nodeserver.go:161] ID: 279 rbd: successfully mounted volume 0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to stagingTargetPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 2019-08-26T15:47:33.583885329Z I0826 15:47:33.580652 20545 utils.go:125] ID: 279 GRPC response: {} NodePublish request: 2019-08-26T15:47:33.601456621Z I0826 15:47:33.600735 20545 utils.go:119] ID: 281 GRPC call: /csi.v1.Node/NodePublishVolume 2019-08-26T15:47:33.603306004Z I0826 15:47:33.600757 20545 utils.go:120] ID: 281 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount","target_path":"/var/lib/kubelet/pods/bad60295-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010"} NodePublish response: 2019-08-26T15:47:33.715244475Z I0826 15:47:33.714462 20545 nodeserver.go:261] ID: 281 rbd: successfully mounted stagingPath /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/globalmount/0001-0009-rook-ceph-0000000000000001-4e253305-c818-11e9-be8f-0a580a830010 to targetPath /var/lib/kubelet/pods/bad60295-c818-11e9-9022-02f7b213f410/volumes/kubernetes.io~csi/pvc-4df1f65f-c818-11e9-9022-02f7b213f410/mount 2019-08-26T15:47:33.715244475Z I0826 15:47:33.714488 20545 utils.go:125] ID: 281 GRPC response: {} NOTE: PV is left mounted here, there are no Unpublish or Unstage calls, IOW no unmount requests send to this node. -------------------------------------------
**** Here is my RCA of this bug *** *) Node stage volume didnt happen at "15s" timeout [1] which caused the cleanup operation to kickoff which failed [2] due to the method Ceph-CSI perform to **staging at present** [3] compared to some time back. 1) The question on why "15s" timeout ? --> Short answer would be, this would be solved with the PR which I filed 10 days back ( actually proactively, so no BUG reference in the same) to Openshift Origin. PR reference # https://github.com/openshift/origin/pull/23624 --snip-- from PR description In case of short timeout (15s), some CSI plugins like RBD can bahave weird. That said, if timeout occurs RBD plugin try an unmap and do other cleanup/rollback operations. In most of the cases, it is just that, its taking more than 15s when there are parellel requests to serve and map. If the timeout is high, most of the mapping operations will succeed and it will not cause any rollback operation/cleanup in the system which avoid other issues. --/snip--- [3] Recently we have changed the way Ceph CSI do the staging. That said, we introduced some changes via this PR https://github.com/ceph/ceph-csi/pull/510. in short, the change is this, Instead of doing a mount on the 'staging target path' given by CO, we create a "metadata file' and a 'new directory' inside this CO provided 'staging target Path' . Then consider this "new" directory as our 'staging target path'. If I fetch this new directory name from the logs, it is '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010' ----> NOTE that, this directory is INSIDE 'globalmount' metadata file will be the 'volume ID' in general. [2] The 'entire' staging operation didnt finish within this '15s' and caused the cleanup operation to be triggered which is actually 'staging target path' removal attempted in this setup. When it was attempted, the 'removal' of this path failed as it carried an 'extra file and directory' inside this path. Some interesting things which I would like to point out here: ------------------------------------------------------------------- 1) CO pass 'staging target path' to SP and as per spec (https://github.com/container-storage-interface/spec/blob/master/spec.md) or the understanding was that, SP can choose the 'staging target path' its own. First immediate question would be, Is above understanding of the SPEC correct ? atleast its not clear in the SPEC that, whether the 'staging' has to mandatory happen on THIS provided path from CO. If thats not mandatory, I see couple of bugs or possibilities of isues here. That said, for SP, there are "2" options, either "stage" it on a totally different path or stage it on a "directory" or path which is inside the "staging target path" given by CO (.../globalmount/") -> On a "different path" : If SP put the staging on a different path, the cleanup operation code in CO is ***not*** correct OR its NO OP or can always fail. The reason being 'cleanup' is attempted on the path which is KNOWN to CO which is nothing but the PATH given by CO at stage RPC call. -> On a 'directory' or 'path' inside the staging target path If SP use a directory as in this case or put some file/directory inside the staging target path, the removal/cleanup will fail as shown in the logs or with the error 'directory not empty'. The immediate fix I can think of here is, at 'removeMountDir' make use of 'os.Removeall' instead of 'os.Remove'!!! --snip-- func removeMountDir(plug *csiPlugin, mountPath string) error { klog.V(4).Info(log("removing mount path [%s]", mountPath)) .... if !mnt { klog.V(4).Info(log("dir not mounted, deleting it [%s]", mountPath)) if err := os.Remove(mountPath); err != nil && !os.IsNotExist(err) { return errors.New(log("failed to remove dir [%s]: %v", mountPath, err)) } .... --/snip-- 2) Other interesting fact here is this: Suppose the cleanup failed due to above reason, it looks like 'CO' will **think** that, staging is actually failed on NODE 1, so reschedule or move the workload/POD to second NODE! Which cause "staging" to happen on Second NODE. There is a possibility that, meanwhile 'SP' would have completed 'staging' of the volume in NODE 1 too!! If thats what happened, I believe it answers ***how we landed upon 'Staging' of the volume in MORE than ONE NODE at same time*** thus 'data corruption' or "FS integrity" issue! Reference bugzillas# https://bugzilla.redhat.com/show_bug.cgi?id=1745773 (Delay in detecting a successful CSI mount or incorrect report of failed mount, by kubelet for Ceph CSI based volume ) https://bugzilla.redhat.com/show_bug.cgi?id=1746111 ( RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem header corruption) ---------------------------------------------- LOGS used for RCA or hypothesis ---------------------------------------------------------------------- ---> ceph-csi NodeStageVolume invoked here as per timestamps (logs from csi node plugin interspersed for clarity) <--- 2019-08-26T15:48:06.921973038Z I0826 15:48:06.921427 30300 utils.go:119] ID: 346 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:48:06.927130275Z I0826 15:48:06.921449 30300 utils.go:120] ID: 346 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010" ---> end interspersed logs (Note: NodeStageVolume request has not yet returned success or failure at this point in time) <--- Aug 26 15:48:10 ip-10-0-129-7 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010. ---> Some error is triggered and hence cleanup routines have kicked in (within 15 seconds?) Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868121 1189 csi_mounter.go:433] kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty ---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct) Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868142 1189 csi_attacher.go:322] kubernetes.io/csi: attacher.MountDevice failed to remove mount dir after errir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty ---> Error carried over above Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868262 1189 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\"" failed. No retries permitted until 2019-08-26 15:48:22.368216697 +0000 UTC m=+1632.155530185 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\") pod \"osio-worker-389152222-7b9786fb59-rtr66\" (UID: \"d1b70966-c818-11e9-9022-02f7b213f410\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded" ---> Looks like original error was a deadline exceeded, IOW a request timeout ------------------------------------------------------------------------------------------------------------------- It would be really appreciated your thoughts on this analysis/RCA or suggested fix, so that we can take up from there.
(In reply to Humble Chirammal from comment #4) > 1) The question on why "15s" timeout ? > > --> Short answer would be, this would be solved with the PR which I filed > 10 days back ( actually proactively, so no BUG reference in the same) to > Openshift Origin. > > PR reference # https://github.com/openshift/origin/pull/23624 I was looking at the openshift/kubernetes branch where it was 2 minutes, but this explains it. We would need to move this timeout up. BUT, to reiterate, moving it to 2 minutes is NOT the fix I am looking for. I am attempting to understand why the volume was staged on two nodes, when the first node did not respond with a success or a failure. > [2] The 'entire' staging operation didnt finish within this '15s' and caused > the cleanup operation to be triggered which is actually 'staging target > path' removal attempted in this setup. When it was attempted, the 'removal' > of this path failed as it carried an 'extra file and directory' inside this > path. Yes this is what I noted as "---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct)" > Suppose the cleanup failed due to above reason, it looks like 'CO' will > **think** that, staging is actually failed on NODE 1, so reschedule or move > the workload/POD to second NODE! If the removeMountDir error is masking the original timeout error, and the state machine assumes the CSI plugin returned an error, then this is what requires a fix. BUT, If irrespective of the error, the state machine would assume the Stage failed (i.e even on timeout (deadline exceeded) errors), and still progress with detach and attach on a different node (because the pod moved), then we need to fix the same. NOTE: Changing CO to perform a RemoveAll instead of a Remove would be incorrect, it is the plugins responsibility to keep the staging path clean at the end of Unstage or failed Stage. Further as this is the staging path, on timeout errors or such a RemoveAll may inadvertently remove user/app data from the mount, if the mount succeeds a little later. I would not change the CO behaviour here.
(In reply to Shyamsundar from comment #5) > (In reply to Humble Chirammal from comment #4) > > Suppose the cleanup failed due to above reason, it looks like 'CO' will > > **think** that, staging is actually failed on NODE 1, so reschedule or move > > the workload/POD to second NODE! > > If the removeMountDir error is masking the original timeout error, and the > state machine assumes the CSI plugin returned an error, then this is what > requires a fix. Just to clear up the error masking possibility due to defer call removeMountDir, this is not feasible as the defer has its own err variable being declared in its scope, due to the := in the statement https://github.com/openshift/origin/blob/f4eaba02647295e3b1c26ac859dc2e723ad08149/vendor/k8s.io/kubernetes/pkg/volume/csi/csi_attacher.go#L321 Tested out in a sample go program: https://play.golang.org/p/EiSfoU3zNXg
[Status update] https://github.com/openshift/origin/pull/23624 has been merged now in OCP upstream.
*** Bug 1746111 has been marked as a duplicate of this bug. ***
What's the status of this BZ? We have to fix it, as it may cause corruption.
We are unable to fix this before the 4.2 release date but will continue working on it for 4.3... Once we have the fix we will re-evalulate the bug and decide if a backport to 4.2 is necessary.