Bug 1745776
Summary: | Pod reports "FailedMount" events for Ceph-CSI PVs due to stale mount/Stage of the same PV in a different node | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shyamsundar <srangana> |
Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Liang Xia <lxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.2.0 | CC: | aos-bugs, aos-storage-staff, bbennett, bchilds, fdeutsch, fsimonce, hchiramm, hekumar, jsafrane, jstrunk, mrajanna |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | ocs-monkey | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Kubernetes did not unmount a CSI volume when a pod was deleted while the volume mount timed out, but was eventually completed by the CSI driver.
Consequence:
A volume was mounted on a node without Kubernetes knowing about it and could not be mounted anywhere else.
Fix:
Kubernetes waits for a final success / error after CSI driver returns timeout or a similar transient error.
Result:
Kubernetes knows if a volume was mounted / unmounted and cleans up possible stray mounts when a Pod is deleted.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-02-04 15:39:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Shyamsundar
2019-08-26 21:42:14 UTC
**** Here is my RCA of this bug *** *) Node stage volume didnt happen at "15s" timeout [1] which caused the cleanup operation to kickoff which failed [2] due to the method Ceph-CSI perform to **staging at present** [3] compared to some time back. 1) The question on why "15s" timeout ? --> Short answer would be, this would be solved with the PR which I filed 10 days back ( actually proactively, so no BUG reference in the same) to Openshift Origin. PR reference # https://github.com/openshift/origin/pull/23624 --snip-- from PR description In case of short timeout (15s), some CSI plugins like RBD can bahave weird. That said, if timeout occurs RBD plugin try an unmap and do other cleanup/rollback operations. In most of the cases, it is just that, its taking more than 15s when there are parellel requests to serve and map. If the timeout is high, most of the mapping operations will succeed and it will not cause any rollback operation/cleanup in the system which avoid other issues. --/snip--- [3] Recently we have changed the way Ceph CSI do the staging. That said, we introduced some changes via this PR https://github.com/ceph/ceph-csi/pull/510. in short, the change is this, Instead of doing a mount on the 'staging target path' given by CO, we create a "metadata file' and a 'new directory' inside this CO provided 'staging target Path' . Then consider this "new" directory as our 'staging target path'. If I fetch this new directory name from the logs, it is '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010' ----> NOTE that, this directory is INSIDE 'globalmount' metadata file will be the 'volume ID' in general. [2] The 'entire' staging operation didnt finish within this '15s' and caused the cleanup operation to be triggered which is actually 'staging target path' removal attempted in this setup. When it was attempted, the 'removal' of this path failed as it carried an 'extra file and directory' inside this path. Some interesting things which I would like to point out here: ------------------------------------------------------------------- 1) CO pass 'staging target path' to SP and as per spec (https://github.com/container-storage-interface/spec/blob/master/spec.md) or the understanding was that, SP can choose the 'staging target path' its own. First immediate question would be, Is above understanding of the SPEC correct ? atleast its not clear in the SPEC that, whether the 'staging' has to mandatory happen on THIS provided path from CO. If thats not mandatory, I see couple of bugs or possibilities of isues here. That said, for SP, there are "2" options, either "stage" it on a totally different path or stage it on a "directory" or path which is inside the "staging target path" given by CO (.../globalmount/") -> On a "different path" : If SP put the staging on a different path, the cleanup operation code in CO is ***not*** correct OR its NO OP or can always fail. The reason being 'cleanup' is attempted on the path which is KNOWN to CO which is nothing but the PATH given by CO at stage RPC call. -> On a 'directory' or 'path' inside the staging target path If SP use a directory as in this case or put some file/directory inside the staging target path, the removal/cleanup will fail as shown in the logs or with the error 'directory not empty'. The immediate fix I can think of here is, at 'removeMountDir' make use of 'os.Removeall' instead of 'os.Remove'!!! --snip-- func removeMountDir(plug *csiPlugin, mountPath string) error { klog.V(4).Info(log("removing mount path [%s]", mountPath)) .... if !mnt { klog.V(4).Info(log("dir not mounted, deleting it [%s]", mountPath)) if err := os.Remove(mountPath); err != nil && !os.IsNotExist(err) { return errors.New(log("failed to remove dir [%s]: %v", mountPath, err)) } .... --/snip-- 2) Other interesting fact here is this: Suppose the cleanup failed due to above reason, it looks like 'CO' will **think** that, staging is actually failed on NODE 1, so reschedule or move the workload/POD to second NODE! Which cause "staging" to happen on Second NODE. There is a possibility that, meanwhile 'SP' would have completed 'staging' of the volume in NODE 1 too!! If thats what happened, I believe it answers ***how we landed upon 'Staging' of the volume in MORE than ONE NODE at same time*** thus 'data corruption' or "FS integrity" issue! Reference bugzillas# https://bugzilla.redhat.com/show_bug.cgi?id=1745773 (Delay in detecting a successful CSI mount or incorrect report of failed mount, by kubelet for Ceph CSI based volume ) https://bugzilla.redhat.com/show_bug.cgi?id=1746111 ( RWO ceph RBD PV was staged on multiple worker nodes, causing filesystem header corruption) ---------------------------------------------- LOGS used for RCA or hypothesis ---------------------------------------------------------------------- ---> ceph-csi NodeStageVolume invoked here as per timestamps (logs from csi node plugin interspersed for clarity) <--- 2019-08-26T15:48:06.921973038Z I0826 15:48:06.921427 30300 utils.go:119] ID: 346 GRPC call: /csi.v1.Node/NodeStageVolume 2019-08-26T15:48:06.927130275Z I0826 15:48:06.921449 30300 utils.go:120] ID: 346 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","pool":"replicapool","storage.kubernetes.io/csiProvisionerIdentity":"1566833504844-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010" ---> end interspersed logs (Note: NodeStageVolume request has not yet returned success or failure at this point in time) <--- Aug 26 15:48:10 ip-10-0-129-7 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount/0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010. ---> Some error is triggered and hence cleanup routines have kicked in (within 15 seconds?) Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868121 1189 csi_mounter.go:433] kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty ---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct) Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868142 1189 csi_attacher.go:322] kubernetes.io/csi: attacher.MountDevice failed to remove mount dir after errir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount]: remove /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c/globalmount: directory not empty ---> Error carried over above Aug 26 15:48:21 ip-10-0-129-7 hyperkube[1189]: E0826 15:48:21.868262 1189 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\"" failed. No retries permitted until 2019-08-26 15:48:22.368216697 +0000 UTC m=+1632.155530185 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-d18e08c9-c818-11e9-b2cc-0a7451fe601c\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-d23fa5e3-c818-11e9-be8f-0a580a830010\") pod \"osio-worker-389152222-7b9786fb59-rtr66\" (UID: \"d1b70966-c818-11e9-9022-02f7b213f410\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded" ---> Looks like original error was a deadline exceeded, IOW a request timeout ------------------------------------------------------------------------------------------------------------------- It would be really appreciated your thoughts on this analysis/RCA or suggested fix, so that we can take up from there. (In reply to Humble Chirammal from comment #4) > 1) The question on why "15s" timeout ? > > --> Short answer would be, this would be solved with the PR which I filed > 10 days back ( actually proactively, so no BUG reference in the same) to > Openshift Origin. > > PR reference # https://github.com/openshift/origin/pull/23624 I was looking at the openshift/kubernetes branch where it was 2 minutes, but this explains it. We would need to move this timeout up. BUT, to reiterate, moving it to 2 minutes is NOT the fix I am looking for. I am attempting to understand why the volume was staged on two nodes, when the first node did not respond with a success or a failure. > [2] The 'entire' staging operation didnt finish within this '15s' and caused > the cleanup operation to be triggered which is actually 'staging target > path' removal attempted in this setup. When it was attempted, the 'removal' > of this path failed as it carried an 'extra file and directory' inside this > path. Yes this is what I noted as "---> Directory cleanup fails as ceph-csi stores a meta file in the global mount path for future use (so this error is logically correct)" > Suppose the cleanup failed due to above reason, it looks like 'CO' will > **think** that, staging is actually failed on NODE 1, so reschedule or move > the workload/POD to second NODE! If the removeMountDir error is masking the original timeout error, and the state machine assumes the CSI plugin returned an error, then this is what requires a fix. BUT, If irrespective of the error, the state machine would assume the Stage failed (i.e even on timeout (deadline exceeded) errors), and still progress with detach and attach on a different node (because the pod moved), then we need to fix the same. NOTE: Changing CO to perform a RemoveAll instead of a Remove would be incorrect, it is the plugins responsibility to keep the staging path clean at the end of Unstage or failed Stage. Further as this is the staging path, on timeout errors or such a RemoveAll may inadvertently remove user/app data from the mount, if the mount succeeds a little later. I would not change the CO behaviour here. (In reply to Shyamsundar from comment #5) > (In reply to Humble Chirammal from comment #4) > > Suppose the cleanup failed due to above reason, it looks like 'CO' will > > **think** that, staging is actually failed on NODE 1, so reschedule or move > > the workload/POD to second NODE! > > If the removeMountDir error is masking the original timeout error, and the > state machine assumes the CSI plugin returned an error, then this is what > requires a fix. Just to clear up the error masking possibility due to defer call removeMountDir, this is not feasible as the defer has its own err variable being declared in its scope, due to the := in the statement https://github.com/openshift/origin/blob/f4eaba02647295e3b1c26ac859dc2e723ad08149/vendor/k8s.io/kubernetes/pkg/volume/csi/csi_attacher.go#L321 Tested out in a sample go program: https://play.golang.org/p/EiSfoU3zNXg [Status update] https://github.com/openshift/origin/pull/23624 has been merged now in OCP upstream. *** Bug 1746111 has been marked as a duplicate of this bug. *** What's the status of this BZ? We have to fix it, as it may cause corruption. We are unable to fix this before the 4.2 release date but will continue working on it for 4.3... Once we have the fix we will re-evalulate the bug and decide if a backport to 4.2 is necessary. |